Skip to content

Commit 1bd6ed5

Browse files
committed
add missing docs
1 parent 4a6f957 commit 1bd6ed5

6 files changed

Lines changed: 226 additions & 1 deletion

File tree

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
---
2+
layout: default
3+
title: Advanced Features
4+
nav_order: 9
5+
parent: How-To Guides
6+
grand_parent: Ruby Gem
7+
---
8+
9+
# Advanced Features
10+
11+
This guide covers advanced features and performance optimizations for html2rss.
12+
13+
## Parallel Processing
14+
15+
html2rss uses parallel processing to improve performance when scraping multiple items. This happens automatically and doesn't require any configuration.
16+
17+
### How It Works
18+
19+
- **Auto-source scraping:** Multiple scrapers run in parallel to analyze the page
20+
- **Item processing:** Each scraped item is processed in parallel
21+
- **Performance benefit:** Significantly faster when dealing with many items
22+
23+
### Performance Tips
24+
25+
1. **Use appropriate selectors:** More specific selectors reduce processing time
26+
2. **Limit items when possible:** Use CSS selectors that target only the content you need
27+
3. **Cache responses:** The web application caches responses automatically
28+
4. **Choose the right strategy:** Use `faraday` for static content, `browserless` only when JavaScript is required
29+
30+
## Memory Optimization
31+
32+
html2rss is designed to be memory-efficient:
33+
34+
- **Frozen objects:** Parsed content is frozen to prevent accidental modifications
35+
- **Efficient data structures:** Uses `Set` instead of `Array` for lookups
36+
- **Minimal allocations:** Prefers bang methods to avoid unnecessary memory allocations
37+
38+
## Large Feed Handling
39+
40+
For websites with many items:
41+
42+
```yaml
43+
# Use specific selectors to limit items
44+
selectors:
45+
items:
46+
selector: ".article:not(.advertisement)" # Exclude ads
47+
title:
48+
selector: "h2" # More specific than generic selectors
49+
```
50+
51+
## Error Recovery
52+
53+
html2rss includes built-in error handling:
54+
55+
- **Graceful degradation:** If one scraper fails, others continue
56+
- **Detailed logging:** Set `LOG_LEVEL=debug` for detailed information
57+
- **Validation:** Configuration is validated before processing
58+
59+
## Custom Headers for Performance
60+
61+
Optimize requests with appropriate headers:
62+
63+
```yaml
64+
headers:
65+
Accept: "text/html,application/xhtml+xml" # Avoid JSON if not needed
66+
Accept-Encoding: "gzip, deflate" # Enable compression
67+
User-Agent: "html2rss/1.0" # Identify your requests
68+
```
69+
70+
## Monitoring and Debugging
71+
72+
### Enable Debug Logging
73+
74+
```bash
75+
LOG_LEVEL=debug html2rss feed config.yml
76+
```
77+
78+
### Web Application Health Checks
79+
80+
Use the health check endpoint to monitor feed generation:
81+
82+
```bash
83+
curl -u username:password http://localhost:3000/health_check.txt
84+
```
85+
86+
## Article Validation
87+
88+
html2rss includes built-in validation for articles to ensure feed quality:
89+
90+
### Validation Rules
91+
92+
Articles are considered valid if they have:
93+
- A non-empty URL
94+
- Either a title OR description (or both)
95+
- A unique ID
96+
97+
### Invalid Articles
98+
99+
Invalid articles are automatically filtered out to prevent empty or broken feed items.
100+
101+
### Custom Validation
102+
103+
You can add custom validation by using post-processors:
104+
105+
```yaml
106+
selectors:
107+
title:
108+
selector: "h2"
109+
post_process:
110+
- name: "gsub"
111+
pattern: "^\\s*$"
112+
replacement: "Untitled"
113+
```
114+
115+
## Best Practices
116+
117+
1. **Test configurations:** Always test your configurations before deploying
118+
2. **Monitor performance:** Use health checks to detect issues early
119+
3. **Keep selectors simple:** Complex selectors are harder to maintain
120+
4. **Use auto-source when possible:** It's often more reliable than manual selectors
121+
5. **Handle errors gracefully:** Implement proper error handling in your applications
122+
6. **Validate your data:** Ensure your selectors return valid content
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
---
2+
layout: default
3+
title: Backward Compatibility
4+
nav_order: 11
5+
parent: How-To Guides
6+
grand_parent: Ruby Gem
7+
---
8+
9+
# Backward Compatibility
10+
11+
html2rss maintains backward compatibility with older configuration formats and attribute names.
12+
13+
## Renamed Attributes
14+
15+
Some attribute names have been renamed for clarity, but the old names still work:
16+
17+
| Current Name | Legacy Names | Description |
18+
| --------------- | ------------------- | ------------------------------ |
19+
| `published_at` | `updated`, `pubDate` | Publication date of the item |
20+
21+
### Example
22+
23+
Both of these configurations work identically:
24+
25+
```yaml
26+
# Current format (recommended)
27+
selectors:
28+
published_at:
29+
selector: ".date"
30+
31+
# Legacy format (still supported)
32+
selectors:
33+
updated:
34+
selector: ".date"
35+
```
36+
37+
## Migration Guide
38+
39+
If you're upgrading from an older version of html2rss:
40+
41+
1. **Update attribute names**: Replace `updated` with `published_at` in your configurations
42+
2. **Test your feeds**: Verify that all feeds still work correctly after the update
43+
44+
## Deprecated Features
45+
46+
The following features are deprecated but still supported:
47+
48+
- **Legacy attribute names**: While still supported, use the current names for new configurations
49+
50+
## Getting Help
51+
52+
If you encounter issues with backward compatibility:
53+
54+
- **Report issues**: Open an issue if you find compatibility problems

ruby-gem/reference/auto-source.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,22 @@ channel:
1818
auto_source: {}
1919
```
2020
21+
## Default Configuration
22+
23+
When you use `auto_source: {}`, html2rss uses these default settings:
24+
25+
- **All scrapers enabled:** `schema`, `semantic_html`, `html`, and `rss_feed_detector`
26+
- **HTML scraper settings:** `minimum_selector_frequency: 2`, `use_top_selectors: 5`
27+
- **Cleanup settings:** `keep_different_domain: true`, `min_words_title: 3`
28+
2129
## How It Works
2230

2331
`auto_source` uses the following strategies to find content:
2432

2533
1. **`schema`:** Parses `<script type="json/ld">` tags containing structured data (e.g., [Schema.org](https://schema.org/)).
2634
2. **`semantic_html`:** Searches for semantic HTML5 tags like `<article>`, `<main>`, and `<section>`.
2735
3. **`html`:** Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
36+
4. **`rss_feed_detector`:** Automatically detects and uses existing RSS feeds on the target website. This is particularly useful when a site already has an RSS feed that can be consumed directly.
2837

2938
## Fine-Tuning
3039

@@ -45,6 +54,8 @@ auto_source:
4554
enabled: true
4655
minimum_selector_frequency: 3 # default: 2
4756
use_top_selectors: 3 # default: 5
57+
rss_feed_detector:
58+
enabled: false # default: true
4859
```
4960

5061
### Cleanup Options

ruby-gem/reference/selectors.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,21 @@ selectors:
3737
enhance: true # default: true
3838
```
3939

40+
## Item Ordering
41+
42+
You can control the order of items in your feed:
43+
44+
```yml
45+
selectors:
46+
items:
47+
selector: ".article"
48+
order: "reverse" # Reverse the order of items (newest first)
49+
```
50+
51+
Available options:
52+
- `"reverse"`: Reverses the order of items (useful when the website shows oldest items first)
53+
- Default: Items appear in the order they are found on the page
54+
4055
## RSS 2.0 Selectors
4156

4257
While you can define any named selector, only the following are used in the final RSS feed:

support/troubleshooting.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,17 @@ If your feed is empty, check the following:
2323
- **URL:** Ensure the `url` in your configuration is correct and accessible.
2424
- **`items.selector`:** Verify that the `items.selector` matches the elements on the page.
2525
- **Website Changes:** Websites change their HTML structure frequently. Your selectors may be outdated.
26+
- **JavaScript Content:** If the content is loaded via JavaScript, use the `browserless` strategy instead of `faraday`.
27+
- **Authentication:** Some sites require authentication - check if you need to add headers or use a different strategy.
28+
29+
### Configuration Errors
30+
31+
Common configuration-related errors:
32+
33+
- **`UnsupportedResponseContentType`:** The website returned content that html2rss can't parse (not HTML or JSON).
34+
- **`UnsupportedStrategy`:** The specified strategy is not available. Use `faraday` or `browserless`.
35+
- **`Configuration must include at least 'selectors' or 'auto_source'`:** You need to specify either manual selectors or enable auto-source.
36+
- **`stylesheet.type invalid`:** Only `text/css` and `text/xsl` are supported for stylesheets.
2637

2738
### Missing Item Parts
2839

@@ -46,6 +57,15 @@ If you are getting a "command not found" error, try the following:
4657
- **Re-install:** Re-install `html2rss` to ensure it is installed correctly: `gem install html2rss`.
4758
- **Check `PATH`:** Ensure that the directory where Ruby gems are installed is in your system's `PATH`.
4859

60+
### Web Application Errors
61+
62+
For html2rss-web specific issues:
63+
64+
- **`401 Unauthorized`:** Check your `AUTO_SOURCE_USERNAME` and `AUTO_SOURCE_PASSWORD` environment variables.
65+
- **`403 Forbidden`:** The URL is not in the `AUTO_SOURCE_ALLOWED_URLS` list, or the origin is not in `AUTO_SOURCE_ALLOWED_ORIGINS`.
66+
- **`500 Internal Server Error`:** Check the application logs for detailed error information.
67+
- **Health check failures:** Use the `/health_check.txt` endpoint to identify which specific feed configurations are broken.
68+
4969
## Tips & Tricks
5070

5171
- **Mobile Redirects:** Check that the channel URL does not redirect to a mobile page with a different markup structure.

web-application/reference/env-variables.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ grand_parent: Web Application
1212

1313
| Name | Description |
1414
| ------------------------------ | ---------------------------------- |
15-
| `BASE_URL` | default: '<http://localhost:3000>' |
1615
| `LOG_LEVEL` | default: 'warn' |
1716
| `HEALTH_CHECK_USERNAME` | default: auto-generated on start |
1817
| `HEALTH_CHECK_PASSWORD` | default: auto-generated on start |
@@ -21,6 +20,7 @@ grand_parent: Web Application
2120
| `AUTO_SOURCE_USERNAME` | no default. |
2221
| `AUTO_SOURCE_PASSWORD` | no default. |
2322
| `AUTO_SOURCE_ALLOWED_ORIGINS` | no default. |
23+
| `AUTO_SOURCE_ALLOWED_URLS` | no default. Wildcard patterns supported. |
2424
| | |
2525
| `PORT` | default: 3000 |
2626
| `RACK_ENV` | default: 'development' |
@@ -29,3 +29,6 @@ grand_parent: Web Application
2929
| `WEB_MAX_THREADS` | default: 5 |
3030
| | |
3131
| `SENTRY_DSN` | no default. |
32+
| | |
33+
| `RUBY_PATH` | default: 'ruby' |
34+
| `APP_ROOT` | default: '.' |

0 commit comments

Comments
 (0)