| title | Advanced Features |
|---|---|
| description | Advanced features and performance optimizations for html2rss. |
This guide covers advanced features and performance optimizations for html2rss.
html2rss uses parallel processing in auto-source discovery. This happens automatically and doesn't require any configuration.
- Use appropriate selectors: More specific selectors reduce processing time
- Limit items when possible: Use CSS selectors that target only the content you need
- Cache responses: The web application caches responses automatically
- Choose the right strategy: Use
faradayfor static content,browserlessonly when JavaScript is required
html2rss is designed to be memory-efficient:
- Frozen objects: Parsed content is frozen to prevent accidental modifications
- Efficient data structures: Uses
Setinstead ofArrayfor lookups - Minimal allocations: Prefers bang methods to avoid unnecessary memory allocations
For websites with many items:
channel:
url: "https://example.com/articles"
selectors:
items:
selector: ".article:not(.advertisement)" # Exclude ads
title:
selector: "h2" # More specific than generic selectors
url:
selector: "a"
extractor: "href"html2rss includes built-in error handling:
- Graceful degradation: If one scraper fails, others continue
- Detailed logging: Set
LOG_LEVEL=debugfor detailed information - Validation: Configuration is validated before processing
Optimize requests with appropriate headers:
headers:
Accept: "text/html,application/xhtml+xml" # Avoid JSON if not needed
Accept-Encoding: "gzip, deflate" # Enable compression
channel:
url: "https://example.com/articles"
selectors:
items:
selector: "article"
title:
selector: "h2"
url:
selector: "a"
extractor: "href"LOG_LEVEL=debug html2rss feed config.ymlUse the health check endpoint to monitor feed generation:
curl -u username:password http://localhost:4000/health_check.txthtml2rss includes built-in validation for articles to ensure feed quality:
Articles are considered valid if they have:
- A non-empty URL
- Either a title OR description (or both)
- A unique ID
Invalid articles are automatically filtered out to prevent empty or broken feed items.
You can add custom validation by using post-processors:
channel:
url: "https://example.com/articles"
selectors:
items:
selector: "article"
title:
selector: "h2"
post_process:
- name: "gsub"
pattern: "^\\s*$"
replacement: "Untitled"
url:
selector: "a"
extractor: "href"- Test configurations: Always test your configurations before deploying
- Monitor performance: Use health checks to detect issues early
- Keep selectors simple: Complex selectors are harder to maintain
- Use auto-source when possible: It's often more reliable than manual selectors
- Handle errors gracefully: Implement proper error handling in your applications
- Validate your data: Ensure your selectors return valid content