|
| 1 | +--- |
| 2 | +layout: default |
| 3 | +title: Advanced Features |
| 4 | +nav_order: 9 |
| 5 | +parent: How-To Guides |
| 6 | +grand_parent: Ruby Gem |
| 7 | +--- |
| 8 | + |
| 9 | +# Advanced Features |
| 10 | + |
| 11 | +This guide covers advanced features and performance optimizations for html2rss. |
| 12 | + |
| 13 | +## Parallel Processing |
| 14 | + |
| 15 | +html2rss uses parallel processing to improve performance when scraping multiple items. This happens automatically and doesn't require any configuration. |
| 16 | + |
| 17 | +### How It Works |
| 18 | + |
| 19 | +- **Auto-source scraping:** Multiple scrapers run in parallel to analyze the page |
| 20 | +- **Item processing:** Each scraped item is processed in parallel |
| 21 | +- **Performance benefit:** Significantly faster when dealing with many items |
| 22 | + |
| 23 | +### Performance Tips |
| 24 | + |
| 25 | +1. **Use appropriate selectors:** More specific selectors reduce processing time |
| 26 | +2. **Limit items when possible:** Use CSS selectors that target only the content you need |
| 27 | +3. **Cache responses:** The web application caches responses automatically |
| 28 | +4. **Choose the right strategy:** Use `faraday` for static content, `browserless` only when JavaScript is required |
| 29 | + |
| 30 | +## Memory Optimization |
| 31 | + |
| 32 | +html2rss is designed to be memory-efficient: |
| 33 | + |
| 34 | +- **Frozen objects:** Parsed content is frozen to prevent accidental modifications |
| 35 | +- **Efficient data structures:** Uses `Set` instead of `Array` for lookups |
| 36 | +- **Minimal allocations:** Prefers bang methods to avoid unnecessary memory allocations |
| 37 | + |
| 38 | +## Large Feed Handling |
| 39 | + |
| 40 | +For websites with many items: |
| 41 | + |
| 42 | +```yaml |
| 43 | +# Use specific selectors to limit items |
| 44 | +selectors: |
| 45 | + items: |
| 46 | + selector: ".article:not(.advertisement)" # Exclude ads |
| 47 | + title: |
| 48 | + selector: "h2" # More specific than generic selectors |
| 49 | +``` |
| 50 | +
|
| 51 | +## Error Recovery |
| 52 | +
|
| 53 | +html2rss includes built-in error handling: |
| 54 | +
|
| 55 | +- **Graceful degradation:** If one scraper fails, others continue |
| 56 | +- **Detailed logging:** Set `LOG_LEVEL=debug` for detailed information |
| 57 | +- **Validation:** Configuration is validated before processing |
| 58 | + |
| 59 | +## Custom Headers for Performance |
| 60 | + |
| 61 | +Optimize requests with appropriate headers: |
| 62 | + |
| 63 | +```yaml |
| 64 | +headers: |
| 65 | + Accept: "text/html,application/xhtml+xml" # Avoid JSON if not needed |
| 66 | + Accept-Encoding: "gzip, deflate" # Enable compression |
| 67 | + User-Agent: "html2rss/1.0" # Identify your requests |
| 68 | +``` |
| 69 | + |
| 70 | +## Monitoring and Debugging |
| 71 | + |
| 72 | +### Enable Debug Logging |
| 73 | + |
| 74 | +```bash |
| 75 | +LOG_LEVEL=debug html2rss feed config.yml |
| 76 | +``` |
| 77 | + |
| 78 | +### Web Application Health Checks |
| 79 | + |
| 80 | +Use the health check endpoint to monitor feed generation: |
| 81 | + |
| 82 | +```bash |
| 83 | +curl -u username:password http://localhost:3000/health_check.txt |
| 84 | +``` |
| 85 | + |
| 86 | +## Article Validation |
| 87 | + |
| 88 | +html2rss includes built-in validation for articles to ensure feed quality: |
| 89 | + |
| 90 | +### Validation Rules |
| 91 | + |
| 92 | +Articles are considered valid if they have: |
| 93 | +- A non-empty URL |
| 94 | +- Either a title OR description (or both) |
| 95 | +- A unique ID |
| 96 | + |
| 97 | +### Invalid Articles |
| 98 | + |
| 99 | +Invalid articles are automatically filtered out to prevent empty or broken feed items. |
| 100 | + |
| 101 | +### Custom Validation |
| 102 | + |
| 103 | +You can add custom validation by using post-processors: |
| 104 | + |
| 105 | +```yaml |
| 106 | +selectors: |
| 107 | + title: |
| 108 | + selector: "h2" |
| 109 | + post_process: |
| 110 | + - name: "gsub" |
| 111 | + pattern: "^\\s*$" |
| 112 | + replacement: "Untitled" |
| 113 | +``` |
| 114 | + |
| 115 | +## Best Practices |
| 116 | + |
| 117 | +1. **Test configurations:** Always test your configurations before deploying |
| 118 | +2. **Monitor performance:** Use health checks to detect issues early |
| 119 | +3. **Keep selectors simple:** Complex selectors are harder to maintain |
| 120 | +4. **Use auto-source when possible:** It's often more reliable than manual selectors |
| 121 | +5. **Handle errors gracefully:** Implement proper error handling in your applications |
| 122 | +6. **Validate your data:** Ensure your selectors return valid content |
0 commit comments