Skip to content

Latest commit

 

History

History
70 lines (53 loc) · 2.28 KB

File metadata and controls

70 lines (53 loc) · 2.28 KB
title Auto Source
description Learn about the auto_source scraper that automatically finds items on a page. No CSS selectors needed - html2rss intelligently detects content.

The auto_source scraper automatically finds items on a page, so you don't have to specify CSS selectors.

To enable it, add auto_source: {} to your configuration:

channel:
  url: https://example.com
auto_source: {}

How It Works

auto_source uses the following strategies to find content:

  1. schema: Parses <script type="json/ld"> tags containing structured data (e.g., Schema.org).
  2. semantic_html: Searches for semantic HTML5 tags like <article>, <main>, and <section>.
  3. html: Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
  4. json_state: Single-page applications often stash pre-rendered article data in <script type="application/json"> tags or global variables such as window.__NEXT_DATA__, window.__NUXT__, or window.STATE. The JSON-state scraper walks those blobs, finds arrays with title/url pairs, and converts them into the same hashes produced by HtmlExtractor.

json_state Limitations: the scraper requires discoverable arrays of hashes containing clear title and url fields. Minified or obfuscated state objects, heavily encoded values, or blobs that require executing embedded functions are ignored.

Fine-Tuning

You can customize auto_source to improve its accuracy.

Scraper Options

Enable or disable specific scrapers and adjust their settings:

channel:
  url: https://example.com
auto_source:
  scraper:
    schema:
      enabled: false # default: true
    semantic_html:
      enabled: false # default: true
    json_state:
      enabled: false # default: true
    html:
      enabled: true
      minimum_selector_frequency: 3 # default: 2
      use_top_selectors: 3 # default: 5

Cleanup Options

Remove unwanted items from the results:

channel:
  url: https://example.com
auto_source:
  cleanup:
    keep_different_domain: false # default: true
    min_words_title: 4 # default: 3

For detailed documentation on the Ruby API, see the official YARD documentation.