html2rss.github.io/src/content/docs/ruby-gem/reference/auto-source.mdx at 06a35846b0e2edfc7c1e87392c8ab48f9b140e1c · html2rss/html2rss.github.io

title	Auto Source
description	Learn about the auto_source scraper that automatically finds items on a page. No CSS selectors needed - html2rss intelligently detects content.

The auto_source scraper automatically finds items on a page, so you don't have to specify CSS selectors.

To enable it, add auto_source: {} to your configuration:

channel:
  url: https://example.com
auto_source: {}

How It Works

auto_source uses the following strategies to find content:

schema: Parses <script type="json/ld"> tags containing structured data (e.g., Schema.org).
semantic_html: Searches for semantic HTML5 tags like <article>, <main>, and <section>.
html: Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
json_state: Single-page applications often stash pre-rendered article data in <script type="application/json"> tags or global variables such as window.__NEXT_DATA__, window.__NUXT__, or window.STATE. The JSON-state scraper walks those blobs, finds arrays with title/url pairs, and converts them into the same hashes produced by HtmlExtractor.

json_state Limitations: the scraper requires discoverable arrays of hashes containing clear title and url fields. Minified or obfuscated state objects, heavily encoded values, or blobs that require executing embedded functions are ignored.

Fine-Tuning

You can customize auto_source to improve its accuracy.

Scraper Options

Enable or disable specific scrapers and adjust their settings:

channel:
  url: https://example.com
auto_source:
  scraper:
    schema:
      enabled: false # default: true
    semantic_html:
      enabled: false # default: true
    json_state:
      enabled: false # default: true
    html:
      enabled: true
      minimum_selector_frequency: 3 # default: 2
      use_top_selectors: 3 # default: 5

Cleanup Options

Remove unwanted items from the results:

channel:
  url: https://example.com
auto_source:
  cleanup:
    keep_different_domain: false # default: true
    min_words_title: 4 # default: 3

For detailed documentation on the Ruby API, see the official YARD documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How It Works

Fine-Tuning

Scraper Options

Cleanup Options

FilesExpand file tree

auto-source.mdx

Latest commit

History

auto-source.mdx

File metadata and controls

How It Works

Fine-Tuning

Scraper Options

Cleanup Options