| layout | default |
|---|---|
| title | Selectors |
| parent | Reference |
| grand_parent | Ruby Gem |
| nav_order | 3 |
The selectors scraper gives you fine-grained control over content extraction using CSS selectors.
A valid RSS item requires at least a
titleor adescription.
At a minimum, you need an items selector to define the list of articles and a title selector for the article titles.
channel:
url: "https://example.com"
selectors:
items:
selector: ".article"
title:
selector: "h1"To simplify configuration, html2rss can automatically extract the title, url, and image from each item. This feature is enabled by default.
selectors:
items:
selector: ".article"
enhance: true # default: trueWhile you can define any named selector, only the following are used in the final RSS feed:
| RSS 2.0 Tag | html2rss Name |
| ------------- | --------------- | ------------------------------ |
| title | title |
| description | description |
| link | url |
| author | author |
| category | categories |
| guid | guid |
| enclosure | enclosure |
| pubDate | published_at |
| comments | comments |
Each selector can be configured with the following options:
| Name | Description |
|---|---|
selector |
The CSS selector for the target element. |
extractor |
The extractor to use for this selector. |
attribute |
The attribute name (required for attribute extractor). |
static |
The static value (required for static extractor). |
post_process |
A list of post-processors to apply to the value. |
Extractors define how to get the value from a selected element.
text: The inner text of the element (default).html: The outer HTML of the element.href: The value of thehrefattribute.attribute: The value of a specified attribute.static: A static value.
Post-processors manipulate the extracted value.
gsub: Performs a global substitution on a string.html_to_markdown: Converts HTML to Markdown.markdown_to_html: Converts Markdown to HTML.parse_time: Parses a string into aTimeobject.parse_uri: Parses a string into aURIobject.sanitize_html: Sanitizes HTML to prevent security vulnerabilities.substring: Extracts a substring from a string.template: Creates a new string from a template and other selector values.
Always use the
sanitize_htmlpost-processor for any HTML content to prevent security risks.
To add categories to an item, provide a list of selector names to the categories selector.
selectors:
genre:
selector: ".genre"
branch:
selector: ".branch"
categories:
- genre
- branchTo create a custom GUID for an item, provide a list of selector names to the guid selector.
selectors:
title:
selector: "h1"
url:
selector: "a"
extractor: "href"
guid:
- urlTo add an enclosure (e.g., an image, audio, or video file) to an item, use the enclosure selector to specify the URL of the file.
selectors:
items:
selector: ".post"
title:
selector: "h2"
enclosure:
selector: "audio"
extractor: "attribute"
attribute: "src"
content_type: "audio/mp3"