|
| 1 | +--- |
| 2 | +title: 'Selectors' |
| 3 | +description: 'The selectors scraper gives you fine-grained control over content extraction using CSS selectors.' |
| 4 | +--- |
| 5 | + |
| 6 | +# Selectors |
| 7 | + |
| 8 | +The `selectors` scraper gives you fine-grained control over content extraction using CSS selectors. |
| 9 | + |
| 10 | +> A valid RSS item requires at least a `title` or a `description`. |
| 11 | +
|
| 12 | +## Basic Configuration |
| 13 | + |
| 14 | +At a minimum, you need an `items` selector to define the list of articles and a `title` selector for the article titles. |
| 15 | + |
| 16 | +```yml |
| 17 | +channel: |
| 18 | + url: "https://example.com" |
| 19 | +selectors: |
| 20 | + items: |
| 21 | + selector: ".article" |
| 22 | + title: |
| 23 | + selector: "h1" |
| 24 | +``` |
| 25 | +
|
| 26 | +## Automatic Item Enhancement |
| 27 | +
|
| 28 | +To simplify configuration, `html2rss` can automatically extract the `title`, `url`, and `image` from each item. This feature is enabled by default. |
| 29 | + |
| 30 | +```yml |
| 31 | +selectors: |
| 32 | + items: |
| 33 | + selector: ".article" |
| 34 | + enhance: true # default: true |
| 35 | +``` |
| 36 | + |
| 37 | +## RSS 2.0 Selectors |
| 38 | + |
| 39 | +While you can define any named selector, only the following are used in the final RSS feed: |
| 40 | + |
| 41 | +| RSS 2.0 Tag | `html2rss` Name | |
| 42 | +| ------------- | --------------- | ------------------------------ | |
| 43 | +| `title` | `title` | |
| 44 | +| `description` | `description` | |
| 45 | +| `link` | `url` | |
| 46 | +| `author` | `author` | |
| 47 | +| `category` | `categories` | |
| 48 | +| `guid` | `guid` | |
| 49 | +| `enclosure` | `enclosure` | |
| 50 | +| `pubDate` | `published_at` | |
| 51 | +| `comments` | `comments` | ⚠️ _Not currently implemented_ | |
| 52 | + |
| 53 | +## Selector Options |
| 54 | + |
| 55 | +Each selector can be configured with the following options: |
| 56 | + |
| 57 | +| Name | Description | |
| 58 | +| -------------- | -------------------------------------------------------- | |
| 59 | +| `selector` | The CSS selector for the target element. | |
| 60 | +| `extractor` | The extractor to use for this selector. | |
| 61 | +| `attribute` | The attribute name (required for `attribute` extractor). | |
| 62 | +| `static` | The static value (required for `static` extractor). | |
| 63 | +| `post_process` | A list of post-processors to apply to the value. | |
| 64 | + |
| 65 | +### Extractors |
| 66 | + |
| 67 | +Extractors define how to get the value from a selected element. |
| 68 | + |
| 69 | +- `text`: The inner text of the element (default). |
| 70 | +- `html`: The outer HTML of the element. |
| 71 | +- `href`: The value of the `href` attribute. |
| 72 | +- `attribute`: The value of a specified attribute. |
| 73 | +- `static`: A static value. |
| 74 | + |
| 75 | +### Post-Processors |
| 76 | + |
| 77 | +Post-processors manipulate the extracted value. |
| 78 | + |
| 79 | +- `gsub`: Performs a global substitution on a string. |
| 80 | +- `html_to_markdown`: Converts HTML to Markdown. |
| 81 | +- `markdown_to_html`: Converts Markdown to HTML. |
| 82 | +- `parse_time`: Parses a string into a `Time` object. |
| 83 | +- `parse_uri`: Parses a string into a `URI` object. |
| 84 | +- `sanitize_html`: Sanitizes HTML to prevent security vulnerabilities. |
| 85 | +- `substring`: Extracts a substring from a string. |
| 86 | +- `template`: Creates a new string from a template and other selector values. |
| 87 | + |
| 88 | +> Always use the `sanitize_html` post-processor for any HTML content to prevent security risks. |
| 89 | + |
| 90 | +## Advanced Usage |
| 91 | + |
| 92 | +### Categories |
| 93 | + |
| 94 | +To add categories to an item, provide a list of selector names to the `categories` selector. |
| 95 | + |
| 96 | +```yml |
| 97 | +selectors: |
| 98 | + genre: |
| 99 | + selector: ".genre" |
| 100 | + branch: |
| 101 | + selector: ".branch" |
| 102 | + categories: |
| 103 | + - genre |
| 104 | + - branch |
| 105 | +``` |
| 106 | + |
| 107 | +### Custom GUID |
| 108 | + |
| 109 | +To create a custom GUID for an item, provide a list of selector names to the `guid` selector. |
| 110 | + |
| 111 | +```yml |
| 112 | +selectors: |
| 113 | + title: |
| 114 | + selector: "h1" |
| 115 | + url: |
| 116 | + selector: "a" |
| 117 | + extractor: "href" |
| 118 | + guid: |
| 119 | + - url |
| 120 | +``` |
| 121 | + |
| 122 | +### Enclosures |
| 123 | + |
| 124 | +To add an enclosure (e.g., an image, audio, or video file) to an item, use the `enclosure` selector to specify the URL of the file. |
| 125 | + |
| 126 | +```yml |
| 127 | +selectors: |
| 128 | + items: |
| 129 | + selector: ".post" |
| 130 | + title: |
| 131 | + selector: "h2" |
| 132 | + enclosure: |
| 133 | + selector: "audio" |
| 134 | + extractor: "attribute" |
| 135 | + attribute: "src" |
| 136 | + content_type: "audio/mp3" |
| 137 | +``` |
0 commit comments