Skip to content

Commit c2b676e

Browse files
committed
phase 2
1 parent 000bd7e commit c2b676e

26 files changed

Lines changed: 873 additions & 0 deletions
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
---
2+
title: 'Get Involved'
3+
description: 'Engage with the html2rss project. Contribute and connect with the community.'
4+
---
5+
6+
# Get Involved
7+
8+
- [**Sponsoring**](/get-involved/sponsoring)
9+
10+
Engage with the `html2rss` project. Contribute and connect with the community.
11+
12+
- [**Project Roadmap**](https://github.com/orgs/html2rss/projects/3/views/1): View current work, plans, and priorities.
13+
- [**Report Bugs & Discuss Features**](/get-involved/issues-and-features): Report bugs or propose features.
14+
- [**Join Community Discussions**](/get-involved/discussions): Connect with users and contributors.
15+
- [**Contribute to html2rss**](/get-involved/contributing): Contribute code, documentation, or feed configurations.
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
---
2+
title: 'Advanced Content Extraction'
3+
description: 'While basic selectors are straightforward, you can achieve very precise content extraction by combining selectors with different extractors and post-processors.'
4+
---
5+
6+
# Advanced Content Extraction with Selectors
7+
8+
While basic selectors are straightforward, you can achieve very precise content extraction by combining selectors with different extractors and post-processors.
9+
10+
## Extractors
11+
12+
Learn how to extract specific attributes (like `src` for images) or static values. See [Extractors](/ruby-gem/reference/selectors).
13+
14+
## Post Processors
15+
16+
Manipulate extracted text, sanitize HTML, convert Markdown, or apply custom logic. See [Post Processors](/ruby-gem/reference/selectors).
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
title: 'How-To Guides'
3+
description: 'This section provides practical examples and solutions for common tasks when using the html2rss gem.'
4+
---
5+
6+
# How-To Guides
7+
8+
This section provides practical examples and solutions for common tasks when using the `html2rss` gem.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
---
2+
title: 'Ruby Gem'
3+
description: 'This section provides comprehensive documentation for the html2rss Ruby gem.'
4+
---
5+
6+
# The html2rss Ruby Gem
7+
8+
This section provides comprehensive documentation for the `html2rss` Ruby gem.
9+
10+
## Getting Started
11+
12+
If you are new to `html2rss`, we recommend starting with the [tutorials](/ruby-gem/tutorials).
13+
14+
## Documentation Sections
15+
16+
- **[Tutorials](/ruby-gem/tutorials)**: Step-by-step guides to help you get started with `html2rss`.
17+
- **[How-To Guides](/ruby-gem/how-to)**: Practical examples and solutions for common tasks.
18+
- **[Reference](/ruby-gem/reference)**: Detailed information on configuration options.
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
---
2+
title: 'Installation'
3+
description: 'This guide will walk you through the process of installing html2rss on your system.'
4+
---
5+
6+
# Installation
7+
8+
This guide will walk you through the process of installing html2rss on your system. html2rss can be installed in several ways, depending on your preferred method and environment.
9+
10+
---
11+
12+
### Prerequisites
13+
14+
- **Ruby:** html2rss is built with Ruby. Ensure you have Ruby installed (version 3.2 or higher required). You can check your Ruby version by running `ruby -v` in your terminal. If you don't have Ruby, visit [ruby-lang.org](https://www.ruby-lang.org/en/documentation/installation/) for installation instructions.
15+
- **Bundler (Recommended):** Bundler is a Ruby gem that manages your application's dependencies. It's highly recommended for a smooth installation. Install it with `gem install bundler`.
16+
17+
---
18+
19+
### Method 1: Gem Installation (Recommended for CLI Usage)
20+
21+
The simplest way to get html2rss for command-line usage is to install it as a Ruby gem.
22+
23+
```bash
24+
gem install html2rss
25+
```
26+
27+
After installation, you should be able to run `html2rss --version` to confirm it's working.
28+
29+
---
30+
31+
### Method 2: Using a Gemfile (For Ruby Projects)
32+
33+
If you're integrating html2rss into an existing Ruby project, add it to your `Gemfile`:
34+
35+
```ruby
36+
# Gemfile
37+
gem 'html2rss'
38+
```
39+
40+
Then, run `bundle install` in your project directory.
41+
42+
---
43+
44+
### Method 3: GitHub Codespaces (For Cloud Development)
45+
46+
For a quick start without local setup, you can develop html2rss directly in your browser using GitHub Codespaces:
47+
48+
[![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://github.com/codespaces/new?repo=html2rss/html2rss)
49+
50+
The Codespace comes pre-configured with Ruby 3.4, all dependencies, and VS Code extensions ready to go!
51+
52+
---
53+
54+
### Verifying Installation
55+
56+
To ensure html2rss is installed correctly, open your terminal and run:
57+
58+
```bash
59+
html2rss --version
60+
```
61+
62+
You should see the installed version number. If you encounter any issues, please refer to the [Troubleshooting Guide](/support/troubleshooting).
63+
64+
---
65+
66+
### Next Steps
67+
68+
Now that html2rss is installed, let's create your [first RSS feed](/ruby-gem/tutorials/your-first-feed)!
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
title: 'Reference'
3+
description: 'This section provides detailed information on the various configuration options available in html2rss.'
4+
---
5+
6+
# Reference
7+
8+
This section provides detailed information on the various configuration options available in `html2rss`.
Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
---
2+
title: 'Selectors'
3+
description: 'The selectors scraper gives you fine-grained control over content extraction using CSS selectors.'
4+
---
5+
6+
# Selectors
7+
8+
The `selectors` scraper gives you fine-grained control over content extraction using CSS selectors.
9+
10+
> A valid RSS item requires at least a `title` or a `description`.
11+
12+
## Basic Configuration
13+
14+
At a minimum, you need an `items` selector to define the list of articles and a `title` selector for the article titles.
15+
16+
```yml
17+
channel:
18+
url: "https://example.com"
19+
selectors:
20+
items:
21+
selector: ".article"
22+
title:
23+
selector: "h1"
24+
```
25+
26+
## Automatic Item Enhancement
27+
28+
To simplify configuration, `html2rss` can automatically extract the `title`, `url`, and `image` from each item. This feature is enabled by default.
29+
30+
```yml
31+
selectors:
32+
items:
33+
selector: ".article"
34+
enhance: true # default: true
35+
```
36+
37+
## RSS 2.0 Selectors
38+
39+
While you can define any named selector, only the following are used in the final RSS feed:
40+
41+
| RSS 2.0 Tag | `html2rss` Name |
42+
| ------------- | --------------- | ------------------------------ |
43+
| `title` | `title` |
44+
| `description` | `description` |
45+
| `link` | `url` |
46+
| `author` | `author` |
47+
| `category` | `categories` |
48+
| `guid` | `guid` |
49+
| `enclosure` | `enclosure` |
50+
| `pubDate` | `published_at` |
51+
| `comments` | `comments` | ⚠️ _Not currently implemented_ |
52+
53+
## Selector Options
54+
55+
Each selector can be configured with the following options:
56+
57+
| Name | Description |
58+
| -------------- | -------------------------------------------------------- |
59+
| `selector` | The CSS selector for the target element. |
60+
| `extractor` | The extractor to use for this selector. |
61+
| `attribute` | The attribute name (required for `attribute` extractor). |
62+
| `static` | The static value (required for `static` extractor). |
63+
| `post_process` | A list of post-processors to apply to the value. |
64+
65+
### Extractors
66+
67+
Extractors define how to get the value from a selected element.
68+
69+
- `text`: The inner text of the element (default).
70+
- `html`: The outer HTML of the element.
71+
- `href`: The value of the `href` attribute.
72+
- `attribute`: The value of a specified attribute.
73+
- `static`: A static value.
74+
75+
### Post-Processors
76+
77+
Post-processors manipulate the extracted value.
78+
79+
- `gsub`: Performs a global substitution on a string.
80+
- `html_to_markdown`: Converts HTML to Markdown.
81+
- `markdown_to_html`: Converts Markdown to HTML.
82+
- `parse_time`: Parses a string into a `Time` object.
83+
- `parse_uri`: Parses a string into a `URI` object.
84+
- `sanitize_html`: Sanitizes HTML to prevent security vulnerabilities.
85+
- `substring`: Extracts a substring from a string.
86+
- `template`: Creates a new string from a template and other selector values.
87+
88+
> Always use the `sanitize_html` post-processor for any HTML content to prevent security risks.
89+
90+
## Advanced Usage
91+
92+
### Categories
93+
94+
To add categories to an item, provide a list of selector names to the `categories` selector.
95+
96+
```yml
97+
selectors:
98+
genre:
99+
selector: ".genre"
100+
branch:
101+
selector: ".branch"
102+
categories:
103+
- genre
104+
- branch
105+
```
106+
107+
### Custom GUID
108+
109+
To create a custom GUID for an item, provide a list of selector names to the `guid` selector.
110+
111+
```yml
112+
selectors:
113+
title:
114+
selector: "h1"
115+
url:
116+
selector: "a"
117+
extractor: "href"
118+
guid:
119+
- url
120+
```
121+
122+
### Enclosures
123+
124+
To add an enclosure (e.g., an image, audio, or video file) to an item, use the `enclosure` selector to specify the URL of the file.
125+
126+
```yml
127+
selectors:
128+
items:
129+
selector: ".post"
130+
title:
131+
selector: "h2"
132+
enclosure:
133+
selector: "audio"
134+
extractor: "attribute"
135+
attribute: "src"
136+
content_type: "audio/mp3"
137+
```
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
title: 'Tutorials'
3+
description: 'This section provides step-by-step tutorials to help you get started with the html2rss Ruby gem.'
4+
---
5+
6+
# Tutorials
7+
8+
This section provides step-by-step tutorials to help you get started with the `html2rss` Ruby gem.
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
---
2+
title: 'Scraping a Simple Blog List'
3+
description: 'This example demonstrates how to create a feed from a typical blog that has a list of articles on its homepage.'
4+
---
5+
6+
# Tutorial: Scraping a Simple Blog List
7+
8+
This example demonstrates how to create a feed from a typical blog that has a list of articles on its homepage.
9+
10+
---
11+
12+
## The Goal
13+
14+
We want to create an RSS feed that contains the title, link, and summary of each article on the blog.
15+
16+
---
17+
18+
## The HTML
19+
20+
Here's a simplified view of the HTML structure we're targeting. The key is to find a container element that wraps each blog post (in this case, `.post-item`) and then find the selectors for the title, link, and summary within that container.
21+
22+
```html
23+
<div class="posts">
24+
<div class="post-item">
25+
<h2 class="post-title"><a href="/blog/post-1">First Post Title</a></h2>
26+
<p class="post-summary">Summary of the first post...</p>
27+
</div>
28+
<div class="post-item">
29+
<h2 class="post-title"><a href="/blog/post-2">Second Post Title</a></h2>
30+
<p class="post-summary">Summary of the second post...</p>
31+
</div>
32+
</div>
33+
```
34+
35+
---
36+
37+
## The Configuration
38+
39+
This configuration uses the `selectors` scraper to precisely extract the content we want.
40+
41+
```yaml
42+
channel:
43+
url: https://example.com/blog
44+
selectors:
45+
items:
46+
selector: ".post-item"
47+
title:
48+
selector: ".post-title a"
49+
url:
50+
selector: ".post-title a"
51+
extractor: "href"
52+
description:
53+
selector: ".post-summary"
54+
```
55+
56+
### Configuration Breakdown
57+
58+
- **`items.selector: ".post-item"`**: This is the most important selector. It tells `html2rss` that every element with the class `post-item` is a single item in the RSS feed.
59+
- **`title.selector: ".post-title a"`**: Within each `.post-item`, this finds the `<a>` tag inside the element with the class `post-title`.
60+
- **`url.selector: ".post-title a"`**: This finds the same `<a>` tag.
61+
- **`url.extractor: "href"`**: This extracts the URL from the `href` attribute of the `<a>` tag.
62+
- **`description.selector: ".post-summary"`**: This finds the element with the class `post-summary`.

0 commit comments

Comments
 (0)