Skip to content

Commit 6d2a855

Browse files
committed
docs: everything of gem readme
1 parent 37655ed commit 6d2a855

16 files changed

Lines changed: 347 additions & 486 deletions

get-involved/index.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
11
---
22
layout: default
33
title: Get Involved
4+
nav_order: 5
45
has_children: true
5-
nav_order: 4
66
---
77

88
# Get Involved
99

10+
- [**Sponsoring**]({{ '/get-involved/sponsoring' | relative_url }})
11+
1012
Engage with the `html2rss` project. Contribute and connect with the community.
1113

1214
- [**Project Roadmap**]({{ 'https://github.com/orgs/html2rss/projects/3/views/1' }}): View current work, plans, and priorities.

get-involved/sponsoring.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
layout: default
3+
title: Sponsoring
4+
parent: Get Involved
5+
nav_order: 4
6+
---
7+
8+
# Sponsoring html2rss
9+
10+
`html2rss` is an open-source project, and its development is made possible by the support of our community. If you find `html2rss` useful, please consider sponsoring the project.
11+
12+
## Why Sponsor?
13+
14+
- **Ensure the project's longevity:** Your sponsorship helps to ensure that the project remains actively maintained and developed.
15+
- **Support new features:** Your contribution will help to fund the development of new features and improvements.
16+
- **Show your appreciation:** Sponsoring is a great way to show your appreciation for the project and the work that goes into it.
17+
18+
## How to Sponsor
19+
20+
You can sponsor the project through [GitHub Sponsors](https://github.com/sponsors/gildesmarais).

ruby-gem/how-to/index.md

Lines changed: 3 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,11 @@
11
---
22
layout: default
33
title: How-To Guides
4-
nav_order: 3
54
parent: Ruby Gem
5+
nav_order: 2
66
has_children: true
77
---
88

9-
# How-To Guides: Practical `html2rss` Configurations
9+
# How-To Guides
1010

11-
This section provides a collection of ready-to-use `html2rss` configuration examples for various popular websites and common use cases. These examples demonstrate how to tackle different HTML structures and content types.
12-
13-
Use these as a starting point, modify them to fit your specific needs, or get inspiration for building your own custom feeds.
14-
15-
---
16-
17-
### How to Use an Example
18-
19-
1. **Copy the YAML:** Copy the entire YAML configuration block for the example you're interested in.
20-
2. **Save as `.yml`:** Save the copied content into a file, e.g., `my-example.yml`.
21-
3. **Generate the Feed:** Run `html2rss` from your terminal:
22-
```bash
23-
html2rss feed my-example.yml > my-example.xml
24-
```
25-
4. **Enjoy!** Open `my-example.xml` in your favorite RSS reader.
26-
27-
---
28-
29-
### Contribute Your Own Examples!
30-
31-
Have you created a useful `html2rss` configuration? We encourage you to share it with the community by contributing to the [`html2rss-configs`](https://github.com/html2rss/html2rss-configs) repository.
11+
This section provides practical examples and solutions for common tasks when using the `html2rss` gem.
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
---
2+
layout: default
3+
title: Managing Feed Configs
4+
parent: How-To Guides
5+
grand_parent: Ruby Gem
6+
nav_order: 7
7+
---
8+
9+
# Managing Feed Configurations with YAML
10+
11+
For easier management, especially when using the CLI or `html2rss-web`, you can store your feed configurations in a YAML file.
12+
13+
## Global and Feed-Specific Configurations
14+
15+
You can define global settings that apply to all feeds, and then define individual feed configurations under the `feeds` key.
16+
17+
```yml
18+
# Global settings
19+
headers:
20+
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
21+
"Accept": "text/html"
22+
23+
# Feed-specific settings
24+
feeds:
25+
my-first-feed:
26+
channel:
27+
url: "https://example.com/blog"
28+
selectors:
29+
# ...
30+
my-second-feed:
31+
channel:
32+
url: "https://example.com/news"
33+
selectors:
34+
# ...
35+
```
36+
37+
## Building Feeds from a YAML File
38+
39+
### Ruby
40+
41+
```ruby
42+
require 'html2rss'
43+
44+
# Build a specific feed from the YAML file
45+
my_feed_config = Html2rss.config_from_yaml_file('feeds.yml', 'my-first-feed')
46+
rss = Html2rss.feed(my_feed_config)
47+
puts rss
48+
49+
# If the YAML file contains only one feed, you can omit the feed name
50+
single_feed_config = Html2rss.config_from_yaml_file('single.yml')
51+
rss = Html2rss.feed(single_feed_config)
52+
puts rss
53+
```
54+
55+
### Command Line
56+
57+
```sh
58+
# Build a specific feed
59+
html2rss feed feeds.yml my-first-feed
60+
61+
# Build a feed from a single-feed YAML file
62+
html2rss feed single.yml
63+
```

ruby-gem/how-to/scraping-json.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
---
2+
layout: default
3+
title: Scraping JSON Responses
4+
parent: How-To Guides
5+
grand_parent: Ruby Gem
6+
nav_order: 6
7+
---
8+
9+
# Scraping JSON Responses
10+
11+
When a website returns a JSON response (i.e., with a `Content-Type` of `application/json`), `html2rss` converts the JSON to XML, allowing you to use CSS selectors for data extraction.
12+
13+
> [!NOTE]
14+
> The JSON response must be an Array or a Hash for the conversion to work.
15+
16+
## JSON to XML Conversion Examples
17+
18+
### JSON Object
19+
20+
A JSON object like this:
21+
22+
```json
23+
{
24+
"data": [{ "title": "Headline", "url": "https://example.com" }]
25+
}
26+
```
27+
28+
is converted to this XML structure:
29+
30+
```xml
31+
<object>
32+
<data>
33+
<array>
34+
<object>
35+
<title>Headline</title>
36+
<url>https://example.com</url>
37+
</object>
38+
</array>
39+
</data>
40+
</object>
41+
```
42+
43+
You would use `array > object` as your `items` selector.
44+
45+
### JSON Array
46+
47+
A JSON array like this:
48+
49+
```json
50+
[{ "title": "Headline", "url": "https://example.com" }]
51+
```
52+
53+
is converted to this XML structure:
54+
55+
```xml
56+
<array>
57+
<object>
58+
<title>Headline</title>
59+
<url>https://example.com</url>
60+
</object>
61+
</array>
62+
```
63+
64+
You would use `array > object` as your `items` selector.
65+
66+
## Configuration Examples
67+
68+
### Ruby
69+
70+
```ruby
71+
Html2rss.feed(
72+
headers: {
73+
Accept: 'application/json'
74+
},
75+
channel: {
76+
url: 'http://domainname.tld/whatever.json'
77+
},
78+
selectors: {
79+
title: { selector: 'foo' }
80+
}
81+
)
82+
```
83+
84+
### YAML
85+
86+
```yml
87+
headers:
88+
Accept: application/json
89+
channel:
90+
url: "http://domainname.tld/whatever.json"
91+
selectors:
92+
title:
93+
selector: "foo"
94+
```

ruby-gem/index.md

Lines changed: 6 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,25 +5,16 @@ nav_order: 3
55
has_children: true
66
---
77

8-
# The html2rss Ruby Gem ([GitHub Repo](https://github.com/html2rss/html2rss))
8+
# The html2rss Ruby Gem
99

10-
This section documents the `html2rss` Ruby gem, the core library for `html2rss-web`. This documentation targets developers using the gem directly. For an easier start, use the [web application]({{ '/web-application' | relative_url }}).
10+
This section provides comprehensive documentation for the `html2rss` Ruby gem.
1111

1212
## Getting Started
1313

14-
Start with the [Installation guide]({{ '/ruby-gem/tutorials/installation' | relative_url }}). Then, create your [first feed]({{ '/ruby-gem/tutorials/your-first-feed' | relative_url }}).
14+
If you are new to `html2rss`, we recommend starting with the [tutorials]({{ '/ruby-gem/tutorials' | relative_url }}).
1515

1616
## Documentation Sections
1717

18-
- **[Tutorials]({{ '/ruby-gem/tutorials' | relative_url }})**: Step-by-step guides to get you started.
19-
- **[How-To Guides]({{ '/ruby-gem/how-to' | relative_url }})**: Solutions to common problems and tasks.
20-
- **[Reference]({{ '/ruby-gem/reference' | relative_url }})**: Technical details and configuration options.
21-
22-
## Advanced Topics
23-
24-
- [**Handling Dynamic Content and JavaScript**]({{ '/ruby-gem/how-to/handling-dynamic-content' | relative_url }}): Process JavaScript-heavy websites.
25-
- [**Customizing HTTP Requests**]({{ '/ruby-gem/how-to/custom-http-requests' | relative_url }}): Send custom HTTP headers.
26-
- [**Dynamic Parameters in URLs and Headers**]({{ '/ruby-gem/how-to/dynamic-parameters' | relative_url }}): Use dynamic parameters in URLs and headers.
27-
- [**Advanced Content Extraction with Selectors**]({{ '/ruby-gem/how-to/advanced-content-extraction' | relative_url }}): Advanced content extraction.
28-
- [**Styling Your RSS Feed**]({{ '/ruby-gem/how-to/styling-rss-feed' | relative_url }}): Add stylesheets to RSS feeds.
29-
- [**Debugging Your Configuration**]({{ '/support/troubleshooting' | relative_url }}): Debug feed configurations.
18+
- **[Tutorials]({{ '/ruby-gem/tutorials' | relative_url }})**: Step-by-step guides to help you get started with `html2rss`.
19+
- **[How-To Guides]({{ '/ruby-gem/how-to' | relative_url }})**: Practical examples and solutions for common tasks.
20+
- **[Reference]({{ '/ruby-gem/reference' | relative_url }})**: Detailed information on configuration options.

ruby-gem/reference/auto-source.md

Lines changed: 12 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -6,37 +6,33 @@ parent: Reference
66
grand_parent: Ruby Gem
77
---
88

9-
# `auto_source`
9+
# Auto Source
1010

11-
The `auto_source` scraper is the easiest way to create a feed. It intelligently finds items on a page without requiring you to specify CSS selectors.
11+
The `auto_source` scraper automatically finds items on a page, so you don't have to specify CSS selectors.
1212

13-
You can enable it in your YAML config like this:
13+
To enable it, add `auto_source: {}` to your configuration:
1414

1515
```yaml
1616
channel:
1717
url: https://example.com
1818
auto_source: {}
1919
```
2020
21-
---
22-
23-
## How it Works
24-
25-
The `auto_source` scraper uses a series of strategies to find content:
21+
## How It Works
2622
27-
1. **`schema`:** It looks for structured data in the form of `<script type="json/ld">` tags. Many websites use this to provide machine-readable information about their content, often following the [Schema.org](https://schema.org/) standard.
28-
2. **`semantic_html`:** It searches for semantic HTML5 tags like `<article>`, `<main>`, and `<section>`. These tags are often used to define the main content of a page.
29-
3. **`html`:** As a last resort, it analyzes the entire HTML structure to find frequently occurring selectors that are likely to contain the main content.
23+
`auto_source` uses the following strategies to find content:
3024

31-
---
25+
1. **`schema`:** Parses `<script type="json/ld">` tags containing structured data (e.g., [Schema.org](https://schema.org/)).
26+
2. **`semantic_html`:** Searches for semantic HTML5 tags like `<article>`, `<main>`, and `<section>`.
27+
3. **`html`:** Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
3228

33-
## Fine-Tuning `auto_source`
29+
## Fine-Tuning
3430

35-
You can customize the behavior of the `auto_source` scraper to improve its accuracy.
31+
You can customize `auto_source` to improve its accuracy.
3632

3733
### Scraper Options
3834

39-
You can enable or disable specific scrapers and adjust their settings.
35+
Enable or disable specific scrapers and adjust their settings:
4036

4137
```yaml
4238
auto_source:
@@ -51,19 +47,13 @@ auto_source:
5147
use_top_selectors: 3 # default: 5
5248
```
5349

54-
- `minimum_selector_frequency`: The minimum number of times a selector must appear to be considered a candidate for the main content.
55-
- `use_top_selectors`: The number of top candidate selectors to consider.
56-
5750
### Cleanup Options
5851

59-
You can also clean up the results to remove unwanted items.
52+
Remove unwanted items from the results:
6053

6154
```yaml
6255
auto_source:
6356
cleanup:
6457
keep_different_domain: false # default: true
6558
min_words_title: 4 # default: 3
6659
```
67-
68-
- `keep_different_domain`: Whether to keep items that link to a different domain.
69-
- `min_words_title`: The minimum number of words a title must have to be included.

ruby-gem/reference/channel.md

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@ parent: Reference
66
grand_parent: Ruby Gem
77
---
88

9-
# `channel`
9+
# Channel
1010

11-
The `channel` key contains information about the RSS feed itself, such as its title, URL, and description.
11+
The `channel` configuration block defines the metadata for your RSS feed.
1212

1313
```yaml
1414
channel:
@@ -21,16 +21,14 @@ channel:
2121
time_zone: "Europe/Berlin"
2222
```
2323
24-
---
25-
26-
## Channel Options
24+
## Options
2725
28-
| Attribute | Required | Type | Default | Remark |
29-
| :------------ | :----------- | :------ | :------------- | :-------------------------------------------------------------------------------------------------------------------------------------- |
30-
| `url` | **Required** | String | | The URL of the website to scrape. |
31-
| `title` | Optional | String | Auto-generated | The title of the RSS feed. |
32-
| `description` | Optional | String | Auto-generated | Retrieved from meta description tags. |
33-
| `author` | Optional | String | Blank | Format: `email (Name)`. |
34-
| `ttl` | Optional | Integer | Auto-generated | Time to live in minutes. `html2rss` will use the `max-age` from the response headers if available, otherwise it will default to `360`. |
35-
| `language` | Optional | String | Auto-generated | Determined by the `lang` attribute of the `<html>` tag. |
36-
| `time_zone` | Optional | String | `'UTC'` | The time zone to use for parsing dates. See a [list of valid time zones](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones). |
26+
| Attribute | Required | Description |
27+
| :------------ | :----------- | :--------------------------------------------------------------------------------------------------------------------------------------- |
28+
| `url` | **Required** | The URL of the website to scrape. |
29+
| `title` | Optional | The title of the RSS feed. Defaults to the website's title. |
30+
| `description` | Optional | A description for the RSS feed. Defaults to the website's meta description. |
31+
| `author` | Optional | The author of the feed, in the format `email (Name)`. |
32+
| `ttl` | Optional | The "time to live" for the feed in minutes. Defaults to the `max-age` from the response headers, or `360`. |
33+
| `language` | Optional | The language of the feed. Defaults to the `lang` attribute of the `<html>` tag. |
34+
| `time_zone` | Optional | The time zone for parsing dates. See the [list of tz database time zones](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones). |

0 commit comments

Comments
 (0)