Skip to content

Commit bb2134d

Browse files
committed
feat: update strategy, request and cli
1 parent 55cf057 commit bb2134d

4 files changed

Lines changed: 189 additions & 1 deletion

File tree

src/content/docs/ruby-gem/how-to/custom-http-requests.mdx

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,12 @@ description: "Learn how to customize HTTP requests with custom headers, authenti
55

66
Some websites require custom HTTP headers, authentication, or other request settings to access their content. `html2rss` lets you customize requests for those cases.
77

8+
Keep this structure in mind:
9+
10+
- `headers` stays top-level
11+
- `strategy` stays top-level
12+
- request-specific controls such as budgets and Browserless options live under `request`
13+
814
## When You Need Custom Headers
915

1016
You might need custom HTTP requests when:
@@ -35,6 +41,32 @@ selectors:
3541
selector: "url"
3642
```
3743
44+
## Request Controls
45+
46+
Request budgets are configured under `request`, not as top-level keys:
47+
48+
```yaml
49+
headers:
50+
User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)"
51+
request:
52+
max_redirects: 5
53+
max_requests: 6
54+
channel:
55+
url: https://example.com/articles
56+
selectors:
57+
items:
58+
selector: article
59+
title:
60+
selector: h2
61+
url:
62+
selector: a
63+
extractor: href
64+
```
65+
66+
- `request.max_redirects` limits redirect hops
67+
- `request.max_requests` limits the total request budget for the feed build
68+
- `request.browserless.*` is reserved for Browserless-only behavior such as preload actions
69+
3870
## Common Use Cases
3971

4072
### API Authentication

src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,29 @@ Some websites load their content dynamically using JavaScript. The default `html
99

1010
Use the [`browserless` strategy](/ruby-gem/reference/strategy) to render JavaScript-heavy websites with a headless browser.
1111

12+
Keep the strategy at the top level and put request-specific options under `request`:
13+
14+
```yaml
15+
strategy: browserless
16+
request:
17+
max_redirects: 5
18+
max_requests: 6
19+
browserless:
20+
preload:
21+
wait_for_network_idle:
22+
timeout_ms: 5000
23+
channel:
24+
url: https://example.com/app
25+
selectors:
26+
items:
27+
selector: .article
28+
title:
29+
selector: h2
30+
url:
31+
selector: a
32+
extractor: href
33+
```
34+
1235
## When to Use Browserless
1336
1437
The `browserless` strategy is necessary when:
@@ -18,6 +41,56 @@ The `browserless` strategy is necessary when:
1841
- **Infinite scroll** - Content loads as you scroll
1942
- **Dynamic forms** - Content changes based on user interaction
2043

44+
## Preload Actions
45+
46+
For dynamic sites, rendering once is often not enough. Use `request.browserless.preload` to wait, click, or scroll before the
47+
HTML snapshot is taken.
48+
49+
### Wait for JavaScript Requests
50+
51+
```yaml
52+
strategy: browserless
53+
request:
54+
browserless:
55+
preload:
56+
wait_for_network_idle:
57+
timeout_ms: 4000
58+
```
59+
60+
### Click "Load More" Buttons
61+
62+
```yaml
63+
strategy: browserless
64+
request:
65+
browserless:
66+
preload:
67+
click_selectors:
68+
- selector: ".load-more"
69+
max_clicks: 3
70+
delay_ms: 250
71+
wait_for_network_idle:
72+
timeout_ms: 3000
73+
```
74+
75+
### Scroll Infinite Lists
76+
77+
```yaml
78+
strategy: browserless
79+
request:
80+
browserless:
81+
preload:
82+
scroll_down:
83+
iterations: 5
84+
delay_ms: 200
85+
wait_for_network_idle:
86+
timeout_ms: 2500
87+
```
88+
89+
These preload steps can be combined in a single config when a site needs several interactions before all items appear.
90+
91+
If a click or scroll step causes a real navigation, html2rss returns the final document metadata, not the original page-load
92+
metadata. That keeps extracted relative links anchored to the rendered page.
93+
2194
## Performance Considerations
2295

2396
The `browserless` strategy is slower than the default `faraday` strategy because it:

src/content/docs/ruby-gem/reference/cli-reference.mdx

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,9 @@ html2rss auto https://example.com/articles
2424
# Force browserless for JavaScript-heavy pages
2525
html2rss auto https://example.com/app --strategy browserless
2626

27+
# Override request budgets at runtime
28+
html2rss auto https://example.com/app --strategy browserless --max-redirects 5 --max-requests 6
29+
2730
# Hint the item selector while keeping auto enhancement
2831
html2rss auto https://example.com/articles --items_selector ".post-card"
2932
```
@@ -44,12 +47,17 @@ html2rss feed feeds.yml my-first-feed
4447
# Override the request strategy at runtime
4548
html2rss feed single.yml --strategy browserless
4649

50+
# Override request budgets at runtime
51+
html2rss feed single.yml --max-redirects 5 --max-requests 6
52+
4753
# Pass dynamic parameters into %<param>s placeholders
4854
html2rss feed single.yml --params id:42 foo:bar
4955
```
5056

5157
Command: `html2rss feed YAML_FILE [feed_name]`
5258

59+
The CLI keeps `strategy` as a top-level override and writes runtime request limits into the generated config under `request`.
60+
5361
### Schema
5462

5563
Prints the exported JSON Schema for the current gem version.

src/content/docs/ruby-gem/reference/strategy.mdx

Lines changed: 76 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ The `strategy` key defines how `html2rss` fetches a website's content.
88
- **`faraday`** (default): Makes a direct HTTP request. It is fast but does not execute JavaScript.
99
- **`browserless`**: Renders the website in a headless Chrome browser, which is necessary for JavaScript-heavy sites.
1010

11+
`strategy` stays a top-level config key. Request-specific controls now live under `request`.
12+
1113
## `browserless`
1214

1315
To use the `browserless` strategy, you need a running instance of [Browserless.io](https://www.browserless.io/).
@@ -27,10 +29,48 @@ docker run \
2729

2830
### Configuration
2931

30-
Set the `strategy` at the top level of your feed configuration:
32+
Set the `strategy` at the top level of your feed configuration and put request controls under `request`:
33+
34+
```yml
35+
strategy: browserless
36+
request:
37+
max_redirects: 5
38+
max_requests: 6
39+
channel:
40+
url: "https://example.com/app"
41+
selectors:
42+
items:
43+
selector: ".article"
44+
title:
45+
selector: "h2"
46+
url:
47+
selector: "a"
48+
extractor: "href"
49+
```
50+
51+
### Request Structure
52+
53+
Use this split consistently:
54+
55+
- `strategy`: selects `faraday` or `browserless`
56+
- `headers`: top-level headers shared by all strategies
57+
- `request.max_redirects`: redirect limit for the request session
58+
- `request.max_requests`: total request budget for the whole feed build
59+
- `request.browserless.*`: Browserless-only options
60+
61+
Example:
3162

3263
```yml
3364
strategy: browserless
65+
headers:
66+
User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)"
67+
request:
68+
max_redirects: 5
69+
max_requests: 6
70+
browserless:
71+
preload:
72+
wait_for_network_idle:
73+
timeout_ms: 5000
3474
channel:
3575
url: "https://example.com/app"
3676
selectors:
@@ -43,6 +83,38 @@ selectors:
4383
extractor: "href"
4484
```
4585

86+
### Browserless Preload
87+
88+
Browserless can interact with the page before html2rss captures the final HTML. Configure preload steps under
89+
`request.browserless.preload`.
90+
91+
```yml
92+
strategy: browserless
93+
request:
94+
browserless:
95+
preload:
96+
wait_for_network_idle:
97+
timeout_ms: 5000
98+
click_selectors:
99+
- selector: ".load-more"
100+
max_clicks: 3
101+
delay_ms: 250
102+
wait_for_network_idle:
103+
timeout_ms: 4000
104+
scroll_down:
105+
iterations: 5
106+
delay_ms: 200
107+
wait_for_network_idle:
108+
timeout_ms: 3000
109+
```
110+
111+
- `wait_for_network_idle`: pauses before and after preload steps
112+
- `click_selectors`: clicks matching elements until they disappear or `max_clicks` is reached
113+
- `scroll_down`: scrolls until the page height stops growing or `iterations` is reached
114+
115+
If preload triggers a real navigation or redirect, html2rss keeps the final document metadata. Relative links and follow-up
116+
pagination therefore resolve against the page that was actually rendered after preload completed.
117+
46118
### Command-Line Usage
47119

48120
You can also specify the strategy on the command line:
@@ -53,6 +125,9 @@ BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000" \
53125
BROWSERLESS_IO_API_TOKEN="6R0W53R135510" \
54126
html2rss feed my_config.yml --strategy browserless
55127
128+
# Override request budgets at runtime
129+
html2rss feed my_config.yml --max-redirects 5 --max-requests 6
130+
56131
# Or rely on the strategy stored in the YAML config
57132
html2rss feed my_config.yml
58133
```

0 commit comments

Comments
 (0)