Skip to content

Commit e9db410

Browse files
authored
feat(import) Add pagination, rate limiting, retry, and truncation warnings (#518)
## Summary - **`--limit 0` means "no limit"**: Common CLI convention so users don't have to guess a large number when importing many repos - **Retry with exponential backoff on HTTP 429**: Rate-limited requests now automatically retry (up to 3 times) using `Retry-After` header or exponential backoff, instead of failing immediately - **GitLab rate limit header logging**: Log `ratelimit-remaining`/`ratelimit-limit` headers after each API request (GitHub already had this) - **Truncation warnings when `--limit` caps results**: Both GitLab and GitHub importers now warn when results are silently truncated, showing "Showing N of M repositories" ## Changes - `base.py`: Add `max_retries`, `retry_base_delay` to `HTTPClient`; retry loop with `_calculate_retry_delay()` for 429s - `gitlab.py`: Add `_log_rate_limit()`, `_warn_truncation()`; capture response headers in pagination methods - `github.py`: Add truncation detection using `total_count` (search) and mid-page limit hit (user/org) - `cli/import_cmd/_common.py`: Allow `limit=0` → `sys.maxsize` in `ImportOptions`
2 parents 2c5c16c + 8e6163b commit e9db410

11 files changed

Lines changed: 976 additions & 40 deletions

File tree

CHANGES

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,41 @@ $ uvx --from 'vcspull' --prerelease allow vcspull
3333
_Notes on upcoming releases will be added here_
3434
<!-- END PLACEHOLDER - ADD NEW CHANGELOG ENTRIES BELOW THIS LINE -->
3535

36+
### Bug fixes
37+
38+
#### `vcspull import`: Fix silent truncation of GitLab/GitHub results (#518)
39+
40+
Previously, `--limit 100` (the default) would silently discard repositories
41+
beyond the cap with no indication that more were available.
42+
43+
- GitLab: Read `x-total` and `x-next-page` response headers to detect
44+
truncation and warn users
45+
- GitHub search: Use `total_count` from the JSON body to detect truncation
46+
- GitHub user/org: Detect mid-page limit hit as a "more available" signal
47+
- All providers now warn when results are capped by `--limit`
48+
49+
#### `vcspull import`: Fix HTTP 429 rate-limit failures on large imports (#518)
50+
51+
Rate-limited API requests previously failed immediately with an unrecoverable
52+
error. Large imports that triggered rate limits had to be manually restarted.
53+
54+
- Add automatic retry with exponential backoff (up to 3 attempts) on HTTP 429
55+
- Honor the `Retry-After` response header when present (capped at 120 s)
56+
- Fall back to exponential backoff with jitter when the header is absent
57+
58+
### New features
59+
60+
#### `vcspull import`: `--limit 0` means "no limit" (#518)
61+
62+
`--limit 0` now fetches every repository instead of raising a validation
63+
error. This follows the common CLI convention where 0 means unlimited.
64+
65+
#### `vcspull import`: GitLab rate-limit header logging (#518)
66+
67+
GitLab `ratelimit-remaining` / `ratelimit-limit` headers are now logged
68+
after each API request, matching the existing GitHub rate-limit logging.
69+
A warning is emitted when fewer than 10 requests remain.
70+
3671
## vcspull v1.56.0 (2026-02-15)
3772

3873
### New features

docs/cli/import/gitlab.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,24 @@ Import repositories from GitLab or a self-hosted GitLab instance.
1414
:path: import gitlab
1515
```
1616

17+
## Subgroup targeting
18+
19+
Use slash notation to target a specific subgroup or sub-subgroup directly:
20+
21+
```console
22+
$ vcspull import gl my-group/my-subgroup \
23+
--mode org \
24+
--workspace ~/code/
25+
```
26+
27+
```console
28+
$ vcspull import gl my-group/my-subgroup/my-leaf \
29+
--mode org \
30+
--workspace ~/code/
31+
```
32+
33+
The `TARGET` argument accepts any depth of slash-separated group path.
34+
1735
## Group flattening
1836

1937
When importing a GitLab group with `--mode org`, vcspull preserves subgroup
@@ -27,6 +45,22 @@ $ vcspull import gl my-group \
2745
--flatten-groups
2846
```
2947

48+
### Workspace structure by target and flag
49+
50+
Given a group tree `my-group → sub → leaf`, importing from `~/code/`:
51+
52+
| Target | `--flatten-groups` | Workspace sections written |
53+
|--------|:-----------------:|---------------------------|
54+
| `my-group` | no | `~/code/`, `~/code/sub/`, `~/code/sub/leaf/` |
55+
| `my-group` | yes | `~/code/` only |
56+
| `my-group/sub` | no | `~/code/`, `~/code/leaf/` |
57+
| `my-group/sub` | yes | `~/code/` only |
58+
| `my-group/sub/leaf` | no | `~/code/` only (leaf — no further nesting) |
59+
| `my-group/sub/leaf` | yes | `~/code/` only |
60+
61+
When the target is already the deepest group (a leaf), `--flatten-groups` has
62+
no effect — all repositories already land in the base workspace.
63+
3064
## Authentication
3165

3266
- **Env vars**: `GITLAB_TOKEN` (primary), `GL_TOKEN` (fallback)

src/vcspull/_internal/remotes/base.py

Lines changed: 87 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@
77
import json
88
import logging
99
import os
10+
import random
11+
import sys
12+
import time
1013
import typing as t
1114
import urllib.error
1215
import urllib.parse
@@ -251,14 +254,21 @@ def __post_init__(self) -> None:
251254
>>> opts.limit
252255
10
253256
254-
>>> ImportOptions(limit=0)
257+
>>> import sys
258+
>>> opts = ImportOptions(limit=0)
259+
>>> opts.limit == sys.maxsize
260+
True
261+
262+
>>> ImportOptions(limit=-1)
255263
Traceback (most recent call last):
256264
...
257-
ValueError: limit must be >= 1, got 0
265+
ValueError: limit must be >= 0, got -1
258266
"""
259-
if self.limit < 1:
260-
msg = f"limit must be >= 1, got {self.limit}"
267+
if self.limit < 0:
268+
msg = f"limit must be >= 0, got {self.limit}"
261269
raise ValueError(msg)
270+
if self.limit == 0:
271+
self.limit = sys.maxsize
262272

263273

264274
class HTTPClient:
@@ -273,6 +283,8 @@ def __init__(
273283
auth_prefix: str = "Bearer",
274284
user_agent: str = "vcspull",
275285
timeout: int = 30,
286+
max_retries: int = 3,
287+
retry_base_delay: float = 1.0,
276288
) -> None:
277289
"""Initialize the HTTP client.
278290
@@ -290,6 +302,10 @@ def __init__(
290302
User-Agent header value
291303
timeout : int
292304
Request timeout in seconds
305+
max_retries : int
306+
Maximum number of retries on HTTP 429 (rate limit) responses
307+
retry_base_delay : float
308+
Base delay in seconds for exponential backoff
293309
294310
Examples
295311
--------
@@ -309,6 +325,8 @@ def __init__(
309325
self.auth_prefix = auth_prefix
310326
self.user_agent = user_agent
311327
self.timeout = timeout
328+
self.max_retries = max_retries
329+
self.retry_base_delay = retry_base_delay
312330

313331
def _build_headers(self) -> dict[str, str]:
314332
"""Build request headers.
@@ -368,7 +386,7 @@ def get(
368386
AuthenticationError
369387
When authentication fails (401)
370388
RateLimitError
371-
When rate limit is exceeded (403/429)
389+
When rate limit is exceeded (403/429) after retries exhausted
372390
NotFoundError
373391
When resource is not found (404)
374392
ServiceUnavailableError
@@ -392,24 +410,74 @@ def get(
392410

393411
log.debug("GET %s", url)
394412

395-
try:
396-
with urllib.request.urlopen(request, timeout=self.timeout) as response:
397-
body = response.read().decode("utf-8")
398-
response_headers = {k.lower(): v for k, v in response.getheaders()}
399-
return json.loads(body), response_headers
400-
except urllib.error.HTTPError as exc:
401-
self._handle_http_error(exc, service_name)
402-
except urllib.error.URLError as exc:
403-
msg = f"Connection error: {exc.reason}"
404-
raise ServiceUnavailableError(msg, service=service_name) from exc
405-
except json.JSONDecodeError as exc:
406-
msg = f"Invalid JSON response from {service_name}"
407-
raise ServiceUnavailableError(msg, service=service_name) from exc
413+
for attempt in range(self.max_retries + 1):
414+
try:
415+
with urllib.request.urlopen(request, timeout=self.timeout) as response:
416+
body = response.read().decode("utf-8")
417+
response_headers = {k.lower(): v for k, v in response.getheaders()}
418+
return json.loads(body), response_headers
419+
except urllib.error.HTTPError as exc: # noqa: PERF203
420+
if exc.code == 429 and attempt < self.max_retries:
421+
delay = self._calculate_retry_delay(exc, attempt)
422+
log.warning(
423+
"Rate limited by %s, retrying in %.1fs (attempt %d/%d)",
424+
service_name,
425+
delay,
426+
attempt + 1,
427+
self.max_retries,
428+
)
429+
time.sleep(delay)
430+
continue
431+
self._handle_http_error(exc, service_name)
432+
except urllib.error.URLError as exc:
433+
msg = f"Connection error: {exc.reason}"
434+
raise ServiceUnavailableError(msg, service=service_name) from exc
435+
except json.JSONDecodeError as exc:
436+
msg = f"Invalid JSON response from {service_name}"
437+
raise ServiceUnavailableError(msg, service=service_name) from exc
408438

409-
# Should never reach here, but for type checker
410439
msg = "Unexpected error"
411440
raise ServiceUnavailableError(msg, service=service_name)
412441

442+
def _calculate_retry_delay(
443+
self,
444+
exc: urllib.error.HTTPError,
445+
attempt: int,
446+
) -> float:
447+
"""Calculate delay before retrying a rate-limited request.
448+
449+
Uses the ``Retry-After`` header if present (capped at 120s),
450+
otherwise falls back to exponential backoff with jitter.
451+
452+
Parameters
453+
----------
454+
exc : urllib.error.HTTPError
455+
The 429 HTTP error response
456+
attempt : int
457+
Zero-based attempt number
458+
459+
Returns
460+
-------
461+
float
462+
Delay in seconds before the next retry
463+
"""
464+
retry_after = None
465+
if exc.headers:
466+
retry_after = exc.headers.get("Retry-After")
467+
468+
if retry_after is not None:
469+
try:
470+
delay = min(float(retry_after), 120.0)
471+
return max(delay, 0.0)
472+
except (ValueError, TypeError):
473+
pass
474+
475+
# Exponential backoff: 2^attempt * base_delay, capped at 60s
476+
backoff_delay = float(min(2**attempt * self.retry_base_delay, 60.0))
477+
# Add jitter: 0 to 50% of the delay
478+
jitter = random.uniform(0, 0.5 * backoff_delay)
479+
return backoff_delay + jitter
480+
413481
def _handle_http_error(
414482
self,
415483
exc: urllib.error.HTTPError,

src/vcspull/_internal/remotes/github.py

Lines changed: 41 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,7 @@ def _fetch_search(self, options: ImportOptions) -> t.Iterator[RemoteRepo]:
191191
endpoint = "/search/repositories"
192192
page = 1
193193
count = 0
194+
total_available: int | None = None
194195

195196
while count < options.limit:
196197
# Always use DEFAULT_PER_PAGE to maintain consistent pagination offset.
@@ -212,12 +213,14 @@ def _fetch_search(self, options: ImportOptions) -> t.Iterator[RemoteRepo]:
212213
self._log_rate_limit(headers)
213214

214215
total_count = data.get("total_count", 0)
215-
if page == 1 and total_count > 1000:
216-
log.warning(
217-
"GitHub search returned %d total results but API limits "
218-
"to 1000; consider narrowing your query",
219-
total_count,
220-
)
216+
if page == 1:
217+
total_available = total_count
218+
if total_count > 1000:
219+
log.warning(
220+
"GitHub search returned %d total results but API limits "
221+
"to 1000; consider narrowing your query",
222+
total_count,
223+
)
221224

222225
items = data.get("items", [])
223226
if not items:
@@ -242,6 +245,18 @@ def _fetch_search(self, options: ImportOptions) -> t.Iterator[RemoteRepo]:
242245

243246
page += 1
244247

248+
# Warn if results were truncated by --limit
249+
if (
250+
count >= options.limit
251+
and total_available is not None
252+
and total_available > count
253+
):
254+
log.warning(
255+
"Showing %d of %d repositories (use --limit 0 to fetch all)",
256+
count,
257+
total_available,
258+
)
259+
245260
def _paginate_repos(
246261
self,
247262
endpoint: str,
@@ -263,6 +278,7 @@ def _paginate_repos(
263278
"""
264279
page = 1
265280
count = 0
281+
more_available = False
266282

267283
while count < options.limit:
268284
# Always use DEFAULT_PER_PAGE to maintain consistent pagination offset.
@@ -285,21 +301,39 @@ def _paginate_repos(
285301
if not data:
286302
break
287303

288-
for item in data:
304+
for idx, item in enumerate(data):
289305
if count >= options.limit:
306+
# Remaining items on this page or a full page = more exist
307+
more_available = (
308+
idx < len(data) - 1 or len(data) == DEFAULT_PER_PAGE
309+
)
290310
break
291311

292312
repo = self._parse_repo(item)
293313
if filter_repo(repo, options):
294314
yield repo
295315
count += 1
296316

317+
# Boundary: limit reached on the last item of a full page
318+
if count >= options.limit and len(data) == DEFAULT_PER_PAGE:
319+
more_available = True
320+
break
321+
297322
# Check if there are more pages
298323
if len(data) < DEFAULT_PER_PAGE:
299324
break
300325

301326
page += 1
302327

328+
# Warn if results were truncated by --limit
329+
# GitHub user/org endpoints don't return total count
330+
if count >= options.limit and more_available:
331+
log.warning(
332+
"Showing %d repositories; more may be available "
333+
"(use --limit 0 to fetch all)",
334+
count,
335+
)
336+
303337
def _parse_repo(self, data: dict[str, t.Any]) -> RemoteRepo:
304338
"""Parse GitHub API response into RemoteRepo.
305339

0 commit comments

Comments
 (0)