Skip to content

Commit 226fb20

Browse files
authored
Feat: Implement globbing in obspec_utils (#42)
* Add design doc * Initial implementation * Test against stdlib globbing * Clarify differences between obspec_utils.glob() and fsspec.glob() * Test against fsspec on S3 * Test edge cases * Parameterize tests * Add to docs * Test error handling * Test OSN
1 parent b52ff23 commit 226fb20

11 files changed

Lines changed: 1912 additions & 2 deletions

File tree

docs/api/glob.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
::: obspec_utils.glob.glob
2+
::: obspec_utils.glob.glob_objects
3+
::: obspec_utils.glob.glob_async
4+
::: obspec_utils.glob.glob_objects_async

docs/design/glob.md

Lines changed: 327 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,327 @@
1+
# Glob Implementation Design
2+
3+
This document describes the design of `obspec_utils.glob`, which provides glob pattern matching for object stores using the obspec `List` primitive.
4+
5+
## Overview
6+
7+
The glob module provides functions to match paths against glob patterns, similar to `fsspec.glob`, `pathlib.glob`, and `glob.glob`. It enables users to find objects in stores using familiar wildcard patterns like `data/**/*.nc`.
8+
9+
## API Design
10+
11+
### Two-Function Approach
12+
13+
We provide two separate functions rather than a single function with a `detail` kwarg:
14+
15+
```python
16+
from obspec_utils import glob, glob_objects
17+
18+
# Get paths only
19+
paths = list(glob(store, "data/**/*.nc"))
20+
# ['data/2024/file1.nc', 'data/2024/01/file2.nc', ...]
21+
22+
# Get full metadata
23+
for obj in glob_objects(store, "data/**/*.nc"):
24+
print(f"{obj['path']}: {obj['size']} bytes")
25+
```
26+
27+
**Rationale:**
28+
29+
| Approach | Typing | API Clarity |
30+
|----------|--------|-------------|
31+
| Two functions | Clean return types | Explicit intent |
32+
| Single function with kwarg | Requires `@overload` decorators | Runtime-dependent return type |
33+
34+
Following Python's "explicit is better than implicit" philosophy, two functions provide:
35+
36+
- **Clean typing** — each function has a single return type
37+
- **Discoverability** — both options visible in autocomplete
38+
- **No ambiguity** — return type known at call site
39+
40+
### Function Matrix
41+
42+
| Function | Protocol | Returns |
43+
|----------|----------|---------|
44+
| `glob` | `obspec.List` | `Iterator[str]` |
45+
| `glob_objects` | `obspec.List` | `Iterator[ObjectMeta]` |
46+
| `glob_async` | `obspec.ListAsync` | `AsyncIterator[str]` |
47+
| `glob_objects_async` | `obspec.ListAsync` | `AsyncIterator[ObjectMeta]` |
48+
49+
### Protocol Requirements
50+
51+
Following obspec's philosophy, we use `obspec.List` and `obspec.ListAsync` directly rather than defining wrapper protocols:
52+
53+
```python
54+
from obspec import List
55+
56+
def glob(store: List, pattern: str) -> Iterator[str]:
57+
...
58+
```
59+
60+
This keeps the API minimal and avoids unnecessary abstraction layers.
61+
62+
## Pattern Support
63+
64+
The glob functions support standard Unix-style glob patterns:
65+
66+
| Pattern | Meaning | Example |
67+
|---------|---------|---------|
68+
| `*` | Matches any characters within a single path segment | `data/*.nc` matches `data/file.nc` but not `data/sub/file.nc` |
69+
| `**` | Matches any number of path segments (recursive) | `data/**/*.nc` matches `data/a/b/c/file.nc` |
70+
| `?` | Matches exactly one character | `file?.nc` matches `file1.nc` but not `file10.nc` |
71+
| `[abc]` | Matches characters in set | `file[123].nc` matches `file1.nc`, `file2.nc`, `file3.nc` |
72+
| `[a-z]` | Matches characters in range | `file[a-c].nc` matches `filea.nc`, `fileb.nc`, `filec.nc` |
73+
| `[!abc]` | Matches characters NOT in set | `file[!0-9].nc` matches `filea.nc` but not `file1.nc` |
74+
75+
## Implementation Algorithm
76+
77+
### 1. Prefix Extraction
78+
79+
Extract the literal prefix from the pattern to optimize the `list()` call:
80+
81+
```python
82+
GLOB_CHARS = frozenset('*?[')
83+
84+
def _parse_pattern(pattern: str) -> tuple[str, str]:
85+
"""Find the longest prefix without glob characters.
86+
87+
The prefix must end at a path separator boundary to work with
88+
obspec's segment-based prefix matching.
89+
"""
90+
for i, char in enumerate(pattern):
91+
if char in GLOB_CHARS:
92+
prefix_end = pattern.rfind('/', 0, i) + 1
93+
return pattern[:prefix_end], pattern[prefix_end:]
94+
95+
# No glob chars - use parent directory as prefix
96+
last_slash = pattern.rfind('/')
97+
if last_slash >= 0:
98+
return pattern[:last_slash + 1], pattern[last_slash + 1:]
99+
return "", pattern
100+
```
101+
102+
Examples:
103+
- `data/2024/**/*.nc` → prefix `data/2024/`, remaining `**/*.nc`
104+
- `data/*.nc` → prefix `data/`, remaining `*.nc`
105+
- `**/*.nc` → prefix `""`, remaining `**/*.nc`
106+
- `data/file.nc` → prefix `data/`, remaining `file.nc` (literal path)
107+
- `file.nc` → prefix `""`, remaining `file.nc` (no directory)
108+
109+
### 2. Pattern Compilation
110+
111+
Convert the glob pattern to a compiled regex using a segment-by-segment approach
112+
inspired by CPython's `glob.translate()`:
113+
114+
```python
115+
import re
116+
117+
def _compile_pattern(pattern: str) -> re.Pattern[str]:
118+
"""
119+
Convert glob pattern to regex, processing segment by segment.
120+
121+
Inspired by CPython 3.13+ glob.translate() but simplified for
122+
object stores (/ separator only, no hidden file handling).
123+
"""
124+
segments = pattern.split('/')
125+
regex_parts = []
126+
127+
i = 0
128+
while i < len(segments):
129+
segment = segments[i]
130+
is_last = (i == len(segments) - 1)
131+
132+
if segment == '**':
133+
# Skip consecutive ** segments
134+
while i + 1 < len(segments) and segments[i + 1] == '**':
135+
i += 1
136+
is_last = (i == len(segments) - 1)
137+
138+
if is_last:
139+
# ** at end: match everything remaining
140+
regex_parts.append('.*')
141+
else:
142+
# ** in middle: match zero or more segments
143+
regex_parts.append('(?:.+/)?')
144+
else:
145+
# Convert segment with wildcards
146+
segment_regex = _translate_segment(segment)
147+
if is_last:
148+
regex_parts.append(segment_regex)
149+
else:
150+
regex_parts.append(segment_regex + '/')
151+
152+
i += 1
153+
154+
return re.compile(''.join(regex_parts) + r'\Z')
155+
156+
def _translate_segment(segment: str) -> str:
157+
"""Translate a single path segment (no /) to regex."""
158+
# Handle *, ?, [abc], [!abc], [a-z] and literal characters
159+
# * -> [^/]* (any chars except /)
160+
# ? -> [^/] (single char except /)
161+
# [...] -> [...] (character class, passed through)
162+
...
163+
```
164+
165+
**Key design choices** (inspired by CPython `glob.translate()`):
166+
167+
| Pattern | Regex | Rationale |
168+
|---------|-------|-----------|
169+
| `*` | `[^/]*` | Match any chars within segment (not across `/`) |
170+
| `**` (middle) | `(?:.+/)?` | Match zero or more complete segments |
171+
| `**` (end) | `.*` | Match everything remaining |
172+
| `?` | `[^/]` | Match single char within segment |
173+
| `[abc]` | `[abc]` | Character class (passed through) |
174+
| `[!abc]` | `[^abc]` | Negated character class |
175+
176+
**Differences from CPython:**
177+
- Object stores use `/` only (no `os.sep` handling)
178+
- No hidden file handling (object stores don't have this concept)
179+
- Simpler implementation focused on object store paths
180+
181+
### 3. List and Filter
182+
183+
```python
184+
def _glob_impl(store: List, pattern: str) -> Iterator[ObjectMeta]:
185+
list_prefix, _ = _parse_pattern(pattern)
186+
compiled = _compile_pattern(pattern)
187+
188+
for chunk in store.list(prefix=list_prefix if list_prefix else None):
189+
for obj in chunk:
190+
if compiled.match(obj["path"]):
191+
yield obj
192+
```
193+
194+
Note: The compiled pattern includes `\Z` anchor at the end, so `match()` (which anchors at the start)
195+
effectively performs a full match. This is more efficient than `fullmatch()` in some regex engines.
196+
197+
## Behavior Comparison
198+
199+
| Feature | `obspec_utils.glob` | `fsspec.glob` | `pathlib.glob` | `glob.glob` |
200+
|---------|---------------------|---------------|----------------|-------------|
201+
| Returns | `Iterator[str]` or `Iterator[ObjectMeta]` | `list[str]` or `dict` | `Iterator[Path]` | `list[str]` |
202+
| `*` matches `/` | No | No | No | No |
203+
| `**` recursive | Yes (always) | Yes | Yes (always) | Yes (if `recursive=True`) |
204+
| Hidden files | Matched | Matched | Matched | Only if pattern starts with `.` |
205+
| Case sensitive | Yes (always) | Platform-dependent | Platform-dependent | Platform-dependent |
206+
| Directories | Not included | Yes (`withdirs`) | Yes | Yes |
207+
| `maxdepth` | Not supported | Yes | No | No |
208+
| Metadata | `glob_objects()` | `detail=True` | No | No |
209+
| Streaming | Yes (iterator) | No (returns list) | Yes (iterator) | No (returns list) |
210+
211+
### Key Differences and Rationale
212+
213+
#### 1. Two functions instead of `detail` kwarg
214+
215+
| `obspec_utils` | `fsspec` |
216+
|----------------|----------|
217+
| `glob()` returns `Iterator[str]` | `glob()` returns `list[str]` |
218+
| `glob_objects()` returns `Iterator[ObjectMeta]` | `glob(..., detail=True)` returns `dict` |
219+
220+
**Rationale**: fsspec uses a runtime `detail` parameter that changes the return type, requiring
221+
`@overload` decorators for proper typing. Two separate functions provide:
222+
- Clean static typing without runtime-dependent return types
223+
- Better IDE autocomplete and type inference
224+
- Follows Python's "explicit is better than implicit"
225+
226+
#### 2. No `maxdepth` parameter
227+
228+
**Rationale**: The obspec `List` primitive is always recursive—there's no way to request a
229+
shallow listing. Adding `maxdepth` would require:
230+
- Counting path segments in every result
231+
- Post-filtering results that exceed the depth limit
232+
- No performance benefit since all objects are fetched anyway
233+
234+
If depth limiting is needed, users can post-filter:
235+
```python
236+
max_depth = 2
237+
results = [p for p in glob(store, "**/*.nc") if p.count("/") <= max_depth]
238+
```
239+
240+
#### 3. Always case-sensitive
241+
242+
**Rationale**: Object stores (S3, GCS, Azure Blob) treat paths as case-sensitive. Unlike
243+
filesystems where case sensitivity varies by platform (case-insensitive on Windows/macOS,
244+
case-sensitive on Linux), object stores are consistent. Matching this behavior avoids
245+
surprises when patterns work locally but fail in production.
246+
247+
#### 4. No directory results
248+
249+
**Rationale**: Object stores don't have real directories—only objects with `/`-separated paths.
250+
What appears as a "directory" is just a common prefix. fsspec's `withdirs=True` returns these
251+
pseudo-directories, but:
252+
- They don't exist as separate entities with metadata
253+
- Including them would require using `ListWithDelimiter` and merging results
254+
- Most use cases want actual objects, not prefixes
255+
256+
#### 5. Streaming results (iterator vs list)
257+
258+
| `obspec_utils` | `fsspec` |
259+
|----------------|----------|
260+
| `Iterator[str]` (lazy) | `list[str]` (eager) |
261+
262+
**Rationale**: Object store listings can return millions of objects. fsspec materializes all
263+
results into a list before returning, which:
264+
- Blocks until all pages are fetched
265+
- Consumes memory proportional to result count
266+
- Can't process results incrementally
267+
268+
Returning an iterator enables:
269+
- Processing results as they arrive
270+
- Early termination (e.g., "find first 10 matches")
271+
- Bounded memory usage regardless of result count
272+
273+
#### 6. Pattern always required
274+
275+
Unlike `fsspec.glob()` which accepts `"bucket/**"` to list everything, `obspec_utils.glob`
276+
requires a pattern. To list all objects, use `store.list()` directly.
277+
278+
## Usage Examples
279+
280+
### Basic Patterns
281+
282+
```python
283+
from obspec_utils import glob, glob_objects
284+
285+
# Find all NetCDF files in a directory
286+
paths = list(glob(store, "data/2024/*.nc"))
287+
288+
# Find all NetCDF files recursively
289+
paths = list(glob(store, "data/**/*.nc"))
290+
291+
# Find files with single-character suffix
292+
paths = list(glob(store, "data/file?.nc"))
293+
294+
# Find files matching character set
295+
paths = list(glob(store, "data/[abc]*.nc"))
296+
```
297+
298+
### With Metadata
299+
300+
```python
301+
# Get file sizes for matching objects
302+
total_size = sum(obj["size"] for obj in glob_objects(store, "data/**/*.nc"))
303+
304+
# Find recently modified files
305+
from datetime import datetime, timedelta, timezone
306+
cutoff = datetime.now(timezone.utc) - timedelta(days=7)
307+
recent = [
308+
obj for obj in glob_objects(store, "data/**/*.nc")
309+
if obj["last_modified"] > cutoff
310+
]
311+
```
312+
313+
### Async Usage
314+
315+
```python
316+
async def process_files(store):
317+
async for path in glob_async(store, "data/**/*.nc"):
318+
await process(path)
319+
```
320+
321+
## Dependencies
322+
323+
- `obspec` — for `List`, `ListAsync`, and `ObjectMeta` types
324+
- `re` — standard library regex
325+
326+
No new external dependencies required. We implement our own `translate()` function
327+
rather than using `fnmatch.translate()` to properly handle path separators and `**` patterns.

mkdocs.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,9 @@ nav:
1717
- "Design":
1818
- "Protocols": "design/protocols.md"
1919
- "Caching": "design/caching.md"
20+
- "Glob": "design/glob.md"
2021
- "API":
22+
- Glob: "api/glob.md"
2123
- Protocols: "api/protocols.md"
2224
- Readers: "api/readers.md"
2325
- Stores:

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ test = [
5252
"pytest-xdist",
5353
"minio",
5454
"docker",
55+
"s3fs",
5556
]
5657
xarray = [
5758
"xarray",

src/obspec_utils/__init__.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
- `obspec_utils.wrappers`: Caching, tracing, and request splitting
88
- `obspec_utils.stores`: Concrete store implementations
99
- `obspec_utils.registry`: URL-to-store mapping
10+
- `obspec_utils.glob`: Glob pattern matching for object stores
1011
1112
Example
1213
-------
@@ -29,5 +30,12 @@
2930
"""
3031

3132
from obspec_utils._version import __version__
32-
33-
__all__ = ["__version__"]
33+
from obspec_utils.glob import glob, glob_async, glob_objects, glob_objects_async
34+
35+
__all__ = [
36+
"__version__",
37+
"glob",
38+
"glob_objects",
39+
"glob_async",
40+
"glob_objects_async",
41+
]

0 commit comments

Comments
 (0)