|
| 1 | +# Glob Implementation Design |
| 2 | + |
| 3 | +This document describes the design of `obspec_utils.glob`, which provides glob pattern matching for object stores using the obspec `List` primitive. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The glob module provides functions to match paths against glob patterns, similar to `fsspec.glob`, `pathlib.glob`, and `glob.glob`. It enables users to find objects in stores using familiar wildcard patterns like `data/**/*.nc`. |
| 8 | + |
| 9 | +## API Design |
| 10 | + |
| 11 | +### Two-Function Approach |
| 12 | + |
| 13 | +We provide two separate functions rather than a single function with a `detail` kwarg: |
| 14 | + |
| 15 | +```python |
| 16 | +from obspec_utils import glob, glob_objects |
| 17 | + |
| 18 | +# Get paths only |
| 19 | +paths = list(glob(store, "data/**/*.nc")) |
| 20 | +# ['data/2024/file1.nc', 'data/2024/01/file2.nc', ...] |
| 21 | + |
| 22 | +# Get full metadata |
| 23 | +for obj in glob_objects(store, "data/**/*.nc"): |
| 24 | + print(f"{obj['path']}: {obj['size']} bytes") |
| 25 | +``` |
| 26 | + |
| 27 | +**Rationale:** |
| 28 | + |
| 29 | +| Approach | Typing | API Clarity | |
| 30 | +|----------|--------|-------------| |
| 31 | +| Two functions | Clean return types | Explicit intent | |
| 32 | +| Single function with kwarg | Requires `@overload` decorators | Runtime-dependent return type | |
| 33 | + |
| 34 | +Following Python's "explicit is better than implicit" philosophy, two functions provide: |
| 35 | + |
| 36 | +- **Clean typing** — each function has a single return type |
| 37 | +- **Discoverability** — both options visible in autocomplete |
| 38 | +- **No ambiguity** — return type known at call site |
| 39 | + |
| 40 | +### Function Matrix |
| 41 | + |
| 42 | +| Function | Protocol | Returns | |
| 43 | +|----------|----------|---------| |
| 44 | +| `glob` | `obspec.List` | `Iterator[str]` | |
| 45 | +| `glob_objects` | `obspec.List` | `Iterator[ObjectMeta]` | |
| 46 | +| `glob_async` | `obspec.ListAsync` | `AsyncIterator[str]` | |
| 47 | +| `glob_objects_async` | `obspec.ListAsync` | `AsyncIterator[ObjectMeta]` | |
| 48 | + |
| 49 | +### Protocol Requirements |
| 50 | + |
| 51 | +Following obspec's philosophy, we use `obspec.List` and `obspec.ListAsync` directly rather than defining wrapper protocols: |
| 52 | + |
| 53 | +```python |
| 54 | +from obspec import List |
| 55 | + |
| 56 | +def glob(store: List, pattern: str) -> Iterator[str]: |
| 57 | + ... |
| 58 | +``` |
| 59 | + |
| 60 | +This keeps the API minimal and avoids unnecessary abstraction layers. |
| 61 | + |
| 62 | +## Pattern Support |
| 63 | + |
| 64 | +The glob functions support standard Unix-style glob patterns: |
| 65 | + |
| 66 | +| Pattern | Meaning | Example | |
| 67 | +|---------|---------|---------| |
| 68 | +| `*` | Matches any characters within a single path segment | `data/*.nc` matches `data/file.nc` but not `data/sub/file.nc` | |
| 69 | +| `**` | Matches any number of path segments (recursive) | `data/**/*.nc` matches `data/a/b/c/file.nc` | |
| 70 | +| `?` | Matches exactly one character | `file?.nc` matches `file1.nc` but not `file10.nc` | |
| 71 | +| `[abc]` | Matches characters in set | `file[123].nc` matches `file1.nc`, `file2.nc`, `file3.nc` | |
| 72 | +| `[a-z]` | Matches characters in range | `file[a-c].nc` matches `filea.nc`, `fileb.nc`, `filec.nc` | |
| 73 | +| `[!abc]` | Matches characters NOT in set | `file[!0-9].nc` matches `filea.nc` but not `file1.nc` | |
| 74 | + |
| 75 | +## Implementation Algorithm |
| 76 | + |
| 77 | +### 1. Prefix Extraction |
| 78 | + |
| 79 | +Extract the literal prefix from the pattern to optimize the `list()` call: |
| 80 | + |
| 81 | +```python |
| 82 | +GLOB_CHARS = frozenset('*?[') |
| 83 | + |
| 84 | +def _parse_pattern(pattern: str) -> tuple[str, str]: |
| 85 | + """Find the longest prefix without glob characters. |
| 86 | +
|
| 87 | + The prefix must end at a path separator boundary to work with |
| 88 | + obspec's segment-based prefix matching. |
| 89 | + """ |
| 90 | + for i, char in enumerate(pattern): |
| 91 | + if char in GLOB_CHARS: |
| 92 | + prefix_end = pattern.rfind('/', 0, i) + 1 |
| 93 | + return pattern[:prefix_end], pattern[prefix_end:] |
| 94 | + |
| 95 | + # No glob chars - use parent directory as prefix |
| 96 | + last_slash = pattern.rfind('/') |
| 97 | + if last_slash >= 0: |
| 98 | + return pattern[:last_slash + 1], pattern[last_slash + 1:] |
| 99 | + return "", pattern |
| 100 | +``` |
| 101 | + |
| 102 | +Examples: |
| 103 | +- `data/2024/**/*.nc` → prefix `data/2024/`, remaining `**/*.nc` |
| 104 | +- `data/*.nc` → prefix `data/`, remaining `*.nc` |
| 105 | +- `**/*.nc` → prefix `""`, remaining `**/*.nc` |
| 106 | +- `data/file.nc` → prefix `data/`, remaining `file.nc` (literal path) |
| 107 | +- `file.nc` → prefix `""`, remaining `file.nc` (no directory) |
| 108 | + |
| 109 | +### 2. Pattern Compilation |
| 110 | + |
| 111 | +Convert the glob pattern to a compiled regex using a segment-by-segment approach |
| 112 | +inspired by CPython's `glob.translate()`: |
| 113 | + |
| 114 | +```python |
| 115 | +import re |
| 116 | + |
| 117 | +def _compile_pattern(pattern: str) -> re.Pattern[str]: |
| 118 | + """ |
| 119 | + Convert glob pattern to regex, processing segment by segment. |
| 120 | +
|
| 121 | + Inspired by CPython 3.13+ glob.translate() but simplified for |
| 122 | + object stores (/ separator only, no hidden file handling). |
| 123 | + """ |
| 124 | + segments = pattern.split('/') |
| 125 | + regex_parts = [] |
| 126 | + |
| 127 | + i = 0 |
| 128 | + while i < len(segments): |
| 129 | + segment = segments[i] |
| 130 | + is_last = (i == len(segments) - 1) |
| 131 | + |
| 132 | + if segment == '**': |
| 133 | + # Skip consecutive ** segments |
| 134 | + while i + 1 < len(segments) and segments[i + 1] == '**': |
| 135 | + i += 1 |
| 136 | + is_last = (i == len(segments) - 1) |
| 137 | + |
| 138 | + if is_last: |
| 139 | + # ** at end: match everything remaining |
| 140 | + regex_parts.append('.*') |
| 141 | + else: |
| 142 | + # ** in middle: match zero or more segments |
| 143 | + regex_parts.append('(?:.+/)?') |
| 144 | + else: |
| 145 | + # Convert segment with wildcards |
| 146 | + segment_regex = _translate_segment(segment) |
| 147 | + if is_last: |
| 148 | + regex_parts.append(segment_regex) |
| 149 | + else: |
| 150 | + regex_parts.append(segment_regex + '/') |
| 151 | + |
| 152 | + i += 1 |
| 153 | + |
| 154 | + return re.compile(''.join(regex_parts) + r'\Z') |
| 155 | + |
| 156 | +def _translate_segment(segment: str) -> str: |
| 157 | + """Translate a single path segment (no /) to regex.""" |
| 158 | + # Handle *, ?, [abc], [!abc], [a-z] and literal characters |
| 159 | + # * -> [^/]* (any chars except /) |
| 160 | + # ? -> [^/] (single char except /) |
| 161 | + # [...] -> [...] (character class, passed through) |
| 162 | + ... |
| 163 | +``` |
| 164 | + |
| 165 | +**Key design choices** (inspired by CPython `glob.translate()`): |
| 166 | + |
| 167 | +| Pattern | Regex | Rationale | |
| 168 | +|---------|-------|-----------| |
| 169 | +| `*` | `[^/]*` | Match any chars within segment (not across `/`) | |
| 170 | +| `**` (middle) | `(?:.+/)?` | Match zero or more complete segments | |
| 171 | +| `**` (end) | `.*` | Match everything remaining | |
| 172 | +| `?` | `[^/]` | Match single char within segment | |
| 173 | +| `[abc]` | `[abc]` | Character class (passed through) | |
| 174 | +| `[!abc]` | `[^abc]` | Negated character class | |
| 175 | + |
| 176 | +**Differences from CPython:** |
| 177 | +- Object stores use `/` only (no `os.sep` handling) |
| 178 | +- No hidden file handling (object stores don't have this concept) |
| 179 | +- Simpler implementation focused on object store paths |
| 180 | + |
| 181 | +### 3. List and Filter |
| 182 | + |
| 183 | +```python |
| 184 | +def _glob_impl(store: List, pattern: str) -> Iterator[ObjectMeta]: |
| 185 | + list_prefix, _ = _parse_pattern(pattern) |
| 186 | + compiled = _compile_pattern(pattern) |
| 187 | + |
| 188 | + for chunk in store.list(prefix=list_prefix if list_prefix else None): |
| 189 | + for obj in chunk: |
| 190 | + if compiled.match(obj["path"]): |
| 191 | + yield obj |
| 192 | +``` |
| 193 | + |
| 194 | +Note: The compiled pattern includes `\Z` anchor at the end, so `match()` (which anchors at the start) |
| 195 | +effectively performs a full match. This is more efficient than `fullmatch()` in some regex engines. |
| 196 | + |
| 197 | +## Behavior Comparison |
| 198 | + |
| 199 | +| Feature | `obspec_utils.glob` | `fsspec.glob` | `pathlib.glob` | `glob.glob` | |
| 200 | +|---------|---------------------|---------------|----------------|-------------| |
| 201 | +| Returns | `Iterator[str]` or `Iterator[ObjectMeta]` | `list[str]` or `dict` | `Iterator[Path]` | `list[str]` | |
| 202 | +| `*` matches `/` | No | No | No | No | |
| 203 | +| `**` recursive | Yes (always) | Yes | Yes (always) | Yes (if `recursive=True`) | |
| 204 | +| Hidden files | Matched | Matched | Matched | Only if pattern starts with `.` | |
| 205 | +| Case sensitive | Yes (always) | Platform-dependent | Platform-dependent | Platform-dependent | |
| 206 | +| Directories | Not included | Yes (`withdirs`) | Yes | Yes | |
| 207 | +| `maxdepth` | Not supported | Yes | No | No | |
| 208 | +| Metadata | `glob_objects()` | `detail=True` | No | No | |
| 209 | +| Streaming | Yes (iterator) | No (returns list) | Yes (iterator) | No (returns list) | |
| 210 | + |
| 211 | +### Key Differences and Rationale |
| 212 | + |
| 213 | +#### 1. Two functions instead of `detail` kwarg |
| 214 | + |
| 215 | +| `obspec_utils` | `fsspec` | |
| 216 | +|----------------|----------| |
| 217 | +| `glob()` returns `Iterator[str]` | `glob()` returns `list[str]` | |
| 218 | +| `glob_objects()` returns `Iterator[ObjectMeta]` | `glob(..., detail=True)` returns `dict` | |
| 219 | + |
| 220 | +**Rationale**: fsspec uses a runtime `detail` parameter that changes the return type, requiring |
| 221 | +`@overload` decorators for proper typing. Two separate functions provide: |
| 222 | +- Clean static typing without runtime-dependent return types |
| 223 | +- Better IDE autocomplete and type inference |
| 224 | +- Follows Python's "explicit is better than implicit" |
| 225 | + |
| 226 | +#### 2. No `maxdepth` parameter |
| 227 | + |
| 228 | +**Rationale**: The obspec `List` primitive is always recursive—there's no way to request a |
| 229 | +shallow listing. Adding `maxdepth` would require: |
| 230 | +- Counting path segments in every result |
| 231 | +- Post-filtering results that exceed the depth limit |
| 232 | +- No performance benefit since all objects are fetched anyway |
| 233 | + |
| 234 | +If depth limiting is needed, users can post-filter: |
| 235 | +```python |
| 236 | +max_depth = 2 |
| 237 | +results = [p for p in glob(store, "**/*.nc") if p.count("/") <= max_depth] |
| 238 | +``` |
| 239 | + |
| 240 | +#### 3. Always case-sensitive |
| 241 | + |
| 242 | +**Rationale**: Object stores (S3, GCS, Azure Blob) treat paths as case-sensitive. Unlike |
| 243 | +filesystems where case sensitivity varies by platform (case-insensitive on Windows/macOS, |
| 244 | +case-sensitive on Linux), object stores are consistent. Matching this behavior avoids |
| 245 | +surprises when patterns work locally but fail in production. |
| 246 | + |
| 247 | +#### 4. No directory results |
| 248 | + |
| 249 | +**Rationale**: Object stores don't have real directories—only objects with `/`-separated paths. |
| 250 | +What appears as a "directory" is just a common prefix. fsspec's `withdirs=True` returns these |
| 251 | +pseudo-directories, but: |
| 252 | +- They don't exist as separate entities with metadata |
| 253 | +- Including them would require using `ListWithDelimiter` and merging results |
| 254 | +- Most use cases want actual objects, not prefixes |
| 255 | + |
| 256 | +#### 5. Streaming results (iterator vs list) |
| 257 | + |
| 258 | +| `obspec_utils` | `fsspec` | |
| 259 | +|----------------|----------| |
| 260 | +| `Iterator[str]` (lazy) | `list[str]` (eager) | |
| 261 | + |
| 262 | +**Rationale**: Object store listings can return millions of objects. fsspec materializes all |
| 263 | +results into a list before returning, which: |
| 264 | +- Blocks until all pages are fetched |
| 265 | +- Consumes memory proportional to result count |
| 266 | +- Can't process results incrementally |
| 267 | + |
| 268 | +Returning an iterator enables: |
| 269 | +- Processing results as they arrive |
| 270 | +- Early termination (e.g., "find first 10 matches") |
| 271 | +- Bounded memory usage regardless of result count |
| 272 | + |
| 273 | +#### 6. Pattern always required |
| 274 | + |
| 275 | +Unlike `fsspec.glob()` which accepts `"bucket/**"` to list everything, `obspec_utils.glob` |
| 276 | +requires a pattern. To list all objects, use `store.list()` directly. |
| 277 | + |
| 278 | +## Usage Examples |
| 279 | + |
| 280 | +### Basic Patterns |
| 281 | + |
| 282 | +```python |
| 283 | +from obspec_utils import glob, glob_objects |
| 284 | + |
| 285 | +# Find all NetCDF files in a directory |
| 286 | +paths = list(glob(store, "data/2024/*.nc")) |
| 287 | + |
| 288 | +# Find all NetCDF files recursively |
| 289 | +paths = list(glob(store, "data/**/*.nc")) |
| 290 | + |
| 291 | +# Find files with single-character suffix |
| 292 | +paths = list(glob(store, "data/file?.nc")) |
| 293 | + |
| 294 | +# Find files matching character set |
| 295 | +paths = list(glob(store, "data/[abc]*.nc")) |
| 296 | +``` |
| 297 | + |
| 298 | +### With Metadata |
| 299 | + |
| 300 | +```python |
| 301 | +# Get file sizes for matching objects |
| 302 | +total_size = sum(obj["size"] for obj in glob_objects(store, "data/**/*.nc")) |
| 303 | + |
| 304 | +# Find recently modified files |
| 305 | +from datetime import datetime, timedelta, timezone |
| 306 | +cutoff = datetime.now(timezone.utc) - timedelta(days=7) |
| 307 | +recent = [ |
| 308 | + obj for obj in glob_objects(store, "data/**/*.nc") |
| 309 | + if obj["last_modified"] > cutoff |
| 310 | +] |
| 311 | +``` |
| 312 | + |
| 313 | +### Async Usage |
| 314 | + |
| 315 | +```python |
| 316 | +async def process_files(store): |
| 317 | + async for path in glob_async(store, "data/**/*.nc"): |
| 318 | + await process(path) |
| 319 | +``` |
| 320 | + |
| 321 | +## Dependencies |
| 322 | + |
| 323 | +- `obspec` — for `List`, `ListAsync`, and `ObjectMeta` types |
| 324 | +- `re` — standard library regex |
| 325 | + |
| 326 | +No new external dependencies required. We implement our own `translate()` function |
| 327 | +rather than using `fnmatch.translate()` to properly handle path separators and `**` patterns. |
0 commit comments