Skip to content

Commit 88a57c0

Browse files
authored
Add user guide section on caching (#54)
1 parent d6cf4bf commit 88a57c0

2 files changed

Lines changed: 140 additions & 0 deletions

File tree

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Caching Remote Data
2+
3+
This guide shows how to reduce network requests when you need to read the same remote data multiple times.
4+
5+
## The Problem
6+
7+
When working with cloud-hosted data, every read operation can trigger a network request. If you're accessing the same data repeatedly, this can be slow and wasteful. Repeatedly accessing the same data can happen evening when reading only one file, and also happens when reading a file multiple times.
8+
9+
## The Solution
10+
11+
Wrap your store with [`CachingReadableStore`][obspec_utils.wrappers.CachingReadableStore] to cache files after the first access:
12+
13+
```python exec="on" source="above" session="cache" result="code"
14+
from obstore.store import S3Store
15+
from obspec_utils.wrappers import CachingReadableStore
16+
17+
# Create the underlying store
18+
store = S3Store(
19+
bucket="nasanex",
20+
aws_region="us-west-2",
21+
skip_signature=True,
22+
)
23+
24+
# Wrap with caching (256 MB default cache size)
25+
cached_store = CachingReadableStore(store)
26+
27+
# First access fetches from network
28+
path = "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2100.nc"
29+
data1 = cached_store.get_range(path, start=0, length=1000)
30+
print(f"After first read: {cached_store.cache_size / 1e6:.1f} MB cached")
31+
32+
# Second access served from cache (no network request)
33+
data2 = cached_store.get_range(path, start=1000, length=1000)
34+
print(f"After second read: {cached_store.cache_size / 1e6:.1f} MB cached")
35+
print(f"Cached files: {len(cached_store.cached_paths)}")
36+
```
37+
38+
## Sizing Your Cache
39+
40+
Set `max_size` based on your available memory and the files you're working with:
41+
42+
```python exec="on" source="above" session="cache2" result="code"
43+
from obstore.store import S3Store
44+
from obspec_utils.wrappers import CachingReadableStore
45+
46+
store = S3Store(
47+
bucket="nasanex",
48+
aws_region="us-west-2",
49+
skip_signature=True,
50+
)
51+
52+
# 512 MB cache for larger workloads
53+
cached_store = CachingReadableStore(store, max_size=512 * 1024 * 1024)
54+
55+
# Cache multiple files
56+
paths = [
57+
"NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2099.nc",
58+
"NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2100.nc",
59+
]
60+
61+
for path in paths:
62+
cached_store.get_range(path, start=0, length=100)
63+
64+
print(f"Cache size: {cached_store.cache_size / 1e6:.1f} MB")
65+
print(f"Cached files ({len(cached_store.cached_paths)}):")
66+
for p in cached_store.cached_paths:
67+
print(f" {p.split('/')[-1]}")
68+
```
69+
70+
When the cache exceeds [`max_size`][obspec_utils.wrappers.CachingReadableStore], the least recently used files are evicted automatically.
71+
72+
## Using with Xarray
73+
74+
Combine caching with readers for xarray workflows:
75+
76+
```python exec="on" source="above" session="cache3" result="code"
77+
import xarray as xr
78+
from obstore.store import HTTPStore
79+
from obspec_utils.wrappers import CachingReadableStore
80+
from obspec_utils.readers import EagerStoreReader
81+
82+
# Access sample NetCDF files over HTTP
83+
store = HTTPStore.from_url("https://github.com/pydata/xarray-data/raw/refs/heads/master/")
84+
cached_store = CachingReadableStore(store)
85+
86+
path = "air_temperature.nc"
87+
88+
# First open: fetches from network
89+
with EagerStoreReader(cached_store, path) as reader:
90+
ds1 = xr.open_dataset(reader, engine="scipy")
91+
var_names = list(ds1.data_vars)
92+
93+
print(f"Variables: {var_names}")
94+
print(f"Cache size after first open: {cached_store.cache_size / 1e6:.2f} MB")
95+
96+
# Second open: served entirely from cache
97+
with EagerStoreReader(cached_store, path) as reader:
98+
ds2 = xr.open_dataset(reader, engine="scipy")
99+
100+
print(f"Cache size after second open: {cached_store.cache_size / 1e6:.2f} MB (unchanged)")
101+
```
102+
103+
## Cleaning Up
104+
105+
Use the context manager for automatic cleanup, or call [`clear_cache()`][obspec_utils.wrappers.CachingReadableStore.clear_cache] explicitly:
106+
107+
```python exec="on" source="above" session="cache4" result="code"
108+
from obstore.store import S3Store
109+
from obspec_utils.wrappers import CachingReadableStore
110+
111+
store = S3Store(
112+
bucket="nasanex",
113+
aws_region="us-west-2",
114+
skip_signature=True,
115+
)
116+
path = "NEX-GDDP/BCSD/rcp85/day/atmos/tasmax/r1i1p1/v1.0/tasmax_day_BCSD_rcp85_r1i1p1_inmcm4_2100.nc"
117+
118+
# Option 1: Context manager (cache cleared on exit)
119+
with CachingReadableStore(store) as cached_store:
120+
cached_store.get_range(path, start=0, length=100)
121+
print(f"Inside context: {cached_store.cache_size / 1e6:.1f} MB cached")
122+
# Cache automatically cleared when exiting the context
123+
print(f"Outside context: {cached_store.cache_size / 1e6:.1f} MB cached")
124+
125+
# Option 2: Explicit cleanup
126+
cached_store = CachingReadableStore(store)
127+
cached_store.get_range(path, start=0, length=100)
128+
print(f"Before clear: {cached_store.cache_size / 1e6:.1f} MB cached")
129+
cached_store.clear_cache()
130+
print(f"After clear: {cached_store.cache_size / 1e6:.1f} MB cached")
131+
```
132+
133+
## When Caching Helps
134+
135+
Caching is most effective when:
136+
137+
- You read the same files multiple times (parsing metadata, then reading data)
138+
- Multiple operations access overlapping files
139+
- Files are small enough to fit in your cache budget

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ nav:
1717
- "User Guide":
1818
- "Opening Data with Xarray": "user-guide/opening-data-with-xarray.md"
1919
- "Finding Files on the Cloud": "user-guide/finding-files.md"
20+
- "Minimizing Data Transfer via Caching": "user-guide/caching-remote-data.md"
2021
- "Debugging Slow Data Access": "user-guide/debugging-data-access.md"
2122
- "API":
2223
- Glob: "api/glob.md"

0 commit comments

Comments
 (0)