Skip to content

Commit 1bd0448

Browse files
committed
docs: Document the deployment dump format
1 parent 1cfbd05 commit 1bd0448

2 files changed

Lines changed: 256 additions & 0 deletions

File tree

docs/dump.md

Lines changed: 255 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
## Dump Format
2+
3+
The `graphman dump` command exports all entity data and metadata for a
4+
single subgraph deployment into a self-contained directory of Parquet files
5+
and JSON metadata. The resulting dump can be used to restore the deployment
6+
into a different `graph-node` instance via `graphman restore`. Dumps are
7+
consistent snapshots of the deployment's state at a specific point in time.
8+
9+
**WARNING**: The dump and restore commands are experimental and can not
10+
replace proper database backups at this point. In particular, there is no
11+
guarantee that a dump will be restorable. Having said that, we encourage
12+
users to try out the dump and restore commands in non-production
13+
environments and report any issues they encounter.
14+
15+
**WARNING**: Dumping happens in a single transaction and can put significant
16+
load on the database for large subgraphs. Use with caution on production
17+
instances.
18+
19+
**WARNING**: Restoring a dump will currently create all the default indexes
20+
that a new deployment gets, and ignores the indexes that might have been
21+
carefully curated for the original deployment, even though they are recorded
22+
in the dump. This can lead to very long restore times for large subgraphs.
23+
The restore process will be optimized in the future to only create indexes
24+
that are present in the dump's metadata.
25+
26+
### Directory layout
27+
28+
A dump directory has the following structure:
29+
30+
```
31+
<dump-dir>/
32+
metadata.json -- deployment metadata + per-table state
33+
schema.graphql -- raw GraphQL schema text
34+
subgraph.yaml -- raw subgraph manifest YAML (optional)
35+
<EntityType>/
36+
chunk_000000.parquet -- rows ordered by vid
37+
chunk_000001.parquet -- incremental append (future chunks)
38+
...
39+
data_sources$/
40+
chunk_000000.parquet -- dynamic data sources
41+
```
42+
43+
Each entity type defined in the GraphQL schema gets its own subdirectory,
44+
named after the entity type exactly as it appears in the schema (e.g.
45+
`Token/`, `Pool/`). The Proof of Indexing appears as a regular entity
46+
directory name `Poi$`. The special `data_sources$` directory holds dynamic
47+
data sources created at runtime.
48+
49+
Within each directory, data is stored in numbered chunk files
50+
(`chunk_000000.parquet`, `chunk_000001.parquet`, ...). A fresh dump
51+
produces a single `chunk_000000.parquet` per table. Incremental dumps
52+
append new chunks rather than rewriting existing ones.
53+
54+
The GraphQL schema and subgraph manifest are stored as separate plain-text
55+
files `schema.graphql` and `subgraph.yaml`.
56+
57+
### metadata.json
58+
59+
The top-level `metadata.json` contains everything needed to reconstruct the
60+
deployment's table structure, plus diagnostic information captured at dump
61+
time. Its structure is:
62+
63+
```json
64+
{
65+
"version": 1,
66+
"deployment": "Qm...",
67+
"network": "mainnet",
68+
69+
"manifest": {
70+
"spec_version": "1.0.0",
71+
"description": "Optional subgraph description",
72+
"repository": "https://github.com/...",
73+
"features": ["..."],
74+
"entities_with_causality_region": ["EntityType1"],
75+
"history_blocks": 2147483647
76+
},
77+
78+
"earliest_block_number": 12345,
79+
"start_block": { "number": 12345, "hash": "0xabc..." },
80+
"head_block": { "number": 99999, "hash": "0xdef..." },
81+
"entity_count": 150000,
82+
83+
"graft_base": null,
84+
"graft_block": null,
85+
"debug_fork": null,
86+
87+
"health": {
88+
"failed": false,
89+
"health": "healthy",
90+
"fatal_error": null,
91+
"non_fatal_errors": []
92+
},
93+
94+
"indexes": {
95+
"token": [
96+
"CREATE INDEX CONCURRENTLY IF NOT EXISTS attr_0_0_id ON sgd.token USING btree (id)"
97+
]
98+
},
99+
100+
"tables": {
101+
"Token": {
102+
"immutable": true,
103+
"has_causality_region": false,
104+
"chunks": [
105+
{
106+
"file": "Token/chunk_000000.parquet",
107+
"min_vid": 0,
108+
"max_vid": 50000,
109+
"row_count": 50000
110+
}
111+
],
112+
"max_vid": 50000
113+
},
114+
"data_sources$": {
115+
"immutable": false,
116+
"has_causality_region": true,
117+
"chunks": [
118+
{
119+
"file": "data_sources$/chunk_000000.parquet",
120+
"min_vid": 0,
121+
"max_vid": 100,
122+
"row_count": 100
123+
}
124+
],
125+
"max_vid": 100
126+
}
127+
}
128+
}
129+
```
130+
131+
**Field descriptions:**
132+
133+
| Field | Description |
134+
| ----------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
135+
| `version` | Format version. Must be `1`. |
136+
| `deployment` | Deployment hash (`Qm...`). |
137+
| `network` | The blockchain network (e.g. `mainnet`, `goerli`). |
138+
| `manifest` | Manifest metadata extracted from `subgraphs.subgraph_manifest`. |
139+
| `manifest.spec_version` | Subgraph API version. Required to parse `schema.graphql`. |
140+
| `manifest.entities_with_causality_region` | Entity types that have a `causality_region` column. |
141+
| `manifest.history_blocks` | How many blocks of entity version history are retained. |
142+
| `earliest_block_number` | Earliest block for which data exists (accounts for pruning). |
143+
| `start_block` | The block where indexing started. Null if not set. |
144+
| `head_block` | The latest indexed block at dump time. |
145+
| `entity_count` | Total entity count across all tables. |
146+
| `graft_base` | Deployment hash of the graft base, if any. |
147+
| `graft_block` | Block pointer of the graft point, if any. |
148+
| `debug_fork` | Debug fork deployment hash, if any. |
149+
| `health` | Point-in-time health snapshot. Not used during restore. |
150+
| `indexes` | Point-in-time index definitions as SQL. Not used during restore (indexes are auto-created by `Layout::create_relational_schema()`). |
151+
| `tables` | Per-table metadata keyed by entity type name (or `data_sources$`). |
152+
153+
Each entry in `tables` contains:
154+
155+
| Field | Description |
156+
| ---------------------- | ------------------------------------------------------------------------------ |
157+
| `immutable` | Whether the entity type is immutable (uses `block$` instead of `block_range`). |
158+
| `has_causality_region` | Whether rows have a `causality_region` column. |
159+
| `chunks` | Ordered list of Parquet chunk files for this table. |
160+
| `chunks[].file` | Relative path from the dump directory. |
161+
| `chunks[].min_vid` | Minimum `vid` value in this chunk. |
162+
| `chunks[].max_vid` | Maximum `vid` value in this chunk. |
163+
| `chunks[].row_count` | Number of rows in this chunk. |
164+
| `max_vid` | Maximum `vid` across all chunks. `-1` if the table is empty. |
165+
166+
### Parquet schema: entity tables
167+
168+
Each entity table's Parquet file uses an Arrow schema derived from the
169+
entity's GraphQL definition. Columns are ordered as follows:
170+
171+
1. **System columns** (always present, in this order):
172+
- `vid` (Int64, non-nullable) -- row version ID
173+
- Block tracking (one of):
174+
- Immutable entities: `block$` (Int32, non-nullable)
175+
- Mutable entities: `block_range_start` (Int32, non-nullable),
176+
`block_range_end` (Int32, nullable -- null means unbounded/current)
177+
- `causality_region` (Int32, non-nullable) -- only if the entity has one
178+
179+
2. **Data columns** in GraphQL declaration order, skipping fulltext
180+
(`TSVector`) columns which are generated and rebuilt on restore.
181+
182+
The PostgreSQL `int4range` type used for `block_range` is decomposed into
183+
two scalar columns (`block_range_start`, `block_range_end`) in the Parquet
184+
representation. This avoids the need for a custom range type in Arrow.
185+
186+
#### Type mapping
187+
188+
GraphQL/PostgreSQL column types map to Arrow data types as follows:
189+
190+
| ColumnType | Arrow DataType | Notes |
191+
| --------------- | ------------------------------ | -------------------------------------------------------------- |
192+
| `Boolean` | `Boolean` | |
193+
| `Int` | `Int32` | |
194+
| `Int8` | `Int64` | |
195+
| `Bytes` | `Binary` | Raw bytes, no hex encoding |
196+
| `BigInt` | `Utf8` | Stored as decimal string for arbitrary precision |
197+
| `BigDecimal` | `Utf8` | Stored as decimal string for arbitrary precision |
198+
| `Timestamp` | `Timestamp(Microsecond, None)` | Microseconds since epoch, no timezone |
199+
| `String` | `Utf8` | |
200+
| `Enum(...)` | `Utf8` | Enum variant as string (cast from PG enum to text during dump) |
201+
| `TSVector(...)` | _skipped_ | Fulltext index columns are generated; rebuilt on restore |
202+
203+
**Array columns:** A GraphQL list field (e.g. `tags: [String!]!`) is
204+
stored as `List<T>` where `T` is the base Arrow type from the table
205+
above. Whether a column is a list is determined by the GraphQL field type,
206+
not by `ColumnType`. For example, `[String!]!` becomes `List<Utf8>` and
207+
`[Int!]` becomes `List<Int32>`.
208+
209+
**Nullability** follows the GraphQL schema: non-null fields produce
210+
non-nullable Arrow columns; optional fields produce nullable columns. List
211+
elements within list columns are always marked nullable in the Arrow schema.
212+
213+
### Parquet schema: data_sources$
214+
215+
The `data_sources$` table has a fixed schema independent of the GraphQL
216+
definition:
217+
218+
| Column | Arrow DataType | Nullable | Description |
219+
| ------------------- | -------------- | -------- | -------------------------------------------------- |
220+
| `vid` | `Int64` | no | Row version ID |
221+
| `block_range_start` | `Int32` | no | Lower bound of `block_range` |
222+
| `block_range_end` | `Int32` | yes | Upper bound (null = unbounded) |
223+
| `causality_region` | `Int32` | no | Causality region |
224+
| `manifest_idx` | `Int32` | no | Index into the manifest's data source list |
225+
| `parent` | `Int32` | yes | Self-referencing parent data source |
226+
| `id` | `Binary` | yes | Data source identifier |
227+
| `param` | `Binary` | yes | Data source parameter |
228+
| `context` | `Utf8` | yes | JSON context |
229+
| `done_at` | `Int32` | yes | Block number where the data source was marked done |
230+
231+
### Compression
232+
233+
All Parquet files use ZSTD compression (default level).
234+
235+
### Row ordering
236+
237+
Within each Parquet chunk file, rows are ordered by `vid` (ascending).
238+
This matches the primary key ordering in PostgreSQL and enables efficient
239+
sequential reads during restore.
240+
241+
### Incremental dumps
242+
243+
An incremental dump reads the existing `metadata.json`, determines the
244+
`max_vid` for each table, and queries only rows with `vid > max_vid`. New
245+
rows are written to new chunk files (e.g. `chunk_000001.parquet`) and the
246+
metadata is updated atomically (write to a temp file, then rename).
247+
248+
### Atomicity
249+
250+
The `metadata.json` file is always written atomically: the dump writes to
251+
`metadata.json.tmp` first, then renames it to `metadata.json`. This
252+
ensures that a reader never sees a partially-written metadata file. If the
253+
dump process crashes mid-write, the previous `metadata.json` remains
254+
intact. The Parquet chunk files are written before `metadata.json` is
255+
updated, so chunk files referenced by `metadata.json` are always complete.

docs/implementation/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,4 @@ the code should go into comments.
1010
* [SQL Query Generation](./sql-query-generation.md)
1111
* [Adding support for a new chain](./add-chain.md)
1212
* [Pruning](./pruning.md)
13+
* [Dump Format](./dump.md)

0 commit comments

Comments
 (0)