|
| 1 | +## Dump Format |
| 2 | + |
| 3 | +The `graphman dump` command exports all entity data and metadata for a |
| 4 | +single subgraph deployment into a self-contained directory of Parquet files |
| 5 | +and JSON metadata. The resulting dump can be used to restore the deployment |
| 6 | +into a different `graph-node` instance via `graphman restore`. Dumps are |
| 7 | +consistent snapshots of the deployment's state at a specific point in time. |
| 8 | + |
| 9 | +**WARNING**: The dump and restore commands are experimental and can not |
| 10 | +replace proper database backups at this point. In particular, there is no |
| 11 | +guarantee that a dump will be restorable. Having said that, we encourage |
| 12 | +users to try out the dump and restore commands in non-production |
| 13 | +environments and report any issues they encounter. |
| 14 | + |
| 15 | +**WARNING**: Dumping happens in a single transaction and can put significant |
| 16 | +load on the database for large subgraphs. Use with caution on production |
| 17 | +instances. |
| 18 | + |
| 19 | +**WARNING**: Restoring a dump will currently create all the default indexes |
| 20 | +that a new deployment gets, and ignores the indexes that might have been |
| 21 | +carefully curated for the original deployment, even though they are recorded |
| 22 | +in the dump. This can lead to very long restore times for large subgraphs. |
| 23 | +The restore process will be optimized in the future to only create indexes |
| 24 | +that are present in the dump's metadata. |
| 25 | + |
| 26 | +### Directory layout |
| 27 | + |
| 28 | +A dump directory has the following structure: |
| 29 | + |
| 30 | +``` |
| 31 | +<dump-dir>/ |
| 32 | + metadata.json -- deployment metadata + per-table state |
| 33 | + schema.graphql -- raw GraphQL schema text |
| 34 | + subgraph.yaml -- raw subgraph manifest YAML (optional) |
| 35 | + <EntityType>/ |
| 36 | + chunk_000000.parquet -- rows ordered by vid |
| 37 | + chunk_000001.parquet -- incremental append (future chunks) |
| 38 | + ... |
| 39 | + data_sources$/ |
| 40 | + chunk_000000.parquet -- dynamic data sources |
| 41 | +``` |
| 42 | + |
| 43 | +Each entity type defined in the GraphQL schema gets its own subdirectory, |
| 44 | +named after the entity type exactly as it appears in the schema (e.g. |
| 45 | +`Token/`, `Pool/`). The Proof of Indexing appears as a regular entity |
| 46 | +directory name `Poi$`. The special `data_sources$` directory holds dynamic |
| 47 | +data sources created at runtime. |
| 48 | + |
| 49 | +Within each directory, data is stored in numbered chunk files |
| 50 | +(`chunk_000000.parquet`, `chunk_000001.parquet`, ...). A fresh dump |
| 51 | +produces a single `chunk_000000.parquet` per table. Incremental dumps |
| 52 | +append new chunks rather than rewriting existing ones. |
| 53 | + |
| 54 | +The GraphQL schema and subgraph manifest are stored as separate plain-text |
| 55 | +files `schema.graphql` and `subgraph.yaml`. |
| 56 | + |
| 57 | +### metadata.json |
| 58 | + |
| 59 | +The top-level `metadata.json` contains everything needed to reconstruct the |
| 60 | +deployment's table structure, plus diagnostic information captured at dump |
| 61 | +time. Its structure is: |
| 62 | + |
| 63 | +```json |
| 64 | +{ |
| 65 | + "version": 1, |
| 66 | + "deployment": "Qm...", |
| 67 | + "network": "mainnet", |
| 68 | + |
| 69 | + "manifest": { |
| 70 | + "spec_version": "1.0.0", |
| 71 | + "description": "Optional subgraph description", |
| 72 | + "repository": "https://github.com/...", |
| 73 | + "features": ["..."], |
| 74 | + "entities_with_causality_region": ["EntityType1"], |
| 75 | + "history_blocks": 2147483647 |
| 76 | + }, |
| 77 | + |
| 78 | + "earliest_block_number": 12345, |
| 79 | + "start_block": { "number": 12345, "hash": "0xabc..." }, |
| 80 | + "head_block": { "number": 99999, "hash": "0xdef..." }, |
| 81 | + "entity_count": 150000, |
| 82 | + |
| 83 | + "graft_base": null, |
| 84 | + "graft_block": null, |
| 85 | + "debug_fork": null, |
| 86 | + |
| 87 | + "health": { |
| 88 | + "failed": false, |
| 89 | + "health": "healthy", |
| 90 | + "fatal_error": null, |
| 91 | + "non_fatal_errors": [] |
| 92 | + }, |
| 93 | + |
| 94 | + "indexes": { |
| 95 | + "token": [ |
| 96 | + "CREATE INDEX CONCURRENTLY IF NOT EXISTS attr_0_0_id ON sgd.token USING btree (id)" |
| 97 | + ] |
| 98 | + }, |
| 99 | + |
| 100 | + "tables": { |
| 101 | + "Token": { |
| 102 | + "immutable": true, |
| 103 | + "has_causality_region": false, |
| 104 | + "chunks": [ |
| 105 | + { |
| 106 | + "file": "Token/chunk_000000.parquet", |
| 107 | + "min_vid": 0, |
| 108 | + "max_vid": 50000, |
| 109 | + "row_count": 50000 |
| 110 | + } |
| 111 | + ], |
| 112 | + "max_vid": 50000 |
| 113 | + }, |
| 114 | + "data_sources$": { |
| 115 | + "immutable": false, |
| 116 | + "has_causality_region": true, |
| 117 | + "chunks": [ |
| 118 | + { |
| 119 | + "file": "data_sources$/chunk_000000.parquet", |
| 120 | + "min_vid": 0, |
| 121 | + "max_vid": 100, |
| 122 | + "row_count": 100 |
| 123 | + } |
| 124 | + ], |
| 125 | + "max_vid": 100 |
| 126 | + } |
| 127 | + } |
| 128 | +} |
| 129 | +``` |
| 130 | + |
| 131 | +**Field descriptions:** |
| 132 | + |
| 133 | +| Field | Description | |
| 134 | +| ----------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | |
| 135 | +| `version` | Format version. Must be `1`. | |
| 136 | +| `deployment` | Deployment hash (`Qm...`). | |
| 137 | +| `network` | The blockchain network (e.g. `mainnet`, `goerli`). | |
| 138 | +| `manifest` | Manifest metadata extracted from `subgraphs.subgraph_manifest`. | |
| 139 | +| `manifest.spec_version` | Subgraph API version. Required to parse `schema.graphql`. | |
| 140 | +| `manifest.entities_with_causality_region` | Entity types that have a `causality_region` column. | |
| 141 | +| `manifest.history_blocks` | How many blocks of entity version history are retained. | |
| 142 | +| `earliest_block_number` | Earliest block for which data exists (accounts for pruning). | |
| 143 | +| `start_block` | The block where indexing started. Null if not set. | |
| 144 | +| `head_block` | The latest indexed block at dump time. | |
| 145 | +| `entity_count` | Total entity count across all tables. | |
| 146 | +| `graft_base` | Deployment hash of the graft base, if any. | |
| 147 | +| `graft_block` | Block pointer of the graft point, if any. | |
| 148 | +| `debug_fork` | Debug fork deployment hash, if any. | |
| 149 | +| `health` | Point-in-time health snapshot. Not used during restore. | |
| 150 | +| `indexes` | Point-in-time index definitions as SQL. Not used during restore (indexes are auto-created by `Layout::create_relational_schema()`). | |
| 151 | +| `tables` | Per-table metadata keyed by entity type name (or `data_sources$`). | |
| 152 | + |
| 153 | +Each entry in `tables` contains: |
| 154 | + |
| 155 | +| Field | Description | |
| 156 | +| ---------------------- | ------------------------------------------------------------------------------ | |
| 157 | +| `immutable` | Whether the entity type is immutable (uses `block$` instead of `block_range`). | |
| 158 | +| `has_causality_region` | Whether rows have a `causality_region` column. | |
| 159 | +| `chunks` | Ordered list of Parquet chunk files for this table. | |
| 160 | +| `chunks[].file` | Relative path from the dump directory. | |
| 161 | +| `chunks[].min_vid` | Minimum `vid` value in this chunk. | |
| 162 | +| `chunks[].max_vid` | Maximum `vid` value in this chunk. | |
| 163 | +| `chunks[].row_count` | Number of rows in this chunk. | |
| 164 | +| `max_vid` | Maximum `vid` across all chunks. `-1` if the table is empty. | |
| 165 | + |
| 166 | +### Parquet schema: entity tables |
| 167 | + |
| 168 | +Each entity table's Parquet file uses an Arrow schema derived from the |
| 169 | +entity's GraphQL definition. Columns are ordered as follows: |
| 170 | + |
| 171 | +1. **System columns** (always present, in this order): |
| 172 | + - `vid` (Int64, non-nullable) -- row version ID |
| 173 | + - Block tracking (one of): |
| 174 | + - Immutable entities: `block$` (Int32, non-nullable) |
| 175 | + - Mutable entities: `block_range_start` (Int32, non-nullable), |
| 176 | + `block_range_end` (Int32, nullable -- null means unbounded/current) |
| 177 | + - `causality_region` (Int32, non-nullable) -- only if the entity has one |
| 178 | + |
| 179 | +2. **Data columns** in GraphQL declaration order, skipping fulltext |
| 180 | + (`TSVector`) columns which are generated and rebuilt on restore. |
| 181 | + |
| 182 | +The PostgreSQL `int4range` type used for `block_range` is decomposed into |
| 183 | +two scalar columns (`block_range_start`, `block_range_end`) in the Parquet |
| 184 | +representation. This avoids the need for a custom range type in Arrow. |
| 185 | + |
| 186 | +#### Type mapping |
| 187 | + |
| 188 | +GraphQL/PostgreSQL column types map to Arrow data types as follows: |
| 189 | + |
| 190 | +| ColumnType | Arrow DataType | Notes | |
| 191 | +| --------------- | ------------------------------ | -------------------------------------------------------------- | |
| 192 | +| `Boolean` | `Boolean` | | |
| 193 | +| `Int` | `Int32` | | |
| 194 | +| `Int8` | `Int64` | | |
| 195 | +| `Bytes` | `Binary` | Raw bytes, no hex encoding | |
| 196 | +| `BigInt` | `Utf8` | Stored as decimal string for arbitrary precision | |
| 197 | +| `BigDecimal` | `Utf8` | Stored as decimal string for arbitrary precision | |
| 198 | +| `Timestamp` | `Timestamp(Microsecond, None)` | Microseconds since epoch, no timezone | |
| 199 | +| `String` | `Utf8` | | |
| 200 | +| `Enum(...)` | `Utf8` | Enum variant as string (cast from PG enum to text during dump) | |
| 201 | +| `TSVector(...)` | _skipped_ | Fulltext index columns are generated; rebuilt on restore | |
| 202 | + |
| 203 | +**Array columns:** A GraphQL list field (e.g. `tags: [String!]!`) is |
| 204 | +stored as `List<T>` where `T` is the base Arrow type from the table |
| 205 | +above. Whether a column is a list is determined by the GraphQL field type, |
| 206 | +not by `ColumnType`. For example, `[String!]!` becomes `List<Utf8>` and |
| 207 | +`[Int!]` becomes `List<Int32>`. |
| 208 | + |
| 209 | +**Nullability** follows the GraphQL schema: non-null fields produce |
| 210 | +non-nullable Arrow columns; optional fields produce nullable columns. List |
| 211 | +elements within list columns are always marked nullable in the Arrow schema. |
| 212 | + |
| 213 | +### Parquet schema: data_sources$ |
| 214 | + |
| 215 | +The `data_sources$` table has a fixed schema independent of the GraphQL |
| 216 | +definition: |
| 217 | + |
| 218 | +| Column | Arrow DataType | Nullable | Description | |
| 219 | +| ------------------- | -------------- | -------- | -------------------------------------------------- | |
| 220 | +| `vid` | `Int64` | no | Row version ID | |
| 221 | +| `block_range_start` | `Int32` | no | Lower bound of `block_range` | |
| 222 | +| `block_range_end` | `Int32` | yes | Upper bound (null = unbounded) | |
| 223 | +| `causality_region` | `Int32` | no | Causality region | |
| 224 | +| `manifest_idx` | `Int32` | no | Index into the manifest's data source list | |
| 225 | +| `parent` | `Int32` | yes | Self-referencing parent data source | |
| 226 | +| `id` | `Binary` | yes | Data source identifier | |
| 227 | +| `param` | `Binary` | yes | Data source parameter | |
| 228 | +| `context` | `Utf8` | yes | JSON context | |
| 229 | +| `done_at` | `Int32` | yes | Block number where the data source was marked done | |
| 230 | + |
| 231 | +### Compression |
| 232 | + |
| 233 | +All Parquet files use ZSTD compression (default level). |
| 234 | + |
| 235 | +### Row ordering |
| 236 | + |
| 237 | +Within each Parquet chunk file, rows are ordered by `vid` (ascending). |
| 238 | +This matches the primary key ordering in PostgreSQL and enables efficient |
| 239 | +sequential reads during restore. |
| 240 | + |
| 241 | +### Incremental dumps |
| 242 | + |
| 243 | +An incremental dump reads the existing `metadata.json`, determines the |
| 244 | +`max_vid` for each table, and queries only rows with `vid > max_vid`. New |
| 245 | +rows are written to new chunk files (e.g. `chunk_000001.parquet`) and the |
| 246 | +metadata is updated atomically (write to a temp file, then rename). |
| 247 | + |
| 248 | +### Atomicity |
| 249 | + |
| 250 | +The `metadata.json` file is always written atomically: the dump writes to |
| 251 | +`metadata.json.tmp` first, then renames it to `metadata.json`. This |
| 252 | +ensures that a reader never sees a partially-written metadata file. If the |
| 253 | +dump process crashes mid-write, the previous `metadata.json` remains |
| 254 | +intact. The Parquet chunk files are written before `metadata.json` is |
| 255 | +updated, so chunk files referenced by `metadata.json` are always complete. |
0 commit comments