Skip to content

Slave rewrite#143

Open
qzhodl wants to merge 5 commits into
mainfrom
slave-rewrite
Open

Slave rewrite#143
qzhodl wants to merge 5 commits into
mainfrom
slave-rewrite

Conversation

@qzhodl
Copy link
Copy Markdown
Contributor

@qzhodl qzhodl commented May 21, 2026

No description provided.

@qzhodl qzhodl changed the title Slave redesign Slave rewrite May 21, 2026
Comment thread L1/slave-rewrite-validation.md Outdated
9. EL: tx execution + state/receipts validation + persist + payload-self-consistency check (recomputes Header.Hash() and rejects on mismatch — standard EL behavior, same as vanilla post-merge geth against its beacon CL)
```

**EVM-level note**: post-merge Ethereum redefined opcode `0x44` from `DIFFICULTY` to `PREVRANDAO` (EIP-4399). For QKC's EVM upgrade, what `block.difficulty` / `block.prevrandao` actually return is a patched-geth decision (preserve the pre-Paris numeric-difficulty semantics, or move to PREVRANDAO with mixhash as the source). Either choice has implications for existing QKC mainnet contracts that read this opcode; audited under the patched-geth workstream, not here.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PREVRANDAO / Mixhash Risk:
Suppose the Engine API payload is built with payloadAttributes.prevRandao. If a pending transaction reads PREVRANDAO, geth executes that transaction against this placeholder value while computing the payload state root.
If the CL later replaces prevRandao with the mined PoW mixhash before calling engine_newPayload, any transaction that read PREVRANDAO may observe a different value on re-execution. That can make the sealed payload invalid or expose a consensus mismatch, depending on how the EL validates the payload.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in eea8194

|---|---|
| Slave Process (this Go binary) | `SlaveServer` ([slave.py](../quarkchain/cluster/slave.py)) |
| ShardCL | `Shard` ([shard.py:502](../quarkchain/cluster/shard.py#L502)) |
| ShardCL's geth subprocess | (new) — replaces `MinorBlockChain` + EVM + StateDB |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should geth's builtin p2p capability be disabled?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we don't use geth's p2p

| `GET_ROOT_CHAIN_STAKES_REQUEST` | **Only the slave owning chain 0 shard 0 serves this** (master hardcodes `full_shard_id = 1` at [master.py:1777](../quarkchain/cluster/master.py#L1777)). `eth_call` against the `ROOT_CHAIN_POSW` system contract at the last-confirmed minor block's state (requires geth `--gcmode=archive` or a short-term snapshot) |
| `GET_ECO_INFO_LIST_REQUEST` | CL local stats (shard tip, difficulty, etc.) |

### 4.4 Mining ops
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we removing 'GET_NEXT_BLOCK_TO_MINE_REQUEST' on purpose? It is also in 'MASTER_OP_RPC_MAP'.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c437634

Comment thread L1/slave-rewrite-validation.md Outdated
Reference: [`quarkchain/protocol.py`](../quarkchain/protocol.py), [`quarkchain/cluster/protocol.py`](../quarkchain/cluster/protocol.py).

Field semantics:
- `branch`: shard identifier; `ROOT_BRANCH` = `0x1` for root chain / cluster control plane.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ROOT_BRANCH is 0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c437634

Comment thread L1/slave-rewrite-validation.md Outdated

**Not stored at all** (compared to earlier drafts of this doc that proposed them):
- `td` (total difficulty) — **doesn't apply to shard chain**. Pyquarkchain's `MinorBlockHeader` ([core.py:681](../quarkchain/core.py#L681)) has per-block `difficulty` but **no `total_difficulty` field**, and `shard_state.py` never references TD on the shard side. Shard tip-update uses height + root-anchor tie-breaker, not TD (see §10 / B.3). Only `RootBlockHeader` carries `total_difficulty`, used by master's root-chain fork choice.
- Canonical / root tip pointers — reconstructed at startup the same way pyquarkchain does it (see [shard_state.py:279 `init_from_root_block`](../quarkchain/cluster/shard_state.py#L279)): master sends the current root tip via `CONNECT_TO_SLAVES_REQUEST`; CL looks up `b"r_last_m" + root_tip_hash` to find the last confirmed mheader for this shard, and that becomes the starting `header_tip`. Any unconfirmed mblocks mined after the last root checkpoint are discarded on restart — same behavior as pyquarkchain today.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems CONNECT_TO_SLAVES_REQUEST only carries slave info. root_tip is handled by PING.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c437634


**Exit criteria**: a master + one new slave can boot from existing mainnet genesis state, ingest a `CONNECT_TO_SLAVES_REQUEST` + `ADD_ROOT_BLOCK_REQUEST`, mine minor blocks, and report headers back to master. Round-trip on existing wire formats.

### M3 — Peer-shard P2P and sync
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest change M3 to M3 — Xshard, checkdb
Rationale: checkdb reads historical blocks from go/quarkchain db, re-inserts them into the chain, and verifies state — zero network dependency. Xshard hooks (pre/post-block cursor, deposit consumption) operate on block payload data, not real-time P2P. Implementing these first lets us validate the full execution pipeline via checkdb before any P2P work.


**Exit criteria**: new slave joins a running testnet, downloads the chain from a peer, and stays at the tip with vanilla pyquarkchain slaves continuing to mine.

### M4 — Xshard, RPC surface, indexer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can put All network-related development at M4, P2P VirtualConnection, sync, real-time slave-to-slave xshard delivery, and master TCP integration belong here.

This same field set also defines the CL↔EL Engine API extension schema (`PayloadAttributesV?QKC` / `ExecutionPayloadV?QKC`) — the contract the CL workstream and the patched-geth workstream must agree on, replacing M1's seal-only dev path.
- No p2p, no sync, no xshard yet.

**Exit criteria**: a master + one new slave can boot from existing mainnet genesis state, ingest a `CONNECT_TO_SLAVES_REQUEST` + `ADD_ROOT_BLOCK_REQUEST`, mine minor blocks, and report headers back to master. Round-trip on existing wire formats.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "existing mainnet genesis state" mean? Does it mean mainnet config?

- Divergence triage tooling: when a block fails, capture pre-state, tx list, post-state diff so the patched-geth or CL bug is bisectable.
- Performance pass: long sync needs reasonable throughput; sequential `engine_newPayload` may bottleneck (open item — tracked separately in the perf workstream).

**Anticipated problem — EVM version divergence**: pyquarkchain's EVM is an old fork of pyethereum (Constantinople/Petersburg era), while the new EL is latest-geth. Between them, opcode gas costs were repriced (EIP-2929 access lists, EIP-3529 refund changes, EIP-1884, EIP-2200, …), opcodes were added (`PUSH0`, …), and behaviors changed. If a historical block is re-executed under the *new* EVM rules, gas accounting diverges → `gasUsed`, refunds, and coinbase amounts differ → state root won't match → replay fails. This is the single most likely cause of replay divergence.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can reuse existing --check_db tool as a light-weight pre-test before full replay harness sync test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants