fix(follow): recover from safe-head divergence by resetting to common ancestor#3533
fix(follow): recover from safe-head divergence by resetting to common ancestor#3533niran wants to merge 2 commits into
Conversation
|
@niran Good catch from the review bot. The SynchronizeTask::execute returning Ok(()) on Syncing means reset_to_ancestor needs explicit response checking. A simple fix would be: rust This way recovery fails loudly instead of silently stalling the node. |
14ed6d9 to
2c05d69
Compare
🟡 Heimdall Review Status
|
f08d2ae to
7714405
Compare
| Err(e) => { | ||
| warn!(target: "follow", error = %e, "Failed to update safe/finalized labels"); | ||
| } |
There was a problem hiding this comment.
FinalizedDivergence is documented as unrecoverable ("requires operator intervention"), but it's caught here by the blanket Err(e) arm and logged as a warn! — then the loop retries on the next 30-second tick, hitting the same divergence repeatedly.
Consider either:
- Propagating
FinalizedDivergenceas a fatal error (return it sostart/ the caller can shut down cleanly), or - At minimum, logging it at
error!level so it doesn't blend in with transient RPC failures.
As-is, an operator seeing warn "Failed to update safe/finalized labels" every 30 seconds may not realize the node is in an unrecoverable state.
88b76d5 to
dea180e
Compare
|
Solid design overall @niran. The 1.
|
… ancestor When follow mode detects the local chain has diverged from the source at the safe label, reset the engine's forkchoice down to the highest block both nodes agree on (the common ancestor) and replay source payloads forward, instead of wedging forever. The reset targets a block the EL already has, so the forkchoice update is Valid and the head reorgs immediately (no EL sync) — unlike pointing the EL at the canonical tip it lacks, which would answer Syncing and never recover. - Finality is never rewound: a finalized-head divergence is fatal (FinalizedDivergence), and a reset below finalized is refused (ReorgBelowFinalized). - Adds an opt-in stateful test EL so recovery tests fail if the design regresses to forkchoice-to-tip. - Removes the now-dead SourceBlockHashMismatch (superseded by recovery). Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
dea180e to
67a6987
Compare
| LabelOutcome::Diverged { number, local: local_hash, remote } => { | ||
| // The source is authoritative for the L2 chain in follow mode. Rather than wedge | ||
| // forever, reset to the common ancestor and replay the source chain forward. | ||
| warn!( | ||
| target: "follow", | ||
| number, | ||
| local = %local_hash, | ||
| source = %remote, | ||
| "Local chain diverged from source safe head; recovering to common ancestor", | ||
| ); | ||
| generation.cancel(); | ||
| let finalized = local_finalized | ||
| .as_ref() | ||
| .ok_or(FollowError::LocalBlockUnavailable(BlockNumberOrTag::Finalized))?; | ||
| let ancestor = | ||
| recovery::recover(local.as_ref(), source.as_ref(), &engine, finalized, number) | ||
| .await?; | ||
| return Ok(SafetyOutcome::Reorged { ancestor }); | ||
| } |
There was a problem hiding this comment.
generation.cancel() is called unconditionally at line 201 before recover(). If recovery then fails (e.g. ResetToAncestorUnconfirmed from the EL returning Syncing), the error is caught here and logged as warn!, but the insert/fetch loops are already dying from the cancelled generation token.
The insert loop will exit — either with Ok(()) (cancellation check at line 94) or Err(BlocksToInsertChannelClosed) (prefetcher drops sender). In both cases, select! at line 319 picks up the result before the safety loop's next tick:
Ok(())→ mapped toSafetyOutcome::Updated→ runtime exits cleanly at line 337, silently masking the failed recoveryErr(BlocksToInsertChannelClosed)→ propagated via?with a misleading error that obscures the real failure
Consider deferring generation.cancel() until after recover() succeeds, so a failed recovery attempt leaves the current generation alive and can be retried on the next safety tick:
LabelOutcome::Diverged { number, local: local_hash, remote } => {
warn!(...);
let finalized = ...;
let ancestor = recovery::recover(...).await?;
generation.cancel(); // cancel only after successful reset
return Ok(SafetyOutcome::Reorged { ancestor });
}This way a transient Syncing response doesn't tear down the node, and the next safety tick retries the recovery while the existing insert loop continues operating on the current (divergent but functional) chain.
Review SummaryThe recovery design is sound: binary-search for the common ancestor, reset the EL to a block it already has (Valid forkchoice), then replay source payloads forward. The Finding
Notes on prior bot commentsSeveral earlier inline comments on this PR are based on incorrect premises:
|
#3527 stops follow mode from promoting a non-canonical block to
safe, but a node that has already diverged still wedges. This PR adds recovery: on a safe-head divergence, find the highest block local and source agree on (the common ancestor), reset the engine's forkchoice down to it (Valid — the EL has that block, no P2P sync), then replay source payloads forward so the EL rebuilds onto the canonical chain. Mirrors op-node'sFindL2Heads+ reset model.ReorgBelowFinalized) and preserves the existing finalized-label hash-mismatch behavior.Local devnet validation
Validated on local
base/basedevnet only, fromfix/follow-recoverybefore the rebase atf08d2aeea7ccb4946591b62fc98c757c1c3445e3. The branch has since been rebased ontomainafter #3527 merged; current head is67a6987b8c42b45f1fe287e011fe745d88fdb76d.just devnet upcompleted.just devnet checkspassed. The expected local L1 blob smoke warning still appeared:unexpected eip-4844 sidecar after osaka./Users/niran/workspace/tmp/2026-06-16-follow-mode-devnet/with a dedicated follow EL/CL pair (base-proofs-follow-el,base-proofs-follow-cl) so the test exercisedbase-consensus follow, not the normal validator/sequencer path.base-proofs-follow-cl;base-proofs-follow-el;2329(0x919):0xab2f9889e60f1c149e0a3303b32796f4aab8b9bdb8eed0c6216de492f267b7330xe9c95ac24501c2327134772e1e23952f7dc9b26061b493c87f756c7e196466e4safecoherently report canonical block2329.Local chain diverged from source safe head; recovering to common ancestor number=2329 ...Reset to common ancestor; restarting follow fetch/insert ancestor=2328Inserting source payload block=2329Inserted new payload hash=0xab2f... number=2329 payload_safety="unsafe"2328,2329, and2330matched between builder and follow EL;builder=3004 follow=3004during the recovery run; after restoring normal source RPC,builder=3131 follow=3131);SourceBlockHashMismatch,ReorgBelowFinalized, invalid forkchoice, panic,Failed, or invalid payload errors.Did not retry
--proofsgated validation in this pass. Prior local devnet attempts were blocked by proofs-history initialization /debug_proofsSyncStatus; this validation targeted the follow runtime and Engine API insertion/recovery path.Follow-up
SourceBlockHashMismatchpath and the safety polling loop logs the failed safe/finalized update before retrying. A follow-up should decide whether finalized divergence should instead be fatal at the loop/node level, or at least logged with stronger severity, since it requires operator intervention.Rebase validation
Rebased onto
mainafter #3527 merged asa11dd1a6a0dd8b2026f6070e4d993dd3f45842ef.cargo +nightly fmt --checkjust --justfile etc/docker/Justfile --listcargo test -p base-consensus-engine test_utilscargo test -p base-consensus-node follow::type=routine
risk=medium
impact=sev3