fix(follow): recover from safe-head divergence by resetting to common ancestor by niran · Pull Request #3533 · base/base

niran · 2026-06-13T23:08:10Z

#3527 stops follow mode from promoting a non-canonical block to safe, but a node that has already diverged still wedges. This PR adds recovery: on a safe-head divergence, find the highest block local and source agree on (the common ancestor), reset the engine's forkchoice down to it (Valid — the EL has that block, no P2P sync), then replay source payloads forward so the EL rebuilds onto the canonical chain. Mirrors op-node's FindL2Heads + reset model.

Finality is never rewound: recovery refuses resets below finalized (ReorgBelowFinalized) and preserves the existing finalized-label hash-mismatch behavior.
Trust model inherited from fix(consensus): coherent follow-mode safe/finalized read #3527 (source authoritative for L2, finalized-guarded). A transient safe-head flap across source replicas is tolerated: recovery fires only on a genuine hash divergence at a given height, not when a replica briefly reports a stale-but-canonical head (that just yields Skip/Promote).
Adds a stateful test EL so recovery tests fail if the design regresses to forkchoice-to-tip.

Local devnet validation

Validated on local base/base devnet only, from fix/follow-recovery before the rebase at f08d2aeea7ccb4946591b62fc98c757c1c3445e3. The branch has since been rebased onto main after #3527 merged; current head is 67a6987b8c42b45f1fe287e011fe745d88fdb76d.

just devnet up completed.
just devnet checks passed. The expected local L1 blob smoke warning still appeared: unexpected eip-4844 sidecar after osaka.
Added a scratch compose overlay under /Users/niran/workspace/tmp/2026-06-16-follow-mode-devnet/ with a dedicated follow EL/CL pair (base-proofs-follow-el, base-proofs-follow-cl) so the test exercised base-consensus follow, not the normal validator/sequencer path.
Confirmed the follow CL started and inserted source payloads through live head.
Simulated an already-diverged follow node:
- stopped only base-proofs-follow-cl;
- sent a tx only to base-proofs-follow-el;
- temporarily ran a standalone sequencer CL against the follow EL Engine API;
- created a local/source mismatch at block 2329 (0x919):
  - source canonical hash: 0xab2f9889e60f1c149e0a3303b32796f4aab8b9bdb8eed0c6216de492f267b733
  - local divergent hash: 0xe9c95ac24501c2327134772e1e23952f7dc9b26061b493c87f756c7e196466e4
- used a scratch source-RPC shim to make source safe coherently report canonical block 2329.
Observed recovery in follow CL logs:
- Local chain diverged from source safe head; recovering to common ancestor number=2329 ...
- Reset to common ancestor; restarting follow fetch/insert ancestor=2328
- Inserting source payload block=2329
- Inserted new payload hash=0xab2f... number=2329 payload_safety="unsafe"
Verified post-recovery convergence:
- blocks 2328, 2329, and 2330 matched between builder and follow EL;
- latest matched after catch-up (builder=3004 follow=3004 during the recovery run; after restoring normal source RPC, builder=3131 follow=3131);
- both follow services were healthy.
Scanned follow CL/EL logs and found no SourceBlockHashMismatch, ReorgBelowFinalized, invalid forkchoice, panic, Failed, or invalid payload errors.

Did not retry --proofs gated validation in this pass. Prior local devnet attempts were blocked by proofs-history initialization / debug_proofsSyncStatus; this validation targeted the follow runtime and Engine API insertion/recovery path.

Follow-up

Finalized-head divergence handling is intentionally not changing in this PR: finalized-label hash mismatches still surface through the existing SourceBlockHashMismatch path and the safety polling loop logs the failed safe/finalized update before retrying. A follow-up should decide whether finalized divergence should instead be fatal at the loop/node level, or at least logged with stronger severity, since it requires operator intervention.

Rebase validation

Rebased onto main after #3527 merged as a11dd1a6a0dd8b2026f6070e4d993dd3f45842ef.

cargo +nightly fmt --check
just --justfile etc/docker/Justfile --list
cargo test -p base-consensus-engine test_utils
cargo test -p base-consensus-node follow::

type=routine
risk=medium
impact=sev3

Sertug17 · 2026-06-14T08:32:38Z

@niran Good catch from the review bot. The SynchronizeTask::execute returning Ok(()) on Syncing means reset_to_ancestor needs explicit response checking. A simple fix would be:

rust
let result = task.execute(&mut state).await.map_err(FollowError::engine_task)?;
// Verify the forkchoice update was actually applied
if !state.sync_state.safe_head().map_or(false, |h| h.block_info.hash == ancestor.block_info.hash) {
return Err(FollowError::RecoveryFailed(
"reset_to_ancestor: EL returned Syncing, head not updated".into()
));
}

This way recovery fails loudly instead of silently stalling the node.

cb-heimdall · 2026-06-16T20:14:07Z

🟡 Heimdall Review Status

Requirement Status More Info

Reviews

🟡 0/1

Denominator calculation

Show calculation

1 if user is bot 0

1 if user is external 0

2 if repo is sensitive 0

From .codeflow.yml 1

Additional review requirements

Show calculation

Max	0

0

From CODEOWNERS 0

Global minimum 0

Max 1

1

1 if commit is unverified 0

Sum 1

github-actions · 2026-06-16T21:24:14Z

+                Err(e) => {
+                    warn!(target: "follow", error = %e, "Failed to update safe/finalized labels");
+                }


FinalizedDivergence is documented as unrecoverable ("requires operator intervention"), but it's caught here by the blanket Err(e) arm and logged as a warn! — then the loop retries on the next 30-second tick, hitting the same divergence repeatedly.

Consider either:

Propagating FinalizedDivergence as a fatal error (return it so start / the caller can shut down cleanly), or

At minimum, logging it at error! level so it doesn't blend in with transient RPC failures.

As-is, an operator seeing warn "Failed to update safe/finalized labels" every 30 seconds may not realize the node is in an unrecoverable state.

Sertug17 · 2026-06-17T12:01:11Z

Solid design overall @niran. The reset_to_ancestor → replay flow is clean and the stateful EL mock properly validates the Valid-vs-Syncing distinction. Two observations the bots didn't flag:

1. `local_finalized` can be stale when passed to `recovery::recover`

In update_safe_and_finalized (runtime.rs:~190), local_finalized is fetched before evaluate_label is called for the safe head. If evaluate_label returns Diverged, the same local_finalized which may now be stale is passed to recovery::recover → find_common_ancestor:

// runtime.rs:~190 — fetched early
let local_finalized = local.block_info(BlockNumberOrTag::Finalized).await?;

// runtime.rs:~207 — evaluate_label might trigger recovery later
LabelOutcome::Diverged { number, .. } => {
    generation.cancel();
    let finalized = local_finalized.as_ref().ok_or(...)?;  // <-- stale?
    let ancestor = recovery::recover(..., finalized, number).await?;

If the local's finalized head advances between the local.block_info(Finalized) call and the recovery::recover call, the finalized parameter points to a block below the actual finalized head. The binary search floor in find_common_ancestor is then too low, producing a lower common ancestor than necessary. Impact is benign the ancestor is still valid (on both chains) and reset_to_ancestor independently checks the live sync_state.finalized_head() for the finality guard. The cost is a slightly longer replay window, not a correctness issue.

Not requesting a change this is conservative by design, and re-fetching inside the Diverged arm would trade one TOCTOU for another. Worth documenting for future readers though.

2. `SynchronizeTask`'s `EngineSyncStateUpdate` with `finalized_head: None`

In reset_to_ancestor (engine.rs:~124), the update uses ..Default::default():

EngineSyncStateUpdate {
    unsafe_head: Some(ancestor),
    local_safe_head: Some(ancestor),
    safe_head: Some(ancestor),
    ..Default::default()  // finalized_head = None
}

When SynchronizeTask::execute sends the fork_choice_updated_v3 to the EL with finalized_block_hash absent/zero, different EL implementations may behave differently. The Engine API spec says forkchoiceState requires headBlockHash; safeBlockHash and finalizedBlockHash are "optional" in the JSON-RPC spec but ForkchoiceState in alloy/alloy-rpc-types-engine serializes them as B256 (not Option<B256>), so they always transmit.

If the non-zero finalized hash from before the reset is sent (which is what actually happens since SynchronizeTask uses the state's current finalized_head from sync_state, not the update), the EL sees a finalized block above the new unsafe head which could trigger a SYNCING or INVALID response depending on how strict the EL's forkchoice validation is.

From the engine tests, the stateful EL mock accepts any forkchoice update to a known block as Valid regardless of the safe/finalized fields, so this doesn't surface in tests. But against a real EL (geth/reth), a finalized_block_hash above the current head would be rejected with InvalidPayloadAttributes or Syncing. Might be worth verifying on the local devnet that the replayed forkchoice updates during the recovery window carry the right finalized hash (the ancestor or below).

Neither is blocking the recovery logic is sound. Mostly surfacing these so the next person reading the code doesn't have to trace them separately.

… ancestor When follow mode detects the local chain has diverged from the source at the safe label, reset the engine's forkchoice down to the highest block both nodes agree on (the common ancestor) and replay source payloads forward, instead of wedging forever. The reset targets a block the EL already has, so the forkchoice update is Valid and the head reorgs immediately (no EL sync) — unlike pointing the EL at the canonical tip it lacks, which would answer Syncing and never recover. - Finality is never rewound: a finalized-head divergence is fatal (FinalizedDivergence), and a reset below finalized is refused (ReorgBelowFinalized). - Adds an opt-in stateful test EL so recovery tests fail if the design regresses to forkchoice-to-tip. - Removes the now-dead SourceBlockHashMismatch (superseded by recovery). Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2026-06-23T22:00:15Z

+            LabelOutcome::Diverged { number, local: local_hash, remote } => {
+                // The source is authoritative for the L2 chain in follow mode. Rather than wedge
+                // forever, reset to the common ancestor and replay the source chain forward.
+                warn!(
+                    target: "follow",
+                    number,
+                    local = %local_hash,
+                    source = %remote,
+                    "Local chain diverged from source safe head; recovering to common ancestor",
+                );
+                generation.cancel();
+                let finalized = local_finalized
+                    .as_ref()
+                    .ok_or(FollowError::LocalBlockUnavailable(BlockNumberOrTag::Finalized))?;
+                let ancestor =
+                    recovery::recover(local.as_ref(), source.as_ref(), &engine, finalized, number)
+                        .await?;
+                return Ok(SafetyOutcome::Reorged { ancestor });
+            }


generation.cancel() is called unconditionally at line 201 before recover(). If recovery then fails (e.g. ResetToAncestorUnconfirmed from the EL returning Syncing), the error is caught here and logged as warn!, but the insert/fetch loops are already dying from the cancelled generation token.

The insert loop will exit — either with Ok(()) (cancellation check at line 94) or Err(BlocksToInsertChannelClosed) (prefetcher drops sender). In both cases, select! at line 319 picks up the result before the safety loop's next tick:

Ok(()) → mapped to SafetyOutcome::Updated → runtime exits cleanly at line 337, silently masking the failed recovery

Err(BlocksToInsertChannelClosed) → propagated via ? with a misleading error that obscures the real failure

Consider deferring generation.cancel() until after recover() succeeds, so a failed recovery attempt leaves the current generation alive and can be retried on the next safety tick:

LabelOutcome::Diverged { number, local: local_hash, remote } => { warn!(...); let finalized = ...; let ancestor = recovery::recover(...).await?; generation.cancel(); // cancel only after successful reset return Ok(SafetyOutcome::Reorged { ancestor }); }

This way a transient Syncing response doesn't tear down the node, and the next safety tick retries the recovery while the existing insert loop continues operating on the current (divergent but functional) chain.

github-actions · 2026-06-23T22:00:47Z

Review Summary

The recovery design is sound: binary-search for the common ancestor, reset the EL to a block it already has (Valid forkchoice), then replay source payloads forward. The reset_to_ancestor confirmation check (engine.rs:126) correctly detects Syncing responses by verifying the sync state actually advanced. Error variants are well-modeled and tests exercise the key invariants (ancestor reset + replay, unknown-tip rejection, below-finalized refusal).

Finding

generation.cancel() before recover() tears down insert/fetch on recovery failure (runtime.rs:201) — If recover() fails after the generation token is already cancelled, the insert/fetch loops die and select! picks up their exit as either a clean shutdown or a misleading BlocksToInsertChannelClosed error. The actual recovery failure is only logged as warn!, not propagated. Moving generation.cancel() after a successful recover() would let failed attempts be retried on the next safety tick without tearing down the running insert loop.

Notes on prior bot comments

Several earlier inline comments on this PR are based on incorrect premises:

The comment on recovery.rs claiming reset_to_ancestor silently succeeds on Syncing is wrong — the confirmation check at engine.rs:126 catches this and returns ResetToAncestorUnconfirmed.
The comment on runtime.rs suggesting generation should be cancelled before recover — it already is (line 201).
The comment on recovery.rs about re-fetching finalized — finalized is passed as a parameter from the caller; the only fetch is from the source to verify agreement.

github-actions Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread crates/consensus/service/src/follow/recovery.rs Outdated

niran force-pushed the fix/follow-safe-coherent-read branch 2 times, most recently from 14ed6d9 to 2c05d69 Compare June 16, 2026 17:13

niran mentioned this pull request Jun 16, 2026

fix(consensus): coherent follow-mode safe/finalized read #3527

Merged

Base automatically changed from fix/follow-safe-coherent-read to main June 16, 2026 20:14

niran force-pushed the fix/follow-recovery branch from f08d2ae to 7714405 Compare June 16, 2026 21:17

github-actions Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread crates/consensus/service/src/follow/runtime.rs Outdated

github-actions Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread crates/consensus/service/src/follow/recovery.rs Outdated

github-actions Bot reviewed Jun 16, 2026

View reviewed changes

niran force-pushed the fix/follow-recovery branch 2 times, most recently from 88b76d5 to dea180e Compare June 16, 2026 21:41

niran and others added 2 commits June 23, 2026 16:45

fix(devnet): use base compose for ingress commands

67a6987

niran force-pushed the fix/follow-recovery branch from dea180e to 67a6987 Compare June 23, 2026 21:48

niran marked this pull request as ready for review June 23, 2026 21:53

niran requested a review from refcell June 23, 2026 21:53

niran enabled auto-merge June 23, 2026 21:58

github-actions Bot reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(follow): recover from safe-head divergence by resetting to common ancestor#3533

fix(follow): recover from safe-head divergence by resetting to common ancestor#3533
niran wants to merge 2 commits into
mainfrom
fix/follow-recovery

niran commented Jun 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Sertug17 commented Jun 14, 2026

Uh oh!

cb-heimdall commented Jun 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

github-actions Bot Jun 16, 2026

Uh oh!

Sertug17 commented Jun 17, 2026 •

edited

Loading

Uh oh!

github-actions Bot Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

niran commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Local devnet validation

Follow-up

Rebase validation

Uh oh!

Uh oh!

Sertug17 commented Jun 14, 2026

Uh oh!

cb-heimdall commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🟡 Heimdall Review Status

Uh oh!

Uh oh!

Uh oh!

github-actions Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Sertug17 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. local_finalized can be stale when passed to recovery::recover

2. SynchronizeTask's EngineSyncStateUpdate with finalized_head: None

Uh oh!

github-actions Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 23, 2026

Review Summary

Finding

Notes on prior bot comments

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

niran commented Jun 13, 2026 •

edited

Loading

cb-heimdall commented Jun 16, 2026 •

edited

Loading

Sertug17 commented Jun 17, 2026 •

edited

Loading

1. `local_finalized` can be stale when passed to `recovery::recover`

2. `SynchronizeTask`'s `EngineSyncStateUpdate` with `finalized_head: None`