Skip to content

fix(follow): recover from safe-head divergence by resetting to common ancestor#3533

Open
niran wants to merge 2 commits into
mainfrom
fix/follow-recovery
Open

fix(follow): recover from safe-head divergence by resetting to common ancestor#3533
niran wants to merge 2 commits into
mainfrom
fix/follow-recovery

Conversation

@niran

@niran niran commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

#3527 stops follow mode from promoting a non-canonical block to safe, but a node that has already diverged still wedges. This PR adds recovery: on a safe-head divergence, find the highest block local and source agree on (the common ancestor), reset the engine's forkchoice down to it (Valid — the EL has that block, no P2P sync), then replay source payloads forward so the EL rebuilds onto the canonical chain. Mirrors op-node's FindL2Heads + reset model.

  • Finality is never rewound: recovery refuses resets below finalized (ReorgBelowFinalized) and preserves the existing finalized-label hash-mismatch behavior.
  • Trust model inherited from fix(consensus): coherent follow-mode safe/finalized read #3527 (source authoritative for L2, finalized-guarded). A transient safe-head flap across source replicas is tolerated: recovery fires only on a genuine hash divergence at a given height, not when a replica briefly reports a stale-but-canonical head (that just yields Skip/Promote).
  • Adds a stateful test EL so recovery tests fail if the design regresses to forkchoice-to-tip.

Local devnet validation

Validated on local base/base devnet only, from fix/follow-recovery before the rebase at f08d2aeea7ccb4946591b62fc98c757c1c3445e3. The branch has since been rebased onto main after #3527 merged; current head is 67a6987b8c42b45f1fe287e011fe745d88fdb76d.

  • just devnet up completed.
  • just devnet checks passed. The expected local L1 blob smoke warning still appeared: unexpected eip-4844 sidecar after osaka.
  • Added a scratch compose overlay under /Users/niran/workspace/tmp/2026-06-16-follow-mode-devnet/ with a dedicated follow EL/CL pair (base-proofs-follow-el, base-proofs-follow-cl) so the test exercised base-consensus follow, not the normal validator/sequencer path.
  • Confirmed the follow CL started and inserted source payloads through live head.
  • Simulated an already-diverged follow node:
    • stopped only base-proofs-follow-cl;
    • sent a tx only to base-proofs-follow-el;
    • temporarily ran a standalone sequencer CL against the follow EL Engine API;
    • created a local/source mismatch at block 2329 (0x919):
      • source canonical hash: 0xab2f9889e60f1c149e0a3303b32796f4aab8b9bdb8eed0c6216de492f267b733
      • local divergent hash: 0xe9c95ac24501c2327134772e1e23952f7dc9b26061b493c87f756c7e196466e4
    • used a scratch source-RPC shim to make source safe coherently report canonical block 2329.
  • Observed recovery in follow CL logs:
    • Local chain diverged from source safe head; recovering to common ancestor number=2329 ...
    • Reset to common ancestor; restarting follow fetch/insert ancestor=2328
    • Inserting source payload block=2329
    • Inserted new payload hash=0xab2f... number=2329 payload_safety="unsafe"
  • Verified post-recovery convergence:
    • blocks 2328, 2329, and 2330 matched between builder and follow EL;
    • latest matched after catch-up (builder=3004 follow=3004 during the recovery run; after restoring normal source RPC, builder=3131 follow=3131);
    • both follow services were healthy.
  • Scanned follow CL/EL logs and found no SourceBlockHashMismatch, ReorgBelowFinalized, invalid forkchoice, panic, Failed, or invalid payload errors.

Did not retry --proofs gated validation in this pass. Prior local devnet attempts were blocked by proofs-history initialization / debug_proofsSyncStatus; this validation targeted the follow runtime and Engine API insertion/recovery path.

Follow-up

  • Finalized-head divergence handling is intentionally not changing in this PR: finalized-label hash mismatches still surface through the existing SourceBlockHashMismatch path and the safety polling loop logs the failed safe/finalized update before retrying. A follow-up should decide whether finalized divergence should instead be fatal at the loop/node level, or at least logged with stronger severity, since it requires operator intervention.

Rebase validation

Rebased onto main after #3527 merged as a11dd1a6a0dd8b2026f6070e4d993dd3f45842ef.

  • cargo +nightly fmt --check
  • just --justfile etc/docker/Justfile --list
  • cargo test -p base-consensus-engine test_utils
  • cargo test -p base-consensus-node follow::

type=routine
risk=medium
impact=sev3

Comment thread crates/consensus/service/src/follow/recovery.rs Outdated
@Sertug17

Copy link
Copy Markdown

@niran Good catch from the review bot. The SynchronizeTask::execute returning Ok(()) on Syncing means reset_to_ancestor needs explicit response checking. A simple fix would be:

rust
let result = task.execute(&mut state).await.map_err(FollowError::engine_task)?;
// Verify the forkchoice update was actually applied
if !state.sync_state.safe_head().map_or(false, |h| h.block_info.hash == ancestor.block_info.hash) {
return Err(FollowError::RecoveryFailed(
"reset_to_ancestor: EL returned Syncing, head not updated".into()
));
}

This way recovery fails loudly instead of silently stalling the node.

@niran niran force-pushed the fix/follow-safe-coherent-read branch 2 times, most recently from 14ed6d9 to 2c05d69 Compare June 16, 2026 17:13
Base automatically changed from fix/follow-safe-coherent-read to main June 16, 2026 20:14
@cb-heimdall

cb-heimdall commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

🟡 Heimdall Review Status

Requirement Status More Info
Reviews 🟡 0/1
Denominator calculation
Show calculation
1 if user is bot 0
1 if user is external 0
2 if repo is sensitive 0
From .codeflow.yml 1
Additional review requirements
Show calculation
Max 0
0
From CODEOWNERS 0
Global minimum 0
Max 1
1
1 if commit is unverified 0
Sum 1

@niran niran force-pushed the fix/follow-recovery branch from f08d2ae to 7714405 Compare June 16, 2026 21:17
Comment thread crates/consensus/service/src/follow/runtime.rs Outdated
Comment thread crates/consensus/service/src/follow/recovery.rs Outdated
Comment on lines +152 to +154
Err(e) => {
warn!(target: "follow", error = %e, "Failed to update safe/finalized labels");
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FinalizedDivergence is documented as unrecoverable ("requires operator intervention"), but it's caught here by the blanket Err(e) arm and logged as a warn! — then the loop retries on the next 30-second tick, hitting the same divergence repeatedly.

Consider either:

  1. Propagating FinalizedDivergence as a fatal error (return it so start / the caller can shut down cleanly), or
  2. At minimum, logging it at error! level so it doesn't blend in with transient RPC failures.

As-is, an operator seeing warn "Failed to update safe/finalized labels" every 30 seconds may not realize the node is in an unrecoverable state.

@niran niran force-pushed the fix/follow-recovery branch 2 times, most recently from 88b76d5 to dea180e Compare June 16, 2026 21:41
@Sertug17

Sertug17 commented Jun 17, 2026

Copy link
Copy Markdown

Solid design overall @niran. The reset_to_ancestor → replay flow is clean and the stateful EL mock properly validates the Valid-vs-Syncing distinction. Two observations the bots didn't flag:

1. local_finalized can be stale when passed to recovery::recover

In update_safe_and_finalized (runtime.rs:~190), local_finalized is fetched before evaluate_label is called for the safe head. If evaluate_label returns Diverged, the same local_finalized which may now be stale is passed to recovery::recoverfind_common_ancestor:

// runtime.rs:~190 — fetched early
let local_finalized = local.block_info(BlockNumberOrTag::Finalized).await?;

// runtime.rs:~207 — evaluate_label might trigger recovery later
LabelOutcome::Diverged { number, .. } => {
    generation.cancel();
    let finalized = local_finalized.as_ref().ok_or(...)?;  // <-- stale?
    let ancestor = recovery::recover(..., finalized, number).await?;

If the local's finalized head advances between the local.block_info(Finalized) call and the recovery::recover call, the finalized parameter points to a block below the actual finalized head. The binary search floor in find_common_ancestor is then too low, producing a lower common ancestor than necessary. Impact is benign the ancestor is still valid (on both chains) and reset_to_ancestor independently checks the live sync_state.finalized_head() for the finality guard. The cost is a slightly longer replay window, not a correctness issue.

Not requesting a change this is conservative by design, and re-fetching inside the Diverged arm would trade one TOCTOU for another. Worth documenting for future readers though.

2. SynchronizeTask's EngineSyncStateUpdate with finalized_head: None

In reset_to_ancestor (engine.rs:~124), the update uses ..Default::default():

EngineSyncStateUpdate {
    unsafe_head: Some(ancestor),
    local_safe_head: Some(ancestor),
    safe_head: Some(ancestor),
    ..Default::default()  // finalized_head = None
}

When SynchronizeTask::execute sends the fork_choice_updated_v3 to the EL with finalized_block_hash absent/zero, different EL implementations may behave differently. The Engine API spec says forkchoiceState requires headBlockHash; safeBlockHash and finalizedBlockHash are "optional" in the JSON-RPC spec but ForkchoiceState in alloy/alloy-rpc-types-engine serializes them as B256 (not Option<B256>), so they always transmit.

If the non-zero finalized hash from before the reset is sent (which is what actually happens since SynchronizeTask uses the state's current finalized_head from sync_state, not the update), the EL sees a finalized block above the new unsafe head which could trigger a SYNCING or INVALID response depending on how strict the EL's forkchoice validation is.

From the engine tests, the stateful EL mock accepts any forkchoice update to a known block as Valid regardless of the safe/finalized fields, so this doesn't surface in tests. But against a real EL (geth/reth), a finalized_block_hash above the current head would be rejected with InvalidPayloadAttributes or Syncing. Might be worth verifying on the local devnet that the replayed forkchoice updates during the recovery window carry the right finalized hash (the ancestor or below).


Neither is blocking the recovery logic is sound. Mostly surfacing these so the next person reading the code doesn't have to trace them separately.

niran and others added 2 commits June 23, 2026 16:45
… ancestor

When follow mode detects the local chain has diverged from the source at
the safe label, reset the engine's forkchoice down to the highest block
both nodes agree on (the common ancestor) and replay source payloads
forward, instead of wedging forever.

The reset targets a block the EL already has, so the forkchoice update is
Valid and the head reorgs immediately (no EL sync) — unlike pointing the
EL at the canonical tip it lacks, which would answer Syncing and never
recover.

- Finality is never rewound: a finalized-head divergence is fatal
  (FinalizedDivergence), and a reset below finalized is refused
  (ReorgBelowFinalized).
- Adds an opt-in stateful test EL so recovery tests fail if the design
  regresses to forkchoice-to-tip.
- Removes the now-dead SourceBlockHashMismatch (superseded by recovery).

Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
@niran niran force-pushed the fix/follow-recovery branch from dea180e to 67a6987 Compare June 23, 2026 21:48
@niran niran marked this pull request as ready for review June 23, 2026 21:53
@niran niran requested a review from refcell June 23, 2026 21:53
@niran niran enabled auto-merge June 23, 2026 21:58
Comment on lines +191 to +209
LabelOutcome::Diverged { number, local: local_hash, remote } => {
// The source is authoritative for the L2 chain in follow mode. Rather than wedge
// forever, reset to the common ancestor and replay the source chain forward.
warn!(
target: "follow",
number,
local = %local_hash,
source = %remote,
"Local chain diverged from source safe head; recovering to common ancestor",
);
generation.cancel();
let finalized = local_finalized
.as_ref()
.ok_or(FollowError::LocalBlockUnavailable(BlockNumberOrTag::Finalized))?;
let ancestor =
recovery::recover(local.as_ref(), source.as_ref(), &engine, finalized, number)
.await?;
return Ok(SafetyOutcome::Reorged { ancestor });
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generation.cancel() is called unconditionally at line 201 before recover(). If recovery then fails (e.g. ResetToAncestorUnconfirmed from the EL returning Syncing), the error is caught here and logged as warn!, but the insert/fetch loops are already dying from the cancelled generation token.

The insert loop will exit — either with Ok(()) (cancellation check at line 94) or Err(BlocksToInsertChannelClosed) (prefetcher drops sender). In both cases, select! at line 319 picks up the result before the safety loop's next tick:

  • Ok(()) → mapped to SafetyOutcome::Updated → runtime exits cleanly at line 337, silently masking the failed recovery
  • Err(BlocksToInsertChannelClosed) → propagated via ? with a misleading error that obscures the real failure

Consider deferring generation.cancel() until after recover() succeeds, so a failed recovery attempt leaves the current generation alive and can be retried on the next safety tick:

LabelOutcome::Diverged { number, local: local_hash, remote } => {
    warn!(...);
    let finalized = ...;
    let ancestor = recovery::recover(...).await?;
    generation.cancel();  // cancel only after successful reset
    return Ok(SafetyOutcome::Reorged { ancestor });
}

This way a transient Syncing response doesn't tear down the node, and the next safety tick retries the recovery while the existing insert loop continues operating on the current (divergent but functional) chain.

@github-actions

Copy link
Copy Markdown
Contributor

Review Summary

The recovery design is sound: binary-search for the common ancestor, reset the EL to a block it already has (Valid forkchoice), then replay source payloads forward. The reset_to_ancestor confirmation check (engine.rs:126) correctly detects Syncing responses by verifying the sync state actually advanced. Error variants are well-modeled and tests exercise the key invariants (ancestor reset + replay, unknown-tip rejection, below-finalized refusal).

Finding

generation.cancel() before recover() tears down insert/fetch on recovery failure (runtime.rs:201) — If recover() fails after the generation token is already cancelled, the insert/fetch loops die and select! picks up their exit as either a clean shutdown or a misleading BlocksToInsertChannelClosed error. The actual recovery failure is only logged as warn!, not propagated. Moving generation.cancel() after a successful recover() would let failed attempts be retried on the next safety tick without tearing down the running insert loop.

Notes on prior bot comments

Several earlier inline comments on this PR are based on incorrect premises:

  • The comment on recovery.rs claiming reset_to_ancestor silently succeeds on Syncing is wrong — the confirmation check at engine.rs:126 catches this and returns ResetToAncestorUnconfirmed.
  • The comment on runtime.rs suggesting generation should be cancelled before recover — it already is (line 201).
  • The comment on recovery.rs about re-fetching finalizedfinalized is passed as a parameter from the caller; the only fetch is from the source to verify agreement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants