Skip to content

Samply#6344

Merged
blp merged 5 commits into
mainfrom
samply
Jun 3, 2026
Merged

Samply#6344
blp merged 5 commits into
mainfrom
samply

Conversation

@blp

@blp blp commented May 29, 2026

Copy link
Copy Markdown
Member

This adds a bunch of useful features in CPU profiles. It fixes a bug that made multihost profiling panic. It also adds node ids to CPU profiling data related to spines, so that it is clear what operators cause exchanges, merges, and other work in spines.

Describe Manual Test Plan

I ran a pipeline and captured CPU profiles a few times and looked at the output.

@blp blp requested a review from ryzhyk May 29, 2026 00:25
@blp blp self-assigned this May 29, 2026
@blp blp added bug Something isn't working DBSP core Related to the core DBSP library rust Pull requests that update Rust code profiler Issues related to the profiler and its APIs metrics Metrics about feldera pipelines multihost Related to multihost or distributed pipelines labels May 29, 2026

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two blockers: a variable-name swap that produces wrong profiling output, and behavior changes without tests. See inline.

Comment thread crates/dbsp/src/trace/spine_async.rs Outdated
Comment thread crates/dbsp/src/circuit/operator_traits.rs
Comment thread crates/dbsp/src/trace/spine_async.rs Outdated
Comment thread crates/dbsp/src/circuit/operator_traits.rs
Comment thread crates/samply/src/lib.rs
Comment on lines +1210 to +1215
// Record when marker space was exhausted. The combination
// of `load` and `store` is not an atomic transaction, but
// it's good enough.
if MARKERS_EXHAUSTED.load().is_none() {
MARKERS_EXHAUSTED.store(Some(Timestamp::now()));
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load-then-store race is acknowledged and is fine for this use case. However, this marker-overflow path has no test. A unit test that fills the block capacity and verifies MARKERS_EXHAUSTED gets set (and that the resulting Capture::finish() annotations contain the overflow marker) would be valuable.

@mihaibudiu mihaibudiu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Leonid is indeed the right person to review this.

Comment thread crates/dbsp/src/operator/dynamic/accumulator.rs
/// An operator doesn't initially know its global node ID, but the global node
/// ID is useful for debugging and profiling. This type allows the name to
/// initially omit the ID but adds it when it becomes available.
pub struct OperatorName(ArcSwap<String>);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not keep the name and the id separate, and just concatenate them on Display?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would work too.

There are a few reasons I went with the current approach:

  • I wanted to minimize the space and time overheads (this is just a single pointer and a single atomic increment/decrement) because the names end up getting copied around a bit and most of the time they won't get used.
  • From memory (I didn't go back and recheck), the names also end up getting used in places where there's already some other string type and it's not necessarily convenient to do a conversion at that point.

Really I wish that the global node id was available at construction instead of sometime later, it would simplify a lot here.

@ryzhyk ryzhyk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't read every line. Adding operator names to batches is a neat improvement.

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-APPROVE after the blocker fix in 8df3120 / 2b83ea4.

  • spine_async.rs backpressure_wait: the swapped (initial_batches, name) destructuring is gone. New shape is much cleaner — single let (name, initial_batches); initialised under the lock with an early-return for the already-drained case, then a tight wait loop, then unconditional Span recording. tooltip format string now reads correctly.
  • exchange.rs / sharded_accumulator.rs: receiver_global_node_id.get().map_or("Exchange"/"ShardedAccumulator", ...) replaces the panicking .unwrap() in fn name() — addresses the multihost CPU profiling crash directly.
  • samply: marker-overflow recording lands as in-profile markers + const fn new cleanup; AtomicOptionTimestamp tightened.
  • sharded_accumulator tests: test_config helper + with_streaming_exchange(true) threaded through single-host and multihost test_circuit paths.

Non-blocking: still no unit test directly asserting the new fn name() fallback string for the unset-OnceCell branch, and the samply marker-overflow in-profile recording is exercised only at integration time. Both are reasonable to defer.

@blp blp enabled auto-merge June 2, 2026 17:02
@blp blp added this pull request to the merge queue Jun 2, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 2, 2026
@blp blp added this pull request to the merge queue Jun 2, 2026

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review of 636fee3: rebased onto a new base; the only substantive change vs the previously approved aef0805 is one line in js-packages/web-console/vite.config.ts — adding 'fancy-ansi > escape-html' to testOptimizeDepsInclude. The remaining 111 files are byte-identical to the prior tip. LGTM.

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 2, 2026
blp added 4 commits June 2, 2026 15:34
Sometimes `receiver_global_node_id` can be uninitialized on multihost
pipelines.  The reason isn't entirely clear to me.

Signed-off-by: Ben Pfaff <blp@feldera.com>
DBSP doesn't use `arcstr` so there is no need to use the `size-of` feature
for it.

feldera-sqllib does use it, so enable it there.

Signed-off-by: Ben Pfaff <blp@feldera.com>
Signed-off-by: Ben Pfaff <blp@feldera.com>
… out.

Before, we just logged a message.

Signed-off-by: Ben Pfaff <blp@feldera.com>
@blp blp enabled auto-merge June 2, 2026 22:35
Signed-off-by: feldera-bot <feldera-bot@feldera.com>
@blp blp added this pull request to the merge queue Jun 2, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks Jun 3, 2026
@blp blp added this pull request to the merge queue Jun 3, 2026
Merged via the queue into main with commit 07ce4ee Jun 3, 2026
2 checks passed
@blp blp deleted the samply branch June 3, 2026 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working DBSP core Related to the core DBSP library metrics Metrics about feldera pipelines multihost Related to multihost or distributed pipelines profiler Issues related to the profiler and its APIs rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants