Skip to content

[adapters] Fix samply on MacOS.#6321

Merged
ryzhyk merged 2 commits into
mainfrom
samply-deadlock-macos
May 28, 2026
Merged

[adapters] Fix samply on MacOS.#6321
ryzhyk merged 2 commits into
mainfrom
samply-deadlock-macos

Conversation

@ryzhyk

@ryzhyk ryzhyk commented May 26, 2026

Copy link
Copy Markdown
Contributor

Trying to run samply via the /samply_profile endpoint caused the process to deadlock on MacOS. Apparently, something on MacOS doesn't allow a process to own its own profiler. As a workaround, we change the way we launch samply on MacOS: instead of starting samply directly, we run a small shell script, which runs samply and returns its PID. We use this PID to monitor the profiler and send signals to it.

The new code only affects MacOS; on Linux we still use the simpler and well tested approach of launching samply directly.

@ryzhyk ryzhyk requested a review from blp May 26, 2026 15:43
@ryzhyk ryzhyk added the connectors Issues related to the adapters/connectors crate label May 26, 2026
let log_file = tempfile::Builder::new()
.prefix("samply_log_")
.suffix(".log")
.rand_bytes(10)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these bytes?

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

macOS detached-subshell workaround is a clean way to dodge task_for_pid self-deadlock, and the Linux/Unix path is preserved with a tight cfg. sh_quote test covers the embedded-single-quote case. Polling at 100ms with a 30s grace + SIGKILL fallback is sensible. LGTM.

Comment thread python/feldera/pipeline.py
if profile.is_empty() {
bail!("samply profile is empty; samply log: `{}`", log.trim());
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.contains("Error:") is a fragile failure signal — it relies on samply's specific log formatting and false-positives if the profiled program prints Error: to stdout (it doesn't here, but only because we redirect samply's own pipes). Since you already check profile.is_empty() above, this branch is mostly a belt-and-suspenders extra. Either drop it or capture samply's exit status (the launcher subshell could write $! and later wait $!; echo $? to a second tempfile).

while process_exists(pid).await {
tokio::time::sleep(Duration::from_millis(100)).await;
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: 100ms × N polls for process liveness is fine in practice, but on macOS kill(pid, 0) returns Ok for zombies too. If the subshell exits before samply does, samply gets reparented to launchd and you're polling launchd-owned PID which is fine — just worth a one-line comment that this is intentionally PID-not-pgid based.

Comment on lines +22 to +25
tokio::task::spawn_blocking(move || kill(Pid::from_raw(pid as i32), None).is_ok())
.await
.unwrap_or(false)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spawn_blocking doesn't hurt but I'm surprised that it's useful. The kill system call shouldn't block AFAIK.

Comment on lines +36 to +44
let deadline = tokio::time::Instant::now() + timeout;
while tokio::time::Instant::now() < deadline {
if !process_exists(pid).await {
return true;
}
tokio::time::sleep(Duration::from_millis(100)).await;
}
false
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be reduced to something like tokio::time::timeout(timeout, wait_for_process_exit_unbounded(pid)).await.map_or(false, |_| true)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Current approach is OK too.)

.context("failed to convert samply log path to str")?;

let sh_cmd = format!(
"( samply record -p {target_pid} -o {} --save-only --presymbolicate > {} 2>&1 & echo $! )",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional subshell (the "()" around the shell commands) looks weird, but I assume it's necessary?

ryzhyk added 2 commits May 26, 2026 11:50
Trying to run samply via the /samply_profile endpoint caused the pipeline
process to deadlock on MacOS. Apparently, something on MacOS doesn't allow a
process to own its own profiler.  As a workaround, we change the way we launch
samply on MacOS: instead of starting samply directly, we run a small shell
script, which runs samply and returns its PID. We use this PID to monitor the
profiler and send signals to it.

The new code only affects MacOS; on Linux we still use the simpler and well
tested approach of launching samply directly.

Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
When making many small transactions, the default polling interval of 1s can be too long.
This commit makes the interval configurable, while reducing the default to 0.5s.

Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
@ryzhyk ryzhyk force-pushed the samply-deadlock-macos branch from 5a459ae to a18859a Compare May 26, 2026 18:51

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-APPROVE on a18859a3 (poll_interval follow-up).

Clean addition: poll_interval_s parameter threaded through Pipeline.commit_transaction and FelderaClient.commit_transaction with default 0.5s (was hard-coded 1.0s sleep). Default tightening is appropriate for the small-transaction workload. Docstrings updated in both layers; the debug log substitutes the actual interval.

Unrelated to the samply MacOS fix — PR title now slightly underdescribes the scope, but not blocking.

No new tests for the parameter itself; it's a passthrough to time.sleep.

@ryzhyk ryzhyk added this pull request to the merge queue May 28, 2026
Merged via the queue into main with commit 358e702 May 28, 2026
1 check passed
@ryzhyk ryzhyk deleted the samply-deadlock-macos branch May 28, 2026 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

connectors Issues related to the adapters/connectors crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants