[adapters] Fix samply on MacOS.#6321
Conversation
| let log_file = tempfile::Builder::new() | ||
| .prefix("samply_log_") | ||
| .suffix(".log") | ||
| .rand_bytes(10) |
mythical-fred
left a comment
There was a problem hiding this comment.
macOS detached-subshell workaround is a clean way to dodge task_for_pid self-deadlock, and the Linux/Unix path is preserved with a tight cfg. sh_quote test covers the embedded-single-quote case. Polling at 100ms with a 30s grace + SIGKILL fallback is sensible. LGTM.
| if profile.is_empty() { | ||
| bail!("samply profile is empty; samply log: `{}`", log.trim()); | ||
| } | ||
|
|
There was a problem hiding this comment.
log.contains("Error:") is a fragile failure signal — it relies on samply's specific log formatting and false-positives if the profiled program prints Error: to stdout (it doesn't here, but only because we redirect samply's own pipes). Since you already check profile.is_empty() above, this branch is mostly a belt-and-suspenders extra. Either drop it or capture samply's exit status (the launcher subshell could write $! and later wait $!; echo $? to a second tempfile).
| while process_exists(pid).await { | ||
| tokio::time::sleep(Duration::from_millis(100)).await; | ||
| } | ||
| } |
There was a problem hiding this comment.
Nit: 100ms × N polls for process liveness is fine in practice, but on macOS kill(pid, 0) returns Ok for zombies too. If the subshell exits before samply does, samply gets reparented to launchd and you're polling launchd-owned PID which is fine — just worth a one-line comment that this is intentionally PID-not-pgid based.
| tokio::task::spawn_blocking(move || kill(Pid::from_raw(pid as i32), None).is_ok()) | ||
| .await | ||
| .unwrap_or(false) | ||
| } |
There was a problem hiding this comment.
spawn_blocking doesn't hurt but I'm surprised that it's useful. The kill system call shouldn't block AFAIK.
| let deadline = tokio::time::Instant::now() + timeout; | ||
| while tokio::time::Instant::now() < deadline { | ||
| if !process_exists(pid).await { | ||
| return true; | ||
| } | ||
| tokio::time::sleep(Duration::from_millis(100)).await; | ||
| } | ||
| false | ||
| } |
There was a problem hiding this comment.
I think this can be reduced to something like tokio::time::timeout(timeout, wait_for_process_exit_unbounded(pid)).await.map_or(false, |_| true)
| .context("failed to convert samply log path to str")?; | ||
|
|
||
| let sh_cmd = format!( | ||
| "( samply record -p {target_pid} -o {} --save-only --presymbolicate > {} 2>&1 & echo $! )", |
There was a problem hiding this comment.
The additional subshell (the "()" around the shell commands) looks weird, but I assume it's necessary?
Trying to run samply via the /samply_profile endpoint caused the pipeline process to deadlock on MacOS. Apparently, something on MacOS doesn't allow a process to own its own profiler. As a workaround, we change the way we launch samply on MacOS: instead of starting samply directly, we run a small shell script, which runs samply and returns its PID. We use this PID to monitor the profiler and send signals to it. The new code only affects MacOS; on Linux we still use the simpler and well tested approach of launching samply directly. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
When making many small transactions, the default polling interval of 1s can be too long. This commit makes the interval configurable, while reducing the default to 0.5s. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
5a459ae to
a18859a
Compare
mythical-fred
left a comment
There was a problem hiding this comment.
Re-APPROVE on a18859a3 (poll_interval follow-up).
Clean addition: poll_interval_s parameter threaded through Pipeline.commit_transaction and FelderaClient.commit_transaction with default 0.5s (was hard-coded 1.0s sleep). Default tightening is appropriate for the small-transaction workload. Docstrings updated in both layers; the debug log substitutes the actual interval.
Unrelated to the samply MacOS fix — PR title now slightly underdescribes the scope, but not blocking.
No new tests for the parameter itself; it's a passthrough to time.sleep.
Trying to run samply via the /samply_profile endpoint caused the process to deadlock on MacOS. Apparently, something on MacOS doesn't allow a process to own its own profiler. As a workaround, we change the way we launch samply on MacOS: instead of starting samply directly, we run a small shell script, which runs samply and returns its PID. We use this PID to monitor the profiler and send signals to it.
The new code only affects MacOS; on Linux we still use the simpler and well tested approach of launching samply directly.