cgroups: add cgroup-name Go helper#22685
Conversation
There was a problem hiding this comment.
8 issues found across 16 files
Confidence score: 3/5
- In
src/collectors/cgroups.plugin/cgroup-name/main.go, Kubernetes annotation and Docker label values are emitted without escaping, so quotes/commas/newlines can corrupt label serialization or inject malformed labels into output; this is the highest user-facing risk because downstream parsing and attribution can break — escape/sanitize label values before emitting them. - In
src/collectors/cgroups.plugin/cgroup-discovery.c, treatingpoll()EINTRas timeout/error can trigger unnecessary helper kills and miss cgroup rename events, leading to intermittent metadata gaps under signal-heavy conditions — handleEINTRas a retry path before merging. - In
src/collectors/cgroups.plugin/sys_fs_cgroup.c, timeout unit conversion can overflow and effectively disable or distort cgroup-name timeout behavior, which risks hangs or premature timeouts depending on inputs — add bounds checks/safe conversion and a regression test around extreme timeout values. - In
src/collectors/cgroups.plugin/cgroup-name/main.go, cache files under/tmpare world-readable, exposing pod/container metadata to other local users; lower severity but concrete privacy leakage on multi-user hosts — restrict file mode (e.g., 0600) and consider a private runtime directory before release.
Architecture diagram
sequenceDiagram
participant Plugin as cgroups.plugin (C)
participant Discovery as discovery thread
participant Helper as cgroup-name (Go binary)
participant DockerAPI as Docker/Podman API
participant K8sAPI as Kubernetes API
participant CRI as Container runtime (docker/crio/containerd)
participant Env as Environment (timeout, log level)
Note over Plugin,Discovery: Configuration loading (once)
Plugin->>Plugin: read_cgroup_plugin_configuration()
Plugin->>Plugin: Set cgroup_name_timeout_ms from config (default 15000)
Plugin->>Env: Set NETDATA_CGROUP_NAME_TIMEOUT_MS env var
Plugin->>Plugin: Determine cgroups_rename_script path to /usr/libexec/netdata/plugins.d/cgroup-name
Note over Discovery,Helper: Per-cgroup rename cycle
Discovery->>Discovery: discovery_rename_cgroup(cg)
alt cgroup matches rename pattern
Discovery->>Helper: spawn cgroup-name <cgroup_path> <cgroup_id>
Note over Helper: Receives args, sets PATH, LC_ALL=C
Helper->>Env: Read NETDATA_CGROUP_NAME_TIMEOUT_MS
alt timeout > 0
Helper->>Helper: Set expiresAt = now + timeout
Helper->>Helper: Create context.WithDeadline(expiresAt)
else timeout == 0
Helper->>Helper: Create context.WithCancel() (unbounded)
end
alt cgroup id contains "kubepods"
Helper->>Helper: k8sGetName()
alt KUBERNETES_SERVICE_HOST set
Helper->>K8sAPI: curl to in-cluster API with TLS verify (or insecure if K8S_TLS_INSECURE)
K8sAPI-->>Helper: Pod JSON
Helper->>Helper: podsToContainerLines() → parse container metadata
else kubectl available
Helper->>Helper: exec kubectl get pods
else
Helper->>Helper: Try GCP metadata API for cluster name
end
alt pause container
Helper->>CRI: Read cgroup.procs, check /proc/<pid>/comm
CRI-->>Helper: "pause"
Helper->>Helper: exit disable (code 3)
else valid container
Helper->>Helper: Build k8s_ prefixed name with labels
end
else non-K8s path
alt matches Docker regex
Helper->>DockerAPI: docker inspect via socket or API
DockerAPI-->>Helper: Container config (env, labels)
Helper->>Helper: parseDockerLikeInspectOutput()
else matches ECS regex
Helper->>DockerAPI: Similar docker inspect
else matches Containerd regex
Helper->>DockerAPI: Similar docker inspect
else matches Podman regex
Helper->>DockerAPI: podman inspect via podman socket
else matches systemd-nspawn
Helper->>Helper: Extract machine name from cgroup path
else matches libvirt regex
Helper->>Helper: Parse libvirt/LXC/QEMU machine names
else matches Proxmox regex
Helper->>Helper: Read /etc/pve config files for VM/container names
end
end
alt name resolved successfully
Helper-->>Discovery: stdout: "<name> <labels>\n"
alt exit code 0
Discovery->>Discovery: Set pending_renames = 0
else exit code 3
Discovery->>Discovery: Set pending_renames = 0, processed = 1
end
else budget expired (no name resolved)
Helper-->>Discovery: stderr: time breakdown log
Helper-->>Discovery: exit code 2 (retry)
else helper stuck/timeout
Discovery->>Helper: poll() with wait_ms = cgroup_name_timeout_ms + 2000ms grace
alt poll timeout or error
Discovery->>Helper: spawn_popen_kill() - kill helper
Discovery->>Discovery: Log error, keep pending_renames for retry
else poll success
Discovery->>Helper: spawn_popen_timedwait() with 2s grace
alt still running after grace
Discovery->>Helper: spawn_popen_kill()
end
end
end
end
Note over Plugin,Helper: Installer & permissions (deployment time)
Participant Installer as Installer scripts
Installer->>Helper: Copy cgroup-name binary to /usr/libexec/netdata/plugins.d/
Installer->>Helper: chown root:netdata, chmod 0750
Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.
Re-trigger cubic
There was a problem hiding this comment.
1 issue found across 12 files (changes from recent commits).
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
|
@cubic-dev-ai please review again |
@ktsaou I have started the AI code review. It will take a few minutes to complete. |
4503c9a to
85ae59f
Compare
|
@cubic-dev-ai please review again |
@ktsaou I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
1 issue found and verified against the latest diff
Confidence score: 4/5
- In
src/collectors/cgroups.plugin/cgroup-name/capture-wrappers/cgroup-name.sh.logger,diff_detailonly compares stdout, so failures caused by exit-code or stderr differences can show an empty diff and hide the real mismatch; merging as-is risks harder triage when capture-wrapper checks fail. Expanddiff_detailto include exit status and stderr deltas before merging to preserve the wrapper’s diagnostic value.
Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.
Re-trigger cubic
|
@cubic-dev-ai please review again |
@ktsaou I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
4 issues found across 22 files
Confidence score: 3/5
- In
src/collectors/cgroups.plugin/cgroup-name/main.go, predictable temp-file creation can follow symlinks, which could let an attacker or local process clobber unintended files when this path runs — switch to nofollow/atomic creation in a private temp directory before merging. - In
src/collectors/cgroups.plugin/cgroup-name/capture-wrappers/cgroup-name.sh.logger, the wrapper currently dropsshell.errand can abort on capture-write failure, so resolver errors may be hidden and normal passthrough behavior can be lost — relayshell.errto stderr and make capture logging best-effort so resolver results still return. - In
src/collectors/cgroups.plugin/cgroup-name/main.go, recompiling a regex on every dispatch adds avoidable overhead in a hot path; this is low severity but can become noticeable under load — precompile it once at package scope (like the other matchers) to de-risk performance regression.
Architecture diagram
sequenceDiagram
participant CG as cgroups.plugin (C)
participant CN as cgroup-name (Go)
participant ENV as Environment
participant DOCK as Docker/Podman
participant K8S as Kubernetes
participant FS as Filesystem
Note over CG,FS: Cgroup Name Resolution Flow
CG->>ENV: read cgroup-name timeout config (default 15s)
ENV-->>CG: cgroup_name_timeout_ms
CG->>CG: set NETDATA_CGROUP_NAME_TIMEOUT_MS env var
CG->>CN: spawn cgroup-name <cgroup_path> <cgroup_id>
Note over CN: self-terminates within timeout budget
CN->>ENV: read NETDATA_CGROUP_NAME_TIMEOUT_MS
ENV-->>CN: timeout budget X ms
CN->>CN: setup deadline context
alt cgroup contains "kubepods"
CN->>K8S: Kubernetes resolution path
K8S-->>CN: container/pod name + labels
else Non-K8s cgroup
CN->>CN: match against docker/ecs/containerd/podman/libvirt/proxmox regex
alt Matched docker/containerd/ecs
CN->>DOCK: docker/podman inspect via API or socket
DOCK-->>CN: container metadata (name, labels)
else Matched libvirt/qemu
CN->>FS: read /etc/pve/qemu-server/<id>.conf
FS-->>CN: VM name
else Matched Proxmox LXC
CN->>FS: read /etc/pve/lxc/<id>.conf
FS-->>CN: container hostname
end
end
CN->>CN: build output: <name> [<labels>]
alt budget expired, name unresolved
CN->>CN: log call time breakdown
CN-->>CG: exit code 2 (retry)
else success
CN-->>CG: stdout: <name> <labels>, exit 0
end
CG->>CG: poll for output with timeout + 2s grace
alt helper timed out
CG->>CG: kill helper process, treat as failure
CG->>CG: schedule retry
else output received
CG->>CG: parse name and labels
alt exit code 0
CG->>CG: apply name/labels to cgroup chart
else exit code 2 (retry)
CG->>CG: keep pending_renames, retry later
else exit code 3 (disable)
CG->>CG: mark cgroup as processed
end
end
alt timeout set to 0 (legacy unbounded)
CG->>CG: wait indefinitely for helper
end
Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.
Re-trigger cubic
…exes - writePrivateFile now writes the Kubernetes cache files atomically through a private temp file plus rename, so a symlink planted under the predictable name in a world-writable TMPDIR is replaced instead of followed. - cache readers (firstLineFile, grepFile) and the cache-presence gate reject non-regular files via isPrivateRegularFile, closing the matching symlink-follow read vector; add tests for both directions. - precompile the libvirt-qemu, lxc.payload and proxmox name/hostname regexes at package scope and reuse reMachineIDSegment, removing per-dispatch regexp compilation from the hot path. - capture wrapper: make capture logging best-effort and relay the shell resolver's stderr so it stays a transparent stand-in for cgroup-name.sh.real.
581ca23 to
ca513c1
Compare
|
@cubic-dev-ai please review again — pushed ca513c1 addressing the 4 open findings (best-effort capture + stderr relay in the dev wrapper, atomic symlink-safe cache writes plus symlink-rejecting reads with tests, and package-scope precompiled dispatch regexes), and rebased onto current master. |
@ktsaou I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
No issues found across 22 files
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant Operator
participant Plugin as CGroup Plugin (cgroup-discovery.c)
participant Helper as cgroup-name (Go helper)
participant ExtAPI as External APIs (Docker/Podman/K8s/kubectl)
participant FS as File System (cgroup paths, cache files, Proxmox configs)
Note over Operator,Plugin: Configuration Phase
Operator->>Plugin: Sets `[plugin:cgroups] cgroup-name timeout` (default 15s)
Plugin->>Plugin: Reads config, sets env `NETDATA_CGROUP_NAME_TIMEOUT_MS`
Plugin->>Plugin: Points `script to get cgroup names` to `/usr/libexec/netdata/plugins.d/cgroup-name`
Note over Plugin,Helper: Discovery Invocation
Plugin->>Plugin: discovery_rename_cgroup() matched cgroup
Plugin->>Helper: spawn_popen (cgroup-name <cgroup_path> <cgroup_id>)
Note over Helper: Inherits env: PATH, LC_ALL, TIMEOUT_MS
Note over Helper,ExtAPI: Name Resolution
Helper->>Helper: Parse args, set up deadline context
alt cgroup id contains "kubepods"
Helper->>FS: Check cache files in TMPDIR (k8s-containers, cluster-name, etc.)
alt Cache miss
Helper->>ExtAPI: Query Kubernetes API or kubectl (with context deadline)
ExtAPI-->>Helper: Pod/container list JSON
Helper->>FS: Write cache files (non-atomic)
Helper->>Helper: Parse to container lines
else Cache hit
FS-->>Helper: Cached container lines
end
Helper->>Helper: Build k8s_<name> with labels
else Non-K8s cgroup
Helper->>Helper: Dispatch regex over cgroup id
alt Docker/containerd/ECS/Podman
Helper->>ExtAPI: Docker/Podman inspect (API or CLI)
ExtAPI-->>Helper: Container metadata
else Proxmox
Helper->>FS: Read /etc/pve/<qemu|lxc>/<id>.conf
FS-->>Helper: Hostname or name
else systemd-nspawn/libvirt/LXC
Helper->>Helper: Extract name from cgroup path
end
end
Note over Helper,Plugin: Output and Exit
alt Name resolved before deadline
Helper-->>Plugin: stdout: `<name> <labels>` or just `<name>`
Helper-->>Plugin: Exit code 0 (success) or 3 (disable)
else Timeout (deadline passed)
Helper->>Helper: Log time breakdown per external call
Helper-->>Plugin: (no stdout)
Helper-->>Plugin: Exit code 2 (retry)
end
Note over Plugin,Plugin: Plugin side handling
Plugin->>Plugin: poll() for output with timeout = operator timeout + 2s grace
alt poll returns > 0 and output available
Plugin->>Plugin: fgets() read line
Plugin->>Plugin: spawn_popen_timedwait() for exit code
alt Exit code 0
Plugin->>Plugin: cgroup_parse_resolved_name_and_labels()
Plugin->>Plugin: set cgroup name, apply labels
else Exit code 3
Plugin->>Plugin: mark cgroup as processed (disable)
end
else poll timeout or error
Plugin->>Plugin: spawn_popen_kill() helper
Plugin->>Plugin: Retry later (retry ladder)
end
Note over Plugin,FS: Fallback
alt No name from helper
Plugin->>Plugin: Use raw cgroup id as name
Plugin->>Plugin: cgroup.name = cgroup_id (truncated to 100 chars)
end
Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.
Re-trigger cubic
…per tests in CI - Restore the shell helper's PATH handling: append the standard system directories to the inherited PATH and resolve docker/kubectl/podman/ps via PATH, instead of locking lookups to a fixed directory list. The helper runs unprivileged (as the netdata user), like the shell helper it replaces, so the lockdown only broke tool discovery on hosts that install them elsewhere. - Remove the jq availability gate: the helper parses JSON natively and never executes jq, so the check needlessly skipped docker-API and Kubernetes name resolution on hosts without jq. - Point Kubernetes API TLS verification failures at the K8S_TLS_INSECURE override in the warning message. - Add the standalone cgroup-name Go module to the Go test matrix and the go-tests path filter so its go test/vet/build/race actually run in CI.
…st the label parser
Runtime behavior:
- Discovery applies the cgroup-name helper's exit-2 fallback name (short
container id, k8s_<id>) on the final retry again, restoring the prior
behavior that a recent change had dropped. Only a kill/timeout discards the
output now, and a timed-out helper is no longer retried, so one hung helper
cannot pile up a backlog and stall discovery.
- Raise the default `cgroup-name timeout` from 15s to 120s. The old shell
helper had no timeout on its docker/kubectl calls, so 120s is closer to its
behavior and avoids giving up on slow Kubernetes API responses.
- Fix the helper path (drop the `script to get cgroup names` override); the
helper always ships next to the other plugins.
Testability:
- Extract the resolved-name/label parser into cgroup-name-labels.{c,h} so it
can be unit tested in isolation; cgroup-discovery.c keeps identical behavior
through a thin wrapper.
- Wire the previously-orphaned C label-parser test (test_cgroups_plugin.c) into
CMake as cgroups-plugin-labels-test and into the topology container tests. It
links the real rrdlabels implementation through a small stub and runs as a
plain-C test, like the sibling cgroup tests.
Document that the cgroup-name helper verifies the Kubernetes API server against the mounted cluster CA by default, and the K8S_TLS_INSECURE escape hatch for clusters with custom PKI, an intercepting proxy, or an external load balancer. Drop the removed `script to get cgroup names` option from the configuration example.
… drop dead fields, test API-server path - cgroup-name-labels.c: add an explicit `len > 0` guard before indexing `n[len - 1]` in the rename-label parse. Behavior-identical (the `n[0] == '"'` short-circuit already guaranteed it), but provable to static analysis. - main.go: match the container id in the cache as a single line (`max=1`), mirroring the shell's line-oriented read and removing a latent multi-line parse foot-gun. - main.go: drop the unused `resolver.dockerHost`/`podmanHost` fields; the DOCKER_HOST/PODMAN_HOST env defaults are still set for the API paths. - add TestMirroredK8sPodListFixtureViaAPIServer covering the in-cluster API-server path (kube-system namespace + /api/v1/pods fetch and parse), which was previously exercised only indirectly through the kubelet path.
…nore the built helper - README: document the `[plugin:cgroups] cgroup-name timeout` option (default 120s, 0 = unbounded) and its retry-on-later-cycle behavior. - FLOW.md: correct the stated default timeout (15s -> 120s). - .gitignore: ignore the compiled cgroup-name Go binary and drop the stale entry for the removed generated cgroup-name.sh.
There was a problem hiding this comment.
1 issue found across 6 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/collectors/cgroups.plugin/cgroup-name/main.go">
<violation number="1" location="src/collectors/cgroups.plugin/cgroup-name/main.go:968">
P2: Cache-file read helpers `grepFile` and `firstLineFile` have a TOCTOU race: they `os.Lstat` the path to verify it is a regular file, then call `os.ReadFile(path)`, which follows symlinks. A local attacker can replace the file with a symlink between the check and the read in a world-writable TMPDIR. `repairPrivateFileMode` has the same flaw with `os.Chmod`. Open with `O_NOFOLLOW` and verify the resulting fd with `fstat` instead.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
|
|
||
| var kubeClusterName, kubeSystemUID, labels, containers string | ||
| if cntrID != "" && isPrivateRegularFile(tmpCluster) && isPrivateRegularFile(tmpSystemUID) && isPrivateRegularFile(tmpContainers) { | ||
| if matched, ok := grepFile(tmpContainers, cntrID, 1); ok { |
There was a problem hiding this comment.
P2: Cache-file read helpers grepFile and firstLineFile have a TOCTOU race: they os.Lstat the path to verify it is a regular file, then call os.ReadFile(path), which follows symlinks. A local attacker can replace the file with a symlink between the check and the read in a world-writable TMPDIR. repairPrivateFileMode has the same flaw with os.Chmod. Open with O_NOFOLLOW and verify the resulting fd with fstat instead.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/collectors/cgroups.plugin/cgroup-name/main.go, line 968:
<comment>Cache-file read helpers `grepFile` and `firstLineFile` have a TOCTOU race: they `os.Lstat` the path to verify it is a regular file, then call `os.ReadFile(path)`, which follows symlinks. A local attacker can replace the file with a symlink between the check and the read in a world-writable TMPDIR. `repairPrivateFileMode` has the same flaw with `os.Chmod`. Open with `O_NOFOLLOW` and verify the resulting fd with `fstat` instead.</comment>
<file context>
@@ -965,7 +965,7 @@ func (r *resolver) k8sGetKubePodName(ctx context.Context, cgroupPath, id string)
var kubeClusterName, kubeSystemUID, labels, containers string
if cntrID != "" && isPrivateRegularFile(tmpCluster) && isPrivateRegularFile(tmpSystemUID) && isPrivateRegularFile(tmpContainers) {
- if matched, ok := grepFile(tmpContainers, cntrID, 0); ok {
+ if matched, ok := grepFile(tmpContainers, cntrID, 1); ok {
labels = matched
kubeSystemUID = firstLineFile(tmpSystemUID)
</file context>
|
Extract the shared pod-list fixture and the resolved-name/label/cache assertions used by both the kubelet and API-server metadata-path tests into k8sPodFixture() and assertK8sBurstableContainerResolved(), removing the duplicated blocks SonarCloud flagged on new code. Both tests keep their distinct transport setup, and the API-server test keeps its kube-system-uid assertion. The kubelet handler no longer calls t.Fatal from a non-test goroutine.
|
@ilyam8 this has been reviewed extensively. With your review we should merge it. |

Summary
cgroup-name.shhelper with a Gocgroup-namehelper.Split Scope
masterand does not include topology container attribution changes.Validation
go test ./...insrc/collectors/cgroups.plugin/cgroup-namecmake -S . -B .local/build-cgroup-name-only -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_PLUGIN_GO=ON -DENABLE_PLUGIN_XENSTAT=OFF -DENABLE_PLUGIN_FREEIPMI=Off -DENABLE_PLUGIN_CUPS=Off -DENABLE_PLUGIN_NFACCT=Off -DENABLE_PLUGIN_OTEL=Off -DENABLE_PLUGIN_OTEL_SIGNAL_VIEWER=Off -DENABLE_EXPORTER_PROMETHEUS_REMOTE_WRITE=Off -DENABLE_EXPORTER_MONGODB=Off -DENABLE_BUNDLED_PROTOBUF=Offcmake --build .local/build-cgroup-name-only --target cgroup-name-target netdata -j "$(nproc)"Upgrade Notes
curl -kand never verified the Kubernetes API-server certificate. The Go helper verifies it against the in-cluster CA (/var/run/secrets/kubernetes.io/serviceaccount/ca.crt), matchingclient-goand the go.d Kubernetes collectors. Standard clusters (kubeadm/EKS/GKE/AKS) are unaffected. On clusters where the API endpoint does not chain to that CA (custom PKI, a TLS-intercepting proxy, or an external load balancer with its own certificate), Kubernetes pods/containers fall back to generick8s_<id>names until you setK8S_TLS_INSECURE=truein the Netdata service environment. This can change chart identity for those nodes on upgrade.[plugin:cgroups] script to get cgroup namesoption was removed. The helper path is now fixed at/usr/libexec/netdata/plugins.d/cgroup-name; any custom value is ignored.[plugin:cgroups] cgroup-name timeoutis now 120s (was 15s; the old shell helper had no timeout). Set0to disable.Summary by cubic
Replaces the shell
cgroup-name.shwith a Gocgroup-namebinary and wires it into build, installers, packages, and runtime. Raises the default timeout to 120s (+2s grace), restores fallback naming, resolves tools via PATH, and hardens cache I/O and label parsing.Refactors
cgroup-name; removedcgroup-name.sh; addedENABLE_CGROUP_NAME; installed as/usr/libexec/netdata/plugins.d/cgroup-namewith root:netdata0750 and runtime checks.cgroup-name timeoutto 120s; discovery adds +2s grace, skips retries after timeouts, and applies the exit-2 fallback name on the final retry.docker/kubectl/podman/psvia PATH (with sbin dirs); dropped thejqgate; improved Kubernetes TLS failure hints (K8S_TLS_INSECURE).Migration
/usr/libexec/netdata/plugins.d/cgroup-name.cgroup-name timeoutin[plugin:cgroups](default 120000 ms; set 0 to disable).Written for commit d5984fc. Summary will update on new commits.