Skip to content

cgroups: add cgroup-name Go helper#22685

Open
ktsaou wants to merge 12 commits into
netdata:masterfrom
ktsaou:topology-cgroup-name
Open

cgroups: add cgroup-name Go helper#22685
ktsaou wants to merge 12 commits into
netdata:masterfrom
ktsaou:topology-cgroup-name

Conversation

@ktsaou

@ktsaou ktsaou commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

  • Replace the shell cgroup-name.sh helper with a Go cgroup-name helper.
  • Wire the helper into CMake, installer, and package permission paths.
  • Preserve cgroups plugin timeout handling and helper failure behavior with focused tests.

Split Scope

  • This PR contains only the cgroup-name helper replacement surface.
  • It is based directly on master and does not include topology container attribution changes.

Validation

  • go test ./... in src/collectors/cgroups.plugin/cgroup-name
  • cmake -S . -B .local/build-cgroup-name-only -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_PLUGIN_GO=ON -DENABLE_PLUGIN_XENSTAT=OFF -DENABLE_PLUGIN_FREEIPMI=Off -DENABLE_PLUGIN_CUPS=Off -DENABLE_PLUGIN_NFACCT=Off -DENABLE_PLUGIN_OTEL=Off -DENABLE_PLUGIN_OTEL_SIGNAL_VIEWER=Off -DENABLE_EXPORTER_PROMETHEUS_REMOTE_WRITE=Off -DENABLE_EXPORTER_MONGODB=Off -DENABLE_BUNDLED_PROTOBUF=Off
  • cmake --build .local/build-cgroup-name-only --target cgroup-name-target netdata -j "$(nproc)"

Upgrade Notes

  • Kubernetes API TLS is now verified by default. The previous shell helper used curl -k and never verified the Kubernetes API-server certificate. The Go helper verifies it against the in-cluster CA (/var/run/secrets/kubernetes.io/serviceaccount/ca.crt), matching client-go and the go.d Kubernetes collectors. Standard clusters (kubeadm/EKS/GKE/AKS) are unaffected. On clusters where the API endpoint does not chain to that CA (custom PKI, a TLS-intercepting proxy, or an external load balancer with its own certificate), Kubernetes pods/containers fall back to generic k8s_<id> names until you set K8S_TLS_INSECURE=true in the Netdata service environment. This can change chart identity for those nodes on upgrade.
  • The [plugin:cgroups] script to get cgroup names option was removed. The helper path is now fixed at /usr/libexec/netdata/plugins.d/cgroup-name; any custom value is ignored.
  • The default [plugin:cgroups] cgroup-name timeout is now 120s (was 15s; the old shell helper had no timeout). Set 0 to disable.

Summary by cubic

Replaces the shell cgroup-name.sh with a Go cgroup-name binary and wires it into build, installers, packages, and runtime. Raises the default timeout to 120s (+2s grace), restores fallback naming, resolves tools via PATH, and hardens cache I/O and label parsing.

  • Refactors

    • Introduced Go cgroup-name; removed cgroup-name.sh; added ENABLE_CGROUP_NAME; installed as /usr/libexec/netdata/plugins.d/cgroup-name with root:netdata 0750 and runtime checks.
    • Increased cgroup-name timeout to 120s; discovery adds +2s grace, skips retries after timeouts, and applies the exit-2 fallback name on the final retry.
    • Resolved docker/kubectl/podman/ps via PATH (with sbin dirs); dropped the jq gate; improved Kubernetes TLS failure hints (K8S_TLS_INSECURE).
    • Hardened Kubernetes cache I/O (atomic writes, reject non-regular files), switched container-id cache lookups to single-line matches, precompiled dispatch regexes, guarded label parsing, and expanded tests (standalone Go module in CI, API-server path test, de-duplicated Kubernetes pod-list tests, fixed t.Fatal from a goroutine; C label-parser test); included a dev-only dual-run wrapper.
  • Migration

    • Remove any custom “script to get cgroup names” setting; Netdata now calls /usr/libexec/netdata/plugins.d/cgroup-name.
    • Configure the helper via cgroup-name timeout in [plugin:cgroups] (default 120000 ms; set 0 to disable).

Written for commit d5984fc. Summary will update on new commits.

Review in cubic

@github-actions github-actions Bot added area/packaging Packaging and operating systems support area/docs area/collectors Everything related to data collection area/build Build system (autotools and cmake). collectors/cgroups labels Jun 11, 2026
Comment thread src/collectors/cgroups.plugin/cgroup-name/main.go Fixed
Comment thread src/collectors/cgroups.plugin/cgroup-name/main.go Fixed
Comment thread src/collectors/cgroups.plugin/cgroup-name/main.go Fixed
Comment thread src/collectors/cgroups.plugin/cgroup-name/main.go Fixed

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 issues found across 16 files

Confidence score: 3/5

  • In src/collectors/cgroups.plugin/cgroup-name/main.go, Kubernetes annotation and Docker label values are emitted without escaping, so quotes/commas/newlines can corrupt label serialization or inject malformed labels into output; this is the highest user-facing risk because downstream parsing and attribution can break — escape/sanitize label values before emitting them.
  • In src/collectors/cgroups.plugin/cgroup-discovery.c, treating poll() EINTR as timeout/error can trigger unnecessary helper kills and miss cgroup rename events, leading to intermittent metadata gaps under signal-heavy conditions — handle EINTR as a retry path before merging.
  • In src/collectors/cgroups.plugin/sys_fs_cgroup.c, timeout unit conversion can overflow and effectively disable or distort cgroup-name timeout behavior, which risks hangs or premature timeouts depending on inputs — add bounds checks/safe conversion and a regression test around extreme timeout values.
  • In src/collectors/cgroups.plugin/cgroup-name/main.go, cache files under /tmp are world-readable, exposing pod/container metadata to other local users; lower severity but concrete privacy leakage on multi-user hosts — restrict file mode (e.g., 0600) and consider a private runtime directory before release.
Architecture diagram
sequenceDiagram
    participant Plugin as cgroups.plugin (C)
    participant Discovery as discovery thread
    participant Helper as cgroup-name (Go binary)
    participant DockerAPI as Docker/Podman API
    participant K8sAPI as Kubernetes API
    participant CRI as Container runtime (docker/crio/containerd)
    participant Env as Environment (timeout, log level)

    Note over Plugin,Discovery: Configuration loading (once)
    Plugin->>Plugin: read_cgroup_plugin_configuration()
    Plugin->>Plugin: Set cgroup_name_timeout_ms from config (default 15000)
    Plugin->>Env: Set NETDATA_CGROUP_NAME_TIMEOUT_MS env var
    Plugin->>Plugin: Determine cgroups_rename_script path to /usr/libexec/netdata/plugins.d/cgroup-name

    Note over Discovery,Helper: Per-cgroup rename cycle
    Discovery->>Discovery: discovery_rename_cgroup(cg)
    alt cgroup matches rename pattern
        Discovery->>Helper: spawn cgroup-name <cgroup_path> <cgroup_id>
        Note over Helper: Receives args, sets PATH, LC_ALL=C
        Helper->>Env: Read NETDATA_CGROUP_NAME_TIMEOUT_MS
        alt timeout > 0
            Helper->>Helper: Set expiresAt = now + timeout
            Helper->>Helper: Create context.WithDeadline(expiresAt)
        else timeout == 0
            Helper->>Helper: Create context.WithCancel() (unbounded)
        end

        alt cgroup id contains "kubepods"
            Helper->>Helper: k8sGetName()
            alt KUBERNETES_SERVICE_HOST set
                Helper->>K8sAPI: curl to in-cluster API with TLS verify (or insecure if K8S_TLS_INSECURE)
                K8sAPI-->>Helper: Pod JSON
                Helper->>Helper: podsToContainerLines() → parse container metadata
            else kubectl available
                Helper->>Helper: exec kubectl get pods
            else
                Helper->>Helper: Try GCP metadata API for cluster name
            end
            alt pause container
                Helper->>CRI: Read cgroup.procs, check /proc/<pid>/comm
                CRI-->>Helper: "pause"
                Helper->>Helper: exit disable (code 3)
            else valid container
                Helper->>Helper: Build k8s_ prefixed name with labels
            end
        else non-K8s path
            alt matches Docker regex
                Helper->>DockerAPI: docker inspect via socket or API
                DockerAPI-->>Helper: Container config (env, labels)
                Helper->>Helper: parseDockerLikeInspectOutput()
            else matches ECS regex
                Helper->>DockerAPI: Similar docker inspect
            else matches Containerd regex
                Helper->>DockerAPI: Similar docker inspect
            else matches Podman regex
                Helper->>DockerAPI: podman inspect via podman socket
            else matches systemd-nspawn
                Helper->>Helper: Extract machine name from cgroup path
            else matches libvirt regex
                Helper->>Helper: Parse libvirt/LXC/QEMU machine names
            else matches Proxmox regex
                Helper->>Helper: Read /etc/pve config files for VM/container names
            end
        end

        alt name resolved successfully
            Helper-->>Discovery: stdout: "<name> <labels>\n"
            alt exit code 0
                Discovery->>Discovery: Set pending_renames = 0
            else exit code 3
                Discovery->>Discovery: Set pending_renames = 0, processed = 1
            end
        else budget expired (no name resolved)
            Helper-->>Discovery: stderr: time breakdown log
            Helper-->>Discovery: exit code 2 (retry)
        else helper stuck/timeout
            Discovery->>Helper: poll() with wait_ms = cgroup_name_timeout_ms + 2000ms grace
            alt poll timeout or error
                Discovery->>Helper: spawn_popen_kill() - kill helper
                Discovery->>Discovery: Log error, keep pending_renames for retry
            else poll success
                Discovery->>Helper: spawn_popen_timedwait() with 2s grace
                alt still running after grace
                    Discovery->>Helper: spawn_popen_kill()
                end
            end
        end
    end

    Note over Plugin,Helper: Installer & permissions (deployment time)
    Participant Installer as Installer scripts
    Installer->>Helper: Copy cgroup-name binary to /usr/libexec/netdata/plugins.d/
    Installer->>Helper: chown root:netdata, chmod 0750
Loading

Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.

Re-trigger cubic

Comment thread src/collectors/cgroups.plugin/sys_fs_cgroup.c Outdated
Comment thread src/collectors/cgroups.plugin/cgroup-discovery.c Outdated
Comment thread src/collectors/cgroups.plugin/cgroup-name/main.go Outdated
Comment thread src/collectors/cgroups.plugin/cgroup-name/main.go Outdated
Comment thread src/collectors/cgroups.plugin/cgroup-name/main.go Outdated
Comment thread src/collectors/cgroups.plugin/cgroup-name/capture-wrappers/cgroup-name.sh.logger Outdated
Comment thread src/collectors/cgroups.plugin/cgroup-name/FLOW.md Outdated
Comment thread src/collectors/cgroups.plugin/cgroup-name/main.go Outdated
@github-actions github-actions Bot added the area/metadata Integrations metadata label Jun 14, 2026

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 12 files (changes from recent commits).

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread src/collectors/cgroups.plugin/cgroup-discovery.c Outdated
@ktsaou

ktsaou commented Jun 14, 2026

Copy link
Copy Markdown
Member Author

@cubic-dev-ai please review again

@cubic-dev-ai

cubic-dev-ai Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

@cubic-dev-ai please review again

@ktsaou I have started the AI code review. It will take a few minutes to complete.

@ktsaou ktsaou force-pushed the topology-cgroup-name branch from 4503c9a to 85ae59f Compare June 14, 2026 08:17
@ktsaou

ktsaou commented Jun 14, 2026

Copy link
Copy Markdown
Member Author

@cubic-dev-ai please review again

@cubic-dev-ai

cubic-dev-ai Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

@cubic-dev-ai please review again

@ktsaou I have started the AI code review. It will take a few minutes to complete.

Comment thread src/collectors/cgroups.plugin/cgroup-name/main.go Fixed
Comment thread src/collectors/cgroups.plugin/cgroup-name/main.go Fixed
Comment thread src/collectors/cgroups.plugin/cgroup-name/main.go Fixed

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found and verified against the latest diff

Confidence score: 4/5

  • In src/collectors/cgroups.plugin/cgroup-name/capture-wrappers/cgroup-name.sh.logger, diff_detail only compares stdout, so failures caused by exit-code or stderr differences can show an empty diff and hide the real mismatch; merging as-is risks harder triage when capture-wrapper checks fail. Expand diff_detail to include exit status and stderr deltas before merging to preserve the wrapper’s diagnostic value.

Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.

Re-trigger cubic

Comment thread src/collectors/cgroups.plugin/cgroup-name/capture-wrappers/cgroup-name.sh.logger Outdated
@ktsaou

ktsaou commented Jun 14, 2026

Copy link
Copy Markdown
Member Author

@cubic-dev-ai please review again

@cubic-dev-ai

cubic-dev-ai Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

@cubic-dev-ai please review again

@ktsaou I have started the AI code review. It will take a few minutes to complete.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 22 files

Confidence score: 3/5

  • In src/collectors/cgroups.plugin/cgroup-name/main.go, predictable temp-file creation can follow symlinks, which could let an attacker or local process clobber unintended files when this path runs — switch to nofollow/atomic creation in a private temp directory before merging.
  • In src/collectors/cgroups.plugin/cgroup-name/capture-wrappers/cgroup-name.sh.logger, the wrapper currently drops shell.err and can abort on capture-write failure, so resolver errors may be hidden and normal passthrough behavior can be lost — relay shell.err to stderr and make capture logging best-effort so resolver results still return.
  • In src/collectors/cgroups.plugin/cgroup-name/main.go, recompiling a regex on every dispatch adds avoidable overhead in a hot path; this is low severity but can become noticeable under load — precompile it once at package scope (like the other matchers) to de-risk performance regression.
Architecture diagram
sequenceDiagram
    participant CG as cgroups.plugin (C)
    participant CN as cgroup-name (Go)
    participant ENV as Environment
    participant DOCK as Docker/Podman
    participant K8S as Kubernetes
    participant FS as Filesystem

    Note over CG,FS: Cgroup Name Resolution Flow

    CG->>ENV: read cgroup-name timeout config (default 15s)
    ENV-->>CG: cgroup_name_timeout_ms
    CG->>CG: set NETDATA_CGROUP_NAME_TIMEOUT_MS env var
    CG->>CN: spawn cgroup-name <cgroup_path> <cgroup_id>
    Note over CN: self-terminates within timeout budget

    CN->>ENV: read NETDATA_CGROUP_NAME_TIMEOUT_MS
    ENV-->>CN: timeout budget X ms
    CN->>CN: setup deadline context

    alt cgroup contains "kubepods"
        CN->>K8S: Kubernetes resolution path
        K8S-->>CN: container/pod name + labels
    else Non-K8s cgroup
        CN->>CN: match against docker/ecs/containerd/podman/libvirt/proxmox regex
        alt Matched docker/containerd/ecs
            CN->>DOCK: docker/podman inspect via API or socket
            DOCK-->>CN: container metadata (name, labels)
        else Matched libvirt/qemu
            CN->>FS: read /etc/pve/qemu-server/<id>.conf
            FS-->>CN: VM name
        else Matched Proxmox LXC
            CN->>FS: read /etc/pve/lxc/<id>.conf
            FS-->>CN: container hostname
        end
    end

    CN->>CN: build output: <name> [<labels>]

    alt budget expired, name unresolved
        CN->>CN: log call time breakdown
        CN-->>CG: exit code 2 (retry)
    else success
        CN-->>CG: stdout: <name> <labels>, exit 0
    end

    CG->>CG: poll for output with timeout + 2s grace
    alt helper timed out
        CG->>CG: kill helper process, treat as failure
        CG->>CG: schedule retry
    else output received
        CG->>CG: parse name and labels
        alt exit code 0
            CG->>CG: apply name/labels to cgroup chart
        else exit code 2 (retry)
            CG->>CG: keep pending_renames, retry later
        else exit code 3 (disable)
            CG->>CG: mark cgroup as processed
        end
    end

    alt timeout set to 0 (legacy unbounded)
        CG->>CG: wait indefinitely for helper
    end
Loading

Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.

Re-trigger cubic

Comment thread src/collectors/cgroups.plugin/cgroup-name/capture-wrappers/cgroup-name.sh.logger Outdated
Comment thread src/collectors/cgroups.plugin/cgroup-name/capture-wrappers/cgroup-name.sh.logger Outdated
Comment thread src/collectors/cgroups.plugin/cgroup-name/main.go Outdated
Comment thread src/collectors/cgroups.plugin/cgroup-name/main.go Outdated
ktsaou added 6 commits June 16, 2026 18:17
…exes

- writePrivateFile now writes the Kubernetes cache files atomically through a
  private temp file plus rename, so a symlink planted under the predictable
  name in a world-writable TMPDIR is replaced instead of followed.
- cache readers (firstLineFile, grepFile) and the cache-presence gate reject
  non-regular files via isPrivateRegularFile, closing the matching
  symlink-follow read vector; add tests for both directions.
- precompile the libvirt-qemu, lxc.payload and proxmox name/hostname regexes
  at package scope and reuse reMachineIDSegment, removing per-dispatch regexp
  compilation from the hot path.
- capture wrapper: make capture logging best-effort and relay the shell
  resolver's stderr so it stays a transparent stand-in for cgroup-name.sh.real.
@ktsaou ktsaou force-pushed the topology-cgroup-name branch from 581ca23 to ca513c1 Compare June 16, 2026 15:19
@ktsaou

ktsaou commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

@cubic-dev-ai please review again — pushed ca513c1 addressing the 4 open findings (best-effort capture + stderr relay in the dev wrapper, atomic symlink-safe cache writes plus symlink-rejecting reads with tests, and package-scope precompiled dispatch regexes), and rebased onto current master.

@cubic-dev-ai

cubic-dev-ai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@cubic-dev-ai please review again — pushed ca513c1 addressing the 4 open findings (best-effort capture + stderr relay in the dev wrapper, atomic symlink-safe cache writes plus symlink-rejecting reads with tests, and package-scope precompiled dispatch regexes), and rebased onto current master.

@ktsaou I have started the AI code review. It will take a few minutes to complete.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 22 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.
Architecture diagram
sequenceDiagram
    participant Operator
    participant Plugin as CGroup Plugin (cgroup-discovery.c)
    participant Helper as cgroup-name (Go helper)
    participant ExtAPI as External APIs (Docker/Podman/K8s/kubectl)
    participant FS as File System (cgroup paths, cache files, Proxmox configs)

    Note over Operator,Plugin: Configuration Phase
    Operator->>Plugin: Sets `[plugin:cgroups] cgroup-name timeout` (default 15s)
    Plugin->>Plugin: Reads config, sets env `NETDATA_CGROUP_NAME_TIMEOUT_MS`
    Plugin->>Plugin: Points `script to get cgroup names` to `/usr/libexec/netdata/plugins.d/cgroup-name`

    Note over Plugin,Helper: Discovery Invocation
    Plugin->>Plugin: discovery_rename_cgroup() matched cgroup
    Plugin->>Helper: spawn_popen (cgroup-name <cgroup_path> <cgroup_id>)
    Note over Helper: Inherits env: PATH, LC_ALL, TIMEOUT_MS

    Note over Helper,ExtAPI: Name Resolution
    Helper->>Helper: Parse args, set up deadline context
    alt cgroup id contains "kubepods"
        Helper->>FS: Check cache files in TMPDIR (k8s-containers, cluster-name, etc.)
        alt Cache miss
            Helper->>ExtAPI: Query Kubernetes API or kubectl (with context deadline)
            ExtAPI-->>Helper: Pod/container list JSON
            Helper->>FS: Write cache files (non-atomic)
            Helper->>Helper: Parse to container lines
        else Cache hit
            FS-->>Helper: Cached container lines
        end
        Helper->>Helper: Build k8s_<name> with labels
    else Non-K8s cgroup
        Helper->>Helper: Dispatch regex over cgroup id
        alt Docker/containerd/ECS/Podman
            Helper->>ExtAPI: Docker/Podman inspect (API or CLI)
            ExtAPI-->>Helper: Container metadata
        else Proxmox
            Helper->>FS: Read /etc/pve/<qemu|lxc>/<id>.conf
            FS-->>Helper: Hostname or name
        else systemd-nspawn/libvirt/LXC
            Helper->>Helper: Extract name from cgroup path
        end
    end

    Note over Helper,Plugin: Output and Exit
    alt Name resolved before deadline
        Helper-->>Plugin: stdout: `<name> <labels>` or just `<name>`
        Helper-->>Plugin: Exit code 0 (success) or 3 (disable)
    else Timeout (deadline passed)
        Helper->>Helper: Log time breakdown per external call
        Helper-->>Plugin: (no stdout)
        Helper-->>Plugin: Exit code 2 (retry)
    end

    Note over Plugin,Plugin: Plugin side handling
    Plugin->>Plugin: poll() for output with timeout = operator timeout + 2s grace
    alt poll returns > 0 and output available
        Plugin->>Plugin: fgets() read line
        Plugin->>Plugin: spawn_popen_timedwait() for exit code
        alt Exit code 0
            Plugin->>Plugin: cgroup_parse_resolved_name_and_labels()
            Plugin->>Plugin: set cgroup name, apply labels
        else Exit code 3
            Plugin->>Plugin: mark cgroup as processed (disable)
        end
    else poll timeout or error
        Plugin->>Plugin: spawn_popen_kill() helper
        Plugin->>Plugin: Retry later (retry ladder)
    end

    Note over Plugin,FS: Fallback
    alt No name from helper
        Plugin->>Plugin: Use raw cgroup id as name
        Plugin->>Plugin: cgroup.name = cgroup_id (truncated to 100 chars)
    end
Loading

Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.

Re-trigger cubic

ktsaou added 3 commits June 17, 2026 02:00
…per tests in CI

- Restore the shell helper's PATH handling: append the standard system
  directories to the inherited PATH and resolve docker/kubectl/podman/ps via
  PATH, instead of locking lookups to a fixed directory list. The helper runs
  unprivileged (as the netdata user), like the shell helper it replaces, so the
  lockdown only broke tool discovery on hosts that install them elsewhere.
- Remove the jq availability gate: the helper parses JSON natively and never
  executes jq, so the check needlessly skipped docker-API and Kubernetes name
  resolution on hosts without jq.
- Point Kubernetes API TLS verification failures at the K8S_TLS_INSECURE
  override in the warning message.
- Add the standalone cgroup-name Go module to the Go test matrix and the
  go-tests path filter so its go test/vet/build/race actually run in CI.
…st the label parser

Runtime behavior:
- Discovery applies the cgroup-name helper's exit-2 fallback name (short
  container id, k8s_<id>) on the final retry again, restoring the prior
  behavior that a recent change had dropped. Only a kill/timeout discards the
  output now, and a timed-out helper is no longer retried, so one hung helper
  cannot pile up a backlog and stall discovery.
- Raise the default `cgroup-name timeout` from 15s to 120s. The old shell
  helper had no timeout on its docker/kubectl calls, so 120s is closer to its
  behavior and avoids giving up on slow Kubernetes API responses.
- Fix the helper path (drop the `script to get cgroup names` override); the
  helper always ships next to the other plugins.

Testability:
- Extract the resolved-name/label parser into cgroup-name-labels.{c,h} so it
  can be unit tested in isolation; cgroup-discovery.c keeps identical behavior
  through a thin wrapper.
- Wire the previously-orphaned C label-parser test (test_cgroups_plugin.c) into
  CMake as cgroups-plugin-labels-test and into the topology container tests. It
  links the real rrdlabels implementation through a small stub and runs as a
  plain-C test, like the sibling cgroup tests.
Document that the cgroup-name helper verifies the Kubernetes API server against
the mounted cluster CA by default, and the K8S_TLS_INSECURE escape hatch for
clusters with custom PKI, an intercepting proxy, or an external load balancer.
Drop the removed `script to get cgroup names` option from the configuration
example.
Comment thread src/collectors/cgroups.plugin/cgroup-name-labels.c Fixed
ktsaou added 2 commits June 17, 2026 14:13
… drop dead fields, test API-server path

- cgroup-name-labels.c: add an explicit `len > 0` guard before indexing
  `n[len - 1]` in the rename-label parse. Behavior-identical (the `n[0] == '"'`
  short-circuit already guaranteed it), but provable to static analysis.
- main.go: match the container id in the cache as a single line (`max=1`),
  mirroring the shell's line-oriented read and removing a latent multi-line
  parse foot-gun.
- main.go: drop the unused `resolver.dockerHost`/`podmanHost` fields; the
  DOCKER_HOST/PODMAN_HOST env defaults are still set for the API paths.
- add TestMirroredK8sPodListFixtureViaAPIServer covering the in-cluster
  API-server path (kube-system namespace + /api/v1/pods fetch and parse), which
  was previously exercised only indirectly through the kubelet path.
…nore the built helper

- README: document the `[plugin:cgroups] cgroup-name timeout` option
  (default 120s, 0 = unbounded) and its retry-on-later-cycle behavior.
- FLOW.md: correct the stated default timeout (15s -> 120s).
- .gitignore: ignore the compiled cgroup-name Go binary and drop the stale
  entry for the removed generated cgroup-name.sh.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 6 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/collectors/cgroups.plugin/cgroup-name/main.go">

<violation number="1" location="src/collectors/cgroups.plugin/cgroup-name/main.go:968">
P2: Cache-file read helpers `grepFile` and `firstLineFile` have a TOCTOU race: they `os.Lstat` the path to verify it is a regular file, then call `os.ReadFile(path)`, which follows symlinks. A local attacker can replace the file with a symlink between the check and the read in a world-writable TMPDIR. `repairPrivateFileMode` has the same flaw with `os.Chmod`. Open with `O_NOFOLLOW` and verify the resulting fd with `fstat` instead.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic


var kubeClusterName, kubeSystemUID, labels, containers string
if cntrID != "" && isPrivateRegularFile(tmpCluster) && isPrivateRegularFile(tmpSystemUID) && isPrivateRegularFile(tmpContainers) {
if matched, ok := grepFile(tmpContainers, cntrID, 1); ok {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Cache-file read helpers grepFile and firstLineFile have a TOCTOU race: they os.Lstat the path to verify it is a regular file, then call os.ReadFile(path), which follows symlinks. A local attacker can replace the file with a symlink between the check and the read in a world-writable TMPDIR. repairPrivateFileMode has the same flaw with os.Chmod. Open with O_NOFOLLOW and verify the resulting fd with fstat instead.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/collectors/cgroups.plugin/cgroup-name/main.go, line 968:

<comment>Cache-file read helpers `grepFile` and `firstLineFile` have a TOCTOU race: they `os.Lstat` the path to verify it is a regular file, then call `os.ReadFile(path)`, which follows symlinks. A local attacker can replace the file with a symlink between the check and the read in a world-writable TMPDIR. `repairPrivateFileMode` has the same flaw with `os.Chmod`. Open with `O_NOFOLLOW` and verify the resulting fd with `fstat` instead.</comment>

<file context>
@@ -965,7 +965,7 @@ func (r *resolver) k8sGetKubePodName(ctx context.Context, cgroupPath, id string)
 	var kubeClusterName, kubeSystemUID, labels, containers string
 	if cntrID != "" && isPrivateRegularFile(tmpCluster) && isPrivateRegularFile(tmpSystemUID) && isPrivateRegularFile(tmpContainers) {
-		if matched, ok := grepFile(tmpContainers, cntrID, 0); ok {
+		if matched, ok := grepFile(tmpContainers, cntrID, 1); ok {
 			labels = matched
 			kubeSystemUID = firstLineFile(tmpSystemUID)
</file context>

@sonarqubecloud

sonarqubecloud Bot commented Jun 17, 2026

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
5.8% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Extract the shared pod-list fixture and the resolved-name/label/cache assertions
used by both the kubelet and API-server metadata-path tests into k8sPodFixture()
and assertK8sBurstableContainerResolved(), removing the duplicated blocks
SonarCloud flagged on new code. Both tests keep their distinct transport setup,
and the API-server test keeps its kube-system-uid assertion. The kubelet handler
no longer calls t.Fatal from a non-test goroutine.
@ktsaou ktsaou marked this pull request as ready for review June 17, 2026 18:12
@ktsaou

ktsaou commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

@ilyam8 this has been reviewed extensively. With your review we should merge it.
The key difference is that now k8s API is verified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/build Build system (autotools and cmake). area/ci area/collectors Everything related to data collection area/docs area/metadata Integrations metadata area/packaging Packaging and operating systems support collectors/cgroups

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants