OpenTelemetry logs ingestion/query subsystem#22720
Conversation
Adds the OpenTelemetry logs subsystem and supporting storage engine: OTLP/gRPC ingestion, a write-ahead log, SFST index files with a query engine, catalog, and the otel-plugin that exposes the otel-logs function. - New crates: otel-plugin, otel-ledger, otel-ingestor, otel-catalog, otel-normalize, otel-streams, sfst, sfsq, sfst-indexer, wal, wal-otap, treight, file-registry, chunk-file, fst-index, ferryboat, and supporting journal crates. - Removes the superseded otel-signal-viewer plugin, the unused nlogql, nlogql-eval, sd-compat, and journal-function crates; renames netdata-otel/flatten_otel -> flatten-otel and the otel-plugin tree. - CMake/packaging: build otel-plugin, install its stock otel.yaml, drop signal-viewer wiring.
A local MCP server (packaging/tools/automation/mcp) that lets an LLM configure, build, run, and query the Netdata Agent from a worktree, to drive the otel-logs subsystem and verify changes end to end. - Build/run lifecycle (declare/start/status/logs/stop), two profiles over one build dir per worktree, Cloud auto-claim for launched agents. - otel-logs tooling: otel_config (rotation/retention knobs), otel_push (deterministic synth corpus), otel_stream_* (live sources), otel_logs (typed query with Cloud-bearer auth). - CMake: a setup-mcp convenience target. Docs: mcp-build-run.md spec, .env.template / ENV.md claim-token keys.
SonarCube complains about duplicate code in packaging/tools/automation/mcp/netdata_mcp/agent_tools.py. However, this is a file that's generated automatically by the agent's own MCP tooling. I don't see an option in the SonarCube's UI to resolve the warning. |
What do you mean resolve it? There is some duplicate code or a chance to dedup code which is why it is raising this. You can ignore it but it is there |
Yes, it is there and it's fine because it's generated code. What I'm trying to say is that I can't find any option in SonarCube's UI to mark this as accepted/wont-fix/etc. |
Ok let me check |
Can't be cleared as in marked resolved (the duplication warning). Just ignore |
ENABLE_PLUGIN_OTEL was a plain option() with no platform guard, so the build tried to compile otel-plugin on Windows (where it does not build) and macOS (untested). It is tested only on Linux. Gate it to OS_LINUX — matching the sibling Linux-only plugins (debugfs, ebpf, cgroup-network) and the plugin's metadata.yaml supported_platforms.
…tore The otel/log-viewer removal left the OpenTelemetry docs broken and partly false, and dropped the live OTLP-metrics integration. Consolidated documentation fixes: - Fix broken links and remove the false signal-viewer / systemd-journal-files prose (logs-tab, working-with-logs, the Learn map OpenTelemetry node, and the OpenTelemetry Logs integration). - Restore the OpenTelemetry metrics integration for src/crates/otel-plugin/: recreate metadata.yaml (metrics-only; the old journal-based logs config is dropped since logs are a separate integration), register the source in integrations/_common.py, and regenerate the integration page + README symlink. Flesh out the OpenTelemetry Logs integration prerequisites. - Remove internal design/plan notes (docs/wal-query-design.md, catalog-implementation-plan.md) and scrub their references + 'milestone N' language from Rust doc-comments (otel-ledger, sfsq); rewrite the otel-streams README. Comment/doc-only — no functional code change. No functional code is changed in this commit; only docs, comments, integration metadata, and generated integration pages.
Three improvements to the MCP build/run automation server (plus a docs pass): - buildcfg: reconfigure when the cached CMAKE_INSTALL_PREFIX (path-normalized) differs from the canonical install prefix, or is absent/malformed — not only when the build type changes. Previously a stale cached prefix made `ninja install` write to one tree while the run launched a stale binary from another. - otel config: add a typed `journal_dir` knob to `netdata_agent_otel_config`, emitted as `logs.journal_dir` (WAL/index stay pinned under the run dir; an empty string omits the key). This points the read-only `legacy-otel-logs` viewer at a journal-file fixture for scripted verification. - Require a per-agent Cloud bearer on the wrapped `netdata_agent_*` forwarder. Agent functions are access-gated (e.g. SIGNED_ID) even on localhost once the agent is claimed, so anonymous forwarding returned HTTP 412. The forwarder now mints a per-agent bearer (a hard error without NETDATA_CLOUD_TOKEN or on mint failure — no anonymous fallback) and attaches it via the SDK's create_mcp_http_client factory, which preserves the MCP 30s/300s timeouts; the bearer is scrubbed from error strings. This unblocks terminal verification of access-gated agent functions. Tests cover all three paths; the build-run spec is updated to match.
The former OpenTelemetry plugin stored logs as systemd journal files, served by a separate otel-signal-viewer plugin via an `otel-logs` function. The new WAL/SFST implementation replaced ingestion and removed that viewer, leaving already-stored logs unviewable. Restore read-only access to them without re-introducing the removed ingestion or the standalone plugin. - Restore the `journal-function` crate: the query / facet / histogram / UI-formatting stack over journal files. - Add an `otel-legacy-logs` worker — a third, read-only otel-plugin worker that serves a `legacy-otel-logs` function over the former journal files via a new `bridge` LegacyLogs IPC protocol. It never writes, prunes, or rotates. The journal directory is resolved from the former `logs.journal_dir` in otel.yaml, defaulting to <NETDATA_LOG_DIR>/otel/v1 (falling back to /var/log/netdata/otel/v1 when unset); a malformed config warns rather than silently using the default. A distinct function name avoids a nondeterministic registry collision with the new ledger's `otel-logs`, and the advertised request params are trimmed to those actually honored. - Fully isolate the worker from the new pipeline: a configure failure, an init failure, an absent journal directory, a runtime crash, or an oversize response disables only the legacy viewer (degrading to idle, dropped routes, or a status-500) and never affects the ingestor/ledger, which keep their fatal contract by design. - Deny launching the obsolete otel-signal-viewer plugin in pluginsd, so a leftover binary from before the upgrade cannot re-register `otel-logs` and collide with the new ledger. Physical package removal is tracked separately (netdata#22728). Includes unit tests (the GET-args shim and journal-dir resolution) and a durable spec under .agents/sow/specs/.
… identity
OpenTelemetry treats a zero-length service.namespace as equal to an
unspecified one, but the logs pipeline encoded stream identity two ways
that disagreed: the ingestor named files with compute_ns_hash(None, name)
for an absent namespace, while the WAL stream filter recomputed
compute_ns_hash(Some(""), name) from the collapsed ServiceStream. The two
never matched, so a stream-filtered query missed every absent-namespace
file.
Centralize identity on a single ServiceStream::ns_hash that maps an empty
field to absent, and route the ingestor, the WAL candidates filter, and
identity through it:
- file-registry: add ServiceStream::ns_hash (empty -> None); keep
compute_ns_hash as the low-level primitive, documented as not for the
identity layer.
- otel-ingestor: extract_stream returns a ServiceStream; drop the
Option-based Stream/CanonicalStream; the per-(tenant, ns_hash) collision
table carries ServiceStream so absent and literal-empty land in one
partition.
- wal: Registry::candidates derives the filter hash via ServiceStream::ns_hash,
fixing the absent-namespace filter miss.
The empty->absent rule keeps the common absent-namespace ns_hash unchanged,
so existing files need no migration and no on-disk/wire format changes; only
a rare literal-empty-namespace file re-partitions. The SFST registry already
filters by ServiceStream equality and is unaffected.
Adds unit and regression tests (absent==empty hashing, the previously-broken
absent-namespace candidate match) and a durable spec under .agents/sow/specs/.
Add a CLI that reads OTel logs directly from on-disk WAL and SFST
directories and queries them through the wire-neutral sfsq engine, with
no running agent — terminal and forensic inspection alongside the live
otel-logs Function.
- Directories resolve per-dir from otel.yaml (explicit --wal-dir/--sfst-dir
override --config over --stock-config); logs read from {dir}/{tenant}.
Sealed SFSTs are time-pruned via their summaries; WAL files carry no
on-disk timestamp index, so each is row-scanned as a tail. A WAL whose
sequence is already sealed into an SFST is skipped (SFST wins), matching
the live query planner.
- Stream filtering mirrors the planner and the absent==empty stream-identity
contract: exact ServiceStream for SFST, ServiceStream::ns_hash for WAL
(an empty namespace collapses to absent) — so an absent-namespace query
matches the files the ingestor actually named, not only literal-empty ones.
- Window [since, until) accepts now / relative / epoch / a UTC datetime
(timezone offsets rejected; post-2106 errors instead of truncating);
newest-first with --reverse; --filter/--query grammar; --fields projection;
NDJSON output with a stable [key,value] field shape.
Tested with unit + integration tests over real OTAP->WAL->SFST fixtures
(tail-vs-sealed row equivalence, SFST-wins dedup, time pruning, and an
absent-namespace stream-filter regression). Adds an operator usage README
and a durable spec under .agents/sow/specs/.
The WAL identified each file only by its ns_hash (in the filename); the (service.namespace, service.name) text lived solely in the frame payloads and was recovered by the indexer at seal time. So an unsealed WAL's stream could not be named without decoding frames — which blocks listing all queryable streams (e.g. a stream selector) for streams not yet sealed into an SFST. Record the stream in the WAL file header so it is available cheaply (recovery, enumeration) without touching frames: - format: bump FORMAT_VERSION to 2; FileHeader carries a ServiceStream, written as two length-prefixed UTF-8 fields (each capped at 256 bytes, truncated on a char boundary — display only; the ns_hash partition key is unaffected). from_bytes hard-rejects v1: OTel logs are experimental and WAL files are short-lived, so there is no back-compat (an unsealed v1 WAL is lost on upgrade). - writer: the per-stream writer carries its ServiceStream and writes it into every file's header and the Created event. - registry: File carries the stream; apply_event(Created) sets it live, and recover() reads it back from the header on restart. - ingestor: thread the ServiceStream (already the collision-table key) down to write_frame. FileEvent::Created (the writer->ledger IPC, which has no version field) gains the stream; safe because the two run in one binary. Adds header round-trip tests (including an absent namespace and oversize truncation), a v1-rejection test, and a recovery assertion; updates the write_frame call sites.
…ntrol) Give the live otel-logs function a stream selector modeled on the systemd-journal "sources" control. Picks prune which files the query opens; the default view still spans every stream. - file_registry::Query: stream filter is now a set of ns_hash values (stream_hashes: Vec<u64>, empty = all) with a shared matches_stream predicate, replacing the single Option<ServiceStream>. - WAL, SFST and catalog candidate filters match by ns_hash membership (SFST and catalog move off ServiceStream equality), safe under the ingestor's per-(tenant, ns_hash) collision invariant. - Handler removes the reserved __streams selection before building the engine query (so it prunes files, never row-filters a phantom facet), decodes the hex picks into Query.stream_hashes, and advertises the tenant's streams as a required_params MultiSelection on every response. - enumerate_streams lists streams window-independent from SFST summaries and WAL File.stream, deduped by ns_hash (SFST wins by seq); an active WAL is sized by valid_up_to since File.size lands only on close. - Every option is defaultSelected so the mandatory control defaults to all streams (MultiSelectionOption gains defaultSelected). - sfsq-cli filters by a one-element stream_hashes from --namespace/--name. Spec otel-stream-identity.md updated: all tiers filter by ns_hash membership, and the former dormant-filter note becomes the live selector contract.
|
Marking this as ready for review because it's a big one and there are parts that won't change. The You can use @Ferroin I need you to check the build/package-related stuff. IIRC, you mentioned in the past that we should mask packages of removed plugins a deprecated/superceded, when we introduce a plugin that replaces them. @netdata/agent The only non-Rust code changes is in src/plugins.d/plugins_d.c, where I've added a deny-list for the Remaining work revolves around:
All of these do not touch the existing agent/collectors at all. [1] @ilyam8 last time I tried the deny-list approach, I targeted the |
# Conflicts: # .agents/sow/specs/README.md # src/crates/Cargo.lock
74 issues all related to mcp tooling and in particular the auto-generated python code for the wrapped MCP tools of the agent. I'm going to start ignoring them. |
The otel/log-viewer removal dropped the otel-signal-viewer-plugin target that carried --cfg=io_uring_skip_arch_check. otel-plugin now pulls io-uring transitively (otel-legacy-logs -> journal-function -> journal-engine -> foyer -> foyer-storage -> io-uring), which has no prebuilt bindings for 32-bit arches, so i386/armhf builds failed the compile-time arch check. Apply the same skip used by netflow-plugin; it is a compile-time, not runtime, dependency.
|


Summary
Adds the OpenTelemetry logs ingestion/query subsystem and a local build/run MCP automation server. Two commits, split by area:
otel — WAL-backed OpenTelemetry logs ingestion, query, and plugin
otel-pluginthat exposes theotel-logsfunction.otel-plugin,otel-ledger,otel-ingestor,otel-catalog,otel-normalize,otel-streams,sfst,sfsq,sfst-indexer,wal,wal-otap,treight,file-registry,chunk-file,fst-index,ferryboat, plus supporting journal crates.otel-signal-viewerplugin and the unusednlogql,nlogql-eval,sd-compat,journal-functioncrates; renames theflatten_otel/otel-plugintrees.otel-plugin, install its stockotel.yaml, drop signal-viewer wiring.mcp — build/run automation server for the Netdata Agent
packaging/tools/automation/mcp) that lets an LLM configure, build, run, and query the Agent from a worktree to drive the otel-logs subsystem end to end.otel_config(rotation/retention knobs),otel_push(deterministic synth corpus),otel_stream_*(live sources),otel_logs(typed query with Cloud-bearer auth).setup-mcpCMake convenience target;mcp-build-run.mdspec;.env.template/ENV.mdclaim-token keys.