Skip to content

OpenTelemetry logs ingestion/query subsystem#22720

Open
vkalintiris wants to merge 14 commits into
netdata:masterfrom
vkalintiris:sjr-pr
Open

OpenTelemetry logs ingestion/query subsystem#22720
vkalintiris wants to merge 14 commits into
netdata:masterfrom
vkalintiris:sjr-pr

Conversation

@vkalintiris

@vkalintiris vkalintiris commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the OpenTelemetry logs ingestion/query subsystem and a local build/run MCP automation server. Two commits, split by area:

  1. otel — WAL-backed OpenTelemetry logs ingestion, query, and plugin

    • OTLP/gRPC ingestion → write-ahead log → SFST index files with a query engine, catalog, and the otel-plugin that exposes the otel-logs function.
    • New crates: otel-plugin, otel-ledger, otel-ingestor, otel-catalog, otel-normalize, otel-streams, sfst, sfsq, sfst-indexer, wal, wal-otap, treight, file-registry, chunk-file, fst-index, ferryboat, plus supporting journal crates.
    • Removes the superseded otel-signal-viewer plugin and the unused nlogql, nlogql-eval, sd-compat, journal-function crates; renames the flatten_otel/otel-plugin trees.
    • CMake/packaging: build otel-plugin, install its stock otel.yaml, drop signal-viewer wiring.
  2. mcp — build/run automation server for the Netdata Agent

    • A local MCP server (packaging/tools/automation/mcp) that lets an LLM configure, build, run, and query the Agent from a worktree to drive the otel-logs subsystem end to end.
    • Build/run lifecycle (declare/start/status/logs/stop), two profiles over one build dir per worktree, Cloud auto-claim for launched agents.
    • otel-logs tooling: otel_config (rotation/retention knobs), otel_push (deterministic synth corpus), otel_stream_* (live sources), otel_logs (typed query with Cloud-bearer auth).
    • A setup-mcp CMake convenience target; mcp-build-run.md spec; .env.template/ENV.md claim-token keys.

Adds the OpenTelemetry logs subsystem and supporting storage engine:
OTLP/gRPC ingestion, a write-ahead log, SFST index files with a query
engine, catalog, and the otel-plugin that exposes the otel-logs function.

- New crates: otel-plugin, otel-ledger, otel-ingestor, otel-catalog,
  otel-normalize, otel-streams, sfst, sfsq, sfst-indexer, wal, wal-otap,
  treight, file-registry, chunk-file, fst-index, ferryboat, and supporting
  journal crates.
- Removes the superseded otel-signal-viewer plugin, the unused nlogql,
  nlogql-eval, sd-compat, and journal-function crates; renames
  netdata-otel/flatten_otel -> flatten-otel and the otel-plugin tree.
- CMake/packaging: build otel-plugin, install its stock otel.yaml, drop
  signal-viewer wiring.
A local MCP server (packaging/tools/automation/mcp) that lets an LLM
configure, build, run, and query the Netdata Agent from a worktree, to
drive the otel-logs subsystem and verify changes end to end.

- Build/run lifecycle (declare/start/status/logs/stop), two profiles over
  one build dir per worktree, Cloud auto-claim for launched agents.
- otel-logs tooling: otel_config (rotation/retention knobs), otel_push
  (deterministic synth corpus), otel_stream_* (live sources), otel_logs
  (typed query with Cloud-bearer auth).
- CMake: a setup-mcp convenience target. Docs: mcp-build-run.md spec,
  .env.template / ENV.md claim-token keys.
@github-actions github-actions Bot added area/packaging Packaging and operating systems support area/docs area/build Build system (autotools and cmake). area/metadata Integrations metadata labels Jun 15, 2026
Comment thread packaging/tools/automation/mcp/tests/test_profiles.py Fixed
Comment thread packaging/tools/automation/mcp/tests/test_profiles.py Fixed
Comment thread packaging/tools/automation/mcp/tests/test_profiles.py Fixed
Comment thread packaging/tools/automation/mcp/tests/test_profiles.py Fixed
Comment thread packaging/tools/automation/mcp/tests/test_profiles.py Fixed
Comment thread packaging/tools/automation/mcp/tests/test_profiles.py Fixed
Comment thread packaging/tools/automation/mcp/tests/test_profiles.py Fixed
@vkalintiris vkalintiris changed the title OpenTelemetry logs ingestion/query subsystem + build/run MCP automation OpenTelemetry logs ingestion/query subsystem Jun 15, 2026
@vkalintiris

vkalintiris commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Quality Gate Failed Quality Gate failed

Failed conditions 6.2% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

SonarCube complains about duplicate code in packaging/tools/automation/mcp/netdata_mcp/agent_tools.py. However, this is a file that's generated automatically by the agent's own MCP tooling. I don't see an option in the SonarCube's UI to resolve the warning.

CC: @stelfrag @ktsaou

@stelfrag

Copy link
Copy Markdown
Collaborator

Quality Gate Failed Quality Gate failed

Failed conditions 6.2% Duplication on New Code (required ≤ 3%)
See analysis details on SonarQube Cloud

SonarCube complains about duplicate code in packaging/tools/automation/mcp/netdata_mcp/agent_tools.py. However, this is a file that's generated automatically by the agent's own MCP tooling. I don't see an option in the SonarCube's UI to resolve the warning.

CC: @stelfrag @ktsaou

What do you mean resolve it? There is some duplicate code or a chance to dedup code which is why it is raising this.

You can ignore it but it is there

@vkalintiris

Copy link
Copy Markdown
Contributor Author

Quality Gate Failed Quality Gate failed

Failed conditions 6.2% Duplication on New Code (required ≤ 3%)
See analysis details on SonarQube Cloud

SonarCube complains about duplicate code in packaging/tools/automation/mcp/netdata_mcp/agent_tools.py. However, this is a file that's generated automatically by the agent's own MCP tooling. I don't see an option in the SonarCube's UI to resolve the warning.
CC: @stelfrag @ktsaou

What do you mean resolve it? There is some duplicate code or a chance to dedup code which is why it is raising this.

You can ignore it but it is there

Yes, it is there and it's fine because it's generated code.

What I'm trying to say is that I can't find any option in SonarCube's UI to mark this as accepted/wont-fix/etc.

@stelfrag

Copy link
Copy Markdown
Collaborator

Quality Gate Failed Quality Gate failed

Failed conditions 6.2% Duplication on New Code (required ≤ 3%)
See analysis details on SonarQube Cloud

SonarCube complains about duplicate code in packaging/tools/automation/mcp/netdata_mcp/agent_tools.py. However, this is a file that's generated automatically by the agent's own MCP tooling. I don't see an option in the SonarCube's UI to resolve the warning.
CC: @stelfrag @ktsaou

What do you mean resolve it? There is some duplicate code or a chance to dedup code which is why it is raising this.
You can ignore it but it is there

Yes, it is there and it's fine because it's generated code.

What I'm trying to say is that I can't find any option in SonarCube's UI to mark this as accepted/wont-fix/etc.

Ok let me check

@stelfrag

Copy link
Copy Markdown
Collaborator

Quality Gate Failed Quality Gate failed

Failed conditions 6.2% Duplication on New Code (required ≤ 3%)
See analysis details on SonarQube Cloud

SonarCube complains about duplicate code in packaging/tools/automation/mcp/netdata_mcp/agent_tools.py. However, this is a file that's generated automatically by the agent's own MCP tooling. I don't see an option in the SonarCube's UI to resolve the warning.
CC: @stelfrag @ktsaou

What do you mean resolve it? There is some duplicate code or a chance to dedup code which is why it is raising this.
You can ignore it but it is there

Yes, it is there and it's fine because it's generated code.
What I'm trying to say is that I can't find any option in SonarCube's UI to mark this as accepted/wont-fix/etc.

Ok let me check

Can't be cleared as in marked resolved (the duplication warning). Just ignore

ENABLE_PLUGIN_OTEL was a plain option() with no platform guard, so the build
tried to compile otel-plugin on Windows (where it does not build) and macOS
(untested). It is tested only on Linux. Gate it to OS_LINUX — matching the
sibling Linux-only plugins (debugfs, ebpf, cgroup-network) and the plugin's
metadata.yaml supported_platforms.
…tore

The otel/log-viewer removal left the OpenTelemetry docs broken and partly false,
and dropped the live OTLP-metrics integration. Consolidated documentation fixes:

- Fix broken links and remove the false signal-viewer / systemd-journal-files
  prose (logs-tab, working-with-logs, the Learn map OpenTelemetry node, and the
  OpenTelemetry Logs integration).
- Restore the OpenTelemetry metrics integration for src/crates/otel-plugin/:
  recreate metadata.yaml (metrics-only; the old journal-based logs config is
  dropped since logs are a separate integration), register the source in
  integrations/_common.py, and regenerate the integration page + README symlink.
  Flesh out the OpenTelemetry Logs integration prerequisites.
- Remove internal design/plan notes (docs/wal-query-design.md,
  catalog-implementation-plan.md) and scrub their references + 'milestone N'
  language from Rust doc-comments (otel-ledger, sfsq); rewrite the otel-streams
  README. Comment/doc-only — no functional code change.

No functional code is changed in this commit; only docs, comments, integration
metadata, and generated integration pages.
@github-actions github-actions Bot added the area/collectors Everything related to data collection label Jun 15, 2026
Three improvements to the MCP build/run automation server (plus a docs pass):

- buildcfg: reconfigure when the cached CMAKE_INSTALL_PREFIX (path-normalized)
  differs from the canonical install prefix, or is absent/malformed — not only
  when the build type changes. Previously a stale cached prefix made `ninja
  install` write to one tree while the run launched a stale binary from another.
- otel config: add a typed `journal_dir` knob to `netdata_agent_otel_config`,
  emitted as `logs.journal_dir` (WAL/index stay pinned under the run dir; an
  empty string omits the key). This points the read-only `legacy-otel-logs`
  viewer at a journal-file fixture for scripted verification.
- Require a per-agent Cloud bearer on the wrapped `netdata_agent_*` forwarder.
  Agent functions are access-gated (e.g. SIGNED_ID) even on localhost once the
  agent is claimed, so anonymous forwarding returned HTTP 412. The forwarder now
  mints a per-agent bearer (a hard error without NETDATA_CLOUD_TOKEN or on mint
  failure — no anonymous fallback) and attaches it via the SDK's
  create_mcp_http_client factory, which preserves the MCP 30s/300s timeouts; the
  bearer is scrubbed from error strings. This unblocks terminal verification of
  access-gated agent functions.

Tests cover all three paths; the build-run spec is updated to match.
The former OpenTelemetry plugin stored logs as systemd journal files, served by
a separate otel-signal-viewer plugin via an `otel-logs` function. The new
WAL/SFST implementation replaced ingestion and removed that viewer, leaving
already-stored logs unviewable. Restore read-only access to them without
re-introducing the removed ingestion or the standalone plugin.

- Restore the `journal-function` crate: the query / facet / histogram /
  UI-formatting stack over journal files.
- Add an `otel-legacy-logs` worker — a third, read-only otel-plugin worker that
  serves a `legacy-otel-logs` function over the former journal files via a new
  `bridge` LegacyLogs IPC protocol. It never writes, prunes, or rotates. The
  journal directory is resolved from the former `logs.journal_dir` in otel.yaml,
  defaulting to <NETDATA_LOG_DIR>/otel/v1 (falling back to /var/log/netdata/otel/v1
  when unset); a malformed config warns rather than silently using the default.
  A distinct function name avoids a nondeterministic registry collision with the
  new ledger's `otel-logs`, and the advertised request params are trimmed to
  those actually honored.
- Fully isolate the worker from the new pipeline: a configure failure, an init
  failure, an absent journal directory, a runtime crash, or an oversize response
  disables only the legacy viewer (degrading to idle, dropped routes, or a
  status-500) and never affects the ingestor/ledger, which keep their fatal
  contract by design.
- Deny launching the obsolete otel-signal-viewer plugin in pluginsd, so a
  leftover binary from before the upgrade cannot re-register `otel-logs` and
  collide with the new ledger. Physical package removal is tracked separately
  (netdata#22728).

Includes unit tests (the GET-args shim and journal-dir resolution) and a durable
spec under .agents/sow/specs/.
… identity

OpenTelemetry treats a zero-length service.namespace as equal to an
unspecified one, but the logs pipeline encoded stream identity two ways
that disagreed: the ingestor named files with compute_ns_hash(None, name)
for an absent namespace, while the WAL stream filter recomputed
compute_ns_hash(Some(""), name) from the collapsed ServiceStream. The two
never matched, so a stream-filtered query missed every absent-namespace
file.

Centralize identity on a single ServiceStream::ns_hash that maps an empty
field to absent, and route the ingestor, the WAL candidates filter, and
identity through it:

- file-registry: add ServiceStream::ns_hash (empty -> None); keep
  compute_ns_hash as the low-level primitive, documented as not for the
  identity layer.
- otel-ingestor: extract_stream returns a ServiceStream; drop the
  Option-based Stream/CanonicalStream; the per-(tenant, ns_hash) collision
  table carries ServiceStream so absent and literal-empty land in one
  partition.
- wal: Registry::candidates derives the filter hash via ServiceStream::ns_hash,
  fixing the absent-namespace filter miss.

The empty->absent rule keeps the common absent-namespace ns_hash unchanged,
so existing files need no migration and no on-disk/wire format changes; only
a rare literal-empty-namespace file re-partitions. The SFST registry already
filters by ServiceStream equality and is unaffected.

Adds unit and regression tests (absent==empty hashing, the previously-broken
absent-namespace candidate match) and a durable spec under .agents/sow/specs/.
Add a CLI that reads OTel logs directly from on-disk WAL and SFST
directories and queries them through the wire-neutral sfsq engine, with
no running agent — terminal and forensic inspection alongside the live
otel-logs Function.

- Directories resolve per-dir from otel.yaml (explicit --wal-dir/--sfst-dir
  override --config over --stock-config); logs read from {dir}/{tenant}.
  Sealed SFSTs are time-pruned via their summaries; WAL files carry no
  on-disk timestamp index, so each is row-scanned as a tail. A WAL whose
  sequence is already sealed into an SFST is skipped (SFST wins), matching
  the live query planner.
- Stream filtering mirrors the planner and the absent==empty stream-identity
  contract: exact ServiceStream for SFST, ServiceStream::ns_hash for WAL
  (an empty namespace collapses to absent) — so an absent-namespace query
  matches the files the ingestor actually named, not only literal-empty ones.
- Window [since, until) accepts now / relative / epoch / a UTC datetime
  (timezone offsets rejected; post-2106 errors instead of truncating);
  newest-first with --reverse; --filter/--query grammar; --fields projection;
  NDJSON output with a stable [key,value] field shape.

Tested with unit + integration tests over real OTAP->WAL->SFST fixtures
(tail-vs-sealed row equivalence, SFST-wins dedup, time pruning, and an
absent-namespace stream-filter regression). Adds an operator usage README
and a durable spec under .agents/sow/specs/.
The WAL identified each file only by its ns_hash (in the filename); the
(service.namespace, service.name) text lived solely in the frame payloads and
was recovered by the indexer at seal time. So an unsealed WAL's stream could
not be named without decoding frames — which blocks listing all queryable
streams (e.g. a stream selector) for streams not yet sealed into an SFST.

Record the stream in the WAL file header so it is available cheaply (recovery,
enumeration) without touching frames:

- format: bump FORMAT_VERSION to 2; FileHeader carries a ServiceStream, written
  as two length-prefixed UTF-8 fields (each capped at 256 bytes, truncated on a
  char boundary — display only; the ns_hash partition key is unaffected).
  from_bytes hard-rejects v1: OTel logs are experimental and WAL files are
  short-lived, so there is no back-compat (an unsealed v1 WAL is lost on upgrade).
- writer: the per-stream writer carries its ServiceStream and writes it into
  every file's header and the Created event.
- registry: File carries the stream; apply_event(Created) sets it live, and
  recover() reads it back from the header on restart.
- ingestor: thread the ServiceStream (already the collision-table key) down to
  write_frame.

FileEvent::Created (the writer->ledger IPC, which has no version field) gains
the stream; safe because the two run in one binary. Adds header round-trip tests
(including an absent namespace and oversize truncation), a v1-rejection test, and
a recovery assertion; updates the write_frame call sites.
…ntrol)

Give the live otel-logs function a stream selector modeled on the
systemd-journal "sources" control. Picks prune which files the query
opens; the default view still spans every stream.

- file_registry::Query: stream filter is now a set of ns_hash values
  (stream_hashes: Vec<u64>, empty = all) with a shared matches_stream
  predicate, replacing the single Option<ServiceStream>.
- WAL, SFST and catalog candidate filters match by ns_hash membership
  (SFST and catalog move off ServiceStream equality), safe under the
  ingestor's per-(tenant, ns_hash) collision invariant.
- Handler removes the reserved __streams selection before building the
  engine query (so it prunes files, never row-filters a phantom facet),
  decodes the hex picks into Query.stream_hashes, and advertises the
  tenant's streams as a required_params MultiSelection on every response.
- enumerate_streams lists streams window-independent from SFST summaries
  and WAL File.stream, deduped by ns_hash (SFST wins by seq); an active
  WAL is sized by valid_up_to since File.size lands only on close.
- Every option is defaultSelected so the mandatory control defaults to
  all streams (MultiSelectionOption gains defaultSelected).
- sfsq-cli filters by a one-element stream_hashes from --namespace/--name.

Spec otel-stream-identity.md updated: all tiers filter by ns_hash
membership, and the former dormant-filter note becomes the live selector
contract.
@vkalintiris vkalintiris marked this pull request as ready for review June 17, 2026 12:41
@vkalintiris

vkalintiris commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Marking this as ready for review because it's a big one and there are parts that won't change. The vkalintiris/sjr-pr branch will always stay green/working in case anyone wants to checkout my changes and run the otel-plugin locally.

You can use src/crates/otel-streams, for real OTel logs from the bluesky social network and the certificate transparency stream. For the later, you need to span a container because there's no public endpoint we can use: docker run -d --rm --name certstream-server -p 8080:8080 0rickyy0/certstream-server-go.

@Ferroin I need you to check the build/package-related stuff. IIRC, you mentioned in the past that we should mask packages of removed plugins a deprecated/superceded, when we introduce a plugin that replaces them.

@netdata/agent The only non-Rust code changes is in src/plugins.d/plugins_d.c, where I've added a deny-list for the otel-signal-viewer plugin [1].

Remaining work revolves around:

  • Cleaning up the stock configuration file,
  • Documentation (for which I'll bring into @Ancairon when the time comes),
  • Fetching/caching from remote object storage.

All of these do not touch the existing agent/collectors at all.

[1] @ilyam8 last time I tried the deny-list approach, I targeted the systemd-journal plugin, this time it's an OTel-related plugin that will no longer be used by anyone.

Comment thread packaging/tools/automation/mcp/scripts/setup_mcp.py Fixed
# Conflicts:
#	.agents/sow/specs/README.md
#	src/crates/Cargo.lock
@vkalintiris

Copy link
Copy Markdown
Contributor Author

Quality Gate Failed Quality Gate failed

Failed conditions 5.9% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

74 issues all related to mcp tooling and in particular the auto-generated python code for the wrapped MCP tools of the agent. I'm going to start ignoring them.

The otel/log-viewer removal dropped the otel-signal-viewer-plugin target
that carried --cfg=io_uring_skip_arch_check. otel-plugin now pulls io-uring
transitively (otel-legacy-logs -> journal-function -> journal-engine -> foyer
-> foyer-storage -> io-uring), which has no prebuilt bindings for 32-bit
arches, so i386/armhf builds failed the compile-time arch check. Apply the
same skip used by netflow-plugin; it is a compile-time, not runtime,
dependency.
@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
5.9% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/build Build system (autotools and cmake). area/collectors Everything related to data collection area/docs area/metadata Integrations metadata area/packaging Packaging and operating systems support area/plugins.d

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants