Skip to content

[Feat] Integrate ucm metrics into vLLM connector metrics#995

Open
dante159753 wants to merge 2 commits into
ModelEngine-Group:developfrom
dante159753:ucm-vllm-connector-metrics
Open

[Feat] Integrate ucm metrics into vLLM connector metrics#995
dante159753 wants to merge 2 commits into
ModelEngine-Group:developfrom
dante159753:ucm-vllm-connector-metrics

Conversation

@dante159753

@dante159753 dante159753 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add Python-side UCM metrics dispatching so existing UCM metrics can feed both multiproc and vLLM connector consumers.
  • Expose UCM connector stats through vLLM KV connector metrics with vllm:-prefixed Prometheus names and worker/engine labels.
  • Add and refine Grafana dashboards for UCM vLLM metrics, including direct connector task breakdowns, layerwise batch breakdowns, engine/worker filtering, and aggregated/per-worker views.
  • Keep the dashboard identity isolated from the existing origin dashboards so both sets can coexist before merge.

Why

The previous multiproc metrics path only covered the local multiproc exporter cleanly and did not fit vLLM's KV connector metrics snapshot path. This change lets UCM metrics be consumed through vLLM-native metric collection while preserving the existing multiproc path and making multi-worker/multi-engine deployments easier to inspect.

Impact

  • Users can view UCM connector, layerwise, cache/store, and vLLM overview metrics through the vLLM-prefixed dashboards.
  • Aggregated dashboard mode now aggregates across engines, while Per Worker mode keeps engine and worker_rank dimensions for disambiguation.
  • Existing ucm: multiproc metric output remains available through the existing configuration path.

Verification

test using jenkins pipeline, no performance impact, metrics works fine.

Comment thread ucm/metrics_config.py
Comment thread ucm/metrics_dispatcher.py
Comment thread ucm/metrics_dispatcher.py
Comment thread ucm/integration/vllm/metrics.py Outdated
Comment thread ucm/metrics_dispatcher.py
@dante159753 dante159753 requested a review from flesher0813 as a code owner June 5, 2026 06:38
@dante159753 dante159753 changed the title [Feat] Add vLLM-native UCM connector metrics [Feat] Add vLLM UCM connector metrics dashboards Jun 9, 2026
{
if (this->logger_) { return this->logger_; }
std::lock_guard<std::mutex> lg(this->mutex_);
if (this->logger_) { return this->logger_; }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 L4 Performance Consideration: The InternSourceString function uses a static mutex for protecting the string pool. In high-frequency logging scenarios (many log statements per millisecond), this mutex could become a contention bottleneck since every log call with a new source location needs to intern the string.

Consider alternatives:

  1. Use a concurrent hash map (e.g., folly::ConcurrentHashMap or similar)
  2. Pre-intern common source locations at initialization
  3. Use lock-free string interning with atomic pointers

For typical UCM usage this may not be critical, but worth documenting if high-frequency logging is expected.

Comment thread ucm/metrics_dispatcher.py
if len(current_counts) < len(bucket_counts):
current_counts.extend([0] * (len(bucket_counts) - len(current_counts)))
for index, count in enumerate(bucket_counts):
current_counts[index] += int(count)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Previous Concern Still Unaddressed: The int(count) conversion on very large float values could have precision issues. For example, if count is a float like 1e15, int(count) would work, but intermediate floating-point operations could introduce rounding errors.

Consider using int(round(count)) or validating that count is within reasonable bounds (e.g., abs(count) < 1e12) before conversion. This is especially important for histogram bucket counts which should always be integers.

Comment thread ucm/metrics_dispatcher.py
if isinstance(value, (tuple, list)) and len(value) == 2:
bucket_counts, sum_delta = value
return list(bucket_counts), float(sum_delta)
return list(getattr(value, "bucketCounts", [])), float(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Previous Concern Still Unaddressed: The _histogram_tuple function returns empty list [] when the histogram value format is unrecognized (line 137-139). This silently drops histogram data without any warning.

Consider adding a warning log:

if not bucket_counts:
    logger.warning(f"Unrecognized histogram format: {type(value)}, data dropped")

This helps diagnose configuration issues or unexpected metric formats.

Comment on lines +119 to +121
auto cbStatus = stream.AppendCallback([eventReadyTp](bool) {
eventReadyTp->store(NowTime::Now(), std::memory_order_relaxed);
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ L3 Issue - Potential Race Condition: The callback stores the timestamp using memory_order_relaxed, which may not provide sufficient synchronization guarantees. The callback could be executed on a different thread (e.g., driver/stream callback thread) while the main thread reads the value. Consider using memory_order_release in the store and memory_order_acquire in the load:

eventReadyTp->store(NowTime::Now(), std::memory_order_release);
// ...
auto ready = eventReadyTp->load(std::memory_order_acquire);

While relaxed is technically correct for a single atomic value, the timing semantics (splitting compute-wait vs D2H) rely on precise ordering which relaxed may not guarantee.

Comment on lines 168 to +177
}
const auto copiedShards = holder_.size() + 1;
s = stream.Synchronize();
auto h2dSyncMs = (NowTime::Now() - tpH2dSubmitted) * 1e3;
UC::Metrics::UpdateStats(NAME_TO_METRIC_ID("cache_h2d_sync_ms"), h2dSyncMs);
if (copiedShards > 0 && h2dSyncMs > 0.0) {
auto copiedBytes = static_cast<double>(copiedShards) * static_cast<double>(shardBytes_);
UC::Metrics::UpdateStats(NAME_TO_METRIC_ID("cache_h2d_bandwidth_gbps"),
copiedBytes / (h2dSyncMs * 1e-3) / 1e9);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 L4 Observation: The bandwidth calculation copiedShards = holder_.size() + 1 assumes all tasks in holder_ are from the same batch and have identical shard sizes. However, holder_ accumulates tasks across multiple TransferOneTask calls before synchronization. If different tasks have different sizes, the total bytes calculation could be inaccurate.

Consider:

  1. Tracking actual bytes per task
  2. Adding a comment explaining that shardBytes_ is assumed constant for all tasks in holder_

@ygwpz

ygwpz commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Follow-up Review Summary (2026-06-13)

This follow-up review focused on the most recent changes since the last review on 2026-06-05, particularly:

Previous Concerns Status

  • ucm/metrics_config.py:55 (PyYAML silent failure) - Author addressed with error logging. Accepted.
  • ucm/integration/vllm/metrics.py (hardcoded reduce()) - Method now returns empty dict. Accepted.
  • ucm/metrics_dispatcher.py:40 (singleton reset) - Author declined. Accepted given use case.
  • ucm/metrics_dispatcher.py:128 (int conversion) - Still unaddressed, posted follow-up comment.
  • ucm/metrics_dispatcher.py:137 (silent histogram drop) - Still unaddressed, posted follow-up comment.

New Issues Found

  • L3: Potential race condition in dump_queue.cc callback timing mechanism (memory_order_relaxed)
  • L4: Bandwidth calculation assumption in load_queue.cc may be inaccurate for variable-sized batches
  • L4: String interning mutex contention in spdlog_logger.cc could impact high-frequency logging

Inline comments have been posted for these issues.

Overall Assessment

The logger redesign addresses the self-reference and memory issues mentioned in the commit. The new bandwidth metrics provide useful visibility into H2D/D2H performance. The main concerns are around memory ordering in the callback-based timing and the assumptions in bandwidth calculations.

@dante159753 dante159753 changed the title [Feat] Add vLLM UCM connector metrics dashboards [Feat] Integrate ucm metrics into vLLM connector metrics Jun 18, 2026
ygwpz
ygwpz previously approved these changes Jun 18, 2026

@ygwpz ygwpz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New changes look good - comprehensive metrics integration with proper consumer architecture and updated Grafana dashboards. The metrics dispatcher pattern allows proper fanout to multiple consumers without clearing data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants