AXON-HLT- Coprocessor Inference Benchmarking

A Python-based proof-of-concept of CERN's SONIC (Services for Optimized Network Inference on Coprocessors) architecture — demonstrating coprocessor offloading for real-time particle physics triggers using NVIDIA Triton Inference Server, ONNX, and gRPC.

Background

At the LHC, protons collide 40 million times per second. The High-Level Trigger (HLT) must reduce this to ~1,000 events per second in real time, under a strict <7 ms latency budget per inference. Modern ML triggers — CNNs, GNNs like ParticleNet — are too slow on CPUs alone and too expensive to provision one GPU per HLT node.

CMS's solution, published in [CMS-MLG-23-001], is SONIC: decouple inference from the trigger CPUs entirely. CPU nodes send gRPC requests to a centralized Triton GPU cluster. The GPU cluster uses dynamic batching to aggregate requests from hundreds of concurrent clients, amortizing per-request overhead and sustaining high throughput even with small individual batches.

This project replicated that architecture in a clean standalone PoC, using the same production tools CMS deploys (Triton 25.12, ONNX, gRPC), against the UCI HIGGS benchmark dataset (28 features, particle classification).

Architecture

Training (offline)
──────────────────
HIGGS dataset (100K rows, 28 features)
  → StandardScaler → train/val split
  → MLP v1 (28→64→32→1) or v2 (28→128→128→64→1)
  → export to ONNX (opset 17, dynamic batch axis)


Inference (online)
──────────────────

  ┌─────────────────────────────────────────┐
  │  HLT simulation (benchmark client)      │
  │  asyncio workers × N concurrency        │
  │  each sends: one FP32[1×28] batch       │
  └────────────────┬────────────────────────┘
                   │  gRPC (port 8001)
                   ▼
  ┌─────────────────────────────────────────┐
  │  NVIDIA Triton Inference Server         │
  │  (Docker, --gpus all)                   │
  │                                         │
  │  Dynamic batching:                      │
  │    max_batch_size: 256                  │
  │    max_queue_delay: 100 µs              │
  │    preferred: [64, 128, 256]            │
  │                                         │
  │  model_repository/                      │
  │    particle_classifier_v1/1/model.onnx  │
  │    particle_classifier_v2/1/model.onnx  │
  └─────────────────────────────────────────┘
                   │
                   ▼
  ┌─────────────────────────────────────────┐
  │  Prometheus metrics (port 8002)         │
  │  GPU utilization, inference counts,     │
  │  queue durations                        │
  └─────────────────────────────────────────┘

Setup

Option 1 — Docker (no Python setup required):

# Pull the benchmark client image
docker pull karandev7/axon-hlt:latest

# Run against any live Triton server
docker run --network host karandev7/axon-hlt \
  --url <triton-host>:8001 \
  --model particle_classifier_v1 \
  --concurrency 64 \
  --duration 30

Option 2 — From source:

Prerequisites: Docker with NVIDIA Container Toolkit, Python 3.10+.

git clone https://github.com/KaranSinghDev/AXON-HLT.git
cd AXON-HLT
make install       # create venv, install tritonclient + deps
make train         # download HIGGS subset, train, export ONNX (~2 min)
make serve         # docker compose up: Triton with dynamic batching
make benchmark     # run 30s gRPC load test at c=64
make plot          # generate plots → results/plots/

To run the A/B comparison (batching ON vs OFF):

# batching ON (already the default)
PYTHONPATH=src venv/bin/python3 scripts/benchmark.py \
  --concurrency 64 --duration 30 \
  --output results/benchmark_batching_on.json

# batching OFF
docker compose down
docker compose -f docker-compose.no-batching.yml up -d
PYTHONPATH=src venv/bin/python3 scripts/benchmark.py \
  --concurrency 64 --duration 30 \
  --output results/benchmark_batching_off.json

# generate plots
PYTHONPATH=src venv/bin/python3 scripts/plot_results.py \
  --batching-on  results/benchmark_batching_on.json \
  --batching-off results/benchmark_batching_off.json \
  --sweep        results/benchmark_sweep.json

Results

All benchmarks ran on: RTX 3060 (12 GB), i7-12700H, 32 GB RAM, WSL2 (Ubuntu 22.04), Triton 25.12-py3, tritonclient 2.64.0.

Dynamic batching A/B — `particle_classifier_v1`, c=64

Metric	Batching ON	Batching OFF
Throughput	10,811 inf/sec	2,893 inf/sec
Latency p50	5.60 ms	21.89 ms
Latency p95	8.70 ms	26.15 ms
Latency p99	10.91 ms	29.01 ms

Dynamic batching delivers 3.7× throughput and 3.9× lower p50 latency at the same concurrency. Without batching, every client request becomes its own GPU kernel launch, saturating the gRPC overhead; with batching, Triton coalesces 64+ requests into a single FP32[64×28] matrix multiply.

Latency CDF with CERN HLT budget

The 7 ms HLT budget (dashed line) is met at the ~85th percentile with dynamic batching enabled at c=64. p50 (5.6 ms) is within budget. p99 (10.9 ms) exceeds it — expected behavior at near-saturation concurrency; CERN's production deployment runs with concurrency headroom to keep p99 in budget.

Concurrency sweep — dynamic batching ON

Concurrency	Throughput	p50 latency	p99 latency
1	816 inf/sec	1.07 ms	2.52 ms
4	2,462 inf/sec	1.45 ms	3.62 ms
16	6,745 inf/sec	2.18 ms	4.70 ms
64	11,701 inf/sec	5.25 ms	8.87 ms
128	11,444 inf/sec	10.82 ms	18.71 ms
256	11,097 inf/sec	22.48 ms	36.54 ms

Throughput saturates around c=64 — the server's GPU pipeline is fully utilized. Beyond that, latency grows with queue depth while throughput holds flat, which is the expected behavior from queueing theory (M/D/1 regime at high load).

The c=1 case (816 inf/sec, p99=2.52ms) is notably low: a single client never fills a preferred batch size, so Triton waits the full 100µs queue delay on every request. This validates the dynamic batching config — it's tuned for concurrent workloads, not sequential ones.

Model depth comparison — v1 vs v2 at c=64

Model	Parameters	Throughput	p50 latency
v1 (28→64→32→1)	~3K	10,811 inf/sec	5.60 ms
v2 (28→128→128→64→1)	~28K	10,743 inf/sec	5.66 ms

v2 (9× more parameters) shows no measurable throughput or latency difference. This is explained in the profiling section below.

Profiling Insights

GPU utilization stayed below 1% throughout all benchmarks. This is not a configuration error — it reflects the compute profile of the workload:

Model size: v1 has ~3K parameters. A single forward pass is ~6K FP32 multiply-adds. At 10K inf/sec with batch size 64, peak theoretical FLOP demand is ~4 GFLOPs/sec — about 0.04% of the RTX 3060's 12.7 TFLOPS.
Bottleneck: the pipeline is I/O-bound on gRPC serialization and deserialization, not compute-bound on the GPU. This is why v1 and v2 perform identically — neither saturates the GPU.
Production context: CERN's SONIC deployment uses larger models (CNNs, GNNs like ParticleNet with ~500K–1M parameters) that are genuinely compute-intensive. The ~122K inf/sec reported in [CMS-MLG-23-001] reflects a multi-GPU server with C++ clients and larger batch sizes. This PoC replicates the architecture and the batching effect; not the production scale.

The I/O-bound result is actually useful: it shows that Triton's gRPC layer is the limiting factor for small models, and that the dynamic batching throughput gains (3.7×) come from reducing the number of kernel launches and amortizing the fixed per-call overhead — exactly the mechanism SONIC describes.

Repository layout

axon/
├── src/axon/
│   ├── model.py          # MLP v1 and v2
│   ├── data.py           # HIGGS downloader + preprocessing
│   ├── export.py         # ONNX export (dynamic batch axis)
│   ├── benchmark.py      # asyncio gRPC load generator
│   ├── client.py         # TritonClient with retry
│   ├── metrics.py        # Prometheus scraper (GPU util, queue depth)
│   └── plot.py           # matplotlib: sweep, A/B, CDF
├── scripts/
│   ├── train.py          # CLI: download → train → export ONNX
│   ├── benchmark.py      # CLI: run benchmark → save JSON
│   └── plot_results.py   # CLI: load JSON → generate plots
├── model_repository/                  # Triton config: dynamic batching ON
├── model_repository_no_batching/      # Triton config: batching OFF (A/B)
├── results/
│   ├── benchmark_batching_on.json
│   ├── benchmark_batching_off.json
│   ├── benchmark_sweep.json
│   ├── benchmark_v2_c64.json
│   └── plots/
├── docker-compose.yml
├── docker-compose.no-batching.yml
├── Makefile
└── tests/

Tests

make test   # runs pytest (20 tests: data, model, ONNX export, schema)

References

CMS Collaboration. Portable acceleration of CMS computing workflows with coprocessors as a service. Computing and Software for Big Science 8, 17 (2024). https://doi.org/10.1007/s41781-024-00124-1 — CMS-MLG-23-001, arXiv:2402.15366
Krupa, J. et al. GPU coprocessors as a service for deep learning inference in high energy physics. Machine Learning: Science and Technology 2, 035005 (2021). https://doi.org/10.1088/2632-2153/abec21
Duarte, J. et al. FPGA-accelerated machine learning inference as a service for particle physics computing. Computing and Software for Big Science 3, 13 (2019). https://doi.org/10.1007/s41781-019-0027-2
Baldi, P., Sadowski, P. & Whiteson, D. Searching for exotic particles in high-energy physics with deep learning. Nature Communications 5, 4308 (2014). https://doi.org/10.1038/ncomms5308 — UCI HIGGS Dataset, https://archive.ics.uci.edu/ml/datasets/HIGGS
NVIDIA Triton Inference Server

License

Apache 2.0 — see LICENSE.

Disclaimer

This is an independent proof-of-concept project. It is not affiliated with, endorsed by, or sponsored by CERN, the CMS Collaboration, or NVIDIA Corporation. References to SONIC, the CMS High-Level Trigger, NVIDIA Triton, and related trademarks are made for descriptive purposes only and remain the property of their respective owners.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AXON-HLT- Coprocessor Inference Benchmarking

Background

Architecture

Setup

Results

Dynamic batching A/B — `particle_classifier_v1`, c=64

Latency CDF with CERN HLT budget

Concurrency sweep — dynamic batching ON

Model depth comparison — v1 vs v2 at c=64

Profiling Insights

Repository layout

Tests

References

License

Disclaimer

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/workflows		.github/workflows
model_repository		model_repository
model_repository_no_batching		model_repository_no_batching
results		results
scripts		scripts
src/axon		src/axon
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.no-batching.yml		docker-compose.no-batching.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AXON-HLT- Coprocessor Inference Benchmarking

Background

Architecture

Setup

Results

Dynamic batching A/B — particle_classifier_v1, c=64

Latency CDF with CERN HLT budget

Concurrency sweep — dynamic batching ON

Model depth comparison — v1 vs v2 at c=64

Profiling Insights

Repository layout

Tests

References

License

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Dynamic batching A/B — `particle_classifier_v1`, c=64

Packages