Skip to content

PR10: Triton/NVRTC/PTX Backend (C++)#10443

Open
agibsonccc wants to merge 12 commits into
masterfrom
pr/10-triton-nvrtc-ptx-backend
Open

PR10: Triton/NVRTC/PTX Backend (C++)#10443
agibsonccc wants to merge 12 commits into
masterfrom
pr/10-triton-nvrtc-ptx-backend

Conversation

@agibsonccc

@agibsonccc agibsonccc commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

PR 10 of 22 PRs in the ag_new_release_updates_2 branch split. Merge after Layer 2 (native core ops + helpers).

  • Three JIT paths: Triton (MLIR-based, full register allocation) → NVRTC (runtime CUDA C compile via nvrtcCompileProgram()) → PTX templates (string assembly, no compiler, microseconds)
  • OpCategoryTable.h: Single source of truth mapping 130+ op names to 18-variant TritonOpCategory enum; both Triton and NVRTC consult the same table — new ops must be registered here to get JIT-compiled
  • Triton section fusion: identifySections() groups contiguous element-wise ops into one ELEMENTWISE section; each matmul/attention/reduction/gather/concat/slice/conv gets its own typed section; multi-section kernels use tt.grid_barrier or __threadfence_system()
  • 9 section emitters: emitElementwiseSection(), emitMatmulSection(), emitAttentionSection(), emitGatherSection(), emitConcatSection(), emitSliceSection(), emitTileSection(), emitConvolutionSection(), emitIm2colSection()
  • Multi-vendor targets: TritonTargetDispatch selects NVIDIA PTX, AMD AMDGCN, or Intel SPIR-V based on detected device — same IR targets multiple GPU vendors
  • Async precompilation: TritonGraphBackend_preload.cpp background thread compiles kernels during warmup, eliminating first-token JIT latency on the decode path
  • Capture safety: GpuKernelLauncher checks tl_streamCaptureActive before any CUDA API call illegal during graph capture; records per-kernel microsecond timing via DSP diagnostics
  • NVRTC caching: JitSegmentCacheKey (shapeKey + segmentOpHash + dtypeKey); compiled PTX cached in NvrtcKernelCache; eviction on OOM
  • Fusion scoring: FusionScoring heuristics gate fusion on element count (≥1M), op count (≥2 fusible ops), and op mix ratio

What Changed

Triton MLIR backend — graph/gpu/ (19 files)

  • TritonIRBuilder.h — core class: buildModule() entry point; isTritonMappable(opName) via OpCategoryTable; classifySegment(); tile size constants (BLOCK_SIZE=1024 element-wise, BLOCK_M/N=128 BLOCK_K=32 matmul)
  • TritonIRBuilder.cpp — orchestrates: loadInputTensors → identifySections → emit*(section) → finalize; uses mlir::MLIRContext with Triton TTIR, arith, math, SCF dialects
  • TritonIRBuilder_types.h — NVCC-safe types: KernelSectionType enum, KernelSection struct, EpilogueOp enum, TritonIRModule
  • TritonIRBuilder_analysis.cppclassifySegment(): returns SegmentKernelPattern (ELEMENTWISE_DOMINANT, MATMUL_DOMINANT, MIXED, REDUCTION_DOMINANT, ATTENTION_DOMINANT)
  • TritonIRBuilder_sections.cppidentifySections() fusion pass; all 9 section emitters; computeSectionGrid(); inter-section barriers
  • TritonIRBuilder_emitters.cpp — per-op MLIR value emission: relu→arith.maxf, exp→math.exp, softmax→tt.reduce + exp + reciprocal
  • TritonIRBuilder_kernels.cpp — kernel function construction: pid/num_programs prologue, pointer arithmetic, tl.load, per-element ops, tl.store
  • TritonIRBuilder_module.cppmlir::PassManager pipeline: triton-to-triton-gpu then triton-gpu-to-llvm
  • TritonIRBuilder_cuda.cu — compiles via Triton Python compiler subprocess or embedded C API; loads .cubin via cuModuleLoadData
  • TritonIRBuilder_types.cppmapNd4jDtypeToTritonType() for SD_FLOAT/SD_HALF/SD_BFLOAT16/SD_DOUBLE/SD_INT32/SD_INT64

Triton backend — graph/gpu/ (12 files)

  • TritonGraphBackend.h/.cppGraphBackend subclass; canFuseSegment() calls isTritonMappable() per slot; per-segment LRU cache of compiled modules
  • TritonGraphBackend_compile.cu — invokes TritonIRBuilder, runs MLIR pass pipeline, loads .cubin or .ptx
  • TritonGraphBackend_execute.cu — resolves buffer pointers from NativeSlots, launches via GpuKernelLauncher on DSP stream
  • TritonGraphBackend_cache.cpp — LRU by (shapeKey, segmentId, graphExecutionMode); TritonCacheBundle stores handle + kernel pointer + launch params
  • TritonGraphBackend_preload.cpp — background thread compiles kernels observed during warmup before hot path needs them
  • TritonGraphBackend_binary.cpp — loads/stores pre-compiled .cubin from/to disk cache
  • TritonTargetDispatch.h/.cpp — selects NVIDIA PTX / AMD AMDGCN / Intel SPIR-V based on detected device

NVRTC backend — graph/gpu/ (4 files)

  • NvrtcGraphBackend.hGraphBackend subclass; cache keyed by JitSegmentCacheKey
  • NvrtcGraphBackend.cugenerateCudaSource(): iterates slots emitting per-op CUDA C expressions; compiles via nvrtcCompileProgram() targeting detected SM; loads via cuModuleLoadDataEx
  • NvrtcKernelBuilder.h/.cu — maps TritonOpCategory to CUDA math inline: relu→fmaxf(x,0.0f), add→(a+b), softmax→multi-step reduce+normalize
  • NvrtcKernelCache.h/.cuJitCompiledKernel (CUmodule + CUfunction + launch params); JitSegmentCacheKey; OOM eviction

PTX template backend — graph/gpu/ (2 files)

  • PtxGraphBackend.hGraphBackend subclass; no compiler invocation, direct PTX string generation
  • PtxGraphBackend.cugeneratePtxSource(): constructs .global/.func PTX via string templates; loads via cuModuleLoadDataEx for driver JIT PTX→SASS

Shared GPU infrastructure — graph/gpu/ (9 files)

  • OpCategoryTable.h — SSOT: 130+ op name strings → TritonOpCategory; covers BINARY_ELEMENTWISE (add/sub/mul/div/min/max/mod and TF aliases), UNARY_ELEMENTWISE (relu/sigmoid/tanh/gelu/exp/log/sqrt/abs/silu and 20+ more), COMPARISON, LOGICAL, TERNARY (where/select), IDENTITY, MATMUL, REDUCTION, NORMALIZATION, CAST, FUSED_ATTENTION, SHAPE_MANIPULATION, DATA_MOVEMENT, CONSTANT_GENERATION, CONVOLUTION, ROPE, FUSED_LLM, UNSUPPORTED
  • JitGraphBackendCommon.h/.cu — shared types and utilities for NVRTC and PTX backends
  • GpuKernelLauncher.h/.cu — unified kernel launch: CUfunction + grid/block dims + stream; capture-safe; DSP TIMING diagnostics integration
  • CaptureBufferRegistry.h/.cu — per-segment GPU buffer registry; validates pointer stability before CUDA graph replay
  • FusionScoring.h/.cpp — fusion heuristics: fuse ≥1M elements, ≥2 fusible ops, shape compatibility
  • SymbolicShapeRanges.h — min/max shape bound propagation for ahead-of-time segment planning
  • SectionTypeConfig.h — flags enabling/disabling individual KernelSectionTypes
  • ViewRecipe.h — logical view relationship expressed as pointer arithmetic rather than a kernel launch

Dependencies

  • Depends on: PR05, PR06, PR07 (NDArray, DataBuffer, LaunchDims); PR09 (DSP graph engine — NativeDynamicShapePlan.h, GraphBackend.h, GraphSegment types)
  • Required by: PR09 (NativeDynamicShapePlan_gpubackend.cpp includes TritonGraphBackend.h, NvrtcGraphBackend.h, PtxGraphBackend.h)

Merge Order

These 22 PRs must merge in layer order. Each layer depends on the layers above it being merged first. PRs within the same layer are independent and can merge in parallel.

This PR: Merge after Layer 2 (native core ops + helpers).

Layer PRs
0 (no deps) PR01, PR02, PR20
1 (build/infra) PR03, PR04
2 (native core) PR05, PR06, PR07
3 (native feat) PR08, PR09, PR10, PR11
4 (java core) PR12, PR13, PR14, PR15
5 (java feat) PR16
6 (import/gen) PR17, PR18, PR19, PR21
7 (validation) PR22

Part of the 22-PR split of ag_new_release_updates_2 branch.
Merge layer: 3 (native features)
Files: 44

See pr-plans/00-master-plan.md for the full split plan and merge order.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds native GPU JIT compilation backends for the graph executor, establishing a Triton-first pipeline with NVRTC and PTX-template fallbacks plus supporting infrastructure (fusion scoring, module residency/LRU, preload, capture-safe buffer tracking, and view-recipe metadata).

Changes:

  • Introduces Triton multi-target dispatch + Triton IR builder core types/utilities for segment compilation and launch planning.
  • Adds NVRTC and PTX-template graph backends for fused elementwise segments, sharing common cache/execute logic.
  • Adds supporting GPU infrastructure: section-type configuration + fusion scoring, capture buffer registry, symbolic shape range profiling, and view-recipe structures.

Reviewed changes

Copilot reviewed 41 out of 44 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
libnd4j/include/graph/gpu/ViewRecipe.h Defines a compact representation for view-producing ops to avoid kernel launches during replay.
libnd4j/include/graph/gpu/TritonTargetDispatch.h Declares runtime target detection + compile/load/launch interface for Triton outputs.
libnd4j/include/graph/gpu/TritonTargetDispatch.cpp Implements Triton compilation pipeline and target-specific module load/launch.
libnd4j/include/graph/gpu/TritonIRBuilder_types.h Defines NVCC-safe Triton IR builder data structures (sections, args, launch phases).
libnd4j/include/graph/gpu/TritonIRBuilder_types.cpp Implements MLIR type mapping + constant emission helpers for Triton IR builder.
libnd4j/include/graph/gpu/TritonIRBuilder_cuda.cu CUDA-only helper(s) for cooperative launch sizing.
libnd4j/include/graph/gpu/TritonGraphBackend.cpp Triton backend core: singleton/config, availability, segment fusibility gating.
libnd4j/include/graph/gpu/TritonGraphBackend_preload.cpp Batch-preload path to ensure compiled modules are resident ahead of replay.
libnd4j/include/graph/gpu/TritonGraphBackend_lru.cpp Module residency tracking + LRU eviction/reload plumbing.
libnd4j/include/graph/gpu/TritonGraphBackend_internal.h Shared internal helpers (hashing, slot resolution, CUDA helpers).
libnd4j/include/graph/gpu/SymbolicShapeRanges.h API for collecting shape range profiles and computing range-based shape keys.
libnd4j/include/graph/gpu/SectionTypeConfig.h Central table/logic describing compilation/fusion behavior per section type.
libnd4j/include/graph/gpu/FusionScoring.h Declares the fusion scoring heuristic used for section-range extension decisions.
libnd4j/include/graph/gpu/FusionScoring.cpp Implements fusion scoring (grid compatibility, memory traffic savings, penalties/bonuses).
libnd4j/include/graph/gpu/GpuKernelLauncher.h Declares CUDA-driver module load/function lookup/kernel launch helpers for JIT backends.
libnd4j/include/graph/gpu/GpuKernelLauncher.cu Implements CUDA-driver load/launch/unload helpers with JIT log capture on failure.
libnd4j/include/graph/gpu/JitGraphBackendCommon.h Shared cache/execute/invalidate primitives for NVRTC/PTX JIT backends.
libnd4j/include/graph/gpu/JitGraphBackendCommon.cu Implements shared fusibility check + kernel launch argument packing and execution.
libnd4j/include/graph/gpu/NvrtcGraphBackend.h Declares NVRTC backend entry points and per-segment cache.
libnd4j/include/graph/gpu/NvrtcGraphBackend.cu Implements CUDA C source generation + NVRTC compilation + shared execute path.
libnd4j/include/graph/gpu/PtxGraphBackend.h Declares PTX-template backend entry points and per-segment cache.
libnd4j/include/graph/gpu/PtxGraphBackend.cu Implements PTX text generation and shared execute path.
libnd4j/include/graph/gpu/NvrtcKernelBuilder.h Declares an alternate NVRTC kernel source builder API (not referenced elsewhere in this PR snapshot).
libnd4j/include/graph/gpu/NvrtcKernelCache.h Declares an alternate NVRTC kernel cache/handle API (not referenced elsewhere in this PR snapshot).
libnd4j/include/graph/gpu/NvrtcKernelCache.cu Implements the alternate NVRTC kernel cache/handle API.
libnd4j/include/graph/gpu/CaptureBufferRegistry.h Declares capture-workspace allocation tracking by segment.
libnd4j/include/graph/gpu/CaptureBufferRegistry.cu Implements capture buffer registry backed by CudaMemoryPool with cross-device rejection.
libnd4j/include/graph/gpu/TritonGraphBackend_compile.cu Segment compilation path (also contains dtype checks used in compile-time analysis).
libnd4j/include/graph/gpu/TritonIRBuilder_module.cpp Triton module-building pipeline (includes precision emulation logic).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread libnd4j/include/graph/gpu/FusionScoring.cpp
Comment thread libnd4j/include/graph/gpu/NvrtcKernelCache.cu Outdated
Comment thread libnd4j/include/graph/gpu/NvrtcKernelCache.cu Outdated
Comment thread libnd4j/include/graph/gpu/TritonGraphBackend_internal.h
Comment on lines +115 to +119
if (rank != sliceRank) return false;
for (int i = 0; i < rank; i++) {
if (outputShape[i] != sliceBegin[i]) return false; // sliceBegin reused for input shape
}
return true;
Comment thread libnd4j/include/graph/gpu/TritonGraphBackend_compile.cu
Comment thread libnd4j/include/graph/gpu/TritonIRBuilder_module.cpp
- Remove hardcoded startSlot==347 debug PTX dump from compileToGpuBinary()
- Gate TTIR file dumps in buildModule() and buildSectionedModule() behind isDebug()
- Gate fprintf(stderr) error messages in GpuKernelLauncher behind isDebug()
@agibsonccc

Copy link
Copy Markdown
Contributor Author

Architecture Overview

This PR implements the Triton/NVRTC/PTX JIT compilation backends — three progressively simpler GPU kernel generation paths that the DSP engine selects from based on availability and op compatibility. Triton generates optimal MLIR-based kernels with full register allocation; NVRTC compiles runtime CUDA C; PTX templates emit string assembly with zero compiler overhead.

Highlights

  • Triton MLIR kernel generation with section fusionTritonIRBuilder groups contiguous ops into typed sections (9 emitters: elementwise, matmul, attention, gather, concat, slice, tile, conv, im2col) with tt.grid_barrier between sections; TritonTargetDispatch selects NVIDIA PTX, AMD AMDGCN, or Intel SPIR-V from the same IR, enabling multi-vendor GPU support
  • OpCategoryTable: 130+ op SSOT for JIT eligibility — single header maps op names to 18-variant TritonOpCategory enum; both Triton and NVRTC backends consult this table — new ops must be registered here to be JIT-compiled; FusionScoring heuristics gate fusion on element count (≥1M), op count (≥2), and op mix ratio

agibsonccc and others added 2 commits June 16, 2026 16:04
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
agibsonccc and others added 7 commits June 16, 2026 16:05
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Apply THROW_EXCEPTION macro consolidation to Triton backend files.
Apply THROW_EXCEPTION macro consolidation to Triton backend files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants