PR10: Triton/NVRTC/PTX Backend (C++)#10443
Conversation
Part of the 22-PR split of ag_new_release_updates_2 branch. Merge layer: 3 (native features) Files: 44 See pr-plans/00-master-plan.md for the full split plan and merge order.
There was a problem hiding this comment.
Pull request overview
Adds native GPU JIT compilation backends for the graph executor, establishing a Triton-first pipeline with NVRTC and PTX-template fallbacks plus supporting infrastructure (fusion scoring, module residency/LRU, preload, capture-safe buffer tracking, and view-recipe metadata).
Changes:
- Introduces Triton multi-target dispatch + Triton IR builder core types/utilities for segment compilation and launch planning.
- Adds NVRTC and PTX-template graph backends for fused elementwise segments, sharing common cache/execute logic.
- Adds supporting GPU infrastructure: section-type configuration + fusion scoring, capture buffer registry, symbolic shape range profiling, and view-recipe structures.
Reviewed changes
Copilot reviewed 41 out of 44 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| libnd4j/include/graph/gpu/ViewRecipe.h | Defines a compact representation for view-producing ops to avoid kernel launches during replay. |
| libnd4j/include/graph/gpu/TritonTargetDispatch.h | Declares runtime target detection + compile/load/launch interface for Triton outputs. |
| libnd4j/include/graph/gpu/TritonTargetDispatch.cpp | Implements Triton compilation pipeline and target-specific module load/launch. |
| libnd4j/include/graph/gpu/TritonIRBuilder_types.h | Defines NVCC-safe Triton IR builder data structures (sections, args, launch phases). |
| libnd4j/include/graph/gpu/TritonIRBuilder_types.cpp | Implements MLIR type mapping + constant emission helpers for Triton IR builder. |
| libnd4j/include/graph/gpu/TritonIRBuilder_cuda.cu | CUDA-only helper(s) for cooperative launch sizing. |
| libnd4j/include/graph/gpu/TritonGraphBackend.cpp | Triton backend core: singleton/config, availability, segment fusibility gating. |
| libnd4j/include/graph/gpu/TritonGraphBackend_preload.cpp | Batch-preload path to ensure compiled modules are resident ahead of replay. |
| libnd4j/include/graph/gpu/TritonGraphBackend_lru.cpp | Module residency tracking + LRU eviction/reload plumbing. |
| libnd4j/include/graph/gpu/TritonGraphBackend_internal.h | Shared internal helpers (hashing, slot resolution, CUDA helpers). |
| libnd4j/include/graph/gpu/SymbolicShapeRanges.h | API for collecting shape range profiles and computing range-based shape keys. |
| libnd4j/include/graph/gpu/SectionTypeConfig.h | Central table/logic describing compilation/fusion behavior per section type. |
| libnd4j/include/graph/gpu/FusionScoring.h | Declares the fusion scoring heuristic used for section-range extension decisions. |
| libnd4j/include/graph/gpu/FusionScoring.cpp | Implements fusion scoring (grid compatibility, memory traffic savings, penalties/bonuses). |
| libnd4j/include/graph/gpu/GpuKernelLauncher.h | Declares CUDA-driver module load/function lookup/kernel launch helpers for JIT backends. |
| libnd4j/include/graph/gpu/GpuKernelLauncher.cu | Implements CUDA-driver load/launch/unload helpers with JIT log capture on failure. |
| libnd4j/include/graph/gpu/JitGraphBackendCommon.h | Shared cache/execute/invalidate primitives for NVRTC/PTX JIT backends. |
| libnd4j/include/graph/gpu/JitGraphBackendCommon.cu | Implements shared fusibility check + kernel launch argument packing and execution. |
| libnd4j/include/graph/gpu/NvrtcGraphBackend.h | Declares NVRTC backend entry points and per-segment cache. |
| libnd4j/include/graph/gpu/NvrtcGraphBackend.cu | Implements CUDA C source generation + NVRTC compilation + shared execute path. |
| libnd4j/include/graph/gpu/PtxGraphBackend.h | Declares PTX-template backend entry points and per-segment cache. |
| libnd4j/include/graph/gpu/PtxGraphBackend.cu | Implements PTX text generation and shared execute path. |
| libnd4j/include/graph/gpu/NvrtcKernelBuilder.h | Declares an alternate NVRTC kernel source builder API (not referenced elsewhere in this PR snapshot). |
| libnd4j/include/graph/gpu/NvrtcKernelCache.h | Declares an alternate NVRTC kernel cache/handle API (not referenced elsewhere in this PR snapshot). |
| libnd4j/include/graph/gpu/NvrtcKernelCache.cu | Implements the alternate NVRTC kernel cache/handle API. |
| libnd4j/include/graph/gpu/CaptureBufferRegistry.h | Declares capture-workspace allocation tracking by segment. |
| libnd4j/include/graph/gpu/CaptureBufferRegistry.cu | Implements capture buffer registry backed by CudaMemoryPool with cross-device rejection. |
| libnd4j/include/graph/gpu/TritonGraphBackend_compile.cu | Segment compilation path (also contains dtype checks used in compile-time analysis). |
| libnd4j/include/graph/gpu/TritonIRBuilder_module.cpp | Triton module-building pipeline (includes precision emulation logic). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (rank != sliceRank) return false; | ||
| for (int i = 0; i < rank; i++) { | ||
| if (outputShape[i] != sliceBegin[i]) return false; // sliceBegin reused for input shape | ||
| } | ||
| return true; |
- Remove hardcoded startSlot==347 debug PTX dump from compileToGpuBinary() - Gate TTIR file dumps in buildModule() and buildSectionedModule() behind isDebug() - Gate fprintf(stderr) error messages in GpuKernelLauncher behind isDebug()
Architecture OverviewThis PR implements the Triton/NVRTC/PTX JIT compilation backends — three progressively simpler GPU kernel generation paths that the DSP engine selects from based on availability and op compatibility. Triton generates optimal MLIR-based kernels with full register allocation; NVRTC compiles runtime CUDA C; PTX templates emit string assembly with zero compiler overhead. Highlights
|
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Apply THROW_EXCEPTION macro consolidation to Triton backend files.
Apply THROW_EXCEPTION macro consolidation to Triton backend files.
Summary
PR 10 of 22 PRs in the
ag_new_release_updates_2branch split. Merge after Layer 2 (native core ops + helpers).nvrtcCompileProgram()) → PTX templates (string assembly, no compiler, microseconds)TritonOpCategoryenum; both Triton and NVRTC consult the same table — new ops must be registered here to get JIT-compiledidentifySections()groups contiguous element-wise ops into oneELEMENTWISEsection; each matmul/attention/reduction/gather/concat/slice/conv gets its own typed section; multi-section kernels usett.grid_barrieror__threadfence_system()emitElementwiseSection(),emitMatmulSection(),emitAttentionSection(),emitGatherSection(),emitConcatSection(),emitSliceSection(),emitTileSection(),emitConvolutionSection(),emitIm2colSection()TritonTargetDispatchselects NVIDIA PTX, AMD AMDGCN, or Intel SPIR-V based on detected device — same IR targets multiple GPU vendorsTritonGraphBackend_preload.cppbackground thread compiles kernels during warmup, eliminating first-token JIT latency on the decode pathGpuKernelLaunchercheckstl_streamCaptureActivebefore any CUDA API call illegal during graph capture; records per-kernel microsecond timing via DSP diagnosticsJitSegmentCacheKey(shapeKey + segmentOpHash + dtypeKey); compiled PTX cached inNvrtcKernelCache; eviction on OOMFusionScoringheuristics gate fusion on element count (≥1M), op count (≥2 fusible ops), and op mix ratioWhat Changed
Triton MLIR backend — graph/gpu/ (19 files)
TritonIRBuilder.h— core class:buildModule()entry point;isTritonMappable(opName)via OpCategoryTable;classifySegment(); tile size constants (BLOCK_SIZE=1024 element-wise, BLOCK_M/N=128 BLOCK_K=32 matmul)TritonIRBuilder.cpp— orchestrates: loadInputTensors → identifySections → emit*(section) → finalize; uses mlir::MLIRContext with Triton TTIR, arith, math, SCF dialectsTritonIRBuilder_types.h— NVCC-safe types:KernelSectionTypeenum,KernelSectionstruct,EpilogueOpenum,TritonIRModuleTritonIRBuilder_analysis.cpp—classifySegment(): returnsSegmentKernelPattern(ELEMENTWISE_DOMINANT, MATMUL_DOMINANT, MIXED, REDUCTION_DOMINANT, ATTENTION_DOMINANT)TritonIRBuilder_sections.cpp—identifySections()fusion pass; all 9 section emitters;computeSectionGrid(); inter-section barriersTritonIRBuilder_emitters.cpp— per-op MLIR value emission: relu→arith.maxf, exp→math.exp, softmax→tt.reduce + exp + reciprocalTritonIRBuilder_kernels.cpp— kernel function construction: pid/num_programs prologue, pointer arithmetic, tl.load, per-element ops, tl.storeTritonIRBuilder_module.cpp—mlir::PassManagerpipeline: triton-to-triton-gpu then triton-gpu-to-llvmTritonIRBuilder_cuda.cu— compiles via Triton Python compiler subprocess or embedded C API; loads.cubinvia cuModuleLoadDataTritonIRBuilder_types.cpp—mapNd4jDtypeToTritonType()for SD_FLOAT/SD_HALF/SD_BFLOAT16/SD_DOUBLE/SD_INT32/SD_INT64Triton backend — graph/gpu/ (12 files)
TritonGraphBackend.h/.cpp—GraphBackendsubclass;canFuseSegment()callsisTritonMappable()per slot; per-segment LRU cache of compiled modulesTritonGraphBackend_compile.cu— invokes TritonIRBuilder, runs MLIR pass pipeline, loads.cubinor.ptxTritonGraphBackend_execute.cu— resolves buffer pointers from NativeSlots, launches via GpuKernelLauncher on DSP streamTritonGraphBackend_cache.cpp— LRU by (shapeKey, segmentId, graphExecutionMode);TritonCacheBundlestores handle + kernel pointer + launch paramsTritonGraphBackend_preload.cpp— background thread compiles kernels observed during warmup before hot path needs themTritonGraphBackend_binary.cpp— loads/stores pre-compiled.cubinfrom/to disk cacheTritonTargetDispatch.h/.cpp— selects NVIDIA PTX / AMD AMDGCN / Intel SPIR-V based on detected deviceNVRTC backend — graph/gpu/ (4 files)
NvrtcGraphBackend.h—GraphBackendsubclass; cache keyed byJitSegmentCacheKeyNvrtcGraphBackend.cu—generateCudaSource(): iterates slots emitting per-op CUDA C expressions; compiles vianvrtcCompileProgram()targeting detected SM; loads viacuModuleLoadDataExNvrtcKernelBuilder.h/.cu— mapsTritonOpCategoryto CUDA math inline: relu→fmaxf(x,0.0f), add→(a+b), softmax→multi-step reduce+normalizeNvrtcKernelCache.h/.cu—JitCompiledKernel(CUmodule + CUfunction + launch params);JitSegmentCacheKey; OOM evictionPTX template backend — graph/gpu/ (2 files)
PtxGraphBackend.h—GraphBackendsubclass; no compiler invocation, direct PTX string generationPtxGraphBackend.cu—generatePtxSource(): constructs.global/.funcPTX via string templates; loads viacuModuleLoadDataExfor driver JIT PTX→SASSShared GPU infrastructure — graph/gpu/ (9 files)
OpCategoryTable.h— SSOT: 130+ op name strings →TritonOpCategory; covers BINARY_ELEMENTWISE (add/sub/mul/div/min/max/mod and TF aliases), UNARY_ELEMENTWISE (relu/sigmoid/tanh/gelu/exp/log/sqrt/abs/silu and 20+ more), COMPARISON, LOGICAL, TERNARY (where/select), IDENTITY, MATMUL, REDUCTION, NORMALIZATION, CAST, FUSED_ATTENTION, SHAPE_MANIPULATION, DATA_MOVEMENT, CONSTANT_GENERATION, CONVOLUTION, ROPE, FUSED_LLM, UNSUPPORTEDJitGraphBackendCommon.h/.cu— shared types and utilities for NVRTC and PTX backendsGpuKernelLauncher.h/.cu— unified kernel launch: CUfunction + grid/block dims + stream; capture-safe; DSP TIMING diagnostics integrationCaptureBufferRegistry.h/.cu— per-segment GPU buffer registry; validates pointer stability before CUDA graph replayFusionScoring.h/.cpp— fusion heuristics: fuse ≥1M elements, ≥2 fusible ops, shape compatibilitySymbolicShapeRanges.h— min/max shape bound propagation for ahead-of-time segment planningSectionTypeConfig.h— flags enabling/disabling individual KernelSectionTypesViewRecipe.h— logical view relationship expressed as pointer arithmetic rather than a kernel launchDependencies
Merge Order
These 22 PRs must merge in layer order. Each layer depends on the layers above it being merged first. PRs within the same layer are independent and can merge in parallel.