PR10: Triton/NVRTC/PTX Backend (C++) by agibsonccc · Pull Request #10443 · deeplearning4j/deeplearning4j

agibsonccc · 2026-06-15T02:25:03Z

Summary

PR 10 of 22 PRs in the ag_new_release_updates_2 branch split. Merge after Layer 2 (native core ops + helpers).

Three JIT paths: Triton (MLIR-based, full register allocation) → NVRTC (runtime CUDA C compile via nvrtcCompileProgram()) → PTX templates (string assembly, no compiler, microseconds)
OpCategoryTable.h: Single source of truth mapping 130+ op names to 18-variant TritonOpCategory enum; both Triton and NVRTC consult the same table — new ops must be registered here to get JIT-compiled
Triton section fusion: identifySections() groups contiguous element-wise ops into one ELEMENTWISE section; each matmul/attention/reduction/gather/concat/slice/conv gets its own typed section; multi-section kernels use tt.grid_barrier or __threadfence_system()
9 section emitters: emitElementwiseSection(), emitMatmulSection(), emitAttentionSection(), emitGatherSection(), emitConcatSection(), emitSliceSection(), emitTileSection(), emitConvolutionSection(), emitIm2colSection()
Multi-vendor targets: TritonTargetDispatch selects NVIDIA PTX, AMD AMDGCN, or Intel SPIR-V based on detected device — same IR targets multiple GPU vendors
Async precompilation: TritonGraphBackend_preload.cpp background thread compiles kernels during warmup, eliminating first-token JIT latency on the decode path
Capture safety: GpuKernelLauncher checks tl_streamCaptureActive before any CUDA API call illegal during graph capture; records per-kernel microsecond timing via DSP diagnostics
NVRTC caching: JitSegmentCacheKey (shapeKey + segmentOpHash + dtypeKey); compiled PTX cached in NvrtcKernelCache; eviction on OOM
Fusion scoring: FusionScoring heuristics gate fusion on element count (≥1M), op count (≥2 fusible ops), and op mix ratio

What Changed

Triton MLIR backend — graph/gpu/ (19 files)

TritonIRBuilder.h — core class: buildModule() entry point; isTritonMappable(opName) via OpCategoryTable; classifySegment(); tile size constants (BLOCK_SIZE=1024 element-wise, BLOCK_M/N=128 BLOCK_K=32 matmul)
TritonIRBuilder.cpp — orchestrates: loadInputTensors → identifySections → emit*(section) → finalize; uses mlir::MLIRContext with Triton TTIR, arith, math, SCF dialects
TritonIRBuilder_types.h — NVCC-safe types: KernelSectionType enum, KernelSection struct, EpilogueOp enum, TritonIRModule
TritonIRBuilder_analysis.cpp — classifySegment(): returns SegmentKernelPattern (ELEMENTWISE_DOMINANT, MATMUL_DOMINANT, MIXED, REDUCTION_DOMINANT, ATTENTION_DOMINANT)
TritonIRBuilder_sections.cpp — identifySections() fusion pass; all 9 section emitters; computeSectionGrid(); inter-section barriers
TritonIRBuilder_emitters.cpp — per-op MLIR value emission: relu→arith.maxf, exp→math.exp, softmax→tt.reduce + exp + reciprocal
TritonIRBuilder_kernels.cpp — kernel function construction: pid/num_programs prologue, pointer arithmetic, tl.load, per-element ops, tl.store
TritonIRBuilder_module.cpp — mlir::PassManager pipeline: triton-to-triton-gpu then triton-gpu-to-llvm
TritonIRBuilder_cuda.cu — compiles via Triton Python compiler subprocess or embedded C API; loads .cubin via cuModuleLoadData
TritonIRBuilder_types.cpp — mapNd4jDtypeToTritonType() for SD_FLOAT/SD_HALF/SD_BFLOAT16/SD_DOUBLE/SD_INT32/SD_INT64

Triton backend — graph/gpu/ (12 files)

TritonGraphBackend.h/.cpp — GraphBackend subclass; canFuseSegment() calls isTritonMappable() per slot; per-segment LRU cache of compiled modules
TritonGraphBackend_compile.cu — invokes TritonIRBuilder, runs MLIR pass pipeline, loads .cubin or .ptx
TritonGraphBackend_execute.cu — resolves buffer pointers from NativeSlots, launches via GpuKernelLauncher on DSP stream
TritonGraphBackend_cache.cpp — LRU by (shapeKey, segmentId, graphExecutionMode); TritonCacheBundle stores handle + kernel pointer + launch params
TritonGraphBackend_preload.cpp — background thread compiles kernels observed during warmup before hot path needs them
TritonGraphBackend_binary.cpp — loads/stores pre-compiled .cubin from/to disk cache
TritonTargetDispatch.h/.cpp — selects NVIDIA PTX / AMD AMDGCN / Intel SPIR-V based on detected device

NVRTC backend — graph/gpu/ (4 files)

NvrtcGraphBackend.h — GraphBackend subclass; cache keyed by JitSegmentCacheKey
NvrtcGraphBackend.cu — generateCudaSource(): iterates slots emitting per-op CUDA C expressions; compiles via nvrtcCompileProgram() targeting detected SM; loads via cuModuleLoadDataEx
NvrtcKernelBuilder.h/.cu — maps TritonOpCategory to CUDA math inline: relu→fmaxf(x,0.0f), add→(a+b), softmax→multi-step reduce+normalize
NvrtcKernelCache.h/.cu — JitCompiledKernel (CUmodule + CUfunction + launch params); JitSegmentCacheKey; OOM eviction

PTX template backend — graph/gpu/ (2 files)

PtxGraphBackend.h — GraphBackend subclass; no compiler invocation, direct PTX string generation
PtxGraphBackend.cu — generatePtxSource(): constructs .global/.func PTX via string templates; loads via cuModuleLoadDataEx for driver JIT PTX→SASS

Shared GPU infrastructure — graph/gpu/ (9 files)

OpCategoryTable.h — SSOT: 130+ op name strings → TritonOpCategory; covers BINARY_ELEMENTWISE (add/sub/mul/div/min/max/mod and TF aliases), UNARY_ELEMENTWISE (relu/sigmoid/tanh/gelu/exp/log/sqrt/abs/silu and 20+ more), COMPARISON, LOGICAL, TERNARY (where/select), IDENTITY, MATMUL, REDUCTION, NORMALIZATION, CAST, FUSED_ATTENTION, SHAPE_MANIPULATION, DATA_MOVEMENT, CONSTANT_GENERATION, CONVOLUTION, ROPE, FUSED_LLM, UNSUPPORTED
JitGraphBackendCommon.h/.cu — shared types and utilities for NVRTC and PTX backends
GpuKernelLauncher.h/.cu — unified kernel launch: CUfunction + grid/block dims + stream; capture-safe; DSP TIMING diagnostics integration
CaptureBufferRegistry.h/.cu — per-segment GPU buffer registry; validates pointer stability before CUDA graph replay
FusionScoring.h/.cpp — fusion heuristics: fuse ≥1M elements, ≥2 fusible ops, shape compatibility
SymbolicShapeRanges.h — min/max shape bound propagation for ahead-of-time segment planning
SectionTypeConfig.h — flags enabling/disabling individual KernelSectionTypes
ViewRecipe.h — logical view relationship expressed as pointer arithmetic rather than a kernel launch

Dependencies

Depends on: PR05, PR06, PR07 (NDArray, DataBuffer, LaunchDims); PR09 (DSP graph engine — NativeDynamicShapePlan.h, GraphBackend.h, GraphSegment types)
Required by: PR09 (NativeDynamicShapePlan_gpubackend.cpp includes TritonGraphBackend.h, NvrtcGraphBackend.h, PtxGraphBackend.h)

Merge Order

These 22 PRs must merge in layer order. Each layer depends on the layers above it being merged first. PRs within the same layer are independent and can merge in parallel.

This PR: Merge after Layer 2 (native core ops + helpers).

Layer	PRs
0 (no deps)	PR01, PR02, PR20
1 (build/infra)	PR03, PR04
2 (native core)	PR05, PR06, PR07
3 (native feat)	PR08, PR09, PR10, PR11
4 (java core)	PR12, PR13, PR14, PR15
5 (java feat)	PR16
6 (import/gen)	PR17, PR18, PR19, PR21
7 (validation)	PR22

Part of the 22-PR split of ag_new_release_updates_2 branch. Merge layer: 3 (native features) Files: 44 See pr-plans/00-master-plan.md for the full split plan and merge order.

…ptx-backend

Copilot

Pull request overview

Adds native GPU JIT compilation backends for the graph executor, establishing a Triton-first pipeline with NVRTC and PTX-template fallbacks plus supporting infrastructure (fusion scoring, module residency/LRU, preload, capture-safe buffer tracking, and view-recipe metadata).

Changes:

Introduces Triton multi-target dispatch + Triton IR builder core types/utilities for segment compilation and launch planning.
Adds NVRTC and PTX-template graph backends for fused elementwise segments, sharing common cache/execute logic.
Adds supporting GPU infrastructure: section-type configuration + fusion scoring, capture buffer registry, symbolic shape range profiling, and view-recipe structures.

Reviewed changes

Copilot reviewed 41 out of 44 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
libnd4j/include/graph/gpu/ViewRecipe.h	Defines a compact representation for view-producing ops to avoid kernel launches during replay.
libnd4j/include/graph/gpu/TritonTargetDispatch.h	Declares runtime target detection + compile/load/launch interface for Triton outputs.
libnd4j/include/graph/gpu/TritonTargetDispatch.cpp	Implements Triton compilation pipeline and target-specific module load/launch.
libnd4j/include/graph/gpu/TritonIRBuilder_types.h	Defines NVCC-safe Triton IR builder data structures (sections, args, launch phases).
libnd4j/include/graph/gpu/TritonIRBuilder_types.cpp	Implements MLIR type mapping + constant emission helpers for Triton IR builder.
libnd4j/include/graph/gpu/TritonIRBuilder_cuda.cu	CUDA-only helper(s) for cooperative launch sizing.
libnd4j/include/graph/gpu/TritonGraphBackend.cpp	Triton backend core: singleton/config, availability, segment fusibility gating.
libnd4j/include/graph/gpu/TritonGraphBackend_preload.cpp	Batch-preload path to ensure compiled modules are resident ahead of replay.
libnd4j/include/graph/gpu/TritonGraphBackend_lru.cpp	Module residency tracking + LRU eviction/reload plumbing.
libnd4j/include/graph/gpu/TritonGraphBackend_internal.h	Shared internal helpers (hashing, slot resolution, CUDA helpers).
libnd4j/include/graph/gpu/SymbolicShapeRanges.h	API for collecting shape range profiles and computing range-based shape keys.
libnd4j/include/graph/gpu/SectionTypeConfig.h	Central table/logic describing compilation/fusion behavior per section type.
libnd4j/include/graph/gpu/FusionScoring.h	Declares the fusion scoring heuristic used for section-range extension decisions.
libnd4j/include/graph/gpu/FusionScoring.cpp	Implements fusion scoring (grid compatibility, memory traffic savings, penalties/bonuses).
libnd4j/include/graph/gpu/GpuKernelLauncher.h	Declares CUDA-driver module load/function lookup/kernel launch helpers for JIT backends.
libnd4j/include/graph/gpu/GpuKernelLauncher.cu	Implements CUDA-driver load/launch/unload helpers with JIT log capture on failure.
libnd4j/include/graph/gpu/JitGraphBackendCommon.h	Shared cache/execute/invalidate primitives for NVRTC/PTX JIT backends.
libnd4j/include/graph/gpu/JitGraphBackendCommon.cu	Implements shared fusibility check + kernel launch argument packing and execution.
libnd4j/include/graph/gpu/NvrtcGraphBackend.h	Declares NVRTC backend entry points and per-segment cache.
libnd4j/include/graph/gpu/NvrtcGraphBackend.cu	Implements CUDA C source generation + NVRTC compilation + shared execute path.
libnd4j/include/graph/gpu/PtxGraphBackend.h	Declares PTX-template backend entry points and per-segment cache.
libnd4j/include/graph/gpu/PtxGraphBackend.cu	Implements PTX text generation and shared execute path.
libnd4j/include/graph/gpu/NvrtcKernelBuilder.h	Declares an alternate NVRTC kernel source builder API (not referenced elsewhere in this PR snapshot).
libnd4j/include/graph/gpu/NvrtcKernelCache.h	Declares an alternate NVRTC kernel cache/handle API (not referenced elsewhere in this PR snapshot).
libnd4j/include/graph/gpu/NvrtcKernelCache.cu	Implements the alternate NVRTC kernel cache/handle API.
libnd4j/include/graph/gpu/CaptureBufferRegistry.h	Declares capture-workspace allocation tracking by segment.
libnd4j/include/graph/gpu/CaptureBufferRegistry.cu	Implements capture buffer registry backed by CudaMemoryPool with cross-device rejection.
libnd4j/include/graph/gpu/TritonGraphBackend_compile.cu	Segment compilation path (also contains dtype checks used in compile-time analysis).
libnd4j/include/graph/gpu/TritonIRBuilder_module.cpp	Triton module-building pipeline (includes precision emulation logic).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    if (rank != sliceRank) return false;
+    for (int i = 0; i < rank; i++) {
+      if (outputShape[i] != sliceBegin[i]) return false; // sliceBegin reused for input shape
+    }
+    return true;


- Remove hardcoded startSlot==347 debug PTX dump from compileToGpuBinary() - Gate TTIR file dumps in buildModule() and buildSectionedModule() behind isDebug() - Gate fprintf(stderr) error messages in GpuKernelLauncher behind isDebug()

agibsonccc · 2026-06-15T12:19:23Z

Architecture Overview

This PR implements the Triton/NVRTC/PTX JIT compilation backends — three progressively simpler GPU kernel generation paths that the DSP engine selects from based on availability and op compatibility. Triton generates optimal MLIR-based kernels with full register allocation; NVRTC compiles runtime CUDA C; PTX templates emit string assembly with zero compiler overhead.

Highlights

Triton MLIR kernel generation with section fusion — TritonIRBuilder groups contiguous ops into typed sections (9 emitters: elementwise, matmul, attention, gather, concat, slice, tile, conv, im2col) with tt.grid_barrier between sections; TritonTargetDispatch selects NVIDIA PTX, AMD AMDGCN, or Intel SPIR-V from the same IR, enabling multi-vendor GPU support
OpCategoryTable: 130+ op SSOT for JIT eligibility — single header maps op names to 18-variant TritonOpCategory enum; both Triton and NVRTC backends consult this table — new ops must be registered here to be JIT-compiled; FusionScoring heuristics gate fusion on element count (≥1M), op count (≥2), and op mix ratio

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Apply THROW_EXCEPTION macro consolidation to Triton backend files.

…tes_2

agibsonccc added 2 commits June 15, 2026 11:24

PR10: Triton/NVRTC/PTX Backend (C++)

6821091

Part of the 22-PR split of ag_new_release_updates_2 branch. Merge layer: 3 (native features) Files: 44 See pr-plans/00-master-plan.md for the full split plan and merge order.

Merge remote-tracking branch 'origin/master' into pr/10-triton-nvrtc-…

d426f28

…ptx-backend

agibsonccc requested a review from Copilot June 15, 2026 05:40

Copilot started reviewing on behalf of agibsonccc June 15, 2026 05:41 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

agibsonccc and others added 2 commits June 16, 2026 16:04

Potential fix for pull request finding

39afbdd

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

07f48b1

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

agibsonccc and others added 7 commits June 16, 2026 16:05

Potential fix for pull request finding

7d02cfe

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

ae89748

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

0aa4074

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

a79423f

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

fix: sync build fixes from ag_new_release_updates_2

ccef1a9

Apply THROW_EXCEPTION macro consolidation to Triton backend files.

fix: sync build fixes from ag_new_release_updates_2

714fc91

Apply THROW_EXCEPTION macro consolidation to Triton backend files.

fix: sync Triton/NVRTC CUDA dispatch backend from ag_new_release_upda…

18cd204

…tes_2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PR10: Triton/NVRTC/PTX Backend (C++)#10443

PR10: Triton/NVRTC/PTX Backend (C++)#10443
agibsonccc wants to merge 12 commits into
masterfrom
pr/10-triton-nvrtc-ptx-backend

agibsonccc commented Jun 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

agibsonccc commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

agibsonccc commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Triton MLIR backend — graph/gpu/ (19 files)

Triton backend — graph/gpu/ (12 files)

NVRTC backend — graph/gpu/ (4 files)

PTX template backend — graph/gpu/ (2 files)

Shared GPU infrastructure — graph/gpu/ (9 files)

Dependencies

Merge Order

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

agibsonccc commented Jun 15, 2026

Architecture Overview

Highlights

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

agibsonccc commented Jun 15, 2026 •

edited

Loading