Architecture Overview#
This section provides an overview of TensorRT’s architecture, design principles, and ecosystem. It introduces key concepts and complementary tools that work alongside TensorRT for optimized inference deployment.
Samples#
The Sample Support Guide illustrates many of the topics discussed in this section.
Complementary GPU Features#
Beyond engine compilation and runtime APIs, NVIDIA GPUs expose platform features for sharing or partitioning hardware among concurrent workloads. The topics in this section describe three complementary options for TensorRT deployments: Multi-instance GPU (MIG) for dedicated GPU slices, green contexts for lightweight compute isolation without MIG hardware, and multi-device execution for spreading a single network across multiple GPUs. Refer to Multi-Device Inference for API setup and operator details.
Multi-instance GPU#
Multi-instance GPU (MIG) is a feature of NVIDIA GPUs with NVIDIA Ampere Architecture or later architectures. It enables user-directed partitioning of a single GPU into multiple smaller GPUs.
The physical partitions provide dedicated compute and memory slices with quality of service. They also support independent execution of parallel workloads on fractions of the GPU.
For TensorRT applications with low GPU utilization, MIG can increase throughput with little or no latency impact. The optimal partitioning scheme is application-specific. Use MIG when strict hardware isolation is required, such as in multi-tenant systems or when GPU memory must be partitioned.
Green Contexts#
A green context (introduced in CUDA 13.1) is a lightweight context associated, from its creation, with a set of specific GPU resources. Users can partition GPU resources, currently streaming multiprocessors (SMs) and work queues (WQs), during green context creation, so that GPU work targeting a green context can only use its provisioned SMs and work queues. Doing so can be beneficial in reducing, or better controlling, interference due to use of common resources. An application can have multiple green contexts.
Green contexts are a lighter-weight alternative to MIG to keep CUDA workloads from affecting each other’s inference compute usage. A primary advantage of green contexts over MIG is that green contexts do not require special hardware support, so they are available on more GPU architectures. However, unlike MIG, green contexts do not allow you to partition memory and do not provide strict isolation that might be required in multi-tenant environments.
Multi-device Execution#
Multi-device execution is a feature for partitioning a single network across multiple GPUs, enabling multi-GPU inference. Each GPU runs its own instance of the TensorRT engine as a distinct rank and exchanges intermediate tensors with other ranks using distributed collective primitives.
The multi-device feature is exposed through the per-layer nbRanks attribute on IDistCollectiveLayer and IAttention, together with the IDistCollectiveLayer collectives (ALL_REDUCE, ALL_GATHER, BROADCAST, REDUCE, REDUCE_SCATTER, ALL_TO_ALL, GATHER, SCATTER).
At runtime, the application initializes an NCCL communicator on each participating rank and attaches it to the engine.
For TensorRT applications whose models are too large to fit on a single GPU, or whose latency is dominated by compute that parallelizes well across devices, multi-device execution lowers per-rank memory pressure and shortens single-query latency at the cost of inter-GPU communication.
Complementary Software#
Tool |
Description |
|---|---|
Higher-level library providing optimized inference across CPUs and GPUs with model management, REST, and gRPC endpoints. |
|
High-performance primitives for preprocessing image, audio, and video data. TensorRT inference can be integrated as a custom operator in a DALI pipeline. Refer to GitHub: DALI for a working example. |
|
PyTorch-TensorRT compiler that converts PyTorch modules into TensorRT engines. Subgraphs are accelerated through TensorRT while Torch executes the rest natively. Refer to GitHub: Examples. |
|
Unified library for quantization, pruning, and distillation. Compresses models for TensorRT-LLM or TensorRT deployment. Replaces the deprecated PyTorch and TensorFlow Quantization Toolkits. To quantize TensorFlow models, export to ONNX first. |
|
Profiling tool integrated with TensorRT for performance analysis. |
|
IDE for ONNX model editing, performance profiling, and TensorRT engine building. |
|
A restricted subset of TensorRT is certified for NVIDIA DRIVE products. Some APIs are marked for DRIVE use only. |
ONNX#
TensorRT’s primary means of importing a trained model from a framework is the ONNX interchange format. TensorRT ships with an ONNX parser library to assist in importing models. The table below summarizes common model entry paths and follow-on tooling.
Entry path |
When to use |
Next step |
|---|---|---|
ONNX export + ONNX parser |
Most training frameworks (PyTorch, TensorFlow, JAX, and others) |
Export to the latest supported opset, then parse with TensorRT’s ONNX parser (refer to C++ or Python API walkthrough) |
PyTorch ONNX export |
PyTorch-trained models |
|
TensorFlow → ONNX |
TensorFlow or Keras models |
|
Post-export cleanup |
Parser errors, unsupported subgraphs, or constant-folding opportunities |
Run Polygraphy constant folding; edit with ONNX-GraphSurgeon when needed |
Where possible, the parser is backward compatible to opset 9. The ONNX Model Opset Version Converter can assist in resolving incompatibilities.
The GitHub version can support later opsets than the version shipped with TensorRT. Refer to the ONNX-TensorRT operator support matrix for the latest information on the supported opset and operators. For TensorRT deployment, we recommend exporting to the latest available ONNX opset.
The ONNX operator support list for TensorRT can be found on GitHub: Supported ONNX Operators.
After exporting a model to ONNX, run constant folding using Polygraphy as a good first step. This often solves TensorRT conversion issues in the ONNX parser and simplifies the workflow.
Code Analysis Tools#
For guidance using the Valgrind and Clang sanitizer tools with TensorRT, refer to the Troubleshooting section.
API Versioning#
TensorRT version number (MAJOR.MINOR.PATCH) follows Semantic Versioning 2.0.0 for its public APIs and library ABIs. Version numbers change as follows:
MAJOR version when making incompatible API or ABI changes.
MINOR version when adding functionality in a backward-compatible manner.
PATCH version when making backward-compatible bug fixes.
Warning
Semantic versioning does not extend to serialized objects by default. To reuse plan files and timing caches, a version compatible engine must be used, otherwise version numbers must match across major, minor, patch, and build versions. Some exceptions exist for the safety runtime as detailed in the NVIDIA DriveOS Developer Guide.
Calibration caches can typically be reused within a major version, but compatibility beyond a specific patch version is not guaranteed.
Deprecation Policy#
Deprecation informs developers that some APIs and tools are no longer recommended. TensorRT has the following deprecation policy, beginning with version 8.0:
Deprecation notices are communicated in the Release Notes.
When using C++ API:
API functions are marked with the
TRT_DEPRECATED_APImacro.Enums are marked with the
TRT_DEPRECATED_ENUMmacro.All other locations are marked with the
TRT_DEPRECATEDmacro.Classes, functions, and objects will have a statement documenting when they were deprecated.
When using the Python API, deprecated methods and classes will issue deprecation warnings at runtime if they are used.
TensorRT provides a 12-month migration period after the deprecation.
APIs and tools continue to work during the migration period.
After the migration period ends, APIs and tools are removed in a manner consistent with semantic versioning.
For any APIs and tools specifically deprecated in TensorRT 7.x, the 12-month migration period starts from the TensorRT 8.0 GA release date.
Hardware Support Lifetime#
TensorRT 8.5.3 was the last release supporting NVIDIA Kepler (SM 3.x) and NVIDIA Maxwell (SM 5.x) devices. These devices are no longer supported in TensorRT 8.6. NVIDIA Pascal (SM 6.x) devices were deprecated in TensorRT 8.6. TensorRT 10.4 was the last release supporting NVIDIA Volta (SM 7.0) devices. Refer to the Support Matrix section for more information.
Support#
Support, resources, and information about TensorRT can be found online at https://developer.nvidia.com/tensorrt. This includes blogs, samples, and more.
You can also access the NVIDIA Developer TensorRT forum at https://forums.developer.nvidia.com/c/ai-data-science/deep-learning/tensorrt/ for all things related to TensorRT. This forum offers the possibility of finding answers, making connections, and getting involved in discussions with customers, developers, and TensorRT engineers.
Reporting Bugs#
NVIDIA appreciates all types of feedback. If you encounter any problems, follow the instructions in the Reporting TensorRT Issues section to report the issues.