Add Biren SUPA accelerator support by frozenleaves · Pull Request #8054 · deepspeedai/DeepSpeed

frozenleaves · 2026-06-08T06:45:33Z

Add Biren SUPA Accelerator Support

Summary

This PR adds accelerator backend support for the Biren SUPA GPU (the Biren Technology GPU, software stack SUPA) to DeepSpeed, enabling DeepSpeed to automatically detect the device, run training and inference on Biren GPUs, and reuse DeepSpeed's existing operator invocation framework (fused optimizer, transformer inference, quantizer, async-io, etc.).

SUPA is onboarded as the 9th supported accelerator, following cuda / cpu / xpu / npu / mps / hpu / mlu / sdaa. It adheres to DeepSpeed's existing DeepSpeedAccelerator abstract interface and the op_builder plugin mechanism, with zero intrusion into existing backends — the only existing file modified is the accelerator auto-detection entry point accelerator/real_accelerator.py.

Changes

1. Accelerator auto-detection and registration — `accelerator/real_accelerator.py` (the only existing file modified)

Add 'supa' to SUPPORTED_ACCELERATOR_LIST.
Explicit specification (DS_ACCELERATOR=supa): attempt import torch_supa, and emit a clear error message if it is missing.
Auto-detection: add a SUPA probing branch that determines availability via import torch_supa and checking torch.supa.is_available().
- Critical ordering: because torch_supa spoofs torch.cuda, the SUPA detection branch must come before the CUDA detection, otherwise Biren cards would be misidentified as CUDA devices. This constraint is clearly noted with a comment in the code.
In the third-step instantiation logic, add the accelerator_name == 'supa' → SUPA_Accelerator() branch.

2. Accelerator implementation — `accelerator/supa_accelerator.py`

Implements all interfaces of the DeepSpeedAccelerator abstract base class. The vast majority of APIs delegate directly to torch.supa.* (mirroring the semantics of torch.cuda.*):

Device management: device / set_device / current_device / device_count / synchronize, etc.
RNG: manual_seed(_all) / get_rng_state / set_rng_state / default_generator.
Stream / Event: Stream / Event / current_stream / default_stream.
Memory management: empty_cache / memory_allocated / max_memory_allocated / memory_reserved / memory_stats / total_memory / available_memory, etc. (some use hasattr for capability probing, for compatibility across different versions of torch_supa).
Data types: declares support for fp32 / fp16 / bf16.
Communication backend: uses BCCL (the Biren collective communication library) on Linux, falling back to gloo on Windows.
CUDA Graph: mapped to torch.supa.SUPAGraph() / torch.supa.graph(...).
op_builder loading: op_builder_dir() returns op_builder.supa (local install) or deepspeed.ops.op_builder.supa (pip install), and lazily loads via pkgutil, scanning all *Builder classes in that directory to build the class_dict.
Environment variables: export_envs exports BCCL / BIREN / SUPA / LD_LIBRARY / PATH; visible_devices_envs uses SUPA_VISIBLE_DEVICES.
Compile backend: defaults to inductor, with Triton support.

3. SUPA op_builder plugin package — `op_builder/supa/` (new)

A new SUPA builder package, parallel to op_builder/{cpu,xpu,npu,...}:

File	Purpose
`builder.py`	`SUPAOpBuilder` base class, compiling host-side C++ sources based on `CppExtension` (`-O3 -std=c++17 -fopenmp` + CPU arch / SIMD width).
`fused_adam.py`	`FusedAdamBuilder` + `SUPAFusedAdam`: prefers calling the `torch.ops.deepspeed.multi_tensor_adam` compiled kernel, falling back to a numerically equivalent pure-PyTorch implementation when missing (supports Adam mode=0 / AdamW mode=1).
`fused_lamb.py`	`FusedLambBuilder` + `SUPAFusedLamb`: `torch.ops.deepspeed.lamb`, with a pure-PyTorch fallback (trust-ratio clamp).
`fused_lion.py`	`FusedLionBuilder` + `SUPAFusedLion`: `torch.ops.deepspeed.multi_tensor_lion`, with a pure-PyTorch fallback.
`inference.py`	`InferenceBuilder` + `SUPAInference`: wraps the full set of transformer inference kernels (layer_norm / rms_norm / softmax(context) / bias* / qkv_gemm / mlp_gemm / vector_matmul / linear_layer / rotary / einsum / MoE / gated_activation), in fp16/bf16/fp32 precisions, each delegating to `torch.ops.deepspeed.*`.
`quantizer.py`	`QuantizerBuilder` + `SUPAQuantizer`: symmetric/asymmetric quantization, stochastic rounding (SR), int4/int8 dequantization, swizzle_quant, quantized_reduction, LoCo, etc.
`async_io.py`	`AsyncIOBuilder`: reuses DeepSpeed's existing `csrc/aio/*` C++ sources, depends on `libaio`, includes a package-manager detection hint.
`cpu_adam.py` / `cpu_lion.py` / `cpu_adagrad.py`	CPU offload optimizer builders, reusing the `csrc/{adam,lion,adagrad}/*` sources.
`no_impl.py`	`NotImplementedBuilder`: a placeholder stub for unimplemented ops; `load()` raises a clear `NotImplementedError`.
`__init__.py`	Exports all builders.

Design highlights:

Compiled kernels are hooked in via import torch_supa_ext.deepspeed (side effect: registers torch.ops.deepspeed.*); all imports are wrapped in try/except so the module remains importable even without the compiled extension.
is_compatible() uses a two-stage decision: "fast path checks whether the op is already registered → otherwise attempt to import the extension".
optimizer builders provide a pure-PyTorch fallback, making it convenient to do functional verification in cmodel / hardware-free environments.

Dependencies

Runtime dependencies (all are Biren software-stack components, needed only when using the SUPA backend):

torch_supa — the Biren PyTorch device extension, providing the torch.supa.* namespace. Required (the basis for accelerator detection and all device APIs).
torch_supa_ext — the Biren compiled operator extension, with submodules:
- torch_supa_ext.deepspeed — registers torch.ops.deepspeed.* (fused optimizer / inference / quantizer kernels).
Optional: when missing, the optimizer falls back to pure PyTorch, while inference/quantizer raise a clear error on invocation and tests are skipped automatically.
BCCL — the Biren collective communication library (the communication backend for distributed training).
libaio — required by AsyncIOBuilder (ZeRO-Infinity NVMe offload) via libaio-dev.

No new dependencies are introduced for DeepSpeed's existing code or other backends.

Usage

Prerequisite: the Biren driver + torch_supa (+ torch_supa_ext as needed) is already installed.

# Option 1: explicitly specify the backend
export DS_ACCELERATOR=supa

# Option 2: auto-detection (just install torch_supa; no environment variable needed)

Usage in code is exactly the same as for other backends, through the unified get_accelerator() abstraction:

import torch
from deepspeed.accelerator import get_accelerator

accelerator = get_accelerator()          # automatically returns SUPA_Accelerator
print(accelerator.device_name())         # 'supa'
device = accelerator.device(0)           # torch.device('supa', 0)
tensor = torch.randn(3, device=device)   # tensor([-0.8643,  1.3154,  1.5823, ], device='supa:0')

# DeepSpeed training/inference initialization requires no changes; op_builder is automatically routed to op_builder.supa

Multi-card visibility is controlled via the SUPA_VISIBLE_DEVICES environment variable; the distributed communication backend defaults to bccl.

Compatibility and scope of impact

The SUPA path is activated only when DS_ACCELERATOR=supa is explicitly set or torch_supa is present in the environment; behavior in all other environments is completely unchanged.
The only existing file modified, real_accelerator.py, only adds branches and does not modify existing logic.
Tests are skipped automatically when no hardware is present, remaining transparent to upstream CI.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3eb1e1811a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: frozenleaves <914814442@qq.com>

frozenleaves · 2026-06-08T08:42:17Z

CC @PKUWZP @delock

PKUWZP · 2026-06-08T22:55:13Z

@frozenleaves Thanks for submitting this PR, very exciting work. A couple of suggestions:

Can you make sure to run pre-commit command to fix any formatting errors?
Can you add some testing results and ideally some benchmarking results? No need to be very large-scale, but a few Biren GPUs are fine.

delock · 2026-06-09T02:46:05Z

@frozenleaves Thanks for submitting this PR, very exciting work. A couple of suggestions:

Can you make sure to run pre-commit command to fix any formatting errors?

Can you add some testing results and ideally some benchmarking results? No need to be very large-scale, but a few Biren GPUs are fine.

I think the testing level should be two tiers:

UT test result -- whether a core subset of UT had passed on SUPA
Workload test result -- which ZeRO stages are currently supported, whether SP/TP is supported etc.
Benchmarking results -- could be shared in the comments or a followup blog post.

Also this table can be updated along with this PR.
https://github.com/deepspeedai/DeepSpeed#contributed-hw-support

frozenleaves · 2026-06-09T03:06:27Z

@frozenleaves Thanks for submitting this PR, very exciting work. A couple of suggestions:

Can you make sure to run pre-commit command to fix any formatting errors?

Can you add some testing results and ideally some benchmarking results? No need to be very large-scale, but a few Biren GPUs are fine.

@frozenleaves Thanks for submitting this PR, very exciting work. A couple of suggestions:

Can you make sure to run pre-commit command to fix any formatting errors?

Can you add some testing results and ideally some benchmarking results? No need to be very large-scale, but a few Biren GPUs are fine.

I think the testing level should be two tiers:

UT test result -- whether a core subset of UT had passed on SUPA

Workload test result -- which ZeRO stages are currently supported, whether SP/TP is supported etc.

Benchmarking results -- could be shared in the comments or a followup blog post.

Also this table can be updated along with this PR. https://github.com/deepspeedai/DeepSpeed#contributed-hw-support

Thank you very much for your review comments. I will fix the CI issue as soon as possible, and add the Workload test result to the comment section of the PR later. Benchmarking results maye we can shared in the followup blog post.

Signed-off-by: frozenleaves <914814442@qq.com>

delock · 2026-06-12T00:30:08Z

Hi @frozenleaves can you fix the DCO error in CI?
When you have workload test result you can post in comments and ping me. Thanks!

Signed-off-by: frozenleaves <914814442@qq.com>

frozenleaves · 2026-06-12T09:00:32Z

UT case test results

✅ Passed

accelerator/

accelerator/test_accelerator.py

autotuning/

autotuning/test_autotuning.py

checkpoint/

checkpoint/test_autotp_uc_checkpoint.py
checkpoint/test_convert_checkpoint.py
checkpoint/test_reshape_checkpoint.py
checkpoint/test_sparse.py
checkpoint/test_shared_weights.py
checkpoint/test_tag_validation.py

compile/

compile/test_inductor_aot_kwargs.py
compile/test_list_schedule.py

compression/

compression/test_compression.py [2 passed, 1 skipped]

launcher/

launcher/test_ds_arguments.py
launcher/test_multinode_runner.py
launcher/test_run.py

model_parallelism/

model_parallelism/test_tp_plan_e2e.py
model_parallelism/test_autotp_custom_patterns.py

module_inject/

module_inject/test_tp_plan_converter.py

monitor/

monitor/test_monitor.py

ops/

ops/adam/
ops/adagrad/
ops/aio/ [44 passed, 46 skipped]
ops/deepspeed4science/ [9 passed, 1 skipped]
ops/lion/
ops/test_op_builder.py

pipe/

pipe/test_pipe_module.py

profiling/

profiling/flops_profiler/test_flops_profiler.py

runtime/

runtime/activation_checkpointing/
runtime/comm/
runtime/tensor_parallel
runtime/zero/
runtime/utils/
runtime/zenflow/
runtime/test_autocast.py
runtime/test_data_efficiency.py
runtime/test_data.py
runtime/test_ds_config_dict.py
runtime/test_ds_initialize.py
runtime/test_ds_config_model.py
runtime/test_lr_schedulers.py
runtime/test_multi_output_model.py
runtime/test_multiple_models.py
runtime/test_mup_optimizers.py
runtime/test_no_sync_ctxt.py
runtime/test_pld.py
runtime/test_precision_config_loss_scale.py
runtime/test_runtime_utils.py
runtime/test_tp_plan_extraction.py

sequence_parallelism/

sequence_parallelism/test_autosp_equivalence.py
sequence_parallelism/test_autosp_integration.py

ulysses_alst/

ulysses_alst/test_tiled_compute.py
ulysses_alst/test_ulysses_sp_hf.py

utils/

utils/test_get_optim_files.py
utils/test_groups.py
utils/test_init_on_device.py
utils/test_nvtx.py

⏭️ Skipped

comm/

comm/test_dist.py

checkpoint/

checkpoint/test_latest_checkpoint.py
checkpoint/test_mics_optimizer.py
checkpoint/test_moe_checkpoint.py
checkpoint/test_pipeline.py
checkpoint/test_universal_checkpoint.py

compression/

compression/test_dequantization.py

elasticity/

elasticity/test_elastic.py

hybrid_engine/

hybrid_engine/test_he_all.py
hybrid_engine/test_he_llama.py
hybrid_engine/test_he_lora.py

inference/

inference/ [skipped]

linear/

linear/test_ctx.py
linear/test_linear.py
linear/test_quant_param.py

model_parallelism/

model_parallelism/test_autotp_training.py
model_parallelism/test_configurable_parallel_mp.py
model_parallelism/test_configurable_parallel_pp.py

ops/

ops/accelerators/
ops/fp_quantizer/
ops/muon/
ops/quantizer/
ops/sparse_attention/
ops/spatial/
ops/transformer/

sequence_parallelism/

sequence_parallelism/test_ulysses.py

v1/

v1/compile/test_compile_autosp.py
v1/compile/test_compile_fx.py
v1/half_precision/test_bf16.py

❌ Failed

checkpoint/

checkpoint/test_other_optimizer.py

launcher/

launcher/test_user_args.py [4 failed, 7 passed] [RuntimeError: launcher 'pdsh' not installed]

moe/

moe/test_moe.py
moe/test_moe_tp.py

runtime/

runtime/sparse_tensor/

v1/

v1/compile/test_compile_zero.py

# Add Biren SUPA Accelerator Support ## Summary This PR adds accelerator backend support for the **Biren SUPA GPU** (the Biren Technology GPU, software stack SUPA) to DeepSpeed, enabling DeepSpeed to automatically detect the device, run training and inference on Biren GPUs, and reuse DeepSpeed's existing operator invocation framework (fused optimizer, transformer inference, quantizer, async-io, etc.). SUPA is onboarded as the 9th supported accelerator, following `cuda / cpu / xpu / npu / mps / hpu / mlu / sdaa`. It adheres to DeepSpeed's existing `DeepSpeedAccelerator` abstract interface and the `op_builder` plugin mechanism, with **zero intrusion** into existing backends — the only existing file modified is the accelerator auto-detection entry point `accelerator/real_accelerator.py`. ## Changes ### 1. Accelerator auto-detection and registration — `accelerator/real_accelerator.py` (the only existing file modified) - Add `'supa'` to `SUPPORTED_ACCELERATOR_LIST`. - **Explicit specification** (`DS_ACCELERATOR=supa`): attempt `import torch_supa`, and emit a clear error message if it is missing. - **Auto-detection**: add a SUPA probing branch that determines availability via `import torch_supa` and checking `torch.supa.is_available()`. - Critical ordering: because `torch_supa` spoofs `torch.cuda`, the SUPA detection branch **must come before** the CUDA detection, otherwise Biren cards would be misidentified as CUDA devices. This constraint is clearly noted with a comment in the code. - In the third-step instantiation logic, add the `accelerator_name == 'supa'` → `SUPA_Accelerator()` branch. ### 2. Accelerator implementation — `accelerator/supa_accelerator.py` Implements all interfaces of the `DeepSpeedAccelerator` abstract base class. The vast majority of APIs delegate directly to `torch.supa.*` (mirroring the semantics of `torch.cuda.*`): - **Device management**: `device / set_device / current_device / device_count / synchronize`, etc. - **RNG**: `manual_seed(_all) / get_rng_state / set_rng_state / default_generator`. - **Stream / Event**: `Stream / Event / current_stream / default_stream`. - **Memory management**: `empty_cache / memory_allocated / max_memory_allocated / memory_reserved / memory_stats / total_memory / available_memory`, etc. (some use `hasattr` for capability probing, for compatibility across different versions of torch_supa). - **Data types**: declares support for fp32 / fp16 / bf16. - **Communication backend**: uses **BCCL** (the Biren collective communication library) on Linux, falling back to `gloo` on Windows. - **CUDA Graph**: mapped to `torch.supa.SUPAGraph()` / `torch.supa.graph(...)`. - **op_builder loading**: `op_builder_dir()` returns `op_builder.supa` (local install) or `deepspeed.ops.op_builder.supa` (pip install), and lazily loads via `pkgutil`, scanning all `*Builder` classes in that directory to build the `class_dict`. - **Environment variables**: `export_envs` exports `BCCL / BIREN / SUPA / LD_LIBRARY / PATH`; `visible_devices_envs` uses `SUPA_VISIBLE_DEVICES`. - **Compile backend**: defaults to `inductor`, with Triton support. ### 3. SUPA op_builder plugin package — `op_builder/supa/` (new) A new SUPA builder package, parallel to `op_builder/{cpu,xpu,npu,...}`: | File | Purpose | |------|------| | `builder.py` | `SUPAOpBuilder` base class, compiling host-side C++ sources based on `CppExtension` (`-O3 -std=c++17 -fopenmp` + CPU arch / SIMD width). | | `fused_adam.py` | `FusedAdamBuilder` + `SUPAFusedAdam`: prefers calling the `torch.ops.deepspeed.multi_tensor_adam` compiled kernel, falling back to a **numerically equivalent pure-PyTorch implementation** when missing (supports Adam mode=0 / AdamW mode=1). | | `fused_lamb.py` | `FusedLambBuilder` + `SUPAFusedLamb`: `torch.ops.deepspeed.lamb`, with a pure-PyTorch fallback (trust-ratio clamp). | | `fused_lion.py` | `FusedLionBuilder` + `SUPAFusedLion`: `torch.ops.deepspeed.multi_tensor_lion`, with a pure-PyTorch fallback. | | `inference.py` | `InferenceBuilder` + `SUPAInference`: wraps the full set of transformer inference kernels (layer_norm / rms_norm / softmax(_context) / bias_* / qkv_gemm / mlp_gemm / vector_matmul / linear_layer / rotary / einsum / MoE / gated_activation), in fp16/bf16/fp32 precisions, each delegating to `torch.ops.deepspeed.*`. | | `quantizer.py` | `QuantizerBuilder` + `SUPAQuantizer`: symmetric/asymmetric quantization, stochastic rounding (SR), int4/int8 dequantization, swizzle_quant, quantized_reduction, LoCo, etc. | | `async_io.py` | `AsyncIOBuilder`: reuses DeepSpeed's existing `csrc/aio/*` C++ sources, depends on `libaio`, includes a package-manager detection hint. | | `cpu_adam.py` / `cpu_lion.py` / `cpu_adagrad.py` | CPU offload optimizer builders, reusing the `csrc/{adam,lion,adagrad}/*` sources. | | `no_impl.py` | `NotImplementedBuilder`: a placeholder stub for unimplemented ops; `load()` raises a clear `NotImplementedError`. | | `__init__.py` | Exports all builders. | **Design highlights**: - Compiled kernels are hooked in via `import torch_supa_ext.deepspeed` (side effect: registers `torch.ops.deepspeed.*`); all imports are wrapped in `try/except` so the module remains importable even without the compiled extension. - `is_compatible()` uses a two-stage decision: "fast path checks whether the op is already registered → otherwise attempt to import the extension". - optimizer builders provide a pure-PyTorch fallback, making it convenient to do functional verification in cmodel / hardware-free environments. ## Dependencies Runtime dependencies (all are Biren software-stack components, needed only when using the SUPA backend): - **`torch_supa`** — the Biren PyTorch device extension, providing the `torch.supa.*` namespace. **Required** (the basis for accelerator detection and all device APIs). - **`torch_supa_ext`** — the Biren compiled operator extension, with submodules: - `torch_supa_ext.deepspeed` — registers `torch.ops.deepspeed.*` (fused optimizer / inference / quantizer kernels). *Optional*: when missing, the optimizer falls back to pure PyTorch, while inference/quantizer raise a clear error on invocation and tests are skipped automatically. - **BCCL** — the Biren collective communication library (the communication backend for distributed training). - **libaio** — required by `AsyncIOBuilder` (ZeRO-Infinity NVMe offload) via `libaio-dev`. **No new dependencies** are introduced for DeepSpeed's existing code or other backends. ## Usage Prerequisite: the Biren driver + `torch_supa` (+ `torch_supa_ext` as needed) is already installed. ```bash # Option 1: explicitly specify the backend export DS_ACCELERATOR=supa # Option 2: auto-detection (just install torch_supa; no environment variable needed) ``` Usage in code is exactly the same as for other backends, through the unified `get_accelerator()` abstraction: ```python import torch from deepspeed.accelerator import get_accelerator accelerator = get_accelerator() # automatically returns SUPA_Accelerator print(accelerator.device_name()) # 'supa' device = accelerator.device(0) # torch.device('supa', 0) tensor = torch.randn(3, device=device) # tensor([-0.8643, 1.3154, 1.5823, ], device='supa:0') # DeepSpeed training/inference initialization requires no changes; op_builder is automatically routed to op_builder.supa ``` Multi-card visibility is controlled via the `SUPA_VISIBLE_DEVICES` environment variable; the distributed communication backend defaults to `bccl`. ## Compatibility and scope of impact - The SUPA path is activated only when `DS_ACCELERATOR=supa` is explicitly set or `torch_supa` is present in the environment; behavior in all other environments is completely unchanged. - The only existing file modified, `real_accelerator.py`, only adds branches and does not modify existing logic. - Tests are skipped automatically when no hardware is present, remaining transparent to upstream CI. --------- Signed-off-by: frozenleaves <914814442@qq.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com> Signed-off-by: nathon-lee <leejianwoo@gmail.com>

# Add Biren SUPA Accelerator Support ## Summary This PR adds accelerator backend support for the **Biren SUPA GPU** (the Biren Technology GPU, software stack SUPA) to DeepSpeed, enabling DeepSpeed to automatically detect the device, run training and inference on Biren GPUs, and reuse DeepSpeed's existing operator invocation framework (fused optimizer, transformer inference, quantizer, async-io, etc.). SUPA is onboarded as the 9th supported accelerator, following `cuda / cpu / xpu / npu / mps / hpu / mlu / sdaa`. It adheres to DeepSpeed's existing `DeepSpeedAccelerator` abstract interface and the `op_builder` plugin mechanism, with **zero intrusion** into existing backends — the only existing file modified is the accelerator auto-detection entry point `accelerator/real_accelerator.py`. ## Changes ### 1. Accelerator auto-detection and registration — `accelerator/real_accelerator.py` (the only existing file modified) - Add `'supa'` to `SUPPORTED_ACCELERATOR_LIST`. - **Explicit specification** (`DS_ACCELERATOR=supa`): attempt `import torch_supa`, and emit a clear error message if it is missing. - **Auto-detection**: add a SUPA probing branch that determines availability via `import torch_supa` and checking `torch.supa.is_available()`. - Critical ordering: because `torch_supa` spoofs `torch.cuda`, the SUPA detection branch **must come before** the CUDA detection, otherwise Biren cards would be misidentified as CUDA devices. This constraint is clearly noted with a comment in the code. - In the third-step instantiation logic, add the `accelerator_name == 'supa'` → `SUPA_Accelerator()` branch. ### 2. Accelerator implementation — `accelerator/supa_accelerator.py` Implements all interfaces of the `DeepSpeedAccelerator` abstract base class. The vast majority of APIs delegate directly to `torch.supa.*` (mirroring the semantics of `torch.cuda.*`): - **Device management**: `device / set_device / current_device / device_count / synchronize`, etc. - **RNG**: `manual_seed(_all) / get_rng_state / set_rng_state / default_generator`. - **Stream / Event**: `Stream / Event / current_stream / default_stream`. - **Memory management**: `empty_cache / memory_allocated / max_memory_allocated / memory_reserved / memory_stats / total_memory / available_memory`, etc. (some use `hasattr` for capability probing, for compatibility across different versions of torch_supa). - **Data types**: declares support for fp32 / fp16 / bf16. - **Communication backend**: uses **BCCL** (the Biren collective communication library) on Linux, falling back to `gloo` on Windows. - **CUDA Graph**: mapped to `torch.supa.SUPAGraph()` / `torch.supa.graph(...)`. - **op_builder loading**: `op_builder_dir()` returns `op_builder.supa` (local install) or `deepspeed.ops.op_builder.supa` (pip install), and lazily loads via `pkgutil`, scanning all `*Builder` classes in that directory to build the `class_dict`. - **Environment variables**: `export_envs` exports `BCCL / BIREN / SUPA / LD_LIBRARY / PATH`; `visible_devices_envs` uses `SUPA_VISIBLE_DEVICES`. - **Compile backend**: defaults to `inductor`, with Triton support. ### 3. SUPA op_builder plugin package — `op_builder/supa/` (new) A new SUPA builder package, parallel to `op_builder/{cpu,xpu,npu,...}`: | File | Purpose | |------|------| | `builder.py` | `SUPAOpBuilder` base class, compiling host-side C++ sources based on `CppExtension` (`-O3 -std=c++17 -fopenmp` + CPU arch / SIMD width). | | `fused_adam.py` | `FusedAdamBuilder` + `SUPAFusedAdam`: prefers calling the `torch.ops.deepspeed.multi_tensor_adam` compiled kernel, falling back to a **numerically equivalent pure-PyTorch implementation** when missing (supports Adam mode=0 / AdamW mode=1). | | `fused_lamb.py` | `FusedLambBuilder` + `SUPAFusedLamb`: `torch.ops.deepspeed.lamb`, with a pure-PyTorch fallback (trust-ratio clamp). | | `fused_lion.py` | `FusedLionBuilder` + `SUPAFusedLion`: `torch.ops.deepspeed.multi_tensor_lion`, with a pure-PyTorch fallback. | | `inference.py` | `InferenceBuilder` + `SUPAInference`: wraps the full set of transformer inference kernels (layer_norm / rms_norm / softmax(_context) / bias_* / qkv_gemm / mlp_gemm / vector_matmul / linear_layer / rotary / einsum / MoE / gated_activation), in fp16/bf16/fp32 precisions, each delegating to `torch.ops.deepspeed.*`. | | `quantizer.py` | `QuantizerBuilder` + `SUPAQuantizer`: symmetric/asymmetric quantization, stochastic rounding (SR), int4/int8 dequantization, swizzle_quant, quantized_reduction, LoCo, etc. | | `async_io.py` | `AsyncIOBuilder`: reuses DeepSpeed's existing `csrc/aio/*` C++ sources, depends on `libaio`, includes a package-manager detection hint. | | `cpu_adam.py` / `cpu_lion.py` / `cpu_adagrad.py` | CPU offload optimizer builders, reusing the `csrc/{adam,lion,adagrad}/*` sources. | | `no_impl.py` | `NotImplementedBuilder`: a placeholder stub for unimplemented ops; `load()` raises a clear `NotImplementedError`. | | `__init__.py` | Exports all builders. | **Design highlights**: - Compiled kernels are hooked in via `import torch_supa_ext.deepspeed` (side effect: registers `torch.ops.deepspeed.*`); all imports are wrapped in `try/except` so the module remains importable even without the compiled extension. - `is_compatible()` uses a two-stage decision: "fast path checks whether the op is already registered → otherwise attempt to import the extension". - optimizer builders provide a pure-PyTorch fallback, making it convenient to do functional verification in cmodel / hardware-free environments. ## Dependencies Runtime dependencies (all are Biren software-stack components, needed only when using the SUPA backend): - **`torch_supa`** — the Biren PyTorch device extension, providing the `torch.supa.*` namespace. **Required** (the basis for accelerator detection and all device APIs). - **`torch_supa_ext`** — the Biren compiled operator extension, with submodules: - `torch_supa_ext.deepspeed` — registers `torch.ops.deepspeed.*` (fused optimizer / inference / quantizer kernels). *Optional*: when missing, the optimizer falls back to pure PyTorch, while inference/quantizer raise a clear error on invocation and tests are skipped automatically. - **BCCL** — the Biren collective communication library (the communication backend for distributed training). - **libaio** — required by `AsyncIOBuilder` (ZeRO-Infinity NVMe offload) via `libaio-dev`. **No new dependencies** are introduced for DeepSpeed's existing code or other backends. ## Usage Prerequisite: the Biren driver + `torch_supa` (+ `torch_supa_ext` as needed) is already installed. ```bash # Option 1: explicitly specify the backend export DS_ACCELERATOR=supa # Option 2: auto-detection (just install torch_supa; no environment variable needed) ``` Usage in code is exactly the same as for other backends, through the unified `get_accelerator()` abstraction: ```python import torch from deepspeed.accelerator import get_accelerator accelerator = get_accelerator() # automatically returns SUPA_Accelerator print(accelerator.device_name()) # 'supa' device = accelerator.device(0) # torch.device('supa', 0) tensor = torch.randn(3, device=device) # tensor([-0.8643, 1.3154, 1.5823, ], device='supa:0') # DeepSpeed training/inference initialization requires no changes; op_builder is automatically routed to op_builder.supa ``` Multi-card visibility is controlled via the `SUPA_VISIBLE_DEVICES` environment variable; the distributed communication backend defaults to `bccl`. ## Compatibility and scope of impact - The SUPA path is activated only when `DS_ACCELERATOR=supa` is explicitly set or `torch_supa` is present in the environment; behavior in all other environments is completely unchanged. - The only existing file modified, `real_accelerator.py`, only adds branches and does not modify existing logic. - Tests are skipped automatically when no hardware is present, remaining transparent to upstream CI. --------- Signed-off-by: frozenleaves <914814442@qq.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>

frozenleaves requested review from loadams and tjruwase as code owners June 8, 2026 06:45

chatgpt-codex-connector Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread op_builder/supa/fused_adam.py

Comment thread op_builder/supa/builder.py Outdated

PKUWZP requested review from PKUWZP and delock June 8, 2026 07:05

frozenleaves force-pushed the main-supa branch from 1f3b2d4 to f82a8ca Compare June 8, 2026 08:01

frozenleaves requested review from hwchen2017 and tohtana as code owners June 8, 2026 08:01

frozenleaves force-pushed the main-supa branch from f82a8ca to 850b322 Compare June 8, 2026 08:15

frozenleaves added 2 commits June 8, 2026 16:17

Add Biren SUPA accelerator support

b51750d

Signed-off-by: frozenleaves <914814442@qq.com>

fix

18e77a6

Signed-off-by: frozenleaves <914814442@qq.com>

frozenleaves force-pushed the main-supa branch from 850b322 to 18e77a6 Compare June 8, 2026 08:20

fix ci

eeab2ff

Signed-off-by: frozenleaves <914814442@qq.com>

frozenleaves force-pushed the main-supa branch from 5154882 to eeab2ff Compare June 9, 2026 03:30

add supa branch for ut

5a3b67e

Signed-off-by: frozenleaves <914814442@qq.com>

frozenleaves force-pushed the main-supa branch from ecb5bd5 to 5a3b67e Compare June 12, 2026 01:44

delock approved these changes Jun 15, 2026

View reviewed changes

Merge branch 'master' into main-supa

136427c

delock merged commit 7ad4108 into deepspeedai:master Jun 16, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Biren SUPA accelerator support#8054

Add Biren SUPA accelerator support#8054
delock merged 5 commits into
deepspeedai:masterfrom
frozenleaves:main-supa

frozenleaves commented Jun 8, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

frozenleaves commented Jun 8, 2026

Uh oh!

PKUWZP commented Jun 8, 2026

Uh oh!

delock commented Jun 9, 2026

Uh oh!

frozenleaves commented Jun 9, 2026

Uh oh!

delock commented Jun 12, 2026

Uh oh!

frozenleaves commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

frozenleaves commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Biren SUPA Accelerator Support

Summary

Changes

1. Accelerator auto-detection and registration — accelerator/real_accelerator.py (the only existing file modified)

2. Accelerator implementation — accelerator/supa_accelerator.py

3. SUPA op_builder plugin package — op_builder/supa/ (new)

Dependencies

Usage

Compatibility and scope of impact

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

frozenleaves commented Jun 8, 2026

Uh oh!

PKUWZP commented Jun 8, 2026

Uh oh!

delock commented Jun 9, 2026

Uh oh!

frozenleaves commented Jun 9, 2026

Uh oh!

delock commented Jun 12, 2026

Uh oh!

frozenleaves commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

UT case test results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

frozenleaves commented Jun 8, 2026 •

edited

Loading

1. Accelerator auto-detection and registration — `accelerator/real_accelerator.py` (the only existing file modified)

2. Accelerator implementation — `accelerator/supa_accelerator.py`

3. SUPA op_builder plugin package — `op_builder/supa/` (new)

frozenleaves commented Jun 12, 2026 •

edited

Loading