Add Biren SUPA accelerator support#8054
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3eb1e1811a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: frozenleaves <914814442@qq.com>
|
@frozenleaves Thanks for submitting this PR, very exciting work. A couple of suggestions:
|
I think the testing level should be two tiers:
Also this table can be updated along with this PR. |
Thank you very much for your review comments. I will fix the CI issue as soon as possible, and add the Workload test result to the comment section of the PR later. Benchmarking results maye we can shared in the followup blog post. |
|
Hi @frozenleaves can you fix the DCO error in CI? |
Signed-off-by: frozenleaves <914814442@qq.com>
UT case test results✅ Passed accelerator/
autotuning/
checkpoint/
compile/
compression/
launcher/
model_parallelism/
module_inject/
monitor/
ops/
pipe/
profiling/
runtime/
sequence_parallelism/
ulysses_alst/
utils/
⏭️ Skipped comm/
checkpoint/
compression/
elasticity/
hybrid_engine/
inference/
linear/
model_parallelism/
ops/
sequence_parallelism/
v1/
❌ Failed checkpoint/
launcher/
moe/
runtime/
v1/
|
# Add Biren SUPA Accelerator Support
## Summary
This PR adds accelerator backend support for the **Biren SUPA GPU** (the
Biren Technology GPU, software stack SUPA) to DeepSpeed, enabling
DeepSpeed to automatically detect the device, run training and inference
on Biren GPUs, and reuse DeepSpeed's existing operator invocation
framework (fused optimizer, transformer inference, quantizer, async-io,
etc.).
SUPA is onboarded as the 9th supported accelerator, following `cuda /
cpu / xpu / npu / mps / hpu / mlu / sdaa`. It adheres to DeepSpeed's
existing `DeepSpeedAccelerator` abstract interface and the `op_builder`
plugin mechanism, with **zero intrusion** into existing backends — the
only existing file modified is the accelerator auto-detection entry
point `accelerator/real_accelerator.py`.
## Changes
### 1. Accelerator auto-detection and registration —
`accelerator/real_accelerator.py` (the only existing file modified)
- Add `'supa'` to `SUPPORTED_ACCELERATOR_LIST`.
- **Explicit specification** (`DS_ACCELERATOR=supa`): attempt `import
torch_supa`, and emit a clear error message if it is missing.
- **Auto-detection**: add a SUPA probing branch that determines
availability via `import torch_supa` and checking
`torch.supa.is_available()`.
- Critical ordering: because `torch_supa` spoofs `torch.cuda`, the SUPA
detection branch **must come before** the CUDA detection, otherwise
Biren cards would be misidentified as CUDA devices. This constraint is
clearly noted with a comment in the code.
- In the third-step instantiation logic, add the `accelerator_name ==
'supa'` → `SUPA_Accelerator()` branch.
### 2. Accelerator implementation — `accelerator/supa_accelerator.py`
Implements all interfaces of the `DeepSpeedAccelerator` abstract base
class. The vast majority of APIs delegate directly to `torch.supa.*`
(mirroring the semantics of `torch.cuda.*`):
- **Device management**: `device / set_device / current_device /
device_count / synchronize`, etc.
- **RNG**: `manual_seed(_all) / get_rng_state / set_rng_state /
default_generator`.
- **Stream / Event**: `Stream / Event / current_stream /
default_stream`.
- **Memory management**: `empty_cache / memory_allocated /
max_memory_allocated / memory_reserved / memory_stats / total_memory /
available_memory`, etc. (some use `hasattr` for capability probing, for
compatibility across different versions of torch_supa).
- **Data types**: declares support for fp32 / fp16 / bf16.
- **Communication backend**: uses **BCCL** (the Biren collective
communication library) on Linux, falling back to `gloo` on Windows.
- **CUDA Graph**: mapped to `torch.supa.SUPAGraph()` /
`torch.supa.graph(...)`.
- **op_builder loading**: `op_builder_dir()` returns `op_builder.supa`
(local install) or `deepspeed.ops.op_builder.supa` (pip install), and
lazily loads via `pkgutil`, scanning all `*Builder` classes in that
directory to build the `class_dict`.
- **Environment variables**: `export_envs` exports `BCCL / BIREN / SUPA
/ LD_LIBRARY / PATH`; `visible_devices_envs` uses
`SUPA_VISIBLE_DEVICES`.
- **Compile backend**: defaults to `inductor`, with Triton support.
### 3. SUPA op_builder plugin package — `op_builder/supa/` (new)
A new SUPA builder package, parallel to `op_builder/{cpu,xpu,npu,...}`:
| File | Purpose |
|------|------|
| `builder.py` | `SUPAOpBuilder` base class, compiling host-side C++
sources based on `CppExtension` (`-O3 -std=c++17 -fopenmp` + CPU arch /
SIMD width). |
| `fused_adam.py` | `FusedAdamBuilder` + `SUPAFusedAdam`: prefers
calling the `torch.ops.deepspeed.multi_tensor_adam` compiled kernel,
falling back to a **numerically equivalent pure-PyTorch implementation**
when missing (supports Adam mode=0 / AdamW mode=1). |
| `fused_lamb.py` | `FusedLambBuilder` + `SUPAFusedLamb`:
`torch.ops.deepspeed.lamb`, with a pure-PyTorch fallback (trust-ratio
clamp). |
| `fused_lion.py` | `FusedLionBuilder` + `SUPAFusedLion`:
`torch.ops.deepspeed.multi_tensor_lion`, with a pure-PyTorch fallback. |
| `inference.py` | `InferenceBuilder` + `SUPAInference`: wraps the full
set of transformer inference kernels (layer_norm / rms_norm /
softmax(_context) / bias_* / qkv_gemm / mlp_gemm / vector_matmul /
linear_layer / rotary / einsum / MoE / gated_activation), in
fp16/bf16/fp32 precisions, each delegating to `torch.ops.deepspeed.*`. |
| `quantizer.py` | `QuantizerBuilder` + `SUPAQuantizer`:
symmetric/asymmetric quantization, stochastic rounding (SR), int4/int8
dequantization, swizzle_quant, quantized_reduction, LoCo, etc. |
| `async_io.py` | `AsyncIOBuilder`: reuses DeepSpeed's existing
`csrc/aio/*` C++ sources, depends on `libaio`, includes a
package-manager detection hint. |
| `cpu_adam.py` / `cpu_lion.py` / `cpu_adagrad.py` | CPU offload
optimizer builders, reusing the `csrc/{adam,lion,adagrad}/*` sources. |
| `no_impl.py` | `NotImplementedBuilder`: a placeholder stub for
unimplemented ops; `load()` raises a clear `NotImplementedError`. |
| `__init__.py` | Exports all builders. |
**Design highlights**:
- Compiled kernels are hooked in via `import torch_supa_ext.deepspeed`
(side effect: registers `torch.ops.deepspeed.*`); all imports are
wrapped in `try/except` so the module remains importable even without
the compiled extension.
- `is_compatible()` uses a two-stage decision: "fast path checks whether
the op is already registered → otherwise attempt to import the
extension".
- optimizer builders provide a pure-PyTorch fallback, making it
convenient to do functional verification in cmodel / hardware-free
environments.
## Dependencies
Runtime dependencies (all are Biren software-stack components, needed
only when using the SUPA backend):
- **`torch_supa`** — the Biren PyTorch device extension, providing the
`torch.supa.*` namespace. **Required** (the basis for accelerator
detection and all device APIs).
- **`torch_supa_ext`** — the Biren compiled operator extension, with
submodules:
- `torch_supa_ext.deepspeed` — registers `torch.ops.deepspeed.*` (fused
optimizer / inference / quantizer kernels).
*Optional*: when missing, the optimizer falls back to pure PyTorch,
while inference/quantizer raise a clear error on invocation and tests
are skipped automatically.
- **BCCL** — the Biren collective communication library (the
communication backend for distributed training).
- **libaio** — required by `AsyncIOBuilder` (ZeRO-Infinity NVMe offload)
via `libaio-dev`.
**No new dependencies** are introduced for DeepSpeed's existing code or
other backends.
## Usage
Prerequisite: the Biren driver + `torch_supa` (+ `torch_supa_ext` as
needed) is already installed.
```bash
# Option 1: explicitly specify the backend
export DS_ACCELERATOR=supa
# Option 2: auto-detection (just install torch_supa; no environment variable needed)
```
Usage in code is exactly the same as for other backends, through the
unified `get_accelerator()` abstraction:
```python
import torch
from deepspeed.accelerator import get_accelerator
accelerator = get_accelerator() # automatically returns SUPA_Accelerator
print(accelerator.device_name()) # 'supa'
device = accelerator.device(0) # torch.device('supa', 0)
tensor = torch.randn(3, device=device) # tensor([-0.8643, 1.3154, 1.5823, ], device='supa:0')
# DeepSpeed training/inference initialization requires no changes; op_builder is automatically routed to op_builder.supa
```
Multi-card visibility is controlled via the `SUPA_VISIBLE_DEVICES`
environment variable; the distributed communication backend defaults to
`bccl`.
## Compatibility and scope of impact
- The SUPA path is activated only when `DS_ACCELERATOR=supa` is
explicitly set or `torch_supa` is present in the environment; behavior
in all other environments is completely unchanged.
- The only existing file modified, `real_accelerator.py`, only adds
branches and does not modify existing logic.
- Tests are skipped automatically when no hardware is present, remaining
transparent to upstream CI.
---------
Signed-off-by: frozenleaves <914814442@qq.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
# Add Biren SUPA Accelerator Support
## Summary
This PR adds accelerator backend support for the **Biren SUPA GPU** (the
Biren Technology GPU, software stack SUPA) to DeepSpeed, enabling
DeepSpeed to automatically detect the device, run training and inference
on Biren GPUs, and reuse DeepSpeed's existing operator invocation
framework (fused optimizer, transformer inference, quantizer, async-io,
etc.).
SUPA is onboarded as the 9th supported accelerator, following `cuda /
cpu / xpu / npu / mps / hpu / mlu / sdaa`. It adheres to DeepSpeed's
existing `DeepSpeedAccelerator` abstract interface and the `op_builder`
plugin mechanism, with **zero intrusion** into existing backends — the
only existing file modified is the accelerator auto-detection entry
point `accelerator/real_accelerator.py`.
## Changes
### 1. Accelerator auto-detection and registration —
`accelerator/real_accelerator.py` (the only existing file modified)
- Add `'supa'` to `SUPPORTED_ACCELERATOR_LIST`.
- **Explicit specification** (`DS_ACCELERATOR=supa`): attempt `import
torch_supa`, and emit a clear error message if it is missing.
- **Auto-detection**: add a SUPA probing branch that determines
availability via `import torch_supa` and checking
`torch.supa.is_available()`.
- Critical ordering: because `torch_supa` spoofs `torch.cuda`, the SUPA
detection branch **must come before** the CUDA detection, otherwise
Biren cards would be misidentified as CUDA devices. This constraint is
clearly noted with a comment in the code.
- In the third-step instantiation logic, add the `accelerator_name ==
'supa'` → `SUPA_Accelerator()` branch.
### 2. Accelerator implementation — `accelerator/supa_accelerator.py`
Implements all interfaces of the `DeepSpeedAccelerator` abstract base
class. The vast majority of APIs delegate directly to `torch.supa.*`
(mirroring the semantics of `torch.cuda.*`):
- **Device management**: `device / set_device / current_device /
device_count / synchronize`, etc.
- **RNG**: `manual_seed(_all) / get_rng_state / set_rng_state /
default_generator`.
- **Stream / Event**: `Stream / Event / current_stream /
default_stream`.
- **Memory management**: `empty_cache / memory_allocated /
max_memory_allocated / memory_reserved / memory_stats / total_memory /
available_memory`, etc. (some use `hasattr` for capability probing, for
compatibility across different versions of torch_supa).
- **Data types**: declares support for fp32 / fp16 / bf16.
- **Communication backend**: uses **BCCL** (the Biren collective
communication library) on Linux, falling back to `gloo` on Windows.
- **CUDA Graph**: mapped to `torch.supa.SUPAGraph()` /
`torch.supa.graph(...)`.
- **op_builder loading**: `op_builder_dir()` returns `op_builder.supa`
(local install) or `deepspeed.ops.op_builder.supa` (pip install), and
lazily loads via `pkgutil`, scanning all `*Builder` classes in that
directory to build the `class_dict`.
- **Environment variables**: `export_envs` exports `BCCL / BIREN / SUPA
/ LD_LIBRARY / PATH`; `visible_devices_envs` uses
`SUPA_VISIBLE_DEVICES`.
- **Compile backend**: defaults to `inductor`, with Triton support.
### 3. SUPA op_builder plugin package — `op_builder/supa/` (new)
A new SUPA builder package, parallel to `op_builder/{cpu,xpu,npu,...}`:
| File | Purpose |
|------|------|
| `builder.py` | `SUPAOpBuilder` base class, compiling host-side C++
sources based on `CppExtension` (`-O3 -std=c++17 -fopenmp` + CPU arch /
SIMD width). |
| `fused_adam.py` | `FusedAdamBuilder` + `SUPAFusedAdam`: prefers
calling the `torch.ops.deepspeed.multi_tensor_adam` compiled kernel,
falling back to a **numerically equivalent pure-PyTorch implementation**
when missing (supports Adam mode=0 / AdamW mode=1). |
| `fused_lamb.py` | `FusedLambBuilder` + `SUPAFusedLamb`:
`torch.ops.deepspeed.lamb`, with a pure-PyTorch fallback (trust-ratio
clamp). |
| `fused_lion.py` | `FusedLionBuilder` + `SUPAFusedLion`:
`torch.ops.deepspeed.multi_tensor_lion`, with a pure-PyTorch fallback. |
| `inference.py` | `InferenceBuilder` + `SUPAInference`: wraps the full
set of transformer inference kernels (layer_norm / rms_norm /
softmax(_context) / bias_* / qkv_gemm / mlp_gemm / vector_matmul /
linear_layer / rotary / einsum / MoE / gated_activation), in
fp16/bf16/fp32 precisions, each delegating to `torch.ops.deepspeed.*`. |
| `quantizer.py` | `QuantizerBuilder` + `SUPAQuantizer`:
symmetric/asymmetric quantization, stochastic rounding (SR), int4/int8
dequantization, swizzle_quant, quantized_reduction, LoCo, etc. |
| `async_io.py` | `AsyncIOBuilder`: reuses DeepSpeed's existing
`csrc/aio/*` C++ sources, depends on `libaio`, includes a
package-manager detection hint. |
| `cpu_adam.py` / `cpu_lion.py` / `cpu_adagrad.py` | CPU offload
optimizer builders, reusing the `csrc/{adam,lion,adagrad}/*` sources. |
| `no_impl.py` | `NotImplementedBuilder`: a placeholder stub for
unimplemented ops; `load()` raises a clear `NotImplementedError`. |
| `__init__.py` | Exports all builders. |
**Design highlights**:
- Compiled kernels are hooked in via `import torch_supa_ext.deepspeed`
(side effect: registers `torch.ops.deepspeed.*`); all imports are
wrapped in `try/except` so the module remains importable even without
the compiled extension.
- `is_compatible()` uses a two-stage decision: "fast path checks whether
the op is already registered → otherwise attempt to import the
extension".
- optimizer builders provide a pure-PyTorch fallback, making it
convenient to do functional verification in cmodel / hardware-free
environments.
## Dependencies
Runtime dependencies (all are Biren software-stack components, needed
only when using the SUPA backend):
- **`torch_supa`** — the Biren PyTorch device extension, providing the
`torch.supa.*` namespace. **Required** (the basis for accelerator
detection and all device APIs).
- **`torch_supa_ext`** — the Biren compiled operator extension, with
submodules:
- `torch_supa_ext.deepspeed` — registers `torch.ops.deepspeed.*` (fused
optimizer / inference / quantizer kernels).
*Optional*: when missing, the optimizer falls back to pure PyTorch,
while inference/quantizer raise a clear error on invocation and tests
are skipped automatically.
- **BCCL** — the Biren collective communication library (the
communication backend for distributed training).
- **libaio** — required by `AsyncIOBuilder` (ZeRO-Infinity NVMe offload)
via `libaio-dev`.
**No new dependencies** are introduced for DeepSpeed's existing code or
other backends.
## Usage
Prerequisite: the Biren driver + `torch_supa` (+ `torch_supa_ext` as
needed) is already installed.
```bash
# Option 1: explicitly specify the backend
export DS_ACCELERATOR=supa
# Option 2: auto-detection (just install torch_supa; no environment variable needed)
```
Usage in code is exactly the same as for other backends, through the
unified `get_accelerator()` abstraction:
```python
import torch
from deepspeed.accelerator import get_accelerator
accelerator = get_accelerator() # automatically returns SUPA_Accelerator
print(accelerator.device_name()) # 'supa'
device = accelerator.device(0) # torch.device('supa', 0)
tensor = torch.randn(3, device=device) # tensor([-0.8643, 1.3154, 1.5823, ], device='supa:0')
# DeepSpeed training/inference initialization requires no changes; op_builder is automatically routed to op_builder.supa
```
Multi-card visibility is controlled via the `SUPA_VISIBLE_DEVICES`
environment variable; the distributed communication backend defaults to
`bccl`.
## Compatibility and scope of impact
- The SUPA path is activated only when `DS_ACCELERATOR=supa` is
explicitly set or `torch_supa` is present in the environment; behavior
in all other environments is completely unchanged.
- The only existing file modified, `real_accelerator.py`, only adds
branches and does not modify existing logic.
- Tests are skipped automatically when no hardware is present, remaining
transparent to upstream CI.
---------
Signed-off-by: frozenleaves <914814442@qq.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Add Biren SUPA Accelerator Support
Summary
This PR adds accelerator backend support for the Biren SUPA GPU (the Biren Technology GPU, software stack SUPA) to DeepSpeed, enabling DeepSpeed to automatically detect the device, run training and inference on Biren GPUs, and reuse DeepSpeed's existing operator invocation framework (fused optimizer, transformer inference, quantizer, async-io, etc.).
SUPA is onboarded as the 9th supported accelerator, following
cuda / cpu / xpu / npu / mps / hpu / mlu / sdaa. It adheres to DeepSpeed's existingDeepSpeedAcceleratorabstract interface and theop_builderplugin mechanism, with zero intrusion into existing backends — the only existing file modified is the accelerator auto-detection entry pointaccelerator/real_accelerator.py.Changes
1. Accelerator auto-detection and registration —
accelerator/real_accelerator.py(the only existing file modified)'supa'toSUPPORTED_ACCELERATOR_LIST.DS_ACCELERATOR=supa): attemptimport torch_supa, and emit a clear error message if it is missing.import torch_supaand checkingtorch.supa.is_available().torch_supaspoofstorch.cuda, the SUPA detection branch must come before the CUDA detection, otherwise Biren cards would be misidentified as CUDA devices. This constraint is clearly noted with a comment in the code.accelerator_name == 'supa'→SUPA_Accelerator()branch.2. Accelerator implementation —
accelerator/supa_accelerator.pyImplements all interfaces of the
DeepSpeedAcceleratorabstract base class. The vast majority of APIs delegate directly totorch.supa.*(mirroring the semantics oftorch.cuda.*):device / set_device / current_device / device_count / synchronize, etc.manual_seed(_all) / get_rng_state / set_rng_state / default_generator.Stream / Event / current_stream / default_stream.empty_cache / memory_allocated / max_memory_allocated / memory_reserved / memory_stats / total_memory / available_memory, etc. (some usehasattrfor capability probing, for compatibility across different versions of torch_supa).glooon Windows.torch.supa.SUPAGraph()/torch.supa.graph(...).op_builder_dir()returnsop_builder.supa(local install) ordeepspeed.ops.op_builder.supa(pip install), and lazily loads viapkgutil, scanning all*Builderclasses in that directory to build theclass_dict.export_envsexportsBCCL / BIREN / SUPA / LD_LIBRARY / PATH;visible_devices_envsusesSUPA_VISIBLE_DEVICES.inductor, with Triton support.3. SUPA op_builder plugin package —
op_builder/supa/(new)A new SUPA builder package, parallel to
op_builder/{cpu,xpu,npu,...}:builder.pySUPAOpBuilderbase class, compiling host-side C++ sources based onCppExtension(-O3 -std=c++17 -fopenmp+ CPU arch / SIMD width).fused_adam.pyFusedAdamBuilder+SUPAFusedAdam: prefers calling thetorch.ops.deepspeed.multi_tensor_adamcompiled kernel, falling back to a numerically equivalent pure-PyTorch implementation when missing (supports Adam mode=0 / AdamW mode=1).fused_lamb.pyFusedLambBuilder+SUPAFusedLamb:torch.ops.deepspeed.lamb, with a pure-PyTorch fallback (trust-ratio clamp).fused_lion.pyFusedLionBuilder+SUPAFusedLion:torch.ops.deepspeed.multi_tensor_lion, with a pure-PyTorch fallback.inference.pyInferenceBuilder+SUPAInference: wraps the full set of transformer inference kernels (layer_norm / rms_norm / softmax(context) / bias* / qkv_gemm / mlp_gemm / vector_matmul / linear_layer / rotary / einsum / MoE / gated_activation), in fp16/bf16/fp32 precisions, each delegating totorch.ops.deepspeed.*.quantizer.pyQuantizerBuilder+SUPAQuantizer: symmetric/asymmetric quantization, stochastic rounding (SR), int4/int8 dequantization, swizzle_quant, quantized_reduction, LoCo, etc.async_io.pyAsyncIOBuilder: reuses DeepSpeed's existingcsrc/aio/*C++ sources, depends onlibaio, includes a package-manager detection hint.cpu_adam.py/cpu_lion.py/cpu_adagrad.pycsrc/{adam,lion,adagrad}/*sources.no_impl.pyNotImplementedBuilder: a placeholder stub for unimplemented ops;load()raises a clearNotImplementedError.__init__.pyDesign highlights:
import torch_supa_ext.deepspeed(side effect: registerstorch.ops.deepspeed.*); all imports are wrapped intry/exceptso the module remains importable even without the compiled extension.is_compatible()uses a two-stage decision: "fast path checks whether the op is already registered → otherwise attempt to import the extension".Dependencies
Runtime dependencies (all are Biren software-stack components, needed only when using the SUPA backend):
torch_supa— the Biren PyTorch device extension, providing thetorch.supa.*namespace. Required (the basis for accelerator detection and all device APIs).torch_supa_ext— the Biren compiled operator extension, with submodules:torch_supa_ext.deepspeed— registerstorch.ops.deepspeed.*(fused optimizer / inference / quantizer kernels).Optional: when missing, the optimizer falls back to pure PyTorch, while inference/quantizer raise a clear error on invocation and tests are skipped automatically.
BCCL — the Biren collective communication library (the communication backend for distributed training).
libaio — required by
AsyncIOBuilder(ZeRO-Infinity NVMe offload) vialibaio-dev.No new dependencies are introduced for DeepSpeed's existing code or other backends.
Usage
Prerequisite: the Biren driver +
torch_supa(+torch_supa_extas needed) is already installed.Usage in code is exactly the same as for other backends, through the unified
get_accelerator()abstraction:Multi-card visibility is controlled via the
SUPA_VISIBLE_DEVICESenvironment variable; the distributed communication backend defaults tobccl.Compatibility and scope of impact
DS_ACCELERATOR=supais explicitly set ortorch_supais present in the environment; behavior in all other environments is completely unchanged.real_accelerator.py, only adds branches and does not modify existing logic.