Skip to content

Add Biren SUPA accelerator support#8054

Merged
delock merged 5 commits into
deepspeedai:masterfrom
frozenleaves:main-supa
Jun 16, 2026
Merged

Add Biren SUPA accelerator support#8054
delock merged 5 commits into
deepspeedai:masterfrom
frozenleaves:main-supa

Conversation

@frozenleaves

@frozenleaves frozenleaves commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Add Biren SUPA Accelerator Support

Summary

This PR adds accelerator backend support for the Biren SUPA GPU (the Biren Technology GPU, software stack SUPA) to DeepSpeed, enabling DeepSpeed to automatically detect the device, run training and inference on Biren GPUs, and reuse DeepSpeed's existing operator invocation framework (fused optimizer, transformer inference, quantizer, async-io, etc.).

SUPA is onboarded as the 9th supported accelerator, following cuda / cpu / xpu / npu / mps / hpu / mlu / sdaa. It adheres to DeepSpeed's existing DeepSpeedAccelerator abstract interface and the op_builder plugin mechanism, with zero intrusion into existing backends — the only existing file modified is the accelerator auto-detection entry point accelerator/real_accelerator.py.

Changes

1. Accelerator auto-detection and registration — accelerator/real_accelerator.py (the only existing file modified)

  • Add 'supa' to SUPPORTED_ACCELERATOR_LIST.
  • Explicit specification (DS_ACCELERATOR=supa): attempt import torch_supa, and emit a clear error message if it is missing.
  • Auto-detection: add a SUPA probing branch that determines availability via import torch_supa and checking torch.supa.is_available().
    • Critical ordering: because torch_supa spoofs torch.cuda, the SUPA detection branch must come before the CUDA detection, otherwise Biren cards would be misidentified as CUDA devices. This constraint is clearly noted with a comment in the code.
  • In the third-step instantiation logic, add the accelerator_name == 'supa'SUPA_Accelerator() branch.

2. Accelerator implementation — accelerator/supa_accelerator.py

Implements all interfaces of the DeepSpeedAccelerator abstract base class. The vast majority of APIs delegate directly to torch.supa.* (mirroring the semantics of torch.cuda.*):

  • Device management: device / set_device / current_device / device_count / synchronize, etc.
  • RNG: manual_seed(_all) / get_rng_state / set_rng_state / default_generator.
  • Stream / Event: Stream / Event / current_stream / default_stream.
  • Memory management: empty_cache / memory_allocated / max_memory_allocated / memory_reserved / memory_stats / total_memory / available_memory, etc. (some use hasattr for capability probing, for compatibility across different versions of torch_supa).
  • Data types: declares support for fp32 / fp16 / bf16.
  • Communication backend: uses BCCL (the Biren collective communication library) on Linux, falling back to gloo on Windows.
  • CUDA Graph: mapped to torch.supa.SUPAGraph() / torch.supa.graph(...).
  • op_builder loading: op_builder_dir() returns op_builder.supa (local install) or deepspeed.ops.op_builder.supa (pip install), and lazily loads via pkgutil, scanning all *Builder classes in that directory to build the class_dict.
  • Environment variables: export_envs exports BCCL / BIREN / SUPA / LD_LIBRARY / PATH; visible_devices_envs uses SUPA_VISIBLE_DEVICES.
  • Compile backend: defaults to inductor, with Triton support.

3. SUPA op_builder plugin package — op_builder/supa/ (new)

A new SUPA builder package, parallel to op_builder/{cpu,xpu,npu,...}:

File Purpose
builder.py SUPAOpBuilder base class, compiling host-side C++ sources based on CppExtension (-O3 -std=c++17 -fopenmp + CPU arch / SIMD width).
fused_adam.py FusedAdamBuilder + SUPAFusedAdam: prefers calling the torch.ops.deepspeed.multi_tensor_adam compiled kernel, falling back to a numerically equivalent pure-PyTorch implementation when missing (supports Adam mode=0 / AdamW mode=1).
fused_lamb.py FusedLambBuilder + SUPAFusedLamb: torch.ops.deepspeed.lamb, with a pure-PyTorch fallback (trust-ratio clamp).
fused_lion.py FusedLionBuilder + SUPAFusedLion: torch.ops.deepspeed.multi_tensor_lion, with a pure-PyTorch fallback.
inference.py InferenceBuilder + SUPAInference: wraps the full set of transformer inference kernels (layer_norm / rms_norm / softmax(context) / bias* / qkv_gemm / mlp_gemm / vector_matmul / linear_layer / rotary / einsum / MoE / gated_activation), in fp16/bf16/fp32 precisions, each delegating to torch.ops.deepspeed.*.
quantizer.py QuantizerBuilder + SUPAQuantizer: symmetric/asymmetric quantization, stochastic rounding (SR), int4/int8 dequantization, swizzle_quant, quantized_reduction, LoCo, etc.
async_io.py AsyncIOBuilder: reuses DeepSpeed's existing csrc/aio/* C++ sources, depends on libaio, includes a package-manager detection hint.
cpu_adam.py / cpu_lion.py / cpu_adagrad.py CPU offload optimizer builders, reusing the csrc/{adam,lion,adagrad}/* sources.
no_impl.py NotImplementedBuilder: a placeholder stub for unimplemented ops; load() raises a clear NotImplementedError.
__init__.py Exports all builders.

Design highlights:

  • Compiled kernels are hooked in via import torch_supa_ext.deepspeed (side effect: registers torch.ops.deepspeed.*); all imports are wrapped in try/except so the module remains importable even without the compiled extension.
  • is_compatible() uses a two-stage decision: "fast path checks whether the op is already registered → otherwise attempt to import the extension".
  • optimizer builders provide a pure-PyTorch fallback, making it convenient to do functional verification in cmodel / hardware-free environments.

Dependencies

Runtime dependencies (all are Biren software-stack components, needed only when using the SUPA backend):

  • torch_supa — the Biren PyTorch device extension, providing the torch.supa.* namespace. Required (the basis for accelerator detection and all device APIs).

  • torch_supa_ext — the Biren compiled operator extension, with submodules:

    • torch_supa_ext.deepspeed — registers torch.ops.deepspeed.* (fused optimizer / inference / quantizer kernels).

    Optional: when missing, the optimizer falls back to pure PyTorch, while inference/quantizer raise a clear error on invocation and tests are skipped automatically.

  • BCCL — the Biren collective communication library (the communication backend for distributed training).

  • libaio — required by AsyncIOBuilder (ZeRO-Infinity NVMe offload) via libaio-dev.

No new dependencies are introduced for DeepSpeed's existing code or other backends.

Usage

Prerequisite: the Biren driver + torch_supa (+ torch_supa_ext as needed) is already installed.

# Option 1: explicitly specify the backend
export DS_ACCELERATOR=supa

# Option 2: auto-detection (just install torch_supa; no environment variable needed)

Usage in code is exactly the same as for other backends, through the unified get_accelerator() abstraction:

import torch
from deepspeed.accelerator import get_accelerator

accelerator = get_accelerator()          # automatically returns SUPA_Accelerator
print(accelerator.device_name())         # 'supa'
device = accelerator.device(0)           # torch.device('supa', 0)
tensor = torch.randn(3, device=device)   # tensor([-0.8643,  1.3154,  1.5823, ], device='supa:0')

# DeepSpeed training/inference initialization requires no changes; op_builder is automatically routed to op_builder.supa

Multi-card visibility is controlled via the SUPA_VISIBLE_DEVICES environment variable; the distributed communication backend defaults to bccl.

Compatibility and scope of impact

  • The SUPA path is activated only when DS_ACCELERATOR=supa is explicitly set or torch_supa is present in the environment; behavior in all other environments is completely unchanged.
  • The only existing file modified, real_accelerator.py, only adds branches and does not modify existing logic.
  • Tests are skipped automatically when no hardware is present, remaining transparent to upstream CI.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3eb1e1811a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread op_builder/supa/fused_adam.py
Comment thread op_builder/supa/builder.py Outdated
Signed-off-by: frozenleaves <914814442@qq.com>
Signed-off-by: frozenleaves <914814442@qq.com>
@frozenleaves

Copy link
Copy Markdown
Contributor Author

CC @PKUWZP @delock

@PKUWZP

PKUWZP commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

@frozenleaves Thanks for submitting this PR, very exciting work. A couple of suggestions:

  • Can you make sure to run pre-commit command to fix any formatting errors?
  • Can you add some testing results and ideally some benchmarking results? No need to be very large-scale, but a few Biren GPUs are fine.

@delock

delock commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

@frozenleaves Thanks for submitting this PR, very exciting work. A couple of suggestions:

  • Can you make sure to run pre-commit command to fix any formatting errors?
  • Can you add some testing results and ideally some benchmarking results? No need to be very large-scale, but a few Biren GPUs are fine.

I think the testing level should be two tiers:

  1. UT test result -- whether a core subset of UT had passed on SUPA
  2. Workload test result -- which ZeRO stages are currently supported, whether SP/TP is supported etc.
  3. Benchmarking results -- could be shared in the comments or a followup blog post.

Also this table can be updated along with this PR.
https://github.com/deepspeedai/DeepSpeed#contributed-hw-support

@frozenleaves

Copy link
Copy Markdown
Contributor Author

@frozenleaves Thanks for submitting this PR, very exciting work. A couple of suggestions:

  • Can you make sure to run pre-commit command to fix any formatting errors?
  • Can you add some testing results and ideally some benchmarking results? No need to be very large-scale, but a few Biren GPUs are fine.

@frozenleaves Thanks for submitting this PR, very exciting work. A couple of suggestions:

  • Can you make sure to run pre-commit command to fix any formatting errors?
  • Can you add some testing results and ideally some benchmarking results? No need to be very large-scale, but a few Biren GPUs are fine.

I think the testing level should be two tiers:

  1. UT test result -- whether a core subset of UT had passed on SUPA
  2. Workload test result -- which ZeRO stages are currently supported, whether SP/TP is supported etc.
  3. Benchmarking results -- could be shared in the comments or a followup blog post.

Also this table can be updated along with this PR. https://github.com/deepspeedai/DeepSpeed#contributed-hw-support

Thank you very much for your review comments. I will fix the CI issue as soon as possible, and add the Workload test result to the comment section of the PR later. Benchmarking results maye we can shared in the followup blog post.

Signed-off-by: frozenleaves <914814442@qq.com>
@delock

delock commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Hi @frozenleaves can you fix the DCO error in CI?
When you have workload test result you can post in comments and ping me. Thanks!

Signed-off-by: frozenleaves <914814442@qq.com>
@frozenleaves

frozenleaves commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

UT case test results

✅ Passed

accelerator/

  • accelerator/test_accelerator.py

autotuning/

  • autotuning/test_autotuning.py

checkpoint/

  • checkpoint/test_autotp_uc_checkpoint.py
  • checkpoint/test_convert_checkpoint.py
  • checkpoint/test_reshape_checkpoint.py
  • checkpoint/test_sparse.py
  • checkpoint/test_shared_weights.py
  • checkpoint/test_tag_validation.py

compile/

  • compile/test_inductor_aot_kwargs.py
  • compile/test_list_schedule.py

compression/

  • compression/test_compression.py [2 passed, 1 skipped]

launcher/

  • launcher/test_ds_arguments.py
  • launcher/test_multinode_runner.py
  • launcher/test_run.py

model_parallelism/

  • model_parallelism/test_tp_plan_e2e.py
  • model_parallelism/test_autotp_custom_patterns.py

module_inject/

  • module_inject/test_tp_plan_converter.py

monitor/

  • monitor/test_monitor.py

ops/

  • ops/adam/
  • ops/adagrad/
  • ops/aio/ [44 passed, 46 skipped]
  • ops/deepspeed4science/ [9 passed, 1 skipped]
  • ops/lion/
  • ops/test_op_builder.py

pipe/

  • pipe/test_pipe_module.py

profiling/

  • profiling/flops_profiler/test_flops_profiler.py

runtime/

  • runtime/activation_checkpointing/
  • runtime/comm/
  • runtime/tensor_parallel
  • runtime/zero/
  • runtime/utils/
  • runtime/zenflow/
  • runtime/test_autocast.py
  • runtime/test_data_efficiency.py
  • runtime/test_data.py
  • runtime/test_ds_config_dict.py
  • runtime/test_ds_initialize.py
  • runtime/test_ds_config_model.py
  • runtime/test_lr_schedulers.py
  • runtime/test_multi_output_model.py
  • runtime/test_multiple_models.py
  • runtime/test_mup_optimizers.py
  • runtime/test_no_sync_ctxt.py
  • runtime/test_pld.py
  • runtime/test_precision_config_loss_scale.py
  • runtime/test_runtime_utils.py
  • runtime/test_tp_plan_extraction.py

sequence_parallelism/

  • sequence_parallelism/test_autosp_equivalence.py
  • sequence_parallelism/test_autosp_integration.py

ulysses_alst/

  • ulysses_alst/test_tiled_compute.py
  • ulysses_alst/test_ulysses_sp_hf.py

utils/

  • utils/test_get_optim_files.py
  • utils/test_groups.py
  • utils/test_init_on_device.py
  • utils/test_nvtx.py

⏭️ Skipped

comm/

  • comm/test_dist.py

checkpoint/

  • checkpoint/test_latest_checkpoint.py
  • checkpoint/test_mics_optimizer.py
  • checkpoint/test_moe_checkpoint.py
  • checkpoint/test_pipeline.py
  • checkpoint/test_universal_checkpoint.py

compression/

  • compression/test_dequantization.py

elasticity/

  • elasticity/test_elastic.py

hybrid_engine/

  • hybrid_engine/test_he_all.py
  • hybrid_engine/test_he_llama.py
  • hybrid_engine/test_he_lora.py

inference/

  • inference/ [skipped]

linear/

  • linear/test_ctx.py
  • linear/test_linear.py
  • linear/test_quant_param.py

model_parallelism/

  • model_parallelism/test_autotp_training.py
  • model_parallelism/test_configurable_parallel_mp.py
  • model_parallelism/test_configurable_parallel_pp.py

ops/

  • ops/accelerators/
  • ops/fp_quantizer/
  • ops/muon/
  • ops/quantizer/
  • ops/sparse_attention/
  • ops/spatial/
  • ops/transformer/

sequence_parallelism/

  • sequence_parallelism/test_ulysses.py

v1/

  • v1/compile/test_compile_autosp.py
  • v1/compile/test_compile_fx.py
  • v1/half_precision/test_bf16.py

❌ Failed

checkpoint/

  • checkpoint/test_other_optimizer.py

launcher/

  • launcher/test_user_args.py [4 failed, 7 passed] [RuntimeError: launcher 'pdsh' not installed]

moe/

  • moe/test_moe.py
  • moe/test_moe_tp.py

runtime/

  • runtime/sparse_tensor/

v1/

  • v1/compile/test_compile_zero.py

@delock delock merged commit 7ad4108 into deepspeedai:master Jun 16, 2026
13 checks passed
nathon-lee pushed a commit to nathon-lee/DeepSpeed_woo that referenced this pull request Jul 1, 2026
# Add Biren SUPA Accelerator Support

## Summary

This PR adds accelerator backend support for the **Biren SUPA GPU** (the
Biren Technology GPU, software stack SUPA) to DeepSpeed, enabling
DeepSpeed to automatically detect the device, run training and inference
on Biren GPUs, and reuse DeepSpeed's existing operator invocation
framework (fused optimizer, transformer inference, quantizer, async-io,
etc.).

SUPA is onboarded as the 9th supported accelerator, following `cuda /
cpu / xpu / npu / mps / hpu / mlu / sdaa`. It adheres to DeepSpeed's
existing `DeepSpeedAccelerator` abstract interface and the `op_builder`
plugin mechanism, with **zero intrusion** into existing backends — the
only existing file modified is the accelerator auto-detection entry
point `accelerator/real_accelerator.py`.

## Changes

### 1. Accelerator auto-detection and registration —
`accelerator/real_accelerator.py` (the only existing file modified)

- Add `'supa'` to `SUPPORTED_ACCELERATOR_LIST`.
- **Explicit specification** (`DS_ACCELERATOR=supa`): attempt `import
torch_supa`, and emit a clear error message if it is missing.
- **Auto-detection**: add a SUPA probing branch that determines
availability via `import torch_supa` and checking
`torch.supa.is_available()`.
- Critical ordering: because `torch_supa` spoofs `torch.cuda`, the SUPA
detection branch **must come before** the CUDA detection, otherwise
Biren cards would be misidentified as CUDA devices. This constraint is
clearly noted with a comment in the code.
- In the third-step instantiation logic, add the `accelerator_name ==
'supa'` → `SUPA_Accelerator()` branch.

### 2. Accelerator implementation — `accelerator/supa_accelerator.py`
Implements all interfaces of the `DeepSpeedAccelerator` abstract base
class. The vast majority of APIs delegate directly to `torch.supa.*`
(mirroring the semantics of `torch.cuda.*`):

- **Device management**: `device / set_device / current_device /
device_count / synchronize`, etc.
- **RNG**: `manual_seed(_all) / get_rng_state / set_rng_state /
default_generator`.
- **Stream / Event**: `Stream / Event / current_stream /
default_stream`.
- **Memory management**: `empty_cache / memory_allocated /
max_memory_allocated / memory_reserved / memory_stats / total_memory /
available_memory`, etc. (some use `hasattr` for capability probing, for
compatibility across different versions of torch_supa).
- **Data types**: declares support for fp32 / fp16 / bf16.
- **Communication backend**: uses **BCCL** (the Biren collective
communication library) on Linux, falling back to `gloo` on Windows.
- **CUDA Graph**: mapped to `torch.supa.SUPAGraph()` /
`torch.supa.graph(...)`.
- **op_builder loading**: `op_builder_dir()` returns `op_builder.supa`
(local install) or `deepspeed.ops.op_builder.supa` (pip install), and
lazily loads via `pkgutil`, scanning all `*Builder` classes in that
directory to build the `class_dict`.
- **Environment variables**: `export_envs` exports `BCCL / BIREN / SUPA
/ LD_LIBRARY / PATH`; `visible_devices_envs` uses
`SUPA_VISIBLE_DEVICES`.
- **Compile backend**: defaults to `inductor`, with Triton support.

### 3. SUPA op_builder plugin package — `op_builder/supa/` (new)

A new SUPA builder package, parallel to `op_builder/{cpu,xpu,npu,...}`:

| File | Purpose |
|------|------|
| `builder.py` | `SUPAOpBuilder` base class, compiling host-side C++
sources based on `CppExtension` (`-O3 -std=c++17 -fopenmp` + CPU arch /
SIMD width). |
| `fused_adam.py` | `FusedAdamBuilder` + `SUPAFusedAdam`: prefers
calling the `torch.ops.deepspeed.multi_tensor_adam` compiled kernel,
falling back to a **numerically equivalent pure-PyTorch implementation**
when missing (supports Adam mode=0 / AdamW mode=1). |
| `fused_lamb.py` | `FusedLambBuilder` + `SUPAFusedLamb`:
`torch.ops.deepspeed.lamb`, with a pure-PyTorch fallback (trust-ratio
clamp). |
| `fused_lion.py` | `FusedLionBuilder` + `SUPAFusedLion`:
`torch.ops.deepspeed.multi_tensor_lion`, with a pure-PyTorch fallback. |
| `inference.py` | `InferenceBuilder` + `SUPAInference`: wraps the full
set of transformer inference kernels (layer_norm / rms_norm /
softmax(_context) / bias_* / qkv_gemm / mlp_gemm / vector_matmul /
linear_layer / rotary / einsum / MoE / gated_activation), in
fp16/bf16/fp32 precisions, each delegating to `torch.ops.deepspeed.*`. |
| `quantizer.py` | `QuantizerBuilder` + `SUPAQuantizer`:
symmetric/asymmetric quantization, stochastic rounding (SR), int4/int8
dequantization, swizzle_quant, quantized_reduction, LoCo, etc. |
| `async_io.py` | `AsyncIOBuilder`: reuses DeepSpeed's existing
`csrc/aio/*` C++ sources, depends on `libaio`, includes a
package-manager detection hint. |
| `cpu_adam.py` / `cpu_lion.py` / `cpu_adagrad.py` | CPU offload
optimizer builders, reusing the `csrc/{adam,lion,adagrad}/*` sources. |
| `no_impl.py` | `NotImplementedBuilder`: a placeholder stub for
unimplemented ops; `load()` raises a clear `NotImplementedError`. |
| `__init__.py` | Exports all builders. |

**Design highlights**:
- Compiled kernels are hooked in via `import torch_supa_ext.deepspeed`
(side effect: registers `torch.ops.deepspeed.*`); all imports are
wrapped in `try/except` so the module remains importable even without
the compiled extension.
- `is_compatible()` uses a two-stage decision: "fast path checks whether
the op is already registered → otherwise attempt to import the
extension".
- optimizer builders provide a pure-PyTorch fallback, making it
convenient to do functional verification in cmodel / hardware-free
environments.

## Dependencies

Runtime dependencies (all are Biren software-stack components, needed
only when using the SUPA backend):

- **`torch_supa`** — the Biren PyTorch device extension, providing the
`torch.supa.*` namespace. **Required** (the basis for accelerator
detection and all device APIs).
- **`torch_supa_ext`** — the Biren compiled operator extension, with
submodules:
- `torch_supa_ext.deepspeed` — registers `torch.ops.deepspeed.*` (fused
optimizer / inference / quantizer kernels).

*Optional*: when missing, the optimizer falls back to pure PyTorch,
while inference/quantizer raise a clear error on invocation and tests
are skipped automatically.
- **BCCL** — the Biren collective communication library (the
communication backend for distributed training).
- **libaio** — required by `AsyncIOBuilder` (ZeRO-Infinity NVMe offload)
via `libaio-dev`.

**No new dependencies** are introduced for DeepSpeed's existing code or
other backends.

## Usage

Prerequisite: the Biren driver + `torch_supa` (+ `torch_supa_ext` as
needed) is already installed.

```bash
# Option 1: explicitly specify the backend
export DS_ACCELERATOR=supa

# Option 2: auto-detection (just install torch_supa; no environment variable needed)
```

Usage in code is exactly the same as for other backends, through the
unified `get_accelerator()` abstraction:

```python
import torch
from deepspeed.accelerator import get_accelerator

accelerator = get_accelerator()          # automatically returns SUPA_Accelerator
print(accelerator.device_name())         # 'supa'
device = accelerator.device(0)           # torch.device('supa', 0)
tensor = torch.randn(3, device=device)   # tensor([-0.8643,  1.3154,  1.5823, ], device='supa:0')

# DeepSpeed training/inference initialization requires no changes; op_builder is automatically routed to op_builder.supa
```

Multi-card visibility is controlled via the `SUPA_VISIBLE_DEVICES`
environment variable; the distributed communication backend defaults to
`bccl`.

## Compatibility and scope of impact

- The SUPA path is activated only when `DS_ACCELERATOR=supa` is
explicitly set or `torch_supa` is present in the environment; behavior
in all other environments is completely unchanged.
- The only existing file modified, `real_accelerator.py`, only adds
branches and does not modify existing logic.
- Tests are skipped automatically when no hardware is present, remaining
transparent to upstream CI.

---------

Signed-off-by: frozenleaves <914814442@qq.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
nathon-lee pushed a commit to nathon-lee/DeepSpeed_woo that referenced this pull request Jul 1, 2026
# Add Biren SUPA Accelerator Support

## Summary

This PR adds accelerator backend support for the **Biren SUPA GPU** (the
Biren Technology GPU, software stack SUPA) to DeepSpeed, enabling
DeepSpeed to automatically detect the device, run training and inference
on Biren GPUs, and reuse DeepSpeed's existing operator invocation
framework (fused optimizer, transformer inference, quantizer, async-io,
etc.).

SUPA is onboarded as the 9th supported accelerator, following `cuda /
cpu / xpu / npu / mps / hpu / mlu / sdaa`. It adheres to DeepSpeed's
existing `DeepSpeedAccelerator` abstract interface and the `op_builder`
plugin mechanism, with **zero intrusion** into existing backends — the
only existing file modified is the accelerator auto-detection entry
point `accelerator/real_accelerator.py`.

## Changes

### 1. Accelerator auto-detection and registration —
`accelerator/real_accelerator.py` (the only existing file modified)

- Add `'supa'` to `SUPPORTED_ACCELERATOR_LIST`.
- **Explicit specification** (`DS_ACCELERATOR=supa`): attempt `import
torch_supa`, and emit a clear error message if it is missing.
- **Auto-detection**: add a SUPA probing branch that determines
availability via `import torch_supa` and checking
`torch.supa.is_available()`.
- Critical ordering: because `torch_supa` spoofs `torch.cuda`, the SUPA
detection branch **must come before** the CUDA detection, otherwise
Biren cards would be misidentified as CUDA devices. This constraint is
clearly noted with a comment in the code.
- In the third-step instantiation logic, add the `accelerator_name ==
'supa'` → `SUPA_Accelerator()` branch.

### 2. Accelerator implementation — `accelerator/supa_accelerator.py`
Implements all interfaces of the `DeepSpeedAccelerator` abstract base
class. The vast majority of APIs delegate directly to `torch.supa.*`
(mirroring the semantics of `torch.cuda.*`):

- **Device management**: `device / set_device / current_device /
device_count / synchronize`, etc.
- **RNG**: `manual_seed(_all) / get_rng_state / set_rng_state /
default_generator`.
- **Stream / Event**: `Stream / Event / current_stream /
default_stream`.
- **Memory management**: `empty_cache / memory_allocated /
max_memory_allocated / memory_reserved / memory_stats / total_memory /
available_memory`, etc. (some use `hasattr` for capability probing, for
compatibility across different versions of torch_supa).
- **Data types**: declares support for fp32 / fp16 / bf16.
- **Communication backend**: uses **BCCL** (the Biren collective
communication library) on Linux, falling back to `gloo` on Windows.
- **CUDA Graph**: mapped to `torch.supa.SUPAGraph()` /
`torch.supa.graph(...)`.
- **op_builder loading**: `op_builder_dir()` returns `op_builder.supa`
(local install) or `deepspeed.ops.op_builder.supa` (pip install), and
lazily loads via `pkgutil`, scanning all `*Builder` classes in that
directory to build the `class_dict`.
- **Environment variables**: `export_envs` exports `BCCL / BIREN / SUPA
/ LD_LIBRARY / PATH`; `visible_devices_envs` uses
`SUPA_VISIBLE_DEVICES`.
- **Compile backend**: defaults to `inductor`, with Triton support.

### 3. SUPA op_builder plugin package — `op_builder/supa/` (new)

A new SUPA builder package, parallel to `op_builder/{cpu,xpu,npu,...}`:

| File | Purpose |
|------|------|
| `builder.py` | `SUPAOpBuilder` base class, compiling host-side C++
sources based on `CppExtension` (`-O3 -std=c++17 -fopenmp` + CPU arch /
SIMD width). |
| `fused_adam.py` | `FusedAdamBuilder` + `SUPAFusedAdam`: prefers
calling the `torch.ops.deepspeed.multi_tensor_adam` compiled kernel,
falling back to a **numerically equivalent pure-PyTorch implementation**
when missing (supports Adam mode=0 / AdamW mode=1). |
| `fused_lamb.py` | `FusedLambBuilder` + `SUPAFusedLamb`:
`torch.ops.deepspeed.lamb`, with a pure-PyTorch fallback (trust-ratio
clamp). |
| `fused_lion.py` | `FusedLionBuilder` + `SUPAFusedLion`:
`torch.ops.deepspeed.multi_tensor_lion`, with a pure-PyTorch fallback. |
| `inference.py` | `InferenceBuilder` + `SUPAInference`: wraps the full
set of transformer inference kernels (layer_norm / rms_norm /
softmax(_context) / bias_* / qkv_gemm / mlp_gemm / vector_matmul /
linear_layer / rotary / einsum / MoE / gated_activation), in
fp16/bf16/fp32 precisions, each delegating to `torch.ops.deepspeed.*`. |
| `quantizer.py` | `QuantizerBuilder` + `SUPAQuantizer`:
symmetric/asymmetric quantization, stochastic rounding (SR), int4/int8
dequantization, swizzle_quant, quantized_reduction, LoCo, etc. |
| `async_io.py` | `AsyncIOBuilder`: reuses DeepSpeed's existing
`csrc/aio/*` C++ sources, depends on `libaio`, includes a
package-manager detection hint. |
| `cpu_adam.py` / `cpu_lion.py` / `cpu_adagrad.py` | CPU offload
optimizer builders, reusing the `csrc/{adam,lion,adagrad}/*` sources. |
| `no_impl.py` | `NotImplementedBuilder`: a placeholder stub for
unimplemented ops; `load()` raises a clear `NotImplementedError`. |
| `__init__.py` | Exports all builders. |

**Design highlights**:
- Compiled kernels are hooked in via `import torch_supa_ext.deepspeed`
(side effect: registers `torch.ops.deepspeed.*`); all imports are
wrapped in `try/except` so the module remains importable even without
the compiled extension.
- `is_compatible()` uses a two-stage decision: "fast path checks whether
the op is already registered → otherwise attempt to import the
extension".
- optimizer builders provide a pure-PyTorch fallback, making it
convenient to do functional verification in cmodel / hardware-free
environments.


## Dependencies

Runtime dependencies (all are Biren software-stack components, needed
only when using the SUPA backend):

- **`torch_supa`** — the Biren PyTorch device extension, providing the
`torch.supa.*` namespace. **Required** (the basis for accelerator
detection and all device APIs).
- **`torch_supa_ext`** — the Biren compiled operator extension, with
submodules:
- `torch_supa_ext.deepspeed` — registers `torch.ops.deepspeed.*` (fused
optimizer / inference / quantizer kernels).

*Optional*: when missing, the optimizer falls back to pure PyTorch,
while inference/quantizer raise a clear error on invocation and tests
are skipped automatically.
- **BCCL** — the Biren collective communication library (the
communication backend for distributed training).
- **libaio** — required by `AsyncIOBuilder` (ZeRO-Infinity NVMe offload)
via `libaio-dev`.

**No new dependencies** are introduced for DeepSpeed's existing code or
other backends.

## Usage

Prerequisite: the Biren driver + `torch_supa` (+ `torch_supa_ext` as
needed) is already installed.

```bash
# Option 1: explicitly specify the backend
export DS_ACCELERATOR=supa

# Option 2: auto-detection (just install torch_supa; no environment variable needed)
```

Usage in code is exactly the same as for other backends, through the
unified `get_accelerator()` abstraction:

```python
import torch
from deepspeed.accelerator import get_accelerator

accelerator = get_accelerator()          # automatically returns SUPA_Accelerator
print(accelerator.device_name())         # 'supa'
device = accelerator.device(0)           # torch.device('supa', 0)
tensor = torch.randn(3, device=device)   # tensor([-0.8643,  1.3154,  1.5823, ], device='supa:0')

# DeepSpeed training/inference initialization requires no changes; op_builder is automatically routed to op_builder.supa
```

Multi-card visibility is controlled via the `SUPA_VISIBLE_DEVICES`
environment variable; the distributed communication backend defaults to
`bccl`.

## Compatibility and scope of impact

- The SUPA path is activated only when `DS_ACCELERATOR=supa` is
explicitly set or `torch_supa` is present in the environment; behavior
in all other environments is completely unchanged.
- The only existing file modified, `real_accelerator.py`, only adds
branches and does not modify existing logic.
- Tests are skipped automatically when no hardware is present, remaining
transparent to upstream CI.

---------

Signed-off-by: frozenleaves <914814442@qq.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants