Skip to content

[feat] Add H2D and D2H Path Based on GPUDirect RDMA#958

Merged
mag1c-h merged 35 commits into
ModelEngine-Group:developfrom
relat-ivity:gdr-stream
May 15, 2026
Merged

[feat] Add H2D and D2H Path Based on GPUDirect RDMA#958
mag1c-h merged 35 commits into
ModelEngine-Group:developfrom
relat-ivity:gdr-stream

Conversation

@relat-ivity

Copy link
Copy Markdown
Contributor

Purpose

🔗 Issue Link: #946

Traditional H2D and D2H transfers are performed through cudaMemcpy. The main drawbacks of cudaMemcpy are its relatively large fixed submission overhead and poor small-I/O transfer bandwidth.

This proposal introduces a GPUDirect RDMA transfer path, referred to as GDR below. GDR is based on the NIC DMA engine and directly performs RDMA operations between GPU HBM and Host DRAM. The transfer path is:

GPU <-> NIC <-> CPU

The architecture of GDR stream is shown in the figure below:

image-20260429170635310

Modifications

The detailed implementation specification is available in the issue #946.

trans Layer

  • gdr_stream.h/cc: Added GdrStream, which implements ordered asynchronous GDR copy through a scheduler thread and a completion thread.

  • device.h, cuda_device.cc, ascend_device.cc, simu_device.cc, trans.py.cc: Added MakeGdrStream to the general Device interface and Python bindings. The CUDA backend supports creating GDR streams, while other backends provide compatible implementations.

  • cuda_buffer.cc, gdr_mr_buffer.h/cc: Added GDR MR registration, query, resolution, and release management for CUDA host/device buffers.

  • gdr_config.h/cc: Added GDR NIC selection logic and GPU KV buffer pre-registration configuration management.

  • gdr_copy.h/cc: Added an ibverbs-based asynchronous GDR copy channel, supporting H2D/D2H copy submission, completion polling, and buffer MR management.

store Layer

  • cache_store.cc, global_config.h: Added use_gdr to CacheStore, along with GPU KV buffer configuration parsing, validation, and pre-registration flow.

  • copy_stream.h, dump_queue.h/cc, load_queue.h/cc: The dump/load transfer path in CacheStore now supports selecting either a normal CUDA stream or a GDR stream based on use_gdr.

  • pcstore.h/cc: Added transferUseGdr and GPU KV buffer range fields to the PcStore configuration, and completed validation, pre-registration, and configuration forwarding during initialization.

  • pcstore.py.cc, pcstore_connector.py, pcstore_connector_v1.py: Added use_gdr, GPU KV buffer address, and size configuration mappings to the PcStore Python bindings and connectors.

  • trans_manager.h/cc, trans_queue.h/cc, trans_share_queue.h/cc: Added GDR switch forwarding to the PcStore transfer layer, and create GDR streams in both normal queues and shared queues based on the configuration.

connector Layer

  • ucm_connector.py: Collects GPU buffer addresses and sizes of the vLLM KV cache, and passes them to UCM for GDR pre-registration.

  • CMakeLists.txt, setup.py: Added a GDR build switch. Whether UCM_ENABLE_GDR_STREAM is enabled is controlled by the ENABLE_GDR environment variable.

  • ucm_config_example.yaml: Added a use_gdr configuration example and environment instructions required to enable GDR.

Test

Bandwidth Test

Unit: GBps

CUDA GDR without Stream GDR Stream
4KB 1.6 11.36 7.61
16KB 5.84 42.83 30.77
64KB 17.43 45.46 44.88
1024KB 48.34 45.84 45.33

GDR Switch Test

  1. Build with USE_GDR=0, configuration file with use_gdr: false: uses CUDA stream and runs normally, as expected.

  2. Build with USE_GDR=0, configuration file with use_gdr: true: reports an error, as expected.

  3. Build with USE_GDR=1, configuration file with use_gdr: false: uses CUDA stream and runs normally, as expected.

  4. Build with USE_GDR=1, configuration file with use_gdr: true: uses GDR stream and runs normally, as expected.

Online Inference Test

Test Configuration

The configuration file is as follows:

ucm_connectors:
  - ucm_connector_name: "UcmPipelineStore"
    ucm_connector_config:
      store_pipeline: "Cache|Posix"
      storage_backends: "/mnt/test"
      io_direct: false
      use_gdr: true
      share_buffer_enable: false
      cache_buffer_capacity_gb: 1

enable_event_sync: true
use_layerwise: true
enable_record_traces: false
use_lite: false
persist_token_threshold: 0

The test script is as follows. The input length is 1024, and the reuse ratio is 80%.

vllm bench serve \
  --backend vllm \
  --base-url http://127.0.0.1:8000 \
  --endpoint /v1/completions \
  --model /home/models/DeepSeek-V2-Lite \
  --dataset-name random \
  --num-prompts 20 \
  --random-prefix-len 819 \
  --random-input-len 205 \
  --random-output-len 128 \
  --max-concurrency 1

Single-node Single-GPU Test

GDR transfer result:

============ Serving Benchmark Result ============
Successful requests:                     20        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  87.18     
Total input tokens:                      20460     
Total generated tokens:                  2560      
Request throughput (req/s):              0.23      
Output token throughput (tok/s):         29.36     
Peak output token throughput (tok/s):    31.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          264.05    
---------------Time to First Token----------------
Mean TTFT (ms):                          93.56     
Median TTFT (ms):                        80.61     
P99 TTFT (ms):                           289.95    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.58     
Median TPOT (ms):                        33.35     
P99 TPOT (ms):                           35.17     
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.58     
Median ITL (ms):                         33.55     
P99 ITL (ms):                            35.84     
==================================================

Test Result

The service runs normally, and the performance is normal.

Single-node Four-GPU DP2 + TP2 Test

GDR transfer result:

============ Serving Benchmark Result ============
Successful requests:                     20        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  13.18     
Total input tokens:                      20460     
Total generated tokens:                  2560      
Request throughput (req/s):              1.52      
Output token throughput (tok/s):         194.28    
Peak output token throughput (tok/s):    212.00    
Peak concurrent requests:                3.00      
Total token throughput (tok/s):          1747.03   
---------------Time to First Token----------------
Mean TTFT (ms):                          30.59     
Median TTFT (ms):                        23.00     
P99 TTFT (ms):                           102.96    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.94      
Median TPOT (ms):                        4.65      
P99 TPOT (ms):                           8.04      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.95      
Median ITL (ms):                         4.65      
P99 ITL (ms):                            4.93      
==================================================

Test Result

The service runs normally, and the performance is normal.

Path Coverage

  1. cache store has been integrated with GdrStream.

  2. The normal pcstore path has been integrated with GdrStream.

  3. The special scatter/gather path in pcstore has not been integrated with GdrStream.

  4. nfsstore has not been integrated with GdrStream.

@relat-ivity relat-ivity changed the title [feat] Gdr stream [feat] Add H2D and D2H Path Based on GPUDirect RDMA May 14, 2026
ygwpz
ygwpz previously approved these changes May 14, 2026

@ygwpz ygwpz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review completed. No critical issues found.

@mag1c-h mag1c-h linked an issue May 15, 2026 that may be closed by this pull request
@mag1c-h mag1c-h merged commit 0f88e34 into ModelEngine-Group:develop May 15, 2026
21 of 24 checks passed
Comment thread setup.py
cmake_args += ["-DBUILD_UCM_SPARSE=ON"]

if ENABLE_GDR:
cmake_args += ["-DUCM_ENABLE_GDR_STREAM=ON"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use "UCMXX" here

Fengli5355 pushed a commit to Fengli5355/unified-cache-management that referenced this pull request May 20, 2026
…p#958)

## Purpose

🔗 **Issue Link:** ModelEngine-Group#946 

Traditional H2D and D2H transfers are performed through `cudaMemcpy`.
The main drawbacks of `cudaMemcpy` are its relatively large fixed
submission overhead and poor small-I/O transfer bandwidth.

This proposal introduces a GPUDirect RDMA transfer path, referred to as
**GDR** below. GDR is based on the NIC DMA engine and directly performs
RDMA operations between GPU HBM and Host DRAM. The transfer path is:

```txt
GPU <-> NIC <-> CPU
```

The architecture of GDR stream is shown in the figure below:

<img width="2423" height="497" alt="image-20260429170635310"
src="https://github.com/user-attachments/assets/5fe3eaf7-c725-4ad1-8e00-8645a0031e81"
/>

## Modifications 

The detailed implementation specification is available in the issue
ModelEngine-Group#946.

### trans Layer

- `gdr_stream.h/cc`: Added `GdrStream`, which implements ordered
asynchronous GDR copy through a scheduler thread and a completion
thread.

- `device.h`, `cuda_device.cc`, `ascend_device.cc`, `simu_device.cc`,
`trans.py.cc`: Added `MakeGdrStream` to the general Device interface and
Python bindings. The CUDA backend supports creating GDR streams, while
other backends provide compatible implementations.

- `cuda_buffer.cc`, `gdr_mr_buffer.h/cc`: Added GDR MR registration,
query, resolution, and release management for CUDA host/device buffers.

- `gdr_config.h/cc`: Added GDR NIC selection logic and GPU KV buffer
pre-registration configuration management.

- `gdr_copy.h/cc`: Added an ibverbs-based asynchronous GDR copy channel,
supporting H2D/D2H copy submission, completion polling, and buffer MR
management.

### store Layer

- `cache_store.cc`, `global_config.h`: Added `use_gdr` to CacheStore,
along with GPU KV buffer configuration parsing, validation, and
pre-registration flow.

- `copy_stream.h`, `dump_queue.h/cc`, `load_queue.h/cc`: The dump/load
transfer path in CacheStore now supports selecting either a normal CUDA
stream or a GDR stream based on `use_gdr`.

- `pcstore.h/cc`: Added `transferUseGdr` and GPU KV buffer range fields
to the PcStore configuration, and completed validation,
pre-registration, and configuration forwarding during initialization.

- `pcstore.py.cc`, `pcstore_connector.py`, `pcstore_connector_v1.py`:
Added `use_gdr`, GPU KV buffer address, and size configuration mappings
to the PcStore Python bindings and connectors.

- `trans_manager.h/cc`, `trans_queue.h/cc`, `trans_share_queue.h/cc`:
Added GDR switch forwarding to the PcStore transfer layer, and create
GDR streams in both normal queues and shared queues based on the
configuration.

### connector Layer

- `ucm_connector.py`: Collects GPU buffer addresses and sizes of the
vLLM KV cache, and passes them to UCM for GDR pre-registration.

- `CMakeLists.txt`, `setup.py`: Added a GDR build switch. Whether
`UCM_ENABLE_GDR_STREAM` is enabled is controlled by the `ENABLE_GDR`
environment variable.

- `ucm_config_example.yaml`: Added a `use_gdr` configuration example and
environment instructions required to enable GDR.

## Test

### Bandwidth Test

Unit: GBps

|        | CUDA  | GDR without Stream | GDR Stream |
| ------ | ----- | ------------------ | ---------- |
| 4KB    | 1.6   | 11.36              | 7.61       |
| 16KB   | 5.84  | 42.83              | 30.77      |
| 64KB   | 17.43 | 45.46              | 44.88      |
| 1024KB | 48.34 | 45.84              | 45.33      |

### GDR Switch Test

1. Build with `USE_GDR=0`, configuration file with `use_gdr: false`:
uses CUDA stream and runs normally, as expected.

2. Build with `USE_GDR=0`, configuration file with `use_gdr: true`:
reports an error, as expected.

3. Build with `USE_GDR=1`, configuration file with `use_gdr: false`:
uses CUDA stream and runs normally, as expected.

4. Build with `USE_GDR=1`, configuration file with `use_gdr: true`: uses
GDR stream and runs normally, as expected.

### Online Inference Test

#### Test Configuration

The configuration file is as follows:

```bash
ucm_connectors:
  - ucm_connector_name: "UcmPipelineStore"
    ucm_connector_config:
      store_pipeline: "Cache|Posix"
      storage_backends: "/mnt/test"
      io_direct: false
      use_gdr: true
      share_buffer_enable: false
      cache_buffer_capacity_gb: 1

enable_event_sync: true
use_layerwise: true
enable_record_traces: false
use_lite: false
persist_token_threshold: 0
```

The test script is as follows. The input length is 1024, and the reuse
ratio is 80%.

```bash
vllm bench serve \
  --backend vllm \
  --base-url http://127.0.0.1:8000 \
  --endpoint /v1/completions \
  --model /home/models/DeepSeek-V2-Lite \
  --dataset-name random \
  --num-prompts 20 \
  --random-prefix-len 819 \
  --random-input-len 205 \
  --random-output-len 128 \
  --max-concurrency 1
```

#### Single-node Single-GPU Test

GDR transfer result:

```bash
============ Serving Benchmark Result ============
Successful requests:                     20        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  87.18     
Total input tokens:                      20460     
Total generated tokens:                  2560      
Request throughput (req/s):              0.23      
Output token throughput (tok/s):         29.36     
Peak output token throughput (tok/s):    31.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          264.05    
---------------Time to First Token----------------
Mean TTFT (ms):                          93.56     
Median TTFT (ms):                        80.61     
P99 TTFT (ms):                           289.95    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.58     
Median TPOT (ms):                        33.35     
P99 TPOT (ms):                           35.17     
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.58     
Median ITL (ms):                         33.55     
P99 ITL (ms):                            35.84     
==================================================
```

**Test Result**

The service runs normally, and the performance is normal.

#### Single-node Four-GPU DP2 + TP2 Test

GDR transfer result:

```bash
============ Serving Benchmark Result ============
Successful requests:                     20        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  13.18     
Total input tokens:                      20460     
Total generated tokens:                  2560      
Request throughput (req/s):              1.52      
Output token throughput (tok/s):         194.28    
Peak output token throughput (tok/s):    212.00    
Peak concurrent requests:                3.00      
Total token throughput (tok/s):          1747.03   
---------------Time to First Token----------------
Mean TTFT (ms):                          30.59     
Median TTFT (ms):                        23.00     
P99 TTFT (ms):                           102.96    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.94      
Median TPOT (ms):                        4.65      
P99 TPOT (ms):                           8.04      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.95      
Median ITL (ms):                         4.65      
P99 ITL (ms):                            4.93      
==================================================
```

**Test Result**

The service runs normally, and the performance is normal.

## Path Coverage

1. `cache store` has been integrated with `GdrStream`.

2. The normal `pcstore` path has been integrated with `GdrStream`.

3. The special scatter/gather path in `pcstore` has not been integrated
with `GdrStream`.

4. `nfsstore` has not been integrated with `GdrStream`.
Frankielx pushed a commit to Frankielx/unified-cache-management that referenced this pull request May 26, 2026
…p#958)

## Purpose

🔗 **Issue Link:** ModelEngine-Group#946 

Traditional H2D and D2H transfers are performed through `cudaMemcpy`.
The main drawbacks of `cudaMemcpy` are its relatively large fixed
submission overhead and poor small-I/O transfer bandwidth.

This proposal introduces a GPUDirect RDMA transfer path, referred to as
**GDR** below. GDR is based on the NIC DMA engine and directly performs
RDMA operations between GPU HBM and Host DRAM. The transfer path is:

```txt
GPU <-> NIC <-> CPU
```

The architecture of GDR stream is shown in the figure below:

<img width="2423" height="497" alt="image-20260429170635310"
src="https://github.com/user-attachments/assets/5fe3eaf7-c725-4ad1-8e00-8645a0031e81"
/>

## Modifications 

The detailed implementation specification is available in the issue
ModelEngine-Group#946.

### trans Layer

- `gdr_stream.h/cc`: Added `GdrStream`, which implements ordered
asynchronous GDR copy through a scheduler thread and a completion
thread.

- `device.h`, `cuda_device.cc`, `ascend_device.cc`, `simu_device.cc`,
`trans.py.cc`: Added `MakeGdrStream` to the general Device interface and
Python bindings. The CUDA backend supports creating GDR streams, while
other backends provide compatible implementations.

- `cuda_buffer.cc`, `gdr_mr_buffer.h/cc`: Added GDR MR registration,
query, resolution, and release management for CUDA host/device buffers.

- `gdr_config.h/cc`: Added GDR NIC selection logic and GPU KV buffer
pre-registration configuration management.

- `gdr_copy.h/cc`: Added an ibverbs-based asynchronous GDR copy channel,
supporting H2D/D2H copy submission, completion polling, and buffer MR
management.

### store Layer

- `cache_store.cc`, `global_config.h`: Added `use_gdr` to CacheStore,
along with GPU KV buffer configuration parsing, validation, and
pre-registration flow.

- `copy_stream.h`, `dump_queue.h/cc`, `load_queue.h/cc`: The dump/load
transfer path in CacheStore now supports selecting either a normal CUDA
stream or a GDR stream based on `use_gdr`.

- `pcstore.h/cc`: Added `transferUseGdr` and GPU KV buffer range fields
to the PcStore configuration, and completed validation,
pre-registration, and configuration forwarding during initialization.

- `pcstore.py.cc`, `pcstore_connector.py`, `pcstore_connector_v1.py`:
Added `use_gdr`, GPU KV buffer address, and size configuration mappings
to the PcStore Python bindings and connectors.

- `trans_manager.h/cc`, `trans_queue.h/cc`, `trans_share_queue.h/cc`:
Added GDR switch forwarding to the PcStore transfer layer, and create
GDR streams in both normal queues and shared queues based on the
configuration.

### connector Layer

- `ucm_connector.py`: Collects GPU buffer addresses and sizes of the
vLLM KV cache, and passes them to UCM for GDR pre-registration.

- `CMakeLists.txt`, `setup.py`: Added a GDR build switch. Whether
`UCM_ENABLE_GDR_STREAM` is enabled is controlled by the `ENABLE_GDR`
environment variable.

- `ucm_config_example.yaml`: Added a `use_gdr` configuration example and
environment instructions required to enable GDR.

## Test

### Bandwidth Test

Unit: GBps

|        | CUDA  | GDR without Stream | GDR Stream |
| ------ | ----- | ------------------ | ---------- |
| 4KB    | 1.6   | 11.36              | 7.61       |
| 16KB   | 5.84  | 42.83              | 30.77      |
| 64KB   | 17.43 | 45.46              | 44.88      |
| 1024KB | 48.34 | 45.84              | 45.33      |

### GDR Switch Test

1. Build with `USE_GDR=0`, configuration file with `use_gdr: false`:
uses CUDA stream and runs normally, as expected.

2. Build with `USE_GDR=0`, configuration file with `use_gdr: true`:
reports an error, as expected.

3. Build with `USE_GDR=1`, configuration file with `use_gdr: false`:
uses CUDA stream and runs normally, as expected.

4. Build with `USE_GDR=1`, configuration file with `use_gdr: true`: uses
GDR stream and runs normally, as expected.

### Online Inference Test

#### Test Configuration

The configuration file is as follows:

```bash
ucm_connectors:
  - ucm_connector_name: "UcmPipelineStore"
    ucm_connector_config:
      store_pipeline: "Cache|Posix"
      storage_backends: "/mnt/test"
      io_direct: false
      use_gdr: true
      share_buffer_enable: false
      cache_buffer_capacity_gb: 1

enable_event_sync: true
use_layerwise: true
enable_record_traces: false
use_lite: false
persist_token_threshold: 0
```

The test script is as follows. The input length is 1024, and the reuse
ratio is 80%.

```bash
vllm bench serve \
  --backend vllm \
  --base-url http://127.0.0.1:8000 \
  --endpoint /v1/completions \
  --model /home/models/DeepSeek-V2-Lite \
  --dataset-name random \
  --num-prompts 20 \
  --random-prefix-len 819 \
  --random-input-len 205 \
  --random-output-len 128 \
  --max-concurrency 1
```

#### Single-node Single-GPU Test

GDR transfer result:

```bash
============ Serving Benchmark Result ============
Successful requests:                     20        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  87.18     
Total input tokens:                      20460     
Total generated tokens:                  2560      
Request throughput (req/s):              0.23      
Output token throughput (tok/s):         29.36     
Peak output token throughput (tok/s):    31.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          264.05    
---------------Time to First Token----------------
Mean TTFT (ms):                          93.56     
Median TTFT (ms):                        80.61     
P99 TTFT (ms):                           289.95    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.58     
Median TPOT (ms):                        33.35     
P99 TPOT (ms):                           35.17     
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.58     
Median ITL (ms):                         33.55     
P99 ITL (ms):                            35.84     
==================================================
```

**Test Result**

The service runs normally, and the performance is normal.

#### Single-node Four-GPU DP2 + TP2 Test

GDR transfer result:

```bash
============ Serving Benchmark Result ============
Successful requests:                     20        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  13.18     
Total input tokens:                      20460     
Total generated tokens:                  2560      
Request throughput (req/s):              1.52      
Output token throughput (tok/s):         194.28    
Peak output token throughput (tok/s):    212.00    
Peak concurrent requests:                3.00      
Total token throughput (tok/s):          1747.03   
---------------Time to First Token----------------
Mean TTFT (ms):                          30.59     
Median TTFT (ms):                        23.00     
P99 TTFT (ms):                           102.96    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.94      
Median TPOT (ms):                        4.65      
P99 TPOT (ms):                           8.04      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.95      
Median ITL (ms):                         4.65      
P99 ITL (ms):                            4.93      
==================================================
```

**Test Result**

The service runs normally, and the performance is normal.

## Path Coverage

1. `cache store` has been integrated with `GdrStream`.

2. The normal `pcstore` path has been integrated with `GdrStream`.

3. The special scatter/gather path in `pcstore` has not been integrated
with `GdrStream`.

4. `nfsstore` has not been integrated with `GdrStream`.
@relat-ivity relat-ivity deleted the gdr-stream branch May 28, 2026 02:03
Fengli5355 pushed a commit to Fengli5355/unified-cache-management that referenced this pull request May 29, 2026
…p#958)

## Purpose

🔗 **Issue Link:** ModelEngine-Group#946 

Traditional H2D and D2H transfers are performed through `cudaMemcpy`.
The main drawbacks of `cudaMemcpy` are its relatively large fixed
submission overhead and poor small-I/O transfer bandwidth.

This proposal introduces a GPUDirect RDMA transfer path, referred to as
**GDR** below. GDR is based on the NIC DMA engine and directly performs
RDMA operations between GPU HBM and Host DRAM. The transfer path is:

```txt
GPU <-> NIC <-> CPU
```

The architecture of GDR stream is shown in the figure below:

<img width="2423" height="497" alt="image-20260429170635310"
src="https://github.com/user-attachments/assets/5fe3eaf7-c725-4ad1-8e00-8645a0031e81"
/>

## Modifications 

The detailed implementation specification is available in the issue
ModelEngine-Group#946.

### trans Layer

- `gdr_stream.h/cc`: Added `GdrStream`, which implements ordered
asynchronous GDR copy through a scheduler thread and a completion
thread.

- `device.h`, `cuda_device.cc`, `ascend_device.cc`, `simu_device.cc`,
`trans.py.cc`: Added `MakeGdrStream` to the general Device interface and
Python bindings. The CUDA backend supports creating GDR streams, while
other backends provide compatible implementations.

- `cuda_buffer.cc`, `gdr_mr_buffer.h/cc`: Added GDR MR registration,
query, resolution, and release management for CUDA host/device buffers.

- `gdr_config.h/cc`: Added GDR NIC selection logic and GPU KV buffer
pre-registration configuration management.

- `gdr_copy.h/cc`: Added an ibverbs-based asynchronous GDR copy channel,
supporting H2D/D2H copy submission, completion polling, and buffer MR
management.

### store Layer

- `cache_store.cc`, `global_config.h`: Added `use_gdr` to CacheStore,
along with GPU KV buffer configuration parsing, validation, and
pre-registration flow.

- `copy_stream.h`, `dump_queue.h/cc`, `load_queue.h/cc`: The dump/load
transfer path in CacheStore now supports selecting either a normal CUDA
stream or a GDR stream based on `use_gdr`.

- `pcstore.h/cc`: Added `transferUseGdr` and GPU KV buffer range fields
to the PcStore configuration, and completed validation,
pre-registration, and configuration forwarding during initialization.

- `pcstore.py.cc`, `pcstore_connector.py`, `pcstore_connector_v1.py`:
Added `use_gdr`, GPU KV buffer address, and size configuration mappings
to the PcStore Python bindings and connectors.

- `trans_manager.h/cc`, `trans_queue.h/cc`, `trans_share_queue.h/cc`:
Added GDR switch forwarding to the PcStore transfer layer, and create
GDR streams in both normal queues and shared queues based on the
configuration.

### connector Layer

- `ucm_connector.py`: Collects GPU buffer addresses and sizes of the
vLLM KV cache, and passes them to UCM for GDR pre-registration.

- `CMakeLists.txt`, `setup.py`: Added a GDR build switch. Whether
`UCM_ENABLE_GDR_STREAM` is enabled is controlled by the `ENABLE_GDR`
environment variable.

- `ucm_config_example.yaml`: Added a `use_gdr` configuration example and
environment instructions required to enable GDR.

## Test

### Bandwidth Test

Unit: GBps

|        | CUDA  | GDR without Stream | GDR Stream |
| ------ | ----- | ------------------ | ---------- |
| 4KB    | 1.6   | 11.36              | 7.61       |
| 16KB   | 5.84  | 42.83              | 30.77      |
| 64KB   | 17.43 | 45.46              | 44.88      |
| 1024KB | 48.34 | 45.84              | 45.33      |

### GDR Switch Test

1. Build with `USE_GDR=0`, configuration file with `use_gdr: false`:
uses CUDA stream and runs normally, as expected.

2. Build with `USE_GDR=0`, configuration file with `use_gdr: true`:
reports an error, as expected.

3. Build with `USE_GDR=1`, configuration file with `use_gdr: false`:
uses CUDA stream and runs normally, as expected.

4. Build with `USE_GDR=1`, configuration file with `use_gdr: true`: uses
GDR stream and runs normally, as expected.

### Online Inference Test

#### Test Configuration

The configuration file is as follows:

```bash
ucm_connectors:
  - ucm_connector_name: "UcmPipelineStore"
    ucm_connector_config:
      store_pipeline: "Cache|Posix"
      storage_backends: "/mnt/test"
      io_direct: false
      use_gdr: true
      share_buffer_enable: false
      cache_buffer_capacity_gb: 1

enable_event_sync: true
use_layerwise: true
enable_record_traces: false
use_lite: false
persist_token_threshold: 0
```

The test script is as follows. The input length is 1024, and the reuse
ratio is 80%.

```bash
vllm bench serve \
  --backend vllm \
  --base-url http://127.0.0.1:8000 \
  --endpoint /v1/completions \
  --model /home/models/DeepSeek-V2-Lite \
  --dataset-name random \
  --num-prompts 20 \
  --random-prefix-len 819 \
  --random-input-len 205 \
  --random-output-len 128 \
  --max-concurrency 1
```

#### Single-node Single-GPU Test

GDR transfer result:

```bash
============ Serving Benchmark Result ============
Successful requests:                     20        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  87.18     
Total input tokens:                      20460     
Total generated tokens:                  2560      
Request throughput (req/s):              0.23      
Output token throughput (tok/s):         29.36     
Peak output token throughput (tok/s):    31.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          264.05    
---------------Time to First Token----------------
Mean TTFT (ms):                          93.56     
Median TTFT (ms):                        80.61     
P99 TTFT (ms):                           289.95    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.58     
Median TPOT (ms):                        33.35     
P99 TPOT (ms):                           35.17     
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.58     
Median ITL (ms):                         33.55     
P99 ITL (ms):                            35.84     
==================================================
```

**Test Result**

The service runs normally, and the performance is normal.

#### Single-node Four-GPU DP2 + TP2 Test

GDR transfer result:

```bash
============ Serving Benchmark Result ============
Successful requests:                     20        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  13.18     
Total input tokens:                      20460     
Total generated tokens:                  2560      
Request throughput (req/s):              1.52      
Output token throughput (tok/s):         194.28    
Peak output token throughput (tok/s):    212.00    
Peak concurrent requests:                3.00      
Total token throughput (tok/s):          1747.03   
---------------Time to First Token----------------
Mean TTFT (ms):                          30.59     
Median TTFT (ms):                        23.00     
P99 TTFT (ms):                           102.96    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.94      
Median TPOT (ms):                        4.65      
P99 TPOT (ms):                           8.04      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.95      
Median ITL (ms):                         4.65      
P99 ITL (ms):                            4.93      
==================================================
```

**Test Result**

The service runs normally, and the performance is normal.

## Path Coverage

1. `cache store` has been integrated with `GdrStream`.

2. The normal `pcstore` path has been integrated with `GdrStream`.

3. The special scatter/gather path in `pcstore` has not been integrated
with `GdrStream`.

4. `nfsstore` has not been integrated with `GdrStream`.
Fengli5355 pushed a commit to Fengli5355/unified-cache-management that referenced this pull request Jun 1, 2026
…p#958)

## Purpose

🔗 **Issue Link:** ModelEngine-Group#946 

Traditional H2D and D2H transfers are performed through `cudaMemcpy`.
The main drawbacks of `cudaMemcpy` are its relatively large fixed
submission overhead and poor small-I/O transfer bandwidth.

This proposal introduces a GPUDirect RDMA transfer path, referred to as
**GDR** below. GDR is based on the NIC DMA engine and directly performs
RDMA operations between GPU HBM and Host DRAM. The transfer path is:

```txt
GPU <-> NIC <-> CPU
```

The architecture of GDR stream is shown in the figure below:

<img width="2423" height="497" alt="image-20260429170635310"
src="https://github.com/user-attachments/assets/5fe3eaf7-c725-4ad1-8e00-8645a0031e81"
/>

## Modifications 

The detailed implementation specification is available in the issue
ModelEngine-Group#946.

### trans Layer

- `gdr_stream.h/cc`: Added `GdrStream`, which implements ordered
asynchronous GDR copy through a scheduler thread and a completion
thread.

- `device.h`, `cuda_device.cc`, `ascend_device.cc`, `simu_device.cc`,
`trans.py.cc`: Added `MakeGdrStream` to the general Device interface and
Python bindings. The CUDA backend supports creating GDR streams, while
other backends provide compatible implementations.

- `cuda_buffer.cc`, `gdr_mr_buffer.h/cc`: Added GDR MR registration,
query, resolution, and release management for CUDA host/device buffers.

- `gdr_config.h/cc`: Added GDR NIC selection logic and GPU KV buffer
pre-registration configuration management.

- `gdr_copy.h/cc`: Added an ibverbs-based asynchronous GDR copy channel,
supporting H2D/D2H copy submission, completion polling, and buffer MR
management.

### store Layer

- `cache_store.cc`, `global_config.h`: Added `use_gdr` to CacheStore,
along with GPU KV buffer configuration parsing, validation, and
pre-registration flow.

- `copy_stream.h`, `dump_queue.h/cc`, `load_queue.h/cc`: The dump/load
transfer path in CacheStore now supports selecting either a normal CUDA
stream or a GDR stream based on `use_gdr`.

- `pcstore.h/cc`: Added `transferUseGdr` and GPU KV buffer range fields
to the PcStore configuration, and completed validation,
pre-registration, and configuration forwarding during initialization.

- `pcstore.py.cc`, `pcstore_connector.py`, `pcstore_connector_v1.py`:
Added `use_gdr`, GPU KV buffer address, and size configuration mappings
to the PcStore Python bindings and connectors.

- `trans_manager.h/cc`, `trans_queue.h/cc`, `trans_share_queue.h/cc`:
Added GDR switch forwarding to the PcStore transfer layer, and create
GDR streams in both normal queues and shared queues based on the
configuration.

### connector Layer

- `ucm_connector.py`: Collects GPU buffer addresses and sizes of the
vLLM KV cache, and passes them to UCM for GDR pre-registration.

- `CMakeLists.txt`, `setup.py`: Added a GDR build switch. Whether
`UCM_ENABLE_GDR_STREAM` is enabled is controlled by the `ENABLE_GDR`
environment variable.

- `ucm_config_example.yaml`: Added a `use_gdr` configuration example and
environment instructions required to enable GDR.

## Test

### Bandwidth Test

Unit: GBps

|        | CUDA  | GDR without Stream | GDR Stream |
| ------ | ----- | ------------------ | ---------- |
| 4KB    | 1.6   | 11.36              | 7.61       |
| 16KB   | 5.84  | 42.83              | 30.77      |
| 64KB   | 17.43 | 45.46              | 44.88      |
| 1024KB | 48.34 | 45.84              | 45.33      |

### GDR Switch Test

1. Build with `USE_GDR=0`, configuration file with `use_gdr: false`:
uses CUDA stream and runs normally, as expected.

2. Build with `USE_GDR=0`, configuration file with `use_gdr: true`:
reports an error, as expected.

3. Build with `USE_GDR=1`, configuration file with `use_gdr: false`:
uses CUDA stream and runs normally, as expected.

4. Build with `USE_GDR=1`, configuration file with `use_gdr: true`: uses
GDR stream and runs normally, as expected.

### Online Inference Test

#### Test Configuration

The configuration file is as follows:

```bash
ucm_connectors:
  - ucm_connector_name: "UcmPipelineStore"
    ucm_connector_config:
      store_pipeline: "Cache|Posix"
      storage_backends: "/mnt/test"
      io_direct: false
      use_gdr: true
      share_buffer_enable: false
      cache_buffer_capacity_gb: 1

enable_event_sync: true
use_layerwise: true
enable_record_traces: false
use_lite: false
persist_token_threshold: 0
```

The test script is as follows. The input length is 1024, and the reuse
ratio is 80%.

```bash
vllm bench serve \
  --backend vllm \
  --base-url http://127.0.0.1:8000 \
  --endpoint /v1/completions \
  --model /home/models/DeepSeek-V2-Lite \
  --dataset-name random \
  --num-prompts 20 \
  --random-prefix-len 819 \
  --random-input-len 205 \
  --random-output-len 128 \
  --max-concurrency 1
```

#### Single-node Single-GPU Test

GDR transfer result:

```bash
============ Serving Benchmark Result ============
Successful requests:                     20        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  87.18     
Total input tokens:                      20460     
Total generated tokens:                  2560      
Request throughput (req/s):              0.23      
Output token throughput (tok/s):         29.36     
Peak output token throughput (tok/s):    31.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          264.05    
---------------Time to First Token----------------
Mean TTFT (ms):                          93.56     
Median TTFT (ms):                        80.61     
P99 TTFT (ms):                           289.95    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.58     
Median TPOT (ms):                        33.35     
P99 TPOT (ms):                           35.17     
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.58     
Median ITL (ms):                         33.55     
P99 ITL (ms):                            35.84     
==================================================
```

**Test Result**

The service runs normally, and the performance is normal.

#### Single-node Four-GPU DP2 + TP2 Test

GDR transfer result:

```bash
============ Serving Benchmark Result ============
Successful requests:                     20        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  13.18     
Total input tokens:                      20460     
Total generated tokens:                  2560      
Request throughput (req/s):              1.52      
Output token throughput (tok/s):         194.28    
Peak output token throughput (tok/s):    212.00    
Peak concurrent requests:                3.00      
Total token throughput (tok/s):          1747.03   
---------------Time to First Token----------------
Mean TTFT (ms):                          30.59     
Median TTFT (ms):                        23.00     
P99 TTFT (ms):                           102.96    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.94      
Median TPOT (ms):                        4.65      
P99 TPOT (ms):                           8.04      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.95      
Median ITL (ms):                         4.65      
P99 ITL (ms):                            4.93      
==================================================
```

**Test Result**

The service runs normally, and the performance is normal.

## Path Coverage

1. `cache store` has been integrated with `GdrStream`.

2. The normal `pcstore` path has been integrated with `GdrStream`.

3. The special scatter/gather path in `pcstore` has not been integrated
with `GdrStream`.

4. `nfsstore` has not been integrated with `GdrStream`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Add H2D and D2H Path Based on GPUDirect RDMA

3 participants