[feat] Add H2D and D2H Path Based on GPUDirect RDMA by relat-ivity · Pull Request #958 · ModelEngine-Group/unified-cache-management

relat-ivity · 2026-05-14T03:39:32Z

Purpose

🔗 Issue Link: #946

Traditional H2D and D2H transfers are performed through cudaMemcpy. The main drawbacks of cudaMemcpy are its relatively large fixed submission overhead and poor small-I/O transfer bandwidth.

This proposal introduces a GPUDirect RDMA transfer path, referred to as GDR below. GDR is based on the NIC DMA engine and directly performs RDMA operations between GPU HBM and Host DRAM. The transfer path is:

GPU <-> NIC <-> CPU

The architecture of GDR stream is shown in the figure below:

Modifications

The detailed implementation specification is available in the issue #946.

trans Layer

gdr_stream.h/cc: Added GdrStream, which implements ordered asynchronous GDR copy through a scheduler thread and a completion thread.
device.h, cuda_device.cc, ascend_device.cc, simu_device.cc, trans.py.cc: Added MakeGdrStream to the general Device interface and Python bindings. The CUDA backend supports creating GDR streams, while other backends provide compatible implementations.
cuda_buffer.cc, gdr_mr_buffer.h/cc: Added GDR MR registration, query, resolution, and release management for CUDA host/device buffers.
gdr_config.h/cc: Added GDR NIC selection logic and GPU KV buffer pre-registration configuration management.
gdr_copy.h/cc: Added an ibverbs-based asynchronous GDR copy channel, supporting H2D/D2H copy submission, completion polling, and buffer MR management.

store Layer

cache_store.cc, global_config.h: Added use_gdr to CacheStore, along with GPU KV buffer configuration parsing, validation, and pre-registration flow.
copy_stream.h, dump_queue.h/cc, load_queue.h/cc: The dump/load transfer path in CacheStore now supports selecting either a normal CUDA stream or a GDR stream based on use_gdr.
pcstore.h/cc: Added transferUseGdr and GPU KV buffer range fields to the PcStore configuration, and completed validation, pre-registration, and configuration forwarding during initialization.
pcstore.py.cc, pcstore_connector.py, pcstore_connector_v1.py: Added use_gdr, GPU KV buffer address, and size configuration mappings to the PcStore Python bindings and connectors.
trans_manager.h/cc, trans_queue.h/cc, trans_share_queue.h/cc: Added GDR switch forwarding to the PcStore transfer layer, and create GDR streams in both normal queues and shared queues based on the configuration.

connector Layer

ucm_connector.py: Collects GPU buffer addresses and sizes of the vLLM KV cache, and passes them to UCM for GDR pre-registration.
CMakeLists.txt, setup.py: Added a GDR build switch. Whether UCM_ENABLE_GDR_STREAM is enabled is controlled by the ENABLE_GDR environment variable.
ucm_config_example.yaml: Added a use_gdr configuration example and environment instructions required to enable GDR.

Test

Bandwidth Test

Unit: GBps

	CUDA	GDR without Stream	GDR Stream
4KB	1.6	11.36	7.61
16KB	5.84	42.83	30.77
64KB	17.43	45.46	44.88
1024KB	48.34	45.84	45.33

GDR Switch Test

Build with USE_GDR=0, configuration file with use_gdr: false: uses CUDA stream and runs normally, as expected.
Build with USE_GDR=0, configuration file with use_gdr: true: reports an error, as expected.
Build with USE_GDR=1, configuration file with use_gdr: false: uses CUDA stream and runs normally, as expected.
Build with USE_GDR=1, configuration file with use_gdr: true: uses GDR stream and runs normally, as expected.

Online Inference Test

Test Configuration

The configuration file is as follows:

ucm_connectors:
  - ucm_connector_name: "UcmPipelineStore"
    ucm_connector_config:
      store_pipeline: "Cache|Posix"
      storage_backends: "/mnt/test"
      io_direct: false
      use_gdr: true
      share_buffer_enable: false
      cache_buffer_capacity_gb: 1

enable_event_sync: true
use_layerwise: true
enable_record_traces: false
use_lite: false
persist_token_threshold: 0

The test script is as follows. The input length is 1024, and the reuse ratio is 80%.

vllm bench serve \
  --backend vllm \
  --base-url http://127.0.0.1:8000 \
  --endpoint /v1/completions \
  --model /home/models/DeepSeek-V2-Lite \
  --dataset-name random \
  --num-prompts 20 \
  --random-prefix-len 819 \
  --random-input-len 205 \
  --random-output-len 128 \
  --max-concurrency 1

Single-node Single-GPU Test

GDR transfer result:

============ Serving Benchmark Result ============
Successful requests:                     20        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  87.18     
Total input tokens:                      20460     
Total generated tokens:                  2560      
Request throughput (req/s):              0.23      
Output token throughput (tok/s):         29.36     
Peak output token throughput (tok/s):    31.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          264.05    
---------------Time to First Token----------------
Mean TTFT (ms):                          93.56     
Median TTFT (ms):                        80.61     
P99 TTFT (ms):                           289.95    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.58     
Median TPOT (ms):                        33.35     
P99 TPOT (ms):                           35.17     
---------------Inter-token Latency----------------
Mean ITL (ms):                           33.58     
Median ITL (ms):                         33.55     
P99 ITL (ms):                            35.84     
==================================================

Test Result

The service runs normally, and the performance is normal.

Single-node Four-GPU DP2 + TP2 Test

GDR transfer result:

============ Serving Benchmark Result ============
Successful requests:                     20        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  13.18     
Total input tokens:                      20460     
Total generated tokens:                  2560      
Request throughput (req/s):              1.52      
Output token throughput (tok/s):         194.28    
Peak output token throughput (tok/s):    212.00    
Peak concurrent requests:                3.00      
Total token throughput (tok/s):          1747.03   
---------------Time to First Token----------------
Mean TTFT (ms):                          30.59     
Median TTFT (ms):                        23.00     
P99 TTFT (ms):                           102.96    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.94      
Median TPOT (ms):                        4.65      
P99 TPOT (ms):                           8.04      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.95      
Median ITL (ms):                         4.65      
P99 ITL (ms):                            4.93      
==================================================

Test Result

The service runs normally, and the performance is normal.

Path Coverage

cache store has been integrated with GdrStream.
The normal pcstore path has been integrated with GdrStream.
The special scatter/gather path in pcstore has not been integrated with GdrStream.
nfsstore has not been integrated with GdrStream.

ygwpz

Code review completed. No critical issues found.

ygwpz · 2026-05-15T01:49:05Z

            cmake_args += ["-DBUILD_UCM_SPARSE=ON"]

+        if ENABLE_GDR:
+            cmake_args += ["-DUCM_ENABLE_GDR_STREAM=ON"]


don't use "UCMXX" here

…p#958) ## Purpose 🔗 **Issue Link:** ModelEngine-Group#946 Traditional H2D and D2H transfers are performed through `cudaMemcpy`. The main drawbacks of `cudaMemcpy` are its relatively large fixed submission overhead and poor small-I/O transfer bandwidth. This proposal introduces a GPUDirect RDMA transfer path, referred to as **GDR** below. GDR is based on the NIC DMA engine and directly performs RDMA operations between GPU HBM and Host DRAM. The transfer path is: ```txt GPU <-> NIC <-> CPU ``` The architecture of GDR stream is shown in the figure below: <img width="2423" height="497" alt="image-20260429170635310" src="https://github.com/user-attachments/assets/5fe3eaf7-c725-4ad1-8e00-8645a0031e81" /> ## Modifications The detailed implementation specification is available in the issue ModelEngine-Group#946. ### trans Layer - `gdr_stream.h/cc`: Added `GdrStream`, which implements ordered asynchronous GDR copy through a scheduler thread and a completion thread. - `device.h`, `cuda_device.cc`, `ascend_device.cc`, `simu_device.cc`, `trans.py.cc`: Added `MakeGdrStream` to the general Device interface and Python bindings. The CUDA backend supports creating GDR streams, while other backends provide compatible implementations. - `cuda_buffer.cc`, `gdr_mr_buffer.h/cc`: Added GDR MR registration, query, resolution, and release management for CUDA host/device buffers. - `gdr_config.h/cc`: Added GDR NIC selection logic and GPU KV buffer pre-registration configuration management. - `gdr_copy.h/cc`: Added an ibverbs-based asynchronous GDR copy channel, supporting H2D/D2H copy submission, completion polling, and buffer MR management. ### store Layer - `cache_store.cc`, `global_config.h`: Added `use_gdr` to CacheStore, along with GPU KV buffer configuration parsing, validation, and pre-registration flow. - `copy_stream.h`, `dump_queue.h/cc`, `load_queue.h/cc`: The dump/load transfer path in CacheStore now supports selecting either a normal CUDA stream or a GDR stream based on `use_gdr`. - `pcstore.h/cc`: Added `transferUseGdr` and GPU KV buffer range fields to the PcStore configuration, and completed validation, pre-registration, and configuration forwarding during initialization. - `pcstore.py.cc`, `pcstore_connector.py`, `pcstore_connector_v1.py`: Added `use_gdr`, GPU KV buffer address, and size configuration mappings to the PcStore Python bindings and connectors. - `trans_manager.h/cc`, `trans_queue.h/cc`, `trans_share_queue.h/cc`: Added GDR switch forwarding to the PcStore transfer layer, and create GDR streams in both normal queues and shared queues based on the configuration. ### connector Layer - `ucm_connector.py`: Collects GPU buffer addresses and sizes of the vLLM KV cache, and passes them to UCM for GDR pre-registration. - `CMakeLists.txt`, `setup.py`: Added a GDR build switch. Whether `UCM_ENABLE_GDR_STREAM` is enabled is controlled by the `ENABLE_GDR` environment variable. - `ucm_config_example.yaml`: Added a `use_gdr` configuration example and environment instructions required to enable GDR. ## Test ### Bandwidth Test Unit: GBps | | CUDA | GDR without Stream | GDR Stream | | ------ | ----- | ------------------ | ---------- | | 4KB | 1.6 | 11.36 | 7.61 | | 16KB | 5.84 | 42.83 | 30.77 | | 64KB | 17.43 | 45.46 | 44.88 | | 1024KB | 48.34 | 45.84 | 45.33 | ### GDR Switch Test 1. Build with `USE_GDR=0`, configuration file with `use_gdr: false`: uses CUDA stream and runs normally, as expected. 2. Build with `USE_GDR=0`, configuration file with `use_gdr: true`: reports an error, as expected. 3. Build with `USE_GDR=1`, configuration file with `use_gdr: false`: uses CUDA stream and runs normally, as expected. 4. Build with `USE_GDR=1`, configuration file with `use_gdr: true`: uses GDR stream and runs normally, as expected. ### Online Inference Test #### Test Configuration The configuration file is as follows: ```bash ucm_connectors: - ucm_connector_name: "UcmPipelineStore" ucm_connector_config: store_pipeline: "Cache|Posix" storage_backends: "/mnt/test" io_direct: false use_gdr: true share_buffer_enable: false cache_buffer_capacity_gb: 1 enable_event_sync: true use_layerwise: true enable_record_traces: false use_lite: false persist_token_threshold: 0 ``` The test script is as follows. The input length is 1024, and the reuse ratio is 80%. ```bash vllm bench serve \ --backend vllm \ --base-url http://127.0.0.1:8000 \ --endpoint /v1/completions \ --model /home/models/DeepSeek-V2-Lite \ --dataset-name random \ --num-prompts 20 \ --random-prefix-len 819 \ --random-input-len 205 \ --random-output-len 128 \ --max-concurrency 1 ``` #### Single-node Single-GPU Test GDR transfer result: ```bash ============ Serving Benchmark Result ============ Successful requests: 20 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 87.18 Total input tokens: 20460 Total generated tokens: 2560 Request throughput (req/s): 0.23 Output token throughput (tok/s): 29.36 Peak output token throughput (tok/s): 31.00 Peak concurrent requests: 2.00 Total token throughput (tok/s): 264.05 ---------------Time to First Token---------------- Mean TTFT (ms): 93.56 Median TTFT (ms): 80.61 P99 TTFT (ms): 289.95 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 33.58 Median TPOT (ms): 33.35 P99 TPOT (ms): 35.17 ---------------Inter-token Latency---------------- Mean ITL (ms): 33.58 Median ITL (ms): 33.55 P99 ITL (ms): 35.84 ================================================== ``` **Test Result** The service runs normally, and the performance is normal. #### Single-node Four-GPU DP2 + TP2 Test GDR transfer result: ```bash ============ Serving Benchmark Result ============ Successful requests: 20 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 13.18 Total input tokens: 20460 Total generated tokens: 2560 Request throughput (req/s): 1.52 Output token throughput (tok/s): 194.28 Peak output token throughput (tok/s): 212.00 Peak concurrent requests: 3.00 Total token throughput (tok/s): 1747.03 ---------------Time to First Token---------------- Mean TTFT (ms): 30.59 Median TTFT (ms): 23.00 P99 TTFT (ms): 102.96 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 4.94 Median TPOT (ms): 4.65 P99 TPOT (ms): 8.04 ---------------Inter-token Latency---------------- Mean ITL (ms): 4.95 Median ITL (ms): 4.65 P99 ITL (ms): 4.93 ================================================== ``` **Test Result** The service runs normally, and the performance is normal. ## Path Coverage 1. `cache store` has been integrated with `GdrStream`. 2. The normal `pcstore` path has been integrated with `GdrStream`. 3. The special scatter/gather path in `pcstore` has not been integrated with `GdrStream`. 4. `nfsstore` has not been integrated with `GdrStream`.

relat-ivity added 30 commits May 14, 2026 11:18

添加GdrStream

7923fc1

gdr模拟stream

9d10a06

GPU MR 注册逻辑修改

0040005

添加网卡配置

e162212

MIT协议

ce420e6

测试文件commit

0a2d942

test修复

7ec61e1

test修复2

feb1df7

test修复3

13206f5

异步下发原子保护

2c4647c

gdr带宽测试

560d347

gdr带宽测试

50ee13b

gdr无stream带宽测试

3d7b8e4

减少每个 copy 都 notify_all 的开销

c0390b1

spsc无锁优化

6b77bb5

stream卡死问题修复

86a0b9d

copy测试卡死修复

152dfe1

stream卡死修复

ef6cf8c

stream日志测试

0d26f90

stream修复卡死

08e681c

防卡死修改为短自旋并删除日志

a9fa564

调度线程修改为busy loop

e5b3cb8

删除MR hash逻辑

59843e9

添加gdr开关

328061f

添加setup编译开关

155cfec

删除测试文件添加gdr日志

dea01d5

修改setup新增行格式

6e7231e

删除gdr传输日志

76b40dc

调度线程busy loop修改为短自旋

bd64729

gdr删除cpurelax同步spec ring逻辑

0a79058

relat-ivity added 3 commits May 14, 2026 11:20

修改网卡配置为环境变量

26b270b

修改UCM编译变量

e205101

配置示例

b06fceb

relat-ivity requested review from FangRun2, Infinite666, Tarrei, harrisonyhq, mag1c-h, qyh111 and ygwpz as code owners May 14, 2026 03:39

relat-ivity changed the title ~~[feat] Gdr stream~~ [feat] Add H2D and D2H Path Based on GPUDirect RDMA May 14, 2026

ygwpz previously approved these changes May 14, 2026

View reviewed changes

重构编译开关设置

34ab0cd

relat-ivity dismissed ygwpz’s stale review via 34ab0cd May 14, 2026 06:45

clang-format格式问题修正

5a5dd5d

mag1c-h linked an issue May 15, 2026 that may be closed by this pull request

[RFC]: Add H2D and D2H Path Based on GPUDirect RDMA #946

Closed

mag1c-h approved these changes May 15, 2026

View reviewed changes

mag1c-h merged commit 0f88e34 into ModelEngine-Group:develop May 15, 2026
21 of 24 checks passed

ygwpz reviewed May 15, 2026

View reviewed changes

wangwenxin0312 mentioned this pull request May 18, 2026

[bugfix] buffer size & request_finished_all_groups fix #963

Merged

relat-ivity deleted the gdr-stream branch May 28, 2026 02:03

relat-ivity mentioned this pull request May 28, 2026

[feat] hma connector supports GPU buffer MR for GPUDirct RDMA #981

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add H2D and D2H Path Based on GPUDirect RDMA#958

[feat] Add H2D and D2H Path Based on GPUDirect RDMA#958
mag1c-h merged 35 commits into
ModelEngine-Group:developfrom
relat-ivity:gdr-stream

relat-ivity commented May 14, 2026

Uh oh!

ygwpz left a comment

Uh oh!

Uh oh!

ygwpz May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

relat-ivity commented May 14, 2026

Purpose

Modifications

trans Layer

store Layer

connector Layer

Test

Bandwidth Test

GDR Switch Test

Online Inference Test

Test Configuration

Single-node Single-GPU Test

Single-node Four-GPU DP2 + TP2 Test

Path Coverage

Uh oh!

ygwpz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ygwpz May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants