[feat] Add H2D and D2H Path Based on GPUDirect RDMA#958
Merged
Conversation
ygwpz
previously approved these changes
May 14, 2026
ygwpz
left a comment
Contributor
There was a problem hiding this comment.
Code review completed. No critical issues found.
mag1c-h
approved these changes
May 15, 2026
ygwpz
reviewed
May 15, 2026
| cmake_args += ["-DBUILD_UCM_SPARSE=ON"] | ||
|
|
||
| if ENABLE_GDR: | ||
| cmake_args += ["-DUCM_ENABLE_GDR_STREAM=ON"] |
Fengli5355
pushed a commit
to Fengli5355/unified-cache-management
that referenced
this pull request
May 20, 2026
…p#958) ## Purpose 🔗 **Issue Link:** ModelEngine-Group#946 Traditional H2D and D2H transfers are performed through `cudaMemcpy`. The main drawbacks of `cudaMemcpy` are its relatively large fixed submission overhead and poor small-I/O transfer bandwidth. This proposal introduces a GPUDirect RDMA transfer path, referred to as **GDR** below. GDR is based on the NIC DMA engine and directly performs RDMA operations between GPU HBM and Host DRAM. The transfer path is: ```txt GPU <-> NIC <-> CPU ``` The architecture of GDR stream is shown in the figure below: <img width="2423" height="497" alt="image-20260429170635310" src="https://github.com/user-attachments/assets/5fe3eaf7-c725-4ad1-8e00-8645a0031e81" /> ## Modifications The detailed implementation specification is available in the issue ModelEngine-Group#946. ### trans Layer - `gdr_stream.h/cc`: Added `GdrStream`, which implements ordered asynchronous GDR copy through a scheduler thread and a completion thread. - `device.h`, `cuda_device.cc`, `ascend_device.cc`, `simu_device.cc`, `trans.py.cc`: Added `MakeGdrStream` to the general Device interface and Python bindings. The CUDA backend supports creating GDR streams, while other backends provide compatible implementations. - `cuda_buffer.cc`, `gdr_mr_buffer.h/cc`: Added GDR MR registration, query, resolution, and release management for CUDA host/device buffers. - `gdr_config.h/cc`: Added GDR NIC selection logic and GPU KV buffer pre-registration configuration management. - `gdr_copy.h/cc`: Added an ibverbs-based asynchronous GDR copy channel, supporting H2D/D2H copy submission, completion polling, and buffer MR management. ### store Layer - `cache_store.cc`, `global_config.h`: Added `use_gdr` to CacheStore, along with GPU KV buffer configuration parsing, validation, and pre-registration flow. - `copy_stream.h`, `dump_queue.h/cc`, `load_queue.h/cc`: The dump/load transfer path in CacheStore now supports selecting either a normal CUDA stream or a GDR stream based on `use_gdr`. - `pcstore.h/cc`: Added `transferUseGdr` and GPU KV buffer range fields to the PcStore configuration, and completed validation, pre-registration, and configuration forwarding during initialization. - `pcstore.py.cc`, `pcstore_connector.py`, `pcstore_connector_v1.py`: Added `use_gdr`, GPU KV buffer address, and size configuration mappings to the PcStore Python bindings and connectors. - `trans_manager.h/cc`, `trans_queue.h/cc`, `trans_share_queue.h/cc`: Added GDR switch forwarding to the PcStore transfer layer, and create GDR streams in both normal queues and shared queues based on the configuration. ### connector Layer - `ucm_connector.py`: Collects GPU buffer addresses and sizes of the vLLM KV cache, and passes them to UCM for GDR pre-registration. - `CMakeLists.txt`, `setup.py`: Added a GDR build switch. Whether `UCM_ENABLE_GDR_STREAM` is enabled is controlled by the `ENABLE_GDR` environment variable. - `ucm_config_example.yaml`: Added a `use_gdr` configuration example and environment instructions required to enable GDR. ## Test ### Bandwidth Test Unit: GBps | | CUDA | GDR without Stream | GDR Stream | | ------ | ----- | ------------------ | ---------- | | 4KB | 1.6 | 11.36 | 7.61 | | 16KB | 5.84 | 42.83 | 30.77 | | 64KB | 17.43 | 45.46 | 44.88 | | 1024KB | 48.34 | 45.84 | 45.33 | ### GDR Switch Test 1. Build with `USE_GDR=0`, configuration file with `use_gdr: false`: uses CUDA stream and runs normally, as expected. 2. Build with `USE_GDR=0`, configuration file with `use_gdr: true`: reports an error, as expected. 3. Build with `USE_GDR=1`, configuration file with `use_gdr: false`: uses CUDA stream and runs normally, as expected. 4. Build with `USE_GDR=1`, configuration file with `use_gdr: true`: uses GDR stream and runs normally, as expected. ### Online Inference Test #### Test Configuration The configuration file is as follows: ```bash ucm_connectors: - ucm_connector_name: "UcmPipelineStore" ucm_connector_config: store_pipeline: "Cache|Posix" storage_backends: "/mnt/test" io_direct: false use_gdr: true share_buffer_enable: false cache_buffer_capacity_gb: 1 enable_event_sync: true use_layerwise: true enable_record_traces: false use_lite: false persist_token_threshold: 0 ``` The test script is as follows. The input length is 1024, and the reuse ratio is 80%. ```bash vllm bench serve \ --backend vllm \ --base-url http://127.0.0.1:8000 \ --endpoint /v1/completions \ --model /home/models/DeepSeek-V2-Lite \ --dataset-name random \ --num-prompts 20 \ --random-prefix-len 819 \ --random-input-len 205 \ --random-output-len 128 \ --max-concurrency 1 ``` #### Single-node Single-GPU Test GDR transfer result: ```bash ============ Serving Benchmark Result ============ Successful requests: 20 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 87.18 Total input tokens: 20460 Total generated tokens: 2560 Request throughput (req/s): 0.23 Output token throughput (tok/s): 29.36 Peak output token throughput (tok/s): 31.00 Peak concurrent requests: 2.00 Total token throughput (tok/s): 264.05 ---------------Time to First Token---------------- Mean TTFT (ms): 93.56 Median TTFT (ms): 80.61 P99 TTFT (ms): 289.95 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 33.58 Median TPOT (ms): 33.35 P99 TPOT (ms): 35.17 ---------------Inter-token Latency---------------- Mean ITL (ms): 33.58 Median ITL (ms): 33.55 P99 ITL (ms): 35.84 ================================================== ``` **Test Result** The service runs normally, and the performance is normal. #### Single-node Four-GPU DP2 + TP2 Test GDR transfer result: ```bash ============ Serving Benchmark Result ============ Successful requests: 20 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 13.18 Total input tokens: 20460 Total generated tokens: 2560 Request throughput (req/s): 1.52 Output token throughput (tok/s): 194.28 Peak output token throughput (tok/s): 212.00 Peak concurrent requests: 3.00 Total token throughput (tok/s): 1747.03 ---------------Time to First Token---------------- Mean TTFT (ms): 30.59 Median TTFT (ms): 23.00 P99 TTFT (ms): 102.96 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 4.94 Median TPOT (ms): 4.65 P99 TPOT (ms): 8.04 ---------------Inter-token Latency---------------- Mean ITL (ms): 4.95 Median ITL (ms): 4.65 P99 ITL (ms): 4.93 ================================================== ``` **Test Result** The service runs normally, and the performance is normal. ## Path Coverage 1. `cache store` has been integrated with `GdrStream`. 2. The normal `pcstore` path has been integrated with `GdrStream`. 3. The special scatter/gather path in `pcstore` has not been integrated with `GdrStream`. 4. `nfsstore` has not been integrated with `GdrStream`.
Frankielx
pushed a commit
to Frankielx/unified-cache-management
that referenced
this pull request
May 26, 2026
…p#958) ## Purpose 🔗 **Issue Link:** ModelEngine-Group#946 Traditional H2D and D2H transfers are performed through `cudaMemcpy`. The main drawbacks of `cudaMemcpy` are its relatively large fixed submission overhead and poor small-I/O transfer bandwidth. This proposal introduces a GPUDirect RDMA transfer path, referred to as **GDR** below. GDR is based on the NIC DMA engine and directly performs RDMA operations between GPU HBM and Host DRAM. The transfer path is: ```txt GPU <-> NIC <-> CPU ``` The architecture of GDR stream is shown in the figure below: <img width="2423" height="497" alt="image-20260429170635310" src="https://github.com/user-attachments/assets/5fe3eaf7-c725-4ad1-8e00-8645a0031e81" /> ## Modifications The detailed implementation specification is available in the issue ModelEngine-Group#946. ### trans Layer - `gdr_stream.h/cc`: Added `GdrStream`, which implements ordered asynchronous GDR copy through a scheduler thread and a completion thread. - `device.h`, `cuda_device.cc`, `ascend_device.cc`, `simu_device.cc`, `trans.py.cc`: Added `MakeGdrStream` to the general Device interface and Python bindings. The CUDA backend supports creating GDR streams, while other backends provide compatible implementations. - `cuda_buffer.cc`, `gdr_mr_buffer.h/cc`: Added GDR MR registration, query, resolution, and release management for CUDA host/device buffers. - `gdr_config.h/cc`: Added GDR NIC selection logic and GPU KV buffer pre-registration configuration management. - `gdr_copy.h/cc`: Added an ibverbs-based asynchronous GDR copy channel, supporting H2D/D2H copy submission, completion polling, and buffer MR management. ### store Layer - `cache_store.cc`, `global_config.h`: Added `use_gdr` to CacheStore, along with GPU KV buffer configuration parsing, validation, and pre-registration flow. - `copy_stream.h`, `dump_queue.h/cc`, `load_queue.h/cc`: The dump/load transfer path in CacheStore now supports selecting either a normal CUDA stream or a GDR stream based on `use_gdr`. - `pcstore.h/cc`: Added `transferUseGdr` and GPU KV buffer range fields to the PcStore configuration, and completed validation, pre-registration, and configuration forwarding during initialization. - `pcstore.py.cc`, `pcstore_connector.py`, `pcstore_connector_v1.py`: Added `use_gdr`, GPU KV buffer address, and size configuration mappings to the PcStore Python bindings and connectors. - `trans_manager.h/cc`, `trans_queue.h/cc`, `trans_share_queue.h/cc`: Added GDR switch forwarding to the PcStore transfer layer, and create GDR streams in both normal queues and shared queues based on the configuration. ### connector Layer - `ucm_connector.py`: Collects GPU buffer addresses and sizes of the vLLM KV cache, and passes them to UCM for GDR pre-registration. - `CMakeLists.txt`, `setup.py`: Added a GDR build switch. Whether `UCM_ENABLE_GDR_STREAM` is enabled is controlled by the `ENABLE_GDR` environment variable. - `ucm_config_example.yaml`: Added a `use_gdr` configuration example and environment instructions required to enable GDR. ## Test ### Bandwidth Test Unit: GBps | | CUDA | GDR without Stream | GDR Stream | | ------ | ----- | ------------------ | ---------- | | 4KB | 1.6 | 11.36 | 7.61 | | 16KB | 5.84 | 42.83 | 30.77 | | 64KB | 17.43 | 45.46 | 44.88 | | 1024KB | 48.34 | 45.84 | 45.33 | ### GDR Switch Test 1. Build with `USE_GDR=0`, configuration file with `use_gdr: false`: uses CUDA stream and runs normally, as expected. 2. Build with `USE_GDR=0`, configuration file with `use_gdr: true`: reports an error, as expected. 3. Build with `USE_GDR=1`, configuration file with `use_gdr: false`: uses CUDA stream and runs normally, as expected. 4. Build with `USE_GDR=1`, configuration file with `use_gdr: true`: uses GDR stream and runs normally, as expected. ### Online Inference Test #### Test Configuration The configuration file is as follows: ```bash ucm_connectors: - ucm_connector_name: "UcmPipelineStore" ucm_connector_config: store_pipeline: "Cache|Posix" storage_backends: "/mnt/test" io_direct: false use_gdr: true share_buffer_enable: false cache_buffer_capacity_gb: 1 enable_event_sync: true use_layerwise: true enable_record_traces: false use_lite: false persist_token_threshold: 0 ``` The test script is as follows. The input length is 1024, and the reuse ratio is 80%. ```bash vllm bench serve \ --backend vllm \ --base-url http://127.0.0.1:8000 \ --endpoint /v1/completions \ --model /home/models/DeepSeek-V2-Lite \ --dataset-name random \ --num-prompts 20 \ --random-prefix-len 819 \ --random-input-len 205 \ --random-output-len 128 \ --max-concurrency 1 ``` #### Single-node Single-GPU Test GDR transfer result: ```bash ============ Serving Benchmark Result ============ Successful requests: 20 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 87.18 Total input tokens: 20460 Total generated tokens: 2560 Request throughput (req/s): 0.23 Output token throughput (tok/s): 29.36 Peak output token throughput (tok/s): 31.00 Peak concurrent requests: 2.00 Total token throughput (tok/s): 264.05 ---------------Time to First Token---------------- Mean TTFT (ms): 93.56 Median TTFT (ms): 80.61 P99 TTFT (ms): 289.95 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 33.58 Median TPOT (ms): 33.35 P99 TPOT (ms): 35.17 ---------------Inter-token Latency---------------- Mean ITL (ms): 33.58 Median ITL (ms): 33.55 P99 ITL (ms): 35.84 ================================================== ``` **Test Result** The service runs normally, and the performance is normal. #### Single-node Four-GPU DP2 + TP2 Test GDR transfer result: ```bash ============ Serving Benchmark Result ============ Successful requests: 20 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 13.18 Total input tokens: 20460 Total generated tokens: 2560 Request throughput (req/s): 1.52 Output token throughput (tok/s): 194.28 Peak output token throughput (tok/s): 212.00 Peak concurrent requests: 3.00 Total token throughput (tok/s): 1747.03 ---------------Time to First Token---------------- Mean TTFT (ms): 30.59 Median TTFT (ms): 23.00 P99 TTFT (ms): 102.96 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 4.94 Median TPOT (ms): 4.65 P99 TPOT (ms): 8.04 ---------------Inter-token Latency---------------- Mean ITL (ms): 4.95 Median ITL (ms): 4.65 P99 ITL (ms): 4.93 ================================================== ``` **Test Result** The service runs normally, and the performance is normal. ## Path Coverage 1. `cache store` has been integrated with `GdrStream`. 2. The normal `pcstore` path has been integrated with `GdrStream`. 3. The special scatter/gather path in `pcstore` has not been integrated with `GdrStream`. 4. `nfsstore` has not been integrated with `GdrStream`.
Fengli5355
pushed a commit
to Fengli5355/unified-cache-management
that referenced
this pull request
May 29, 2026
…p#958) ## Purpose 🔗 **Issue Link:** ModelEngine-Group#946 Traditional H2D and D2H transfers are performed through `cudaMemcpy`. The main drawbacks of `cudaMemcpy` are its relatively large fixed submission overhead and poor small-I/O transfer bandwidth. This proposal introduces a GPUDirect RDMA transfer path, referred to as **GDR** below. GDR is based on the NIC DMA engine and directly performs RDMA operations between GPU HBM and Host DRAM. The transfer path is: ```txt GPU <-> NIC <-> CPU ``` The architecture of GDR stream is shown in the figure below: <img width="2423" height="497" alt="image-20260429170635310" src="https://github.com/user-attachments/assets/5fe3eaf7-c725-4ad1-8e00-8645a0031e81" /> ## Modifications The detailed implementation specification is available in the issue ModelEngine-Group#946. ### trans Layer - `gdr_stream.h/cc`: Added `GdrStream`, which implements ordered asynchronous GDR copy through a scheduler thread and a completion thread. - `device.h`, `cuda_device.cc`, `ascend_device.cc`, `simu_device.cc`, `trans.py.cc`: Added `MakeGdrStream` to the general Device interface and Python bindings. The CUDA backend supports creating GDR streams, while other backends provide compatible implementations. - `cuda_buffer.cc`, `gdr_mr_buffer.h/cc`: Added GDR MR registration, query, resolution, and release management for CUDA host/device buffers. - `gdr_config.h/cc`: Added GDR NIC selection logic and GPU KV buffer pre-registration configuration management. - `gdr_copy.h/cc`: Added an ibverbs-based asynchronous GDR copy channel, supporting H2D/D2H copy submission, completion polling, and buffer MR management. ### store Layer - `cache_store.cc`, `global_config.h`: Added `use_gdr` to CacheStore, along with GPU KV buffer configuration parsing, validation, and pre-registration flow. - `copy_stream.h`, `dump_queue.h/cc`, `load_queue.h/cc`: The dump/load transfer path in CacheStore now supports selecting either a normal CUDA stream or a GDR stream based on `use_gdr`. - `pcstore.h/cc`: Added `transferUseGdr` and GPU KV buffer range fields to the PcStore configuration, and completed validation, pre-registration, and configuration forwarding during initialization. - `pcstore.py.cc`, `pcstore_connector.py`, `pcstore_connector_v1.py`: Added `use_gdr`, GPU KV buffer address, and size configuration mappings to the PcStore Python bindings and connectors. - `trans_manager.h/cc`, `trans_queue.h/cc`, `trans_share_queue.h/cc`: Added GDR switch forwarding to the PcStore transfer layer, and create GDR streams in both normal queues and shared queues based on the configuration. ### connector Layer - `ucm_connector.py`: Collects GPU buffer addresses and sizes of the vLLM KV cache, and passes them to UCM for GDR pre-registration. - `CMakeLists.txt`, `setup.py`: Added a GDR build switch. Whether `UCM_ENABLE_GDR_STREAM` is enabled is controlled by the `ENABLE_GDR` environment variable. - `ucm_config_example.yaml`: Added a `use_gdr` configuration example and environment instructions required to enable GDR. ## Test ### Bandwidth Test Unit: GBps | | CUDA | GDR without Stream | GDR Stream | | ------ | ----- | ------------------ | ---------- | | 4KB | 1.6 | 11.36 | 7.61 | | 16KB | 5.84 | 42.83 | 30.77 | | 64KB | 17.43 | 45.46 | 44.88 | | 1024KB | 48.34 | 45.84 | 45.33 | ### GDR Switch Test 1. Build with `USE_GDR=0`, configuration file with `use_gdr: false`: uses CUDA stream and runs normally, as expected. 2. Build with `USE_GDR=0`, configuration file with `use_gdr: true`: reports an error, as expected. 3. Build with `USE_GDR=1`, configuration file with `use_gdr: false`: uses CUDA stream and runs normally, as expected. 4. Build with `USE_GDR=1`, configuration file with `use_gdr: true`: uses GDR stream and runs normally, as expected. ### Online Inference Test #### Test Configuration The configuration file is as follows: ```bash ucm_connectors: - ucm_connector_name: "UcmPipelineStore" ucm_connector_config: store_pipeline: "Cache|Posix" storage_backends: "/mnt/test" io_direct: false use_gdr: true share_buffer_enable: false cache_buffer_capacity_gb: 1 enable_event_sync: true use_layerwise: true enable_record_traces: false use_lite: false persist_token_threshold: 0 ``` The test script is as follows. The input length is 1024, and the reuse ratio is 80%. ```bash vllm bench serve \ --backend vllm \ --base-url http://127.0.0.1:8000 \ --endpoint /v1/completions \ --model /home/models/DeepSeek-V2-Lite \ --dataset-name random \ --num-prompts 20 \ --random-prefix-len 819 \ --random-input-len 205 \ --random-output-len 128 \ --max-concurrency 1 ``` #### Single-node Single-GPU Test GDR transfer result: ```bash ============ Serving Benchmark Result ============ Successful requests: 20 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 87.18 Total input tokens: 20460 Total generated tokens: 2560 Request throughput (req/s): 0.23 Output token throughput (tok/s): 29.36 Peak output token throughput (tok/s): 31.00 Peak concurrent requests: 2.00 Total token throughput (tok/s): 264.05 ---------------Time to First Token---------------- Mean TTFT (ms): 93.56 Median TTFT (ms): 80.61 P99 TTFT (ms): 289.95 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 33.58 Median TPOT (ms): 33.35 P99 TPOT (ms): 35.17 ---------------Inter-token Latency---------------- Mean ITL (ms): 33.58 Median ITL (ms): 33.55 P99 ITL (ms): 35.84 ================================================== ``` **Test Result** The service runs normally, and the performance is normal. #### Single-node Four-GPU DP2 + TP2 Test GDR transfer result: ```bash ============ Serving Benchmark Result ============ Successful requests: 20 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 13.18 Total input tokens: 20460 Total generated tokens: 2560 Request throughput (req/s): 1.52 Output token throughput (tok/s): 194.28 Peak output token throughput (tok/s): 212.00 Peak concurrent requests: 3.00 Total token throughput (tok/s): 1747.03 ---------------Time to First Token---------------- Mean TTFT (ms): 30.59 Median TTFT (ms): 23.00 P99 TTFT (ms): 102.96 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 4.94 Median TPOT (ms): 4.65 P99 TPOT (ms): 8.04 ---------------Inter-token Latency---------------- Mean ITL (ms): 4.95 Median ITL (ms): 4.65 P99 ITL (ms): 4.93 ================================================== ``` **Test Result** The service runs normally, and the performance is normal. ## Path Coverage 1. `cache store` has been integrated with `GdrStream`. 2. The normal `pcstore` path has been integrated with `GdrStream`. 3. The special scatter/gather path in `pcstore` has not been integrated with `GdrStream`. 4. `nfsstore` has not been integrated with `GdrStream`.
Fengli5355
pushed a commit
to Fengli5355/unified-cache-management
that referenced
this pull request
Jun 1, 2026
…p#958) ## Purpose 🔗 **Issue Link:** ModelEngine-Group#946 Traditional H2D and D2H transfers are performed through `cudaMemcpy`. The main drawbacks of `cudaMemcpy` are its relatively large fixed submission overhead and poor small-I/O transfer bandwidth. This proposal introduces a GPUDirect RDMA transfer path, referred to as **GDR** below. GDR is based on the NIC DMA engine and directly performs RDMA operations between GPU HBM and Host DRAM. The transfer path is: ```txt GPU <-> NIC <-> CPU ``` The architecture of GDR stream is shown in the figure below: <img width="2423" height="497" alt="image-20260429170635310" src="https://github.com/user-attachments/assets/5fe3eaf7-c725-4ad1-8e00-8645a0031e81" /> ## Modifications The detailed implementation specification is available in the issue ModelEngine-Group#946. ### trans Layer - `gdr_stream.h/cc`: Added `GdrStream`, which implements ordered asynchronous GDR copy through a scheduler thread and a completion thread. - `device.h`, `cuda_device.cc`, `ascend_device.cc`, `simu_device.cc`, `trans.py.cc`: Added `MakeGdrStream` to the general Device interface and Python bindings. The CUDA backend supports creating GDR streams, while other backends provide compatible implementations. - `cuda_buffer.cc`, `gdr_mr_buffer.h/cc`: Added GDR MR registration, query, resolution, and release management for CUDA host/device buffers. - `gdr_config.h/cc`: Added GDR NIC selection logic and GPU KV buffer pre-registration configuration management. - `gdr_copy.h/cc`: Added an ibverbs-based asynchronous GDR copy channel, supporting H2D/D2H copy submission, completion polling, and buffer MR management. ### store Layer - `cache_store.cc`, `global_config.h`: Added `use_gdr` to CacheStore, along with GPU KV buffer configuration parsing, validation, and pre-registration flow. - `copy_stream.h`, `dump_queue.h/cc`, `load_queue.h/cc`: The dump/load transfer path in CacheStore now supports selecting either a normal CUDA stream or a GDR stream based on `use_gdr`. - `pcstore.h/cc`: Added `transferUseGdr` and GPU KV buffer range fields to the PcStore configuration, and completed validation, pre-registration, and configuration forwarding during initialization. - `pcstore.py.cc`, `pcstore_connector.py`, `pcstore_connector_v1.py`: Added `use_gdr`, GPU KV buffer address, and size configuration mappings to the PcStore Python bindings and connectors. - `trans_manager.h/cc`, `trans_queue.h/cc`, `trans_share_queue.h/cc`: Added GDR switch forwarding to the PcStore transfer layer, and create GDR streams in both normal queues and shared queues based on the configuration. ### connector Layer - `ucm_connector.py`: Collects GPU buffer addresses and sizes of the vLLM KV cache, and passes them to UCM for GDR pre-registration. - `CMakeLists.txt`, `setup.py`: Added a GDR build switch. Whether `UCM_ENABLE_GDR_STREAM` is enabled is controlled by the `ENABLE_GDR` environment variable. - `ucm_config_example.yaml`: Added a `use_gdr` configuration example and environment instructions required to enable GDR. ## Test ### Bandwidth Test Unit: GBps | | CUDA | GDR without Stream | GDR Stream | | ------ | ----- | ------------------ | ---------- | | 4KB | 1.6 | 11.36 | 7.61 | | 16KB | 5.84 | 42.83 | 30.77 | | 64KB | 17.43 | 45.46 | 44.88 | | 1024KB | 48.34 | 45.84 | 45.33 | ### GDR Switch Test 1. Build with `USE_GDR=0`, configuration file with `use_gdr: false`: uses CUDA stream and runs normally, as expected. 2. Build with `USE_GDR=0`, configuration file with `use_gdr: true`: reports an error, as expected. 3. Build with `USE_GDR=1`, configuration file with `use_gdr: false`: uses CUDA stream and runs normally, as expected. 4. Build with `USE_GDR=1`, configuration file with `use_gdr: true`: uses GDR stream and runs normally, as expected. ### Online Inference Test #### Test Configuration The configuration file is as follows: ```bash ucm_connectors: - ucm_connector_name: "UcmPipelineStore" ucm_connector_config: store_pipeline: "Cache|Posix" storage_backends: "/mnt/test" io_direct: false use_gdr: true share_buffer_enable: false cache_buffer_capacity_gb: 1 enable_event_sync: true use_layerwise: true enable_record_traces: false use_lite: false persist_token_threshold: 0 ``` The test script is as follows. The input length is 1024, and the reuse ratio is 80%. ```bash vllm bench serve \ --backend vllm \ --base-url http://127.0.0.1:8000 \ --endpoint /v1/completions \ --model /home/models/DeepSeek-V2-Lite \ --dataset-name random \ --num-prompts 20 \ --random-prefix-len 819 \ --random-input-len 205 \ --random-output-len 128 \ --max-concurrency 1 ``` #### Single-node Single-GPU Test GDR transfer result: ```bash ============ Serving Benchmark Result ============ Successful requests: 20 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 87.18 Total input tokens: 20460 Total generated tokens: 2560 Request throughput (req/s): 0.23 Output token throughput (tok/s): 29.36 Peak output token throughput (tok/s): 31.00 Peak concurrent requests: 2.00 Total token throughput (tok/s): 264.05 ---------------Time to First Token---------------- Mean TTFT (ms): 93.56 Median TTFT (ms): 80.61 P99 TTFT (ms): 289.95 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 33.58 Median TPOT (ms): 33.35 P99 TPOT (ms): 35.17 ---------------Inter-token Latency---------------- Mean ITL (ms): 33.58 Median ITL (ms): 33.55 P99 ITL (ms): 35.84 ================================================== ``` **Test Result** The service runs normally, and the performance is normal. #### Single-node Four-GPU DP2 + TP2 Test GDR transfer result: ```bash ============ Serving Benchmark Result ============ Successful requests: 20 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 13.18 Total input tokens: 20460 Total generated tokens: 2560 Request throughput (req/s): 1.52 Output token throughput (tok/s): 194.28 Peak output token throughput (tok/s): 212.00 Peak concurrent requests: 3.00 Total token throughput (tok/s): 1747.03 ---------------Time to First Token---------------- Mean TTFT (ms): 30.59 Median TTFT (ms): 23.00 P99 TTFT (ms): 102.96 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 4.94 Median TPOT (ms): 4.65 P99 TPOT (ms): 8.04 ---------------Inter-token Latency---------------- Mean ITL (ms): 4.95 Median ITL (ms): 4.65 P99 ITL (ms): 4.93 ================================================== ``` **Test Result** The service runs normally, and the performance is normal. ## Path Coverage 1. `cache store` has been integrated with `GdrStream`. 2. The normal `pcstore` path has been integrated with `GdrStream`. 3. The special scatter/gather path in `pcstore` has not been integrated with `GdrStream`. 4. `nfsstore` has not been integrated with `GdrStream`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
🔗 Issue Link: #946
Traditional H2D and D2H transfers are performed through
cudaMemcpy. The main drawbacks ofcudaMemcpyare its relatively large fixed submission overhead and poor small-I/O transfer bandwidth.This proposal introduces a GPUDirect RDMA transfer path, referred to as GDR below. GDR is based on the NIC DMA engine and directly performs RDMA operations between GPU HBM and Host DRAM. The transfer path is:
The architecture of GDR stream is shown in the figure below:
Modifications
The detailed implementation specification is available in the issue #946.
trans Layer
gdr_stream.h/cc: AddedGdrStream, which implements ordered asynchronous GDR copy through a scheduler thread and a completion thread.device.h,cuda_device.cc,ascend_device.cc,simu_device.cc,trans.py.cc: AddedMakeGdrStreamto the general Device interface and Python bindings. The CUDA backend supports creating GDR streams, while other backends provide compatible implementations.cuda_buffer.cc,gdr_mr_buffer.h/cc: Added GDR MR registration, query, resolution, and release management for CUDA host/device buffers.gdr_config.h/cc: Added GDR NIC selection logic and GPU KV buffer pre-registration configuration management.gdr_copy.h/cc: Added an ibverbs-based asynchronous GDR copy channel, supporting H2D/D2H copy submission, completion polling, and buffer MR management.store Layer
cache_store.cc,global_config.h: Addeduse_gdrto CacheStore, along with GPU KV buffer configuration parsing, validation, and pre-registration flow.copy_stream.h,dump_queue.h/cc,load_queue.h/cc: The dump/load transfer path in CacheStore now supports selecting either a normal CUDA stream or a GDR stream based onuse_gdr.pcstore.h/cc: AddedtransferUseGdrand GPU KV buffer range fields to the PcStore configuration, and completed validation, pre-registration, and configuration forwarding during initialization.pcstore.py.cc,pcstore_connector.py,pcstore_connector_v1.py: Addeduse_gdr, GPU KV buffer address, and size configuration mappings to the PcStore Python bindings and connectors.trans_manager.h/cc,trans_queue.h/cc,trans_share_queue.h/cc: Added GDR switch forwarding to the PcStore transfer layer, and create GDR streams in both normal queues and shared queues based on the configuration.connector Layer
ucm_connector.py: Collects GPU buffer addresses and sizes of the vLLM KV cache, and passes them to UCM for GDR pre-registration.CMakeLists.txt,setup.py: Added a GDR build switch. WhetherUCM_ENABLE_GDR_STREAMis enabled is controlled by theENABLE_GDRenvironment variable.ucm_config_example.yaml: Added ause_gdrconfiguration example and environment instructions required to enable GDR.Test
Bandwidth Test
Unit: GBps
GDR Switch Test
Build with
USE_GDR=0, configuration file withuse_gdr: false: uses CUDA stream and runs normally, as expected.Build with
USE_GDR=0, configuration file withuse_gdr: true: reports an error, as expected.Build with
USE_GDR=1, configuration file withuse_gdr: false: uses CUDA stream and runs normally, as expected.Build with
USE_GDR=1, configuration file withuse_gdr: true: uses GDR stream and runs normally, as expected.Online Inference Test
Test Configuration
The configuration file is as follows:
The test script is as follows. The input length is 1024, and the reuse ratio is 80%.
Single-node Single-GPU Test
GDR transfer result:
Test Result
The service runs normally, and the performance is normal.
Single-node Four-GPU DP2 + TP2 Test
GDR transfer result:
Test Result
The service runs normally, and the performance is normal.
Path Coverage
cache storehas been integrated withGdrStream.The normal
pcstorepath has been integrated withGdrStream.The special scatter/gather path in
pcstorehas not been integrated withGdrStream.nfsstorehas not been integrated withGdrStream.