Skip to content

pci: vfio: Implement migration v2 for snapshot and restore#8303

Draft
saravan2 wants to merge 6 commits into
cloud-hypervisor:mainfrom
saravan2:vfio-migration
Draft

pci: vfio: Implement migration v2 for snapshot and restore#8303
saravan2 wants to merge 6 commits into
cloud-hypervisor:mainfrom
saravan2:vfio-migration

Conversation

@saravan2

@saravan2 saravan2 commented May 30, 2026

Copy link
Copy Markdown
Member

Summary

Adds VFIO v2 migration protocol support for same host snapshot and restore of migratable VFIO devices, for example ConnectX VFs bound to mlx5_vfio_pci. Non migratable devices retain their existing snapshot behavior. Live migration with VFIO Devices, DMA dirty page tracking, and precopy iteration are follow up work and are not in this PR.

VFIO Migration v2 State Transitions

Phase Action Device State Transition
Save pause() RUNNING → STOP
Save snapshot() requests data STOP → STOP_COPY
Save snapshot() drains data_fd (device blob) no transition
Save snapshot() finalizes STOP_COPY → STOP
Restore load_migration_data() requests data_fd RUNNING → RESUMING (*)
Restore load_migration_data() writes device blob no transition
Restore post load fixups (PCI_COMMAND, rearm MSI or MSI-X) no transition
Resume resume() after save (optional) STOP → RUNNING
Resume resume() after restore (mandatory) RESUMING → RUNNING (*)

(*) Each arrow (→) is one VFIO_DEVICE_FEATURE_SET_MIG_DEVICE_STATE ioctl.
Non adjacent transitions such as RUNNING → RESUMING, RESUMING → RUNNING has to pass through STOP as per the protocol, and the host kernel walks that intermediate STOP state internally via the variant driver (mlx5_vfio_pci), therefore in this implementation Cloud-Hypervisor avoids enforcing STOP in the restore, resume phases. Referred QEMU's VFIO Migration ioctl sequence.

Test

  • Unit Tests
cargo test -p pci --lib vfio::tests
   Compiling log v0.4.30
   Compiling either v1.16.0
   Compiling serde_json v1.0.150
   Compiling vm-allocator v0.1.0 (/home/saravanand/cloud-hypervisor/vm-allocator)
   Compiling vfio-ioctls v0.6.1
   Compiling itertools v0.14.0
   Compiling hypervisor v0.1.0 (/home/saravanand/cloud-hypervisor/hypervisor)
   Compiling vfio_user v0.1.3
   Compiling vm-device v0.1.0 (/home/saravanand/cloud-hypervisor/vm-device)
   Compiling vm-migration v0.1.0 (/home/saravanand/cloud-hypervisor/vm-migration)
   Compiling pci v0.1.0 (/home/saravanand/cloud-hypervisor/pci)
    Finished `test` profile [unoptimized + debuginfo] target(s) in 1.79s
     Running unittests src/lib.rs (target/debug/deps/pci-2f3a4b128c172bab)

running 9 tests
test vfio::tests::default_query_migration_support_returns_none ... ok
test vfio::tests::default_set_migration_state_errors ... ok
test vfio::tests::load_migration_data_recovers_on_failure ... ok
test vfio::tests::vfio_migration_state_invalid_errors ... ok
test vfio::tests::load_migration_data_happy_path ... ok
test vfio::tests::save_migration_data_happy_path ... ok
test vfio::tests::vfio_migration_state_round_trips ... ok
test vfio::tests::save_migration_data_recovers_on_failure ... ok
test vfio::tests::write_config_register_mirrors_non_bar_into_shadow ... ok

test result: ok. 9 passed; 0 failed; 0 ignored; 0 measured; 9 filtered out; finished in 0.00s

  • Snapshot and Restore validated on a Mellanox ConnectX 7 VF (mlx5_vfio_pci, firmware 28.43.1014)
Snaphsot-Restore-VM-small.mov

@saravan2 saravan2 requested a review from a team as a code owner May 30, 2026 01:27
@saravan2 saravan2 self-assigned this May 30, 2026
@saravan2

Copy link
Copy Markdown
Member Author

@Lencerf VFIO migration v2 implementation for snapshot and restore

@saravan2

saravan2 commented May 30, 2026

Copy link
Copy Markdown
Member Author

I require assistance in re-running CI / integration-x86-64-pr (pull_request) to determine whether the integration vfio test failure is caused due to the changes introduced in this PR.

vfio-ioctls 0.6.1 adds the VFIO device migration v2 and DMA logging
ioctls. Its VfioIommufd constructor now takes the iommufd-ioctls
0.2.0 IommuFd type, so iommufd-ioctls moves to 0.2.0.

Signed-off-by: Saravanan D <saravanand@crusoe.ai>
@saravan2

saravan2 commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

I require assistance in re-running CI / integration-x86-64-pr (pull_request) to determine whether the integration vfio test failure is caused due to the changes introduced in this PR.

Found the root cause. I was missing the new VFIO_DEVICE_FEATURE ioctl in the VMM seccomp filter. It went unnoticed during development because I had to use --seccomp=false

saravan2 added 5 commits June 2, 2026 00:54
Probe VFIO_DEVICE_FEATURE_MIGRATION during VfioCommon::new() and store
the result in a new migration_flags field so later migration phases can
gate state machine transitions.

The probe runs on every instantiation, including snapshot restore,
because migration capability is a property of the host kernel and its
variant driver rather than of any saved VM state.

query_migration_support is added to the internal Vfio trait with a
default implementation that returns Ok(None), meaning not migratable.
VfioDeviceWrapper overrides it to issue the kernel ioctl, while
vfio-user devices keep the default and are always treated as non
migratable.

Allow the VFIO_DEVICE_FEATURE ioctl in the VMM seccomp filter.

Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Use VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE to pause, snapshot, and
resume migratable VFIO devices. pause() transitions to STOP,
resume() back to RUNNING, and snapshot() walks STOP_COPY, drains
the data_fd to EOF, and returns to STOP. The opaque state blob
is attached to the device snapshot as a base64 encoded child
snapshot.

All new behavior is gated on migration_flags.is_some(), so
devices without migration support (including vfio-user) retain
their previous snapshot behavior.

On any transition failure during save, STOP is attempted as best
effort before bubbling the error. Full recovery including device
reset is deferred.

Mirror every non BAR, non MSI config write into the
PciConfiguration shadow via write_byte / write_word / write_reg
so snapshot() captures PCI_COMMAND. The non BAR read path goes
directly to VFIO, leaving the shadow at post init zeros. Without
this, snapshots encode cmd=0 and mlx5_vfio_pci loses bus master
across restore, wedging its firmware cmd queue with ACCESS_REG
timeouts. Avoid write_config_register because its drain of
pending_bar_reprogram when MSE=1 would swallow the BAR reprog
params that the caller reads immediately below.

Signed-off-by: Saravanan D <saravanand@crusoe.ai>
When a snapshot is loaded, walk the migration v2 state machine
in VfioCommon::new() after interrupt state has been restored.
If the device supports migration and a blob is present, drive
RUNNING to RESUMING in a single transition and write the blob
to the data_fd. The kernel handles the intermediate STOP arc
internally. An explicit STOP dwell was observed to make
mlx5_vfio_pci re initialize SQ, CQ, and EQ indices on top of
the just loaded blob, wedging queue state on resume. The
device is left in RESUMING and resume() drives it to RUNNING
during VM resume.

After the blob load, push PCI_COMMAND to the device via
write_config(). set_state() rebuilds in memory MSI / MSI-X
structs but does not touch the kernel's view of PCI_COMMAND,
so without this the VF sits at post reset defaults with no
bus master and mlx5_core ACCESS_REG times out. Rearm
VFIO_DEVICE_SET_IRQS via enable_msi or enable_msix because
set_state() does not reissue the ioctl and the kernel has
no eventfds for this device until this is done. Both match
QEMU vfio_pci_load_config().

In allocate_bars, skip add_pci_bar and add_pci_rom_bar on
restore. PciConfiguration::new(Some(state)) already populated
the BAR registers with used=true, so the extra call trips
BarInUse. The bars vec and mmio_regions pushes still need to
happen so the caller can wire bus mappings.

The new state machine is gated on migration_flags.is_some(),
so devices without migration support (including vfio-user)
skip it entirely.

On any transition or write failure during restore, STOP is
attempted as best effort before bubbling the error.

Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Cover the Vfio trait defaults (query_migration_support returns
Ok(None), set_migration_state and get_mig_data_size return
NoMigrationSupport), VfioMigrationState round trip and invalid
value handling, and the VfioCommon save and load helpers using
a mock Vfio wrapper driven through a libc pipe for the data_fd.

Happy paths verify the transition sequence and the transferred
byte payload. The save happy path asserts the full STOP_COPY to
STOP arc. The load happy path asserts the device is left in
RESUMING so resume() can drive it back to RUNNING. Error paths
verify the best effort STOP recovery when set_migration_state
fails mid sequence.

Also verify the save side PciConfiguration shadow sync: writing
a non BAR config register through VfioCommon::write_config_register
mirrors the bytes into the local shadow so a later snapshot picks
up the live value instead of the post init zero.

Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Add a Snapshot and Restore section to docs/vfio.md covering the
migration v2 requirements (Linux 5.18 kernel, variant VFIO
driver such as mlx5_vfio_pci) and the classification into
migratable and non migratable modes.

The migratable mode description covers the full restore sequence:
the RUNNING to RESUMING single transition (the kernel walks the
intermediate STOP arc), the post load PCI_COMMAND push to the
device, and the MSI or MSI-X eventfd rearm that the kernel state
does not carry. Behavior matches QEMU vfio_pci_load_config().

Note one limitation: the snapshot format stores the opaque
device blob as base64 inside the snapshot JSON, which may
benefit from a binary transport path for very large state.

Extend docs/snapshot_restore.md with a short VFIO section that
distinguishes migratable from non migratable devices and
redirects to docs/vfio.md for details. Replace the stale
"VFIO devices is out of scope" limitation with a note that
live migration support for VFIO devices is tracked as a follow
up.

Signed-off-by: Saravanan D <saravanand@crusoe.ai>
@rbradford

Copy link
Copy Markdown
Member

@saravan2 Is this going to conflict with #8287 - should we wait until that is merged before reviewing?

@saravan2

Copy link
Copy Markdown
Member Author

@saravan2 Is this going to conflict with #8287 - should we wait until that is merged before reviewing?

This PR and #8287 share the first commit in the series updating the rust-vmm/VFIO crate and nothing else. So this PR can be reviewed independently.

Comment thread pci/src/vfio.rs
unimplemented!()
}

fn query_migration_support(&self) -> Result<Option<u64>, VfioError> {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should call it migration_feature_flags() or migration_flags() since that's what this function does, it returns the set of migration feature flags (rather than checking for the migration support, which is a much more general concept).

Comment thread pci/src/vfio.rs
.map_err(VfioError::KernelVfio)
}

fn query_migration_support(&self) -> Result<Option<u64>, VfioError> {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Comment thread pci/src/vfio.rs
None
}
Err(e) => {
warn!("VFIO device {bdf} migration probe failed, treating as non-migratable: {e}");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we really want to make this a warning? On older kernels, people using VFIO will always get this warning, even if they don't try to perform any migration related work. That seems a bit too much, I'd keep this at the debug level.

Comment thread pci/src/vfio.rs
vfio_device_mig_state_VFIO_DEVICE_STATE_RUNNING as VFIO_DEV_STATE_RUNNING,
vfio_device_mig_state_VFIO_DEVICE_STATE_RUNNING_P2P as VFIO_DEV_STATE_RUNNING_P2P,
vfio_device_mig_state_VFIO_DEVICE_STATE_STOP as VFIO_DEV_STATE_STOP,
vfio_device_mig_state_VFIO_DEVICE_STATE_STOP_COPY as VFIO_DEV_STATE_STOP_COPY, *,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for a separate commit/PR, but it'd be worth converting * into the actual list of imports that are needed.

Comment thread pci/src/vfio.rs
Err(VfioError::NoMigrationSupport)
}

fn read_migration_data_to_end(&self) -> Result<Vec<u8>, VfioError> {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: read_migration_data()

Comment thread pci/src/vfio.rs
Comment on lines +386 to +395
pub(crate) enum VfioMigrationState {
Error = VFIO_DEV_STATE_ERROR,
Stop = VFIO_DEV_STATE_STOP,
Running = VFIO_DEV_STATE_RUNNING,
StopCopy = VFIO_DEV_STATE_STOP_COPY,
Resuming = VFIO_DEV_STATE_RESUMING,
RunningP2P = VFIO_DEV_STATE_RUNNING_P2P,
PreCopy = VFIO_DEV_STATE_PRE_COPY,
PreCopyP2P = VFIO_DEV_STATE_PRE_COPY_P2P,
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be more idiomatic to define the list as:

pub(crate) enum VfioMigrationState {
    Error,
    Stop,
    Running,
    StopCopy,
    Resuming,
    RunningP2P,
    PreCopy,
    PreCopyP2P,
}

and then implement Into for u32.

Comment thread pci/src/vfio.rs

#[derive(Serialize, Deserialize)]
struct VfioMigrationData {
blob: String,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what's the rationale for encoding the opaque blob as a base64 string? Why not use Vec<u8>? Are you trying to reduce the blob size?

Comment thread pci/src/vfio.rs
})();

if result.is_err() {
let _ = self.transition_migration_state(VfioMigrationState::Stop);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we really expect this to help the recovery?

Comment thread pci/src/vfio.rs
Comment on lines +1437 to +1449
// Mirror into the shadow so snapshot() captures PCI_COMMAND.
// Avoid write_config_register as it drains pending_bar_reprogram.
let byte_offset = reg_idx * PCI_CONFIG_REGISTER_SIZE + offset as usize;
match data.len() {
1 => self.configuration.write_byte(byte_offset, data[0]),
2 => self
.configuration
.write_word(byte_offset, u16::from(data[0]) | (u16::from(data[1]) << 8)),
4 => self
.configuration
.write_reg(reg_idx, LittleEndian::read_u32(data)),
_ => {}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we invoke write_config_register() after the BAR reprogramming block of code?
Also, please make sure to document very clearly (multiple lines of explanation) why we're doing this as it's not obvious which side effects this can have.

@likebreath

Copy link
Copy Markdown
Member

@saravan2 You will need a rebase, as the rust-vmm crates were updated from main.

@rbradford rbradford marked this pull request as draft June 18, 2026 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants