pci: vfio: Implement migration v2 for snapshot and restore#8303
pci: vfio: Implement migration v2 for snapshot and restore#8303saravan2 wants to merge 6 commits into
Conversation
|
@Lencerf VFIO migration v2 implementation for snapshot and restore |
|
|
vfio-ioctls 0.6.1 adds the VFIO device migration v2 and DMA logging ioctls. Its VfioIommufd constructor now takes the iommufd-ioctls 0.2.0 IommuFd type, so iommufd-ioctls moves to 0.2.0. Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Found the root cause. I was missing the new |
Probe VFIO_DEVICE_FEATURE_MIGRATION during VfioCommon::new() and store the result in a new migration_flags field so later migration phases can gate state machine transitions. The probe runs on every instantiation, including snapshot restore, because migration capability is a property of the host kernel and its variant driver rather than of any saved VM state. query_migration_support is added to the internal Vfio trait with a default implementation that returns Ok(None), meaning not migratable. VfioDeviceWrapper overrides it to issue the kernel ioctl, while vfio-user devices keep the default and are always treated as non migratable. Allow the VFIO_DEVICE_FEATURE ioctl in the VMM seccomp filter. Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Use VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE to pause, snapshot, and resume migratable VFIO devices. pause() transitions to STOP, resume() back to RUNNING, and snapshot() walks STOP_COPY, drains the data_fd to EOF, and returns to STOP. The opaque state blob is attached to the device snapshot as a base64 encoded child snapshot. All new behavior is gated on migration_flags.is_some(), so devices without migration support (including vfio-user) retain their previous snapshot behavior. On any transition failure during save, STOP is attempted as best effort before bubbling the error. Full recovery including device reset is deferred. Mirror every non BAR, non MSI config write into the PciConfiguration shadow via write_byte / write_word / write_reg so snapshot() captures PCI_COMMAND. The non BAR read path goes directly to VFIO, leaving the shadow at post init zeros. Without this, snapshots encode cmd=0 and mlx5_vfio_pci loses bus master across restore, wedging its firmware cmd queue with ACCESS_REG timeouts. Avoid write_config_register because its drain of pending_bar_reprogram when MSE=1 would swallow the BAR reprog params that the caller reads immediately below. Signed-off-by: Saravanan D <saravanand@crusoe.ai>
When a snapshot is loaded, walk the migration v2 state machine in VfioCommon::new() after interrupt state has been restored. If the device supports migration and a blob is present, drive RUNNING to RESUMING in a single transition and write the blob to the data_fd. The kernel handles the intermediate STOP arc internally. An explicit STOP dwell was observed to make mlx5_vfio_pci re initialize SQ, CQ, and EQ indices on top of the just loaded blob, wedging queue state on resume. The device is left in RESUMING and resume() drives it to RUNNING during VM resume. After the blob load, push PCI_COMMAND to the device via write_config(). set_state() rebuilds in memory MSI / MSI-X structs but does not touch the kernel's view of PCI_COMMAND, so without this the VF sits at post reset defaults with no bus master and mlx5_core ACCESS_REG times out. Rearm VFIO_DEVICE_SET_IRQS via enable_msi or enable_msix because set_state() does not reissue the ioctl and the kernel has no eventfds for this device until this is done. Both match QEMU vfio_pci_load_config(). In allocate_bars, skip add_pci_bar and add_pci_rom_bar on restore. PciConfiguration::new(Some(state)) already populated the BAR registers with used=true, so the extra call trips BarInUse. The bars vec and mmio_regions pushes still need to happen so the caller can wire bus mappings. The new state machine is gated on migration_flags.is_some(), so devices without migration support (including vfio-user) skip it entirely. On any transition or write failure during restore, STOP is attempted as best effort before bubbling the error. Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Cover the Vfio trait defaults (query_migration_support returns Ok(None), set_migration_state and get_mig_data_size return NoMigrationSupport), VfioMigrationState round trip and invalid value handling, and the VfioCommon save and load helpers using a mock Vfio wrapper driven through a libc pipe for the data_fd. Happy paths verify the transition sequence and the transferred byte payload. The save happy path asserts the full STOP_COPY to STOP arc. The load happy path asserts the device is left in RESUMING so resume() can drive it back to RUNNING. Error paths verify the best effort STOP recovery when set_migration_state fails mid sequence. Also verify the save side PciConfiguration shadow sync: writing a non BAR config register through VfioCommon::write_config_register mirrors the bytes into the local shadow so a later snapshot picks up the live value instead of the post init zero. Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Add a Snapshot and Restore section to docs/vfio.md covering the migration v2 requirements (Linux 5.18 kernel, variant VFIO driver such as mlx5_vfio_pci) and the classification into migratable and non migratable modes. The migratable mode description covers the full restore sequence: the RUNNING to RESUMING single transition (the kernel walks the intermediate STOP arc), the post load PCI_COMMAND push to the device, and the MSI or MSI-X eventfd rearm that the kernel state does not carry. Behavior matches QEMU vfio_pci_load_config(). Note one limitation: the snapshot format stores the opaque device blob as base64 inside the snapshot JSON, which may benefit from a binary transport path for very large state. Extend docs/snapshot_restore.md with a short VFIO section that distinguishes migratable from non migratable devices and redirects to docs/vfio.md for details. Replace the stale "VFIO devices is out of scope" limitation with a note that live migration support for VFIO devices is tracked as a follow up. Signed-off-by: Saravanan D <saravanand@crusoe.ai>
| unimplemented!() | ||
| } | ||
|
|
||
| fn query_migration_support(&self) -> Result<Option<u64>, VfioError> { |
There was a problem hiding this comment.
nit: we should call it migration_feature_flags() or migration_flags() since that's what this function does, it returns the set of migration feature flags (rather than checking for the migration support, which is a much more general concept).
| .map_err(VfioError::KernelVfio) | ||
| } | ||
|
|
||
| fn query_migration_support(&self) -> Result<Option<u64>, VfioError> { |
| None | ||
| } | ||
| Err(e) => { | ||
| warn!("VFIO device {bdf} migration probe failed, treating as non-migratable: {e}"); |
There was a problem hiding this comment.
nit: do we really want to make this a warning? On older kernels, people using VFIO will always get this warning, even if they don't try to perform any migration related work. That seems a bit too much, I'd keep this at the debug level.
| vfio_device_mig_state_VFIO_DEVICE_STATE_RUNNING as VFIO_DEV_STATE_RUNNING, | ||
| vfio_device_mig_state_VFIO_DEVICE_STATE_RUNNING_P2P as VFIO_DEV_STATE_RUNNING_P2P, | ||
| vfio_device_mig_state_VFIO_DEVICE_STATE_STOP as VFIO_DEV_STATE_STOP, | ||
| vfio_device_mig_state_VFIO_DEVICE_STATE_STOP_COPY as VFIO_DEV_STATE_STOP_COPY, *, |
There was a problem hiding this comment.
nit: for a separate commit/PR, but it'd be worth converting * into the actual list of imports that are needed.
| Err(VfioError::NoMigrationSupport) | ||
| } | ||
|
|
||
| fn read_migration_data_to_end(&self) -> Result<Vec<u8>, VfioError> { |
| pub(crate) enum VfioMigrationState { | ||
| Error = VFIO_DEV_STATE_ERROR, | ||
| Stop = VFIO_DEV_STATE_STOP, | ||
| Running = VFIO_DEV_STATE_RUNNING, | ||
| StopCopy = VFIO_DEV_STATE_STOP_COPY, | ||
| Resuming = VFIO_DEV_STATE_RESUMING, | ||
| RunningP2P = VFIO_DEV_STATE_RUNNING_P2P, | ||
| PreCopy = VFIO_DEV_STATE_PRE_COPY, | ||
| PreCopyP2P = VFIO_DEV_STATE_PRE_COPY_P2P, | ||
| } |
There was a problem hiding this comment.
nit: might be more idiomatic to define the list as:
pub(crate) enum VfioMigrationState {
Error,
Stop,
Running,
StopCopy,
Resuming,
RunningP2P,
PreCopy,
PreCopyP2P,
}
and then implement Into for u32.
|
|
||
| #[derive(Serialize, Deserialize)] | ||
| struct VfioMigrationData { | ||
| blob: String, |
There was a problem hiding this comment.
Out of curiosity, what's the rationale for encoding the opaque blob as a base64 string? Why not use Vec<u8>? Are you trying to reduce the blob size?
| })(); | ||
|
|
||
| if result.is_err() { | ||
| let _ = self.transition_migration_state(VfioMigrationState::Stop); |
There was a problem hiding this comment.
nit: do we really expect this to help the recovery?
| // Mirror into the shadow so snapshot() captures PCI_COMMAND. | ||
| // Avoid write_config_register as it drains pending_bar_reprogram. | ||
| let byte_offset = reg_idx * PCI_CONFIG_REGISTER_SIZE + offset as usize; | ||
| match data.len() { | ||
| 1 => self.configuration.write_byte(byte_offset, data[0]), | ||
| 2 => self | ||
| .configuration | ||
| .write_word(byte_offset, u16::from(data[0]) | (u16::from(data[1]) << 8)), | ||
| 4 => self | ||
| .configuration | ||
| .write_reg(reg_idx, LittleEndian::read_u32(data)), | ||
| _ => {} | ||
| } |
There was a problem hiding this comment.
Why can't we invoke write_config_register() after the BAR reprogramming block of code?
Also, please make sure to document very clearly (multiple lines of explanation) why we're doing this as it's not obvious which side effects this can have.
|
@saravan2 You will need a rebase, as the rust-vmm crates were updated from main. |
Summary
Adds VFIO v2 migration protocol support for same host snapshot and restore of migratable VFIO devices, for example ConnectX VFs bound to
mlx5_vfio_pci. Non migratable devices retain their existing snapshot behavior. Live migration with VFIO Devices, DMA dirty page tracking, and precopy iteration are follow up work and are not in this PR.VFIO Migration v2 State Transitions
pause()snapshot()requests datasnapshot()drainsdata_fd(device blob)snapshot()finalizesload_migration_data()requestsdata_fdload_migration_data()writes device blobresume()after save (optional)resume()after restore (mandatory)(*) Each arrow (→) is one
VFIO_DEVICE_FEATURE_SET_MIG_DEVICE_STATEioctl.Non adjacent transitions such as
RUNNING → RESUMING, RESUMING → RUNNINGhas to pass throughSTOPas per the protocol, and the host kernel walks that intermediateSTOPstate internally via the variant driver (mlx5_vfio_pci), therefore in this implementation Cloud-Hypervisor avoids enforcingSTOPin the restore, resume phases. Referred QEMU's VFIO Migration ioctl sequence.Test
mlx5_vfio_pci, firmware 28.43.1014)Snaphsot-Restore-VM-small.mov