The device_id field was added to both NumaConfig and NumaNode as part
of the Generic Initiator support, but create_numa_nodes() change
was missed when the commits were reorganized.
As a result, node.device_id is never propogated from the config to
the runtime node and the ACPI SRAT Type 5 (Generic Initiator Affinity)
entries were never emitted.
Add the missing propogation so that create_srat_table() can resolve
the device and emit the correct affinity structure
Fixes: #7717
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Add performance tests for standalone qcow2 images without backing
files - uncompressed, zlib and zstd compressed. Each variant
includes single queue and multiqueue tests for sequential
read, random read and warmed up sequential read.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add multiqueue num_queues=4 performance tests for qcow2 overlay
images with both qcow2 and raw backing files - sequential read,
random read, and warm read variants.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
The backing_files option defaults to false, so qcow2 overlay
tests fail with MaxNestingDepthExceeded. Pass backing_files=on
when the test file is an overlay.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
The focal image checksums have been moved in the -common
sha1sums file. Use the correct file for metrics.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Unlike most virtio feature bits, VIRTIO_BLK_F_RO is not optional.
It indicates that the host is refusing to permit write operations, and
the guest must not be allowed to override it.
However, the block device currently does not enforce this. If the guest
does not negotiate VIRTIO_BLK_F_RO, the block device will think the
device is writable and forward write requests to the backend.
This is not a security problem right now because the backing device of a
read-only device is always opened read-only. The kernel will thus
reject the write operations with EBADF. If support is added for
receiving the backing device file descriptor via SCM_RIGHTS (#7704),
it will be possible to have a read-only block device backed by a
writable file descriptor. This would make the bug a genuine security
vulnerability.
Fix the bug by explicitly checking if VIRTIO_BLK_F_RO was offered but
not negotiated. In this case, log a warning and proceed as if the guest
did acknowledge the feature. This always indicates a guest driver bug.
Fixes: #7697
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
OVMF sends FLUSH requests to read-only virtio-block devices. Refusing
these requests prevents OVMF from accessing the EFI System Partition and
therefore makes VMs unable to boot. Accept these requests instead.
them.
Ignoring these requests is possible, but inconsistent with fsync(2)
which honors them.
Fixes: #7698
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Add handling for GHCB_INFO_SPECIAL_DBGPRINT VMG exit in the SEV-SNP
guest exit handler. This exit occurs when the guest sends debug print
requests through the GHCB interface.
Without this handler, SEV-SNP guests fail to boot when debug output
is triggered, such as when a debugger is attached to the guest image.
The handler acknowledges the exit without printing to avoid performance
degradation from frequent debug print requests.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
The --net help text documented fd as fd=<fd1,fd2...>, but
comma-separated FD lists in option values must be bracketed to avoid
top-level option splitting.
Update NetConfig::SYNTAX to use fd=<[fd1,fd2,...]>, matching parser
behavior and existing net parsing tests:
`cargo test -p vmm test_net_parsing`
On-behalf-of: SAP leander.kohler@sap.com
Signed-off-by: Leander Kohler <leander.kohler@cyberus-technology.de>
Add comprehensive integration tests for DISCARD and WRITE_ZEROES:
Multiqueue stress tests verify concurrent operations across queues,
testing scattered writes with simultaneous fstrim, and write/discard
races that stress refcount table locking.
Format specific tests verify QCOW2 deallocates clusters after DISCARD,
raw files create holes using fallocate, and unsupported formats VHD
and VHDX correctly reject DISCARD requests.
Tests for sparse=off verify raw files preallocate full disk size and
QCOW2 uses zero flag instead of deallocating clusters.
Add helper functions to verify sparse files, count QCOW2 zero flagged
regions using qemu-img map, and verify guest reads zeros from
discarded regions.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add comprehensive tests for DISCARD and WRITE_ZEROES operations:
QCOW2 zero flag test validates the complete workflow: allocate
cluster, DISCARD it, verify reads return zeros, write new data,
verify cluster reallocated.
QcowSync tests verify punch_hole and write_zeroes with Arc<Mutex<>>
sharing, including tests for cache consistency with multiple async
I/O operations.
RawFileSync tests verify punch_hole and write_zeroes using
fallocate.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Implement punch_hole() and write_zeroes() for raw file backends using
io_uring and fallocate.
punch_hole() uses FALLOC_FL_PUNCH_HOLE to deallocate storage.
write_zeroes() uses FALLOC_FL_ZERO_RANGE to write zeros efficiently.
Both use FALLOC_FL_KEEP_SIZE to maintain file size.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Implement punch_hole and write_zeroes for QcowSync backend by
delegating to QcowFile::punch_hole which triggers cluster
deallocation. write_zeroes delegates to punch_hole as unallocated
clusters read as zeros in QCOW2.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add VIRTIO_BLK_T_DISCARD and VIRTIO_BLK_T_WRITE_ZEROES request types.
Parse discard/write_zeroes descriptors (sector, num_sectors, flags),
convert to byte offsets, and call punch_hole/write_zeroes on the disk
backend. Mark as unsupported in sync mode.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Implement DISCARD using QCOW2 zero flag (bit 0 of L2 entries) with
sparse aware behavior.
When sparse=true - fully deallocate clusters by decrementing
refcount, clearing L2 entry, and reclaiming storage via punch_hole
when refcount reaches zero.
When sparse=false - use zero flag to keep storage allocated while
marking as reading zeros. Only works when cluster is not shared.
Shared clusters are fully deallocated.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
When sparse=false is configured, preallocate the entire raw disk file
at startup using fallocate(). This provides space reservation and
reduces fragmentation.
Only applies to raw disks. QCOW2/VHD/VHDX formats manage their own
allocation.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add sparse parameter to QcowFile constructors and propagate it from
device_manager through QcowDiskSync. This makes the sparse configuration
available throughout the QCOW2 implementation for controlling allocation
and deallocation behavior.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add supports_zero_flag() to DiskFile trait to indicate whether a disk
format can mark clusters/blocks as reading zeros without deallocating
storage.
QCOW2 supports this via the zero flag in L2 entries. VHDX also has
PAYLOAD_BLOCK_ZERO state for this, though it's not yet implemented in
cloud-hypervisor.
This enables DISCARD to be advertised even with sparse=false for formats
with zero-flag support, since they can mark regions as zeros (keeps
storage allocated) instead of requiring full deallocation.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add sparse boolean configuration option to DiskConfig with a default
value of true to control disk space allocation behavior.
When sparse is true, the disk uses sparse allocation where deallocated
blocks are returned to the filesystem, and the DISCARD feature is
advertised to the guest.
When sparse is false, disk space is kept fully allocated and DISCARD
is not advertised.
WRITE_ZEROES is always advertised when the backend supports it,
regardless of the sparse setting.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add capability query to DiskFile trait to check backend
support for sparse operations (punch hole, write zeroes,
discard). Only advertise VIRTIO_BLK_F_DISCARD and
VIRTIO_BLK_F_WRITE_ZEROES when the backend supports these
operations.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add functions to probe whether a file or block device actually
supports PUNCH_HOLE and ZERO_RANGE operations at runtime. The
probe is performed at file open time by testing the operations
at EOF with a zero-length range, which is a safe no-op.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add punch_hole() and write_zeroes() methods to the AsyncIo trait
with stub implementations for all backends. These will be used to
support DISCARD and WRITE_ZEROES operations.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Decompose the monolithic `new_from_memory_manager` function into
smaller, focused helper methods to improve code readability,
maintainability, and testability.
Changes:
- Extract `should_force_iommu()` to determine IOMMU requirements for
confidential computing (TDX/SEV-SNP)
- Extract `should_stop_on_boot()` to check debug pause configuration
- Extract `create_cpu_manager()` to encapsulate CPU manager creation
and CPUID population
- Extract `init_tdx_if_enabled()` for TDX-specific VM initialization
- Extract `create_device_manager()` to encapsulate device manager setup
- Extract `hypervisor_specific_init()` to orchestrate initialization
sequences for different hypervisors (KVM, MSHV, SEV-SNP)
- Extract `init_sev_snp()` for SEV-SNP confidential VM setup
- Extract `init_mshv()` for MSHV hypervisor initialization
- Extract `init_kvm()` for KVM hypervisor initialization
- Extract `create_fw_cfg_if_enabled()` for fw_cfg device creation
This refactoring replaces complex nested `cfg_if!` blocks with cleaner
conditional method calls, providing clear separation between hypervisor-
specific initialization paths while preserving existing functionality.
No functional changes intended.
Issue: https://github.com/cloud-hypervisor/cloud-hypervisor/issues/7598
Signed-off-by: Muminul Islam <muislam@microsoft.com>
The lock doesn't make any sense. There is no shared ownership. All
accesses are already synchronized by accesses on a higher level.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
Update .lychee.toml to exclude the following patterns:
- ARM domains (developer.arm.com, infocenter.arm.com) which return 403
Forbidden due to anti-bot protections in CI.
- Local TCP addresses (192.168.1.10) which are unsupported by the
link-checker tool.
- The .lychee.toml file itself, to prevent the tool from recursively
checking its own regex exclusion patterns as valid URLs.
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Document device_id parameter in NumaConfig, automatic
guest_numa_id assignment, default NUMA distances and
restrictions on Generoc Initiator NUMA nodes
Add numa configuration examples with GPU device and distance
relationships.
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Add test_guest_numa_generic_initiator to validate ACPI Generic
Initiator Affinity (SRAT Type 5) support for VFIO device.
The test verifies the following :
- Guest VM boots with a VFIO device associated with a {cpu,
memort}-less NUMA node
- Guest Kernel correctly detects Generic Initiator through
ACPI tables SRAT, SLIT
- NUMA topology in the guest includes the device-only node
with correct distances
Invoked via :
./scripts/dev_cli.sh tests --integration -- --hypervisor kvm \
--test-filter test_guest_numa_generic_initiator
The test requires a real VFIO device bound to vfio-pci driver and
skips gracefully if hardware is unavailable.
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Update FDT generation to skip NUMA properties when Generic Initiator
nodes are present, preventing conflicts between FDT and ACPI NUMA
information. FDT cannot represent Generic Initiator nodes, so ACPI
(via SRAT Type 5) becomes the authoritative source for the entire
NUMA topology when Generic Initiators exist.
Skip FDT numa-node-id properties in CPU and memory nodes
when Generic Initiator is present
Distance map bug fix : iterate over actual NUMA node IDs instead
of 0..len()
Use distance symmetry to derive distance when forward config is
missing
Default to distance cost 20 when neither direction specified
Only create memory nodes if NUMA node has memory region
Added unit tests
ARM64 boot protocol:
https://docs.kernel.org/arch/arm64/booting.html
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Support ACPI Generic Initiator Affinity to associate
PCI devices with NUMA proximity domains
Add GenericInitiatorAffinity struct
Add from_pci_bdf() to encode PCI Segment:Bus:Device.Function
Add from_acpi_device() for ACPI device handles (future use)
Generate SRAT Type 5 entries for nodes with device_id
Improve create_slit_table() to check distance symmetry when
forward distance is missing
Track device ID to BDF mappings in DeviceManager
Includes comprehensive unit tests
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Validate device_id in numa config is mutually
exclusive with cpus and memory_zones
Add NumaConfig::validate() and modify NumaConfig::parse()
Add ValidationError::InvalidNumaConfig for detailed error
messages
Include unit tests covering valid and invalid configs
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Add an optional device_id string field to NumaConfig for identifying
PCI devices associated with a NUMA node. This is used by the Generic
Initiator support to map devices to their proximity domain.
Update OpenAPI spec (cloud-hypervisor.yaml) to include the
new device_id field in the NumaConfig schema.
The device_id is optional and parsed from the --numa parameter:
--numa "device_id=<device_id>,distances=[...],..."
The optional field is accepted but not used.
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
The documentation says guest_numa_id is required to be unique and
therefore the parser() giving default value for non-existing
guest_numa_id with .unwrap_or(0) is dangerous.
Return a validation error if guest_numa_id is not provided instead
of silently defaulting to 0.
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Verify live resize of QCOW2 disks works via the API, including
resizing that requires L1 table growth.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add support for live resizing QCOW2 images. This enables growing
the virtual size of a QCOW2 disk while the VM is running.
Key features:
- Growing the image automatically expands the L1 table if needed
- Shrinking is not supported
- Resizing for images with backing files is not supported
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Provide a stub implementation for save_data_tables() to unblock pause
functionality. Without this, pausing a VM causes Cloud Hypervisor to
panic due to the unimplemented!() macro. This unblocks the
test_api_http_pause_resume testcase. We don't need to save any state
just to pause and resume.
Signed-off-by: Anirudh Rayabharam <anrayabh@microsoft.com>
Instead of closing a file descriptor that belongs to the vhost-user
frontend, drop the vu_common_ctrl::VhostUserHandle and the
vhost::vhost_user::Frontend it contains. This causes the destructor to
drop the file descriptor.
This breaks the last DPDK test, so disable it. See #7689.
Fixes: #7163
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
When a VFIO device with multiple MMIO regions is hot-unplugged, each
region must be individually matched and removed from the DeviceManager's
mmio_regions list. Compare per-region rather than building an aggregate
across all regions, which would never match any individual entry.
Also remove the now-unused HashSet import.
Signed-off-by: Damian Barabonkov <dbctl@pm.me>
Change has_matching_slots() to compare two MmioRegion instances
directly rather than requiring callers to construct an intermediate
HashSet of slot numbers. Remove the now-unused
user_memory_region_slots() method and HashSet import.
Signed-off-by: Damian Barabonkov <dbctl@pm.me>
Replace `clone()` with `take()` when retrieving device configurations
from `DeviceManager.config`.
This avoids unnecessarily copying the device configuration lists (e.g.,
`disks`, `net`, `fs`) when they are being processed and subsequently
moved out of the configuration. This optimization improves performance
by reducing memory allocations and cloning overhead.
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Since kernel commit 6693731487a8 ("vsock/virtio: Allocate nonlinear SKBs
for handling large transmit buffers"), a large vsock packet can be split
into multiple descriptors.
If we encounter such TX packets, pull the content into an owned buffer.
Fixes: #7672
Signed-off-by: Wei Liu <liuwe@microsoft.com>
When running manual tests locally, it is sometimes necessary to
generate a cloud-init file at a custom path instead of defaulting
to /tmp. This is useful for developers and higher-level management
layers where files in /tmp may be cleaned up automatically.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Backing files (e.g. for QCOW2) interact badly with landlock since they
are not obvious from the initial VM configuration. Only enable their use
with an explicit option.
Signed-off-by: Rob Bradford <rbradford@meta.com>