The lock doesn't make any sense. There is no shared ownership. All
accesses are already synchronized by accesses on a higher level.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
Update .lychee.toml to exclude the following patterns:
- ARM domains (developer.arm.com, infocenter.arm.com) which return 403
Forbidden due to anti-bot protections in CI.
- Local TCP addresses (192.168.1.10) which are unsupported by the
link-checker tool.
- The .lychee.toml file itself, to prevent the tool from recursively
checking its own regex exclusion patterns as valid URLs.
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Document device_id parameter in NumaConfig, automatic
guest_numa_id assignment, default NUMA distances and
restrictions on Generoc Initiator NUMA nodes
Add numa configuration examples with GPU device and distance
relationships.
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Add test_guest_numa_generic_initiator to validate ACPI Generic
Initiator Affinity (SRAT Type 5) support for VFIO device.
The test verifies the following :
- Guest VM boots with a VFIO device associated with a {cpu,
memort}-less NUMA node
- Guest Kernel correctly detects Generic Initiator through
ACPI tables SRAT, SLIT
- NUMA topology in the guest includes the device-only node
with correct distances
Invoked via :
./scripts/dev_cli.sh tests --integration -- --hypervisor kvm \
--test-filter test_guest_numa_generic_initiator
The test requires a real VFIO device bound to vfio-pci driver and
skips gracefully if hardware is unavailable.
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Update FDT generation to skip NUMA properties when Generic Initiator
nodes are present, preventing conflicts between FDT and ACPI NUMA
information. FDT cannot represent Generic Initiator nodes, so ACPI
(via SRAT Type 5) becomes the authoritative source for the entire
NUMA topology when Generic Initiators exist.
Skip FDT numa-node-id properties in CPU and memory nodes
when Generic Initiator is present
Distance map bug fix : iterate over actual NUMA node IDs instead
of 0..len()
Use distance symmetry to derive distance when forward config is
missing
Default to distance cost 20 when neither direction specified
Only create memory nodes if NUMA node has memory region
Added unit tests
ARM64 boot protocol:
https://docs.kernel.org/arch/arm64/booting.html
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Support ACPI Generic Initiator Affinity to associate
PCI devices with NUMA proximity domains
Add GenericInitiatorAffinity struct
Add from_pci_bdf() to encode PCI Segment:Bus:Device.Function
Add from_acpi_device() for ACPI device handles (future use)
Generate SRAT Type 5 entries for nodes with device_id
Improve create_slit_table() to check distance symmetry when
forward distance is missing
Track device ID to BDF mappings in DeviceManager
Includes comprehensive unit tests
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Validate device_id in numa config is mutually
exclusive with cpus and memory_zones
Add NumaConfig::validate() and modify NumaConfig::parse()
Add ValidationError::InvalidNumaConfig for detailed error
messages
Include unit tests covering valid and invalid configs
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Add an optional device_id string field to NumaConfig for identifying
PCI devices associated with a NUMA node. This is used by the Generic
Initiator support to map devices to their proximity domain.
Update OpenAPI spec (cloud-hypervisor.yaml) to include the
new device_id field in the NumaConfig schema.
The device_id is optional and parsed from the --numa parameter:
--numa "device_id=<device_id>,distances=[...],..."
The optional field is accepted but not used.
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
The documentation says guest_numa_id is required to be unique and
therefore the parser() giving default value for non-existing
guest_numa_id with .unwrap_or(0) is dangerous.
Return a validation error if guest_numa_id is not provided instead
of silently defaulting to 0.
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Verify live resize of QCOW2 disks works via the API, including
resizing that requires L1 table growth.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add support for live resizing QCOW2 images. This enables growing
the virtual size of a QCOW2 disk while the VM is running.
Key features:
- Growing the image automatically expands the L1 table if needed
- Shrinking is not supported
- Resizing for images with backing files is not supported
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Provide a stub implementation for save_data_tables() to unblock pause
functionality. Without this, pausing a VM causes Cloud Hypervisor to
panic due to the unimplemented!() macro. This unblocks the
test_api_http_pause_resume testcase. We don't need to save any state
just to pause and resume.
Signed-off-by: Anirudh Rayabharam <anrayabh@microsoft.com>
Instead of closing a file descriptor that belongs to the vhost-user
frontend, drop the vu_common_ctrl::VhostUserHandle and the
vhost::vhost_user::Frontend it contains. This causes the destructor to
drop the file descriptor.
This breaks the last DPDK test, so disable it. See #7689.
Fixes: #7163
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
When a VFIO device with multiple MMIO regions is hot-unplugged, each
region must be individually matched and removed from the DeviceManager's
mmio_regions list. Compare per-region rather than building an aggregate
across all regions, which would never match any individual entry.
Also remove the now-unused HashSet import.
Signed-off-by: Damian Barabonkov <dbctl@pm.me>
Change has_matching_slots() to compare two MmioRegion instances
directly rather than requiring callers to construct an intermediate
HashSet of slot numbers. Remove the now-unused
user_memory_region_slots() method and HashSet import.
Signed-off-by: Damian Barabonkov <dbctl@pm.me>
Replace `clone()` with `take()` when retrieving device configurations
from `DeviceManager.config`.
This avoids unnecessarily copying the device configuration lists (e.g.,
`disks`, `net`, `fs`) when they are being processed and subsequently
moved out of the configuration. This optimization improves performance
by reducing memory allocations and cloning overhead.
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Since kernel commit 6693731487a8 ("vsock/virtio: Allocate nonlinear SKBs
for handling large transmit buffers"), a large vsock packet can be split
into multiple descriptors.
If we encounter such TX packets, pull the content into an owned buffer.
Fixes: #7672
Signed-off-by: Wei Liu <liuwe@microsoft.com>
When running manual tests locally, it is sometimes necessary to
generate a cloud-init file at a custom path instead of defaulting
to /tmp. This is useful for developers and higher-level management
layers where files in /tmp may be cleaned up automatically.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Backing files (e.g. for QCOW2) interact badly with landlock since they
are not obvious from the initial VM configuration. Only enable their use
with an explicit option.
Signed-off-by: Rob Bradford <rbradford@meta.com>
Test reading from overlay at offsets beyond backing file returns
zeros. Covers reads within backing range, beyond backing, and
boundary spanning.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
When an overlay QCOW2 image is larger than its backing file, reads
from offsets beyond the backing file virtual size would previously
fail with an I/O error.
The backing file virtual size is determined at open time and stored
for bounds checking during read operations:
- If the entire read is beyond the backing size, return all zeros
- If the read spans the boundary, read available data from backing and
fill the remainder with zeros
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Set HV_PARTITION_CREATION_FLAG_SMT_ENABLED_GUEST when the guest
topology has more than one thread per core. This allows the
hypervisor to schedule guest VPs correctly on SMT-enabled hosts.
Without this flag, the hypervisor schedules guest VPs incorrectly,
causing SMT unusable.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
The kernel allows madvise on shared memory if
/sys/kernel/mm/transparent_hugepage/shmem_enabled is set.
Always try and configure THP via madvise when
the user requests THP be enabled.
If this fails, only a warning log is emitted and THP won't be enabled.
Signed-off-by: Champ-Goblem <cameron@northflank.com>
When restoring from snapshot with shared=false, write access to the
backing file is not required. Opening it read-only allows restore to
succeed on read-only media and overlay lower layers while preserving
MAP_PRIVATE semantics.
Signed-off-by: Rowen-Ye <rowenye1@gmail.com>
This patch adds the skeleton of the CVM test
support and modify existing scripts and test framework
to enable such scenario. Split the sha1sum to support both
regular and CVM guest. Add one test case for CVM. Will further
add more test cases.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
X64_64 image download steps is being used for both
regular and CVM guest. Keeping the steps withing a function
in the test-util.sh
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Now Guest struct has an option to set timeout.
No need to pass timeout while booting the guest.
If no timeout is set, default is used.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Modify Guest struct to keep some test specific
data so that test cases could be shared between
regular guest and CVM.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Testing generic vhost-user devices will require virtiofsd to support the
--tag option, which v1.8.0 does not support.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Replace generic WritingHeader error with specific SyncingHeader
error for header fsync operations. This provides more precise
error reporting when syncing QCOW2 header changes to disk fails.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Wrap QcowFile in Arc<Mutex<>> to ensure thread safety when multiple
virtio queues access the same QCOW2 image concurrently.
Previously, each queue received its own QcowSync instance via
new_async_io() that shared the underlying QcowFile through Clone.
However, cloned QcowFile instances share internal mutable state
(L2 cache, reference counts, file seek position) without
synchronization, leading to data corruption under concurrent I/O.
This change serializes all QCOW2 operations through a mutex, which
ensures correctness at the cost of parallelism. A more performant
solution would require separating metadata locking from actual I/O
operations, tracked in #7560.
Related: #7560
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
TL;DR: Would reduce CI pressure by cancelling more "unnecessary" runs
but I can't verify without running a merge queue.
A common development pattern is to push a change and then immediately
check CI results. Follow-up fix pushes are quite common, which leads to
multiple CI runs being queued for the same pull request.
In Cloud Hypervisor, the size and cost of the CI matrix means that
several consecutive pushes (for example 3-4 in a short time) put
significant pressure on CI runners and noticeably increase feedback
latency.
In practice, concurrency handling is especially tricky for the merge
queue. From personal experience: If one does not take special care, CI
runs triggered by a `merge_group` can cancel each other, as in a merge
queue there are two runs for each job by default: one for the normal PR
and one for the merge commit. This is easy to run into, also because the
available documentation and best practices for this feature are not very
good.
At the same time, our workflows do not run on `push` events, but only
on `pull_request` and `merge_group`. Because of this, using
`${{ github.ref }}` alone as a concurrency key is not very meaningful,
and in practice only few runs are actually cancelled for successive PR
updates. Therefore, we should improve the usage of this feature.
This change tries to improve the situation by refining the concurrency
group key. The goal is to keep cancellation for multiple PR pushes,
while at the same time preventing unintended cancellations in the merge
queue by separating `merge_group` runs from regular PR runs.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
In kata-containers we use the api_client crate, but it's currently
failing our cargo deny check due to missing license, and there aren't
any license files within the crate, so I haven't found a good way to
work around this.
Alternatively I'd be happy to add the license to the workspace crate
and then reference it here, but that seems to clash with the direction
of the project in #7525.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
QCOW2 v3 autoclear_features field contains bits for features whose
metadata becomes invalid when the image is modified by software that
doesn't understand them. Defined bits:
- Bit 0: Bitmaps extension
- Bit 1: Raw external data
Cloud-hypervisor doesn't support bitmaps or external data files, so
all autoclear bits are cleared on writable open. This signals other
tools that these features' data may be stale.
Readonly opens preserve autoclear bits unchanged.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
This adds some missing features that are useful. In particular it adds
VIRTIO_F_RING_INDIRECT_DESC which gives a performance improvement.
Signed-off-by: Rob Bradford <rbradford@meta.com>
Reported-by: Daniel Farina <daniel@ubicloud.com>
TDX builds its own ACPI tables in `create_acpi_tables_tdx` so it will
return None in the standard `create_acpi_tables` function and the
assertion for `rsdp_addr` will fail.
Signed-off-by: Zhibin Li <banlu.lzb@antgroup.com>