When restoring from snapshot with shared=false, write access to the
backing file is not required. Opening it read-only allows restore to
succeed on read-only media and overlay lower layers while preserving
MAP_PRIVATE semantics.
Signed-off-by: Rowen-Ye <rowenye1@gmail.com>
This patch adds the skeleton of the CVM test
support and modify existing scripts and test framework
to enable such scenario. Split the sha1sum to support both
regular and CVM guest. Add one test case for CVM. Will further
add more test cases.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
X64_64 image download steps is being used for both
regular and CVM guest. Keeping the steps withing a function
in the test-util.sh
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Now Guest struct has an option to set timeout.
No need to pass timeout while booting the guest.
If no timeout is set, default is used.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Modify Guest struct to keep some test specific
data so that test cases could be shared between
regular guest and CVM.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Testing generic vhost-user devices will require virtiofsd to support the
--tag option, which v1.8.0 does not support.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Replace generic WritingHeader error with specific SyncingHeader
error for header fsync operations. This provides more precise
error reporting when syncing QCOW2 header changes to disk fails.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Wrap QcowFile in Arc<Mutex<>> to ensure thread safety when multiple
virtio queues access the same QCOW2 image concurrently.
Previously, each queue received its own QcowSync instance via
new_async_io() that shared the underlying QcowFile through Clone.
However, cloned QcowFile instances share internal mutable state
(L2 cache, reference counts, file seek position) without
synchronization, leading to data corruption under concurrent I/O.
This change serializes all QCOW2 operations through a mutex, which
ensures correctness at the cost of parallelism. A more performant
solution would require separating metadata locking from actual I/O
operations, tracked in #7560.
Related: #7560
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
TL;DR: Would reduce CI pressure by cancelling more "unnecessary" runs
but I can't verify without running a merge queue.
A common development pattern is to push a change and then immediately
check CI results. Follow-up fix pushes are quite common, which leads to
multiple CI runs being queued for the same pull request.
In Cloud Hypervisor, the size and cost of the CI matrix means that
several consecutive pushes (for example 3-4 in a short time) put
significant pressure on CI runners and noticeably increase feedback
latency.
In practice, concurrency handling is especially tricky for the merge
queue. From personal experience: If one does not take special care, CI
runs triggered by a `merge_group` can cancel each other, as in a merge
queue there are two runs for each job by default: one for the normal PR
and one for the merge commit. This is easy to run into, also because the
available documentation and best practices for this feature are not very
good.
At the same time, our workflows do not run on `push` events, but only
on `pull_request` and `merge_group`. Because of this, using
`${{ github.ref }}` alone as a concurrency key is not very meaningful,
and in practice only few runs are actually cancelled for successive PR
updates. Therefore, we should improve the usage of this feature.
This change tries to improve the situation by refining the concurrency
group key. The goal is to keep cancellation for multiple PR pushes,
while at the same time preventing unintended cancellations in the merge
queue by separating `merge_group` runs from regular PR runs.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
In kata-containers we use the api_client crate, but it's currently
failing our cargo deny check due to missing license, and there aren't
any license files within the crate, so I haven't found a good way to
work around this.
Alternatively I'd be happy to add the license to the workspace crate
and then reference it here, but that seems to clash with the direction
of the project in #7525.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
QCOW2 v3 autoclear_features field contains bits for features whose
metadata becomes invalid when the image is modified by software that
doesn't understand them. Defined bits:
- Bit 0: Bitmaps extension
- Bit 1: Raw external data
Cloud-hypervisor doesn't support bitmaps or external data files, so
all autoclear bits are cleared on writable open. This signals other
tools that these features' data may be stale.
Readonly opens preserve autoclear bits unchanged.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
This adds some missing features that are useful. In particular it adds
VIRTIO_F_RING_INDIRECT_DESC which gives a performance improvement.
Signed-off-by: Rob Bradford <rbradford@meta.com>
Reported-by: Daniel Farina <daniel@ubicloud.com>
TDX builds its own ACPI tables in `create_acpi_tables_tdx` so it will
return None in the standard `create_acpi_tables` function and the
assertion for `rsdp_addr` will fail.
Signed-off-by: Zhibin Li <banlu.lzb@antgroup.com>
Add tests for corrupt bit behavior during I/O operations.
- Unaligned L2 table address triggers corrupt bit on read
- Unaligned cluster address triggers corrupt bit on read and write
- Normal operations do not set the corrupt bit
- V2 images work correctly without feature bits
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
MSHV doesn't present an ITS to guests. So, /proc/interrupts would never
have "ITS-PCI-MSIX".
Instead, a Gicv2m frame is presented to guests. So expect
"GICv2m-PCI-MSIX" in testcases.
Signed-off-by: Anirudh Rayabharam <anrayabh@microsoft.com>
Add integration tests for QCOW2 corrupt bit handling. Verify that
images with the corrupt bit set are rejected for writable access but
allowed for read-only access with a warning.
Helper functions are added to read and modify the corrupt flag in the
QCOW2 v3 header.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Validate that L2 table offsets and refcount block offsets are cluster
aligned. Set the corrupt bit when unaligned offsets are detected, as
this indicates corrupted L1 or refcount table entries.
Validate that data cluster offsets from L2 entries are cluster aligned
during both reads and writes to existing clusters. Set the corrupt bit
when unaligned data cluster offsets are detected.
Prevent allocation of clusters at offset 0, which contains the QCOW2
header and should never be allocated. This catches corruption in the
available clusters list. Set the corrupt bit when this condition is
detected.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Set the QCOW2 corrupt bit when internal inconsistencies are detected
that indicate image metadata may be corrupted:
- Decompression decode failure, meaning compressed cluster data is
invalid
- Decompression size mismatch, where decompressed data doesn't match
expected cluster size
- Partial write after decompression, where L2 table was updated but
data cluster not fully written, leaving metadata inconsistent
- Invalid refcount index, where cluster address is outside valid
refcount table range, indicating a corrupted L2 entry
- Dirty L2 with zero L1 address, where L2 table is marked dirty but
L1 has no address for it
Note: Marking decompression failures as corrupt is more conservative
than QEMU, which returns EIO without setting the corrupt bit. This is
debatable since corrupted compressed data doesn't necessarily indicate
metadata corruption, but it provides a stronger safety guarantee by
preventing further writes to potentially damaged images.
Once set, the image can only be opened read-only until repaired with
qemu-img check -r.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add comprehensive tests for the corrupt bit handling. Cover writable
rejection, read-only access, persistence, and dirty bit
coexistence.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Implement proper handling of the QCOW2 corrupt bit (incompatible feature
bit 1) according to the specification:
- Add Error::CorruptImage for rejecting writable opens of corrupt images
- Add CORRUPT to SUPPORTED features (handled specially, not rejected)
- Add QcowHeader::set_corrupt_bit() to mark images as corrupt
- Add QcowHeader::is_corrupt() helper method
- Reject writable opens of corrupt images with Error::CorruptImage
- Allow readonly opens of corrupt images with a warning
The corrupt bit indicates that image metadata may be inconsistent. Per
spec, such images must not be written to until repaired by external
tools like qemu-img. Read-only access is permitted to allow data
recovery.
Users can open corrupt images read-only using:
--disk path=/path/to/image.qcow2,readonly=on
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
VsockPacket::hdr holds a raw pointer to the address of the VSock packet
header, which is in guest memory. It opens the door to double-fetch
(or TOCTOU) race conditions. Therefore, VSockPacket::hdr content can't
be trusted since it can be arbitrarily changed by the guest, at any
time.
To mitigate this, we can copy the header content to an array in VMM's
memory that the guest can't modify.
Signed-off-by: Thomas Leroy <thomas.leroy.mp@gmail.com>
Update QcowHeader and other related places to use BeUint methods
internally for reading/writing header fields.
This removes the byteorder dependency from mod.rs and consolidates
all big-endian file I/O through the shared BeUint trait.
Suggested-by: Rob Bradford <rbradford@rivosinc.com>
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a read_be() method to the BeUint trait and make it pub(super)
so it can be used across the qcow module. Change BeUint::write_be()
to take Self instead of u64, providing type safety through TryFrom
conversion.
Suggested-by: Rob Bradford <rbradford@rivosinc.com>
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add tests to verify dirty bit is set while VM runs and cleared on
clean shutdown. As part of it, ensure graceful shutdown when OS
disk verification requires consistent image state.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Verify dirty bit is set on open and cleared on close for v3 images.
Ensure v2 and read-only files are not affected. Update existing
tests.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add support for the dirty bit (bit 0 of incompatible_features) which
indicates the image was not closed cleanly. This improves data
integrity by allowing detection of potentially corrupted images.
On open:
- If dirty bit is already set, log a warning and trigger
refcount rebuild
- Set the dirty bit and write it to disk immediately
- Sync to ensure persistence before any writes
- Skip dirty bit and refcount rebuild for readonly files
On clean close:
- Clear the dirty bit in the header
- Write it to disk and sync
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Test all refcount_order values (0-6):
- Basic open for each width
- Write/read roundtrip
- Overwrite and multi-cluster allocation
- L2 cache eviction under memory pressure
- Sub-byte and byte-aligned max value handling
- Overflow error detection
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Reject refcount values exceeding the maximum for the image's
refcount_order. This prevents silent truncation when storing
refcounts in narrow widths (e.g., 1-bit max is 1, 4-bit max is 15,
etc.).
Returns RefcountOverflow error with the attempted value, maximum,
and bit width. Propagates as EINVAL to the guest.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
QCOW2 v3 specifies refcount_order 0-6 with
refcount_bits = 1 << refcount_order. Previously only 16-bit (order 4)
was supported.
Changes:
- RefcountBytes trait handles byte-aligned types (8/16/32/64-bit)
- Generic pack/unpack for sub-byte widths (1/2/4-bit)
- Function pointers for read/write selected at open time
- Internal refcount type widened from u16 to u64
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Parse the feature name table header extension to provide descriptive
error messages when unsupported incompatible features are detected.
Currently only the compression bit (bit 3, zstd) is supported.
This prevents opening qcow2 images with features that would cause
incorrect behavior or data corruption (e.g., dirty bit, corrupt bit,
external data file, extended L2 entries).
Feature names are defined as follows:
1. The image's feature name table header extension (if present)
2. Hardcoded fallback names for known features
3. Generic "unknown feature bit N" for undefined features
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Co-developed-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Implement read support for bit 0 in QCOW2 L2 table entries.
When this flag is set, the cluster reads as zeros without accessing
disk. This improves compatibility with QCOW2 images that use this
optimization.
According to the QCOW2 specification, bit 0 of the standard cluster
descriptor indicates that the cluster reads as zeros. Unlike
l2_entry == 0 indicating a completely unallocated entry, bit 0 can
be set on an allocated cluster.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
build_edk2 was leaving behind .built markers
even when compilation failed.
Gate creation of.built marker to occur only on
successful build
Modify build_edk2() to exit with error code
when arm64 firmware artifact : CLOUDHV_EFI.fd
is not produced
Fixes#7608
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
build_edk2() module in scripts/common-aarch64.sh
does not produce the UEFI firmware for aarch64 as
the commits used to assemble sources for acpica,
edk2-platforms and edk2 do not compile after GCC
version upgraded from 11.4.0 to 13.3.0 in the
developer container (ubuntu 22.04 to 24.04)
Apply minimum upgrade to EDK2_REPO and ACPICA_REPO
required to compile with GCC 13.3.0
while still assuring guest VM boot for all
integration tests
BaseTools: Brotli compression submodule that was
previously failing has been fixed following commit
bump
Developers can now produce UEFI firmware for
aarch64 using the following commands
```
./scripts/dev_cli.sh shell
source scripts/test-util.sh
source scripts/common-aarch64.sh
build_edk2
```
Update docs/uefi.md
Fixes#7608
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Recent changes related to arm64 support in MSHV exposed
inconsistencies in the VM initialization and CVM boot paths.
The VM creation flow currently diverges across multiple scenarios,
including regular MSHV, CVM, and arm64, with each path performing
guest initialization steps in a different order.
Certain platform-specific requirements further constrain the ordering
of operations, such as the timing of address space creation,
IGVM loading, interrupt controller setup, and payload loading. For
CVM case address-space creation must be done after IGVM loading, and
PSP measurement. For Regular and arm64 this memory initialization
must be done early. For MSHV, vm.init() and sev_snp.init() are called in
different order which is run time and build time conditionally checked.
Additionally, while the KVM initialization path differs slightly
from MSHV, it shares common logic that is currently split across
separate conditional and build-time code paths, contributing to
fragmentation of the overall flow.
This change restructures the VM creation and initialization sequence
to better align shared logic, enforce scenario-specific ordering
constraints, and ensure consistent and correct behavior across all
supported configurations. In doing so, it restores proper CVM boot
behavior and improves the maintainability of the initialization code.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
The spec says simply that that an empty payload should be returned on
error. Be slightly more helpful by adding a warning.
Signed-off-by: Rob Bradford <rbradford@meta.com>
Based upon the discussion and in
https://github.com/rust-vmm/vhost/issues/29#issue-830820820 and the QEMU
behaviour the get_config offset should be zero. This was not caught by
our integration tests as the vhost-user-blk backend as implemented in
this repository does not use the offset.
Fixes: #7615
Signed-off-by: Rob Bradford <rbradford@meta.com>