Set the QCOW2 corrupt bit when internal inconsistencies are detected
that indicate image metadata may be corrupted:
- Decompression decode failure, meaning compressed cluster data is
invalid
- Decompression size mismatch, where decompressed data doesn't match
expected cluster size
- Partial write after decompression, where L2 table was updated but
data cluster not fully written, leaving metadata inconsistent
- Invalid refcount index, where cluster address is outside valid
refcount table range, indicating a corrupted L2 entry
- Dirty L2 with zero L1 address, where L2 table is marked dirty but
L1 has no address for it
Note: Marking decompression failures as corrupt is more conservative
than QEMU, which returns EIO without setting the corrupt bit. This is
debatable since corrupted compressed data doesn't necessarily indicate
metadata corruption, but it provides a stronger safety guarantee by
preventing further writes to potentially damaged images.
Once set, the image can only be opened read-only until repaired with
qemu-img check -r.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add comprehensive tests for the corrupt bit handling. Cover writable
rejection, read-only access, persistence, and dirty bit
coexistence.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Implement proper handling of the QCOW2 corrupt bit (incompatible feature
bit 1) according to the specification:
- Add Error::CorruptImage for rejecting writable opens of corrupt images
- Add CORRUPT to SUPPORTED features (handled specially, not rejected)
- Add QcowHeader::set_corrupt_bit() to mark images as corrupt
- Add QcowHeader::is_corrupt() helper method
- Reject writable opens of corrupt images with Error::CorruptImage
- Allow readonly opens of corrupt images with a warning
The corrupt bit indicates that image metadata may be inconsistent. Per
spec, such images must not be written to until repaired by external
tools like qemu-img. Read-only access is permitted to allow data
recovery.
Users can open corrupt images read-only using:
--disk path=/path/to/image.qcow2,readonly=on
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
VsockPacket::hdr holds a raw pointer to the address of the VSock packet
header, which is in guest memory. It opens the door to double-fetch
(or TOCTOU) race conditions. Therefore, VSockPacket::hdr content can't
be trusted since it can be arbitrarily changed by the guest, at any
time.
To mitigate this, we can copy the header content to an array in VMM's
memory that the guest can't modify.
Signed-off-by: Thomas Leroy <thomas.leroy.mp@gmail.com>
Update QcowHeader and other related places to use BeUint methods
internally for reading/writing header fields.
This removes the byteorder dependency from mod.rs and consolidates
all big-endian file I/O through the shared BeUint trait.
Suggested-by: Rob Bradford <rbradford@rivosinc.com>
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a read_be() method to the BeUint trait and make it pub(super)
so it can be used across the qcow module. Change BeUint::write_be()
to take Self instead of u64, providing type safety through TryFrom
conversion.
Suggested-by: Rob Bradford <rbradford@rivosinc.com>
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add tests to verify dirty bit is set while VM runs and cleared on
clean shutdown. As part of it, ensure graceful shutdown when OS
disk verification requires consistent image state.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Verify dirty bit is set on open and cleared on close for v3 images.
Ensure v2 and read-only files are not affected. Update existing
tests.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add support for the dirty bit (bit 0 of incompatible_features) which
indicates the image was not closed cleanly. This improves data
integrity by allowing detection of potentially corrupted images.
On open:
- If dirty bit is already set, log a warning and trigger
refcount rebuild
- Set the dirty bit and write it to disk immediately
- Sync to ensure persistence before any writes
- Skip dirty bit and refcount rebuild for readonly files
On clean close:
- Clear the dirty bit in the header
- Write it to disk and sync
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Test all refcount_order values (0-6):
- Basic open for each width
- Write/read roundtrip
- Overwrite and multi-cluster allocation
- L2 cache eviction under memory pressure
- Sub-byte and byte-aligned max value handling
- Overflow error detection
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Reject refcount values exceeding the maximum for the image's
refcount_order. This prevents silent truncation when storing
refcounts in narrow widths (e.g., 1-bit max is 1, 4-bit max is 15,
etc.).
Returns RefcountOverflow error with the attempted value, maximum,
and bit width. Propagates as EINVAL to the guest.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
QCOW2 v3 specifies refcount_order 0-6 with
refcount_bits = 1 << refcount_order. Previously only 16-bit (order 4)
was supported.
Changes:
- RefcountBytes trait handles byte-aligned types (8/16/32/64-bit)
- Generic pack/unpack for sub-byte widths (1/2/4-bit)
- Function pointers for read/write selected at open time
- Internal refcount type widened from u16 to u64
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Parse the feature name table header extension to provide descriptive
error messages when unsupported incompatible features are detected.
Currently only the compression bit (bit 3, zstd) is supported.
This prevents opening qcow2 images with features that would cause
incorrect behavior or data corruption (e.g., dirty bit, corrupt bit,
external data file, extended L2 entries).
Feature names are defined as follows:
1. The image's feature name table header extension (if present)
2. Hardcoded fallback names for known features
3. Generic "unknown feature bit N" for undefined features
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Co-developed-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Implement read support for bit 0 in QCOW2 L2 table entries.
When this flag is set, the cluster reads as zeros without accessing
disk. This improves compatibility with QCOW2 images that use this
optimization.
According to the QCOW2 specification, bit 0 of the standard cluster
descriptor indicates that the cluster reads as zeros. Unlike
l2_entry == 0 indicating a completely unallocated entry, bit 0 can
be set on an allocated cluster.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
build_edk2 was leaving behind .built markers
even when compilation failed.
Gate creation of.built marker to occur only on
successful build
Modify build_edk2() to exit with error code
when arm64 firmware artifact : CLOUDHV_EFI.fd
is not produced
Fixes#7608
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
build_edk2() module in scripts/common-aarch64.sh
does not produce the UEFI firmware for aarch64 as
the commits used to assemble sources for acpica,
edk2-platforms and edk2 do not compile after GCC
version upgraded from 11.4.0 to 13.3.0 in the
developer container (ubuntu 22.04 to 24.04)
Apply minimum upgrade to EDK2_REPO and ACPICA_REPO
required to compile with GCC 13.3.0
while still assuring guest VM boot for all
integration tests
BaseTools: Brotli compression submodule that was
previously failing has been fixed following commit
bump
Developers can now produce UEFI firmware for
aarch64 using the following commands
```
./scripts/dev_cli.sh shell
source scripts/test-util.sh
source scripts/common-aarch64.sh
build_edk2
```
Update docs/uefi.md
Fixes#7608
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
Recent changes related to arm64 support in MSHV exposed
inconsistencies in the VM initialization and CVM boot paths.
The VM creation flow currently diverges across multiple scenarios,
including regular MSHV, CVM, and arm64, with each path performing
guest initialization steps in a different order.
Certain platform-specific requirements further constrain the ordering
of operations, such as the timing of address space creation,
IGVM loading, interrupt controller setup, and payload loading. For
CVM case address-space creation must be done after IGVM loading, and
PSP measurement. For Regular and arm64 this memory initialization
must be done early. For MSHV, vm.init() and sev_snp.init() are called in
different order which is run time and build time conditionally checked.
Additionally, while the KVM initialization path differs slightly
from MSHV, it shares common logic that is currently split across
separate conditional and build-time code paths, contributing to
fragmentation of the overall flow.
This change restructures the VM creation and initialization sequence
to better align shared logic, enforce scenario-specific ordering
constraints, and ensure consistent and correct behavior across all
supported configurations. In doing so, it restores proper CVM boot
behavior and improves the maintainability of the initialization code.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
The spec says simply that that an empty payload should be returned on
error. Be slightly more helpful by adding a warning.
Signed-off-by: Rob Bradford <rbradford@meta.com>
Based upon the discussion and in
https://github.com/rust-vmm/vhost/issues/29#issue-830820820 and the QEMU
behaviour the get_config offset should be zero. This was not caught by
our integration tests as the vhost-user-blk backend as implemented in
this repository does not use the offset.
Fixes: #7615
Signed-off-by: Rob Bradford <rbradford@meta.com>
Since the mshv integration workflow has been stable for a long time,
make the workflows no longer optional.
Signed-off-by: Aastha Rawat <aastharawat@microsoft.com>
Add warmup_iterations field to run iterations before measuring
performance. This complements existing cold start tests
by separating cache effects from steady state throughput.
New tests with 2 warmup iterations:
- block_qcow2_backing_qcow2_read_warm_MiBps
- block_qcow2_backing_raw_read_warm_MiBps
Results show warm cache is much faster and more consistent:
- QCOW2: 1766 MiB/s (4% variance) vs cold 960 MiB/s (73% variance)
- RAW: 1822 MiB/s (6% variance) vs cold 1300 MiB/s (55% variance)
RAW backing is 3% faster than QCOW2 in steady state.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
The write_to() function is used by test code to create qcow2 files for
testing. For v3 headers with extended header_size (>104), it needs to:
1. Write the mandatory compression_type field at bytes 104-111
2. Write the header extension end marker at the header_size offset
3. Seek to backing_file_offset before writing the backing file path
Additionally, create_for_size_and_path() must set backing_file_offset
to account for the 8 byte extension end marker in v3 files, so the
backing file path doesn't overwrite the extension area.
Add unit tests for read_header_extensions() covering backing format
parsing (raw/qcow2), unknown extensions, and error cases (invalid
formats, invalid UTF-8). These tests depend on the header writing fixes
to create properly formatted v3 test files.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add support for parsing QCOW v3 header extensions to read the
backing file format. The QCOW v3 spec allows optional header
extensions between the fixed header and the backing file name.
Implement read_header_extensions() to parse the extension area,
which starts at the header_size offset. At the moment it is
used to read the backing file format. Further extension
processing is open in folow up implementations.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add support for raw backing files in addition to qcow2 backing
files. This enables QCOW2 overlays to use raw images as their
backing store.
The backing file format is auto-detected when not specified,
using the existing detect_image_type() function.
Add backing_file_format field to QcowHeader to store the format
type, which will be populated from header extensions by a
subsequent patch.
Modify new_from_backing() to accept a backing_format parameter,
consolidating support for both raw and qcow2 backing files in a
single function. The backing_file_size parameter allows overlay
creation without opening the backing file multiple times.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
In most of the cases, special registers don't change after emulations,
but current code sets them back unconditionally, and although some of
them are set over the register page, others require a system call and a
hypervisor to be updated, which is a waste it there were not changes.
Introduce and CPU update method for Microsoft Hypervisor emulator and
set special registers only when they were changed. This change reduces
guest boot time by 4% for a single VP guest boot (in L1VH partition) in
my experiments.
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
This is a precursor change to overall ioctl and hypercall reduction
effort. The old (current) CPU state can be compared to the new to
determine what has changed and avoid unnecessary register updates.
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Minor modifications were made to make the sentences sound more natural.
Also fixed some parameter usage issues in bash code block.
Signed-off-by: Yi Wang <foxywang@tencent.com>
There are some syntax and format issues in tdx/sev documents.
Make some modification to make the description more natural.
And the link of SEV-SNP is invalid, fix it.
Signed-off-by: Yi Wang <foxywang@tencent.com>
There are some minor syntax and command issues in debug-port document.
As commit 5febdec81a (vmm: Enable `gdbstub` on AArch64) supported
aarch64, the docs should keep consistent.
Signed-off-by: Yi Wang <foxywang@tencent.com>
Fix some minor syntax issues in api/building document to make
the sentences more fluent and easy to read.
Signed-off-by: Yi Wang <foxywang@tencent.com>
Some description in the device document were inconsistent with the
source code. Also fix some syntax issues to make the sentences more
fluent.
Signed-off-by: Yi Wang <foxywang@tencent.com>
CI reports:
In scripts/test-util.sh line 216:
cleanup() {
^-- SC2329 (info): This function is never invoked. Check usage (or ignored if invoked indirectly).
The shellcheck can't trace calling in trap, so we need add hint
to make it happy.
Signed-off-by: Yi Wang <foxywang@tencent.com>
The IORT table's ID mapping uses a 256-ID offset per PCI segment to
ensure unique device IDs across all segments. This partitioning scheme
(output_base = 256 * segment_id) must match the device ID encoding used
in KVM MSI routing configuration [1].
This mapping assumes one bus per PCI segment, and supports up to 256 PCI
segments in the system.
[1] c9374d87ac
Signed-off-by: Bo Chen <bchen@crusoe.ai>
The IORT specification (Revision E.b, Table 12) defines the ITS Group
Node structure with an ITS Identifiers array following the node header.
Although the IORT table is zero-initialized, this commit adds an
explicit write of the ITS identifier value (0) for clarity and spec
compliance.
This ITS identifier must match the `translation_id` field in the MADT
GIC ITS structure to ensure proper interrupt routing on ARM platforms.
Signed-off-by: Bo Chen <bchen@crusoe.ai>
The current IORT table implementation is based on IORT Spec revision E.b
[1], as evidenced by:
* The PCI root complex node revision being set to `3`
* The code being updated in late 2021 [2] when revision E.b was the
latest version
This patch ensures the IORT table is properly generated according to
this specification revision, fixing three issues:
1. The IORT table revision should be `3` rather than `2` (see Table 2 in
the spec [1])
2. The GIC ITS group node revision should be `1` rather than `0`
(see Table 12 in the spec [1])
3. The "Memory access properties" and "ATS Attribute" fields of the PCI
root complex node was set incorrectly - specifically the MAF (Memory
Access Flags) including CPM and DACS bits (see Tables 14, 15, and 17
in the spec [1])
[1] https://developer.arm.com/documentation/den0049/eb/?lang=en
[2] https://github.com/cloud-hypervisor/cloud-hypervisor/pull/3356
Signed-off-by: Bo Chen <bchen@crusoe.ai>
It should always succeed and is apparently implicitly called by libc or
some dependency somewhere.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>