Verify that opening a QCOW2 image with a backing file reference
through QcowDiskSync with backing_files=off produces the user-facing
BackingFilesDisabled error rather than MaxNestingDepthExceeded.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
When a QCOW2 image has a backing file but backing_files=on is not set,
the error was MaxNestingDepthExceeded which gives no indication that
this is a policy decision or how to resolve it.
Add a BackingFilesDisabled error variant whose message indicates that
backing file support is disabled and references the backing_files
option. The translation from MaxNestingDepthExceeded to
BackingFilesDisabled happens at the QcowDiskSync boundary where the
policy decision is made, preserving the original error for genuine
recursive depth exhaustion.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add comprehensive tests for DISCARD and WRITE_ZEROES operations:
QCOW2 zero flag test validates the complete workflow: allocate
cluster, DISCARD it, verify reads return zeros, write new data,
verify cluster reallocated.
QcowSync tests verify punch_hole and write_zeroes with Arc<Mutex<>>
sharing, including tests for cache consistency with multiple async
I/O operations.
RawFileSync tests verify punch_hole and write_zeroes using
fallocate.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Implement punch_hole() and write_zeroes() for raw file backends using
io_uring and fallocate.
punch_hole() uses FALLOC_FL_PUNCH_HOLE to deallocate storage.
write_zeroes() uses FALLOC_FL_ZERO_RANGE to write zeros efficiently.
Both use FALLOC_FL_KEEP_SIZE to maintain file size.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Implement punch_hole and write_zeroes for QcowSync backend by
delegating to QcowFile::punch_hole which triggers cluster
deallocation. write_zeroes delegates to punch_hole as unallocated
clusters read as zeros in QCOW2.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add VIRTIO_BLK_T_DISCARD and VIRTIO_BLK_T_WRITE_ZEROES request types.
Parse discard/write_zeroes descriptors (sector, num_sectors, flags),
convert to byte offsets, and call punch_hole/write_zeroes on the disk
backend. Mark as unsupported in sync mode.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Implement DISCARD using QCOW2 zero flag (bit 0 of L2 entries) with
sparse aware behavior.
When sparse=true - fully deallocate clusters by decrementing
refcount, clearing L2 entry, and reclaiming storage via punch_hole
when refcount reaches zero.
When sparse=false - use zero flag to keep storage allocated while
marking as reading zeros. Only works when cluster is not shared.
Shared clusters are fully deallocated.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
When sparse=false is configured, preallocate the entire raw disk file
at startup using fallocate(). This provides space reservation and
reduces fragmentation.
Only applies to raw disks. QCOW2/VHD/VHDX formats manage their own
allocation.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add sparse parameter to QcowFile constructors and propagate it from
device_manager through QcowDiskSync. This makes the sparse configuration
available throughout the QCOW2 implementation for controlling allocation
and deallocation behavior.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add supports_zero_flag() to DiskFile trait to indicate whether a disk
format can mark clusters/blocks as reading zeros without deallocating
storage.
QCOW2 supports this via the zero flag in L2 entries. VHDX also has
PAYLOAD_BLOCK_ZERO state for this, though it's not yet implemented in
cloud-hypervisor.
This enables DISCARD to be advertised even with sparse=false for formats
with zero-flag support, since they can mark regions as zeros (keeps
storage allocated) instead of requiring full deallocation.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add capability query to DiskFile trait to check backend
support for sparse operations (punch hole, write zeroes,
discard). Only advertise VIRTIO_BLK_F_DISCARD and
VIRTIO_BLK_F_WRITE_ZEROES when the backend supports these
operations.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add functions to probe whether a file or block device actually
supports PUNCH_HOLE and ZERO_RANGE operations at runtime. The
probe is performed at file open time by testing the operations
at EOF with a zero-length range, which is a safe no-op.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add punch_hole() and write_zeroes() methods to the AsyncIo trait
with stub implementations for all backends. These will be used to
support DISCARD and WRITE_ZEROES operations.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add support for live resizing QCOW2 images. This enables growing
the virtual size of a QCOW2 disk while the VM is running.
Key features:
- Growing the image automatically expands the L1 table if needed
- Shrinking is not supported
- Resizing for images with backing files is not supported
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Backing files (e.g. for QCOW2) interact badly with landlock since they
are not obvious from the initial VM configuration. Only enable their use
with an explicit option.
Signed-off-by: Rob Bradford <rbradford@meta.com>
Test reading from overlay at offsets beyond backing file returns
zeros. Covers reads within backing range, beyond backing, and
boundary spanning.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
When an overlay QCOW2 image is larger than its backing file, reads
from offsets beyond the backing file virtual size would previously
fail with an I/O error.
The backing file virtual size is determined at open time and stored
for bounds checking during read operations:
- If the entire read is beyond the backing size, return all zeros
- If the read spans the boundary, read available data from backing and
fill the remainder with zeros
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Replace generic WritingHeader error with specific SyncingHeader
error for header fsync operations. This provides more precise
error reporting when syncing QCOW2 header changes to disk fails.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Wrap QcowFile in Arc<Mutex<>> to ensure thread safety when multiple
virtio queues access the same QCOW2 image concurrently.
Previously, each queue received its own QcowSync instance via
new_async_io() that shared the underlying QcowFile through Clone.
However, cloned QcowFile instances share internal mutable state
(L2 cache, reference counts, file seek position) without
synchronization, leading to data corruption under concurrent I/O.
This change serializes all QCOW2 operations through a mutex, which
ensures correctness at the cost of parallelism. A more performant
solution would require separating metadata locking from actual I/O
operations, tracked in #7560.
Related: #7560
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
QCOW2 v3 autoclear_features field contains bits for features whose
metadata becomes invalid when the image is modified by software that
doesn't understand them. Defined bits:
- Bit 0: Bitmaps extension
- Bit 1: Raw external data
Cloud-hypervisor doesn't support bitmaps or external data files, so
all autoclear bits are cleared on writable open. This signals other
tools that these features' data may be stale.
Readonly opens preserve autoclear bits unchanged.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add tests for corrupt bit behavior during I/O operations.
- Unaligned L2 table address triggers corrupt bit on read
- Unaligned cluster address triggers corrupt bit on read and write
- Normal operations do not set the corrupt bit
- V2 images work correctly without feature bits
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Validate that L2 table offsets and refcount block offsets are cluster
aligned. Set the corrupt bit when unaligned offsets are detected, as
this indicates corrupted L1 or refcount table entries.
Validate that data cluster offsets from L2 entries are cluster aligned
during both reads and writes to existing clusters. Set the corrupt bit
when unaligned data cluster offsets are detected.
Prevent allocation of clusters at offset 0, which contains the QCOW2
header and should never be allocated. This catches corruption in the
available clusters list. Set the corrupt bit when this condition is
detected.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Set the QCOW2 corrupt bit when internal inconsistencies are detected
that indicate image metadata may be corrupted:
- Decompression decode failure, meaning compressed cluster data is
invalid
- Decompression size mismatch, where decompressed data doesn't match
expected cluster size
- Partial write after decompression, where L2 table was updated but
data cluster not fully written, leaving metadata inconsistent
- Invalid refcount index, where cluster address is outside valid
refcount table range, indicating a corrupted L2 entry
- Dirty L2 with zero L1 address, where L2 table is marked dirty but
L1 has no address for it
Note: Marking decompression failures as corrupt is more conservative
than QEMU, which returns EIO without setting the corrupt bit. This is
debatable since corrupted compressed data doesn't necessarily indicate
metadata corruption, but it provides a stronger safety guarantee by
preventing further writes to potentially damaged images.
Once set, the image can only be opened read-only until repaired with
qemu-img check -r.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add comprehensive tests for the corrupt bit handling. Cover writable
rejection, read-only access, persistence, and dirty bit
coexistence.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Implement proper handling of the QCOW2 corrupt bit (incompatible feature
bit 1) according to the specification:
- Add Error::CorruptImage for rejecting writable opens of corrupt images
- Add CORRUPT to SUPPORTED features (handled specially, not rejected)
- Add QcowHeader::set_corrupt_bit() to mark images as corrupt
- Add QcowHeader::is_corrupt() helper method
- Reject writable opens of corrupt images with Error::CorruptImage
- Allow readonly opens of corrupt images with a warning
The corrupt bit indicates that image metadata may be inconsistent. Per
spec, such images must not be written to until repaired by external
tools like qemu-img. Read-only access is permitted to allow data
recovery.
Users can open corrupt images read-only using:
--disk path=/path/to/image.qcow2,readonly=on
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Update QcowHeader and other related places to use BeUint methods
internally for reading/writing header fields.
This removes the byteorder dependency from mod.rs and consolidates
all big-endian file I/O through the shared BeUint trait.
Suggested-by: Rob Bradford <rbradford@rivosinc.com>
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a read_be() method to the BeUint trait and make it pub(super)
so it can be used across the qcow module. Change BeUint::write_be()
to take Self instead of u64, providing type safety through TryFrom
conversion.
Suggested-by: Rob Bradford <rbradford@rivosinc.com>
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Verify dirty bit is set on open and cleared on close for v3 images.
Ensure v2 and read-only files are not affected. Update existing
tests.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add support for the dirty bit (bit 0 of incompatible_features) which
indicates the image was not closed cleanly. This improves data
integrity by allowing detection of potentially corrupted images.
On open:
- If dirty bit is already set, log a warning and trigger
refcount rebuild
- Set the dirty bit and write it to disk immediately
- Sync to ensure persistence before any writes
- Skip dirty bit and refcount rebuild for readonly files
On clean close:
- Clear the dirty bit in the header
- Write it to disk and sync
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Test all refcount_order values (0-6):
- Basic open for each width
- Write/read roundtrip
- Overwrite and multi-cluster allocation
- L2 cache eviction under memory pressure
- Sub-byte and byte-aligned max value handling
- Overflow error detection
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Reject refcount values exceeding the maximum for the image's
refcount_order. This prevents silent truncation when storing
refcounts in narrow widths (e.g., 1-bit max is 1, 4-bit max is 15,
etc.).
Returns RefcountOverflow error with the attempted value, maximum,
and bit width. Propagates as EINVAL to the guest.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
QCOW2 v3 specifies refcount_order 0-6 with
refcount_bits = 1 << refcount_order. Previously only 16-bit (order 4)
was supported.
Changes:
- RefcountBytes trait handles byte-aligned types (8/16/32/64-bit)
- Generic pack/unpack for sub-byte widths (1/2/4-bit)
- Function pointers for read/write selected at open time
- Internal refcount type widened from u16 to u64
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Parse the feature name table header extension to provide descriptive
error messages when unsupported incompatible features are detected.
Currently only the compression bit (bit 3, zstd) is supported.
This prevents opening qcow2 images with features that would cause
incorrect behavior or data corruption (e.g., dirty bit, corrupt bit,
external data file, extended L2 entries).
Feature names are defined as follows:
1. The image's feature name table header extension (if present)
2. Hardcoded fallback names for known features
3. Generic "unknown feature bit N" for undefined features
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Co-developed-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Implement read support for bit 0 in QCOW2 L2 table entries.
When this flag is set, the cluster reads as zeros without accessing
disk. This improves compatibility with QCOW2 images that use this
optimization.
According to the QCOW2 specification, bit 0 of the standard cluster
descriptor indicates that the cluster reads as zeros. Unlike
l2_entry == 0 indicating a completely unallocated entry, bit 0 can
be set on an allocated cluster.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
The write_to() function is used by test code to create qcow2 files for
testing. For v3 headers with extended header_size (>104), it needs to:
1. Write the mandatory compression_type field at bytes 104-111
2. Write the header extension end marker at the header_size offset
3. Seek to backing_file_offset before writing the backing file path
Additionally, create_for_size_and_path() must set backing_file_offset
to account for the 8 byte extension end marker in v3 files, so the
backing file path doesn't overwrite the extension area.
Add unit tests for read_header_extensions() covering backing format
parsing (raw/qcow2), unknown extensions, and error cases (invalid
formats, invalid UTF-8). These tests depend on the header writing fixes
to create properly formatted v3 test files.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add support for parsing QCOW v3 header extensions to read the
backing file format. The QCOW v3 spec allows optional header
extensions between the fixed header and the backing file name.
Implement read_header_extensions() to parse the extension area,
which starts at the header_size offset. At the moment it is
used to read the backing file format. Further extension
processing is open in folow up implementations.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add support for raw backing files in addition to qcow2 backing
files. This enables QCOW2 overlays to use raw images as their
backing store.
The backing file format is auto-detected when not specified,
using the existing detect_image_type() function.
Add backing_file_format field to QcowHeader to store the format
type, which will be populated from header extensions by a
subsequent patch.
Modify new_from_backing() to accept a backing_format parameter,
consolidating support for both raw and qcow2 backing files in a
single function. The backing_file_size parameter allows overlay
creation without opening the backing file multiple times.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Support for resize events for raw_async disks.
On-behalf-of: SAP thomas.prescher@sap.com
Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
Add basic infrastructure so resize events are
propagated to the underlying disk implementation.
On-behalf-of: SAP thomas.prescher@sap.com
Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
This is a pre-requisite for the bug fix in the following commit.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This better reflects the actual usage.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
Refactor write_pointer_table to accept iterators instead of requiring
materialized vectors, eliminating temporary allocations in L1 table
sync operations.
Changes:
- Modified write_pointer_table() to take Iterator<Item = &T> and
dereference internally before passing owned values to the callback
- Added write_pointer_table_direct() convenience wrapper for cases
without value transformation
- Updated sync_caches() to use l1_table.iter() directly instead of
.get_values().iter().copied()
- Implemented Deref<Target = [T]> for VecCache to enable direct .iter()
Performance impact:
- Eliminates L1 table allocation during sync (~2KB per 100GB disk)
- L2 and refcount table writes already used slices, no change there
- Zero performance overhead: iterator dereferencing is equivalent to
.copied() and optimizes identically
The L1 sync previously collected entries into a Vec to apply the
OFLAG_COPIED flag. The new iterator+callback pattern computes this
on-the-fly, avoiding the allocation entirely.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add Deref<Target = [T]> implementation for VecCache<T> to allow direct
slice operations without explicitly calling get_values(). This enables
cleaner code patterns like cache.iter() instead
of cache.get_values().iter().
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
https://github.com/cloud-hypervisor/cloud-hypervisor/pull/7294 adjusted
the checks for read-only requests made to virtio-blk devices and started
rejecting VIRTIO_BLK_T_GET_ID requests. These requests do not perform
any writes and are needed in order to access device serials from within
the guest.
Signed-off-by: Connor Brewster <cbrewster@hey.com>
The BufWriter must be flushed explicitly to handle errors
properly. Without explicit flush, errors during the implicit
drop flush are ignored.
This is the same issue fixed for write_pointer_table
in commit 85556951a.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>