Add support for live resizing QCOW2 images. This enables growing
the virtual size of a QCOW2 disk while the VM is running.
Key features:
- Growing the image automatically expands the L1 table if needed
- Shrinking is not supported
- Resizing for images with backing files is not supported
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Backing files (e.g. for QCOW2) interact badly with landlock since they
are not obvious from the initial VM configuration. Only enable their use
with an explicit option.
Signed-off-by: Rob Bradford <rbradford@meta.com>
Test reading from overlay at offsets beyond backing file returns
zeros. Covers reads within backing range, beyond backing, and
boundary spanning.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
When an overlay QCOW2 image is larger than its backing file, reads
from offsets beyond the backing file virtual size would previously
fail with an I/O error.
The backing file virtual size is determined at open time and stored
for bounds checking during read operations:
- If the entire read is beyond the backing size, return all zeros
- If the read spans the boundary, read available data from backing and
fill the remainder with zeros
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Replace generic WritingHeader error with specific SyncingHeader
error for header fsync operations. This provides more precise
error reporting when syncing QCOW2 header changes to disk fails.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Wrap QcowFile in Arc<Mutex<>> to ensure thread safety when multiple
virtio queues access the same QCOW2 image concurrently.
Previously, each queue received its own QcowSync instance via
new_async_io() that shared the underlying QcowFile through Clone.
However, cloned QcowFile instances share internal mutable state
(L2 cache, reference counts, file seek position) without
synchronization, leading to data corruption under concurrent I/O.
This change serializes all QCOW2 operations through a mutex, which
ensures correctness at the cost of parallelism. A more performant
solution would require separating metadata locking from actual I/O
operations, tracked in #7560.
Related: #7560
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
QCOW2 v3 autoclear_features field contains bits for features whose
metadata becomes invalid when the image is modified by software that
doesn't understand them. Defined bits:
- Bit 0: Bitmaps extension
- Bit 1: Raw external data
Cloud-hypervisor doesn't support bitmaps or external data files, so
all autoclear bits are cleared on writable open. This signals other
tools that these features' data may be stale.
Readonly opens preserve autoclear bits unchanged.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add tests for corrupt bit behavior during I/O operations.
- Unaligned L2 table address triggers corrupt bit on read
- Unaligned cluster address triggers corrupt bit on read and write
- Normal operations do not set the corrupt bit
- V2 images work correctly without feature bits
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Validate that L2 table offsets and refcount block offsets are cluster
aligned. Set the corrupt bit when unaligned offsets are detected, as
this indicates corrupted L1 or refcount table entries.
Validate that data cluster offsets from L2 entries are cluster aligned
during both reads and writes to existing clusters. Set the corrupt bit
when unaligned data cluster offsets are detected.
Prevent allocation of clusters at offset 0, which contains the QCOW2
header and should never be allocated. This catches corruption in the
available clusters list. Set the corrupt bit when this condition is
detected.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Set the QCOW2 corrupt bit when internal inconsistencies are detected
that indicate image metadata may be corrupted:
- Decompression decode failure, meaning compressed cluster data is
invalid
- Decompression size mismatch, where decompressed data doesn't match
expected cluster size
- Partial write after decompression, where L2 table was updated but
data cluster not fully written, leaving metadata inconsistent
- Invalid refcount index, where cluster address is outside valid
refcount table range, indicating a corrupted L2 entry
- Dirty L2 with zero L1 address, where L2 table is marked dirty but
L1 has no address for it
Note: Marking decompression failures as corrupt is more conservative
than QEMU, which returns EIO without setting the corrupt bit. This is
debatable since corrupted compressed data doesn't necessarily indicate
metadata corruption, but it provides a stronger safety guarantee by
preventing further writes to potentially damaged images.
Once set, the image can only be opened read-only until repaired with
qemu-img check -r.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add comprehensive tests for the corrupt bit handling. Cover writable
rejection, read-only access, persistence, and dirty bit
coexistence.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Implement proper handling of the QCOW2 corrupt bit (incompatible feature
bit 1) according to the specification:
- Add Error::CorruptImage for rejecting writable opens of corrupt images
- Add CORRUPT to SUPPORTED features (handled specially, not rejected)
- Add QcowHeader::set_corrupt_bit() to mark images as corrupt
- Add QcowHeader::is_corrupt() helper method
- Reject writable opens of corrupt images with Error::CorruptImage
- Allow readonly opens of corrupt images with a warning
The corrupt bit indicates that image metadata may be inconsistent. Per
spec, such images must not be written to until repaired by external
tools like qemu-img. Read-only access is permitted to allow data
recovery.
Users can open corrupt images read-only using:
--disk path=/path/to/image.qcow2,readonly=on
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Update QcowHeader and other related places to use BeUint methods
internally for reading/writing header fields.
This removes the byteorder dependency from mod.rs and consolidates
all big-endian file I/O through the shared BeUint trait.
Suggested-by: Rob Bradford <rbradford@rivosinc.com>
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a read_be() method to the BeUint trait and make it pub(super)
so it can be used across the qcow module. Change BeUint::write_be()
to take Self instead of u64, providing type safety through TryFrom
conversion.
Suggested-by: Rob Bradford <rbradford@rivosinc.com>
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Verify dirty bit is set on open and cleared on close for v3 images.
Ensure v2 and read-only files are not affected. Update existing
tests.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add support for the dirty bit (bit 0 of incompatible_features) which
indicates the image was not closed cleanly. This improves data
integrity by allowing detection of potentially corrupted images.
On open:
- If dirty bit is already set, log a warning and trigger
refcount rebuild
- Set the dirty bit and write it to disk immediately
- Sync to ensure persistence before any writes
- Skip dirty bit and refcount rebuild for readonly files
On clean close:
- Clear the dirty bit in the header
- Write it to disk and sync
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Test all refcount_order values (0-6):
- Basic open for each width
- Write/read roundtrip
- Overwrite and multi-cluster allocation
- L2 cache eviction under memory pressure
- Sub-byte and byte-aligned max value handling
- Overflow error detection
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Reject refcount values exceeding the maximum for the image's
refcount_order. This prevents silent truncation when storing
refcounts in narrow widths (e.g., 1-bit max is 1, 4-bit max is 15,
etc.).
Returns RefcountOverflow error with the attempted value, maximum,
and bit width. Propagates as EINVAL to the guest.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
QCOW2 v3 specifies refcount_order 0-6 with
refcount_bits = 1 << refcount_order. Previously only 16-bit (order 4)
was supported.
Changes:
- RefcountBytes trait handles byte-aligned types (8/16/32/64-bit)
- Generic pack/unpack for sub-byte widths (1/2/4-bit)
- Function pointers for read/write selected at open time
- Internal refcount type widened from u16 to u64
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Parse the feature name table header extension to provide descriptive
error messages when unsupported incompatible features are detected.
Currently only the compression bit (bit 3, zstd) is supported.
This prevents opening qcow2 images with features that would cause
incorrect behavior or data corruption (e.g., dirty bit, corrupt bit,
external data file, extended L2 entries).
Feature names are defined as follows:
1. The image's feature name table header extension (if present)
2. Hardcoded fallback names for known features
3. Generic "unknown feature bit N" for undefined features
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Co-developed-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Implement read support for bit 0 in QCOW2 L2 table entries.
When this flag is set, the cluster reads as zeros without accessing
disk. This improves compatibility with QCOW2 images that use this
optimization.
According to the QCOW2 specification, bit 0 of the standard cluster
descriptor indicates that the cluster reads as zeros. Unlike
l2_entry == 0 indicating a completely unallocated entry, bit 0 can
be set on an allocated cluster.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
The write_to() function is used by test code to create qcow2 files for
testing. For v3 headers with extended header_size (>104), it needs to:
1. Write the mandatory compression_type field at bytes 104-111
2. Write the header extension end marker at the header_size offset
3. Seek to backing_file_offset before writing the backing file path
Additionally, create_for_size_and_path() must set backing_file_offset
to account for the 8 byte extension end marker in v3 files, so the
backing file path doesn't overwrite the extension area.
Add unit tests for read_header_extensions() covering backing format
parsing (raw/qcow2), unknown extensions, and error cases (invalid
formats, invalid UTF-8). These tests depend on the header writing fixes
to create properly formatted v3 test files.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add support for parsing QCOW v3 header extensions to read the
backing file format. The QCOW v3 spec allows optional header
extensions between the fixed header and the backing file name.
Implement read_header_extensions() to parse the extension area,
which starts at the header_size offset. At the moment it is
used to read the backing file format. Further extension
processing is open in folow up implementations.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add support for raw backing files in addition to qcow2 backing
files. This enables QCOW2 overlays to use raw images as their
backing store.
The backing file format is auto-detected when not specified,
using the existing detect_image_type() function.
Add backing_file_format field to QcowHeader to store the format
type, which will be populated from header extensions by a
subsequent patch.
Modify new_from_backing() to accept a backing_format parameter,
consolidating support for both raw and qcow2 backing files in a
single function. The backing_file_size parameter allows overlay
creation without opening the backing file multiple times.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Support for resize events for raw_async disks.
On-behalf-of: SAP thomas.prescher@sap.com
Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
Add basic infrastructure so resize events are
propagated to the underlying disk implementation.
On-behalf-of: SAP thomas.prescher@sap.com
Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
This is a pre-requisite for the bug fix in the following commit.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This better reflects the actual usage.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
Refactor write_pointer_table to accept iterators instead of requiring
materialized vectors, eliminating temporary allocations in L1 table
sync operations.
Changes:
- Modified write_pointer_table() to take Iterator<Item = &T> and
dereference internally before passing owned values to the callback
- Added write_pointer_table_direct() convenience wrapper for cases
without value transformation
- Updated sync_caches() to use l1_table.iter() directly instead of
.get_values().iter().copied()
- Implemented Deref<Target = [T]> for VecCache to enable direct .iter()
Performance impact:
- Eliminates L1 table allocation during sync (~2KB per 100GB disk)
- L2 and refcount table writes already used slices, no change there
- Zero performance overhead: iterator dereferencing is equivalent to
.copied() and optimizes identically
The L1 sync previously collected entries into a Vec to apply the
OFLAG_COPIED flag. The new iterator+callback pattern computes this
on-the-fly, avoiding the allocation entirely.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add Deref<Target = [T]> implementation for VecCache<T> to allow direct
slice operations without explicitly calling get_values(). This enables
cleaner code patterns like cache.iter() instead
of cache.get_values().iter().
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
https://github.com/cloud-hypervisor/cloud-hypervisor/pull/7294 adjusted
the checks for read-only requests made to virtio-blk devices and started
rejecting VIRTIO_BLK_T_GET_ID requests. These requests do not perform
any writes and are needed in order to access device serials from within
the guest.
Signed-off-by: Connor Brewster <cbrewster@hey.com>
The BufWriter must be flushed explicitly to handle errors
properly. Without explicit flush, errors during the implicit
drop flush are ignored.
This is the same issue fixed for write_pointer_table
in commit 85556951a.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add set_cluster_refcount_track_freed() helper to consolidate the
common pattern of setting a cluster refcount and tracking freed
refblocks. This reduces code duplication and improves readability.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Freed clusters correctly have refcount=0. Remove the assertion that
expected no clusters with zero refcount, as it was validating the
buggy behavior.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
When converting a compressed cluster to standard during write
operations, the old compressed cluster's refcount was never
decremented, causing leak warnings by `qemu-img check ..`
`Leaked cluster X refcount=N reference=M`
Additionally, compressed data can span multiple physical clusters,
not just one. The compressed cluster address and size are encoded
in the L2 entry, and the data may cross cluster boundaries.
The proper handling is implemented as follows:
- Extract compressed cluster address and size before overwriting
L2 entry
- Identify all clusters occupied by the compressed data
- Decrement refcount for each cluster in the range
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
When a refcount block is evicted from cache and replaced with a new
one, the old refcount block cluster was added to unref_clusters but
its refcount was never decremented to 0 on disk. This left the cluster
with refcount=1 while no metadata referenced it, causing errors in
qemu-img check
`Leaked cluster X refcount=1 reference=0`
This fix recursively calls set_cluster_refcount(freed_cluster, 0) to
properly decrement the freed refcount block's refcount on disk. The
recursion handles cascading replacements where freeing one refcount
block may trigger the replacement of another.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
The OFLAG_COPIED bit (bit 63) indicates a cluster's refcount is exactly
1 and doesn't need copy-on-write. This bit must be set in L1 entries
when their referenced L2 clusters have refcount=1.
Previously, L1 entries were always written as raw addresses without the
OFLAG_COPIED bit, violating the QCOW2 specification and causing qemu-img
check to report errors like
`ERROR OFLAG_COPIED L2 cluster: l1_index=X .... refcount=1`
The implementation queries each L2 cluster's refcount in sync_caches()
and sets OFLAG_COPIED appropriately when writing the L1 table. This
ensures QCOW2 images are specification compliant and maintain correct
COW semantics to avoid data corruption.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Previously the code relies on the implicit flush when BufWriter is
dropped. That's not safe.
Per BufWriter's document:
```
It is critical to call flush before BufWriter<W> is dropped. Though
dropping will attempt to flush the contents of the buffer, any errors
that happen in the process of dropping will be ignored. Calling flush
ensures that the buffer is empty and thus dropping will not even attempt
file operations.
```
Signed-off-by: Wei Liu <liuwe@microsoft.com>
This helps to uncover expensive and needless clones in the code base.
For example, I prevented extensive clones in the snapshot path where
(nested) BTreeMap's have been cloned over and over again. Further,
the lint helps devs to much better reason about the ownership of
parameters.
All of these changes have been done manually with the necessary
caution. A few structs that are cheap to clone are now `copy` so that
this lint won't trigger for them.
I didn't enable the lint so far as it is a massive rabbit hole and
needs much more fixes. Nevertheless, it is very useful.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This removes cognitive load when reading if statements.
All changes were applied by clippy via `--fix`.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This commit is part of a series of similar commits.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
Add support of reading and writing compressed clusters.
Support zlib and zstd compressions.
L2 cache: store entire L2 entries, not only standard cluster addresses.
Read path. Offsets of compressed clusters cannot be determined,
therefore replace QcowFile.file_offset_read() with QcowFile.file_read().
This method reads the cluster, decompresses it if necessary and returns
the data to the caller.
Write path. QcowFile.file_offset_write(): since writing to compressed
clusters is not generally possible, allocate a new standard
(non-compressed) cluster if compressed L2 entry is encountered; then
decompress compressed cluster into new cluster; then return offset
inside new cluster to the caller. Processing of standard clusters is
not changed.
Signed-off-by: Eugene Korenevsky <ekorenevsky@aliyun.com>
The granularity has significant implications in typical cloud
deployments with network storage. The Linux kernel will sync advisory
locks to network file systems, but these backends may have different
policies and handle locks differently. For example, Netapp speaks a NFS
API but will treat advisory OFD locks for the whole file as mandatory
locks, whereas byte-range locks for the whole file will remain
advisory [0].
As it is a valid use case to prevent multiple CHV instances from
accessing the same disk but disk management software (e.g., Cinder in
OpenStack) should be able to snapshot disks while VMs are running, we
need special control over the lock granularity. Therefore, it is a valid
use case to lock the whole byte range of a disk image without
technically locking the whole file - to get the best of both worlds.
This also brings CHVs behavior in line with QEMU [1].
Whole-file locks remain a valid use case and could be supported later.
This patch only provides the necessary groundwork; making it
configurable is out of scope for now.
[0] https://kb.netapp.com/on-prem/ontap/da/NAS/NAS-KBs/How_is_Mandatory_Locking_supported_for_NFSv4_on_ONTAP_9
[1] <qemu>/util/osdep.c::qemu_lock_fcntl()
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This should be guaranteed by GuestMemory and GuestMemoryRegion, but
those traits are currently safe, so add checks to guard against
incorrect implementations of them.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>