Unlike most virtio feature bits, VIRTIO_BLK_F_RO is not optional.
It indicates that the host is refusing to permit write operations, and
the guest must not be allowed to override it.
However, the block device currently does not enforce this. If the guest
does not negotiate VIRTIO_BLK_F_RO, the block device will think the
device is writable and forward write requests to the backend.
This is not a security problem right now because the backing device of a
read-only device is always opened read-only. The kernel will thus
reject the write operations with EBADF. If support is added for
receiving the backing device file descriptor via SCM_RIGHTS (#7704),
it will be possible to have a read-only block device backed by a
writable file descriptor. This would make the bug a genuine security
vulnerability.
Fix the bug by explicitly checking if VIRTIO_BLK_F_RO was offered but
not negotiated. In this case, log a warning and proceed as if the guest
did acknowledge the feature. This always indicates a guest driver bug.
Fixes: #7697
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
OVMF sends FLUSH requests to read-only virtio-block devices. Refusing
these requests prevents OVMF from accessing the EFI System Partition and
therefore makes VMs unable to boot. Accept these requests instead.
them.
Ignoring these requests is possible, but inconsistent with fsync(2)
which honors them.
Fixes: #7698
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Add supports_zero_flag() to DiskFile trait to indicate whether a disk
format can mark clusters/blocks as reading zeros without deallocating
storage.
QCOW2 supports this via the zero flag in L2 entries. VHDX also has
PAYLOAD_BLOCK_ZERO state for this, though it's not yet implemented in
cloud-hypervisor.
This enables DISCARD to be advertised even with sparse=false for formats
with zero-flag support, since they can mark regions as zeros (keeps
storage allocated) instead of requiring full deallocation.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add sparse boolean configuration option to DiskConfig with a default
value of true to control disk space allocation behavior.
When sparse is true, the disk uses sparse allocation where deallocated
blocks are returned to the filesystem, and the DISCARD feature is
advertised to the guest.
When sparse is false, disk space is kept fully allocated and DISCARD
is not advertised.
WRITE_ZEROES is always advertised when the backend supports it,
regardless of the sparse setting.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add capability query to DiskFile trait to check backend
support for sparse operations (punch hole, write zeroes,
discard). Only advertise VIRTIO_BLK_F_DISCARD and
VIRTIO_BLK_F_WRITE_ZEROES when the backend supports these
operations.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Instead of closing a file descriptor that belongs to the vhost-user
frontend, drop the vu_common_ctrl::VhostUserHandle and the
vhost::vhost_user::Frontend it contains. This causes the destructor to
drop the file descriptor.
This breaks the last DPDK test, so disable it. See #7689.
Fixes: #7163
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Since kernel commit 6693731487a8 ("vsock/virtio: Allocate nonlinear SKBs
for handling large transmit buffers"), a large vsock packet can be split
into multiple descriptors.
If we encounter such TX packets, pull the content into an owned buffer.
Fixes: #7672
Signed-off-by: Wei Liu <liuwe@microsoft.com>
This adds some missing features that are useful. In particular it adds
VIRTIO_F_RING_INDIRECT_DESC which gives a performance improvement.
Signed-off-by: Rob Bradford <rbradford@meta.com>
Reported-by: Daniel Farina <daniel@ubicloud.com>
VsockPacket::hdr holds a raw pointer to the address of the VSock packet
header, which is in guest memory. It opens the door to double-fetch
(or TOCTOU) race conditions. Therefore, VSockPacket::hdr content can't
be trusted since it can be arbitrarily changed by the guest, at any
time.
To mitigate this, we can copy the header content to an array in VMM's
memory that the guest can't modify.
Signed-off-by: Thomas Leroy <thomas.leroy.mp@gmail.com>
Based upon the discussion and in
https://github.com/rust-vmm/vhost/issues/29#issue-830820820 and the QEMU
behaviour the get_config offset should be zero. This was not caught by
our integration tests as the vhost-user-blk backend as implemented in
this repository does not use the offset.
Fixes: #7615
Signed-off-by: Rob Bradford <rbradford@meta.com>
It should always succeed and is apparently implicitly called by libc or
some dependency somewhere.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Add basic infrastructure so resize events are
propagated to the underlying disk implementation.
On-behalf-of: SAP thomas.prescher@sap.com
Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
This change is a prerequisite for live disk resizing. Before this
commit, the epoll-handler threads just got a copy of the sector
size which we cannot update during runtime.
On-behalf-of: SAP thomas.prescher@sap.com
Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
The lock must always correspond to the physical size of the file,
everything else doesn't make sense.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This better reflects the actual usage.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
Add fcntl to virtio_block_thread_rules to allow try_clone() on file
descriptors. The try_clone() method uses fcntl(fd, F_DUPFD_CLOEXEC)
to duplicate file descriptors, which is needed for efficient QCOW2
L1 table sync that avoids temporary allocations.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
https://github.com/cloud-hypervisor/cloud-hypervisor/pull/7294 adjusted
the checks for read-only requests made to virtio-blk devices and started
rejecting VIRTIO_BLK_T_GET_ID requests. These requests do not perform
any writes and are needed in order to access device serials from within
the guest.
Signed-off-by: Connor Brewster <cbrewster@hey.com>
This is a follow-up of [0].
# Advantages
- This saves dozens of unneeded clone()s across the whole code base
- Makes it much easier to reason about how parameters are used
(often we passed owned Arc/Rc versions without actually needing
ownership)
# Exceptions
For certain code paths, the alternatives would require awkward or overly
complex code, and in some cases the functions are the logical owners of
the values they take. In those cases, I've added
#[allow(clippy::needless_pass_by_value)].
This does not mean that one should not improve this in the future.
[0] 6a86c157af
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This helps to uncover expensive and needless clones in the code base.
For example, I prevented extensive clones in the snapshot path where
(nested) BTreeMap's have been cloned over and over again. Further,
the lint helps devs to much better reason about the ownership of
parameters.
All of these changes have been done manually with the necessary
caution. A few structs that are cheap to clone are now `copy` so that
this lint won't trigger for them.
I didn't enable the lint so far as it is a massive rabbit hole and
needs much more fixes. Nevertheless, it is very useful.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This removes cognitive load when reading if statements.
All changes were applied by clippy via `--fix`.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This commit is part of a series of similar commits.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
The granularity has significant implications in typical cloud
deployments with network storage. The Linux kernel will sync advisory
locks to network file systems, but these backends may have different
policies and handle locks differently. For example, Netapp speaks a NFS
API but will treat advisory OFD locks for the whole file as mandatory
locks, whereas byte-range locks for the whole file will remain
advisory [0].
As it is a valid use case to prevent multiple CHV instances from
accessing the same disk but disk management software (e.g., Cinder in
OpenStack) should be able to snapshot disks while VMs are running, we
need special control over the lock granularity. Therefore, it is a valid
use case to lock the whole byte range of a disk image without
technically locking the whole file - to get the best of both worlds.
This also brings CHVs behavior in line with QEMU [1].
Whole-file locks remain a valid use case and could be supported later.
This patch only provides the necessary groundwork; making it
configurable is out of scope for now.
[0] https://kb.netapp.com/on-prem/ontap/da/NAS/NAS-KBs/How_is_Mandatory_Locking_supported_for_NFSv4_on_ONTAP_9
[1] <qemu>/util/osdep.c::qemu_lock_fcntl()
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This should be guaranteed by GuestMemory and GuestMemoryRegion, but
those traits are currently safe, so add checks to guard against
incorrect implementations of them.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Callers of get_host_address_range() rely on it returning a pointer to at
least size bytes of memory. mem.get_host_address() is an overrideable
method of a safe trait, so it is better for safe code to not rely on its
correctness for safety. Instead, use mem.get_slice(), which returns a
VolatileSlice whose invariants guarantee that it points to a sufficient
amount of memory. If mem.check_range() succeeds but mem.get_slice()
returns a slice that is too small, this means that there is either a
logic error or a situation the code cannot support yet, so panic.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
This API passes a u64 to a kernel API that treats the u64 as a userspace
address. Therefore, it should be marked unsafe, but it currently is not
[1]. Wrap the call in an unsafe block to document that invariants must
be upheld to avoid undefined behavior. This causes a compiler warning,
so suppress the warning with #[allow(unused_unsafe)].
[1]: https://github.com/rust-vmm/vfio/issues/100
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
I believe that its only caller used it safely, but it is still better to
mark the code as unsafe. Also add additional validity checks.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
To ensure that struct sizes are the same on 32-bit and 64-bit, various
kernel APIs use __u64 (Rust u64) to represent userspace pointers.
Userspace is expected to cast pointers to __u64 before passing them to
the kernel, and cast kernel-provided __u64 to a pointer before using
them. However, various safe APIs in Cloud Hypervisor took
caller-provided u64 values and passed them to syscalls that interpret
them as userspace addresses. Therefore, passing bad u64 values would
cause memory disclosure or corruption.
Fix the bug by using usize and pointer types as appropriate. To make
soundness of the code easier to reason about, the PCI code gains a new
MmapRegion abstraction that ensures the validity of pointers. The rest
of the code already has an MmapRegion abstraction it can use. To avoid
having to reason about whether something is keeping the MmapRegion
alive, reference counting is added. MmapRegion cannot hold references
to other objects, so the reference counting cannot introduce cycles.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
This better aligns with the rest of the code and makes it clearer
that these tests can run "as is" in a normal hosted environments
without the special test environment.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
A major improvement to the developer experience of clippy in
Cloud Hypervisor.
1. Make `cargo clippy` just work with the same lints we use in CI
2. Simplify adding new lints
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
... by passing the slice along instead.
On-behalf-of: SAP julian.stecklina@sap.com
Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
This would be a good opportunity to optimize another pointless vector
away, but I don't have a good way to test this at the moment. But
maybe someone else gives it a shot.
On-behalf-of: SAP julian.stecklina@sap.com
Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
as of rust 1.90, writes to unix socket streams use send_with_flags
instead of write, so it uses a sendto syscall instead of write.
Signed-off-by: Matt Moriarity <matt@mattmoriarity.com>
Consuming `&Arc<T>` as argument is almost always an antipattern as it
hides whether the callee is going to take over (shared) ownership
(by .clone()) or not. Instead, it is better to consume `&dyn T` or
`Arc<dyn T>` to be more explicit. This commit cleans up the code.
The change is very mechanic and was very easy to implement across the
code base.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
The virtio spec defines the feature bit range and 4 x 32-bit pages.
However no features are currently defined with bits > 63. However Linux
has now started acking features in those higher pages. Since this is
harmless (we drop the acks to those feature pages) and quietly return 0
for available features in those higher pages this warning can be
removed.
Fixes: #7414
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
Fix clippy warning `uninlined_format_args` reported by rustc rustc
1.89.0 (29483883e 2025-08-04).
```console
warning: variables can be used directly in the `format!` string
--> block/src/lib.rs:649:17
|
649 | info!("{} failed to create io_uring instance: {}", error_msg, e);
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#uninlined_format_args
= note: `#[warn(clippy::uninlined_format_args)]` on by default
help: change this to
|
649 - info!("{} failed to create io_uring instance: {}", error_msg, e);
649 + info!("{error_msg} failed to create io_uring instance: {e}");
|
```
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
The function `read_local_stream_port` had no proper handling for
unexpected or incomplete input.
When the control socket of the VSOCK device was closed without sending
the expected `CONNECT <PORT>\n` statement completely, the thread
got stuck in an infinite loop as it attempted to read from a closed
socket over and over again which never returned any data.
This resulted in the thread responsible for `epoll` being completely
blocked. New VSOCK connections could not be established and existing
ones became defunct, effectively leading to a Denial of Service of
the entire VSOCK device.
The issue can be reproduced by opening a socket and immediately
closing it.
```
socat - UNIX-CONNECT:/socket.vsock
<Ctrl-C>
```
Instead of applying a quick fix by handling the `EPOLLHUP` event before
reading, the function is refactored to remove the error-prone `while`
loop and multiple `read`s.
Notably, we now check if the number of bytes read is zero, which occurs
when `event_set == EPOLLHUP | EPOLLIN`, indicating that the socket has
been closed by the client.
Additionally, the actual parsing code is now extracted into a dedicated
function that is tested.
Fixes: #6798
Signed-off-by: Maximilian Güntner <code@mguentner.de>
This bumps the MSRV to 1.88 (also, Rust edition 2024 is mandatory).
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This upgrades the Cargo workspace to Rust edition 2024 to keep the
code base clean and up to date.
The commit only contains the adjustments to the Cargo.toml files and
basic compiler error fixes. Also, this commit includes new SAFETY
comments as discussed in [1]. The changes were not automatically
fixed by `cargo fix --edition` but needed manual adjustments.
Apart from that, all formatting and clippy adjustments follow in
subsequent commits.
*
As only exception, workspace member net_gen sticks to edition 2021
for now as discussed in [0].
[0] https://github.com/cloud-hypervisor/cloud-hypervisor/pull/7295#discussion_r2310851041
[1] https://github.com/cloud-hypervisor/cloud-hypervisor/pull/7256#issuecomment-3271888674
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
These can differ between platforms, so it's better to use centralized
definitions of them. We can't currently do this for the KVM and VFIO
ioctls, because the corresponding crates don't publicly expose the
ioctl numbers.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Cache and batch IO requests after parsing all
items in the queue, improving performance—especially
for small block sizes—by reducing per-request overhead.
Introduced two methods in the AsyncIo trait for batch
submission, with implementation in the raw disk backend.
This method should be called during/after parsing all block IO requests
in the available queue. If the batch submission is not enabled, by
default it does the old way of submitting requests.
Signed-off-by: Bo Chen <bchen@crusoe.ai>
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Instead of returning boolean return an struct of completion status
so that it can be cached for batch submission.
Signed-off-by: Bo Chen <bchen@crusoe.ai>
Signed-off-by: Muminul Islam <muislam@microsoft.com>
This patch changes the read-only check using acked features bit, which
will help to check more features.
Signed-off-by: Songqian Li <sionli@tencent.com>