When the host-side unix socket peer does shutdown(SHUT_WR), read()
returns 0 on the stream. Previously this was treated as a full
connection closure, sending VSOCK_OP_SHUTDOWN with both SHUTDOWN_RCV
and SHUTDOWN_SEND flags and arming the kill timer. This prevented the
guest from sending any further data back through the still-writable
socket.
Change LocalClosed to carry a bool indicating whether receiving from
the guest is also shut down. On stream EOF, only set SHUTDOWN_SEND
(host won't send anymore) and transition to LocalClosed(false),
keeping the connection alive for guest-to-host data flow. The
connection fully closes when the guest also sends SHUTDOWN with
SHUTDOWN_SEND, or sends RST.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When hotplugging a net device via ch-remote add-net with a tap FD,
the virtio-net queue pair threads trigger seccomp violations for
syscalls 332 (statx), 8 (lseek), and 79 (getcwd). These are issued
by glibc/Rust std during tap FD setup within the thread.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add support for balloon stats tags 10-15, which are Linux kernel
extensions reporting VM memory pressure metrics: OOM kills, allocation
stalls, async/direct page scans, and async/direct page reclaims.
Also downgrade the balloon statistics API request log and the unknown
stats tag warning to debug level to reduce log noise.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implement the virtio balloon statistics virtqueue per the VIRTIO spec,
exposing guest memory statistics through a new /vm.balloon-statistics
API endpoint.
Device layer: add stats virtqueue constants and types, wire up the
stats queue in the balloon handler, implement stats request/response
handlers, and expose a public request_balloon_statistics function.
VMM layer: plumb statistics through device manager, VM, and VMM,
add HTTP and D-Bus API endpoints, add ServiceUnavailable (503) to
api_client, add balloon-statistics subcommand to ch-remote, and add
the endpoint to the OpenAPI spec.
The response includes both guest-reported stats and host-known values
(balloon_actual_bytes, balloon_target_bytes, balloon_total_ram_bytes)
so consumers can understand balloon utilization without querying
/vm.info separately.
Also fixes pre-existing rustfmt issues and adds clippy+rustfmt to the
flake devShell.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The GPU frontend only offered transport-level features
(DEFAULT_VIRTIO_FEATURES) during vhost-user negotiation, causing all
GPU device feature bits (VIRGL, EDID, RESOURCE_BLOB, etc.) to be
zeroed out when intersected with the backend's features. The guest
driver never saw any GPU capabilities.
Add the virtio-gpu feature bit constants and include them in
avail_features, matching the pattern used by the blk and net frontends.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The fcntl syscall was only allowed for virtio device threads in debug
builds (behind #[cfg(debug_assertions)]), causing seccomp violations
in release builds across rng, balloon, net, gpu, fs, and vsock threads.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a vhost-user frontend for virtio-gpu using the current vhost-user
protocol (GET_SHMEM_CONFIG=44, SHMEM_MAP/UNMAP=9/10, SHMEM=0x0020_0000).
The GPU device queries the backend for shared memory regions via
get_shmem_config(), allocates a PCI BAR for the shmem region, and
implements a BackendReqHandler that handles SHMEM_MAP/UNMAP requests
by mmapping file descriptors from the backend into the shared memory
region.
To support non-contiguous shared memory region IDs (needed for
VIRTIO_GPU_SHM_ID_HOST_VISIBLE=1), VirtioSharedMemoryList.region_list
is changed from Vec<VirtioSharedMemory> to BTreeMap<u8, VirtioSharedMemory>,
and the PCI BAR code uses the map key as the PCI capability shmid.
Includes full VMM plumbing: GpuConfig, --gpu CLI, API, device_manager,
seccomp rules, and hotplug support.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Unlike most virtio feature bits, VIRTIO_BLK_F_RO is not optional.
It indicates that the host is refusing to permit write operations, and
the guest must not be allowed to override it.
However, the block device currently does not enforce this. If the guest
does not negotiate VIRTIO_BLK_F_RO, the block device will think the
device is writable and forward write requests to the backend.
This is not a security problem right now because the backing device of a
read-only device is always opened read-only. The kernel will thus
reject the write operations with EBADF. If support is added for
receiving the backing device file descriptor via SCM_RIGHTS (#7704),
it will be possible to have a read-only block device backed by a
writable file descriptor. This would make the bug a genuine security
vulnerability.
Fix the bug by explicitly checking if VIRTIO_BLK_F_RO was offered but
not negotiated. In this case, log a warning and proceed as if the guest
did acknowledge the feature. This always indicates a guest driver bug.
Fixes: #7697
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
OVMF sends FLUSH requests to read-only virtio-block devices. Refusing
these requests prevents OVMF from accessing the EFI System Partition and
therefore makes VMs unable to boot. Accept these requests instead.
them.
Ignoring these requests is possible, but inconsistent with fsync(2)
which honors them.
Fixes: #7698
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Add supports_zero_flag() to DiskFile trait to indicate whether a disk
format can mark clusters/blocks as reading zeros without deallocating
storage.
QCOW2 supports this via the zero flag in L2 entries. VHDX also has
PAYLOAD_BLOCK_ZERO state for this, though it's not yet implemented in
cloud-hypervisor.
This enables DISCARD to be advertised even with sparse=false for formats
with zero-flag support, since they can mark regions as zeros (keeps
storage allocated) instead of requiring full deallocation.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add sparse boolean configuration option to DiskConfig with a default
value of true to control disk space allocation behavior.
When sparse is true, the disk uses sparse allocation where deallocated
blocks are returned to the filesystem, and the DISCARD feature is
advertised to the guest.
When sparse is false, disk space is kept fully allocated and DISCARD
is not advertised.
WRITE_ZEROES is always advertised when the backend supports it,
regardless of the sparse setting.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add capability query to DiskFile trait to check backend
support for sparse operations (punch hole, write zeroes,
discard). Only advertise VIRTIO_BLK_F_DISCARD and
VIRTIO_BLK_F_WRITE_ZEROES when the backend supports these
operations.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Instead of closing a file descriptor that belongs to the vhost-user
frontend, drop the vu_common_ctrl::VhostUserHandle and the
vhost::vhost_user::Frontend it contains. This causes the destructor to
drop the file descriptor.
This breaks the last DPDK test, so disable it. See #7689.
Fixes: #7163
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Since kernel commit 6693731487a8 ("vsock/virtio: Allocate nonlinear SKBs
for handling large transmit buffers"), a large vsock packet can be split
into multiple descriptors.
If we encounter such TX packets, pull the content into an owned buffer.
Fixes: #7672
Signed-off-by: Wei Liu <liuwe@microsoft.com>
This adds some missing features that are useful. In particular it adds
VIRTIO_F_RING_INDIRECT_DESC which gives a performance improvement.
Signed-off-by: Rob Bradford <rbradford@meta.com>
Reported-by: Daniel Farina <daniel@ubicloud.com>
VsockPacket::hdr holds a raw pointer to the address of the VSock packet
header, which is in guest memory. It opens the door to double-fetch
(or TOCTOU) race conditions. Therefore, VSockPacket::hdr content can't
be trusted since it can be arbitrarily changed by the guest, at any
time.
To mitigate this, we can copy the header content to an array in VMM's
memory that the guest can't modify.
Signed-off-by: Thomas Leroy <thomas.leroy.mp@gmail.com>
Based upon the discussion and in
https://github.com/rust-vmm/vhost/issues/29#issue-830820820 and the QEMU
behaviour the get_config offset should be zero. This was not caught by
our integration tests as the vhost-user-blk backend as implemented in
this repository does not use the offset.
Fixes: #7615
Signed-off-by: Rob Bradford <rbradford@meta.com>
It should always succeed and is apparently implicitly called by libc or
some dependency somewhere.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Add basic infrastructure so resize events are
propagated to the underlying disk implementation.
On-behalf-of: SAP thomas.prescher@sap.com
Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
This change is a prerequisite for live disk resizing. Before this
commit, the epoll-handler threads just got a copy of the sector
size which we cannot update during runtime.
On-behalf-of: SAP thomas.prescher@sap.com
Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
The lock must always correspond to the physical size of the file,
everything else doesn't make sense.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This better reflects the actual usage.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
Add fcntl to virtio_block_thread_rules to allow try_clone() on file
descriptors. The try_clone() method uses fcntl(fd, F_DUPFD_CLOEXEC)
to duplicate file descriptors, which is needed for efficient QCOW2
L1 table sync that avoids temporary allocations.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
https://github.com/cloud-hypervisor/cloud-hypervisor/pull/7294 adjusted
the checks for read-only requests made to virtio-blk devices and started
rejecting VIRTIO_BLK_T_GET_ID requests. These requests do not perform
any writes and are needed in order to access device serials from within
the guest.
Signed-off-by: Connor Brewster <cbrewster@hey.com>
This is a follow-up of [0].
# Advantages
- This saves dozens of unneeded clone()s across the whole code base
- Makes it much easier to reason about how parameters are used
(often we passed owned Arc/Rc versions without actually needing
ownership)
# Exceptions
For certain code paths, the alternatives would require awkward or overly
complex code, and in some cases the functions are the logical owners of
the values they take. In those cases, I've added
#[allow(clippy::needless_pass_by_value)].
This does not mean that one should not improve this in the future.
[0] 6a86c157af
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This helps to uncover expensive and needless clones in the code base.
For example, I prevented extensive clones in the snapshot path where
(nested) BTreeMap's have been cloned over and over again. Further,
the lint helps devs to much better reason about the ownership of
parameters.
All of these changes have been done manually with the necessary
caution. A few structs that are cheap to clone are now `copy` so that
this lint won't trigger for them.
I didn't enable the lint so far as it is a massive rabbit hole and
needs much more fixes. Nevertheless, it is very useful.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This removes cognitive load when reading if statements.
All changes were applied by clippy via `--fix`.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This commit is part of a series of similar commits.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
The granularity has significant implications in typical cloud
deployments with network storage. The Linux kernel will sync advisory
locks to network file systems, but these backends may have different
policies and handle locks differently. For example, Netapp speaks a NFS
API but will treat advisory OFD locks for the whole file as mandatory
locks, whereas byte-range locks for the whole file will remain
advisory [0].
As it is a valid use case to prevent multiple CHV instances from
accessing the same disk but disk management software (e.g., Cinder in
OpenStack) should be able to snapshot disks while VMs are running, we
need special control over the lock granularity. Therefore, it is a valid
use case to lock the whole byte range of a disk image without
technically locking the whole file - to get the best of both worlds.
This also brings CHVs behavior in line with QEMU [1].
Whole-file locks remain a valid use case and could be supported later.
This patch only provides the necessary groundwork; making it
configurable is out of scope for now.
[0] https://kb.netapp.com/on-prem/ontap/da/NAS/NAS-KBs/How_is_Mandatory_Locking_supported_for_NFSv4_on_ONTAP_9
[1] <qemu>/util/osdep.c::qemu_lock_fcntl()
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This should be guaranteed by GuestMemory and GuestMemoryRegion, but
those traits are currently safe, so add checks to guard against
incorrect implementations of them.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Callers of get_host_address_range() rely on it returning a pointer to at
least size bytes of memory. mem.get_host_address() is an overrideable
method of a safe trait, so it is better for safe code to not rely on its
correctness for safety. Instead, use mem.get_slice(), which returns a
VolatileSlice whose invariants guarantee that it points to a sufficient
amount of memory. If mem.check_range() succeeds but mem.get_slice()
returns a slice that is too small, this means that there is either a
logic error or a situation the code cannot support yet, so panic.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
This API passes a u64 to a kernel API that treats the u64 as a userspace
address. Therefore, it should be marked unsafe, but it currently is not
[1]. Wrap the call in an unsafe block to document that invariants must
be upheld to avoid undefined behavior. This causes a compiler warning,
so suppress the warning with #[allow(unused_unsafe)].
[1]: https://github.com/rust-vmm/vfio/issues/100
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
I believe that its only caller used it safely, but it is still better to
mark the code as unsafe. Also add additional validity checks.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
To ensure that struct sizes are the same on 32-bit and 64-bit, various
kernel APIs use __u64 (Rust u64) to represent userspace pointers.
Userspace is expected to cast pointers to __u64 before passing them to
the kernel, and cast kernel-provided __u64 to a pointer before using
them. However, various safe APIs in Cloud Hypervisor took
caller-provided u64 values and passed them to syscalls that interpret
them as userspace addresses. Therefore, passing bad u64 values would
cause memory disclosure or corruption.
Fix the bug by using usize and pointer types as appropriate. To make
soundness of the code easier to reason about, the PCI code gains a new
MmapRegion abstraction that ensures the validity of pointers. The rest
of the code already has an MmapRegion abstraction it can use. To avoid
having to reason about whether something is keeping the MmapRegion
alive, reference counting is added. MmapRegion cannot hold references
to other objects, so the reference counting cannot introduce cycles.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
This better aligns with the rest of the code and makes it clearer
that these tests can run "as is" in a normal hosted environments
without the special test environment.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
A major improvement to the developer experience of clippy in
Cloud Hypervisor.
1. Make `cargo clippy` just work with the same lints we use in CI
2. Simplify adding new lints
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
... by passing the slice along instead.
On-behalf-of: SAP julian.stecklina@sap.com
Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
This would be a good opportunity to optimize another pointless vector
away, but I don't have a good way to test this at the moment. But
maybe someone else gives it a shot.
On-behalf-of: SAP julian.stecklina@sap.com
Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
as of rust 1.90, writes to unix socket streams use send_with_flags
instead of write, so it uses a sendto syscall instead of write.
Signed-off-by: Matt Moriarity <matt@mattmoriarity.com>
Consuming `&Arc<T>` as argument is almost always an antipattern as it
hides whether the callee is going to take over (shared) ownership
(by .clone()) or not. Instead, it is better to consume `&dyn T` or
`Arc<dyn T>` to be more explicit. This commit cleans up the code.
The change is very mechanic and was very easy to implement across the
code base.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
The virtio spec defines the feature bit range and 4 x 32-bit pages.
However no features are currently defined with bits > 63. However Linux
has now started acking features in those higher pages. Since this is
harmless (we drop the acks to those feature pages) and quietly return 0
for available features in those higher pages this warning can be
removed.
Fixes: #7414
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
Fix clippy warning `uninlined_format_args` reported by rustc rustc
1.89.0 (29483883e 2025-08-04).
```console
warning: variables can be used directly in the `format!` string
--> block/src/lib.rs:649:17
|
649 | info!("{} failed to create io_uring instance: {}", error_msg, e);
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#uninlined_format_args
= note: `#[warn(clippy::uninlined_format_args)]` on by default
help: change this to
|
649 - info!("{} failed to create io_uring instance: {}", error_msg, e);
649 + info!("{error_msg} failed to create io_uring instance: {e}");
|
```
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>