Move UserspaceMapping to vm-device to avoid redefinition since
UserspaceMapping is used by both `virtio-devices` and `device`
crate.
Signed-off-by: Songqian Li <sionli@tencent.com>
Many of the workspace members in the Cloud-hypervisor workspace share
common dependencies. Making these workspace dependencies reduces
duplication and improves maintainability.
Signed-off-by: Oliver Anderson <oliver.anderson@cyberus-technology.de>
On-behalf-of: SAP oliver.anderson@sap.com
The read_exact() call was introduced in 82ac114b8 ("virtio-devices:
vsock: handle short read in muxer") to solve a crash when a connection
disconnected without sending any data, but it introduced a problem of
its own: because the socket is non-blocking, read_exact() may read
some data, then return ErrorKind::WouldBlock. In that case, the data
it read will be discarded. So for example if it read "CONNECT ",
and then nothing else was available to read yet, "CONNECT " would be
discarded, and so the next time this function was called, when epoll
triggered again for the socket, only the following data would end up
in command.buf, causing an error due to just a port number being an
invalid command.
Contrary to that commit message, this code was actually designed to
handle short reads just fine — in the case of a short read, it stores
the data it has read in command, and returns
Error::UnixRead(ErrorKind::WouldBlock), which is ignored by the
caller, and the function gets called again when there is more data to
read, building up command potentially over the course of several
reads. The only thing it didn't handle correctly, as far as I can
tell, was a 0-byte read, which happens when a client disconnects from
the socket without writing anything. All that's needed to fix this is
to avoid an invalid subtraction in that case, so this change reverts
82ac114b8, fixing the issue with partial commands being discarded, and
instead handles the 0-byte read by using slice::get, and treating an
empty command as an incomplete command, which of course it is.
Fixes: 82ac114b8 ("virtio-devices: vsock: handle short read in muxer")
Signed-off-by: Alyssa Ross <hi@alyssa.is>
This makes it possible to run cargo test just for the virtio-devices
crate (as long as either KVM or MSHV is specified).
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Recently vfio crates have moved to crates.io, thus we should start
consuming the crate from crates.io instead git url.
This results in better versioning instead of tracking some git commit
sha.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
The changes were mostly automatically applied using the Python
script mentioned in the first commit of this series.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
Instead of exiting on IO errors, report the errors to the guest with
VIRTIO_BLK_S_IOERR. For example, the guest kernel will log something
similar to this if the nbd behind /dev/vdc is unexpectedly disconnected:
[ 166.033957] I/O error, dev vdc, sector 264 op 0x1:(WRITE) flags 0x9800 phys_seg 1 prio class 2
[ 166.035083] Aborting journal on device vdc-8.
[ 166.037307] Buffer I/O error on dev vdc, logical block 9, lost sync page write
[ 166.038471] JBD2: I/O error when updating journal superblock for vdc-8.
[...]
[ 174.234470] EXT4-fs (vdc): I/O error while writing superblock
In case the rootfs is not located on the affected block device, this
will not crash the guest.
Fixes: #6995
Signed-off-by: Gauthier Jolly <contact@gjolly.fr>
This was caught by the nightly compiler during cargo fuzz build.
error: lifetime flowing from input to output with different syntax can be confusing
--> /home/runner/work/cloud-hypervisor/cloud-hypervisor/hypervisor/src/arch/x86/emulator/mod.rs:493:26
|
493 | pub fn new(platform: &mut dyn PlatformEmulator<CpuState = T>) -> Emulator<T> {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ----------- the lifetime gets resolved as `'_`
| |
| this lifetime flows to the output
|
= note: `-D mismatched-lifetime-syntaxes` implied by `-D warnings`
= help: to override `-D warnings` add `#[allow(mismatched_lifetime_syntaxes)]`
help: one option is to remove the lifetime for references and use the anonymous lifetime for paths
|
493 | pub fn new(platform: &mut dyn PlatformEmulator<CpuState = T>) -> Emulator<'_, T> {
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
As almost every sub crate depends on thiserror, lets upgrade it to a
workspace dependency.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This streamlines the code base to follow best practices for
error handling in Rust: Each error struct implements
std::error::Error (most due via thiserror::Error derive macro)
and sets its source accordingly.
This allows future work that nicely prints the error chains,
for example.
So far, the convention is that each error prints its
sub error as part of its Display::fmt() impl.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This streamlines the code base to follow best practices for
error handling in Rust: Each error struct implements
std::error::Error (most due via thiserror::Error derive macro)
and sets its source accordingly.
This allows future work that nicely prints the error chains,
for example.
So far, the convention is that each error prints its
sub error as part of its Display::fmt() impl.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This streamlines the code base to follow best practices for
error handling in Rust: Each error struct implements
std::error::Error (most due via thiserror::Error derive macro)
and sets its source accordingly.
This allows future work that nicely prints the error chains,
for example.
So far, the convention is that each error prints its
sub error as part of its Display::fmt() impl.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
Allow tap interfaces to be configured with an IPv6 address. The change
is fairly straightforward: we need to update the API types and CLI
parsing to accept either an IPv6 or IPv4 and then match on the IP
address type when the tap device is configured.
For IPv6 addresses, the netmask (prefix) must be provided at the same
time as the address itself (in the SIOCSIFADDR ioctl). They cannot be
configured separately. So we remove the separate "set_netmask" function
and convert "set_ip_addr" to also accept a netmask. For IPv4 addresses,
the IP address and netmask were already always set together, so this
should have no functional impact for users of IPv4 addresses.
Signed-off-by: Gregory Anders <ganders@cloudflare.com>
# What
This commit introduces file-based advisory locking for the files backing
up the block devices by using the fcntl() syscall with OFD locks. The
per-open-file-descriptor (OFD) locks are more robust than traditional
POSIX locks (F_SETLK) as they are not tied to process IDs and avoid
common issues in multithreaded or multi-fd scenarios [1]. Therefore,
we don't use `std::fs::File::try_lock()`, which is backed by F_SETLKW.
The locking mechanism is aware of the `readonly` property and allows
`n` readers or `1` writer (exclusive mode).
As the locks are advisory, multiple cloud-hypervisor processes can
prevent themselves from writing to the same file. However, this is not
a system-wide file-system level locking mechanism preventing to open()
a file.
The introduced new locking mechanism does not cover vhost-user devices.
# Why
To prevent misconfiguration and improve safety, it is good practice to
protect disk image files with a locking mechanism. Experience and common
best practices suggest that advisory locks are preferable over mandatory
locks due to better compatibility and fewer pitfalls (in fs space).
The introduced functionality is aligned with the approach taken by
QEMU [0], and is also recommended in [1].
# Implementation Details
We need to ensure that not only normal operation keeps working but also
state save/resume and live-migration. Especially for live migration,
it is crucial that the sender VMM releases the locks when the VM stops
so the receiver VMM can acquire them right after that.
Therefore, the locking and releasing happen directly on the block
device struct. The device manager knows all block devices and can
forward requests to these types.
Last but not least, this commit uses on explicit lock acquiring
but implicit lock releasing (FD close). It only explicitly releases
the locks where this integrates more smoothly into the existing
code.
# Testing
I tested
- normal operation
- state save/resume,
- device hot plugging,
- and live-migration
with read/shared and write/exclusive locks.
One can use the `fcntl-tool` to test if locks are actually acquired
or released [2].
# Links
[0] 825b96dbce/util/osdep.c (L266)
[1] https://apenwarr.ca/log/20101213
[2] https://crates.io/crates/fcntl-tool
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
The Memory Space Enable (MSE) bit from the COMMAND register in the
PCI configuration space controls whether a PCI device responds to memory
space accesses, e.g. read and write cycles to the device MMIO regions
defined by its BARs. The MSE bit is used by the device drivers to ensure
the correctness of BAR reprogramming. A common workflow is, the driver
first clears the MSE bit, then writes new values to the BAR registers,
and finally set the MSE bit to finish the BAR reprogramming.
This patch changes how we handle BAR reprogramming for all PCI
devices (e.g. virtio-pci, vfio, vfio-user, etc.), so that we follow the
same convention, e.g. moving PCI BARs when its MSE bit is set.
Note that some device drivers (such as edk2) only clear and set MSE once
while reprogramming multiple BARs of a single device. To support such
behavior, this patch adds support for multiple pending BAR reprogramming.
See: https://github.com/cloud-hypervisor/cloud-hypervisor/issues/7027#issuecomment-2853642959
Signed-off-by: Bo Chen <bchen@crusoe.ai>
A BAR reprogramming of a PCI device will only happen when the (guest)
kernel write to its PCI config space, e.g. the detection of bar
reprogramming (`detect_bar_repgraomming()`) can be embedded to the PCI
config space write (`write_config_register()`). It simplifies APIs
exposed by the `struct PciConfiguration` and `trait PciDevice`. It also
prepares for easier handling of pending bar reprogramming when the MSE
bit of the COMMAND register is not enabled at the time of changing BAR
registers.
See: https://github.com/cloud-hypervisor/cloud-hypervisor/issues/7027#issuecomment-2853642959
Signed-off-by: Bo Chen <bchen@crusoe.ai>
Rust has a new way of constructing other error and clippy complains if
we are still using the older way to construct error message. Thus,
migrate to the new approach suggested by the clippy.
Warning from beta compiler:
error: this can be `std::io::Error::other(_)`
--> block/src/vhdx/mod.rs:142:17
|
| / std::io::Error::new(
| | std::io::ErrorKind::Other,
| | format!("Failed to update VHDx header: {e}"),
| | )
| |_________________^
|
= help: for further information visit
https://rust-lang.github.io/rust-clippy/master/index.html#io_other_error
help: use `std::io::Error::other`
std::io::Error::other(
format!("Failed to update VHDx header: {e}"),
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
`serde_json` crate is referenced by multiple components, centralize it
to workspace to better manage this crate.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Currently, Cloud Hypervisor does not set a VIRTIO_IOMMU_F_INPUT_RANGE
feature bit for the VirtIO IOMMU device, which, according to spec[1],
means that the guest may use the whole 64-bit address space is for
IOMMU purposes:
>If the feature is not offered, virtual mappings span over the whole
>64-bit address space (start = 0, end = 0xffffffff ffffffff)
As far as I am aware, there are currently no host platforms on
the market capable of addressing the whole 64-bit address space.
For example, I am currently working with a host platform that reports
39-bit address space for IOMMU purposes:
>DMAR: Host address width 39
When running a VFIO pass-through guest on such a platform, NVIDIA
driver in guest gets DMA mapping failures when working with large data,
and this results in Cloud Hypervisor exiting with the following error:
>cloud-hypervisor: 1501.220535s: <__iommu>
>ERROR:virtio-devices/src/thread_helper.rs:53 -- Error running worker:
>HandleEvent(Failed to process request queue : ExternalMapping(Custom
>{ kind: Other, error: "failed to map memory for VFIO container, iova
>0x7fff00000000, gpa 0x24ce25000, size 0x1000: IommuDmaMap(Error(22))"
>}))
Passing "--platform iommu_address_width=39" to Cloud Hypervisor built
with this change fixes this.
[1]: https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/
virtio-v1.3-csd01.html#x1-5420006
Signed-off-by: Nikolay Edigaryev <edigaryev@gmail.com>
Instead of reinventing this mock infrastructure in the upcoming fuzzer,
reuse the one that is already available.
However this change makes Clippy complain that TestBackend and
TestContext don't implement Default. This is just test code, we can
suppress Clippy in this case.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
Remove an erroneous optimisation that used the page size mask to reduce
the range to iterate through on the set of mappings. This doesn't work
as the virtio-iommu ranges are larger than a single page. This may have
worked in the past when the mappings were limited to a single page.
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
MSHV's SEV-SNP implementation calls ioeventfds whenever there is an
event.
This change removes the need frequent allocation and deallocation of a
vector, while at the same time makes sure other call sites are
unaffected.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
This allows the guest to put in more than one segment per request. It
can improve the throughput of the system.
Introduce a new check to make sure the queue size configured by the user
is large enough to hold at least one segment.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
Replace `map_or()` on false condition with `is_some_and` to provide
better readability, as suggestted by v1.84.0-beta.1 `cargo clippy`.
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
Asserting on .is_ok()/.is_err() leads to hard to debug failures (as if
the test fails, it will only say "assertion failed: false". We replace
these with `.unwrap()`, which also prints the exact error variant that
was unexpectedly encountered (we can to this these days thanks to
efforts to implement Display and Debug for our error types). If the
assert!((...).is_ok()) was followed by an .unwrap() anyway, we just drop
the assert.
Inspired by and quoted from @roypat.
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
By introducing `imports_granularity="Module"` format strategy,
effectively groups imports from the same module into one line or block,
improving maintainability and readability.
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
Historically the Cloud Hypervisor coding style has been to ensure that
all imports are ordered and placed in a single group. Unfortunately
cargo fmt has no support for ensuring that all imports are in a single
group so if whitespace lines were added as part of the import statements
then they would only be odered correctly in the group.
By adopting "group_imports="StdExternalCrate" we can enforce a style
where imports are placed in at most three groups for std, external
crates and the crate itself. Choosing a style enforceable by the tooling
reduces the reviewer burden.
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
Modify `Cargo.toml` in each member crate to follow the dependencies
specified in root `Cargo.toml` file.
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
Rebooting a VM fails with the following error when debug assertions
are enabled:
fatal runtime error: IO Safety violation: owned file descriptor already closed
This happens because FromRawFd::from_raw_fd is used on RawFds stored
in ConsoleInfo every time a VM begins to boot, so the second
time (after a reboot, or if the first attempt to boot via the API
failed), the fd will be closed. Until this assertion is hit, the code
is operating on either closed file descriptors, or new file
descriptors for something completely different. If debug assertions
are disabled, it will just continue doing this with unpredictable
results.
To fix this, and prevent the problem reocurring, ownership of the
console file descriptors needs to be properly tracked, using Rust's
type system, so this commit refactors the console code to do that.
The file descriptors are now passed around with reference counts, so
they won't be closed prematurely. The obvious way to do this would be
to just have each member of ConsoleInfo be an Arc<File>, but we need
to accomodate that serial console file descriptors can also be
sockets. We can't just store an OwnedFd and convert it when it's
used, because we only get a reference from the Arc, so we need to
store the descriptors as their concrete types in an enum. Since this
basically duplicates the ConsoleOutputMode enum from the config, the
ConsoleOutputMode enum is now not used past constructing the
ConsoleInfo.
So that ownership can be represented consistently, the debug console's
tty mode now uses its own stdout descriptor.
I'm still using .try_clone().unwrap() (i.e. dup()) to clone file
descriptors for Endpoint::FilePair and Endpoint::TtyPair, because I
assume there's a reason for them not just to hold a single file
descriptor.
I've also retained the existing behaviour of having serial manager
ignore the tty file descriptor passed to it (which is stdout), and
instead using stdin. It looks a lot weirder now, because it has to
explicitly indicate it's ignoring the fd with an underscore binding.
Fixes: 52eebaf6 ("vmm: refactor DeviceManager to use console_info")
Signed-off-by: Alyssa Ross <hi@alyssa.is>