Support ACPI Generic Initiator Affinity to associate
PCI devices with NUMA proximity domains
Add GenericInitiatorAffinity struct
Add from_pci_bdf() to encode PCI Segment:Bus:Device.Function
Add from_acpi_device() for ACPI device handles (future use)
Generate SRAT Type 5 entries for nodes with device_id
Improve create_slit_table() to check distance symmetry when
forward distance is missing
Track device ID to BDF mappings in DeviceManager
Includes comprehensive unit tests
Signed-off-by: Saravanan D <saravanand@crusoe.ai>
TDX builds its own ACPI tables in `create_acpi_tables_tdx` so it will
return None in the standard `create_acpi_tables` function and the
assertion for `rsdp_addr` will fail.
Signed-off-by: Zhibin Li <banlu.lzb@antgroup.com>
Recent changes related to arm64 support in MSHV exposed
inconsistencies in the VM initialization and CVM boot paths.
The VM creation flow currently diverges across multiple scenarios,
including regular MSHV, CVM, and arm64, with each path performing
guest initialization steps in a different order.
Certain platform-specific requirements further constrain the ordering
of operations, such as the timing of address space creation,
IGVM loading, interrupt controller setup, and payload loading. For
CVM case address-space creation must be done after IGVM loading, and
PSP measurement. For Regular and arm64 this memory initialization
must be done early. For MSHV, vm.init() and sev_snp.init() are called in
different order which is run time and build time conditionally checked.
Additionally, while the KVM initialization path differs slightly
from MSHV, it shares common logic that is currently split across
separate conditional and build-time code paths, contributing to
fragmentation of the overall flow.
This change restructures the VM creation and initialization sequence
to better align shared logic, enforce scenario-specific ordering
constraints, and ensure consistent and correct behavior across all
supported configurations. In doing so, it restores proper CVM boot
behavior and improves the maintainability of the initialization code.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
On MSHV, exposing multithreaded CPU topologies requires setting the
PROCESSORS_PER_SOCKET partition property so that CPUID.0xB reports
correct logical processor counts and topology levels to the guest.
This property must be set after all vCPUs are configured, as the
hypervisor uses the complete vCPU layout to derive and report CPU
topology information.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add basic infrastructure so resize events are
propagated to the underlying disk implementation.
On-behalf-of: SAP thomas.prescher@sap.com
Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
In [0] we refactored some Arc<Mutex<T>> parameters to &Mutex<T>> to
satisfy clippy's needless_pass_by_value lint. Nevertheless, this is also
not so idiomatic, so as a follow-up, we put the responsibility to lock
objects to the caller side (only where this is not strictly needed by
the callee).
While on it, I also tried to pass vm_config directly into
pre_create_console_devices() which would clean up some code, but then
we have interleaving mutable and immutable borrows of the Vmm, which
are denied by the borrow checker.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
Create HypervisorVmConfig early and pass the
struct to VM creation API in the vmm crate. Getting
rid of multiple conditional parameter.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
... and nuke some Option<> while I was there. Given that HashMap has a
usable default and we end up passing an empty HashMap anyway, just get
rid of the Option.
On-behalf-of: SAP julian.stecklina@sap.com
Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
This is a follow-up of [0].
# Advantages
- This saves dozens of unneeded clone()s across the whole code base
- Makes it much easier to reason about how parameters are used
(often we passed owned Arc/Rc versions without actually needing
ownership)
# Exceptions
For certain code paths, the alternatives would require awkward or overly
complex code, and in some cases the functions are the logical owners of
the values they take. In those cases, I've added
#[allow(clippy::needless_pass_by_value)].
This does not mean that one should not improve this in the future.
[0] 6a86c157af
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This helps to uncover expensive and needless clones in the code base.
For example, I prevented extensive clones in the snapshot path where
(nested) BTreeMap's have been cloned over and over again. Further,
the lint helps devs to much better reason about the ownership of
parameters.
All of these changes have been done manually with the necessary
caution. A few structs that are cheap to clone are now `copy` so that
this lint won't trigger for them.
I didn't enable the lint so far as it is a massive rabbit hole and
needs much more fixes. Nevertheless, it is very useful.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This commit is part of a series of similar commits.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
It takes a pointer to a userspace address that it accesses, so it should
be marked unsafe. This was missed earlier.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Also drop support for building the TDX code for 32-bit targets. All
CPUs with TDX support are 64-bit so supporting 32-bit targets is not
needed.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
To ensure that struct sizes are the same on 32-bit and 64-bit, various
kernel APIs use __u64 (Rust u64) to represent userspace pointers.
Userspace is expected to cast pointers to __u64 before passing them to
the kernel, and cast kernel-provided __u64 to a pointer before using
them. However, various safe APIs in Cloud Hypervisor took
caller-provided u64 values and passed them to syscalls that interpret
them as userspace addresses. Therefore, passing bad u64 values would
cause memory disclosure or corruption.
Fix the bug by using usize and pointer types as appropriate. To make
soundness of the code easier to reason about, the PCI code gains a new
MmapRegion abstraction that ensures the validity of pointers. The rest
of the code already has an MmapRegion abstraction it can use. To avoid
having to reason about whether something is keeping the MmapRegion
alive, reference counting is added. MmapRegion cannot hold references
to other objects, so the reference counting cannot introduce cycles.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
To ensure that struct sizes are the same on 32-bit and 64-bit, various
kernel APIs use __u64 (Rust u64) to represent userspace pointers.
Userspace is expected to cast pointers to __u64 before passing them to
the kernel, and cast kernel-provided __u64 to a pointer before using
them. However, various safe APIs in Cloud Hypervisor took
caller-provided u64 values and passed them to syscalls that treat them
as userspace addresses. Therefore, passing bad u64 values would cause
memory disclosure or corruption. The memory region APIs are one example
of this, so mark them as unsafe.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
This better aligns with the rest of the code and makes it clearer
that these tests can run "as is" in a normal hosted environments
without the special test environment.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
Consuming `&Arc<T>` as argument is almost always an antipattern as it
hides whether the callee is going to take over (shared) ownership
(by .clone()) or not. Instead, it is better to consume `&dyn T` or
`Arc<dyn T>` to be more explicit. This commit cleans up the code.
The change is very mechanic and was very easy to implement across the
code base.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This was added in 7be69edf51 to deal with
changes to the KVM bindings that made run() and set_immediate_exit()
take &mut self. Instead adopt a Box<> value in Vcpu allowing the removal
of this internal Mutex.
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
Fix clippy warning `uninlined_format_args` reported by rustc rustc
1.89.0 (29483883e 2025-08-04).
```console
warning: variables can be used directly in the `format!` string
--> block/src/lib.rs:649:17
|
649 | info!("{} failed to create io_uring instance: {}", error_msg, e);
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#uninlined_format_args
= note: `#[warn(clippy::uninlined_format_args)]` on by default
help: change this to
|
649 - info!("{} failed to create io_uring instance: {}", error_msg, e);
649 + info!("{error_msg} failed to create io_uring instance: {e}");
|
```
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
For MSHV customers don't want to make everything
default during partition creation. For example
nested support, some synthetic features could be
controlled from CLI through platform argument.
Create_vm API getting messy after adding more flags.
This patch introduces common data struct to be passed
from vmm crate to hypervisor crate during partition creation.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
This bumps the MSRV to 1.88 (also, Rust edition 2024 is mandatory).
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
Raise the max number of supported (v)CPUs on kvm x86_64 hosts
to 8192 (the max allowed value of CONFIG_NR_CPUS in the Linux kernel).
Other platfroms keep their existing CPU limits pending further
development and testing.
The change has been tested on Intel and AMD hosts.
Signed-off-by: Barret Rhoden <brho@google.com>
Signed-off-by: Neel Natu <neelnatu@google.com>
Signed-off-by: Ofir Weisse <oweisse@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
This commit removes the SGX support from cloud hypervisor. SGX support
was deprecated in May as part of #7090.
Signed-off-by: Shubham Chakrawar <schakrawar@crusoe.ai>
On aarch64 and RISC-V, calling load_firmware() through load_kernel()
provides no benefit and only duplicates checks already performed in
load_payload(). load_payload() now directly invokes load_firmware() or
load_kernel(), removing unnecessary indirection and redundancy.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
Both functions are defined separately for the two architecture with
minor differences.
* `load_firmware()`: call `arch::uefi::load_uefi` which are available on
both architecture;
* `load_kernel()`: manually align to `arch::layout::KERNEL_START` 2MB
for both architecture (e.g. no-op for `aarch64`);
Signed-off-by: Bo Chen <bchen@crusoe.ai>
Currently, the following scenarios are supported by Cloud Hypervisor to
bootstrap a VM:
1. provide firmware
2. provide kernel
3. provide kernel + cmdline
4. provide kernel + initrd
5. provide kernel + cmdline + initrd
As the difference between `--firmware` and `--kernel` is not very clear
currently, especially as both use/support a Xen PVH entry, adding this
helps to identify the cause of misconfiguration.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This is the second patch in a series intended to let Cloud Hypervisor
support more than 255 vCPUs in guest VMs; the first patch/commit is
https://github.com/cloud-hypervisor/cloud-hypervisor/pull/7231
At the moment, CPU topology in Cloud Hypervisor is using
u8 for components, and somewhat inconsistently:
- struct CpuTopology in vmm/src/vm_config.rs uses four components
(threads_per_core, cores_per_die, dies_per_package, packages);
- when passed around as a tuple, it is a 3-tuple of u8, with
some inconsistency:
- in get_x2apic_id in arch/src/x86_64/mod.rs the three u8
are assumed to be (correctly)
threads_per_core, cores_per_die, and dies_per_package, but
- in get_vcpu_topology() in vmm/src/cpu.rs the three-tuple is
threads_per_core, cores_per_die, and packages (dies_per_package
is assumed to always be one? not clear).
So for consistency, a 4-tuple is always passed around.
In addition, the types of the tuple components is changed from u8 to
u16, as on x86_64 subcomponents can consume up to 16 bits.
Again, config constraints have not been changed, so this patch
is mostly NOOP.
Signed-off-by: Barret Rhoden <brho@google.com>
Signed-off-by: Neel Natu <neelnatu@google.com>
Signed-off-by: Ofir Weisse <oweisse@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
This is the first change to Cloud Hypervisor in a series of changes
intended to increase the max number of supported vCPUs in guest VMs,
which is currently limited to 255 (254 on x86_64).
No user-visible/behavior changes are expected as a result of
applying this patch, as the type of boot_cpus and related
fields in config structs remains u8 for now, and all configuration
validations remain the same.
Signed-off-by: Barret Rhoden <brho@google.com>
Signed-off-by: Neel Natu <neelnatu@google.com>
Signed-off-by: Ofir Weisse <oweisse@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
This allows us to enable/disable the fw_cfg device via the cli
We can also now upload files into the guest vm using fw_cfg_items
via the cli
Signed-off-by: Alex Orozco <alexorozco@google.com>
The acpi tables are created in the same place the acpi tables would be
created for the regular bootflow, except here we add them to the
fw_cfg device to be measured by the fw and then the fw will put the
acpi tables into memory.
Signed-off-by: Alex Orozco <alexorozco@google.com>
The kernel and initramfs are passed to the fw_cfg device as
file references. The cmdline is passed directly.
Signed-off-by: Alex Orozco <alexorozco@google.com>
Here we add the fw_cfg device as a legacy device to the device manager.
It is guarded behind a fw_cfg flag in vmm at creation of the
DeviceManager. In this cl we implement the fw_cfg device with one
function (signature).
Signed-off-by: Alex Orozco <alexorozco@google.com>
Error::UefiLoad is required for load_firmware to propagate errors
encountered, define it for riscv64.
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
In case of CVM guest rsdp is set to none. Unwrapping it
make the vmm crashed. Don't call configure system if the
rsdb address is none.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
The changes were mostly automatically applied using the following
Python script:
```python
import os, re
for root, _, files in os.walk("."):
for f in files:
if not f.endswith(".rs"):
continue
p = os.path.join(root, f)
with open(p, "r", encoding="utf-8") as file:
lines = file.readlines()
changed = False
for i in range(len(lines) - 1):
if re.search(r'#\[error\(".*: \{0[^}]*\}"\)\]', lines[i]) and "#[source]" in lines[i + 1].strip():
lines[i] = re.sub(r': \{0[^}]*\}"\)\]', '")]', lines[i])
changed = True
if changed:
with open(p, "w", encoding="utf-8") as file:
file.writelines(lines)
print("Fixed:", p)
```
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
# Conflicts:
# vmm/src/api/http/mod.rs
This streamlines the code base to follow best practices for
error handling in Rust: Each error struct implements
std::error::Error (most due via thiserror::Error derive macro)
and sets its source accordingly.
This allows future work that nicely prints the error chains,
for example.
So far, the convention is that each error prints its
sub error as part of its Display::fmt() impl.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This streamlines the code base to follow best practices for
error handling in Rust: Each error struct implements
std::error::Error (most due via thiserror::Error derive macro)
and sets its source accordingly.
This allows future work that nicely prints the error chains,
for example.
So far, the convention is that each error prints its
sub error as part of its Display::fmt() impl.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
This adds guidance on how to resolve the issue.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
# What
This commit introduces file-based advisory locking for the files backing
up the block devices by using the fcntl() syscall with OFD locks. The
per-open-file-descriptor (OFD) locks are more robust than traditional
POSIX locks (F_SETLK) as they are not tied to process IDs and avoid
common issues in multithreaded or multi-fd scenarios [1]. Therefore,
we don't use `std::fs::File::try_lock()`, which is backed by F_SETLKW.
The locking mechanism is aware of the `readonly` property and allows
`n` readers or `1` writer (exclusive mode).
As the locks are advisory, multiple cloud-hypervisor processes can
prevent themselves from writing to the same file. However, this is not
a system-wide file-system level locking mechanism preventing to open()
a file.
The introduced new locking mechanism does not cover vhost-user devices.
# Why
To prevent misconfiguration and improve safety, it is good practice to
protect disk image files with a locking mechanism. Experience and common
best practices suggest that advisory locks are preferable over mandatory
locks due to better compatibility and fewer pitfalls (in fs space).
The introduced functionality is aligned with the approach taken by
QEMU [0], and is also recommended in [1].
# Implementation Details
We need to ensure that not only normal operation keeps working but also
state save/resume and live-migration. Especially for live migration,
it is crucial that the sender VMM releases the locks when the VM stops
so the receiver VMM can acquire them right after that.
Therefore, the locking and releasing happen directly on the block
device struct. The device manager knows all block devices and can
forward requests to these types.
Last but not least, this commit uses on explicit lock acquiring
but implicit lock releasing (FD close). It only explicitly releases
the locks where this integrates more smoothly into the existing
code.
# Testing
I tested
- normal operation
- state save/resume,
- device hot plugging,
- and live-migration
with read/shared and write/exclusive locks.
One can use the `fcntl-tool` to test if locks are actually acquired
or released [2].
# Links
[0] 825b96dbce/util/osdep.c (L266)
[1] https://apenwarr.ca/log/20101213
[2] https://crates.io/crates/fcntl-tool
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
It seems like address allocation has been spread into different files
and different location for x86 vs ARM. This makes it hard to follow the
code. Thus, unify it a single location which satisfies all the
requirement.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>