Implements a host-side daemon that equalizes free memory headroom between host and guests via virtio-balloon, using ChromeOS's BalanceAvailablePolicy. Includes VM discovery via inotify, crosvm control socket client, stall detection, NixOS module integration, and CLI argument parsing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| bufferbloat-test | ||
| example | ||
| modules | ||
| packages | ||
| patches | ||
| rootfs-nixos | ||
| vm-switch | ||
| vmsilo-balloond | ||
| wayland_decoration_tests | ||
| .gitignore | ||
| .gitmodules | ||
| CLAUDE.md | ||
| flake.lock | ||
| flake.nix | ||
| README.md | ||
vmsilo
A NixOS VM compartmentalization system inspired by Qubes OS. Runs programs in isolated VMs using crosvm (Chrome OS VMM), displaying their windows natively on the host desktop.
Thanks to Thomas Leonard (@talex5), who wrote the wayland proxy and made qubes-lite, which made this project possible. https://gitlab.com/talex5/qubes-lite
This is a hacky side project. If you need a serious and secure operating system, use Qubes.
Features
- Qubes-style colored window decorations enforced by patched kwin
- Fast graphics
- Supports many wayland protocols for things like HDR, fractional scaling and smooth video playback
- Sound playback (no sound input yet)
- VMs can be configured fully disposable with no state kept between restarts
- Shared directories over virtiofs for easily sharing files between VMs
- PCI passthrough
- Auto shutdown idle VMs
Comparison to Qubes
The main benefits compared to Qubes are:
- Fast, modern graphics. Wayland calls are proxied to the host.
- Better power management. Qubes is based on Xen, and its support for modern laptop power management is significantly worse than linux.
- NixOS-based declarative VM config.
The cost for that is security. Qubes is laser-focused on security and hard compartmentalisation. This makes it by far the most secure general-purpose operating system there is.
Ways in which we are less secure than Qubes (list is not even remotely exhaustive):
- The host system is not isolated from the network at all. The user needs to use discipline to not access untrusted network resources from the host. Even if they do, handling VM network traffic makes the host attack surface much larger.
- There is no attempt to isolate the host system from hardware peripherals. Qubes segregates USB and network into VMs.
- Currently clipboard is shared between host and all VMs. This will be fixed at some point, the plan is to implement a two-level clipboard like Qubes.
- Proxying wayland calls means the attack surface from VM to host is way larger than Qubes' raw framebuffer copy approach.
- Probably a million other things.
If you are trying to defend against a determined, well-resourced attacker targeting you specifically then you should be running Qubes.
Quick Start
Add to your flake inputs:
{
inputs.vmsilo.url = "git+https://git.dsg.is/dsg/vmsilo.git";
}
Import the module and configure VMs in your NixOS configuration:
{ config, pkgs, ... }: {
imports = [ inputs.vmsilo.nixosModules.default ];
# User must have explicit UID for vmsilo
users.users.david.uid = 1000;
programs.vmsilo = {
enable = true;
user = "david";
hostNetworking.nat = {
enable = true;
interface = "eth0";
};
nixosVms = [
{
name = "banking";
memory = 4096;
cpus = 4;
autoShutdown = { enable = true; after = 120; };
network = {
nameservers = [ "1.1.1.1" ];
interfaces.wan = {
type = "tap";
tap.hostAddress = "10.0.0.254/24";
addresses = [ "10.0.0.1/24" ];
routes."0.0.0.0/0" = { via = "10.0.0.254"; };
};
};
guestPrograms = with pkgs; [ firefox konsole ];
}
{
name = "shopping";
memory = 2048;
cpus = 2;
autoShutdown.enable = true;
network = {
nameservers = [ "1.1.1.1" ];
interfaces.wan = {
type = "tap";
tap.hostAddress = "10.0.1.254/24";
addresses = [ "10.0.1.1/24" ];
routes."0.0.0.0/0" = { via = "10.0.1.254"; };
};
};
guestPrograms = with pkgs; [ firefox konsole ];
}
{
name = "personal";
memory = 4096;
cpus = 4;
# No network.interfaces = offline VM
guestPrograms = with pkgs; [ libreoffice ];
}
];
};
}
Configuration Options
programs.vmsilo
| Option | Type | Default | Description |
|---|---|---|---|
enable |
bool | false |
Enable vmsilo VM management |
user |
string | required | User who owns TAP interfaces and runs VMs (must have explicit UID) |
hostNetworking.nat.enable |
bool | false |
Enable NAT for VM internet access |
hostNetworking.nat.interface |
string | "" |
External interface for NAT (required if nat.enable) |
nixosVms |
list of VM configs | [] |
List of NixOS-based VMs to create |
enableBashIntegration |
bool | true |
Enable bash completion for vm-* commands |
crosvm.logLevel |
string | "info" |
Log level for crosvm (error, warn, info, debug, trace) |
crosvm.extraArgs |
list of strings | [] |
Extra args passed to crosvm before "run" subcommand |
crosvm.extraRunArgs |
list of strings | [] |
Extra args passed to crosvm after "run" subcommand |
vm-switch.logLevel |
string | "info" |
Log level for vm-switch daemon (error, warn, info, debug, trace) |
vm-switch.extraArgs |
list of strings | [] |
Extra command line arguments for vm-switch daemon |
vmsilo-balloond.logLevel |
string | "info" |
Log level for vmsilo-balloond daemon (error, warn, info, debug, trace) |
vmsilo-balloond.pollInterval |
string | "2s" |
Policy evaluation interval (e.g. "2s", "1s", "500ms") |
vmsilo-balloond.criticalHostPercent |
int | 5 |
Host critical threshold as percentage of total RAM |
vmsilo-balloond.guestAvailableBias |
string | "300m" |
Guest bias term (e.g. "300m", "500m") |
vmsilo-balloond.extraArgs |
list of strings | [] |
Extra command line arguments for vmsilo-balloond daemon |
isolatedPciDevices |
list of strings | [] |
PCI devices to isolate with vfio-pci |
VM Configuration (nixosVms items)
| Option | Type | Default | Description |
|---|---|---|---|
name |
string | required | VM name for scripts |
memory |
int | 1024 |
Memory allocation in MB |
cpus |
int | 2 |
Number of virtual CPUs |
color |
string | "red" |
Window decoration color (named color or hex, e.g., "#2ecc71") |
network.nameservers |
list of strings | [] |
DNS nameservers for this VM |
network.interfaces |
attrset of interface configs | {} |
Network interfaces (keys are guest-visible names) |
autoShutdown.enable |
bool | false |
Auto-shutdown when idle (after autoShutdown.after seconds) |
autoShutdown.after |
int | 60 |
Seconds to wait before shutdown |
autoStart |
bool | false |
Start VM automatically at boot instead of on-demand |
dependsOn |
list of strings | [] |
VM names to also start when this VM starts |
additionalDisks |
list of disk configs | [] |
Additional disks to attach (see Disk Configuration) |
rootDisk |
disk config or null | null | Custom root disk (defaults to built rootfs) |
kernel |
path or null | null | Custom kernel image |
initramfs |
path or null | null | Custom initramfs |
rootDiskReadonly |
bool | true |
Whether root disk is read-only |
sharedHome |
bool or string | true |
Share host dir as /home/user via virtiofs (true=/shared/<vmname>, string=custom path, false=disabled) |
kernelParams |
list of strings | [] |
Extra kernel command line parameters |
gpu |
bool or attrset | true |
GPU config (false=disabled, true=default, attrset=custom) |
sound |
bool or attrset | true |
Sound config (false=disabled, true=default PulseAudio, attrset=custom) |
sharedDirectories |
list of attrsets | [] |
Shared directories with path, tag, and optional kv pairs |
pciDevices |
list of attrsets | [] |
PCI devices to passthrough (path + optional kv pairs) |
guestPrograms |
list of packages | [] |
VM-specific packages |
guestConfig |
NixOS module(s) | [] |
VM-specific NixOS configuration (module, list of modules, or path) |
vhostUser |
list of attrsets | [] |
Manual vhost-user devices (vm-switch auto-populated from network.interfaces) |
crosvm.logLevel |
string or null | null | Per-VM log level override (uses global if null) |
crosvm.extraArgs |
list of strings | [] |
Per-VM extra args (appended to global crosvm.extraArgs) |
rootOverlay.type |
"qcow2" or "tmpfs" |
"qcow2" |
Overlay upper layer: disk-backed (qcow2) or RAM-backed (tmpfs) |
rootOverlay.size |
string | "10G" |
Max ephemeral disk size (qcow2 only). Parsed by qemu-img |
crosvm.extraRunArgs |
list of strings | [] |
Per-VM extra run args (appended to global crosvm.extraRunArgs) |
Network Interface Configuration (network.interfaces.<name>)
The network.interfaces option is an attrset where keys become guest-visible interface names (e.g., wan, internal).
| Option | Type | Default | Description |
|---|---|---|---|
type |
"tap" or "vm-switch" |
"vm-switch" |
Interface type. Use "tap", "vm-switch" is not ready for use. |
macAddress |
string or null | null | MAC address (auto-generated from vmName-ifName hash if null) |
tap.name |
string or null | null | TAP interface name on host (default: <vmname>-<ifIndex>) |
tap.hostAddress |
string or null | null | Host-side IP with prefix (e.g., "10.0.0.254/24"). Mutually exclusive with tap.bridge. |
tap.bridge |
string or null | null | Bridge name to add TAP to (via networking.bridges). Mutually exclusive with tap.hostAddress. |
vmNetwork.name |
string | required for vm-switch | Network name for vm-switch |
vmNetwork.receiveBroadcast |
bool | false |
Whether this VM receives broadcast traffic |
vmNetwork.routes |
list of strings | [] |
CIDR prefixes for subnet routing (e.g., "192.168.1.0/24") |
dhcp |
bool | false |
Enable DHCP for this interface |
addresses |
list of strings | [] |
Static IPv4 addresses with prefix |
routes |
attrs | {} |
IPv4 routes (destination -> { via = gateway; }) |
v6Addresses |
list of strings | [] |
Static IPv6 addresses with prefix |
v6Routes |
attrs | {} |
IPv6 routes |
Shared directories
The attributes of the sharedDirectories option are the parameters to crosvm's --shared-dir. For a full list,
see crosvm run --help.
Example to mount /shared/personal on the host, mapping all UID/GID on the vm to the same UID/GID on the host:
sharedDirectories = [ { path = "/shared/personal"; id = "home"; uidmap = "0 0 65535"; gidmap = "0 0 65535"; } ];
To allow only uid/gid 1000:
sharedDirectories = [ { path = "/shared/personal"; id = "home"; uid = 1000; gid = 1000; uidmap = "1000 1000 1"; gidmap = "1000 1000 1"; } ];
Note that both UID and GID of the vm user accessing files need to exist on the host, otherwise you will get an "Invalid argument" error.
Shared Home
By default, each VM's /home/user is shared from the host via virtiofs (sharedHome = true). The host directory is /shared/<vmname>. On first VM start, if the directory doesn't exist, it is initialized by copying /var/lib/vmsilo/home-template. You can seed that template with dotfiles, configs, etc.
sharedHome = true— use default path/shared/<vmname>(default)sharedHome = "/custom/path"— use a custom host pathsharedHome = false— disable, guest/home/userlives on the root overlay
Both /shared and /var/lib/vmsilo/home-template are owned by the configured user.
Disk Configuration (additionalDisks items)
Free-form attrsets passed directly to crosvm --block. The path attribute is required and used as a positional argument.
additionalDisks = [{
path = "/tmp/data.qcow2"; # required, positional
ro = false; # read-only
sparse = true; # enable discard/trim
block-size = 4096; # reported block size
id = "data"; # device identifier
direct = false; # O_DIRECT mode
}];
# Results in: --block /tmp/data.qcow2,ro=false,sparse=true,block-size=4096,id=data,direct=false
Wayland Proxy
waylandProxy = "wayland-proxy-virtwl"; # Default: wayland-proxy-virtwl by Thomas Leonard
waylandProxy = "sommelier"; # ChromeOS sommelier
GPU Configuration
gpu = false; # Disabled
gpu = true; # Default: context-types=cross-domain:virgl2
gpu = { context-types = "cross-domain:virgl2"; width = 1920; height = 1080; }; # Custom
Sound Configuration
sound = false; # Disabled
sound = true; # Default PulseAudio config
sound = { backend = "pulse"; capture = true; }; # Custom
PCI Passthrough Configuration
pciDevices = [{
path = "01:00.0"; # BDF format, auto-converted to sysfs path
iommu = "on"; # Optional: enable IOMMU
}];
# Results in: --vfio /sys/bus/pci/devices/0000:01:00.0/,iommu=on
vhost-user Devices
vhostUser = [{
type = "net";
socket = "/path/to/socket";
}];
# Results in: --vhost-user type=net,socket=/path/to/socket
Auto-populated from vmNetwork configuration with type = "net".
Window Decoration Colors
Each VM's color option controls its KDE window decoration color, providing a visual indicator of which security domain a window belongs to:
nixosVms = [
{ name = "banking"; color = "#2ecc71"; ... } # Green
{ name = "shopping"; color = "#3498db"; ... } # Blue
{ name = "untrusted"; color = "red"; ... } # Red (default)
];
The color is passed to KWin via the wayland security context. A KWin patch (included in the module) reads the color and applies it to the window's title bar and frame. Serverside decorations are forced for VM windows so the color is always visible. Text color is automatically chosen (black or white) based on the background luminance.
Supported formats: named colors ("red", "green"), hex ("#FF0000"), RGB ("rgb(255,0,0)").
Commands
After rebuilding NixOS, the following commands are available:
Run command in VM (recommended)
vm-run <name> <command>
Example: vm-run banking firefox
This is the primary way to interact with VMs. The command:
- Connects to the VM's socket at
/run/vmsilo/<name>-command.socket - Triggers socket activation to start the VM if not running
- Sends the command to the guest
Start/Stop VMs
vm-start <name> # Start VM via systemd (uses polkit, no sudo needed)
vm-stop <name> # Stop VM via systemd (uses polkit, no sudo needed)
Start VM for debugging
vm-start-debug <name>
Starts crosvm directly in the foreground (requires sudo), bypassing socket activation. Useful for debugging VM boot issues since crosvm output is visible.
Shell access
vm-shell <name> # Connect to serial console (default)
vm-shell --ssh <name> # SSH into VM as user
vm-shell --ssh --root <name> # SSH into VM as root
The default serial console mode requires no configuration. Press CTRL+] to escape.
SSH mode requires SSH keys configured in per-VM guestConfig (see Advanced Configuration).
Socket activation
VMs run as system services (for PCI passthrough and sandboxing) and start automatically on first access via systemd socket activation:
# Check socket status
systemctl status vmsilo-banking.socket
# Check VM service status
systemctl status vmsilo-banking-vm.service
Sockets are enabled by default and start on boot.
Network Architecture
Interface Types
TAP interfaces (type = "tap"): For host networking and NAT internet access.
- Creates a TAP interface on the host with
tap.hostAddressor adds it to a bridge withtap.bridge tap.hostAddressandtap.bridgeare mutually exclusive- Guest uses addresses from
addressesoption - Routes configured via
routesoption
VM-switch interfaces (type = "vm-switch"): For VM-to-VM networking.
- Uses vhost-user-net backed by vm-switch daemon
- L3 IP-based forwarding between all peers (any peer can reach any other)
receiveBroadcast = trueenables broadcast traffic delivery for a peerroutesadvertises CIDR prefixes for subnet routing (e.g., a default gateway)- MAC addresses auto-generated from
SHA1("<vmName>-<ifName>")
Note: vm-switch is not currently recommended due to performance issues.
Interface Naming
Interface names are user-specified via network.interfaces attrset keys. Names are passed to the guest via vmsilo.ifname=<name>,<mac> kernel parameters and applied at early boot via udev rules.
NAT
When hostNetworking.nat.enable = true, the module configures:
- IP forwarding (
net.ipv4.ip_forward = 1) - NAT masquerading on
hostNetworking.nat.interface - Internal IPs derived from TAP interface
addresses
PCI Passthrough
Pass PCI devices (USB controllers, network cards) directly to VMs for hardware isolation.
Configuration
programs.vmsilo = {
# Devices to isolate from host (claimed by vfio-pci)
isolatedPciDevices = [ "01:00.0" "02:00.0" ];
nixosVms = [{
name = "sys-usb";
memory = 1024;
pciDevices = [{ path = "01:00.0"; }]; # USB controller
}
{
name = "sys-net";
memory = 1024;
pciDevices = [{ path = "02:00.0"; }]; # Network card
}];
};
# Recommended: blacklist native drivers for reliability
boot.blacklistedKernelModules = [ "xhci_hcd" ]; # for USB controllers
How It Works
- Early boot: vfio-pci claims isolated devices before other drivers load
- Activation: If devices are already bound, they're rebound to vfio-pci
- VM start: IOMMU groups are validated, then devices are passed via
--vfio
Architecture
Each NixOS VM gets:
- A dedicated qcow2 rootfs image with packages baked in
- Overlayfs root (read-only ext4 lower + ephemeral qcow2 upper by default, tmpfs fallback)
- Wayland proxy for GPU passthrough (wayland-proxy-virtwl or sommelier)
- Session setup via
vmsilo-session-setup(imports display variables into user manager, startsgraphical-session.target) - Socket-activated command listener (
vsock-cmd.socket+vsock-cmd@.service, user services gated ongraphical-session.target) - Optional idle watchdog for auto-shutdown VMs (queries user service instances)
- Systemd-based init
The host provides:
- Persistent TAP interfaces via NixOS networking
- NAT for internet access (optional)
- Socket activation for commands (
/run/vmsilo/<name>-command.socket) - Console PTY for serial access (
/run/vmsilo/<name>-console) - VM services run as root for PCI passthrough and sandboxing (crosvm drops privileges)
- Polkit rules for the configured user to manage VM services without sudo
- CLI tools:
vm-run,vm-start,vm-stop,vm-start-debug,vm-shell - Desktop integration with .desktop files for guest applications
vm-switch
Note: vm-switch is not recommended. It was an experiment meant to provide a VM-to-VM network path without the host network stack needing to touch the packets. But it performs poorly under load, a busy connection can saturate the buffer and cause high latency for other connections. I tried various ways of integrating FqCodel to fix it but it still didn't work well.
The vm-switch daemon (vm-switch/ Rust crate) provides L3 IP-based forwarding for VM-to-VM networks. One instance runs per vmNetwork, managed by systemd (vm-switch-<netname>.service).
Process model: The main process watches a config directory for MAC files and forks one child process per VM. Each child is a vhost-user net backend serving a single VM's network interface.
Main Process
(config watch, orchestration)
/ | \
fork / fork | fork \
v v v
Child: router Child: banking Child: shopping
(vhost-user) (vhost-user) (vhost-user)
| | |
[unix socket] [unix socket] [unix socket]
| | |
crosvm crosvm crosvm
(router VM) (banking VM) (shopping VM)
Packet forwarding uses lock-free SPSC ring buffers in shared memory (memfd_create + mmap). When a VM transmits a frame, its child process validates the source MAC/IP and routes by destination IP:
- ARP requests: answered locally with proxy ARP (no forwarding needed)
- Unicast IPv4: classified by 5-tuple flow hash, enqueued into per-peer FQ-CoDel (RFC 8290), dequeued via DRR + CoDel, pushed into the destination's ingress ring buffer
- Broadcast: sent directly to peers marked as broadcast-receiving (bypasses FQ-CoDel)
FQ-CoDel provides per-flow fairness and bufferbloat control. When the ring buffer is full, packets stay queued in FQ-CoDel and are pushed when space becomes available. Ring buffers use atomic head/tail pointers (no locks in the datapath) with eventfd signaling for data and space availability. VIRTIO_F_EVENT_IDX is negotiated for notification suppression.
Buffer exchange protocol: The main process orchestrates buffer setup between children via a control channel (SOCK_SEQPACKET + SCM_RIGHTS for passing memfd/eventfd file descriptors):
- Main tells Child A: "create an ingress buffer for Child B" (
GetBuffer) - Child A creates the ring buffer and returns the FDs — memfd, data eventfd, space eventfd (
BufferReady) - Main forwards those FDs to Child B as an egress target (
PutBuffer) - Child B can now write frames directly into Child A's memory -- no copies through the main process
Sandboxing: The daemon runs in a multi-layer sandbox applied at startup (before any async runtime or threads):
| Layer | Mechanism | Effect |
|---|---|---|
| User namespace | CLONE_NEWUSER |
Unprivileged outside, appears as UID 0 inside |
| PID namespace | CLONE_NEWPID |
Main is PID 1; children invisible to host |
| Mount namespace | CLONE_NEWNS + pivot_root |
Minimal tmpfs root: /config, /dev (null/zero/urandom), /proc, /tmp |
| IPC namespace | CLONE_NEWIPC |
Isolated System V IPC |
| Network namespace | CLONE_NEWNET |
No interfaces; communication only via inherited FDs |
| Seccomp (main) | BPF whitelist | Allows fork, socket creation, inotify for config watching |
| Seccomp (child) | Tighter BPF whitelist | No fork, no socket creation, no file open; applied after vhost setup |
Seccomp modes: --seccomp-mode=kill (default), trap (SIGSYS for debugging), log, disabled.
Disable sandboxing for debugging with --no-sandbox and --seccomp-mode=disabled.
CLI:
vm-switch [OPTIONS]
Options:
-n, --name <NAME> Network name (used in worker process titles)
-d, --config-dir <PATH> Config/MAC file directory [default: /run/vm-switch]
--log-level <LEVEL> error, warn, info, debug, trace [default: warn]
--buffer-size <SIZE> Ring buffer data region size [default: 256k]
--fq-codel-target <DUR> CoDel target sojourn time [default: 5ms]
--fq-codel-interval <DUR> CoDel measurement interval [default: 100ms]
--fq-codel-limit <SIZE> Hard byte limit per-peer FQ-CoDel queue [default: 2m]
--fq-codel-quantum <SIZE> DRR quantum in bytes per round per flow [default: 1514]
--quiet Suppress all log output
--no-sandbox Disable namespace sandboxing
--seccomp-mode <MODE> kill, trap, log, disabled [default: kill]
Worker processes set their title to vm-switch(<name>): <vm> so they are identifiable in ps and top.