Sandbox virtiofsd services with namespace isolation and hardening

virtiofsd has built-in sandboxing (--sandbox=namespace): it creates
mount/PID/network namespaces, does pivot_root, drops capabilities, and
applies its own seccomp filter. The systemd unit adds non-overlapping
hardening: IPC/UTS namespace isolation, seccomp-based protections, a
capability bounding set as defense-in-depth, and LimitNOFILE=1048576.

Per-instance runtime directories (/run/vmsilo/<vmname>/virtiofs-<tag>/)
replace the shared directory for better isolation.

New VM options: virtiofs.seccompPolicy and virtiofs.disableSandbox.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Davíð Steinn Geirsson 2026-03-25 11:20:30 +00:00
parent 5ccd64c41f
commit f3663e7e66
7 changed files with 152 additions and 10 deletions

View file

@ -114,9 +114,10 @@ See README.md for full usage details and options.
- **Session bind**: GPU-enabled VMs (default) are tied to the desktop session via per-VM systemd user services bound to `graphical-session.target`. For `autoStart` GPU VMs, the session-bind service also starts the VM on login. Non-GPU `autoStart` VMs start at `multi-user.target` (boot).
- **Automatic DNS**: All VMs have `systemd-resolved` enabled by default (guest rootfs). Netvm VMs get `unbound` as a recursive resolver via `guestConfig` injection. Downstream VMs get nameserver kernel params pointing at their netvm's IP via `netvmInjections.nameservers`. VMs with `netvm = "host"` or no netvm need manual DNS config.
- **GPU device backend**: `vmsilo-<name>-gpu` service runs the GPU device backend sandboxed; selectable via `gpu.backend` between `vhost-device-gpu` (default, vhost-device-gpu in rutabaga mode) and `crosvm` (crosvm device gpu). Both crosvm and cloud-hypervisor VMMs attach via vhost-user. `vmsilo-<name>-wayland-seccontext` must start first. GPU is enabled when any capability (`wayland`, `opengl`, `vulkan`) is true; `wayland` defaults true. Set `gpu.wayland = false` to disable.
- **Per-VM runtime dirs**: all sockets under `/run/vmsilo/<vmname>/` subdirectories (not flat).
- **Per-VM runtime dirs**: all sockets under `/run/vmsilo/<vmname>/` subdirectories (not flat). virtiofs instances get per-instance dirs at `/run/vmsilo/<vmname>/virtiofs-<tag>/`.
- **USB passthrough**: usbip-over-vsock on port 5002. Guest runs `usbip-rs client listen`, host runs one `usbip-rs host connect` per device as `vmsilo-<vm>-usb@<devpath>.service`. Works with both crosvm and cloud-hypervisor.
- **CH sandboxing**: CH VMs use NixOS confinement (chroot), PrivateUsers=identity, PrivateNetwork, PrivatePIDs, PrivateIPC, empty CapabilityBoundingSet. TAP FDs passed via `vmsilo-tap-open` + `ch-remote add-net`. All privileged operations in ExecStartPre=+/ExecStartPost=+/ExecStopPost=+. Gated by `cloud-hypervisor.disableSandbox`.
- **virtiofsd sandboxing**: virtiofsd has built-in sandboxing (`--sandbox=namespace`): creates mount/PID/network namespaces, does pivot_root, drops capabilities, and applies its own seccomp filter. The systemd unit adds non-overlapping hardening: IPC/UTS namespace isolation, seccomp-based protections (clock/modules/logs/personality), capability bounding set (as defense-in-depth), and `LimitNOFILE=1048576`. Per-instance runtime dirs at `/run/vmsilo/<vmname>/virtiofs-<tag>/`. Gated by `virtiofs.disableSandbox`; seccomp controlled independently by `virtiofs.seccompPolicy`.
### Gotchas

View file

@ -194,6 +194,8 @@ There are a lot of configuration options but you don't really need to touch most
| `sound.logLevel` | string | `"info"` | RUST_LOG level for the sound device service |
| `sound.seccompPolicy` | `"enforcing"` or `"log"` | `"enforcing"` | Seccomp policy for sound device service. `"enforcing"` blocks unlisted syscalls; `"log"` only logs them. |
| `sharedDirectories` | attrsOf submodule | `{}` | Shared directories via virtiofsd (keys are fs tags, see below) |
| `virtiofs.seccompPolicy` | `"enforcing"` or `"log"` | `"enforcing"` | Seccomp policy for virtiofsd instances. `"enforcing"` blocks unlisted syscalls; `"log"` only logs them. |
| `virtiofs.disableSandbox` | bool | `false` | Disable non-seccomp sandboxing for virtiofsd instances. Useful for debugging. |
| `pciDevices` | list of attrsets | `[]` | PCI devices to passthrough (path + optional kv pairs) |
| `usbDevices` | list of attrsets | `[]` | USB devices to passthrough (vendorId, productId, optional serial) |
| `guestPrograms` | list of packages | `[]` | VM-specific packages |

View file

@ -0,0 +1,94 @@
# virtiofsd Sandboxing Design
## Overview
Add strong sandboxing to virtiofsd services using NixOS confinement, Linux namespaces, capability restrictions, and virtiofsd's built-in seccomp filter. Each virtiofsd instance is confined to only its exported directory and its own runtime directory.
## New Options
### `virtiofs.seccompPolicy`
- Type: enum `"enforcing"` | `"log"`
- Default: `"enforcing"`
- Controls virtiofsd's built-in seccomp filter via the `--seccomp` CLI flag:
- `"enforcing"` → omit flag (virtiofsd defaults to `--seccomp kill`)
- `"log"``--seccomp log` (log violations without killing)
- Independent of `disableSandbox` — always applies.
### `virtiofs.disableSandbox`
- Type: bool
- Default: `false`
- When `false`: full confinement + namespace + capability hardening applied to all virtiofsd instances for this VM.
- When `true`: all systemd-level sandboxing skipped.
- Description: "Disable non-seccomp sandboxing for virtiofsd instances. Seccomp is controlled separately by virtiofs.seccompPolicy."
## Per-Instance Runtime Directories
**Current**: All virtiofsd instances share `/run/vmsilo/<vmname>/virtiofs/`.
**New**: Each instance gets its own directory: `/run/vmsilo/<vmname>/virtiofs-<tag>/`.
### Affected locations
- **Prep service** (`services.nix`): Replace single `install -d -m 0755 /run/vmsilo/${vm.name}/virtiofs` with one `install` per share tag: `install -d -m 0755 /run/vmsilo/${vm.name}/virtiofs-${tag}`.
- **Socket path**: Changes from `/run/vmsilo/<vmname>/virtiofs/<tag>.socket` to `/run/vmsilo/<vmname>/virtiofs-<tag>/<tag>.socket`.
- **ExecStopPost cleanup**: Update path to `rm -f /run/vmsilo/${vm.name}/virtiofs-${tag}/${tag}.socket`.
- **crosvm vhost-user args** (`scripts.nix`): Update socket path in `--vhost-user` args.
- **cloud-hypervisor JSON config** (`vm-config.nix`): Update socket path in `fs` entries.
- **vhostUserSockets list**: Update the map to use new per-instance path.
- **`mkVirtiofsdCmd`** (`services.nix`): Update `--socket-path` argument to new per-instance path. Add `--seccomp` flag based on `seccompPolicy`.
## Sandboxing Configuration
virtiofsd has built-in sandboxing (`--sandbox=namespace`): it creates mount/PID/network namespaces, does `pivot_root` into the shared directory, drops capabilities to a minimal set, and applies its own seccomp filter. The systemd unit adds only non-overlapping hardening, gated by `lib.optionalAttrs (!vm.virtiofs.disableSandbox)`.
virtiofsd runs as root (no `User` directive). It manages its own capability dropping after sandbox setup.
### virtiofsd's built-in sandbox
- **Mount namespace** (`CLONE_NEWNS`) + `pivot_root` into shared directory
- **PID namespace** (`CLONE_NEWPID`)
- **Network namespace** (`CLONE_NEWNET`)
- **Capability dropping** to: `CAP_CHOWN`, `CAP_DAC_OVERRIDE`, `CAP_FOWNER`, `CAP_FSETID`, `CAP_SETGID`, `CAP_SETUID`, `CAP_MKNOD`, `CAP_SETFCAP`, optionally `CAP_DAC_READ_SEARCH`
- **Seccomp filter** (allowlist of ~100 syscalls)
- **`RLIMIT_NOFILE`** raised to 1,000,000
### Systemd-level hardening (non-overlapping)
| Directive | Value | Purpose |
|---|---|---|
| `LimitNOFILE` | `1048576` | Allow virtiofsd to set its desired fd limit |
| `PrivateIPC` | `true` | IPC namespace — virtiofsd doesn't isolate IPC |
| `ProtectHostname` | `true` | UTS namespace — virtiofsd doesn't isolate UTS |
| `ProtectClock` | `true` | Block clock modification syscalls |
| `ProtectKernelModules` | `true` | Block module load/unload |
| `ProtectKernelLogs` | `true` | Block kernel log access |
| `LockPersonality` | `true` | Prevent personality() changes |
| `SystemCallArchitectures` | `"native"` | Block non-native ABI |
| `MemoryDenyWriteExecute` | `true` | Block W+X memory mappings |
| `NoNewPrivileges` | `true` | Prevent privilege escalation via execve |
| `CapabilityBoundingSet` | (see below) | Defense-in-depth ceiling on capabilities |
`CapabilityBoundingSet` includes `CAP_SYS_ADMIN` (needed for `unshare`/`mount`/`pivot_root` during sandbox setup, dropped by virtiofsd after) plus the 9 operational capabilities virtiofsd retains.
Directives NOT applied (virtiofsd handles these internally or they are undone by `pivot_root`):
- No confinement/chroot — virtiofsd does `pivot_root`
- No `PrivateUsers`, `PrivateNetwork`, `PrivatePIDs` — virtiofsd creates these namespaces
- No `PrivateTmp`, `PrivateDevices`, `ProtectKernelTunables`, `ProtectControlGroups` — mount-based, undone by `pivot_root`
- No `RestrictNamespaces` — would block virtiofsd's `unshare()`
- No `BindPaths`/`BindReadOnlyPaths` — undone by `pivot_root`
### sharedHome ExecStartPre
The "home" tag's `createSharedHome` script needs host filesystem access to create the shared home directory from a template. Prefix with `+` so it runs outside the sandbox: `ExecStartPre = [ "+${createSharedHome}" ]`.
## Seccomp
Handled entirely by virtiofsd's built-in seccomp filter (defined in `seccomp.rs`), not by systemd's `SystemCallFilter`.
The `virtiofs.seccompPolicy` option adds `--seccomp <mode>` to the virtiofsd command line:
- `"enforcing"` → omit flag (virtiofsd defaults to `--seccomp kill`)
- `"log"``--seccomp log`
`disableSandbox` has no effect on seccomp — it is always controlled independently via `seccompPolicy`.

View file

@ -148,7 +148,7 @@ let
vhostUserSockets =
lib.optional gpuEnabled "/run/vmsilo/${vm.name}/gpu/gpu.socket"
++ lib.optional soundEnabled "/run/vmsilo/${vm.name}/sound/sound.socket"
++ map (tag: "/run/vmsilo/${vm.name}/virtiofs/${tag}.socket") (
++ map (tag: "/run/vmsilo/${vm.name}/virtiofs-${tag}/${tag}.socket") (
builtins.attrNames effectiveSharedDirs
);
@ -392,7 +392,7 @@ let
chFsEntries = map (tag: {
tag = tag;
socket = "/run/vmsilo/${vm.name}/virtiofs/${tag}.socket";
socket = "/run/vmsilo/${vm.name}/virtiofs-${tag}/${tag}.socket";
}) (builtins.attrNames effectiveSharedDirs);
chDeviceEntries = map (dev: {

View file

@ -501,6 +501,22 @@ let
};
};
virtiofs = {
seccompPolicy = lib.mkOption {
type = lib.types.enum [
"enforcing"
"log"
];
default = "enforcing";
description = "Seccomp policy for virtiofsd instances. Controls virtiofsd's built-in --seccomp flag. 'enforcing' kills on violation; 'log' only logs.";
};
disableSandbox = lib.mkOption {
type = lib.types.bool;
default = false;
description = "Disable non-seccomp sandboxing for virtiofsd instances. Seccomp is controlled separately by virtiofs.seccompPolicy.";
};
};
sharedDirectories = lib.mkOption {
type = lib.types.attrsOf (
lib.types.submodule {

View file

@ -52,7 +52,7 @@ let
# virtiofsd vhost-user socket args
virtiofsDirArgs = lib.concatMapStringsSep " " (
tag: "--vhost-user type=fs,socket=/run/vmsilo/${vm.name}/virtiofs/${tag}.socket"
tag: "--vhost-user type=fs,socket=/run/vmsilo/${vm.name}/virtiofs-${tag}/${tag}.socket"
) (builtins.attrNames c.effectiveSharedDirs);
# Kernel params wrapped with -p for crosvm CLI

View file

@ -83,8 +83,19 @@ let
/run/vmsilo/${vm.name}/gpu \
/run/vmsilo/${vm.name}/gpu/shader-cache \
/run/vmsilo/${vm.name}/sound
${pkgs.coreutils}/bin/install -d -m 0755 \
/run/vmsilo/${vm.name}/virtiofs
${lib.concatMapStringsSep "\n"
(tag: ''
${pkgs.coreutils}/bin/install -d -m 0755 \
/run/vmsilo/${vm.name}/virtiofs-${tag}
'')
(
builtins.attrNames (mkEffectiveSharedDirs {
inherit (vm) sharedDirectories sharedHome;
vmName = vm.name;
inherit userUid userGid;
})
)
}
'';
};
}
@ -308,7 +319,7 @@ let
"${pkgs.virtiofsd}/bin/virtiofsd"
"--shared-dir ${d.path}"
"--tag ${tag}"
"--socket-path /run/vmsilo/${vm.name}/virtiofs/${tag}.socket"
"--socket-path /run/vmsilo/${vm.name}/virtiofs-${tag}/${tag}.socket"
"--thread-pool-size ${toString d.threadPoolSize}"
"--inode-file-handles=${d.inodeFileHandles}"
"--cache ${d.cache}"
@ -327,6 +338,8 @@ let
++ lib.optional (d.gidMap != null) "--gid-map ${d.gidMap}"
++ lib.optional (d.translateUid != null) "--translate-uid ${d.translateUid}"
++ lib.optional (d.translateGid != null) "--translate-gid ${d.translateGid}"
++ lib.optional (vm.virtiofs.seccompPolicy == "log") "--seccomp log"
++ lib.optional vm.virtiofs.disableSandbox "--sandbox none"
);
in
lib.mapAttrsToList (
@ -337,15 +350,32 @@ let
before = [ "vmsilo-${vm.name}-vm.service" ];
requiredBy = [ "vmsilo-${vm.name}-vm.service" ];
bindsTo = [ "vmsilo-${vm.name}-vm.service" ];
# virtiofsd has built-in sandboxing (--sandbox=namespace): creates mount/PID/network
# namespaces, does pivot_root into the shared directory, drops capabilities to a
# minimal set, and applies its own seccomp filter (--seccomp). We only add hardening
# that doesn't overlap: IPC/UTS namespace isolation and seccomp-based protections
# (clock/modules/logs/personality). No CapabilityBoundingSet or NoNewPrivileges —
# virtiofsd manages its own capabilities via capng and both interfere with that.
serviceConfig = {
Type = "simple";
ExecStart = mkVirtiofsdCmd tag dirConfig;
ExecStopPost = pkgs.writeShellScript "cleanup-virtiofsd-${vm.name}-${tag}" ''
rm -f /run/vmsilo/${vm.name}/virtiofs/${tag}.socket
rm -f /run/vmsilo/${vm.name}/virtiofs-${tag}/${tag}.socket
'';
LimitNOFILE = "1048576";
}
// lib.optionalAttrs (tag == "home" && sharedHomeEnabled) {
ExecStartPre = [ "${createSharedHome}" ];
ExecStartPre = [ "+${createSharedHome}" ];
}
// lib.optionalAttrs (!vm.virtiofs.disableSandbox) {
PrivateIPC = true;
ProtectHostname = true;
ProtectClock = true;
ProtectKernelModules = true;
ProtectKernelLogs = true;
LockPersonality = true;
SystemCallArchitectures = "native";
MemoryDenyWriteExecute = true;
};
}
) effectiveSharedDirs
@ -782,7 +812,6 @@ in
"d /run/vmsilo/${vm.name} 0775 root ${config.users.users.${cfg.user}.group} -"
"d /run/vmsilo/${vm.name}/gpu 0755 ${cfg.user} ${config.users.users.${cfg.user}.group} -"
"d /run/vmsilo/${vm.name}/sound 0755 ${cfg.user} ${config.users.users.${cfg.user}.group} -"
"d /run/vmsilo/${vm.name}/virtiofs 0755 root root -"
"d /var/lib/vmsilo/${vm.name} 0755 root root -"
]) allVms
++ lib.optionals anySharedHome [