diff --git a/docs/plans/2026-02-08-vm-switch-sandboxing-design.md b/docs/plans/2026-02-08-vm-switch-sandboxing-design.md new file mode 100644 index 0000000..7e81340 --- /dev/null +++ b/docs/plans/2026-02-08-vm-switch-sandboxing-design.md @@ -0,0 +1,531 @@ +# vm-switch Sandboxing Design + +## Overview + +vm-switch handles untrusted network data from VMs. This design adds strong sandboxing using Linux namespaces and seccomp, with per-VM process isolation. + +## Threat Model + +Protect against: +- **Arbitrary code execution** - Memory corruption in packet parsing leads to host compromise +- **Information disclosure** - Compromised vm-switch reads sensitive host files +- **Lateral movement** - Compromised vm-switch attacks other services + +Defense: Minimal privileges, per-VM isolation, syscall filtering. + +## Architecture + +### Current (single process, threads) + +``` +vm-switch (main) + ├── thread: banking-backend + ├── thread: shopping-backend + └── thread: router-backend + (shared memory via Arc) +``` + +### New (fork per VM, sandboxed) + +``` +vm-switch (main) ─── lighter sandbox + │ + ├── fork → client-A-backend ─── strict sandbox + ├── fork → client-B-backend ─── strict sandbox + └── fork → router-backend ─── strict sandbox (slightly more permissive) +``` + +Each child handles vhost-user protocol for one VM. Children are sandboxed before processing any packets. + +## Inter-Process Communication + +### SPSC Ring Buffers + +Each VM pair gets a dedicated single-producer single-consumer ring buffer. No shared writers, no coordination overhead. + +**1:N topology (router + clients):** +- Each client has one egress buffer (client → router) +- Each client has one ingress buffer (router → client) +- Router has N egress buffers and N ingress buffers + +``` +┌──────────────┐ ┌──────────────┐ +│ Client A │ │ Router │ +├──────────────┤ ├──────────────┤ +│ egress → R │────────▶│ ingress ← A │ +│ ingress ← R │◀────────│ egress → A │ +└──────────────┘ └──────────────┘ +``` + +**Security property:** Each buffer accepts data from only one source. A compromised producer can only read/corrupt its own outgoing frames. + +### Buffer Ownership + +Children create their own egress buffers and share FDs with main: + +```rust +/// Buffer I created and own - I write to this +struct OwnedRingBuffer { + memfd: OwnedFd, + map: MmapMut, +} + +/// Buffer someone else created - I read from this +struct MappedRingBuffer { + fd: OwnedFd, // received via SCM_RIGHTS + map: MmapMut, +} +``` + +**Startup sequence:** +1. Main forks child +2. Child creates egress buffer (memfd), sends FD to main +3. Main forwards FD to destination (router for clients) +4. Destination maps the buffer +5. Return path: destination sends its egress FD back through main + +### Ring Buffer Layout + +Simple layout with head/tail in buffer: + +``` +┌─────────────────────────────────────┐ +│ head: u64 │ tail: u64 │ padding │ +├─────────────────────────────────────┤ +│ slot[0] │ slot[1] │ ... │ +└─────────────────────────────────────┘ +``` + +Both producer and consumer have read+write access. This is safe because each buffer is dedicated to one producer. + +### Ring Buffer Details + +**Slot format (fixed-size):** +```rust +const MAX_FRAME_SIZE: usize = 9216; // jumbo frames + headroom + +#[repr(C)] +struct Slot { + len: u32, // frame length (0 = empty/unused) + _padding: u32, // alignment + data: [u8; MAX_FRAME_SIZE], +} + +#[repr(C)] +struct RingBuffer { + head: AtomicU64, // next write position (producer owns) + tail: AtomicU64, // next read position (consumer owns) + _padding: [u8; 48], // pad to 64 bytes (cache line) + slots: [Slot; RING_SIZE], +} +``` + +**Producer protocol:** +```rust +fn push(&self, frame: &[u8]) -> bool { + let head = self.head.load(Relaxed); // only producer writes head + let tail = self.tail.load(Acquire); // sync with consumer + + if (head + 1) % RING_SIZE == tail { + return false; // full - drop frame + } + + let slot = &mut self.slots[head]; + slot.data[..frame.len()].copy_from_slice(frame); + slot.len = frame.len() as u32; + + fence(Release); // data visible before head update + self.head.store((head + 1) % RING_SIZE, Relaxed); + + // Signal consumer via eventfd if it was empty + if head == tail { + self.eventfd.write(1); + } + true +} +``` + +**Consumer protocol:** +```rust +fn pop(&self) -> Option> { + let tail = self.tail.load(Relaxed); // only consumer writes tail + let head = self.head.load(Acquire); // sync with producer + + if head == tail { + return None; // empty + } + + let slot = &self.slots[tail]; + let frame = slot.data[..slot.len as usize].to_vec(); + + self.tail.store((tail + 1) % RING_SIZE, Release); + Some(frame) +} +``` + +**Notification (hybrid):** +- Consumer polls in tight loop while processing frames +- When queue empty, consumer blocks on eventfd +- Producer writes to eventfd after pushing to previously-empty queue +- Balances latency (polling when busy) and CPU usage (sleeping when idle) + +**Buffer full behavior:** +- Producer drops frame and returns false +- No backpressure - behaves like a real network switch under congestion +- Logging/metrics can track drop rate + +### Control Channel Protocol + +Each child has a unix socketpair with main for buffer coordination. Messages are serde-serialized (e.g., using postcard or bincode). + +Message types are asymmetric to prevent children from influencing MAC address assignment: + +```rust +/// Messages from main to child +#[derive(Serialize, Deserialize)] +enum MainToChild { + /// "Here's a buffer for receiving from peer `name`" + /// Accompanied by 2 FDs via SCM_RIGHTS: [memfd, eventfd] + /// MAC is authoritative - child must use this for filtering + PutBuffer { name: String, mac: Mac }, + + /// "Peer `name` is gone, clean up associated buffers" + RemoveBuffer { name: String }, +} + +/// Messages from child to main +#[derive(Serialize, Deserialize)] +enum ChildToMain { + /// "I need a buffer to receive from peer `name`" + GetBuffer { name: String }, + + /// "Here's my egress buffer for peer `name`" + /// Accompanied by 2 FDs via SCM_RIGHTS: [memfd, eventfd] + /// No MAC field - main uses MAC from config, not from children + BufferReady { name: String }, +} +``` + +Each buffer has an associated eventfd for notification. Producer creates both the memfd and eventfd; both are passed to the consumer. + +**Security note:** Children cannot specify MAC addresses. Main derives MACs from config files only, preventing children from spoofing other peers' identities. + +**Example flow (client startup):** +``` +Main Client + │ │ + │◄── GetBuffer { "router" } ───────┤ Client requests router's buffer + │ │ + ├─── PutBuffer { "router", mac } ──►│ Main sends buffer + authoritative MAC + │ + FD via SCM_RIGHTS │ + │ │ + │◄── BufferReady { "router" } ─────┤ Client sends its egress to main + │ + FD via SCM_RIGHTS │ (main knows client's MAC from config) +``` + +**Example flow (new peer notification):** +``` +Main Router + │ │ + ├─── PutBuffer { "client-a", mac }─►│ Main pushes client's egress + MAC + │ + FD │ + │ │ + │◄── BufferReady { "client-a" } ───┤ Router sends its egress for client-a + │ + FD │ +``` + +**Cleanup:** +``` +Main Any Child + │ │ + ├─── RemoveBuffer { "peer-x" } ────►│ Peer disconnected, clean up + │ │ +``` + +This protocol is topology-agnostic and can support future modes beyond 1:N. + +### MAC Filtering + +Each ingress buffer has an associated expected source MAC. Receivers validate source MAC to prevent spoofing: + +```rust +struct IngressBuffer { + ring: MappedRingBuffer, + expected_source_mac: Mac, + accept_broadcast: bool, // true for router-like roles +} + +impl IngressBuffer { + fn read_frame(&self) -> Option { + let frame = self.ring.pop()?; + + // Accept if source matches expected peer + if frame.source_mac == self.expected_source_mac { + return Some(frame); + } + + // Router accepts broadcast/multicast destinations + if self.accept_broadcast && frame.dest_mac.is_multicast() { + return Some(frame); + } + + // Drop spoofed frame + log::warn!( + "Dropping frame with unexpected source MAC: got {}, expected {}", + frame.source_mac, self.expected_source_mac + ); + None + } +} +``` + +**MAC is provided via control channel:** When main sends `PutBuffer`, it includes the expected source MAC. The child stores this with the ingress buffer metadata. + +## Privilege Model + +### Unprivileged Operation + +vm-switch runs as an unprivileged user. User namespaces provide "fake root" capabilities: + +``` +Unprivileged user (UID 1000) +│ +├─► unshare(CLONE_NEWUSER) +│ └─► Now "root" inside user namespace with CAP_SYS_ADMIN +│ +├─► Write uid_map/gid_map (map UID 1000 → 0 inside) +│ +└─► unshare(CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET) + └─► These succeed with user namespace capabilities +``` + +From the host kernel's perspective, vm-switch is always UID 1000. If compromised, the attacker cannot affect host resources. + +## Sandboxing Details + +### Main Process Sandbox + +**Namespaces:** +- User (map real UID to root inside) +- Mount (only config dir visible) +- IPC (isolated) +- Network (empty, unix sockets via inherited FDs) + +**Sequence:** +``` +vm-switch starts (unprivileged) +│ +├─► unshare(USER | MOUNT | IPC | NETWORK) +├─► Write "deny" to /proc/self/setgroups +├─► Write UID/GID mappings +│ +├─► Mount minimal filesystem: +│ tmpfs as new root +│ bind mount --config-dir (read-write) +├─► pivot_root to new root +├─► Apply seccomp filter +│ +└─► Event loop +``` + +**Seccomp whitelist:** +``` +# Child management +fork, clone, waitpid, wait4 + +# Config watching +inotify_init1, inotify_add_watch, inotify_rm_watch + +# Unix sockets (AF_UNIX only) +socket(AF_UNIX), bind, listen, accept4, sendmsg, recvmsg + +# Namespace setup for children +unshare, setns, pivot_root, mount, umount2 +prctl, seccomp +setresuid, setresgid, setgroups + +# Memory +mmap(!PROT_EXEC), munmap, madvise, memfd_create, ftruncate + +# Event loop +epoll_create1, epoll_ctl, epoll_wait, read, write + +# Misc +close, exit_group, clock_gettime, sigaction, rt_sigprocmask +``` + +### Child Process Sandbox + +**Namespaces:** Same as main (inherited or created fresh per child). + +**Sandbox timing:** After setup, before processing packets: +``` +fork() + │ + ├─► Create memfd, mmap buffers + ├─► Exchange FDs with main + ├─► Set up vhost-user listener socket + │ + ├─► Apply sandbox (namespaces + seccomp) + │ ─────── security boundary ─────── + │ + └─► Accept vhost-user connection, process packets +``` + +**Base child seccomp whitelist:** +``` +# Vhost-user protocol +accept4, read, write, recvmsg, sendmsg + +# Guest memory mapping +mmap(!PROT_EXEC, MAP_SHARED), munmap, madvise + +# Event loop + ring buffer notification +epoll_create1, epoll_ctl, epoll_wait, eventfd2 + +# Synchronization +futex + +# Misc +close, exit_group, clock_gettime +``` + +**Extended seccomp whitelist (for children that create buffers dynamically, e.g., router):** +``` + + +# Dynamic buffer creation for new peers +memfd_create, ftruncate +``` + +### Seccomp Violation Handling + +- Default: `SCMP_ACT_KILL_PROCESS` (immediate termination) +- Debug mode: `--seccomp-trap` flag uses `SCMP_ACT_TRAP` (SIGSYS signal, allows logging) + +## Implementation + +### Dependencies + +- **seccompiler** - Pure Rust seccomp BPF compilation (used by Firecracker) +- **nix** - Rust bindings for namespace operations, mount, pivot_root + +No C dependencies. Pure Rust implementation. + +### Seccompiler Usage + +```rust +use seccompiler::{SeccompAction, SeccompFilter, SeccompRule, SeccompCondition}; + +let mut rules = BTreeMap::new(); + +// Allow read/write unconditionally +rules.insert(libc::SYS_read, vec![SeccompRule::new(vec![])?]); +rules.insert(libc::SYS_write, vec![SeccompRule::new(vec![])?]); + +// Allow socket() only for AF_UNIX +rules.insert(libc::SYS_socket, vec![ + SeccompRule::new(vec![ + SeccompCondition::new(0, ArgLen::Dword, Cmp::Eq, libc::AF_UNIX as u64)?, + ])?, +]); + +// Allow mmap() without PROT_EXEC +rules.insert(libc::SYS_mmap, vec![ + SeccompRule::new(vec![ + SeccompCondition::new(2, ArgLen::Dword, Cmp::MaskedEq(libc::PROT_EXEC as u64), 0)?, + ])?, +]); + +let action = if trap_mode { + SeccompAction::Trap +} else { + SeccompAction::KillProcess +}; + +let filter = SeccompFilter::new(rules, action, SeccompAction::Allow, TargetArch::x86_64)?; +BpfProgram::from(&filter).install()?; +``` + +## Process Lifecycle + +### Main Process Responsibilities + +Main is the coordinator: +- Watches config dir for MAC file changes (inotify) +- Forks children for each VM +- Initiates buffer handoff via `PutBuffer` when new peer appears +- Forwards `BufferReady` messages between children +- Sends `RemoveBuffer` when peer disappears or crashes +- Monitors children via `waitpid()` + +Children are passive responders to main's messages. + +### Child Crash Detection + +Main detects child crashes via standard Unix mechanisms: +1. Kernel sends `SIGCHLD` to main when child exits +2. Main calls `waitpid()` to reap child and get exit status +3. Main sends `RemoveBuffer { name }` to all peers that had buffers with crashed child +4. Main can optionally respawn the child + +### Main Crash Handling + +Children detect main crash via control socket: +1. Control socket returns EOF or error +2. Child exits cleanly +3. Systemd restarts main process +4. Main respawns all children fresh + +Children do not attempt to continue operating without main. + +### Buffer Cleanup + +**Ownership:** Producer (creator) owns the memfd. + +**Normal cleanup:** +1. Child exits normally +2. Kernel closes child's FDs automatically +3. Peers receive `RemoveBuffer`, unmap and close their FD references +4. When last reference closed, kernel frees the memfd + +**Crash cleanup:** +1. Child crashes +2. Kernel closes child's FDs (same as normal exit) +3. Main detects via `waitpid()`, sends `RemoveBuffer` to peers +4. Same cleanup path as normal + +**RemoveBuffer race conditions:** +- On `RemoveBuffer`, consumer stops reading immediately +- In-flight frames in that buffer are discarded (acceptable) +- Producer may write briefly after consumer unmaps (harmless - just wasted work) +- No synchronization needed between producer and consumer for cleanup + +## Logging + +Children inherit stdout/stderr from main. Systemd captures output and forwards to journald. No special handling needed. + +## Shutdown + +1. Main receives SIGTERM +2. Main closes control sockets to children +3. Children detect EOF, exit cleanly +4. Main calls `waitpid()` with timeout +5. SIGKILL any remaining children after timeout + +## Memory Sharing with crosvm + +vhost-user protocol passes guest memory via `SCM_RIGHTS` FD passing. This works across different UIDs: + +1. crosvm creates memfd for guest memory +2. crosvm sends FD via unix socket +3. vm-switch child receives FD (now has capability to access) +4. vm-switch child mmaps the FD + +The FD transfer grants access regardless of UID differences. + +## Future Considerations + +- Running crosvm unprivileged (each VM as separate user) - compatible with this design +- Additional Landlock filesystem restrictions if needed +- Audit logging of seccomp violations in production