Add vm-switch sandboxing design document

Design for adding strong sandboxing to vm-switch using Linux namespaces and seccomp, with per-VM process isolation. Key elements: - Fork per VM instead of threads, each child sandboxed - SPSC ring buffers for inter-process frame routing - Unprivileged operation via user namespaces - seccompiler + nix for pure Rust implementation - Asymmetric control protocol preventing MAC spoofing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 02:05:01 +00:00 · 2026-02-08 02:05:01 +00:00 · 7b60a0d688
commit 7b60a0d688
parent d1117314eb
1 changed files with 531 additions and 0 deletions
--- a/docs/plans/2026-02-08-vm-switch-sandboxing-design.md
+++ b/docs/plans/2026-02-08-vm-switch-sandboxing-design.md
@ -0,0 +1,531 @@
+# vm-switch Sandboxing Design
+
+## Overview
+
+vm-switch handles untrusted network data from VMs. This design adds strong sandboxing using Linux namespaces and seccomp, with per-VM process isolation.
+
+## Threat Model
+
+Protect against:
+- **Arbitrary code execution** - Memory corruption in packet parsing leads to host compromise
+- **Information disclosure** - Compromised vm-switch reads sensitive host files
+- **Lateral movement** - Compromised vm-switch attacks other services
+
+Defense: Minimal privileges, per-VM isolation, syscall filtering.
+
+## Architecture
+
+### Current (single process, threads)
+
+```
+vm-switch (main)
+  ├── thread: banking-backend
+  ├── thread: shopping-backend
+  └── thread: router-backend
+      (shared memory via Arc<RwLock>)
+```
+
+### New (fork per VM, sandboxed)
+
+```
+vm-switch (main) ─── lighter sandbox
+  │
+  ├── fork → client-A-backend ─── strict sandbox
+  ├── fork → client-B-backend ─── strict sandbox
+  └── fork → router-backend ─── strict sandbox (slightly more permissive)
+```
+
+Each child handles vhost-user protocol for one VM. Children are sandboxed before processing any packets.
+
+## Inter-Process Communication
+
+### SPSC Ring Buffers
+
+Each VM pair gets a dedicated single-producer single-consumer ring buffer. No shared writers, no coordination overhead.
+
+**1:N topology (router + clients):**
+- Each client has one egress buffer (client → router)
+- Each client has one ingress buffer (router → client)
+- Router has N egress buffers and N ingress buffers
+
+```
+┌──────────────┐         ┌──────────────┐
+│   Client A   │         │    Router    │
+├──────────────┤         ├──────────────┤
+│ egress → R   │────────▶│ ingress ← A  │
+│ ingress ← R  │◀────────│ egress → A   │
+└──────────────┘         └──────────────┘
+```
+
+**Security property:** Each buffer accepts data from only one source. A compromised producer can only read/corrupt its own outgoing frames.
+
+### Buffer Ownership
+
+Children create their own egress buffers and share FDs with main:
+
+```rust
+/// Buffer I created and own - I write to this
+struct OwnedRingBuffer {
+    memfd: OwnedFd,
+    map: MmapMut,
+}
+
+/// Buffer someone else created - I read from this
+struct MappedRingBuffer {
+    fd: OwnedFd,       // received via SCM_RIGHTS
+    map: MmapMut,
+}
+```
+
+**Startup sequence:**
+1. Main forks child
+2. Child creates egress buffer (memfd), sends FD to main
+3. Main forwards FD to destination (router for clients)
+4. Destination maps the buffer
+5. Return path: destination sends its egress FD back through main
+
+### Ring Buffer Layout
+
+Simple layout with head/tail in buffer:
+
+```
+┌─────────────────────────────────────┐
+│  head: u64  │  tail: u64  │ padding │
+├─────────────────────────────────────┤
+│  slot[0]  │  slot[1]  │  ...        │
+└─────────────────────────────────────┘
+```
+
+Both producer and consumer have read+write access. This is safe because each buffer is dedicated to one producer.
+
+### Ring Buffer Details
+
+**Slot format (fixed-size):**
+```rust
+const MAX_FRAME_SIZE: usize = 9216;  // jumbo frames + headroom
+
+#[repr(C)]
+struct Slot {
+    len: u32,                        // frame length (0 = empty/unused)
+    _padding: u32,                   // alignment
+    data: [u8; MAX_FRAME_SIZE],
+}
+
+#[repr(C)]
+struct RingBuffer {
+    head: AtomicU64,                 // next write position (producer owns)
+    tail: AtomicU64,                 // next read position (consumer owns)
+    _padding: [u8; 48],              // pad to 64 bytes (cache line)
+    slots: [Slot; RING_SIZE],
+}
+```
+
+**Producer protocol:**
+```rust
+fn push(&self, frame: &[u8]) -> bool {
+    let head = self.head.load(Relaxed);      // only producer writes head
+    let tail = self.tail.load(Acquire);      // sync with consumer
+
+    if (head + 1) % RING_SIZE == tail {
+        return false;                         // full - drop frame
+    }
+
+    let slot = &mut self.slots[head];
+    slot.data[..frame.len()].copy_from_slice(frame);
+    slot.len = frame.len() as u32;
+
+    fence(Release);                           // data visible before head update
+    self.head.store((head + 1) % RING_SIZE, Relaxed);
+
+    // Signal consumer via eventfd if it was empty
+    if head == tail {
+        self.eventfd.write(1);
+    }
+    true
+}
+```
+
+**Consumer protocol:**
+```rust
+fn pop(&self) -> Option<Vec<u8>> {
+    let tail = self.tail.load(Relaxed);      // only consumer writes tail
+    let head = self.head.load(Acquire);      // sync with producer
+
+    if head == tail {
+        return None;                          // empty
+    }
+
+    let slot = &self.slots[tail];
+    let frame = slot.data[..slot.len as usize].to_vec();
+
+    self.tail.store((tail + 1) % RING_SIZE, Release);
+    Some(frame)
+}
+```
+
+**Notification (hybrid):**
+- Consumer polls in tight loop while processing frames
+- When queue empty, consumer blocks on eventfd
+- Producer writes to eventfd after pushing to previously-empty queue
+- Balances latency (polling when busy) and CPU usage (sleeping when idle)
+
+**Buffer full behavior:**
+- Producer drops frame and returns false
+- No backpressure - behaves like a real network switch under congestion
+- Logging/metrics can track drop rate
+
+### Control Channel Protocol
+
+Each child has a unix socketpair with main for buffer coordination. Messages are serde-serialized (e.g., using postcard or bincode).
+
+Message types are asymmetric to prevent children from influencing MAC address assignment:
+
+```rust
+/// Messages from main to child
+#[derive(Serialize, Deserialize)]
+enum MainToChild {
+    /// "Here's a buffer for receiving from peer `name`"
+    /// Accompanied by 2 FDs via SCM_RIGHTS: [memfd, eventfd]
+    /// MAC is authoritative - child must use this for filtering
+    PutBuffer { name: String, mac: Mac },
+
+    /// "Peer `name` is gone, clean up associated buffers"
+    RemoveBuffer { name: String },
+}
+
+/// Messages from child to main
+#[derive(Serialize, Deserialize)]
+enum ChildToMain {
+    /// "I need a buffer to receive from peer `name`"
+    GetBuffer { name: String },
+
+    /// "Here's my egress buffer for peer `name`"
+    /// Accompanied by 2 FDs via SCM_RIGHTS: [memfd, eventfd]
+    /// No MAC field - main uses MAC from config, not from children
+    BufferReady { name: String },
+}
+```
+
+Each buffer has an associated eventfd for notification. Producer creates both the memfd and eventfd; both are passed to the consumer.
+
+**Security note:** Children cannot specify MAC addresses. Main derives MACs from config files only, preventing children from spoofing other peers' identities.
+
+**Example flow (client startup):**
+```
+Main                              Client
+  │                                  │
+  │◄── GetBuffer { "router" } ───────┤  Client requests router's buffer
+  │                                  │
+  ├─── PutBuffer { "router", mac } ──►│  Main sends buffer + authoritative MAC
+  │    + FD via SCM_RIGHTS           │
+  │                                  │
+  │◄── BufferReady { "router" } ─────┤  Client sends its egress to main
+  │    + FD via SCM_RIGHTS           │  (main knows client's MAC from config)
+```
+
+**Example flow (new peer notification):**
+```
+Main                              Router
+  │                                  │
+  ├─── PutBuffer { "client-a", mac }─►│  Main pushes client's egress + MAC
+  │    + FD                          │
+  │                                  │
+  │◄── BufferReady { "client-a" } ───┤  Router sends its egress for client-a
+  │    + FD                          │
+```
+
+**Cleanup:**
+```
+Main                              Any Child
+  │                                  │
+  ├─── RemoveBuffer { "peer-x" } ────►│  Peer disconnected, clean up
+  │                                  │
+```
+
+This protocol is topology-agnostic and can support future modes beyond 1:N.
+
+### MAC Filtering
+
+Each ingress buffer has an associated expected source MAC. Receivers validate source MAC to prevent spoofing:
+
+```rust
+struct IngressBuffer {
+    ring: MappedRingBuffer,
+    expected_source_mac: Mac,
+    accept_broadcast: bool,  // true for router-like roles
+}
+
+impl IngressBuffer {
+    fn read_frame(&self) -> Option<Frame> {
+        let frame = self.ring.pop()?;
+
+        // Accept if source matches expected peer
+        if frame.source_mac == self.expected_source_mac {
+            return Some(frame);
+        }
+
+        // Router accepts broadcast/multicast destinations
+        if self.accept_broadcast && frame.dest_mac.is_multicast() {
+            return Some(frame);
+        }
+
+        // Drop spoofed frame
+        log::warn!(
+            "Dropping frame with unexpected source MAC: got {}, expected {}",
+            frame.source_mac, self.expected_source_mac
+        );
+        None
+    }
+}
+```
+
+**MAC is provided via control channel:** When main sends `PutBuffer`, it includes the expected source MAC. The child stores this with the ingress buffer metadata.
+
+## Privilege Model
+
+### Unprivileged Operation
+
+vm-switch runs as an unprivileged user. User namespaces provide "fake root" capabilities:
+
+```
+Unprivileged user (UID 1000)
+│
+├─► unshare(CLONE_NEWUSER)
+│   └─► Now "root" inside user namespace with CAP_SYS_ADMIN
+│
+├─► Write uid_map/gid_map (map UID 1000 → 0 inside)
+│
+└─► unshare(CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET)
+    └─► These succeed with user namespace capabilities
+```
+
+From the host kernel's perspective, vm-switch is always UID 1000. If compromised, the attacker cannot affect host resources.
+
+## Sandboxing Details
+
+### Main Process Sandbox
+
+**Namespaces:**
+- User (map real UID to root inside)
+- Mount (only config dir visible)
+- IPC (isolated)
+- Network (empty, unix sockets via inherited FDs)
+
+**Sequence:**
+```
+vm-switch starts (unprivileged)
+│
+├─► unshare(USER | MOUNT | IPC | NETWORK)
+├─► Write "deny" to /proc/self/setgroups
+├─► Write UID/GID mappings
+│
+├─► Mount minimal filesystem:
+│     tmpfs as new root
+│     bind mount --config-dir (read-write)
+├─► pivot_root to new root
+├─► Apply seccomp filter
+│
+└─► Event loop
+```
+
+**Seccomp whitelist:**
+```
+# Child management
+fork, clone, waitpid, wait4
+
+# Config watching
+inotify_init1, inotify_add_watch, inotify_rm_watch
+
+# Unix sockets (AF_UNIX only)
+socket(AF_UNIX), bind, listen, accept4, sendmsg, recvmsg
+
+# Namespace setup for children
+unshare, setns, pivot_root, mount, umount2
+prctl, seccomp
+setresuid, setresgid, setgroups
+
+# Memory
+mmap(!PROT_EXEC), munmap, madvise, memfd_create, ftruncate
+
+# Event loop
+epoll_create1, epoll_ctl, epoll_wait, read, write
+
+# Misc
+close, exit_group, clock_gettime, sigaction, rt_sigprocmask
+```
+
+### Child Process Sandbox
+
+**Namespaces:** Same as main (inherited or created fresh per child).
+
+**Sandbox timing:** After setup, before processing packets:
+```
+fork()
+  │
+  ├─► Create memfd, mmap buffers
+  ├─► Exchange FDs with main
+  ├─► Set up vhost-user listener socket
+  │
+  ├─► Apply sandbox (namespaces + seccomp)
+  │         ─────── security boundary ───────
+  │
+  └─► Accept vhost-user connection, process packets
+```
+
+**Base child seccomp whitelist:**
+```
+# Vhost-user protocol
+accept4, read, write, recvmsg, sendmsg
+
+# Guest memory mapping
+mmap(!PROT_EXEC, MAP_SHARED), munmap, madvise
+
+# Event loop + ring buffer notification
+epoll_create1, epoll_ctl, epoll_wait, eventfd2
+
+# Synchronization
+futex
+
+# Misc
+close, exit_group, clock_gettime
+```
+
+**Extended seccomp whitelist (for children that create buffers dynamically, e.g., router):**
+```
+<all base child syscalls>
+
+# Dynamic buffer creation for new peers
+memfd_create, ftruncate
+```
+
+### Seccomp Violation Handling
+
+- Default: `SCMP_ACT_KILL_PROCESS` (immediate termination)
+- Debug mode: `--seccomp-trap` flag uses `SCMP_ACT_TRAP` (SIGSYS signal, allows logging)
+
+## Implementation
+
+### Dependencies
+
+- **seccompiler** - Pure Rust seccomp BPF compilation (used by Firecracker)
+- **nix** - Rust bindings for namespace operations, mount, pivot_root
+
+No C dependencies. Pure Rust implementation.
+
+### Seccompiler Usage
+
+```rust
+use seccompiler::{SeccompAction, SeccompFilter, SeccompRule, SeccompCondition};
+
+let mut rules = BTreeMap::new();
+
+// Allow read/write unconditionally
+rules.insert(libc::SYS_read, vec![SeccompRule::new(vec![])?]);
+rules.insert(libc::SYS_write, vec![SeccompRule::new(vec![])?]);
+
+// Allow socket() only for AF_UNIX
+rules.insert(libc::SYS_socket, vec![
+    SeccompRule::new(vec![
+        SeccompCondition::new(0, ArgLen::Dword, Cmp::Eq, libc::AF_UNIX as u64)?,
+    ])?,
+]);
+
+// Allow mmap() without PROT_EXEC
+rules.insert(libc::SYS_mmap, vec![
+    SeccompRule::new(vec![
+        SeccompCondition::new(2, ArgLen::Dword, Cmp::MaskedEq(libc::PROT_EXEC as u64), 0)?,
+    ])?,
+]);
+
+let action = if trap_mode {
+    SeccompAction::Trap
+} else {
+    SeccompAction::KillProcess
+};
+
+let filter = SeccompFilter::new(rules, action, SeccompAction::Allow, TargetArch::x86_64)?;
+BpfProgram::from(&filter).install()?;
+```
+
+## Process Lifecycle
+
+### Main Process Responsibilities
+
+Main is the coordinator:
+- Watches config dir for MAC file changes (inotify)
+- Forks children for each VM
+- Initiates buffer handoff via `PutBuffer` when new peer appears
+- Forwards `BufferReady` messages between children
+- Sends `RemoveBuffer` when peer disappears or crashes
+- Monitors children via `waitpid()`
+
+Children are passive responders to main's messages.
+
+### Child Crash Detection
+
+Main detects child crashes via standard Unix mechanisms:
+1. Kernel sends `SIGCHLD` to main when child exits
+2. Main calls `waitpid()` to reap child and get exit status
+3. Main sends `RemoveBuffer { name }` to all peers that had buffers with crashed child
+4. Main can optionally respawn the child
+
+### Main Crash Handling
+
+Children detect main crash via control socket:
+1. Control socket returns EOF or error
+2. Child exits cleanly
+3. Systemd restarts main process
+4. Main respawns all children fresh
+
+Children do not attempt to continue operating without main.
+
+### Buffer Cleanup
+
+**Ownership:** Producer (creator) owns the memfd.
+
+**Normal cleanup:**
+1. Child exits normally
+2. Kernel closes child's FDs automatically
+3. Peers receive `RemoveBuffer`, unmap and close their FD references
+4. When last reference closed, kernel frees the memfd
+
+**Crash cleanup:**
+1. Child crashes
+2. Kernel closes child's FDs (same as normal exit)
+3. Main detects via `waitpid()`, sends `RemoveBuffer` to peers
+4. Same cleanup path as normal
+
+**RemoveBuffer race conditions:**
+- On `RemoveBuffer`, consumer stops reading immediately
+- In-flight frames in that buffer are discarded (acceptable)
+- Producer may write briefly after consumer unmaps (harmless - just wasted work)
+- No synchronization needed between producer and consumer for cleanup
+
+## Logging
+
+Children inherit stdout/stderr from main. Systemd captures output and forwards to journald. No special handling needed.
+
+## Shutdown
+
+1. Main receives SIGTERM
+2. Main closes control sockets to children
+3. Children detect EOF, exit cleanly
+4. Main calls `waitpid()` with timeout
+5. SIGKILL any remaining children after timeout
+
+## Memory Sharing with crosvm
+
+vhost-user protocol passes guest memory via `SCM_RIGHTS` FD passing. This works across different UIDs:
+
+1. crosvm creates memfd for guest memory
+2. crosvm sends FD via unix socket
+3. vm-switch child receives FD (now has capability to access)
+4. vm-switch child mmaps the FD
+
+The FD transfer grants access regardless of UID differences.
+
+## Future Considerations
+
+- Running crosvm unprivileged (each VM as separate user) - compatible with this design
+- Additional Landlock filesystem restrictions if needed
+- Audit logging of seccomp violations in production