Add vm-switch sandboxing design document

Design for adding strong sandboxing to vm-switch using Linux namespaces
and seccomp, with per-VM process isolation. Key elements:

- Fork per VM instead of threads, each child sandboxed
- SPSC ring buffers for inter-process frame routing
- Unprivileged operation via user namespaces
- seccompiler + nix for pure Rust implementation
- Asymmetric control protocol preventing MAC spoofing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Davíð Steinn Geirsson 2026-02-08 02:05:01 +00:00
parent d1117314eb
commit 7b60a0d688

View file

@ -0,0 +1,531 @@
# vm-switch Sandboxing Design
## Overview
vm-switch handles untrusted network data from VMs. This design adds strong sandboxing using Linux namespaces and seccomp, with per-VM process isolation.
## Threat Model
Protect against:
- **Arbitrary code execution** - Memory corruption in packet parsing leads to host compromise
- **Information disclosure** - Compromised vm-switch reads sensitive host files
- **Lateral movement** - Compromised vm-switch attacks other services
Defense: Minimal privileges, per-VM isolation, syscall filtering.
## Architecture
### Current (single process, threads)
```
vm-switch (main)
├── thread: banking-backend
├── thread: shopping-backend
└── thread: router-backend
(shared memory via Arc<RwLock>)
```
### New (fork per VM, sandboxed)
```
vm-switch (main) ─── lighter sandbox
├── fork → client-A-backend ─── strict sandbox
├── fork → client-B-backend ─── strict sandbox
└── fork → router-backend ─── strict sandbox (slightly more permissive)
```
Each child handles vhost-user protocol for one VM. Children are sandboxed before processing any packets.
## Inter-Process Communication
### SPSC Ring Buffers
Each VM pair gets a dedicated single-producer single-consumer ring buffer. No shared writers, no coordination overhead.
**1:N topology (router + clients):**
- Each client has one egress buffer (client → router)
- Each client has one ingress buffer (router → client)
- Router has N egress buffers and N ingress buffers
```
┌──────────────┐ ┌──────────────┐
│ Client A │ │ Router │
├──────────────┤ ├──────────────┤
│ egress → R │────────▶│ ingress ← A │
│ ingress ← R │◀────────│ egress → A │
└──────────────┘ └──────────────┘
```
**Security property:** Each buffer accepts data from only one source. A compromised producer can only read/corrupt its own outgoing frames.
### Buffer Ownership
Children create their own egress buffers and share FDs with main:
```rust
/// Buffer I created and own - I write to this
struct OwnedRingBuffer {
memfd: OwnedFd,
map: MmapMut,
}
/// Buffer someone else created - I read from this
struct MappedRingBuffer {
fd: OwnedFd, // received via SCM_RIGHTS
map: MmapMut,
}
```
**Startup sequence:**
1. Main forks child
2. Child creates egress buffer (memfd), sends FD to main
3. Main forwards FD to destination (router for clients)
4. Destination maps the buffer
5. Return path: destination sends its egress FD back through main
### Ring Buffer Layout
Simple layout with head/tail in buffer:
```
┌─────────────────────────────────────┐
│ head: u64 │ tail: u64 │ padding │
├─────────────────────────────────────┤
│ slot[0] │ slot[1] │ ... │
└─────────────────────────────────────┘
```
Both producer and consumer have read+write access. This is safe because each buffer is dedicated to one producer.
### Ring Buffer Details
**Slot format (fixed-size):**
```rust
const MAX_FRAME_SIZE: usize = 9216; // jumbo frames + headroom
#[repr(C)]
struct Slot {
len: u32, // frame length (0 = empty/unused)
_padding: u32, // alignment
data: [u8; MAX_FRAME_SIZE],
}
#[repr(C)]
struct RingBuffer {
head: AtomicU64, // next write position (producer owns)
tail: AtomicU64, // next read position (consumer owns)
_padding: [u8; 48], // pad to 64 bytes (cache line)
slots: [Slot; RING_SIZE],
}
```
**Producer protocol:**
```rust
fn push(&self, frame: &[u8]) -> bool {
let head = self.head.load(Relaxed); // only producer writes head
let tail = self.tail.load(Acquire); // sync with consumer
if (head + 1) % RING_SIZE == tail {
return false; // full - drop frame
}
let slot = &mut self.slots[head];
slot.data[..frame.len()].copy_from_slice(frame);
slot.len = frame.len() as u32;
fence(Release); // data visible before head update
self.head.store((head + 1) % RING_SIZE, Relaxed);
// Signal consumer via eventfd if it was empty
if head == tail {
self.eventfd.write(1);
}
true
}
```
**Consumer protocol:**
```rust
fn pop(&self) -> Option<Vec<u8>> {
let tail = self.tail.load(Relaxed); // only consumer writes tail
let head = self.head.load(Acquire); // sync with producer
if head == tail {
return None; // empty
}
let slot = &self.slots[tail];
let frame = slot.data[..slot.len as usize].to_vec();
self.tail.store((tail + 1) % RING_SIZE, Release);
Some(frame)
}
```
**Notification (hybrid):**
- Consumer polls in tight loop while processing frames
- When queue empty, consumer blocks on eventfd
- Producer writes to eventfd after pushing to previously-empty queue
- Balances latency (polling when busy) and CPU usage (sleeping when idle)
**Buffer full behavior:**
- Producer drops frame and returns false
- No backpressure - behaves like a real network switch under congestion
- Logging/metrics can track drop rate
### Control Channel Protocol
Each child has a unix socketpair with main for buffer coordination. Messages are serde-serialized (e.g., using postcard or bincode).
Message types are asymmetric to prevent children from influencing MAC address assignment:
```rust
/// Messages from main to child
#[derive(Serialize, Deserialize)]
enum MainToChild {
/// "Here's a buffer for receiving from peer `name`"
/// Accompanied by 2 FDs via SCM_RIGHTS: [memfd, eventfd]
/// MAC is authoritative - child must use this for filtering
PutBuffer { name: String, mac: Mac },
/// "Peer `name` is gone, clean up associated buffers"
RemoveBuffer { name: String },
}
/// Messages from child to main
#[derive(Serialize, Deserialize)]
enum ChildToMain {
/// "I need a buffer to receive from peer `name`"
GetBuffer { name: String },
/// "Here's my egress buffer for peer `name`"
/// Accompanied by 2 FDs via SCM_RIGHTS: [memfd, eventfd]
/// No MAC field - main uses MAC from config, not from children
BufferReady { name: String },
}
```
Each buffer has an associated eventfd for notification. Producer creates both the memfd and eventfd; both are passed to the consumer.
**Security note:** Children cannot specify MAC addresses. Main derives MACs from config files only, preventing children from spoofing other peers' identities.
**Example flow (client startup):**
```
Main Client
│ │
│◄── GetBuffer { "router" } ───────┤ Client requests router's buffer
│ │
├─── PutBuffer { "router", mac } ──►│ Main sends buffer + authoritative MAC
│ + FD via SCM_RIGHTS │
│ │
│◄── BufferReady { "router" } ─────┤ Client sends its egress to main
│ + FD via SCM_RIGHTS │ (main knows client's MAC from config)
```
**Example flow (new peer notification):**
```
Main Router
│ │
├─── PutBuffer { "client-a", mac }─►│ Main pushes client's egress + MAC
│ + FD │
│ │
│◄── BufferReady { "client-a" } ───┤ Router sends its egress for client-a
│ + FD │
```
**Cleanup:**
```
Main Any Child
│ │
├─── RemoveBuffer { "peer-x" } ────►│ Peer disconnected, clean up
│ │
```
This protocol is topology-agnostic and can support future modes beyond 1:N.
### MAC Filtering
Each ingress buffer has an associated expected source MAC. Receivers validate source MAC to prevent spoofing:
```rust
struct IngressBuffer {
ring: MappedRingBuffer,
expected_source_mac: Mac,
accept_broadcast: bool, // true for router-like roles
}
impl IngressBuffer {
fn read_frame(&self) -> Option<Frame> {
let frame = self.ring.pop()?;
// Accept if source matches expected peer
if frame.source_mac == self.expected_source_mac {
return Some(frame);
}
// Router accepts broadcast/multicast destinations
if self.accept_broadcast && frame.dest_mac.is_multicast() {
return Some(frame);
}
// Drop spoofed frame
log::warn!(
"Dropping frame with unexpected source MAC: got {}, expected {}",
frame.source_mac, self.expected_source_mac
);
None
}
}
```
**MAC is provided via control channel:** When main sends `PutBuffer`, it includes the expected source MAC. The child stores this with the ingress buffer metadata.
## Privilege Model
### Unprivileged Operation
vm-switch runs as an unprivileged user. User namespaces provide "fake root" capabilities:
```
Unprivileged user (UID 1000)
├─► unshare(CLONE_NEWUSER)
│ └─► Now "root" inside user namespace with CAP_SYS_ADMIN
├─► Write uid_map/gid_map (map UID 1000 → 0 inside)
└─► unshare(CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET)
└─► These succeed with user namespace capabilities
```
From the host kernel's perspective, vm-switch is always UID 1000. If compromised, the attacker cannot affect host resources.
## Sandboxing Details
### Main Process Sandbox
**Namespaces:**
- User (map real UID to root inside)
- Mount (only config dir visible)
- IPC (isolated)
- Network (empty, unix sockets via inherited FDs)
**Sequence:**
```
vm-switch starts (unprivileged)
├─► unshare(USER | MOUNT | IPC | NETWORK)
├─► Write "deny" to /proc/self/setgroups
├─► Write UID/GID mappings
├─► Mount minimal filesystem:
│ tmpfs as new root
│ bind mount --config-dir (read-write)
├─► pivot_root to new root
├─► Apply seccomp filter
└─► Event loop
```
**Seccomp whitelist:**
```
# Child management
fork, clone, waitpid, wait4
# Config watching
inotify_init1, inotify_add_watch, inotify_rm_watch
# Unix sockets (AF_UNIX only)
socket(AF_UNIX), bind, listen, accept4, sendmsg, recvmsg
# Namespace setup for children
unshare, setns, pivot_root, mount, umount2
prctl, seccomp
setresuid, setresgid, setgroups
# Memory
mmap(!PROT_EXEC), munmap, madvise, memfd_create, ftruncate
# Event loop
epoll_create1, epoll_ctl, epoll_wait, read, write
# Misc
close, exit_group, clock_gettime, sigaction, rt_sigprocmask
```
### Child Process Sandbox
**Namespaces:** Same as main (inherited or created fresh per child).
**Sandbox timing:** After setup, before processing packets:
```
fork()
├─► Create memfd, mmap buffers
├─► Exchange FDs with main
├─► Set up vhost-user listener socket
├─► Apply sandbox (namespaces + seccomp)
│ ─────── security boundary ───────
└─► Accept vhost-user connection, process packets
```
**Base child seccomp whitelist:**
```
# Vhost-user protocol
accept4, read, write, recvmsg, sendmsg
# Guest memory mapping
mmap(!PROT_EXEC, MAP_SHARED), munmap, madvise
# Event loop + ring buffer notification
epoll_create1, epoll_ctl, epoll_wait, eventfd2
# Synchronization
futex
# Misc
close, exit_group, clock_gettime
```
**Extended seccomp whitelist (for children that create buffers dynamically, e.g., router):**
```
<all base child syscalls>
# Dynamic buffer creation for new peers
memfd_create, ftruncate
```
### Seccomp Violation Handling
- Default: `SCMP_ACT_KILL_PROCESS` (immediate termination)
- Debug mode: `--seccomp-trap` flag uses `SCMP_ACT_TRAP` (SIGSYS signal, allows logging)
## Implementation
### Dependencies
- **seccompiler** - Pure Rust seccomp BPF compilation (used by Firecracker)
- **nix** - Rust bindings for namespace operations, mount, pivot_root
No C dependencies. Pure Rust implementation.
### Seccompiler Usage
```rust
use seccompiler::{SeccompAction, SeccompFilter, SeccompRule, SeccompCondition};
let mut rules = BTreeMap::new();
// Allow read/write unconditionally
rules.insert(libc::SYS_read, vec![SeccompRule::new(vec![])?]);
rules.insert(libc::SYS_write, vec![SeccompRule::new(vec![])?]);
// Allow socket() only for AF_UNIX
rules.insert(libc::SYS_socket, vec![
SeccompRule::new(vec![
SeccompCondition::new(0, ArgLen::Dword, Cmp::Eq, libc::AF_UNIX as u64)?,
])?,
]);
// Allow mmap() without PROT_EXEC
rules.insert(libc::SYS_mmap, vec![
SeccompRule::new(vec![
SeccompCondition::new(2, ArgLen::Dword, Cmp::MaskedEq(libc::PROT_EXEC as u64), 0)?,
])?,
]);
let action = if trap_mode {
SeccompAction::Trap
} else {
SeccompAction::KillProcess
};
let filter = SeccompFilter::new(rules, action, SeccompAction::Allow, TargetArch::x86_64)?;
BpfProgram::from(&filter).install()?;
```
## Process Lifecycle
### Main Process Responsibilities
Main is the coordinator:
- Watches config dir for MAC file changes (inotify)
- Forks children for each VM
- Initiates buffer handoff via `PutBuffer` when new peer appears
- Forwards `BufferReady` messages between children
- Sends `RemoveBuffer` when peer disappears or crashes
- Monitors children via `waitpid()`
Children are passive responders to main's messages.
### Child Crash Detection
Main detects child crashes via standard Unix mechanisms:
1. Kernel sends `SIGCHLD` to main when child exits
2. Main calls `waitpid()` to reap child and get exit status
3. Main sends `RemoveBuffer { name }` to all peers that had buffers with crashed child
4. Main can optionally respawn the child
### Main Crash Handling
Children detect main crash via control socket:
1. Control socket returns EOF or error
2. Child exits cleanly
3. Systemd restarts main process
4. Main respawns all children fresh
Children do not attempt to continue operating without main.
### Buffer Cleanup
**Ownership:** Producer (creator) owns the memfd.
**Normal cleanup:**
1. Child exits normally
2. Kernel closes child's FDs automatically
3. Peers receive `RemoveBuffer`, unmap and close their FD references
4. When last reference closed, kernel frees the memfd
**Crash cleanup:**
1. Child crashes
2. Kernel closes child's FDs (same as normal exit)
3. Main detects via `waitpid()`, sends `RemoveBuffer` to peers
4. Same cleanup path as normal
**RemoveBuffer race conditions:**
- On `RemoveBuffer`, consumer stops reading immediately
- In-flight frames in that buffer are discarded (acceptable)
- Producer may write briefly after consumer unmaps (harmless - just wasted work)
- No synchronization needed between producer and consumer for cleanup
## Logging
Children inherit stdout/stderr from main. Systemd captures output and forwards to journald. No special handling needed.
## Shutdown
1. Main receives SIGTERM
2. Main closes control sockets to children
3. Children detect EOF, exit cleanly
4. Main calls `waitpid()` with timeout
5. SIGKILL any remaining children after timeout
## Memory Sharing with crosvm
vhost-user protocol passes guest memory via `SCM_RIGHTS` FD passing. This works across different UIDs:
1. crosvm creates memfd for guest memory
2. crosvm sends FD via unix socket
3. vm-switch child receives FD (now has capability to access)
4. vm-switch child mmaps the FD
The FD transfer grants access regardless of UID differences.
## Future Considerations
- Running crosvm unprivileged (each VM as separate user) - compatible with this design
- Additional Landlock filesystem restrictions if needed
- Audit logging of seccomp violations in production