Add vm-switch sandboxing design document
Design for adding strong sandboxing to vm-switch using Linux namespaces and seccomp, with per-VM process isolation. Key elements: - Fork per VM instead of threads, each child sandboxed - SPSC ring buffers for inter-process frame routing - Unprivileged operation via user namespaces - seccompiler + nix for pure Rust implementation - Asymmetric control protocol preventing MAC spoofing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
d1117314eb
commit
7b60a0d688
1 changed files with 531 additions and 0 deletions
531
docs/plans/2026-02-08-vm-switch-sandboxing-design.md
Normal file
531
docs/plans/2026-02-08-vm-switch-sandboxing-design.md
Normal file
|
|
@ -0,0 +1,531 @@
|
|||
# vm-switch Sandboxing Design
|
||||
|
||||
## Overview
|
||||
|
||||
vm-switch handles untrusted network data from VMs. This design adds strong sandboxing using Linux namespaces and seccomp, with per-VM process isolation.
|
||||
|
||||
## Threat Model
|
||||
|
||||
Protect against:
|
||||
- **Arbitrary code execution** - Memory corruption in packet parsing leads to host compromise
|
||||
- **Information disclosure** - Compromised vm-switch reads sensitive host files
|
||||
- **Lateral movement** - Compromised vm-switch attacks other services
|
||||
|
||||
Defense: Minimal privileges, per-VM isolation, syscall filtering.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Current (single process, threads)
|
||||
|
||||
```
|
||||
vm-switch (main)
|
||||
├── thread: banking-backend
|
||||
├── thread: shopping-backend
|
||||
└── thread: router-backend
|
||||
(shared memory via Arc<RwLock>)
|
||||
```
|
||||
|
||||
### New (fork per VM, sandboxed)
|
||||
|
||||
```
|
||||
vm-switch (main) ─── lighter sandbox
|
||||
│
|
||||
├── fork → client-A-backend ─── strict sandbox
|
||||
├── fork → client-B-backend ─── strict sandbox
|
||||
└── fork → router-backend ─── strict sandbox (slightly more permissive)
|
||||
```
|
||||
|
||||
Each child handles vhost-user protocol for one VM. Children are sandboxed before processing any packets.
|
||||
|
||||
## Inter-Process Communication
|
||||
|
||||
### SPSC Ring Buffers
|
||||
|
||||
Each VM pair gets a dedicated single-producer single-consumer ring buffer. No shared writers, no coordination overhead.
|
||||
|
||||
**1:N topology (router + clients):**
|
||||
- Each client has one egress buffer (client → router)
|
||||
- Each client has one ingress buffer (router → client)
|
||||
- Router has N egress buffers and N ingress buffers
|
||||
|
||||
```
|
||||
┌──────────────┐ ┌──────────────┐
|
||||
│ Client A │ │ Router │
|
||||
├──────────────┤ ├──────────────┤
|
||||
│ egress → R │────────▶│ ingress ← A │
|
||||
│ ingress ← R │◀────────│ egress → A │
|
||||
└──────────────┘ └──────────────┘
|
||||
```
|
||||
|
||||
**Security property:** Each buffer accepts data from only one source. A compromised producer can only read/corrupt its own outgoing frames.
|
||||
|
||||
### Buffer Ownership
|
||||
|
||||
Children create their own egress buffers and share FDs with main:
|
||||
|
||||
```rust
|
||||
/// Buffer I created and own - I write to this
|
||||
struct OwnedRingBuffer {
|
||||
memfd: OwnedFd,
|
||||
map: MmapMut,
|
||||
}
|
||||
|
||||
/// Buffer someone else created - I read from this
|
||||
struct MappedRingBuffer {
|
||||
fd: OwnedFd, // received via SCM_RIGHTS
|
||||
map: MmapMut,
|
||||
}
|
||||
```
|
||||
|
||||
**Startup sequence:**
|
||||
1. Main forks child
|
||||
2. Child creates egress buffer (memfd), sends FD to main
|
||||
3. Main forwards FD to destination (router for clients)
|
||||
4. Destination maps the buffer
|
||||
5. Return path: destination sends its egress FD back through main
|
||||
|
||||
### Ring Buffer Layout
|
||||
|
||||
Simple layout with head/tail in buffer:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────┐
|
||||
│ head: u64 │ tail: u64 │ padding │
|
||||
├─────────────────────────────────────┤
|
||||
│ slot[0] │ slot[1] │ ... │
|
||||
└─────────────────────────────────────┘
|
||||
```
|
||||
|
||||
Both producer and consumer have read+write access. This is safe because each buffer is dedicated to one producer.
|
||||
|
||||
### Ring Buffer Details
|
||||
|
||||
**Slot format (fixed-size):**
|
||||
```rust
|
||||
const MAX_FRAME_SIZE: usize = 9216; // jumbo frames + headroom
|
||||
|
||||
#[repr(C)]
|
||||
struct Slot {
|
||||
len: u32, // frame length (0 = empty/unused)
|
||||
_padding: u32, // alignment
|
||||
data: [u8; MAX_FRAME_SIZE],
|
||||
}
|
||||
|
||||
#[repr(C)]
|
||||
struct RingBuffer {
|
||||
head: AtomicU64, // next write position (producer owns)
|
||||
tail: AtomicU64, // next read position (consumer owns)
|
||||
_padding: [u8; 48], // pad to 64 bytes (cache line)
|
||||
slots: [Slot; RING_SIZE],
|
||||
}
|
||||
```
|
||||
|
||||
**Producer protocol:**
|
||||
```rust
|
||||
fn push(&self, frame: &[u8]) -> bool {
|
||||
let head = self.head.load(Relaxed); // only producer writes head
|
||||
let tail = self.tail.load(Acquire); // sync with consumer
|
||||
|
||||
if (head + 1) % RING_SIZE == tail {
|
||||
return false; // full - drop frame
|
||||
}
|
||||
|
||||
let slot = &mut self.slots[head];
|
||||
slot.data[..frame.len()].copy_from_slice(frame);
|
||||
slot.len = frame.len() as u32;
|
||||
|
||||
fence(Release); // data visible before head update
|
||||
self.head.store((head + 1) % RING_SIZE, Relaxed);
|
||||
|
||||
// Signal consumer via eventfd if it was empty
|
||||
if head == tail {
|
||||
self.eventfd.write(1);
|
||||
}
|
||||
true
|
||||
}
|
||||
```
|
||||
|
||||
**Consumer protocol:**
|
||||
```rust
|
||||
fn pop(&self) -> Option<Vec<u8>> {
|
||||
let tail = self.tail.load(Relaxed); // only consumer writes tail
|
||||
let head = self.head.load(Acquire); // sync with producer
|
||||
|
||||
if head == tail {
|
||||
return None; // empty
|
||||
}
|
||||
|
||||
let slot = &self.slots[tail];
|
||||
let frame = slot.data[..slot.len as usize].to_vec();
|
||||
|
||||
self.tail.store((tail + 1) % RING_SIZE, Release);
|
||||
Some(frame)
|
||||
}
|
||||
```
|
||||
|
||||
**Notification (hybrid):**
|
||||
- Consumer polls in tight loop while processing frames
|
||||
- When queue empty, consumer blocks on eventfd
|
||||
- Producer writes to eventfd after pushing to previously-empty queue
|
||||
- Balances latency (polling when busy) and CPU usage (sleeping when idle)
|
||||
|
||||
**Buffer full behavior:**
|
||||
- Producer drops frame and returns false
|
||||
- No backpressure - behaves like a real network switch under congestion
|
||||
- Logging/metrics can track drop rate
|
||||
|
||||
### Control Channel Protocol
|
||||
|
||||
Each child has a unix socketpair with main for buffer coordination. Messages are serde-serialized (e.g., using postcard or bincode).
|
||||
|
||||
Message types are asymmetric to prevent children from influencing MAC address assignment:
|
||||
|
||||
```rust
|
||||
/// Messages from main to child
|
||||
#[derive(Serialize, Deserialize)]
|
||||
enum MainToChild {
|
||||
/// "Here's a buffer for receiving from peer `name`"
|
||||
/// Accompanied by 2 FDs via SCM_RIGHTS: [memfd, eventfd]
|
||||
/// MAC is authoritative - child must use this for filtering
|
||||
PutBuffer { name: String, mac: Mac },
|
||||
|
||||
/// "Peer `name` is gone, clean up associated buffers"
|
||||
RemoveBuffer { name: String },
|
||||
}
|
||||
|
||||
/// Messages from child to main
|
||||
#[derive(Serialize, Deserialize)]
|
||||
enum ChildToMain {
|
||||
/// "I need a buffer to receive from peer `name`"
|
||||
GetBuffer { name: String },
|
||||
|
||||
/// "Here's my egress buffer for peer `name`"
|
||||
/// Accompanied by 2 FDs via SCM_RIGHTS: [memfd, eventfd]
|
||||
/// No MAC field - main uses MAC from config, not from children
|
||||
BufferReady { name: String },
|
||||
}
|
||||
```
|
||||
|
||||
Each buffer has an associated eventfd for notification. Producer creates both the memfd and eventfd; both are passed to the consumer.
|
||||
|
||||
**Security note:** Children cannot specify MAC addresses. Main derives MACs from config files only, preventing children from spoofing other peers' identities.
|
||||
|
||||
**Example flow (client startup):**
|
||||
```
|
||||
Main Client
|
||||
│ │
|
||||
│◄── GetBuffer { "router" } ───────┤ Client requests router's buffer
|
||||
│ │
|
||||
├─── PutBuffer { "router", mac } ──►│ Main sends buffer + authoritative MAC
|
||||
│ + FD via SCM_RIGHTS │
|
||||
│ │
|
||||
│◄── BufferReady { "router" } ─────┤ Client sends its egress to main
|
||||
│ + FD via SCM_RIGHTS │ (main knows client's MAC from config)
|
||||
```
|
||||
|
||||
**Example flow (new peer notification):**
|
||||
```
|
||||
Main Router
|
||||
│ │
|
||||
├─── PutBuffer { "client-a", mac }─►│ Main pushes client's egress + MAC
|
||||
│ + FD │
|
||||
│ │
|
||||
│◄── BufferReady { "client-a" } ───┤ Router sends its egress for client-a
|
||||
│ + FD │
|
||||
```
|
||||
|
||||
**Cleanup:**
|
||||
```
|
||||
Main Any Child
|
||||
│ │
|
||||
├─── RemoveBuffer { "peer-x" } ────►│ Peer disconnected, clean up
|
||||
│ │
|
||||
```
|
||||
|
||||
This protocol is topology-agnostic and can support future modes beyond 1:N.
|
||||
|
||||
### MAC Filtering
|
||||
|
||||
Each ingress buffer has an associated expected source MAC. Receivers validate source MAC to prevent spoofing:
|
||||
|
||||
```rust
|
||||
struct IngressBuffer {
|
||||
ring: MappedRingBuffer,
|
||||
expected_source_mac: Mac,
|
||||
accept_broadcast: bool, // true for router-like roles
|
||||
}
|
||||
|
||||
impl IngressBuffer {
|
||||
fn read_frame(&self) -> Option<Frame> {
|
||||
let frame = self.ring.pop()?;
|
||||
|
||||
// Accept if source matches expected peer
|
||||
if frame.source_mac == self.expected_source_mac {
|
||||
return Some(frame);
|
||||
}
|
||||
|
||||
// Router accepts broadcast/multicast destinations
|
||||
if self.accept_broadcast && frame.dest_mac.is_multicast() {
|
||||
return Some(frame);
|
||||
}
|
||||
|
||||
// Drop spoofed frame
|
||||
log::warn!(
|
||||
"Dropping frame with unexpected source MAC: got {}, expected {}",
|
||||
frame.source_mac, self.expected_source_mac
|
||||
);
|
||||
None
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**MAC is provided via control channel:** When main sends `PutBuffer`, it includes the expected source MAC. The child stores this with the ingress buffer metadata.
|
||||
|
||||
## Privilege Model
|
||||
|
||||
### Unprivileged Operation
|
||||
|
||||
vm-switch runs as an unprivileged user. User namespaces provide "fake root" capabilities:
|
||||
|
||||
```
|
||||
Unprivileged user (UID 1000)
|
||||
│
|
||||
├─► unshare(CLONE_NEWUSER)
|
||||
│ └─► Now "root" inside user namespace with CAP_SYS_ADMIN
|
||||
│
|
||||
├─► Write uid_map/gid_map (map UID 1000 → 0 inside)
|
||||
│
|
||||
└─► unshare(CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET)
|
||||
└─► These succeed with user namespace capabilities
|
||||
```
|
||||
|
||||
From the host kernel's perspective, vm-switch is always UID 1000. If compromised, the attacker cannot affect host resources.
|
||||
|
||||
## Sandboxing Details
|
||||
|
||||
### Main Process Sandbox
|
||||
|
||||
**Namespaces:**
|
||||
- User (map real UID to root inside)
|
||||
- Mount (only config dir visible)
|
||||
- IPC (isolated)
|
||||
- Network (empty, unix sockets via inherited FDs)
|
||||
|
||||
**Sequence:**
|
||||
```
|
||||
vm-switch starts (unprivileged)
|
||||
│
|
||||
├─► unshare(USER | MOUNT | IPC | NETWORK)
|
||||
├─► Write "deny" to /proc/self/setgroups
|
||||
├─► Write UID/GID mappings
|
||||
│
|
||||
├─► Mount minimal filesystem:
|
||||
│ tmpfs as new root
|
||||
│ bind mount --config-dir (read-write)
|
||||
├─► pivot_root to new root
|
||||
├─► Apply seccomp filter
|
||||
│
|
||||
└─► Event loop
|
||||
```
|
||||
|
||||
**Seccomp whitelist:**
|
||||
```
|
||||
# Child management
|
||||
fork, clone, waitpid, wait4
|
||||
|
||||
# Config watching
|
||||
inotify_init1, inotify_add_watch, inotify_rm_watch
|
||||
|
||||
# Unix sockets (AF_UNIX only)
|
||||
socket(AF_UNIX), bind, listen, accept4, sendmsg, recvmsg
|
||||
|
||||
# Namespace setup for children
|
||||
unshare, setns, pivot_root, mount, umount2
|
||||
prctl, seccomp
|
||||
setresuid, setresgid, setgroups
|
||||
|
||||
# Memory
|
||||
mmap(!PROT_EXEC), munmap, madvise, memfd_create, ftruncate
|
||||
|
||||
# Event loop
|
||||
epoll_create1, epoll_ctl, epoll_wait, read, write
|
||||
|
||||
# Misc
|
||||
close, exit_group, clock_gettime, sigaction, rt_sigprocmask
|
||||
```
|
||||
|
||||
### Child Process Sandbox
|
||||
|
||||
**Namespaces:** Same as main (inherited or created fresh per child).
|
||||
|
||||
**Sandbox timing:** After setup, before processing packets:
|
||||
```
|
||||
fork()
|
||||
│
|
||||
├─► Create memfd, mmap buffers
|
||||
├─► Exchange FDs with main
|
||||
├─► Set up vhost-user listener socket
|
||||
│
|
||||
├─► Apply sandbox (namespaces + seccomp)
|
||||
│ ─────── security boundary ───────
|
||||
│
|
||||
└─► Accept vhost-user connection, process packets
|
||||
```
|
||||
|
||||
**Base child seccomp whitelist:**
|
||||
```
|
||||
# Vhost-user protocol
|
||||
accept4, read, write, recvmsg, sendmsg
|
||||
|
||||
# Guest memory mapping
|
||||
mmap(!PROT_EXEC, MAP_SHARED), munmap, madvise
|
||||
|
||||
# Event loop + ring buffer notification
|
||||
epoll_create1, epoll_ctl, epoll_wait, eventfd2
|
||||
|
||||
# Synchronization
|
||||
futex
|
||||
|
||||
# Misc
|
||||
close, exit_group, clock_gettime
|
||||
```
|
||||
|
||||
**Extended seccomp whitelist (for children that create buffers dynamically, e.g., router):**
|
||||
```
|
||||
<all base child syscalls>
|
||||
|
||||
# Dynamic buffer creation for new peers
|
||||
memfd_create, ftruncate
|
||||
```
|
||||
|
||||
### Seccomp Violation Handling
|
||||
|
||||
- Default: `SCMP_ACT_KILL_PROCESS` (immediate termination)
|
||||
- Debug mode: `--seccomp-trap` flag uses `SCMP_ACT_TRAP` (SIGSYS signal, allows logging)
|
||||
|
||||
## Implementation
|
||||
|
||||
### Dependencies
|
||||
|
||||
- **seccompiler** - Pure Rust seccomp BPF compilation (used by Firecracker)
|
||||
- **nix** - Rust bindings for namespace operations, mount, pivot_root
|
||||
|
||||
No C dependencies. Pure Rust implementation.
|
||||
|
||||
### Seccompiler Usage
|
||||
|
||||
```rust
|
||||
use seccompiler::{SeccompAction, SeccompFilter, SeccompRule, SeccompCondition};
|
||||
|
||||
let mut rules = BTreeMap::new();
|
||||
|
||||
// Allow read/write unconditionally
|
||||
rules.insert(libc::SYS_read, vec![SeccompRule::new(vec![])?]);
|
||||
rules.insert(libc::SYS_write, vec![SeccompRule::new(vec![])?]);
|
||||
|
||||
// Allow socket() only for AF_UNIX
|
||||
rules.insert(libc::SYS_socket, vec![
|
||||
SeccompRule::new(vec![
|
||||
SeccompCondition::new(0, ArgLen::Dword, Cmp::Eq, libc::AF_UNIX as u64)?,
|
||||
])?,
|
||||
]);
|
||||
|
||||
// Allow mmap() without PROT_EXEC
|
||||
rules.insert(libc::SYS_mmap, vec![
|
||||
SeccompRule::new(vec![
|
||||
SeccompCondition::new(2, ArgLen::Dword, Cmp::MaskedEq(libc::PROT_EXEC as u64), 0)?,
|
||||
])?,
|
||||
]);
|
||||
|
||||
let action = if trap_mode {
|
||||
SeccompAction::Trap
|
||||
} else {
|
||||
SeccompAction::KillProcess
|
||||
};
|
||||
|
||||
let filter = SeccompFilter::new(rules, action, SeccompAction::Allow, TargetArch::x86_64)?;
|
||||
BpfProgram::from(&filter).install()?;
|
||||
```
|
||||
|
||||
## Process Lifecycle
|
||||
|
||||
### Main Process Responsibilities
|
||||
|
||||
Main is the coordinator:
|
||||
- Watches config dir for MAC file changes (inotify)
|
||||
- Forks children for each VM
|
||||
- Initiates buffer handoff via `PutBuffer` when new peer appears
|
||||
- Forwards `BufferReady` messages between children
|
||||
- Sends `RemoveBuffer` when peer disappears or crashes
|
||||
- Monitors children via `waitpid()`
|
||||
|
||||
Children are passive responders to main's messages.
|
||||
|
||||
### Child Crash Detection
|
||||
|
||||
Main detects child crashes via standard Unix mechanisms:
|
||||
1. Kernel sends `SIGCHLD` to main when child exits
|
||||
2. Main calls `waitpid()` to reap child and get exit status
|
||||
3. Main sends `RemoveBuffer { name }` to all peers that had buffers with crashed child
|
||||
4. Main can optionally respawn the child
|
||||
|
||||
### Main Crash Handling
|
||||
|
||||
Children detect main crash via control socket:
|
||||
1. Control socket returns EOF or error
|
||||
2. Child exits cleanly
|
||||
3. Systemd restarts main process
|
||||
4. Main respawns all children fresh
|
||||
|
||||
Children do not attempt to continue operating without main.
|
||||
|
||||
### Buffer Cleanup
|
||||
|
||||
**Ownership:** Producer (creator) owns the memfd.
|
||||
|
||||
**Normal cleanup:**
|
||||
1. Child exits normally
|
||||
2. Kernel closes child's FDs automatically
|
||||
3. Peers receive `RemoveBuffer`, unmap and close their FD references
|
||||
4. When last reference closed, kernel frees the memfd
|
||||
|
||||
**Crash cleanup:**
|
||||
1. Child crashes
|
||||
2. Kernel closes child's FDs (same as normal exit)
|
||||
3. Main detects via `waitpid()`, sends `RemoveBuffer` to peers
|
||||
4. Same cleanup path as normal
|
||||
|
||||
**RemoveBuffer race conditions:**
|
||||
- On `RemoveBuffer`, consumer stops reading immediately
|
||||
- In-flight frames in that buffer are discarded (acceptable)
|
||||
- Producer may write briefly after consumer unmaps (harmless - just wasted work)
|
||||
- No synchronization needed between producer and consumer for cleanup
|
||||
|
||||
## Logging
|
||||
|
||||
Children inherit stdout/stderr from main. Systemd captures output and forwards to journald. No special handling needed.
|
||||
|
||||
## Shutdown
|
||||
|
||||
1. Main receives SIGTERM
|
||||
2. Main closes control sockets to children
|
||||
3. Children detect EOF, exit cleanly
|
||||
4. Main calls `waitpid()` with timeout
|
||||
5. SIGKILL any remaining children after timeout
|
||||
|
||||
## Memory Sharing with crosvm
|
||||
|
||||
vhost-user protocol passes guest memory via `SCM_RIGHTS` FD passing. This works across different UIDs:
|
||||
|
||||
1. crosvm creates memfd for guest memory
|
||||
2. crosvm sends FD via unix socket
|
||||
3. vm-switch child receives FD (now has capability to access)
|
||||
4. vm-switch child mmaps the FD
|
||||
|
||||
The FD transfer grants access regardless of UID differences.
|
||||
|
||||
## Future Considerations
|
||||
|
||||
- Running crosvm unprivileged (each VM as separate user) - compatible with this design
|
||||
- Additional Landlock filesystem restrictions if needed
|
||||
- Audit logging of seccomp violations in production
|
||||
Loading…
Add table
Add a link
Reference in a new issue