Isolation models, ranked by what they actually break

The container ecosystem talks about “isolation” as if it were a scalar. It is not. There are at least five qualitatively different mechanisms, each with a different break model — meaning a different answer to “what does it take for the workload to do something the operator did not authorize?” This essay walks through them in order of strength, and is honest about what each one actually fails at.

1. Process boundary

The weakest meaningful isolation: separate Unix processes, no other intervention. The boundary is enforced by the kernel via standard POSIX permissions. A process running as a different user cannot directly read another’s memory or kill its processes — but anything writable as the target user, anything mapped shared, anything in /tmp without O_TMPFILE, is fair game. The OS scheduler shares CPU. The page cache is shared. Side channels are shared. This is the model for “just run it as nobody” and it is approximately the model used by the average shared web host in 2003. The break model is: a single SUID binary, a writable shared directory, a network listener bound to 0.0.0.0, a setuid escalation primitive. Hundreds of such primitives ship in a default Linux install.

Use when: never, for untrusted code, in 2026.

2. Namespace + cgroup boundary

The default Docker model. Process gets a private PID space, mount table, user-ID range, network stack, and IPC space. Cgroups cap resources. The kernel allocates these as cheap kernel objects; cost is essentially zero. The break model is the entire syscall surface (~350 syscalls), the entire LSM hook surface, the entire procfs/sysfs surface, and any historical bug in the namespace logic itself.

A non-exhaustive list of escape avenues that have shipped as CVEs in the last five years: runc symlink races (CVE-2019-5736 — overwrites the runc binary on the host), filesystem context overflows (CVE-2022-0185 — heap overflow reachable from a user namespace), cgroups v1 release_agent (CVE-2022-0492), pid namespace bugs around setns(), BPF verifier bugs in the JIT, OverlayFS metadata copy-up bugs.

Use when: the workload is friendly and you just want resource isolation. Not when the workload is hostile, not when the workload is LLM-generated, not when the workload is a third-party plugin.

3. Namespaces + syscall filter (where ZViz lives)

Same primitives as level 2, plus a seccomp-BPF filter that constrains which syscalls can be made. Plus capability drop. Plus an LSM ruleset — Landlock for unprivileged filesystem access control, optionally AppArmor or SELinux on top.

The break model shrinks dramatically. Where level 2’s break model is “any bug in any syscall the kernel exposes,” level 3’s break model is “any bug in any allowed syscall.” For a ZViz default profile, that’s 132 syscalls plus argument-filtered socket(). The kernel attack surface for unshare, ptrace, bpf, mount, init_module, kexec_load, keyctl, userfaultfd, io_uring_setup, perf_event_open — all gone, no code path reachable.

Performance is the headline benefit. The allowed syscalls run at native speed. clock_gettime() uses the kernel vDSO. read()/write() go straight to the VFS. There is no userspace interposition, no context switch tax, no emulation layer. ZViz benchmarks the difference: 20ns clock_gettime (ZViz) vs 4,982ns (gVisor), 212ns vs 4,393ns for read, 211ns vs 1,169ns for write.

The honest weakness: a kernel CVE in an allowed syscall still works. read() is on the allow-list and always will be. The argument is that the defender’s job is to shrink the reachable surface enough that the residual risk is acceptable for the workload class — and the empirical record on read/write/mmap bugs is much better than the record on bpf and userfaultfd.

Use when: AI-agent code execution, CI sandboxing, multi-tenant code-execution platforms, anything that needs sub-10ms cold starts and a small memory floor.

4. Userspace kernel (gVisor)

A different bet. Instead of constraining which syscalls the workload can call, intercept all of them and re-implement the kernel ABI in userspace. The workload calls read(); the call lands in gVisor’s Sentry process; Sentry runs its own filesystem code; if Sentry needs to actually touch the disk, it makes a much smaller, vetted call to the host kernel (~70 host syscalls total in a typical workload).

Break model: a bug in Sentry’s reimplementation, or a bug in one of the ~70 host syscalls Sentry uses. Both surfaces are smaller than the level-2 surface; the second is smaller than the level-3 surface. The trade-off is paid on every syscall: 50–100% overhead in the steady state, sometimes much more (clock_gettime is 249x slower because Sentry cannot use the kernel vDSO). Cold start is ~200ms because Sentry itself has to initialize. Memory floor is ~50 MB because Sentry is a real userspace program with a runtime.

In exchange, gVisor “allows” things ZViz blocks: ptrace, mount, unshare, nested namespaces. They are sandboxed in Sentry’s virtual environment, which is why Docker-in-Docker works inside gVisor. The container thinks it created a real PID namespace; it actually created an entry in Sentry’s emulated process table. Two valid philosophies, equivalent security outcome on those operations.

Use when: you need Docker-in-Docker, Bazel/Nix internal sandboxing, strace, or the strongest “I cannot trust this workload at all” position with compatibility for legitimate dangerous syscalls.

5. MicroVM (Firecracker, Cloud Hypervisor)

A KVM guest with the minimum possible attack surface. Firecracker’s design is famously paranoid: no PCI, no ACPI, no BIOS, virtio devices only, jailed VMM process. The boundary is enforced by hardware virtualization — the workload runs in ring 3 of a guest CPU whose ring 0 is a separate Linux kernel.

Break model: a VMM bug (Firecracker’s surface is tiny; the audit history is excellent), a hardware bug (Spectre, Meltdown, MDS — patchable but real), a virtio device bug. The host kernel’s syscall surface is not directly reachable by the workload; the workload calls into its own guest kernel, which makes hypercalls to KVM.

Cost: ~125ms cold start (down from “seconds” in older VM stacks), ~5 MB memory floor for the VMM plus the guest kernel and rootfs. Requires VT-x/AMD-V hardware.

Use when: function-as-a-service platforms running arbitrary tenant code (Lambda, Fly, Modal), high-tier compliance regimes, anywhere you need a defensible “the boundary is hardware-enforced” claim.

6. Kata Containers / “container that is actually a VM”

A QEMU-or-Cloud-Hypervisor guest dressed in the OCI runtime ABI. Stronger isolation than MicroVM in surface area (full virtio, more drivers) but heavier: ~1s cold start, ~128 MB memory floor. Compatibility is excellent — Kata runs nearly any container that runc does, because the inside is a near-complete Linux environment.

Use when: you want VM-grade isolation and your workloads are Linux containers that won’t fit into Firecracker’s bare-bones model.

The ranking, ordered by what they break

Model	What it actually fails at	Cold start	Memory
Process	Anything writable by the target user; the entire syscall surface; side channels	0	0
Namespace+cgroup	Kernel CVE in any of ~350 syscalls; namespace logic bugs	~50ms	~0
Seccomp-filtered (ZViz)	Kernel CVE in any of ~132 allowed syscalls	~8ms	~2 MB
Userspace kernel (gVisor)	Bug in Sentry; CVE in any of ~70 host syscalls Sentry uses	~200ms	~50 MB
MicroVM (Firecracker)	VMM bug; hardware side channel; virtio bug	~125ms	~5 MB
Kata Containers	VMM bug; full QEMU surface; full virtio	~1s	~128 MB

Picking honestly

No model is universally correct. The two questions worth asking are: (1) what is the trust level of the workload, and (2) what is the latency/density budget? “AI agents executing snippets, 1000 per second, ten-second TTL” lives at level 3 or 4. “Customer multi-tenant Lambda” lives at level 5. “An internal tool you built, running on your own infra” can live at level 2 and probably should. A single org will operate two or three of these models for different tenants of the same platform.

The model that does not make this list is “we ran it in Docker and called it sandboxed.” That is a category error. Docker’s job description, since 2013, has been packaging. Isolation has always been a side effect that the marketing department was too generous about. Pick the right layer.