Why namespaces aren't enough for untrusted code

Containers are not a security boundary. They are a packaging boundary that the industry has, over a decade of marketing, convinced itself is a security boundary. The misunderstanding is older than Docker — chroot had the same problem in 1979 — and every few months a fresh runc CVE re-litigates the point. The argument for namespaces-only isolation goes: “the workload can’t see host PIDs, host mounts, host network interfaces, host user IDs, so it can’t attack them.” The argument’s defect is that the workload doesn’t need to see them. It needs to call into the kernel, and the kernel sees everything.

What a namespace actually is

A Linux namespace is a per-process view of a kernel-managed resource. CLONE_NEWPID gives the process its own PID 1. CLONE_NEWNS gives it a private mount table. CLONE_NEWUSER remaps UID/GID ranges. CLONE_NEWNET produces an isolated network stack. None of these mechanisms add new authorization checks. They change what the calling process can name, not what the kernel will do on its behalf when correctly asked.

Concretely: a process in a user namespace can still call keyctl(), bpf(), perf_event_open(), io_uring_setup(), userfaultfd(), add_key(), and every other syscall the kernel exports. If any of those code paths contains a memory-safety bug — and they have, repeatedly — the namespace does not stop the exploit. CVE-2022-0185 (filesystem context heap overflow) is reachable from a user namespace. CVE-2022-0492 (cgroups v1 release_agent) is reachable when CAP_SYS_ADMIN is held in the user’s own namespace. The pattern is consistent across the last five years of container-escape CVEs: the namespace was a property the attacker noted on the way past, not a wall they had to break.

The cgroups misunderstanding

Cgroups are a quota mechanism. They cap memory, CPU, PIDs, block-I/O bandwidth. They are essential for multi-tenant systems and they are not security. A workload pinned to 512 MB of RAM and 64 PIDs still calls bpf() and unshare() the same way as one with no limits. A fork bomb is rate-limited; a kernel exploit is not.

This matters because the standard “secure container” recipe in production is: namespaces + cgroups + (maybe) a default seccomp profile. Two of three of those are not security primitives. The third is — and the default profiles are too permissive, by design, because they have to ship to everyone.

What the default seccomp profile leaves open

Docker’s default seccomp profile blocks roughly 44 syscalls. The kernel exposes around 350. The ratio doesn’t matter much in itself — many syscalls are uninteresting from an exploit perspective — but several extremely interesting ones are allowed in the default profile because legitimate workloads need them. clone() with arbitrary flags. unshare() (allowed in default profile, blocked only when CAP_SYS_ADMIN is dropped). keyctl() (allowed with conditions). bpf() (blocked, but a default profile that blocks bpf is also a profile that breaks anything wanting eBPF observability). The pattern is unavoidable: a general-purpose profile cannot be a strict profile, because some legitimate workload needs every dangerous syscall.

The fix is not a better default profile. The fix is per-workload profiles where the workload is the unit being secured, not the runtime.

Capabilities are necessary, not sufficient

The Linux capability system splits root into 41 finer-grained privileges. Dropping them all from a container is good hygiene — and several escape paths require holding a capability somewhere in the workload’s user namespace. CAP_SYS_ADMIN is the most famous: it’s still the second root. CAP_NET_RAW enables raw socket creation. CAP_BPF (newly split out from CAP_SYS_ADMIN in 5.8) lets you load BPF programs.

But the capability system pre-dates user namespaces and the interaction is subtle. A process can hold CAP_SYS_ADMIN in its own user namespace and not in the host’s. That’s still enough to do interesting things inside the namespace — including, historically, things that bled across the boundary. CVE-2022-0492 is the canonical example: cgroups v1’s release_agent path didn’t check the namespace the writing process was actually privileged in. Dropping capabilities does not save you from that class of bug; only not creating the user namespace, or not allowing the syscall, does.

The selective-denial argument

ZViz’s argument is straightforward: the only durable way to shrink the attack surface is to shrink the reachable syscall set. Not to emulate syscalls in userspace (that’s gVisor’s bet, and it carries a 5–250x performance tax on every allowed call), but to deny them at the seccomp layer before any kernel code runs.

The ZViz default profile drops all 41 capabilities, applies a Landlock ruleset for filesystem control, mounts /proc, /sys, /dev privately inside the container, and runs the workload as PID 1 of a fresh user/pid/mount/ipc/uts namespace. Then it loads a 124-instruction BPF filter that allows 132 syscalls to reach the host kernel at native speed, denies 24 outright with EPERM, and routes one (socket()) through an argument-filter that lets AF_INET/AF_INET6 through and blocks AF_PACKET, AF_NETLINK, and the rest.

The 24 blocked syscalls are not chosen by vibes. They are the ones whose mere availability is a foothold: ptrace, mount, unshare, bpf, kexec_load, init_module, finit_module, delete_module, keyctl, add_key, request_key, setns, pivot_root, userfaultfd, perf_event_open, io_uring_setup, process_vm_readv/writev, swapon/swapoff, clock_settime, reboot, chroot, personality(ADDR_NO_RANDOMIZE). Some of these block legitimate workloads — unshare breaks Bazel’s internal sandboxing, ptrace breaks strace. The README is explicit about this trade-off and points those workloads at gVisor instead.

”But my workload doesn’t need those”

Right. That’s the whole pitch. If your workload is a web server, a database client, a script, an LLM-generated Python program, a build job that doesn’t internally sandbox, an ETL pipeline — you don’t need 350 syscalls. You need roughly 70, plus or minus the libc version, and the rest are pure attack surface. Selective denial is a way of charging zero performance for that observation.

The README cites the comparison directly: 19 out of 19 escape tests blocked, 8 out of 8 live attacks blocked, 98.2% policy match with gVisor across 55 individual checks. The single delta is network egress default (ZViz: deny; gVisor: allow), which is a defensible difference in either direction.

What this does not solve

ZViz does not solve kernel CVEs in allowed syscall paths. If a flaw in read() lets an attacker pivot to ring 0, ZViz can’t help — read() is on the allow-list, by necessity, in every profile. The honest framing is: ZViz removes the dangerous-by-design syscalls from reach, leaving the dangerous-by-bug surface. That surface is much smaller. It is not zero.

For workloads where it must be effectively zero — a hostile state actor inside the container, the highest-tier compliance regimes — you want a hardware boundary: Firecracker or Kata Containers running a KVM guest. The performance tax is real (cold start in the hundreds of milliseconds, ~5–128 MB memory floor depending on guest) but the boundary is enforced by silicon, not by a BPF filter the kernel is also enforcing.

Pick the right boundary for the threat. Namespaces alone are not one.