Virtualize /etc/hosts: always-on, image-seeded, leak-proof shim#69
Merged
Conversation
…'s file
resolve_net_allow used to emit `etc_hosts: Some(...)` only when at least
one `net_allow` rule named a concrete hostname; otherwise the seccomp
openat shim was not registered and `openat("/etc/hosts")` fell through
to the host's on-disk file. That broke the "loopback-only view" the
netlink and procfs virtualization already promise, and made sandbox
behavior depend on whichever distro's `/etc/hosts` happened to be
installed (Ubuntu 20.04 parks `::1` under `ip6-localhost`, not
`localhost`, so AI_ADDRCONFIG localhost lookups silently dropped v6).
Make `etc_hosts` a plain `String` that always carries the loopback base
(`127.0.0.1 localhost` / `::1 localhost`), with `net_allow` host entries
appended on top, and register the openat shim unconditionally. Add a
direct test that reads `/etc/hosts` from inside the sandbox and asserts
the exact synthetic content.
Signed-off-by: Cong Wang <cwang@multikernel.io>
The handler installed in 3278445 only matched `openat(_, "/etc/hosts", ...)`, leaving four bypasses that read the host's real on-disk file and undid the commit's no-leak guarantee: legacy `open(path, ...)` (not registered at all), `openat2(...)` (a distinct syscall number used by some Go and hand-rolled libcs), dirfd-relative spellings like `openat(open("/etc"), "hosts", ...)`, and non-canonical absolutes like `/etc/../etc/hosts`, `//etc/hosts`, or `/etc/./hosts`. Register the shim for `open`, `openat`, and `openat2`, and rewrite `handle_etc_hosts_open` to dispatch on `notif.data.nr` so each ABI hits the right argument slots. Resolve the (dirfd, pathname) pair to an absolute path via `/proc/<pid>/cwd` or `/proc/<pid>/fd/<dirfd>` and lexically normalize it (collapse `.`, `..`, redundant `/`) before comparing to `/etc/hosts`. The BPF notif list now carries `open` and `openat2` unconditionally so callers picking those syscalls actually generate a notification instead of slipping through ALLOW. Add a regression test that exercises each bypass shape (dirfd-relative + three non-canonical spellings) and asserts the synthetic content comes back every time. Signed-off-by: Cong Wang <cwang@multikernel.io>
Last commit closed the literal-path bypass for /etc/hosts but the same 4 shapes (legacy `open`, `openat2`, dirfd-relative, non-canonical) still slipped past every other open-keyed shim and most of them are actual security boundaries: /proc/kcore (and friends in `is_sensitive_proc`) returns physical memory, the per-PID `/proc/<pid>/` filter hides other sandboxes' state from each other, /proc/cpuinfo + /proc/meminfo + /proc/net/* fudge resource accounting, /etc/hostname leaks the host identity, and /dev/urandom bypasses the deterministic seed. Lift the `(dirfd, path)` → normalized absolute path logic out of `handle_etc_hosts_open` into a shared `procfs::resolve_open_target` that dispatches on `notif.data.nr` for open / openat / openat2. Convert `handle_proc_open`, `handle_hostname_open`, and `handle_random_open` to use it, and add `open_family_syscalls()` in the dispatch builder so each handler is registered for all three syscalls in one place. The BPF notif list already carries open and openat2 unconditionally (added by 0fc09b4), so the kernel produces a notification regardless of which spelling the caller picked. Add regression tests for the new coverage: the /proc/kcore deny, the /proc/cpuinfo virtualization, and the /etc/hostname shim each get a test that walks dirfd-relative, `..`-laden, repeated-slash, and `./`-laden spellings and asserts the shim still fires. Signed-off-by: Cong Wang <cwang@multikernel.io>
3278445 always served the same fixed loopback-only /etc/hosts to every sandbox, which lost any entries an image author had baked in (private registries, internal services, etc.). Trade-off acknowledged at the time. This commit picks the alternative direction we discussed: keep the always-on loopback guarantee, but seed the synthetic file from the image's `<chroot>/etc/hosts` when one is present, only injecting the loopback entries the image itself doesn't already cover, and appending net_allow concrete-host entries on top. Split the synthesis: `resolve_net_allow` now returns just the `concrete_host_entries` lines (one per resolved net_allow hostname), and a new `compose_virtual_etc_hosts(chroot_root, concrete_entries)` helper assembles the final file. Without a chroot it returns the fixed loopback base + concrete entries (matches the previous behavior). With a chroot, it reads `<chroot>/etc/hosts` (falling back to the loopback base if absent or unreadable), checks which localhost families the image already covers, and injects only the missing ones — comments are stripped before the localhost-presence scan so an inline `#` in the image doesn't fool the detector. Move the `/etc/hosts` shim registration *above* the chroot dispatch so the synthetic memfd wins in chroot mode: otherwise the chroot handler's `openat2(RESOLVE_IN_ROOT)` opens `<chroot>/etc/hosts` directly and the merge never reaches the sandbox. Tests: 6 new unit tests for `compose_virtual_etc_hosts` (no chroot, image with both / one / no loopback families, missing file, inline comment edge case) and 2 new integration tests that build a rootfs with a custom `/etc/hosts`, run a real sandbox under `--chroot`, and assert the merged content the sandbox actually sees (image entries preserved, loopback injected when missing, no duplicates when present). Signed-off-by: Cong Wang <cwang@multikernel.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/etc/hostsso the sandbox sees a fixed loopback view (127.0.0.1 localhost+::1 localhost) plus anynet_allowhost entries, independent of whatever the host's on-disk file says. Previously the synthetic file was only served when at least onenet_allowrule named a concrete hostname; otherwise the host's/etc/hostsleaked through.handle_etc_hosts_openthat the literal-path match used to miss: legacyopen(path, ...),openat2(...)(used by some Go runtimes), dirfd-relative spellings (openat(open("/etc"), "hosts", ...)), and non-canonical absolutes (/etc/../etc/hosts,//etc/hosts,/etc/./hosts). The newresolve_open_targethelper dispatches onnotif.data.nr, resolves(dirfd, pathname)via/proc/<pid>/{cwd,fd/<fd>}, and lexically normalizes before comparing.is_sensitive_proc(the/proc/kcoredeny — actual security boundary), the per-PID/proc/<pid>/filter,/proc/cpuinfoand friends,/etc/hostname, and/dev/urandom. All handlers now shareresolve_open_targetand are registered foropen+openat+openat2in one place viaopen_family_syscalls(). The BPF notif list carries the same set so the kernel actually delivers notifications./etc/hostsfrom the image in chroot/image mode.compose_virtual_etc_hosts(chroot_root, concrete_entries)reads<chroot>/etc/hosts(when present), injects loopback entries only for any family the image doesn't already cover, and appendsnet_allowhost entries. Without a chroot it returns the fixed loopback base. The/etc/hostsshim is registered above the chroot dispatch so the synthetic memfd wins overopenat2(RESOLVE_IN_ROOT)inside the chroot.Test plan
cargo test --package sandlock-core --lib(289 pass, includes 6 newcompose_virtual_etc_hoststests covering no-chroot, image-with-both / one / no loopback families, missing file, and the inline-comment edge case)cargo test --package sandlock-core --test integration(225 pass, includes 4 new bypass-resistance tests + 2 new chroot image-seed tests + 1 baseline virtualization test)cat /etc/hostsinside a non-chroot sandbox returns exactly the synthetic content; inside a chroot whose image lacks loopback entries, the merged file shows the image's entries plus the injected loopback; with an image that already ships loopback, no duplicates appear.Known follow-up (not in this PR)
The chroot handler's
openat2(RESOLVE_IN_ROOT)still shadowshandle_proc_open,handle_hostname_open, andhandle_random_openbecause those are registered after the chroot dispatch. With/procbind-mounted into the chroot, this lets the kernel's realcpuinfo/kcore/ etc. through, bypassing both virtualization and the sensitive-path deny. Same fix shape (reorder above chroot) — separate change.🤖 Generated with Claude Code