Skip to content

Virtualize /etc/hosts: always-on, image-seeded, leak-proof shim#69

Merged
congwang-mk merged 4 commits into
mainfrom
virtualize-etc-hosts
May 27, 2026
Merged

Virtualize /etc/hosts: always-on, image-seeded, leak-proof shim#69
congwang-mk merged 4 commits into
mainfrom
virtualize-etc-hosts

Conversation

@congwang-mk
Copy link
Copy Markdown
Contributor

Summary

  • Always virtualize /etc/hosts so the sandbox sees a fixed loopback view (127.0.0.1 localhost + ::1 localhost) plus any net_allow host entries, independent of whatever the host's on-disk file says. Previously the synthetic file was only served when at least one net_allow rule named a concrete hostname; otherwise the host's /etc/hosts leaked through.
  • Close four shim bypasses in handle_etc_hosts_open that the literal-path match used to miss: legacy open(path, ...), openat2(...) (used by some Go runtimes), dirfd-relative spellings (openat(open("/etc"), "hosts", ...)), and non-canonical absolutes (/etc/../etc/hosts, //etc/hosts, /etc/./hosts). The new resolve_open_target helper dispatches on notif.data.nr, resolves (dirfd, pathname) via /proc/<pid>/{cwd,fd/<fd>}, and lexically normalizes before comparing.
  • Generalize the normalization to every path-keyed shim that had the same bug: is_sensitive_proc (the /proc/kcore deny — actual security boundary), the per-PID /proc/<pid>/ filter, /proc/cpuinfo and friends, /etc/hostname, and /dev/urandom. All handlers now share resolve_open_target and are registered for open + openat + openat2 in one place via open_family_syscalls(). The BPF notif list carries the same set so the kernel actually delivers notifications.
  • Seed the synthetic /etc/hosts from the image in chroot/image mode. compose_virtual_etc_hosts(chroot_root, concrete_entries) reads <chroot>/etc/hosts (when present), injects loopback entries only for any family the image doesn't already cover, and appends net_allow host entries. Without a chroot it returns the fixed loopback base. The /etc/hosts shim is registered above the chroot dispatch so the synthetic memfd wins over openat2(RESOLVE_IN_ROOT) inside the chroot.

Test plan

  • Rust unit tests: cargo test --package sandlock-core --lib (289 pass, includes 6 new compose_virtual_etc_hosts tests covering no-chroot, image-with-both / one / no loopback families, missing file, and the inline-comment edge case)
  • Rust integration tests: cargo test --package sandlock-core --test integration (225 pass, includes 4 new bypass-resistance tests + 2 new chroot image-seed tests + 1 baseline virtualization test)
  • Verified end-to-end on a Linux 6.19 host: cat /etc/hosts inside a non-chroot sandbox returns exactly the synthetic content; inside a chroot whose image lacks loopback entries, the merged file shows the image's entries plus the injected loopback; with an image that already ships loopback, no duplicates appear.

Known follow-up (not in this PR)

The chroot handler's openat2(RESOLVE_IN_ROOT) still shadows handle_proc_open, handle_hostname_open, and handle_random_open because those are registered after the chroot dispatch. With /proc bind-mounted into the chroot, this lets the kernel's real cpuinfo / kcore / etc. through, bypassing both virtualization and the sensitive-path deny. Same fix shape (reorder above chroot) — separate change.

🤖 Generated with Claude Code

…'s file

resolve_net_allow used to emit `etc_hosts: Some(...)` only when at least
one `net_allow` rule named a concrete hostname; otherwise the seccomp
openat shim was not registered and `openat("/etc/hosts")` fell through
to the host's on-disk file. That broke the "loopback-only view" the
netlink and procfs virtualization already promise, and made sandbox
behavior depend on whichever distro's `/etc/hosts` happened to be
installed (Ubuntu 20.04 parks `::1` under `ip6-localhost`, not
`localhost`, so AI_ADDRCONFIG localhost lookups silently dropped v6).
Make `etc_hosts` a plain `String` that always carries the loopback base
(`127.0.0.1 localhost` / `::1 localhost`), with `net_allow` host entries
appended on top, and register the openat shim unconditionally. Add a
direct test that reads `/etc/hosts` from inside the sandbox and asserts
the exact synthetic content.

Signed-off-by: Cong Wang <cwang@multikernel.io>
The handler installed in 3278445 only matched
`openat(_, "/etc/hosts", ...)`, leaving four bypasses that read the
host's real on-disk file and undid the commit's no-leak guarantee:
legacy `open(path, ...)` (not registered at all), `openat2(...)` (a
distinct syscall number used by some Go and hand-rolled libcs),
dirfd-relative spellings like `openat(open("/etc"), "hosts", ...)`,
and non-canonical absolutes like `/etc/../etc/hosts`, `//etc/hosts`,
or `/etc/./hosts`.

Register the shim for `open`, `openat`, and `openat2`, and rewrite
`handle_etc_hosts_open` to dispatch on `notif.data.nr` so each ABI
hits the right argument slots. Resolve the (dirfd, pathname) pair to
an absolute path via `/proc/<pid>/cwd` or `/proc/<pid>/fd/<dirfd>` and
lexically normalize it (collapse `.`, `..`, redundant `/`) before
comparing to `/etc/hosts`. The BPF notif list now carries `open` and
`openat2` unconditionally so callers picking those syscalls actually
generate a notification instead of slipping through ALLOW.

Add a regression test that exercises each bypass shape (dirfd-relative
+ three non-canonical spellings) and asserts the synthetic content
comes back every time.

Signed-off-by: Cong Wang <cwang@multikernel.io>
Last commit closed the literal-path bypass for /etc/hosts but the same 4
shapes (legacy `open`, `openat2`, dirfd-relative, non-canonical) still
slipped past every other open-keyed shim and most of them are actual
security boundaries: /proc/kcore (and friends in `is_sensitive_proc`)
returns physical memory, the per-PID `/proc/<pid>/` filter hides other
sandboxes' state from each other, /proc/cpuinfo + /proc/meminfo +
/proc/net/* fudge resource accounting, /etc/hostname leaks the host
identity, and /dev/urandom bypasses the deterministic seed.

Lift the `(dirfd, path)` → normalized absolute path logic out of
`handle_etc_hosts_open` into a shared `procfs::resolve_open_target`
that dispatches on `notif.data.nr` for open / openat / openat2.
Convert `handle_proc_open`, `handle_hostname_open`, and
`handle_random_open` to use it, and add `open_family_syscalls()` in
the dispatch builder so each handler is registered for all three
syscalls in one place. The BPF notif list already carries open and
openat2 unconditionally (added by 0fc09b4), so the kernel produces a
notification regardless of which spelling the caller picked.

Add regression tests for the new coverage: the /proc/kcore deny, the
/proc/cpuinfo virtualization, and the /etc/hostname shim each get a
test that walks dirfd-relative, `..`-laden, repeated-slash, and
`./`-laden spellings and asserts the shim still fires.

Signed-off-by: Cong Wang <cwang@multikernel.io>
3278445 always served the same fixed loopback-only /etc/hosts to every
sandbox, which lost any entries an image author had baked in (private
registries, internal services, etc.). Trade-off acknowledged at the
time. This commit picks the alternative direction we discussed: keep
the always-on loopback guarantee, but seed the synthetic file from
the image's `<chroot>/etc/hosts` when one is present, only injecting
the loopback entries the image itself doesn't already cover, and
appending net_allow concrete-host entries on top.

Split the synthesis: `resolve_net_allow` now returns just the
`concrete_host_entries` lines (one per resolved net_allow hostname),
and a new `compose_virtual_etc_hosts(chroot_root, concrete_entries)`
helper assembles the final file. Without a chroot it returns the
fixed loopback base + concrete entries (matches the previous
behavior). With a chroot, it reads `<chroot>/etc/hosts` (falling
back to the loopback base if absent or unreadable), checks which
localhost families the image already covers, and injects only the
missing ones — comments are stripped before the localhost-presence
scan so an inline `#` in the image doesn't fool the detector.

Move the `/etc/hosts` shim registration *above* the chroot dispatch
so the synthetic memfd wins in chroot mode: otherwise the chroot
handler's `openat2(RESOLVE_IN_ROOT)` opens `<chroot>/etc/hosts`
directly and the merge never reaches the sandbox.

Tests: 6 new unit tests for `compose_virtual_etc_hosts` (no chroot,
image with both / one / no loopback families, missing file, inline
comment edge case) and 2 new integration tests that build a rootfs
with a custom `/etc/hosts`, run a real sandbox under `--chroot`, and
assert the merged content the sandbox actually sees (image entries
preserved, loopback injected when missing, no duplicates when present).

Signed-off-by: Cong Wang <cwang@multikernel.io>
@congwang-mk congwang-mk merged commit bd46ed4 into main May 27, 2026
8 checks passed
@congwang-mk congwang-mk deleted the virtualize-etc-hosts branch May 27, 2026 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant