propagate canonical exit status through exit_group and epoch_kill#1169
propagate canonical exit status through exit_group and epoch_kill#1169rennergade merged 8 commits intomainfrom
Conversation
Three related fixes: - exit_syscall: do not record per-thread exit status; only exit_group determines the cage's process-level exit code - exit_group_syscall: use the cage's already-recorded final_exit_status so whichever thread wins the CAS (e.g. faulthandler calling _exit(1)) determines the OS-level exit code; late threads calling exit_group(0) will use the authoritative code - signal_handler (epoch_kill path): read the cage's final_exit_status instead of hard-coding 0 so the wasmtime exit trampoline receives the correct exit code
Tests that when two threads race to call exit(), the first caller's exit code is what the parent sees via waitpid(). Worker thread calls exit(1) and the main thread calls exit(0) after a spin; parent should observe exit code 1.
The bug was that exit_syscall (SYS_exit, called by glibc start_thread when a non-main pthread returns) called cage_record_exit_status(0). cage_record_exit_status uses first-write-wins semantics, so if SYS_exit(0) fires before exit_group(1), final_exit_status is locked at 0 and cage_finalize reports 0 to the parent. The new test creates a thread that returns NULL, then after pthread_join calls exit(1). glibc's thread exit order is FUTEX_WAKE → EXIT_SYSCALL(0), so pthread_join returns just before EXIT_SYSCALL. Thread A's cage_record_exit_status(0) fires before main's asyncify-rewind path reaches exit(1), locking status=0 on the unfixed branch. After the fix, exit_syscall does not record status, so exit_group(1) writes Some(1) first.
End-to-End Test ReportTest Previewgrate harnessGrate Test Report
Cases
static harnessTest ReportDeterministic TestsSummary
Test Results by Category
Fail TestsSummary
wasm harnessTest ReportDeterministic TestsSummary
Test Results by Category
Fail TestsSummary
Test Results by Category
C++ harnessSummary
Cases
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
End-to-End Test ReportTest Previewgrate harnessGrate Test Report
Cases
static harnessTest ReportDeterministic TestsSummary
Test Results by Category
Fail TestsSummary
wasm harnessTest ReportDeterministic TestsSummary
Test Results by Category
Fail TestsSummary
Test Results by Category
C++ harnessSummary
Cases
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
One concern: the |
There are two different exit routine, one is for wasm module exit code (which propogates exit code via wasm semantic), and one is for waitpid semantic exit code (which records the exit code in the cage struct). the wasm module exit code currently isn't getting handled (e.g. the exit code of forked module is silently dropped). This is like thread exit code v.s. process exit code, where we currently only handled process exit code but not thread exit code. This PR basically removed recording of thread exit code (since it previously would overwrite process exit code) and force it to 1. This is a temporary solution before we add/fix the support for thread exit code (which will need to implement pthread_exit as well). The process exit code recorded on cage struct remains unchanged. |
Problem
When a process exited with a non-zero code (e.g. faulthandler calling
_exit(1)), the parent'swaitpid()could observe exit code 0 instead.This is the root reason of python test worker bug observed in python tests.
This was caused by three separate but related issues:
1.
exit_syscallrecorded the cage exit statusexit_syscall(SYS_exit = 60) is called by glibc'sstart_threadwhen anon-main pthread returns from its thread function. It was calling
cage_record_exit_status(0)even though a per-thread exit has no bearing onthe process-level exit code.
cage_record_exit_statususes first-write-winssemantics, so if SYS_exit(0) fired before
exit_group_syscall(1), it lockedfinal_exit_status = Some(0)and the subsequentexit_group(1)could notoverwrite it.
cage_finalizethen reported exit code 0 to the parent.The glibc thread exit sequence is:
pthread_joinreturns right afterFUTEX_WAKE, so the joiner's return path(asyncify rewind + multiple library frames) loses the race to thread A's
cage_record_exit_status(0), which fires at the top ofexit_syscallbeforeany asyncify unwind.
2.
exit_group_syscallpassed its ownstatus_argto the exit trampolineLate threads calling
exit_group_syscall(0)after the cage was already markedExited(1)passedstatus_arg = 0to the wasmtime exit trampoline. The lastthread's exit code determines the value returned to
OnCalledAction::Finish,so a late
exit_group(0)could overwrite the intended code.3.
signal_handlerhardcoded exit code 0 for epoch-killed threadsWhen
exit_groupor a fatal signal calledepoch_kill_all, epoch-killedthreads entered
signal_handlerand calledctx.exit_call(caller, 0, 0).This hardcoded 0 was passed to
OnCalledAction::Finishregardless of whatexit code had been recorded.
Fix
exit_syscall: remove thecage_record_exit_statuscall entirely. Onlyexit_group_syscalldetermines the process-level exit code.exit_group_syscall: computecanonical_status_argby readingcage.final_exit_statusat the point of calling the exit trampoline.Late threads calling
exit_group(0)after the CAS winner recordedExited(1)will use the authoritative code.
signal_handler(epoch_kill path): readcage.final_exit_statusandpass the recorded exit code to
ctx.exit_callinstead of hardcoding 0.Test
tests/unit-tests/process_tests/deterministic/exit_status_first_wins.cForks a child that creates a pthread which returns NULL (triggering
SYS_exit(0)via glibcstart_thread), then callsexit(1)afterpthread_join. Becausepthread_joinreturns beforeEXIT_SYSCALLfires,thread A's
cage_record_exit_status(0)races with the main thread'sexit_group(1). Thread A wins the race on the unfixed branch (asyncify rewindoverhead delays the main thread). Parent must observe exit code 1.