Releases: facebookresearch/CUTracer
CUTracer v0.2.1 Release 🎉
CUTracer v0.2.1 Release Notes
🎉 Patch Release — 24 commits since v0.2.0
This release brings PyTorch-native Python callstack capture, a new kernel events recording system, cluster-level delay injection for inter-CTA race detection, significant trace format optimizations via instruction table embedding, a critical NVBit 1.8 TMA compatibility fix, and several CI/security hardening improvements. Date range: 2026-04-08 to 2026-04-22.
✨ Highlights
- PyTorch CapturedTraceback Integration — Full Python callstack capture via PyTorch's CapturedTraceback API with zero compile-time dependencies, replacing unsymbolized C++ backtrace frames with readable Python call sites
- Kernel Events Recording — New structured NDJSON logging of all kernel launches (not just instrumented ones) with callstack deduplication for minimal file overhead
- Cluster-Level Delay Mode — New
--delay-mode clusterthat delays one random CTA per cluster to expose missing inter-CTA synchronization bugs - Instruction Table Embedding — Per-instruction static table with SASS binary encoding embedded in kernel_metadata, eliminating redundant per-record SASS strings from JSON traces
- Custom Delay Patterns — New
--delay-patternsflag for injecting delays at arbitrary SASS instruction types, including a"*"wildcard mode - NVBit 1.8 TMA Fix — Critical fix for TMA operand extraction regression introduced by NVBit 1.8's new
TMA_PARAM_HANDLEoperand type
🧠 PyTorch CapturedTraceback Callstack Capture
A three-part series replaces the existing backtrace()-based C++ callstack capture with PyTorch's CapturedTraceback API, producing readable Python call stacks that show the actual user code and PyTorch layers that launched a kernel.
- Backtrace refactor (#202) — Extract
capture_cpu_callstack()intocapture_cpu_callstack_backtrace()to prepare for alternative backends - CapturedTraceback module (#203) — New
python_callstack.cpp(~360 lines) that dynamically resolves Python C API functions viadlsym(dlopen(NULL))— zero compile-time Python/PyTorch dependencies. CallsCapturedTraceback.extract().summary()when the current thread holds the GIL - Dynamic mode selection (#204) — New
CpuCallstackModeenum (AUTO/PYTORCH/BACKTRACE/DISABLED) replacing the old boolean flag.AUTO(default) tries PyTorch first, falls back tobacktrace(). Newcpu_callstack_sourcefield in kernel_metadata JSON output - Auto-GIL acquisition — New
auto_gilmode that re-acquires the GIL viaPyGILState_Ensure()for Triton kernels where__triton_launcherreleases the GIL beforecuLaunchKernelEx. Safe because the Python frame chain is frozen at the Tritonlaunch()call site
# Use PyTorch callstacks (auto-detected by default)
cutracer trace --cpu-callstack auto -- python my_triton_kernel.py
# Force auto_gil mode for Triton kernels
cutracer trace --cpu-callstack auto_gil -- python my_triton_kernel.py📋 Kernel Events Recording
New structured logging of every kernel launch to a dedicated NDJSON file (cutracer_kernel_events_*.ndjson), independent of instrumentation. Key innovation: callstack deduplication using FNV-1a hashing — each unique Python callstack is emitted once as a callstack_def record, with subsequent launches referencing it by callstack_id.
- Three modes:
dedup(recommended),full(inline callstack per launch),nostack(metadata only) - Zero overhead when disabled (default)
- Query integration — The
querycommand now handles kernel events files seamlessly:callstack_defrecords are cached and resolved,kernel_launchrecords get acallerfield injected with the innermost call site frame - Recommended query pattern:
--group-by kernel_checksum --countfor launch frequency analysis
# Record kernel events with callstack dedup
cutracer trace --kernel-events dedup -- python my_app.py
# Query kernel launch counts
cutracer query cutracer_kernel_events_*.ndjson --group-by kernel_checksum --count🔀 Cluster-Level Random Delay Mode
New --delay-mode cluster (#207) that delays only one randomly-selected CTA within each cluster while other CTAs proceed at normal speed (~430 lines). This creates timing asymmetry between CTAs in the same cluster, exposing missing inter-CTA synchronization bugs that --delay-mode random (intra-CTA) would not catch.
- Uses
cluster_ctaid/cluster_nctaidPTX registers for CTA selection within each cluster cluster_seedstored in delay config JSON for deterministic replay- Host-side cluster dimension detection via
CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION(including CUDA graph captures), with fallback tocuFuncGetAttribute - One-time-per-kernel diagnostic log of runtime cluster dimensions
- Automatic warning when used against non-cluster-launched kernels (no-op detection)
# Expose inter-CTA synchronization bugs
cutracer trace -i random_delay -a random_delay \
--delay-mode cluster --delay-ns 10000 \
--kernel-filters my_cluster_kernel -- python test.py🎯 Custom Delay Patterns
New --delay-patterns flag (#211) that lets users specify arbitrary SASS instruction substrings for delay injection, overriding the built-in DELAY_INJECTION_PATTERNS list. This enables targeted testing of specific instruction types without modifying source code.
- Comma-separated patterns:
--delay-patterns "UTMALDG,UTMASTG" - Wildcard
--delay-patterns "*"matches all instructions, injecting delay on a random 50% subset of ALL SASS instructions - Also adds
SYNCS.EXCH(mbarrier init) to the built-in delay injection patterns - Plumbed through
CUTRACER_DELAY_PATTERNSenvironment variable
# Delay only TMA instructions
cutracer trace -i random_delay --delay-ns 5000 \
--delay-patterns "UTMALDG,UTMASTG" -- python test.py
# Stress-test: random delay on 50% of ALL instructions
cutracer trace -i random_delay --delay-ns 1000 \
--delay-patterns "*" -- python test.py📦 Instruction Table Embedding & Trace Format Optimization
Three-part series that embeds a per-instruction static table in kernel_metadata and eliminates redundant per-record SASS strings from JSON output:
- Instruction table — Each
kernel_metadatarecord now includes aninstructionsarray indexed byopcode_id, containing SASS disassembly, binary encoding (via NVBit 1.8getSassBinary()API), register indices, and uniform register indices - Schema update —
kernel_metadata.schema.jsonupdated with theinstructionsproperty definition - Per-record SASS removal —
j["sass"]serialization removed from JSON output (text mode unchanged). The PythonTraceReadercaches the instruction table and injectssassinto records on read viaopcode_idlookup. Backward compatible: old traces with inline SASS still work
🐛 Bug Fixes
- NVBit 1.8 TMA operand extraction regression — NVBit 1.8 changed
UTMALDG/UTMASTG/UTMAREDGfrom two separateMREFoperands to a singleTMA_PARAM_HANDLEoperand. CUTracer's operand loop had no handler, silently dropping all UR information and breakingtma_trace,data-race,tma,mma, anddataflowanalysis commands. Added a manualTMA_PARAM_HANDLEdecoder plus unhandled-operand debug logging for future NVBit changes - Kernel hash always 0x —
compute_kernel_checksum()was only called inside instrumentation path; now computed at metadata creation time via idempotentensure_kernel_checksum()helper - Log truncation on SIGTERM (#206) — Added
flush_log_files()beforeraise(SIGTERM)in deadlock detection, kernel timeout, and no-data timeout termination paths - Cluster warning noise (#210) — Gated
[CLUSTER]warning byshould_instrumentso non-matching kernels don't clutter stderr; also fixed legacyMEMTRACE:prefix toCUTracer: - Kernel events writer lifetime — Fixed multi-context crash where first context teardown destroyed the shared writer; now cleaned up only when all contexts are gone
- Python callstack lineno — Clamped negative
PyFrame_GetLineNumber()return values unconditionally (previously only insidePyErr_Occurred()block) - Kernel events NDJSON enforcement — Force NDJSON format for kernel events writer regardless of
CUTRACER_TRACE_FORMAT, with warning when text mode is active
🖥️ CLI Changes
New Flags
# Callstack capture mode (replaces boolean --cpu-callstack 0/1)
cutracer trace --cpu-callstack auto|pytorch|backtrace|auto_gil|0|1
# Kernel events recording
cutracer trace --kernel-events dedup|full|nostack
# Cluster delay mode
cutracer trace --delay-mode cluster # (alongside existing random/fixed)
# Custom delay patterns
cutracer trace --delay-patterns "PATTERN1,PATTERN2" # or "*" for allQuery Command Updates
# Query kernel events files (auto-detected by file naming)
cutracer query cutracer_kernel_events_*.ndjson --group-by kernel_checksum --count📁 Configuration Changes
Updated Environment Variables
| Variable | Change | Description |
|---|---|---|
CUTRACER_CPU_CALLSTACK |
Extended | Now accepts: auto, pytorch, backtrace, auto_gil, 0, 1 (was: 0/1 only) |
CUTRACER_KERNEL_EVENTS |
New | Kernel events recording mode: 0 (disabled, default), dedup, full, nostack |
CUTRACER_DELAY_MODE |
Extended | New cluster value (alongside existing random/fixed) |
CUTRACER_DELAY_PATTERNS |
New | Comma-separated SASS instruction substrings for delay injection; "*" for all instructions |
🔒 Security & CI
- CodeQL fixes (#209) — Replaced
fopen("w")withopen(O_CREAT|O_WRONLY|O_TRUNC, 0644)+fdopen()for explicit file permissions; addedpermissions: contents: readto CI ...
CUTracer v0.2.0 Release
🎉 Major Release — 114 commits since v0.1.0
CUTracer v0.2.0 brings Blackwell GPU support, a unified CLI experience, advanced data race reduction, and significant improvements to trace infrastructure.
✨ Highlights
- Blackwell (SM100) GPU Support — Tensor core instruction tracing for UTC*MMA, UTMALDG, UTMAREDG, and TMA descriptors
- Unified CLI — New
cutracer tracesubcommand replaces manualCUDA_INJECTION64_PATHsetup - Data Race Reducer — DDMin bisection algorithm to automatically find minimal race-triggering configurations
- NVBit 1.8 — Updated from NVBit 1.7.7.1 to 1.8, with a critical fix for
<<<>>>kernel launch deadlocks - CPU Call Stack Capture — Per-kernel-launch host-side stack traces for debugging
- Kernel Timeouts & Safety Limits — Configurable execution timeout and trace file size limits
🏗️ Blackwell GPU Support
Full tracing support for NVIDIA Blackwell architecture:
- UTC*MMA tensor core instructions — Trace Blackwell's new warp-group MMA operations (#161)
- UTMAREDG tracing — Support for TMA reduction instructions (#162)
- UTMALDG decoder — Decode TMA load descriptor parameters
- TMA descriptor tracing — Capture and decode TMA descriptor fields for tile configuration analysis (#155)
- TMA descriptor decoding in SASS — Extract descriptor parameters from cubin SASS output
- Tensor memory delay injection — Extend random delay to TMA instructions for data race detection (#189)
🖥️ Unified CLI
The CLI has been completely revamped with a unified cutracer entry point:
cutracer trace — Run and Trace
# Trace a CUDA application (replaces manual CUDA_INJECTION64_PATH setup)
cutracer trace --instrument opcode_only -- python my_kernel.py
# With cubin dump and output directory
cutracer trace --instrument reg_trace --dump-cubin --output-dir ./traces -- python my_kernel.py
# Shell-style environment variable passthrough
cutracer trace CUTRACER_DELAY_NS=1000 -- python my_kernel.pycutracer query — Query Trace Data
# Filter and query traces
cutracer query trace.ndjson --filter "warp=24"
cutracer query trace.ndjson --filter "cta=[0,0,0],opcode=LDG" # Multi-condition AND filter
cutracer query trace.ndjson --output result.ndjson --compresscutracer analyze — Analyze Traces
# Warp execution summary
cutracer analyze warp-summary trace.ndjsoncutracer reduce — Minimize Race Configs
# Find minimal delay configuration that triggers a race
cutracer reduce --config delay_config.json -- python my_kernel.pycutracer sass — SASS Extraction
# Extract SASS from cubin files
cutracer sass --cubin kernel.cubin🔍 Data Race Detection Enhancements
DDMin Bisection Reducer (#187)
Automatically reduce a delay configuration to the minimal set of delay points that still trigger a data race, using the delta debugging (ddmin) algorithm:
- Exponentially faster than brute-force elimination
- Produces minimal reproducible configurations
- Integrated via
cutracer reduceCLI command
Per-Thread Random Delay Mode (#186)
- New
CUTRACER_DELAY_MODE=per_threadfor thread-level delay granularity - Better coverage for detecting fine-grained data races
Delay Config Mutator (#145)
- Programmatic API for manipulating delay configurations
- Enables automated delay sweep workflows
⏱️ Reliability & Safety
- Kernel execution timeout (
CUTRACER_KERNEL_TIMEOUT_S) — Kill kernels that exceed a time limit (#169) - No-data timeout — Detect silent hangs when no trace data is produced
- Trace file size limit (
CUTRACER_TRACE_SIZE_LIMIT_MB) — Prevent runaway disk usage (#169) - Periodic flush — TraceWriter and log files flush periodically during kernel hangs, ensuring data is available for post-mortem analysis
- Configurable channel buffer size (
CUTRACER_CHANNEL_RECORDS) — Tune buffer for hang debugging scenarios - Fix
<<<>>>deadlock — Preload flush_channel via fatbin + NVBit tool API to eliminate kernel launch deadlocks (#199) - Fix CUDA graph handling — Prevent graph build/capture phase from prematurely executing per-launch side effects
- Fix trace overwrite — Trace file write mode changed from append to overwrite across runs
🔧 Instrumentation Improvements
- Instruction category system — Conditional instrumentation based on instruction categories (#134)
- IPOINT configuration — Configure instrumentation points via environment variables (#183)
- Register indices in trace output — CPU-side static mapping of register operands (#143)
- opcode_only trace writing — Lightweight opcode-only mode now writes structured trace output
- Auto-enable cubin dump — Cubin dump auto-enabled when instrumentation is active (#191)
- Kernel checksum — Robust delay config replay using kernel binary checksums (#133, #141)
- CPU call stack capture — Host-side stack trace for each kernel launch (#172)
- Re-execute compiled kernel — Ensure trace captures both compilation and execution runs
📁 Configuration Changes
Renamed Environment Variables
| Old | New |
|---|---|
TRACE_FORMAT_NDJSON |
CUTRACER_TRACE_FORMAT |
CUTRACER_TRACE_OUTPUT_DIR |
CUTRACER_OUTPUT_DIR |
CUTRACER_TRACE_FORMAT now also accepts string names (e.g., ndjson_zst, ndjson, log) in addition to numeric values (#193).
New Environment Variables
| Variable | Description | Default |
|---|---|---|
CUTRACER_KERNEL_TIMEOUT_S |
Kernel execution timeout in seconds | (disabled) |
CUTRACER_TRACE_SIZE_LIMIT_MB |
Max trace file size in MB | (unlimited) |
CUTRACER_CHANNEL_RECORDS |
Channel buffer record count | (default) |
CUTRACER_CPU_CALLSTACK |
Enable CPU call stack capture | 0 |
CUTRACER_DELAY_MODE |
Delay mode (uniform/per_thread) |
uniform |
CUTRACER_OUTPUT_DIR |
Unified output directory for all artifacts | . |
CUTRACER_IPOINT |
Instrumentation point configuration | (default) |
🔄 Dependency Updates
- NVBit: 1.7.7.1 → 1.7.7.3 → 1.8 (#164, #198)
- nlohmann/json: Updated default to 3.12.0
- Python: CI updated to Python 3.13
- GitHub Actions: Updated to latest versions
- JSON parsing: Migrated to orjson for faster JSON I/O via tritonparse
_json_compat - Daily NVBit update check: Automated GitHub Action to detect upstream NVBit releases (#163)
🐍 Python Package Improvements
- CLP archive support — Dump and read CLP compressed log archives (#118, #148)
- Unified logger module — Consistent logging across all Python modules
- Schema validation — Migrated trace validation into the cutracer Python module (#154)
- Query enhancements — Hex filters,
--all-linesflag, NDJSON output,--output,--compress(#136) - Multi-condition AND filter — Filter by multiple fields simultaneously (#139)
- JSON list value filters — Support
cta=[0,0,0]style filter expressions - KernelConfig abstraction — Clean API for trace metadata
- TraceWriter metadata —
write_metadata()andkernel_metadataevent support (#153) - Truncated trace detection — Detect and handle truncated trace files gracefully
- GB200 aarch64 support — Installation scripts updated for GB200 platforms (#159, #173)
📋 Requirements
- CUDA Toolkit: Aligned with NVBit 1.8 requirements
- libzstd: Required for trace compression
- Python 3.10+: For Python package
- NVBit 1.8: Bundled (auto-downloaded during build)
⚠️ Breaking Changes
TRACE_FORMAT_NDJSONrenamed toCUTRACER_TRACE_FORMAT(#192)CUTRACER_TRACE_OUTPUT_DIRrenamed toCUTRACER_OUTPUT_DIR(#167)- CLI entry point unified to
cutracer(replacescutraceross) --allflag renamed to--all-lines(#157)analysismodule renamed toquery(#135)pcfield in trace output changed to hex string format (#137)
🙏 Acknowledgments
CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework and the v1.8 release.
📄 License
- MIT License — Meta Platforms, Inc. contributions
- BSD-3-Clause License — NVIDIA NVBit components
See LICENSE and LICENSE-BSD for details.
📚 Documentation
Full documentation is available in the Wiki.
🔗 Links
CUTracer v0.1.0 Release 🎉
🎉 Initial Public Release
CUTracer is an NVBit-based CUDA binary instrumentation tool for GPU kernel analysis and debugging. It enables runtime-level insights without requiring application recompilation.
✨ Highlights
- Zero-modification runtime injection - Attach to any CUDA application via
CUDA_INJECTION64_PATH - GPU Hang Detection - Automatic deadlock identification with process termination
- Data Race Detection - Random delay injection with deterministic replay support
- Triton/Proton Integration - Per-warp instruction histograms with IPC calculation
- Efficient Trace Compression - NDJSON + Zstd (~92% space savings)
- Python Analysis Toolkit - Available on PyPI:
pip install cutracer
🔧 Instrumentation Modes
| Mode | Environment Variable | Description |
|---|---|---|
opcode_only |
CUTRACER_INSTRUMENT=opcode_only |
Lightweight instruction counting |
reg_trace |
CUTRACER_INSTRUMENT=reg_trace |
Register value tracing (R/UREG support) |
mem_addr_trace |
CUTRACER_INSTRUMENT=mem_addr_trace |
Memory address tracing |
mem_value_trace |
CUTRACER_INSTRUMENT=mem_value_trace |
Memory address + value tracing (Global/Shared/Local) |
random_delay |
CUTRACER_INSTRUMENT=random_delay |
Delay injection for race detection |
Multiple modes can be enabled simultaneously with comma-separated values.
📊 Built-in Analyses
Instruction Histogram (proton_instr_histogram)
- Clock-delimited per-warp instruction counting
- CSV output:
warp_id,region_id,instruction,count - Integration with Triton Proton for IPC calculation
- Requires kernels to emit clock instructions (e.g., Triton
pl.scope())
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=proton_instr_histogram \
KERNEL_FILTERS=add_kernel \
python ./vector-add-instrumented.pyDeadlock/Hang Detection (deadlock_detection)
- Detects warps stuck in stable PC loops
- Automatic SIGTERM → SIGKILL termination sequence
- Detailed warp state logging for post-mortem analysis
- Auto-enables
reg_traceinstrumentation
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=deadlock_detection \
python ./test_hang.pyData Race Detection (random_delay)
- Injects delays at synchronization points to expose timing-dependent races
- Target SASS patterns:
SYNCS.PHASECHK.TRANS64.TRYWAIT(mbarrier try_wait)SYNCS.ARRIVE.TRANS64.RED.A1T0(mbarrier arrive)UTMALDG.2D(TMA load)WARPGROUP.DEPBAR.LE(MMA wait)
Deterministic Replay Support:
CUTRACER_DELAY_DUMP_PATH: Export delay config JSON for recordingCUTRACER_DELAY_LOAD_PATH: Load delay config JSON for exact replay- Workflow: Discover race with random delays → Reproduce exactly with saved config
# Record mode
CUTRACER_DELAY_NS=1000 \
CUTRACER_DELAY_DUMP_PATH=./delay_config.json \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python your_kernel.py
# Replay mode (deterministic reproduction)
CUTRACER_DELAY_LOAD_PATH=./delay_config.json \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python your_kernel.py📁 Output Formats
| Mode | Extension | Description |
|---|---|---|
| 0 | .log |
Human-readable text format |
| 1 (default) | .ndjson.zst |
NDJSON + Zstd compressed |
| 2 | .ndjson |
NDJSON uncompressed |
Set via TRACE_FORMAT_NDJSON environment variable.
Compression level configurable via CUTRACER_ZSTD_LEVEL (1-22, default: 22).
🐍 Python Package
Available on PyPI:
pip install cutracerFeatures
Validation:
- JSON syntax and schema validation
- Text format validation
- Cross-format consistency checking
- Transparent Zstd compression handling
Analysis:
TraceReader: Stream trace records from NDJSON filesStreamingGrouper: Memory-efficient grouped analysisWarpSummary: Warp execution status for hang analysis (completed/in-progress/missing)- Multi-format output: table, JSON, CSV
CLI Tools:
# Validate trace files
cutraceross validate trace.ndjson
cutraceross validate trace.ndjson.zst --verbose
# Analyze trace data
cutraceross analyze trace.ndjson --head 20
cutraceross analyze trace.ndjson --filter "warp=24"
cutraceross analyze trace.ndjson --group-by warp --count⚙️ Configuration Reference
| Variable | Description | Default |
|---|---|---|
CUTRACER_INSTRUMENT |
Instrumentation modes (comma-separated) | (none) |
CUTRACER_ANALYSIS |
Analysis types (comma-separated) | (none) |
KERNEL_FILTERS |
Kernel name filters (substring match) | (none) |
INSTR_BEGIN / INSTR_END |
Instruction index range filter | 0 / UINT32_MAX |
TRACE_FORMAT_NDJSON |
Output format (0/1/2) | 1 |
CUTRACER_ZSTD_LEVEL |
Zstd compression level | 22 |
CUTRACER_DELAY_NS |
Delay value in nanoseconds | 0 (disabled) |
CUTRACER_DELAY_DUMP_PATH |
Export delay config JSON | (none) |
CUTRACER_DELAY_LOAD_PATH |
Load delay config for replay | (none) |
TOOL_VERBOSE |
Verbosity level (0/1/2) | 0 |
CUTRACER_DUMP_CUBIN |
Dump cubin files | 0 |
📋 Requirements
- CUDA Toolkit: Aligned with NVBit requirements
- libzstd: Required for trace compression
- Python 3.10+: For Python package
⚠️ API Stability Notice
This is the initial public release (v0.1.0). APIs and configuration options may change in future versions as we gather feedback and iterate on the design.
Known Limitations
- Instruction histogram requires clock instruction boundaries (e.g., Triton
pl.scope()) - Nested regions not supported for instruction histogram analysis
🙏 Acknowledgments
CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework.
📄 License
- MIT License - Meta Platforms, Inc. contributions
- BSD-3-Clause License - NVIDIA NVBit components
See LICENSE and LICENSE-BSD for details.
📚 Documentation
Full documentation is available in the Wiki.