CUTracer is a CUDA binary instrumentation tool built on NVBit. It cleanly separates lightweight data collection (instrumentation) from host-side processing (analysis). Typical workflows include per-warp instruction histograms (delimited by GPU clock reads) and kernel hang detection.
- NVBit-powered, runtime attach via
CUDA_INJECTION64_PATH(no app rebuild needed) - Multiple instrumentation modes: opcode-only, register trace, memory trace, random delay
- Built-in analyses:
- Instruction Histogram (for Proton/Triton workflows)
- Deadlock/Hang Detection
- Data Race Detection
- CUDA Graph and stream-capture aware flows
- Deterministic kernel log file naming and CSV outputs
All requirements are aligned with NVBit.
Unique requirements:
- libzstd: Required for trace compression
- Clone the repository:
cd ~
git clone git@github.com:facebookresearch/CUTracer.git
cd CUTracerNote for Meta internal users: CUTracer is also available at
fbcode/triton/tools/CUTracer/within fbsource. You can build viabuck2 build fbcode//triton/tools/CUTracer:cutracer.soinstead of the Makefile workflow.
- Install system dependencies (libzstd static library for self-contained builds):
# Ubuntu/Debian
# On most Ubuntu/Debian systems, libzstd-dev provides both shared and static libs (libzstd.a).
# You can verify this with: dpkg -L libzstd-dev | grep 'libzstd.a'
# If your distribution does not ship the static library in libzstd-dev, you may need to
# build zstd from source or install a distro-specific static libzstd package.
sudo apt-get install libzstd-dev
# CentOS/RHEL/Fedora (static library for portable builds)
sudo dnf install libzstd-static
# If static library is not available, the build will fall back to dynamic linking
# and display a warning. The resulting binary will not be self-contained.- Download third-party dependencies:
./install_third_party.shThis will download:
- NVBit (NVIDIA Binary Instrumentation Tool)
- nlohmann/json (JSON library for C++)
- Build the tool:
make -j$(nproc)cd ~/CUTracer/python
pip install .# Option A: Set CUTRACER_LIB_PATH once (recommended)
export CUTRACER_LIB_PATH=~/CUTracer/lib
cutracer trace -i tma_trace -- ./your_app
# Option B: Specify cutracer.so explicitly
cutracer trace -i tma_trace --cutracer-so ~/CUTracer/lib/cutracer.so -- ./your_app
# Option C: Run from the CUTracer project root (auto-discovers ./lib/cutracer.so)
cd ~/CUTracer
cutracer trace -i tma_trace -- ./your_app
# Option D: Kernel launch logger only (no instrumentation, no trace files)
cutracer trace -- ./your_appcutracer analyze warp-summary output.ndjson
cutracer query output.ndjson --filter "warp=24"
cutracer validate output.ndjsonNote: You can also use CUTracer without the Python CLI by setting the
CUDA_INJECTION64_PATHenvironment variable directly:CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so ./your_app
CUTRACER_INSTRUMENT: comma-separated modes:opcode_only,reg_trace,mem_trace,random_delayCUTRACER_ANALYSIS: comma-separated analyses:proton_instr_histogram,deadlock_detection,random_delay- Enabling
proton_instr_histogramauto-enablesopcode_only - Enabling
deadlock_detectionauto-enablesreg_trace - Enabling
random_delayauto-enablesrandom_delayinstrumentation; also requiresCUTRACER_DELAY_NSto be set
- Enabling
KERNEL_FILTERS: comma-separated substrings matching unmangled or mangled kernel namesINSTR_BEGIN,INSTR_END: static instruction index gate during instrumentationTOOL_VERBOSE: 0/1/2CUTRACER_TRACE_FORMAT: trace output format. Accepts string names or numeric values (replaces the legacyTRACE_FORMAT_NDJSONenv var, which is still accepted for backward compatibility)- ndjson or 2 (default): NDJSON uncompressed (
.ndjson) - text (or 0): Plain text (
.log, legacy format, verbose) - zstd (or 1): NDJSON+Zstd compressed (
.ndjson.zst, ~12x compression, 92% space savings) - clp (or 3): CLP Archive (
.clp)
- ndjson or 2 (default): NDJSON uncompressed (
CUTRACER_ZSTD_LEVEL: Zstd compression level (1-22, default 9)- Lower values (1-3): Faster compression, slightly larger output
- Higher values (19-22): Maximum compression, slower but smallest output
- Default of 9 provides balanced compression speed and ratio
CUTRACER_DELAY_NS: Max delay value in nanoseconds forrandom_delayanalysis (required whenrandom_delayis enabled)CUTRACER_DELAY_MIN_NS: Minimum delay in nanoseconds — floor for random mode (default: 0). Must be ≤CUTRACER_DELAY_NSCUTRACER_DELAY_MODE: Delay mode:random(per-thread random delay in[min, max], default) orfixed(same delay for all threads, often masks races)CUTRACER_DELAY_DUMP_PATH: Output path for delay config JSON file (for recording instrumentation patterns)CUTRACER_DELAY_LOAD_PATH: Input path for delay config JSON file (for replay mode - deterministic reproduction)CUTRACER_OUTPUT_DIR: Output directory for all CUTracer files (trace files and log files). Defaults to the current directory. The directory must exist and be writable.CUTRACER_CPU_CALLSTACK: CPU call stack capture mode at each kernel launch (default:auto)auto(default): Prefer PyTorch CapturedTraceback for Python frames, fallback to C++ backtrace if Python/PyTorch is unavailablepytorch: Force PyTorch CapturedTraceback only (returns empty if unavailable)backtrace: Force C++ backtrace only (original behavior)1: Same asauto(backward compatible)0: Disable call stack capture- When enabled, the
kernel_metadatatrace event includes acpu_callstackarray and acpu_callstack_sourcefield ("pytorch"or"backtrace") indicating the capture method used
CUTRACER_KERNEL_TIMEOUT_S: Kernel execution time limit in seconds (default: 0 = disabled)- Terminates the process with SIGTERM when a kernel runs longer than this value
- Acts as a general safety valve, independent of deadlock detection (does not require
-a deadlock_detection)
CUTRACER_NO_DATA_TIMEOUT_S: No-data hang detection timeout in seconds (default: 15)- Terminates the process with SIGTERM when no trace data arrives for this duration
- Acts as a general safety valve, independent of deadlock detection (does not require
-a deadlock_detection) - Catches "silent" hangs where all warps are blocked on synchronization primitives with zero trace output
- Works whether the kernel went silent after producing some data, or never produced any data at all
- When
-a deadlock_detectionis also active, prints detailed warp status summary before termination - Set to 0 to disable
CUTRACER_TRACE_SIZE_LIMIT_MB: Maximum trace file size in MB (default: 0 = disabled)- When any trace file exceeds this limit, tracing is stopped for that kernel; kernel execution continues normally
- Useful for preventing runaway trace files from filling disk (e.g., during deadlocked kernels)
Notes:
- The tool sets
CUDA_MANAGED_FORCE_DEVICE_ALLOC=1to simplify channel memory handling. - Multiple analyses can be combined (e.g.,
CUTRACER_ANALYSIS=proton_instr_histogram,deadlock_detection). Each analysis auto-enables its required instrumentation mode.
- Counts SASS instruction mnemonics per warp within regions delimited by clock reads (start/stop model; nested regions not supported)
- Output: one CSV per kernel launch with columns
warp_id,region_id,instruction,count
Example (Triton/Proton + IPC):
cd ~/CUTracer/tests/proton_tests
# 1) Collect histogram with CUTracer
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=proton_instr_histogram \
KERNEL_FILTERS=add_kernel \
python ./vector-add-instrumented.py
# 2) Run without CUTracer to generate a clean Chrome trace
python ./vector-add-instrumented.py
# 3) Merge and compute IPC
python ~/CUTracer/scripts/parse_instr_hist_trace.py \
--chrome-trace ./vector.chrome_trace \
--cutracer-trace ./kernel_*_add_kernel_hist.csv \
--cutracer-log ./cutracer_main_*.log \
--output vectoradd_ipc.csv- Detects sustained hangs by identifying warps stuck in stable PC loops; logs and issues SIGTERM→SIGKILL if sustained
- Requires
reg_trace(auto-enabled)
Example (intentional loop):
cd ~/CUTracer/tests/hang_test
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=deadlock_detection \
python ./test_hang.py- Data races depend on thread scheduling and timing — buggy code may appear correct by luck.
This analysis exposes hidden races by injecting random delays before synchronization-related SASS instructions (e.g.,
BAR,MEMBAR,ATOM,RED), disrupting the normal timing and forcing latent races to manifest as observable failures. - Each instrumentation point is randomly enabled/disabled (50% probability)
- Two delay modes:
random(default): Each thread gets a random delay in[0, CUTRACER_DELAY_NS]using GPU-side xorshift32 PRNG seeded withthreadIdx/blockIdx/clock. Creates per-thread timing skew that amplifies data races. Recommended.fixed: All threads get the same delay. Preserves relative timing between threads and often masks races rather than exposing them. Not recommended for race detection.
- Requires
CUTRACER_DELAY_NSto be set. Therandom_delayinstrumentation mode is auto-enabled.
Example:
CUTRACER_DELAY_NS=100000 \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python3 your_kernel.pyCUTracer supports dumping delay configurations to JSON for deterministic reproduction of data races:
- Dump mode: Set
CUTRACER_DELAY_DUMP_PATHto save the random instrumentation pattern to a JSON file - Replay mode: Set
CUTRACER_DELAY_LOAD_PATHto load a saved config and reproduce the exact same delay pattern
Note: You cannot use both at the same time.
Workflow:
- Run with
CUTRACER_DELAY_DUMP_PATH=/tmp/config.jsonto record the delay pattern - When a failure occurs, save the config file
- Replay with
CUTRACER_DELAY_LOAD_PATH=/tmp/config.jsonto reproduce deterministically - Reduce with
cutracer reduceto find the minimal set of delay points (see below)
The reduce subcommand finds the minimal set of delay injection points that trigger a data race. Two strategies:
linear: Tests each point one by one. O(N) test runs. Simple but slow.bisect: ddmin-style bisection. Splits points in half and recursively narrows down. Typically O(log N) iterations. Recommended for large configs.
Use --confidence-runs N (odd number) for majority voting when the race is probabilistic.
# Bisection reduction (fast)
cutracer reduce -c config.json -t ./test_race.sh --strategy bisect --confidence-runs 3The test script convention follows llvm-reduce: exit 0 = interesting (race occurred), exit 1+ = not interesting (no race).
The examples/ directory contains reference trace outputs for common workflows:
- Proton Trace -- sample instruction histogram CSV, CUTracer log, and a README explaining the end-to-end proton instrumentation workflow for a Triton vector-add kernel
- No CSV/log: check
CUDA_INJECTION64_PATH,KERNEL_FILTERS, and write permissions - Empty histogram: ensure kernels emit clock instructions (e.g., Triton
pl.scope) - High overhead: prefer opcode-only; narrow filters; use
INSTR_BEGIN/INSTR_END - CUDA Graph/stream capture: data is flushed at
cuGraphLaunchexit; ensure stream sync - IPC merge issues: resolve warp mismatches and kernel hash ambiguity with parser flags
This repository contains code under the MIT license (Meta) and the BSD-3-Clause license (NVIDIA). See LICENSE and LICENSE-BSD for details.
The full documentation lives in the Wiki. Key topics include Quickstart, Analyses, Post-processing, Configuration, Outputs, API & Data Structures, Developer Guide, and Troubleshooting.