Skip to content

Releases: facebookresearch/CUTracer

CUTracer v0.2.1 Release 🎉

23 Apr 03:41

Choose a tag to compare

CUTracer v0.2.1 Release Notes

🎉 Patch Release — 24 commits since v0.2.0

This release brings PyTorch-native Python callstack capture, a new kernel events recording system, cluster-level delay injection for inter-CTA race detection, significant trace format optimizations via instruction table embedding, a critical NVBit 1.8 TMA compatibility fix, and several CI/security hardening improvements. Date range: 2026-04-08 to 2026-04-22.


✨ Highlights

  • PyTorch CapturedTraceback Integration — Full Python callstack capture via PyTorch's CapturedTraceback API with zero compile-time dependencies, replacing unsymbolized C++ backtrace frames with readable Python call sites
  • Kernel Events Recording — New structured NDJSON logging of all kernel launches (not just instrumented ones) with callstack deduplication for minimal file overhead
  • Cluster-Level Delay Mode — New --delay-mode cluster that delays one random CTA per cluster to expose missing inter-CTA synchronization bugs
  • Instruction Table Embedding — Per-instruction static table with SASS binary encoding embedded in kernel_metadata, eliminating redundant per-record SASS strings from JSON traces
  • Custom Delay Patterns — New --delay-patterns flag for injecting delays at arbitrary SASS instruction types, including a "*" wildcard mode
  • NVBit 1.8 TMA Fix — Critical fix for TMA operand extraction regression introduced by NVBit 1.8's new TMA_PARAM_HANDLE operand type

🧠 PyTorch CapturedTraceback Callstack Capture

A three-part series replaces the existing backtrace()-based C++ callstack capture with PyTorch's CapturedTraceback API, producing readable Python call stacks that show the actual user code and PyTorch layers that launched a kernel.

  • Backtrace refactor (#202) — Extract capture_cpu_callstack() into capture_cpu_callstack_backtrace() to prepare for alternative backends
  • CapturedTraceback module (#203) — New python_callstack.cpp (~360 lines) that dynamically resolves Python C API functions via dlsym(dlopen(NULL)) — zero compile-time Python/PyTorch dependencies. Calls CapturedTraceback.extract().summary() when the current thread holds the GIL
  • Dynamic mode selection (#204) — New CpuCallstackMode enum (AUTO/PYTORCH/BACKTRACE/DISABLED) replacing the old boolean flag. AUTO (default) tries PyTorch first, falls back to backtrace(). New cpu_callstack_source field in kernel_metadata JSON output
  • Auto-GIL acquisition — New auto_gil mode that re-acquires the GIL via PyGILState_Ensure() for Triton kernels where __triton_launcher releases the GIL before cuLaunchKernelEx. Safe because the Python frame chain is frozen at the Triton launch() call site
# Use PyTorch callstacks (auto-detected by default)
cutracer trace --cpu-callstack auto -- python my_triton_kernel.py

# Force auto_gil mode for Triton kernels
cutracer trace --cpu-callstack auto_gil -- python my_triton_kernel.py

📋 Kernel Events Recording

New structured logging of every kernel launch to a dedicated NDJSON file (cutracer_kernel_events_*.ndjson), independent of instrumentation. Key innovation: callstack deduplication using FNV-1a hashing — each unique Python callstack is emitted once as a callstack_def record, with subsequent launches referencing it by callstack_id.

  • Three modes: dedup (recommended), full (inline callstack per launch), nostack (metadata only)
  • Zero overhead when disabled (default)
  • Query integration — The query command now handles kernel events files seamlessly: callstack_def records are cached and resolved, kernel_launch records get a caller field injected with the innermost call site frame
  • Recommended query pattern: --group-by kernel_checksum --count for launch frequency analysis
# Record kernel events with callstack dedup
cutracer trace --kernel-events dedup -- python my_app.py

# Query kernel launch counts
cutracer query cutracer_kernel_events_*.ndjson --group-by kernel_checksum --count

🔀 Cluster-Level Random Delay Mode

New --delay-mode cluster (#207) that delays only one randomly-selected CTA within each cluster while other CTAs proceed at normal speed (~430 lines). This creates timing asymmetry between CTAs in the same cluster, exposing missing inter-CTA synchronization bugs that --delay-mode random (intra-CTA) would not catch.

  • Uses cluster_ctaid/cluster_nctaid PTX registers for CTA selection within each cluster
  • cluster_seed stored in delay config JSON for deterministic replay
  • Host-side cluster dimension detection via CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION (including CUDA graph captures), with fallback to cuFuncGetAttribute
  • One-time-per-kernel diagnostic log of runtime cluster dimensions
  • Automatic warning when used against non-cluster-launched kernels (no-op detection)
# Expose inter-CTA synchronization bugs
cutracer trace -i random_delay -a random_delay \
    --delay-mode cluster --delay-ns 10000 \
    --kernel-filters my_cluster_kernel -- python test.py

🎯 Custom Delay Patterns

New --delay-patterns flag (#211) that lets users specify arbitrary SASS instruction substrings for delay injection, overriding the built-in DELAY_INJECTION_PATTERNS list. This enables targeted testing of specific instruction types without modifying source code.

  • Comma-separated patterns: --delay-patterns "UTMALDG,UTMASTG"
  • Wildcard --delay-patterns "*" matches all instructions, injecting delay on a random 50% subset of ALL SASS instructions
  • Also adds SYNCS.EXCH (mbarrier init) to the built-in delay injection patterns
  • Plumbed through CUTRACER_DELAY_PATTERNS environment variable
# Delay only TMA instructions
cutracer trace -i random_delay --delay-ns 5000 \
    --delay-patterns "UTMALDG,UTMASTG" -- python test.py

# Stress-test: random delay on 50% of ALL instructions
cutracer trace -i random_delay --delay-ns 1000 \
    --delay-patterns "*" -- python test.py

📦 Instruction Table Embedding & Trace Format Optimization

Three-part series that embeds a per-instruction static table in kernel_metadata and eliminates redundant per-record SASS strings from JSON output:

  • Instruction table — Each kernel_metadata record now includes an instructions array indexed by opcode_id, containing SASS disassembly, binary encoding (via NVBit 1.8 getSassBinary() API), register indices, and uniform register indices
  • Schema updatekernel_metadata.schema.json updated with the instructions property definition
  • Per-record SASS removalj["sass"] serialization removed from JSON output (text mode unchanged). The Python TraceReader caches the instruction table and injects sass into records on read via opcode_id lookup. Backward compatible: old traces with inline SASS still work

🐛 Bug Fixes

  • NVBit 1.8 TMA operand extraction regression — NVBit 1.8 changed UTMALDG/UTMASTG/UTMAREDG from two separate MREF operands to a single TMA_PARAM_HANDLE operand. CUTracer's operand loop had no handler, silently dropping all UR information and breaking tma_trace, data-race, tma, mma, and dataflow analysis commands. Added a manual TMA_PARAM_HANDLE decoder plus unhandled-operand debug logging for future NVBit changes
  • Kernel hash always 0xcompute_kernel_checksum() was only called inside instrumentation path; now computed at metadata creation time via idempotent ensure_kernel_checksum() helper
  • Log truncation on SIGTERM (#206) — Added flush_log_files() before raise(SIGTERM) in deadlock detection, kernel timeout, and no-data timeout termination paths
  • Cluster warning noise (#210) — Gated [CLUSTER] warning by should_instrument so non-matching kernels don't clutter stderr; also fixed legacy MEMTRACE: prefix to CUTracer:
  • Kernel events writer lifetime — Fixed multi-context crash where first context teardown destroyed the shared writer; now cleaned up only when all contexts are gone
  • Python callstack lineno — Clamped negative PyFrame_GetLineNumber() return values unconditionally (previously only inside PyErr_Occurred() block)
  • Kernel events NDJSON enforcement — Force NDJSON format for kernel events writer regardless of CUTRACER_TRACE_FORMAT, with warning when text mode is active

🖥️ CLI Changes

New Flags

# Callstack capture mode (replaces boolean --cpu-callstack 0/1)
cutracer trace --cpu-callstack auto|pytorch|backtrace|auto_gil|0|1

# Kernel events recording
cutracer trace --kernel-events dedup|full|nostack

# Cluster delay mode
cutracer trace --delay-mode cluster  # (alongside existing random/fixed)

# Custom delay patterns
cutracer trace --delay-patterns "PATTERN1,PATTERN2"  # or "*" for all

Query Command Updates

# Query kernel events files (auto-detected by file naming)
cutracer query cutracer_kernel_events_*.ndjson --group-by kernel_checksum --count

📁 Configuration Changes

Updated Environment Variables

Variable Change Description
CUTRACER_CPU_CALLSTACK Extended Now accepts: auto, pytorch, backtrace, auto_gil, 0, 1 (was: 0/1 only)
CUTRACER_KERNEL_EVENTS New Kernel events recording mode: 0 (disabled, default), dedup, full, nostack
CUTRACER_DELAY_MODE Extended New cluster value (alongside existing random/fixed)
CUTRACER_DELAY_PATTERNS New Comma-separated SASS instruction substrings for delay injection; "*" for all instructions

🔒 Security & CI

  • CodeQL fixes (#209) — Replaced fopen("w") with open(O_CREAT|O_WRONLY|O_TRUNC, 0644) + fdopen() for explicit file permissions; added permissions: contents: read to CI ...
Read more

CUTracer v0.2.0 Release

08 Apr 23:14

Choose a tag to compare

🎉 Major Release — 114 commits since v0.1.0

CUTracer v0.2.0 brings Blackwell GPU support, a unified CLI experience, advanced data race reduction, and significant improvements to trace infrastructure.


✨ Highlights

  • Blackwell (SM100) GPU Support — Tensor core instruction tracing for UTC*MMA, UTMALDG, UTMAREDG, and TMA descriptors
  • Unified CLI — New cutracer trace subcommand replaces manual CUDA_INJECTION64_PATH setup
  • Data Race Reducer — DDMin bisection algorithm to automatically find minimal race-triggering configurations
  • NVBit 1.8 — Updated from NVBit 1.7.7.1 to 1.8, with a critical fix for <<<>>> kernel launch deadlocks
  • CPU Call Stack Capture — Per-kernel-launch host-side stack traces for debugging
  • Kernel Timeouts & Safety Limits — Configurable execution timeout and trace file size limits

🏗️ Blackwell GPU Support

Full tracing support for NVIDIA Blackwell architecture:

  • UTC*MMA tensor core instructions — Trace Blackwell's new warp-group MMA operations (#161)
  • UTMAREDG tracing — Support for TMA reduction instructions (#162)
  • UTMALDG decoder — Decode TMA load descriptor parameters
  • TMA descriptor tracing — Capture and decode TMA descriptor fields for tile configuration analysis (#155)
  • TMA descriptor decoding in SASS — Extract descriptor parameters from cubin SASS output
  • Tensor memory delay injection — Extend random delay to TMA instructions for data race detection (#189)

🖥️ Unified CLI

The CLI has been completely revamped with a unified cutracer entry point:

cutracer trace — Run and Trace

# Trace a CUDA application (replaces manual CUDA_INJECTION64_PATH setup)
cutracer trace --instrument opcode_only -- python my_kernel.py

# With cubin dump and output directory
cutracer trace --instrument reg_trace --dump-cubin --output-dir ./traces -- python my_kernel.py

# Shell-style environment variable passthrough
cutracer trace CUTRACER_DELAY_NS=1000 -- python my_kernel.py

cutracer query — Query Trace Data

# Filter and query traces
cutracer query trace.ndjson --filter "warp=24"
cutracer query trace.ndjson --filter "cta=[0,0,0],opcode=LDG"  # Multi-condition AND filter
cutracer query trace.ndjson --output result.ndjson --compress

cutracer analyze — Analyze Traces

# Warp execution summary
cutracer analyze warp-summary trace.ndjson

cutracer reduce — Minimize Race Configs

# Find minimal delay configuration that triggers a race
cutracer reduce --config delay_config.json -- python my_kernel.py

cutracer sass — SASS Extraction

# Extract SASS from cubin files
cutracer sass --cubin kernel.cubin

🔍 Data Race Detection Enhancements

DDMin Bisection Reducer (#187)

Automatically reduce a delay configuration to the minimal set of delay points that still trigger a data race, using the delta debugging (ddmin) algorithm:

  • Exponentially faster than brute-force elimination
  • Produces minimal reproducible configurations
  • Integrated via cutracer reduce CLI command

Per-Thread Random Delay Mode (#186)

  • New CUTRACER_DELAY_MODE=per_thread for thread-level delay granularity
  • Better coverage for detecting fine-grained data races

Delay Config Mutator (#145)

  • Programmatic API for manipulating delay configurations
  • Enables automated delay sweep workflows

⏱️ Reliability & Safety

  • Kernel execution timeout (CUTRACER_KERNEL_TIMEOUT_S) — Kill kernels that exceed a time limit (#169)
  • No-data timeout — Detect silent hangs when no trace data is produced
  • Trace file size limit (CUTRACER_TRACE_SIZE_LIMIT_MB) — Prevent runaway disk usage (#169)
  • Periodic flush — TraceWriter and log files flush periodically during kernel hangs, ensuring data is available for post-mortem analysis
  • Configurable channel buffer size (CUTRACER_CHANNEL_RECORDS) — Tune buffer for hang debugging scenarios
  • Fix <<<>>> deadlock — Preload flush_channel via fatbin + NVBit tool API to eliminate kernel launch deadlocks (#199)
  • Fix CUDA graph handling — Prevent graph build/capture phase from prematurely executing per-launch side effects
  • Fix trace overwrite — Trace file write mode changed from append to overwrite across runs

🔧 Instrumentation Improvements

  • Instruction category system — Conditional instrumentation based on instruction categories (#134)
  • IPOINT configuration — Configure instrumentation points via environment variables (#183)
  • Register indices in trace output — CPU-side static mapping of register operands (#143)
  • opcode_only trace writing — Lightweight opcode-only mode now writes structured trace output
  • Auto-enable cubin dump — Cubin dump auto-enabled when instrumentation is active (#191)
  • Kernel checksum — Robust delay config replay using kernel binary checksums (#133, #141)
  • CPU call stack capture — Host-side stack trace for each kernel launch (#172)
  • Re-execute compiled kernel — Ensure trace captures both compilation and execution runs

📁 Configuration Changes

Renamed Environment Variables

Old New
TRACE_FORMAT_NDJSON CUTRACER_TRACE_FORMAT
CUTRACER_TRACE_OUTPUT_DIR CUTRACER_OUTPUT_DIR

CUTRACER_TRACE_FORMAT now also accepts string names (e.g., ndjson_zst, ndjson, log) in addition to numeric values (#193).

New Environment Variables

Variable Description Default
CUTRACER_KERNEL_TIMEOUT_S Kernel execution timeout in seconds (disabled)
CUTRACER_TRACE_SIZE_LIMIT_MB Max trace file size in MB (unlimited)
CUTRACER_CHANNEL_RECORDS Channel buffer record count (default)
CUTRACER_CPU_CALLSTACK Enable CPU call stack capture 0
CUTRACER_DELAY_MODE Delay mode (uniform/per_thread) uniform
CUTRACER_OUTPUT_DIR Unified output directory for all artifacts .
CUTRACER_IPOINT Instrumentation point configuration (default)

🔄 Dependency Updates

  • NVBit: 1.7.7.1 → 1.7.7.3 → 1.8 (#164, #198)
  • nlohmann/json: Updated default to 3.12.0
  • Python: CI updated to Python 3.13
  • GitHub Actions: Updated to latest versions
  • JSON parsing: Migrated to orjson for faster JSON I/O via tritonparse _json_compat
  • Daily NVBit update check: Automated GitHub Action to detect upstream NVBit releases (#163)

🐍 Python Package Improvements

  • CLP archive support — Dump and read CLP compressed log archives (#118, #148)
  • Unified logger module — Consistent logging across all Python modules
  • Schema validation — Migrated trace validation into the cutracer Python module (#154)
  • Query enhancements — Hex filters, --all-lines flag, NDJSON output, --output, --compress (#136)
  • Multi-condition AND filter — Filter by multiple fields simultaneously (#139)
  • JSON list value filters — Support cta=[0,0,0] style filter expressions
  • KernelConfig abstraction — Clean API for trace metadata
  • TraceWriter metadatawrite_metadata() and kernel_metadata event support (#153)
  • Truncated trace detection — Detect and handle truncated trace files gracefully
  • GB200 aarch64 support — Installation scripts updated for GB200 platforms (#159, #173)

📋 Requirements

  • CUDA Toolkit: Aligned with NVBit 1.8 requirements
  • libzstd: Required for trace compression
  • Python 3.10+: For Python package
  • NVBit 1.8: Bundled (auto-downloaded during build)

⚠️ Breaking Changes

  • TRACE_FORMAT_NDJSON renamed to CUTRACER_TRACE_FORMAT (#192)
  • CUTRACER_TRACE_OUTPUT_DIR renamed to CUTRACER_OUTPUT_DIR (#167)
  • CLI entry point unified to cutracer (replaces cutraceross)
  • --all flag renamed to --all-lines (#157)
  • analysis module renamed to query (#135)
  • pc field in trace output changed to hex string format (#137)

🙏 Acknowledgments

CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework and the v1.8 release.


📄 License

  • MIT License — Meta Platforms, Inc. contributions
  • BSD-3-Clause License — NVIDIA NVBit components

See LICENSE and LICENSE-BSD for details.


📚 Documentation

Full documentation is available in the Wiki.


🔗 Links

CUTracer v0.1.0 Release 🎉

06 Feb 03:01

Choose a tag to compare

🎉 Initial Public Release

CUTracer is an NVBit-based CUDA binary instrumentation tool for GPU kernel analysis and debugging. It enables runtime-level insights without requiring application recompilation.


✨ Highlights

  • Zero-modification runtime injection - Attach to any CUDA application via CUDA_INJECTION64_PATH
  • GPU Hang Detection - Automatic deadlock identification with process termination
  • Data Race Detection - Random delay injection with deterministic replay support
  • Triton/Proton Integration - Per-warp instruction histograms with IPC calculation
  • Efficient Trace Compression - NDJSON + Zstd (~92% space savings)
  • Python Analysis Toolkit - Available on PyPI: pip install cutracer

🔧 Instrumentation Modes

Mode Environment Variable Description
opcode_only CUTRACER_INSTRUMENT=opcode_only Lightweight instruction counting
reg_trace CUTRACER_INSTRUMENT=reg_trace Register value tracing (R/UREG support)
mem_addr_trace CUTRACER_INSTRUMENT=mem_addr_trace Memory address tracing
mem_value_trace CUTRACER_INSTRUMENT=mem_value_trace Memory address + value tracing (Global/Shared/Local)
random_delay CUTRACER_INSTRUMENT=random_delay Delay injection for race detection

Multiple modes can be enabled simultaneously with comma-separated values.


📊 Built-in Analyses

Instruction Histogram (proton_instr_histogram)

  • Clock-delimited per-warp instruction counting
  • CSV output: warp_id,region_id,instruction,count
  • Integration with Triton Proton for IPC calculation
  • Requires kernels to emit clock instructions (e.g., Triton pl.scope())
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=proton_instr_histogram \
KERNEL_FILTERS=add_kernel \
python ./vector-add-instrumented.py

Deadlock/Hang Detection (deadlock_detection)

  • Detects warps stuck in stable PC loops
  • Automatic SIGTERM → SIGKILL termination sequence
  • Detailed warp state logging for post-mortem analysis
  • Auto-enables reg_trace instrumentation
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=deadlock_detection \
python ./test_hang.py

Data Race Detection (random_delay)

  • Injects delays at synchronization points to expose timing-dependent races
  • Target SASS patterns:
    • SYNCS.PHASECHK.TRANS64.TRYWAIT (mbarrier try_wait)
    • SYNCS.ARRIVE.TRANS64.RED.A1T0 (mbarrier arrive)
    • UTMALDG.2D (TMA load)
    • WARPGROUP.DEPBAR.LE (MMA wait)

Deterministic Replay Support:

  • CUTRACER_DELAY_DUMP_PATH: Export delay config JSON for recording
  • CUTRACER_DELAY_LOAD_PATH: Load delay config JSON for exact replay
  • Workflow: Discover race with random delays → Reproduce exactly with saved config
# Record mode
CUTRACER_DELAY_NS=1000 \
CUTRACER_DELAY_DUMP_PATH=./delay_config.json \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python your_kernel.py

# Replay mode (deterministic reproduction)
CUTRACER_DELAY_LOAD_PATH=./delay_config.json \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python your_kernel.py

📁 Output Formats

Mode Extension Description
0 .log Human-readable text format
1 (default) .ndjson.zst NDJSON + Zstd compressed
2 .ndjson NDJSON uncompressed

Set via TRACE_FORMAT_NDJSON environment variable.

Compression level configurable via CUTRACER_ZSTD_LEVEL (1-22, default: 22).


🐍 Python Package

Available on PyPI:

pip install cutracer

Features

Validation:

  • JSON syntax and schema validation
  • Text format validation
  • Cross-format consistency checking
  • Transparent Zstd compression handling

Analysis:

  • TraceReader: Stream trace records from NDJSON files
  • StreamingGrouper: Memory-efficient grouped analysis
  • WarpSummary: Warp execution status for hang analysis (completed/in-progress/missing)
  • Multi-format output: table, JSON, CSV

CLI Tools:

# Validate trace files
cutraceross validate trace.ndjson
cutraceross validate trace.ndjson.zst --verbose

# Analyze trace data
cutraceross analyze trace.ndjson --head 20
cutraceross analyze trace.ndjson --filter "warp=24"
cutraceross analyze trace.ndjson --group-by warp --count

⚙️ Configuration Reference

Variable Description Default
CUTRACER_INSTRUMENT Instrumentation modes (comma-separated) (none)
CUTRACER_ANALYSIS Analysis types (comma-separated) (none)
KERNEL_FILTERS Kernel name filters (substring match) (none)
INSTR_BEGIN / INSTR_END Instruction index range filter 0 / UINT32_MAX
TRACE_FORMAT_NDJSON Output format (0/1/2) 1
CUTRACER_ZSTD_LEVEL Zstd compression level 22
CUTRACER_DELAY_NS Delay value in nanoseconds 0 (disabled)
CUTRACER_DELAY_DUMP_PATH Export delay config JSON (none)
CUTRACER_DELAY_LOAD_PATH Load delay config for replay (none)
TOOL_VERBOSE Verbosity level (0/1/2) 0
CUTRACER_DUMP_CUBIN Dump cubin files 0

📋 Requirements

  • CUDA Toolkit: Aligned with NVBit requirements
  • libzstd: Required for trace compression
  • Python 3.10+: For Python package

⚠️ API Stability Notice

This is the initial public release (v0.1.0). APIs and configuration options may change in future versions as we gather feedback and iterate on the design.

Known Limitations

  • Instruction histogram requires clock instruction boundaries (e.g., Triton pl.scope())
  • Nested regions not supported for instruction histogram analysis

🙏 Acknowledgments

CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework.


📄 License

  • MIT License - Meta Platforms, Inc. contributions
  • BSD-3-Clause License - NVIDIA NVBit components

See LICENSE and LICENSE-BSD for details.


📚 Documentation

Full documentation is available in the Wiki.


🔗 Links