23 Apr 03:41

FindHao

0ab27d5

Latest

CUTracer v0.2.1 Release Notes

🎉 Patch Release — 24 commits since v0.2.0

This release brings PyTorch-native Python callstack capture, a new kernel events recording system, cluster-level delay injection for inter-CTA race detection, significant trace format optimizations via instruction table embedding, a critical NVBit 1.8 TMA compatibility fix, and several CI/security hardening improvements. Date range: 2026-04-08 to 2026-04-22.

✨ Highlights

PyTorch CapturedTraceback Integration — Full Python callstack capture via PyTorch's CapturedTraceback API with zero compile-time dependencies, replacing unsymbolized C++ backtrace frames with readable Python call sites
Kernel Events Recording — New structured NDJSON logging of all kernel launches (not just instrumented ones) with callstack deduplication for minimal file overhead
Cluster-Level Delay Mode — New --delay-mode cluster that delays one random CTA per cluster to expose missing inter-CTA synchronization bugs
Instruction Table Embedding — Per-instruction static table with SASS binary encoding embedded in kernel_metadata, eliminating redundant per-record SASS strings from JSON traces
Custom Delay Patterns — New --delay-patterns flag for injecting delays at arbitrary SASS instruction types, including a "*" wildcard mode
NVBit 1.8 TMA Fix — Critical fix for TMA operand extraction regression introduced by NVBit 1.8's new TMA_PARAM_HANDLE operand type

🧠 PyTorch CapturedTraceback Callstack Capture

A three-part series replaces the existing backtrace()-based C++ callstack capture with PyTorch's CapturedTraceback API, producing readable Python call stacks that show the actual user code and PyTorch layers that launched a kernel.

Backtrace refactor (#202) — Extract capture_cpu_callstack() into capture_cpu_callstack_backtrace() to prepare for alternative backends
CapturedTraceback module (#203) — New python_callstack.cpp (~360 lines) that dynamically resolves Python C API functions via dlsym(dlopen(NULL)) — zero compile-time Python/PyTorch dependencies. Calls CapturedTraceback.extract().summary() when the current thread holds the GIL
Dynamic mode selection (#204) — New CpuCallstackMode enum (AUTO/PYTORCH/BACKTRACE/DISABLED) replacing the old boolean flag. AUTO (default) tries PyTorch first, falls back to backtrace(). New cpu_callstack_source field in kernel_metadata JSON output
Auto-GIL acquisition — New auto_gil mode that re-acquires the GIL via PyGILState_Ensure() for Triton kernels where __triton_launcher releases the GIL before cuLaunchKernelEx. Safe because the Python frame chain is frozen at the Triton launch() call site

# Use PyTorch callstacks (auto-detected by default)
cutracer trace --cpu-callstack auto -- python my_triton_kernel.py

# Force auto_gil mode for Triton kernels
cutracer trace --cpu-callstack auto_gil -- python my_triton_kernel.py

📋 Kernel Events Recording

New structured logging of every kernel launch to a dedicated NDJSON file (cutracer_kernel_events_*.ndjson), independent of instrumentation. Key innovation: callstack deduplication using FNV-1a hashing — each unique Python callstack is emitted once as a callstack_def record, with subsequent launches referencing it by callstack_id.

Three modes: dedup (recommended), full (inline callstack per launch), nostack (metadata only)
Zero overhead when disabled (default)
Query integration — The query command now handles kernel events files seamlessly: callstack_def records are cached and resolved, kernel_launch records get a caller field injected with the innermost call site frame
Recommended query pattern: --group-by kernel_checksum --count for launch frequency analysis

# Record kernel events with callstack dedup
cutracer trace --kernel-events dedup -- python my_app.py

# Query kernel launch counts
cutracer query cutracer_kernel_events_*.ndjson --group-by kernel_checksum --count

🔀 Cluster-Level Random Delay Mode

New --delay-mode cluster (#207) that delays only one randomly-selected CTA within each cluster while other CTAs proceed at normal speed (~430 lines). This creates timing asymmetry between CTAs in the same cluster, exposing missing inter-CTA synchronization bugs that --delay-mode random (intra-CTA) would not catch.

Uses cluster_ctaid/cluster_nctaid PTX registers for CTA selection within each cluster
cluster_seed stored in delay config JSON for deterministic replay
Host-side cluster dimension detection via CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION (including CUDA graph captures), with fallback to cuFuncGetAttribute
One-time-per-kernel diagnostic log of runtime cluster dimensions
Automatic warning when used against non-cluster-launched kernels (no-op detection)

# Expose inter-CTA synchronization bugs
cutracer trace -i random_delay -a random_delay \
    --delay-mode cluster --delay-ns 10000 \
    --kernel-filters my_cluster_kernel -- python test.py

🎯 Custom Delay Patterns

New --delay-patterns flag (#211) that lets users specify arbitrary SASS instruction substrings for delay injection, overriding the built-in DELAY_INJECTION_PATTERNS list. This enables targeted testing of specific instruction types without modifying source code.

Comma-separated patterns: --delay-patterns "UTMALDG,UTMASTG"
Wildcard --delay-patterns "*" matches all instructions, injecting delay on a random 50% subset of ALL SASS instructions
Also adds SYNCS.EXCH (mbarrier init) to the built-in delay injection patterns
Plumbed through CUTRACER_DELAY_PATTERNS environment variable

# Delay only TMA instructions
cutracer trace -i random_delay --delay-ns 5000 \
    --delay-patterns "UTMALDG,UTMASTG" -- python test.py

# Stress-test: random delay on 50% of ALL instructions
cutracer trace -i random_delay --delay-ns 1000 \
    --delay-patterns "*" -- python test.py

📦 Instruction Table Embedding & Trace Format Optimization

Three-part series that embeds a per-instruction static table in kernel_metadata and eliminates redundant per-record SASS strings from JSON output:

Instruction table — Each kernel_metadata record now includes an instructions array indexed by opcode_id, containing SASS disassembly, binary encoding (via NVBit 1.8 getSassBinary() API), register indices, and uniform register indices
Schema update — kernel_metadata.schema.json updated with the instructions property definition
Per-record SASS removal — j["sass"] serialization removed from JSON output (text mode unchanged). The Python TraceReader caches the instruction table and injects sass into records on read via opcode_id lookup. Backward compatible: old traces with inline SASS still work

🐛 Bug Fixes

NVBit 1.8 TMA operand extraction regression — NVBit 1.8 changed UTMALDG/UTMASTG/UTMAREDG from two separate MREF operands to a single TMA_PARAM_HANDLE operand. CUTracer's operand loop had no handler, silently dropping all UR information and breaking tma_trace, data-race, tma, mma, and dataflow analysis commands. Added a manual TMA_PARAM_HANDLE decoder plus unhandled-operand debug logging for future NVBit changes
Kernel hash always 0x — compute_kernel_checksum() was only called inside instrumentation path; now computed at metadata creation time via idempotent ensure_kernel_checksum() helper
Log truncation on SIGTERM (#206) — Added flush_log_files() before raise(SIGTERM) in deadlock detection, kernel timeout, and no-data timeout termination paths
Cluster warning noise (#210) — Gated [CLUSTER] warning by should_instrument so non-matching kernels don't clutter stderr; also fixed legacy MEMTRACE: prefix to CUTracer:
Kernel events writer lifetime — Fixed multi-context crash where first context teardown destroyed the shared writer; now cleaned up only when all contexts are gone
Python callstack lineno — Clamped negative PyFrame_GetLineNumber() return values unconditionally (previously only inside PyErr_Occurred() block)
Kernel events NDJSON enforcement — Force NDJSON format for kernel events writer regardless of CUTRACER_TRACE_FORMAT, with warning when text mode is active

🖥️ CLI Changes

New Flags

# Callstack capture mode (replaces boolean --cpu-callstack 0/1)
cutracer trace --cpu-callstack auto|pytorch|backtrace|auto_gil|0|1

# Kernel events recording
cutracer trace --kernel-events dedup|full|nostack

# Cluster delay mode
cutracer trace --delay-mode cluster  # (alongside existing random/fixed)

# Custom delay patterns
cutracer trace --delay-patterns "PATTERN1,PATTERN2"  # or "*" for all

Query Command Updates

# Query kernel events files (auto-detected by file naming)
cutracer query cutracer_kernel_events_*.ndjson --group-by kernel_checksum --count

📁 Configuration Changes

Updated Environment Variables

Variable	Change	Description
`CUTRACER_CPU_CALLSTACK`	Extended	Now accepts: `auto`, `pytorch`, `backtrace`, `auto_gil`, `0`, `1` (was: `0`/`1` only)
`CUTRACER_KERNEL_EVENTS`	New	Kernel events recording mode: `0` (disabled, default), `dedup`, `full`, `nostack`
`CUTRACER_DELAY_MODE`	Extended	New `cluster` value (alongside existing `random`/`fixed`)
`CUTRACER_DELAY_PATTERNS`	New	Comma-separated SASS instruction substrings for delay injection; `"*"` for all instructions

🔒 Security & CI

CodeQL fixes (#209) — Replaced fopen("w") with open(O_CREAT|O_WRONLY|O_TRUNC, 0644) + fdopen() for explicit file permissions; added permissions: contents: read to CI ...

Assets 2

08 Apr 23:14

FindHao

v0.2.0

e9b0b56

CUTracer v0.2.0 Release

🎉 Major Release — 114 commits since v0.1.0

CUTracer v0.2.0 brings Blackwell GPU support, a unified CLI experience, advanced data race reduction, and significant improvements to trace infrastructure.

✨ Highlights

Blackwell (SM100) GPU Support — Tensor core instruction tracing for UTC*MMA, UTMALDG, UTMAREDG, and TMA descriptors
Unified CLI — New cutracer trace subcommand replaces manual CUDA_INJECTION64_PATH setup
Data Race Reducer — DDMin bisection algorithm to automatically find minimal race-triggering configurations
NVBit 1.8 — Updated from NVBit 1.7.7.1 to 1.8, with a critical fix for <<<>>> kernel launch deadlocks
CPU Call Stack Capture — Per-kernel-launch host-side stack traces for debugging
Kernel Timeouts & Safety Limits — Configurable execution timeout and trace file size limits

🏗️ Blackwell GPU Support

Full tracing support for NVIDIA Blackwell architecture:

UTC*MMA tensor core instructions — Trace Blackwell's new warp-group MMA operations (#161)
UTMAREDG tracing — Support for TMA reduction instructions (#162)
UTMALDG decoder — Decode TMA load descriptor parameters
TMA descriptor tracing — Capture and decode TMA descriptor fields for tile configuration analysis (#155)
TMA descriptor decoding in SASS — Extract descriptor parameters from cubin SASS output
Tensor memory delay injection — Extend random delay to TMA instructions for data race detection (#189)

🖥️ Unified CLI

The CLI has been completely revamped with a unified cutracer entry point:

`cutracer trace` — Run and Trace

# Trace a CUDA application (replaces manual CUDA_INJECTION64_PATH setup)
cutracer trace --instrument opcode_only -- python my_kernel.py

# With cubin dump and output directory
cutracer trace --instrument reg_trace --dump-cubin --output-dir ./traces -- python my_kernel.py

# Shell-style environment variable passthrough
cutracer trace CUTRACER_DELAY_NS=1000 -- python my_kernel.py

`cutracer query` — Query Trace Data

# Filter and query traces
cutracer query trace.ndjson --filter "warp=24"
cutracer query trace.ndjson --filter "cta=[0,0,0],opcode=LDG"  # Multi-condition AND filter
cutracer query trace.ndjson --output result.ndjson --compress

`cutracer analyze` — Analyze Traces

# Warp execution summary
cutracer analyze warp-summary trace.ndjson

`cutracer reduce` — Minimize Race Configs

# Find minimal delay configuration that triggers a race
cutracer reduce --config delay_config.json -- python my_kernel.py

`cutracer sass` — SASS Extraction

# Extract SASS from cubin files
cutracer sass --cubin kernel.cubin

🔍 Data Race Detection Enhancements

DDMin Bisection Reducer (#187)

Automatically reduce a delay configuration to the minimal set of delay points that still trigger a data race, using the delta debugging (ddmin) algorithm:

Exponentially faster than brute-force elimination
Produces minimal reproducible configurations
Integrated via cutracer reduce CLI command

Per-Thread Random Delay Mode (#186)

New CUTRACER_DELAY_MODE=per_thread for thread-level delay granularity
Better coverage for detecting fine-grained data races

Delay Config Mutator (#145)

Programmatic API for manipulating delay configurations
Enables automated delay sweep workflows

⏱️ Reliability & Safety

Kernel execution timeout (CUTRACER_KERNEL_TIMEOUT_S) — Kill kernels that exceed a time limit (#169)
No-data timeout — Detect silent hangs when no trace data is produced
Trace file size limit (CUTRACER_TRACE_SIZE_LIMIT_MB) — Prevent runaway disk usage (#169)
Periodic flush — TraceWriter and log files flush periodically during kernel hangs, ensuring data is available for post-mortem analysis
Configurable channel buffer size (CUTRACER_CHANNEL_RECORDS) — Tune buffer for hang debugging scenarios
Fix <<<>>> deadlock — Preload flush_channel via fatbin + NVBit tool API to eliminate kernel launch deadlocks (#199)
Fix CUDA graph handling — Prevent graph build/capture phase from prematurely executing per-launch side effects
Fix trace overwrite — Trace file write mode changed from append to overwrite across runs

🔧 Instrumentation Improvements

Instruction category system — Conditional instrumentation based on instruction categories (#134)
IPOINT configuration — Configure instrumentation points via environment variables (#183)
Register indices in trace output — CPU-side static mapping of register operands (#143)
opcode_only trace writing — Lightweight opcode-only mode now writes structured trace output
Auto-enable cubin dump — Cubin dump auto-enabled when instrumentation is active (#191)
Kernel checksum — Robust delay config replay using kernel binary checksums (#133, #141)
CPU call stack capture — Host-side stack trace for each kernel launch (#172)
Re-execute compiled kernel — Ensure trace captures both compilation and execution runs

📁 Configuration Changes

Renamed Environment Variables

Old	New
`TRACE_FORMAT_NDJSON`	`CUTRACER_TRACE_FORMAT`
`CUTRACER_TRACE_OUTPUT_DIR`	`CUTRACER_OUTPUT_DIR`

CUTRACER_TRACE_FORMAT now also accepts string names (e.g., ndjson_zst, ndjson, log) in addition to numeric values (#193).

New Environment Variables

Variable	Description	Default
`CUTRACER_KERNEL_TIMEOUT_S`	Kernel execution timeout in seconds	(disabled)
`CUTRACER_TRACE_SIZE_LIMIT_MB`	Max trace file size in MB	(unlimited)
`CUTRACER_CHANNEL_RECORDS`	Channel buffer record count	(default)
`CUTRACER_CPU_CALLSTACK`	Enable CPU call stack capture	0
`CUTRACER_DELAY_MODE`	Delay mode (`uniform`/`per_thread`)	`uniform`
`CUTRACER_OUTPUT_DIR`	Unified output directory for all artifacts	`.`
`CUTRACER_IPOINT`	Instrumentation point configuration	(default)

🔄 Dependency Updates

NVBit: 1.7.7.1 → 1.7.7.3 → 1.8 (#164, #198)
nlohmann/json: Updated default to 3.12.0
Python: CI updated to Python 3.13
GitHub Actions: Updated to latest versions
JSON parsing: Migrated to orjson for faster JSON I/O via tritonparse _json_compat
Daily NVBit update check: Automated GitHub Action to detect upstream NVBit releases (#163)

🐍 Python Package Improvements

CLP archive support — Dump and read CLP compressed log archives (#118, #148)
Unified logger module — Consistent logging across all Python modules
Schema validation — Migrated trace validation into the cutracer Python module (#154)
Query enhancements — Hex filters, --all-lines flag, NDJSON output, --output, --compress (#136)
Multi-condition AND filter — Filter by multiple fields simultaneously (#139)
JSON list value filters — Support cta=[0,0,0] style filter expressions
KernelConfig abstraction — Clean API for trace metadata
TraceWriter metadata — write_metadata() and kernel_metadata event support (#153)
Truncated trace detection — Detect and handle truncated trace files gracefully
GB200 aarch64 support — Installation scripts updated for GB200 platforms (#159, #173)

📋 Requirements

CUDA Toolkit: Aligned with NVBit 1.8 requirements
libzstd: Required for trace compression
Python 3.10+: For Python package
NVBit 1.8: Bundled (auto-downloaded during build)

⚠️ Breaking Changes

TRACE_FORMAT_NDJSON renamed to CUTRACER_TRACE_FORMAT (#192)
CUTRACER_TRACE_OUTPUT_DIR renamed to CUTRACER_OUTPUT_DIR (#167)
CLI entry point unified to cutracer (replaces cutraceross)
--all flag renamed to --all-lines (#157)
analysis module renamed to query (#135)
pc field in trace output changed to hex string format (#137)

🙏 Acknowledgments

CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework and the v1.8 release.

📄 License

MIT License — Meta Platforms, Inc. contributions
BSD-3-Clause License — NVIDIA NVBit components

See LICENSE and LICENSE-BSD for details.

📚 Documentation

Full documentation is available in the Wiki.

🔗 Links

Repository: https://github.com/facebookresearch/CUTracer
PyPI: https://pypi.org/project/cutracer/
Issues: https://github.com/facebookresearch/CUTracer/issues

Assets 2

06 Feb 03:01

FindHao

v0.1.0

34e5ed1

CUTracer v0.1.0 Release 🎉

🎉 Initial Public Release

CUTracer is an NVBit-based CUDA binary instrumentation tool for GPU kernel analysis and debugging. It enables runtime-level insights without requiring application recompilation.

✨ Highlights

Zero-modification runtime injection - Attach to any CUDA application via CUDA_INJECTION64_PATH
GPU Hang Detection - Automatic deadlock identification with process termination
Data Race Detection - Random delay injection with deterministic replay support
Triton/Proton Integration - Per-warp instruction histograms with IPC calculation
Efficient Trace Compression - NDJSON + Zstd (~92% space savings)
Python Analysis Toolkit - Available on PyPI: pip install cutracer

🔧 Instrumentation Modes

Mode	Environment Variable	Description
`opcode_only`	`CUTRACER_INSTRUMENT=opcode_only`	Lightweight instruction counting
`reg_trace`	`CUTRACER_INSTRUMENT=reg_trace`	Register value tracing (R/UREG support)
`mem_addr_trace`	`CUTRACER_INSTRUMENT=mem_addr_trace`	Memory address tracing
`mem_value_trace`	`CUTRACER_INSTRUMENT=mem_value_trace`	Memory address + value tracing (Global/Shared/Local)
`random_delay`	`CUTRACER_INSTRUMENT=random_delay`	Delay injection for race detection

Multiple modes can be enabled simultaneously with comma-separated values.

📊 Built-in Analyses

Instruction Histogram (`proton_instr_histogram`)

Clock-delimited per-warp instruction counting
CSV output: warp_id,region_id,instruction,count
Integration with Triton Proton for IPC calculation
Requires kernels to emit clock instructions (e.g., Triton pl.scope())

CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=proton_instr_histogram \
KERNEL_FILTERS=add_kernel \
python ./vector-add-instrumented.py

Deadlock/Hang Detection (`deadlock_detection`)

Detects warps stuck in stable PC loops
Automatic SIGTERM → SIGKILL termination sequence
Detailed warp state logging for post-mortem analysis
Auto-enables reg_trace instrumentation

CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=deadlock_detection \
python ./test_hang.py

Data Race Detection (`random_delay`)

Injects delays at synchronization points to expose timing-dependent races
Target SASS patterns:
- SYNCS.PHASECHK.TRANS64.TRYWAIT (mbarrier try_wait)
- SYNCS.ARRIVE.TRANS64.RED.A1T0 (mbarrier arrive)
- UTMALDG.2D (TMA load)
- WARPGROUP.DEPBAR.LE (MMA wait)

Deterministic Replay Support:

CUTRACER_DELAY_DUMP_PATH: Export delay config JSON for recording
CUTRACER_DELAY_LOAD_PATH: Load delay config JSON for exact replay
Workflow: Discover race with random delays → Reproduce exactly with saved config

# Record mode
CUTRACER_DELAY_NS=1000 \
CUTRACER_DELAY_DUMP_PATH=./delay_config.json \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python your_kernel.py

# Replay mode (deterministic reproduction)
CUTRACER_DELAY_LOAD_PATH=./delay_config.json \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python your_kernel.py

📁 Output Formats

Mode	Extension	Description
0	`.log`	Human-readable text format
1 (default)	`.ndjson.zst`	NDJSON + Zstd compressed
2	`.ndjson`	NDJSON uncompressed

Set via TRACE_FORMAT_NDJSON environment variable.

Compression level configurable via CUTRACER_ZSTD_LEVEL (1-22, default: 22).

🐍 Python Package

Available on PyPI:

pip install cutracer

Features

Validation:

JSON syntax and schema validation
Text format validation
Cross-format consistency checking
Transparent Zstd compression handling

Analysis:

TraceReader: Stream trace records from NDJSON files
StreamingGrouper: Memory-efficient grouped analysis
WarpSummary: Warp execution status for hang analysis (completed/in-progress/missing)
Multi-format output: table, JSON, CSV

CLI Tools:

# Validate trace files
cutraceross validate trace.ndjson
cutraceross validate trace.ndjson.zst --verbose

# Analyze trace data
cutraceross analyze trace.ndjson --head 20
cutraceross analyze trace.ndjson --filter "warp=24"
cutraceross analyze trace.ndjson --group-by warp --count

⚙️ Configuration Reference

Variable	Description	Default
`CUTRACER_INSTRUMENT`	Instrumentation modes (comma-separated)	(none)
`CUTRACER_ANALYSIS`	Analysis types (comma-separated)	(none)
`KERNEL_FILTERS`	Kernel name filters (substring match)	(none)
`INSTR_BEGIN` / `INSTR_END`	Instruction index range filter	0 / UINT32_MAX
`TRACE_FORMAT_NDJSON`	Output format (0/1/2)	1
`CUTRACER_ZSTD_LEVEL`	Zstd compression level	22
`CUTRACER_DELAY_NS`	Delay value in nanoseconds	0 (disabled)
`CUTRACER_DELAY_DUMP_PATH`	Export delay config JSON	(none)
`CUTRACER_DELAY_LOAD_PATH`	Load delay config for replay	(none)
`TOOL_VERBOSE`	Verbosity level (0/1/2)	0
`CUTRACER_DUMP_CUBIN`	Dump cubin files	0

📋 Requirements

CUDA Toolkit: Aligned with NVBit requirements
libzstd: Required for trace compression
Python 3.10+: For Python package

⚠️ API Stability Notice

This is the initial public release (v0.1.0). APIs and configuration options may change in future versions as we gather feedback and iterate on the design.

Known Limitations

Instruction histogram requires clock instruction boundaries (e.g., Triton pl.scope())
Nested regions not supported for instruction histogram analysis

🙏 Acknowledgments

CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework.

📄 License

MIT License - Meta Platforms, Inc. contributions
BSD-3-Clause License - NVIDIA NVBit components

See LICENSE and LICENSE-BSD for details.

📚 Documentation

Full documentation is available in the Wiki.

🔗 Links

Repository: https://github.com/facebookresearch/CUTracer
PyPI: https://pypi.org/project/cutracer/
Issues: https://github.com/facebookresearch/CUTracer/issues

Assets 2

Releases: facebookresearch/CUTracer