Troubleshooting#
Debugging asynchronous GPU kernels requires isolating the failure before reading the code. A crash usually means a bad pointer or an uninitialized barrier; a hang means a mismatched synchronization point; and silent corruption means a missing fence or a phase-tracking bug.
This appendix gathers the common failure modes for the kernels built in this book. It builds on the performance model in What Makes a Kernel Fast and the generated-CUDA skeleton in Scaling GEMM with Warp Specialization and Clusters.
Isolating the Environment#
Before assuming the kernel logic is flawed, verify the runtime context. A kernel compiled for Blackwell will immediately fail with an unspecified launch failure on Hopper, and a stale import path can cause you to debug the wrong code entirely.
Confirm the imported package: Run
python -c "import tvm, tvm.tirx; print(tvm.__file__, tvm.__version__)". This rules out importing a stale checkout.Confirm the GPU capability: Run
python -c "import torch; print(torch.cuda.get_device_name(), torch.cuda.get_device_capability())". The kernels in this book target Blackwell (sm_100a); attempting to launch them on older architectures explains most immediate failures.Run the smallest correctness check: Always run the kernel’s own correctness check (e.g.,
run_correctness()) against a small shape before trusting any performance numbers.
Symptom, cause, fix#
The tables below group the common failures. Each row is symptom, likely cause, and first check / fix.
Install and import#
Symptom |
Likely cause |
First check / fix |
|---|---|---|
|
wrong or missing wheel |
|
Imports the wrong code |
a second checkout on |
print |
Launch fails immediately on any kernel |
not a Blackwell GPU |
check command 2; the kernels require |
Compile time#
Symptom |
Likely cause |
First check / fix |
|---|---|---|
“unknown TIRx API” / attribute error |
API name drift vs. the installed wheel |
look the name up in TIRx Language Reference and the installed |
unsupported |
a copy/MMA dispatch the target does not support |
confirm the |
buffer scope mismatch |
a buffer used outside its |
review the scope rules in Special Memory: TMEM |
Crash at run time#
Symptom |
Likely cause |
First check / fix |
|---|---|---|
|
out-of-range index or wrong layout/offset |
shrink to one tile; the CUDA context is now poisoned, so see When to restart Python |
|
a barrier left uninitialized, or a resource over-allocated |
check that every |
Hang / deadlock |
a collective barrier that not all expected threads reach |
check for |
Wrong result (compiles and runs, bad numbers)#
These are the dangerous ones: no error, just a wrong answer. Each links to where the book works through it.
Symptom |
Likely cause |
First check / fix |
|---|---|---|
Wrong after the first K chunk |
a reused mbarrier with the phase not flipped ( |
Building a Tiled GEMM walks through the phase-tracking table |
Hang once warpgroups specialize |
|
use |
Garbage / uninitialized accumulator |
|
|
Corruption near the epilogue / TMEM |
missing |
add the |
TMA store writes stale data |
missing |
fence before the store; see the producer-to-engine handoff in Async Coordination: mbarriers |
Slow (correct but underperforming)#
Symptom |
Likely cause |
First check / fix |
|---|---|---|
Generated CUDA has no TMA |
a copy that did not lower to |
grep the generated CUDA for the TMA intrinsic; check the |
Tensor pipe underutilized |
too-shallow pipeline / poor overlap |
revisit the overlap argument in What Makes a Kernel Fast |
Register spill |
tile or unroll too large for the register budget |
check the |
Inspecting the generated CUDA#
The single most useful debugging move is to read what the compiler actually emitted. Every built module exposes it:
# Default (no argument) prints the lowered source:
print(ex.mod.imports[0].inspect_source())
# Ask for CUDA specifically, and save it so you can search and diff it:
cuda_source = ex.mod.imports[0].inspect_source("cuda")
with open("artifacts/my_kernel.cu", "w") as f:
f.write(cuda_source)
A few patterns are worth recognizing when you scan that output:
In the generated CUDA |
Means |
|---|---|
|
a single-thread guard, e.g. an elected issuer or an |
|
an mbarrier initialization; should be at top level, before any branch |
|
a CTA-wide barrier and shared-memory ordering point |
a |
the Blackwell Tensor Core path was actually generated |
a |
the copy lowered to TMA rather than a thread copy |
Use the correct-kernel skeleton in Scaling GEMM with Warp Specialization and Clusters as the reference
shape, and watch for things that should not appear: every lane issuing an MMA,
a cta_sync() inside a wg_id branch, a missing TMA commit/wait, or a missing
tcgen05 wait before the accumulator is read.
When to restart Python#
A CUDA error does not always clean up after itself. After an illegal memory access or any “CUDA context poisoned” error, the context is in an undefined state
and later unrelated calls (even torch.randn) may keep failing. When that
happens, restart the Python process before drawing any conclusions.
Filing a good issue#
This guide covers the most common patterns, but you may encounter issues not listed here. If you are still stuck, please file an issue on the Apache TVM GitHub repository. To help others reproduce and debug the problem, please include the environment details (the TVM path and GPU capability commands above) and a minimal reproducible example at a small shape.