Debugging Warp-Specialized Kernels#
GEMM Steps 7-9 in Scaling GEMM with Warp Specialization and Clusters overlap TMA load, tcgen05 MMA, and TMEM/SMEM writeback. The same debugging method applies to Flash Attention handoffs: identify the roles, identify the storage each role owns, then verify the generated CUDA against that model.
Do not start by rewriting the kernel. First make sure the run is valid, then inspect the generated CUDA. After environment and compile-time issues are ruled out, runtime failures in these kernels usually reduce to a broken handoff: an uninitialized barrier, the wrong arrival count, a collective hidden inside a role guard, a stale barrier phase, or storage reused before the producer has made its writes visible.
Before Debugging the Kernel#
Rule out the runtime context first:
python -c "import tvm, tvm.tirx; print(tvm.__file__, tvm.__version__)"
python -c "import torch; print(torch.cuda.get_device_name(), torch.cuda.get_device_capability())"
These kernels target Blackwell (sm_100a). If Python imports a stale TVM checkout, or the GPU is not Blackwell-class, fix that before changing the kernel. Then run the kernel’s smallest correctness check, such as run_correctness(), before looking at performance.
Debugging Workflow#
Reproduce the failure at the smallest shape that still fails. If the failure is an illegal memory access, restart Python before the next run.
If compilation fails, check the installed API, target,
dispatch=, and buffer scopes before reading the runtime synchronization code.Save
inspect_source("cuda")output. Search it for role guards,mbarrier_init,tcgen05,cp.async.bulk.tensor, andcta_sync()before reading the Python again.Write the roles / storage / handoff / lifetime table for the kernel path that failed.
Check the generated CUDA against that table: barrier inits before role branches, expected TMA producer, MMA issuer(s), writeback group(s), and no CTA-wide collective inside a warpgroup-only branch.
Classify the run as a deadlock, crash, wrong result, or correct-but-slow run, then use the matching section below.
Change one handoff at a time: init count, arrive/wait phase, role guard, fence, TMA store drain, TMEM alloc/dealloc, or tile-scheduler advance.
Re-run correctness before measuring performance.
What Transfers#
For any asynchronous kernel, make a small worksheet before changing code:
Item |
What to write down |
|---|---|
Roles |
The exact threads, warps, warpgroups, or CTAs that issue each async operation. |
Storage |
The live location of each tile at each step: GMEM, SMEM, TMEM, or registers. |
Handoff |
The producer, the consumer, the signal object, the arrival count, the phase, and the fence or drain that makes the data visible. |
Lifetime |
The earliest point where each storage slot can be reused, read back, or freed. |
Then verify the generated CUDA against the worksheet:
Role guards match the roles table.
Barrier inits appear before guarded role branches.
Collective operations are not accidentally narrowed by lane, warp, or warpgroup guards.
Arrive/wait phases match the handoff table.
TMA store drains, TMEM dealloc, and SMEM reuse happen only after the lifetime table says they are legal.
Use the same worksheet for TMA->MMA->writeback GEMM pipelines and for the score/softmax/value/correction handoffs in Flash Attention.
If Compilation Fails#
Fix compile-time failures before debugging runtime synchronization:
Symptom |
Likely area |
First check |
|---|---|---|
Unknown TIRx API or attribute error |
Installed wheel does not match the tutorial code |
Print |
Unsupported |
The selected target or primitive does not support that path |
Check the |
Buffer scope mismatch |
A buffer is being used through the wrong hardware path |
Check the storage row of the worksheet: TMEM must be accessed through |
Compile succeeds but generated CUDA lacks the expected path |
Dispatch did not lower the way you expected |
Inspect the generated CUDA for |
Inspecting Generated Code#
For any compiled kernel, save the CUDA so you can search and diff it:
from pathlib import Path
cuda_source = ex.mod.imports[0].inspect_source("cuda")
Path("artifacts").mkdir(exist_ok=True)
Path("artifacts/my_kernel.cu").write_text(cuda_source, encoding="utf-8")
print(cuda_source)
The generated code maps TIRx constructs to CUDA like this:
TIRx |
Generated CUDA |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Scan for these strings before reading the full kernel:
Generated CUDA |
Check |
|---|---|
|
Single CTA-thread guard, often barrier initialization |
|
Barrier initialization exists and appears before role branches |
|
The Tensor Core path was generated |
|
The copy lowered to TMA |
|
CTA-wide barrier; it must not sit inside a |
Step 7 Reference Skeleton#
A correctly compiled Step 7 kernel has this top-level shape. The guards below are written with role names for readability; in generated CUDA, search for the corresponding expressions from the table above.
// (1) Barrier inits: top level, CTA thread 0 only
if (threadIdx.x < 1) {
mbarrier_init(tma2mma[0..1], 1);
mbarrier_init(mma2tma[0..1], 1);
mbarrier_init(mma2ld, 1);
mbarrier_init(ld2mma, 128); // arrived by all 128 WG0 threads
}
// (2) TMEM alloc: WG0 warp 0, all lanes of the issuing warp
if (wg_id == 0 && warp_id == 0) tcgen05_alloc(..., 512);
// (3) Fences + cta_sync, then phase init: producer=1, consumer=0
// (4) Warp-specialized loop
if (wg_id == 1 && warp_id == 3 && elect_sync) { /* TMA */ while(valid){ ... next_tile(); } }
if (wg_id == 1 && warp_id == 0 && elect_sync) { /* MMA */ while(valid){ ... next_tile(); } }
if (wg_id == 0) { /* WB */ while(valid){ ... next_tile(); } }
// (5) Cleanup: issuing warp, no lane guard
cta_sync();
if (warp_id == 0) { tcgen05_relinquish_alloc_permit(); tcgen05_dealloc(..., 512); }
Check these before changing the algorithm:
Barrier inits sit at top level, not inside a
wg_idguard.tcgen05_allocandtcgen05_deallochave a warp guard but no lane guard; all lanes of the issuing warp participate.TMA and MMA loops both iterate
K_TILEStimes.Phase init is producer=
1, consumer=0.
Symptom Map#
Start from the symptom, but treat it as a clue rather than a final diagnosis:
Clue |
Likely area |
First check |
|---|---|---|
Kernel hangs, then the runtime reports an unspecified launch failure |
Deadlock |
Barrier init placement, arrival count, |
Illegal memory access, XID, or later unrelated CUDA calls also fail |
Crash / poisoned context |
Restart Python, then check pointer ranges, storage lifetime, and collective participation |
Wrong rows appear in 128-row or tile-sized stripes |
Sync race or tile-index mismatch |
Producer/consumer phases, scheduler advance, and which warpgroup owns each row stripe |
|
Descriptor, operand setup, or uninitialized accumulation |
SMEM/TMEM descriptor setup, swizzle/layout, and accumulator initialization |
Finite but patterned wrong values |
Stale or partially visible data |
Missing fence, missing TMA store drain, or storage reused before the lifetime table allows it |
Correct output but no expected speedup |
Dispatch or resource issue |
Generated CUDA path, pipeline depth, occupancy, and register spill |
When to Restart Python#
A CUDA error does not always clean up after itself. After an illegal memory access, XID, or “CUDA context poisoned” error, later unrelated calls such as torch.randn may keep failing. Restart the Python process before testing the next fix, otherwise you may be debugging the previous crash instead of the current code.
Deadlocks#
Check these in order:
Arrival count does not match init count. Common case:
MBarrier.init(128)butarriveis guarded byif warp_id == 0: if lane_id == 0:, so only 1 thread arrives and the wait never returns.Barrier
init(count)
Who arrives
Arrivals
TMABar(tma->mma)1
TMA engine via
arrive(stage, bytes)1
TCGen05Bar(mma->tma, mma->ld)1
MMA warp via
tcgen05.commit1
MBarrier(ld->mma)128
All WG0 threads via
arrive128
Barrier init nested inside a
wg_idguard..init()lowers toif threadIdx.x < 1:, meaning CTA thread 0. CTA thread 0 lives in WG0, soif wg_id == 1:prevents every thread from running the init. Inits must be at top level;grep mbarrier_initininspect_source()to verify.cta_sync()inside a warpgroup branch.cta_syncis__syncthreads(), which requires all CTA threads. Insideif wg_id == 0:, WG1 never reaches it. UseT.cuda.warpgroup_sync(10)for a single-warpgroup barrier.tile_scheduler.next_tile()skipped by some consumer-warpgroup threads. The scheduler tracks per-thread state; threads that skip it can loop forever.TMA and MMA disagree on K-tile count. If MMA does
K_TILES - 1instead ofK_TILES, barrier phases drift and the second outer tile deadlocks.PipelineStateinitial phase wrong. Producer starts atphase=1so the first wait passes; consumer starts atphase=0so the first wait blocks. If both start from the same phase, the first handoff can deadlock immediately.
Crashes and Context Poisoning#
Common causes:
pool.allocafterpool.commit(). Barrier wrappers callallocinternally. Correct order:tmem_addr -> barrier wrappers -> move_base_to(1024) -> Asmem / Bsmem / Dsmem -> commit().tcgen05.allocortcgen05.deallocwith a lane guard. The issuing warp must participate with all lanes.if lane_id == 0:runs one thread, which is undefined behavior.Missing
cta_sync()beforetcgen05.dealloc. TMEM is freed while writeback is still reading.Out-of-range GMEM or SMEM access. Shrink to one tile, check the scheduler’s
m_idx/n_idx, and check that the current shape is a multiple of the kernel’s tile or cluster tile.
Wrong Results#
Classify wrong output by pattern before guessing. Whole row stripes often point to a producer/consumer phase, tile-index, or role-ownership mismatch. NaN output often points to descriptor setup, operand setup, or uninitialized accumulation. Finite but patterned wrong values often mean the consumer read an old tile, a partially written tile, or data whose store had not drained yet.
tcgen05.commitoutsideelect_sync. All 32 threads create commit groups; the 31 empty groups signal the mbarrier immediately. TMA can overwrite SMEM before MMA reads it.Missing
fence.proxy_async("shared::cta")before TMA store. The TMA engine may not see SMEM writes from threads.Missing
cp_async.bulk.commit_group()pluswait_group(0)after TMA store. The next tile can reuse Dsmem before the store drains.Persistent kernel fails intermittently at small sizes such as 1024x1024. Larger sizes can mask the race with longer K-loops. Re-check phase reset between tiles and the TMA-store commit/wait.
fence.after_thread_sync()is usually not the fix. The MMA-completion mbarrier already carries release-acquire semantics. Steps 8 and 9 add it conservatively on the writeback edge, aftermma2ld.waitand before the firsttcgen05.ld; do not add it routinely on the TMA-to-MMA edge.
Correct but Slow#
If the output is correct but performance is far below expectation, use the same inspection loop:
Clue |
Likely area |
First check |
|---|---|---|
Generated CUDA has no |
Copy did not lower to TMA |
Check |
Generated CUDA has no |
MMA did not lower to Blackwell Tensor Core instructions |
Check |
TMA and MMA do not overlap |
Pipeline too shallow or phases serialize producer/consumer |
Inspect the order of wait/arrive/advance in the generated CUDA |
Good small-shape correctness but poor large-shape speed |
Register spill, occupancy, or staging-buffer pressure |
Check the compiler resource report; reduce tile size, chunk writeback, or lower pipeline depth |
Filing a Good Issue#
If the failure survives the checks above, reduce it before filing an issue on the Apache TVM GitHub repository. Include:
the
tvm.__file__/tvm.__version__output and GPU capability;the smallest shape that reproduces the failure;
whether the failure is compile-time, deadlock, crash, wrong result, or correct-but-slow;
the minimal kernel or notebook cell plus its correctness check;
the saved
inspect_source("cuda")output, or the smallest excerpt that shows the suspicious guard, barrier, or dispatch path.