A 2-CTA Cluster: Cooperative MMA via Cross-CTA SMEM Read
click a piece — two CTAs share stored-B row slices across the cluster (DSMEM)
CLUSTER — 2 CTAs across SMs · DSMEM
CTA 0 · SM-0
Asmem (own)
A rows 0–127
Bsmem
B stored rows 0–127
D[0:128, 0:256]
cross-CTA
read ↔
CTA 1 · SM-1
Asmem (own)
A rows 128–255
Bsmem
B stored rows 128–255
D[128:256, 0:256]
Cluster output: 256 × 256 — twice the 128 × 128 one CTA computes alone