How to Write a Fast Matrix Multiplication from Scratch with Tensor Cores (2024)

(alexarmbr.github.io)

147 points | by skidrow 75 days ago

11 comments

ashvardanian 75 days ago
Great article — and several other high-quality deep dives linked at the end! Here's another one on the H100 that I found particularly useful: <https://cudaforfun.substack.com/p/outperforming-cublas-on-h1...>
I agree with the author that programming GEMM on newer GPUs is a very different experience, though I'm wondering if "newer GPUs are [actually strictly] better"? It seems like there should still be some highly cost-effective use cases for T4 GPUs — aren't there?
[-]
- saagarjha 74 days ago
  No, new GPUs are much cheaper in price/compute.
light_hue_1 75 days ago
Closely related to this if you're interested in the topic if Deep Mind's guide on how to scale your model.
https://jax-ml.github.io/scaling-book/roofline/
imtringued 74 days ago
I'm personally starting to get the impression that the CUDA programming model is actually terrible for things like matrix multiplication.
The author is constantly trying to work around the mismatch between the software model and the hardware model.
When doing matrix multiplication you don't want to program at the CUDA core level, you want to program at the streaming multiprocessor level.
There is no such thing as uncoalesced reads at the streaming multiprocessor level. It just doesn't exist.
mma.sync.aligned expects you to program at the streaming multiprocessor level. This is because the threads are sharing tensor cores, meaning that the concept of individual threads is really poorly thought out in this context.
I also have to comment on the Blogpost itself.
>Kernel 5 - Tune Tile Dimensions
This section is unlovingly named "tune tile dimensions", when you should explicitly say what you did: you allocated more memory towards accumulators. Matrix multiplication is accumulator bound.
This also plays into the misconception early in the post.
>To illustrate the best case scenario, imagine that fast memory was large enough to fit A,B and C in their entirety.
This is not correct, because you can split matrix multiplication along it's k or D dimension. You only need to keep the accumulator C in fast memory. If you can keep all of C in fast memory, then you only need to keep one vector of dimension n for A and one vector of dimension n for B in fast memory. Obviously in the case of tiles you want a vector of tiles. In the extreme limit the memory of A and B is irrelevant, because each element in the vector for A and B is read n times. As your matrix gets bigger, the sharing gets more extreme! O(n) memory for the inputs A and B but O(n^2) for C means that arithmetic intensity is inevitable!
The accumulator bottleneck is the essence of matrix multiplication!
[-]
- alexarmbr 73 days ago
  author here: thanks so much for the feedback. I agree, 'more memory for accumulators' would be a better title for this section.
  and I also see the misconception you are pointing out in the 'best case' section. re-reading this, I realize that if you are accumulating C using outer products between columns of A and rows of B, you can achieve O(N) intensity while storing all of C, and just a column of A and a row of B in fast memory. Whereas if you are using inner products, you need all of A,B,C in fast memory to achieve O(N) intensity.
  I guess when I wrote this I was just thinking about an inner product, which is too narrow. Thanks I might tweak this section :)
westurner 75 days ago
Multiplication algorithm: https://en.wikipedia.org/wiki/Multiplication_algorithm
From https://news.ycombinator.com/item?id=40519828 re: LLMs and matrix multiplication with tensors:
> "You Need to Pay Better Attention" (2024) https://arxiv.org/abs/2403.01643 :
>> Our first contribution is Optimised Attention, which performs similarly to standard attention, but has 3/4 as many parameters and one matrix multiplication fewer per head. Next, we introduce Efficient Attention, which performs on par with standard attention with only 1/2 as many parameters as many parameters and two matrix multiplications fewer per head and is up to twice as fast as standard attention. Lastly, we introduce Super Attention, which surpasses standard attention by a significant margin in both vision and natural language processing tasks while having fewer parameters and matrix multiplications.
From "Transformer is a holographic associative memory" (2025) https://news.ycombinator.com/item?id=43029899 .. https://westurner.github.io/hnlog/#story-43028710 :
>>> Convolution is in fact multiplication in Fourier space (this is the convolution theorem [1]) which says that Fourier transforms convert convolutions to products.
From https://news.ycombinator.com/item?id=41322088 :
> "A carbon-nanotube-based tensor processing unit" (2024)
[-]
- westurner 74 days ago
  "Karatsuba Matrix Multiplication and Its Efficient Hardware Implementations" (2025) https://arxiv.org/abs/2501.08889 .. https://news.ycombinator.com/item?id=43372227
hansvm 74 days ago
Some related fun facts:
1. That roofline curve idea applies to multiple processes, computers, and data centers just as well. If you have enough "cache" (disk, RAM, whatever), you can do a distributed matmul and actually effectively use every coprocessor at nearly 100% efficiency.
2. If you need f32 intermediate precision, you can approximate that with Kahan-like ideas and still take advantage of the f16 core, at somewhere in the 25%-50% efficiency range (still much better than the <10% you get by ignoring the tensor core).
[-]
- saagarjha 74 days ago
  Yep, the "3" in 3xTF32 kind of gives away the performance cost ;)
cavisne 75 days ago
"An H100 GPU has 989 TFLOPs of half-precision matrix multiply compute, and ~60 TFLOPs of “everything else”"
I always thought there was a lot of crossover between gaming GPU's & DC GPU's (and the volume is why NVIDIA is so far ahead). Are tensor cores somehow related to the pre tensorcore SM's (like an abstraction on top of SM's?)
[-]
- pests 74 days ago
  The "tensor cores" are just like the previous "cuda cores" in that they exist inside each SM along with all the other required machinery like register files, schedulers, etc. Volta was the first microarchitecture to have the SM include purpose built tensor cores. Before that, they only contained more general CUDA cores.
atum47 75 days ago
In my parallel programming class we used several techniques to increase the speed of matrix multiplication, and compared them. I vaguely remember using OpenMP and cuda. I need to look into my backups to see if I still have those codes. Specially the cuda one, I wonder how similar it is to tensors
gradascent 75 days ago
Great deep dive. I've learned a lot already and haven't even finished the introduction
astee 75 days ago
If I ever need a fast matmul, you're hired.
saagarjha 74 days ago
This is a really good post, and a nice successor to the earlier one that I've pointed people at before: https://siboehm.com/articles/22/CUDA-MMM (this is linked in the post, of course). That one is a good introduction but I found that it doesn't really explain a lot of the choices it makes beyond the basic ones (coalescing=good, bank conflicts=bad). This is makes it a good introduction but this goes a lot further into the details that are important when you try to attack cuBLAS performance for faster kernels on more recent hardware.
In fact I recently had someone go through almost the same steps here, although they were using Ampere GPUs rather than Turing. As the post mentions, the microarchitecture is almost the same, especially when talking about the tensor cores (Hopper has significant changes). One difference is that it supports asynchronous global to shared memory transfers, which is quite useful. And you can get these for quite cheap: a RTX 3060 is a good card to test on. Vast.ai will rent one to you for a fraction of the price the author paid, and you can run Nsight Compute on them if you pick the VM instances (which costs several times more–the Docker instances are actually like a tenth of the prices the author paid).
I don't really have anything to point out about the content here but I will add that you will want to pipeline shared memory loads (to registers) in addition to those coming global memory. If you do this it's very possible to beat cuBLAS, which doesn't do that (CUTLASS does). I think around that microarchitecture you want at least a two-deep pipeline where global memory loads to shared memory two iterations head, then shared memory populates registers one iteration ahead, then you do the MMA tile. As you get faster and faster tensor cores this pipeline increases, and async copies will come help make your life easier–but the actual benefit here is that you save registers, which are a scarce resource. Async copies don't need to pin registers while waiting for global→shared stores. In fact Blackwell has dedicated "tensor memory" to directly load your tensor cores without needing to use registers at all. It's actually larger than your registers and shared memory put together, IIRC.
Most people have moved on at this point but I did want to say that I think the ldmatrix instruction was a mistake, honestly–it is designed to hide the matrix layout from you, but in doing so it also makes it inevitable that you'll have conflicts. One thing I haven't heard many people talk about is that conflicts actually occur in a transaction, rather than a warp. This also applies to coalescing from global memory. So for both if you use vectorized operations you can fill a transaction with fewer threads, reducing the chance of conflicts (or threads cooperating for coalescing). If all threads perform vectorized loads, you reduce your worst-case conflicts from 32-way to 8-way! It's an easy way to improve your performance if you're having trouble with laying out data across the entire warp.
Also, occupancy of GEMM is basically 2-3 blocks per SM, if that. This is mentioned by this post, and Simon does too. But I think those who are new to writing high-performance compute bound kernels don't really internalize it as they should. The model of "more threads, more occupancy, to hide the memory latency" is truly dead and buried. It is hard to feed tensor cores but you can't solve this by wavefront parallelism anymore. Everything is explicitly pipelined, with more and more of it being done in software in the direction of computation, rather than horizontally across warps or SMs.
[-]
- alexarmbr 73 days ago
  thanks so much for reading!!
  >I think around that microarchitecture you want at least a two-deep pipeline where global memory loads to shared memory two iterations head
  I agree, if I were to spend more time on this, I would add another layer of pipelining like you say, and also tweak it so that each threadblock is persistent on a particular SM, and processes multiple tiles. This allows you to hide the write latency of the epilogue, and arrange the SM<->data layout in way that maximizes L2 cache locality. (good explanation of this here https://cudaforfun.substack.com/p/outperforming-cublas-on-h1...)
  >I did want to say that I think the ldmatrix instruction was a mistake
  I agree. I found this instruction wierd because it hides which thread is reading which data, but it causes shared memory bank conflicts, so you are left to guess which thread is reading what. I find using the TMA on hopper is much nicer.
  >The model of "more threads, more occupancy, to hide the memory latency" is truly dead and buried
  Well said, seems like all the intro to CUDA textbooks need to be rewritten. It seems like now for all GEMM-like kernels, occupancy matters very little, and its more about using dedicated, asychronous hardware units properly in conjunction one another. I like this because there is a bit less black magic involved when chasing the long tail of performance. This is well put here
  https://research.colfax-intl.com/cutlass-tutorial-writing-ge...
  "In historical context, these developments continue a trend of replacing general-purpose computational resources by specialized hardware resources, to both remove bottlenecks and free up those general-purpose resources for other operations. Starting with the Volta architecture, the Tensor Cores divorced GEMM arithmetic operations from the general computational pipeline. Ampere’s asynchronous copy instructions allowed for true pipelining of GEMM mainloops. On Hopper GPUs, the asynchronous, single-threaded TMA and the ability to reallocate registers between warpgroups dramatically reduced the register and thread cost of data movement, and the asynchronous WGMMA allowed for pipelining of MMA with other compute operations. Now, Tensor Memory and UMMA do for MMA just what TMA did for copy, making it a single-threaded, asynchronous operation that does not consume registers. As a result, registers can primarily be used for other tasks like scheduling and fused epilogue operations."
  [-]
  - saagarjha 72 days ago
    Well, CUDA is used for a lot of things. It can make sense to increase occupancy when your code is memory bound, for example.