Understanding GEMM on Blackwell with CuTeDSL
Published:
In this post I want to walk through how a high-performance GEMM is structured on Blackwell, using the official CUTLASS dense GEMM example as a case study. This kernel is written in CuTeDSL — NVIDIA’s Python DSL for authoring CUTLASS kernels — and it packs in nearly every Blackwell-specific optimization available.
