Posts by Tags

CUDA

Understanding GEMM on Blackwell with CuTeDSL

10 minute read

Published:

In this post I want to walk through how a high-performance GEMM is structured on Blackwell, using the official CUTLASS dense GEMM example as a case study. This kernel is written in CuTeDSL — NVIDIA’s Python DSL for authoring CUTLASS kernels — and it packs in nearly every Blackwell-specific optimization available.

Optimizing an NVFP4 Group GEMM Kernel on Blackwell

6 minute read

Published:

Happy Chinese New Year! Besides celebrating CNY this weekend, I also spent some time working on NVFP4 block-scaled group GEMM kernel optimizations on NVIDIA B200. This is from a GPU Mode competition where participants optimize CUDA kernels on Blackwell GPUs. The code is written in CuTeDSL, CUTLASS’s Python DSL for Blackwell kernels.

blog

Understanding GEMM on Blackwell with CuTeDSL

10 minute read

Published:

In this post I want to walk through how a high-performance GEMM is structured on Blackwell, using the official CUTLASS dense GEMM example as a case study. This kernel is written in CuTeDSL — NVIDIA’s Python DSL for authoring CUTLASS kernels — and it packs in nearly every Blackwell-specific optimization available.

Optimizing an NVFP4 Group GEMM Kernel on Blackwell

6 minute read

Published:

Happy Chinese New Year! Besides celebrating CNY this weekend, I also spent some time working on NVFP4 block-scaled group GEMM kernel optimizations on NVIDIA B200. This is from a GPU Mode competition where participants optimize CUDA kernels on Blackwell GPUs. The code is written in CuTeDSL, CUTLASS’s Python DSL for Blackwell kernels.