Posts

2026

Optimizing an NVFP4 Group GEMM Kernel on Blackwell

6 minute read

Published:

Happy Chinese New Year! Besides celebrating CNY this weekend, I also spent some time working on NVFP4 block-scaled group GEMM kernel optimizations on NVIDIA B200. This is from a GPU Mode competition where participants optimize CUDA kernels on Blackwell GPUs. The code is written in CuTeDSL, CUTLASS’s Python DSL for Blackwell kernels.