Posts by Tags

Understanding GEMM on Blackwell with CuTeDSL

10 minute read

Published: February 28, 2026

In this post I want to walk through how a high-performance GEMM is structured on Blackwell, using the official CUTLASS dense GEMM example as a case study. This kernel is written in CuTeDSL — NVIDIA’s Python DSL for authoring CUTLASS kernels — and it packs in nearly every Blackwell-specific optimization available.

Optimizing an NVFP4 Group GEMM Kernel on Blackwell

6 minute read

Published: February 16, 2026

Happy Chinese New Year! Besides celebrating CNY this weekend, I also spent some time working on NVFP4 block-scaled group GEMM kernel optimizations on NVIDIA B200. This is from a GPU Mode competition where participants optimize CUDA kernels on Blackwell GPUs. The code is written in CuTeDSL, CUTLASS’s Python DSL for Blackwell kernels.

Understanding GEMM on Blackwell with CuTeDSL

10 minute read

Published: February 28, 2026

In this post I want to walk through how a high-performance GEMM is structured on Blackwell, using the official CUTLASS dense GEMM example as a case study. This kernel is written in CuTeDSL — NVIDIA’s Python DSL for authoring CUTLASS kernels — and it packs in nearly every Blackwell-specific optimization available.

Optimizing an NVFP4 Group GEMM Kernel on Blackwell

6 minute read

Published: February 16, 2026

Happy Chinese New Year! Besides celebrating CNY this weekend, I also spent some time working on NVFP4 block-scaled group GEMM kernel optimizations on NVIDIA B200. This is from a GPU Mode competition where participants optimize CUDA kernels on Blackwell GPUs. The code is written in CuTeDSL, CUTLASS’s Python DSL for Blackwell kernels.

Jingkun Zhang

Posts by Tags

CUDA

Understanding GEMM on Blackwell with CuTeDSL

Optimizing an NVFP4 Group GEMM Kernel on Blackwell

blog

Understanding GEMM on Blackwell with CuTeDSL

Optimizing an NVFP4 Group GEMM Kernel on Blackwell