Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

Posts

Understanding GEMM on Blackwell with CuTeDSL

10 minute read

Published:

In this post I want to walk through how a high-performance GEMM is structured on Blackwell, using the official CUTLASS dense GEMM example as a case study. This kernel is written in CuTeDSL — NVIDIA’s Python DSL for authoring CUTLASS kernels — and it packs in nearly every Blackwell-specific optimization available.

Optimizing an NVFP4 Group GEMM Kernel on Blackwell

6 minute read

Published:

Happy Chinese New Year! Besides celebrating CNY this weekend, I also spent some time working on NVFP4 block-scaled group GEMM kernel optimizations on NVIDIA B200. This is from a GPU Mode competition where participants optimize CUDA kernels on Blackwell GPUs. The code is written in CuTeDSL, CUTLASS’s Python DSL for Blackwell kernels.

portfolio

publications

talks

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.