Shlok Limbhare

AI/ML Engineer

I'm focused on high-performance LLM inference, working across GPU kernels, attention mechanisms, and system-level optimizations to turn research ideas into efficient, production-ready systems.

About

I'm an AI/ML engineer specializing in high-performance LLM inference optimization. My work spans GPU kernel development, attention mechanism optimization, memory management, and system-level performance tuning. I'm passionate about bridging the gap between cutting-edge research and practical, efficient implementations.

My expertise includes CUDA and HIP kernel programming, FP8 quantization, inference optimization techniques, and profiling tools like NVIDIA Nsight. I focus on making large language models run faster, more efficiently, and at scale.

Featured Projects

Mini-Attention

A personal research repo for high-performance GPU attention kernels. Implements Standard, Ring, and KV-Cached/Paged attention across PyTorch, Triton, and CUDA — with dedicated branches optimized for RTX 3060, NVIDIA Blackwell (B200), and AMD MI300. Benchmarks single- and multi-GPU performance, memory efficiency, and throughput.

Flash Attention KV-Cache CUDA / Triton Multi-GPU AMD MI300
View on GitHub →

WaveBoost

An inference optimization sandbox implementing individual CUDA kernels for LLM inference from scratch. Reimplements attention variants including MHA and GQA, with systematic benchmarking to study low-latency decoding and optimized memory management at scale.

LLM Inference MHA / GQA GPU Kernels CUDA
View on GitHub →

Get in Touch

I'm always interested in discussing GPU optimization, LLM inference, and high-performance computing challenges. Feel free to reach out.