Let's face it - we're in the middle of a GPU shortage crisis. Everyone wants GPUs for their AI projects. The demand is through the roof, and prices are absolutely insane - a single H100 can cost over $30,000, and good luck even finding one in stock.
For most companies and researchers, buying more GPUs simply isn't an option. The only realistic solution? We need to squeeze every bit of performance from the GPUs we already have.
If you've ever tried optimizing CUDA code, you know the pain. It's like solving a massive puzzle where you're constantly tweaking memory access patterns, adjusting thread blocks, and running endless profiling tests. Engineers spend weeks or months on this stuff, and it's honestly exhausting.
Here's where things get interesting. Recent LLM models - think DeepSeek-R1 and OpenAI's o1 - are getting pretty good at writing code. And here's the kicker: CUDA optimization has a super clear reward signal - speed! Your code either runs faster or it doesn't. That's perfect for training RL.
Imagine this: instead of you pulling your hair out trying different optimizations, an AI could generate thousands of variations, test them all, and learn what works. It might even discover tricks that humans never thought of!
So we built CUDA-L1, which uses something we call "contrastive reinforcement learning." Think of it like this: instead of just trying random stuff, our AI compares different CUDA versions side-by-side and learns why some are faster than others. It's like having a coach that shows you good vs. bad examples until you get it.
GPU Model | Average Speedup |
---|---|
H100 | 17.8× |
RTX 3090 | 19.0× |
L40 | 16.5× |
H800 | 14.7× |
H20 | 13.9× |
Here's what blew our minds about what CUDA-L1 learned on its own:
Ask any AI to write CUDA code and you'll likely get something that doesn't compile, crashes, or runs painfully slow. The reason is simple: these models barely saw any quality CUDA code during training. It's like asking someone who's only read cooking blogs to become a chef.
We built CUDA-L1 with a three-stage pipeline: supervised learning (learn the basics), self-supervised learning (practice until perfect), and contrastive reinforcement learning (compete for speed).
First, we needed to fix the data shortage problem. We took existing CUDA code and created variations of it - expanding the model's exposure to different CUDA patterns. This supervised fine-tuning phase has one goal: make sure the AI can write CUDA code that actually compiles and runs correctly.
Next, we let the model generate its own CUDA code, test it, and learn from what works. The model generates thousands of code samples, we automatically test each one, and only the successful implementations get fed back for more training. No speed optimization yet - just making sure the code works reliably.
This is where CUDA-L1 becomes special. Traditional RL would just assign scores to generated code and hope the model figures out why some implementations are faster. That's like grading exams without showing students the correct answers.
Instead, we do something radically different. Look at this actual prompt we use:
Then we ask three critical questions:
The magic happens because the AI can directly see and reason about performance differences. It's not guessing in the dark - it's learning from concrete examples of what makes CUDA code fast.
We tested CUDA-L1 on KernelBench, a comprehensive benchmark suite with three difficulty levels:
Method | Mean | Max | 75% | 50% | 25% | Success # out of total |
Speedup # out of total |
---|---|---|---|---|---|---|---|
All | 17.7× | 449× | 7.08× | 1.81× | 1.22× | 249/250 | 242/250 |
Level 1 | 12.3× | 166× | 9.28× | 1.65× | 1.15× | 99/100 | 96/100 |
Level 2 | 6.39× | 111× | 4.42× | 1.61× | 1.24× | 100/100 | 97/100 |
Level 3 | 50.8× | 449× | 22.9× | 2.66× | 1.58× | 50/50 | 49/50 |
The results? Mind-blowing. CUDA-L1 achieved an average 17.7× speedup across all tasks, with some kernels running up to 449× faster than the baseline PyTorch implementations.
Here's where it gets really interesting: the harder the task, the better CUDA-L1 performs:
This pattern makes perfect sense - complex ML operations have more room for optimization, and CUDA-L1 excels at finding these opportunities. This is especially exciting for real-world applications like LLM inference, where complex operations dominate the workload.
We trained CUDA-L1 on NVIDIA A100s, but what if you're using a different GPU? Good news: the optimizations transfer remarkably well. We tested the same A100-optimized kernels on:
GPU Device | Mean | Max | 75% | 50% | 25% | Success # out of 250 |
Speedup # out of 250 |
---|---|---|---|---|---|---|---|
A100 PCIe | 17.7× | 449× | 7.08× | 1.81× | 1.22× | 249 | 242 |
H100 XSM | 17.8× | 1,001× | 4.02× | 1.63× | 1.16× | 246 | 235 |
L40 | 16.5× | 365× | 6.17× | 1.61× | 1.15× | 247 | 234 |
RTX 3090 | 19.0× | 611× | 4.41× | 1.44× | 1.11× | 246 | 227 |
H800 XSM | 14.7× | 433× | 4.80× | 1.57× | 1.16× | 249 | 243 |
H20 | 13.9× | 412× | 4.76× | 1.54× | 1.16× | 248 | 236 |
That's right - without any special tuning for these GPUs, CUDA-L1's optimizations work across the board. Consumer GPUs like the RTX series even show more consistent gains than the datacenter GPUs. This suggests that the optimization patterns CUDA-L1 learns are fundamental enough to benefit any modern GPU architecture.
Could we do even better with GPU-specific training? Absolutely. We're planning to release specialized versions of CUDA-L1 for different GPU architectures in future updates.
The future of GPU programming might involve less manual optimization and more collaboration with AI that truly understands what makes kernels fast. And unlike traditional approaches, CUDA-L1 keeps getting better with each training iteration, continuously discovering new optimization patterns.
Let's look at a concrete example to see what CUDA-L1 actually does. We gave it a bidirectional multi-layer GRU (a type of neural network used in language processing) and watched the magic happen.
CUDA-L1 applied four key optimizations:
Here's where it gets fascinating. We tested each optimization individually and in combination. The results completely shattered our expectations:
The answer reveals a fundamental truth about GPU optimization: it's not additive, it's multiplicative. Think of it this way:
The really mind-blowing part? CUDA-L1 discovered these synergistic relationships on its own. Through trial and error during training, it learned that certain optimizations are "gatekeepers" that unlock the potential of others. It figured out optimization principles that even experienced CUDA programmers might miss.
This is the power of Contrastive-RL: by comparing thousands of implementations with their performance scores, CUDA-L1 doesn't just memorize optimization tricks - it develops an intuition for how different techniques interact and amplify each other.
@article{deepreinforce2025cudal1,
title={CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning},
author={DeepReinforce Team},
journal={arXiv preprint arXiv:2507.14111},
year={2025}
}