CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

DeepReinforce Team
July 21, 2025

Introduction

The GPU Crisis and How AI Might Save Us

Let's face it - we're in the middle of a GPU shortage crisis. Everyone wants GPUs for their AI projects. The demand is through the roof, and prices are absolutely insane - a single H100 can cost over $30,000, and good luck even finding one in stock.

For most companies and researchers, buying more GPUs simply isn't an option. The only realistic solution? We need to squeeze every bit of performance from the GPUs we already have.

The Old Way: Manual CUDA Optimization Hell

If you've ever tried optimizing CUDA code, you know the pain. It's like solving a massive puzzle where you're constantly tweaking memory access patterns, adjusting thread blocks, and running endless profiling tests. Engineers spend weeks or months on this stuff, and it's honestly exhausting.

Enter AI: What if LLMs Could Do This For Us?

Here's where things get interesting. Recent LLM models - think DeepSeek-R1 and OpenAI's o1 - are getting pretty good at writing code. And here's the kicker: CUDA optimization has a super clear reward signal - speed! Your code either runs faster or it doesn't. That's perfect for training RL.

Imagine this: instead of you pulling your hair out trying different optimizations, an AI could generate thousands of variations, test them all, and learn what works. It might even discover tricks that humans never thought of!

Introducing CUDA-L1

So we built CUDA-L1, which uses something we call "contrastive reinforcement learning." Think of it like this: instead of just trying random stuff, our AI compares different CUDA versions side-by-side and learns why some are faster than others. It's like having a coach that shows you good vs. bad examples until you get it.

The Results Are Mind-Blowing

17.7×
Average speedup across 250 benchmarks
449×
Best case speedup (not a typo!)

CUDA-L1 Works great on different GPUs too:
GPU Model Average Speedup
H100 17.8×
RTX 3090 19.0×
L40 16.5×
H800 14.7×
H20 13.9×

What Makes This Special?

Here's what blew our minds about what CUDA-L1 learned on its own:

  • • It discovered optimization techniques by itself - stuff like memory coalescing, loop unrolling, operation fusion. Some of these are well-known, others are rarely used.
  • • It figures out the perfect combo - like a chef who knows exactly which spices work together, it combines optimizations in ways that maximize performance.
  • • It learned the "rules" of CUDA - like how some optimizations multiply each other's effects, or how you need to apply certain "gatekeeper" techniques first before others will work.
  • • It spots hidden problems - sometimes it rejects optimizations that look good on paper but actually slow things down due to sneaky issues like CPU-GPU sync overhead.

How CUDA-L1 Works

The Problem: Why Can't Current LLMs Write Good CUDA?

Ask any AI to write CUDA code and you'll likely get something that doesn't compile, crashes, or runs painfully slow. The reason is simple: these models barely saw any quality CUDA code during training. It's like asking someone who's only read cooking blogs to become a chef.

Enter CUDA-L1: A Three-Step Recipe for Success

We built CUDA-L1 with a three-stage pipeline: supervised learning (learn the basics), self-supervised learning (practice until perfect), and contrastive reinforcement learning (compete for speed).

CUDA-L1 Pipeline
Stage 1: Learning the Basics with Data Augmentation

First, we needed to fix the data shortage problem. We took existing CUDA code and created variations of it - expanding the model's exposure to different CUDA patterns. This supervised fine-tuning phase has one goal: make sure the AI can write CUDA code that actually compiles and runs correctly.

Stage 2: Practice Makes Perfect with Self-Supervised Learning

Next, we let the model generate its own CUDA code, test it, and learn from what works. The model generates thousands of code samples, we automatically test each one, and only the successful implementations get fed back for more training. No speed optimization yet - just making sure the code works reliably.

Stage 3: The Speed Revolution - Contrastive Reinforcement Learning

This is where CUDA-L1 becomes special. Traditional RL would just assign scores to generated code and hope the model figures out why some implementations are faster. That's like grading exams without showing students the correct answers.

Instead, we do something radically different. Look at this actual prompt we use:

We show the AI multiple CUDA implementations WITH their speed scores:
  • • "Here's kernel_v1 that achieves 1.2x speedup"
  • • "Here's kernel_v2 that achieves 2.8x speedup"
  • • "Here's kernel_v3 that achieves 1.5x speedup"

Then we ask three critical questions:

  1. Performance Analysis: "Why is kernel_v2 so much faster? What optimizations did it use that the others didn't?"
  2. Algorithm Design: "Based on this analysis, what optimization strategy would work even better?"
  3. Code Implementation: "Now write a kernel that beats them all."

The magic happens because the AI can directly see and reason about performance differences. It's not guessing in the dark - it's learning from concrete examples of what makes CUDA code fast.

Results

Does It Actually Work?

We tested CUDA-L1 on KernelBench, a comprehensive benchmark suite with three difficulty levels:

  • Level 1: Simple operations (like matrix multiply)
  • Level 2: Operator sequences (like attention mechanisms)
  • Level 3: Complex ML tasks (like full transformer layers)
Performance of CUDA-L1 on KernelBench
Method Mean Max 75% 50% 25% Success
# out of total
Speedup
# out of total
All 17.7× 449× 7.08× 1.81× 1.22× 249/250 242/250
Level 1 12.3× 166× 9.28× 1.65× 1.15× 99/100 96/100
Level 2 6.39× 111× 4.42× 1.61× 1.24× 100/100 97/100
Level 3 50.8× 449× 22.9× 2.66× 1.58× 50/50 49/50

The results? Mind-blowing. CUDA-L1 achieved an average 17.7× speedup across all tasks, with some kernels running up to 449× faster than the baseline PyTorch implementations.

Here's where it gets really interesting: the harder the task, the better CUDA-L1 performs:

  • Level 1: 12.3× average speedup
  • Level 2: 6.4× average speedup
  • Level 3: 50.8× average speedup

This pattern makes perfect sense - complex ML operations have more room for optimization, and CUDA-L1 excels at finding these opportunities. This is especially exciting for real-world applications like LLM inference, where complex operations dominate the workload.

Performance Comparison

But Does It Work on Your GPU?

We trained CUDA-L1 on NVIDIA A100s, but what if you're using a different GPU? Good news: the optimizations transfer remarkably well. We tested the same A100-optimized kernels on:

CUDA-L1 performances on KernelBench across different GPU devices
GPU Device Mean Max 75% 50% 25% Success
# out of 250
Speedup
# out of 250
A100 PCIe 17.7× 449× 7.08× 1.81× 1.22× 249 242
H100 XSM 17.8× 1,001× 4.02× 1.63× 1.16× 246 235
L40 16.5× 365× 6.17× 1.61× 1.15× 247 234
RTX 3090 19.0× 611× 4.41× 1.44× 1.11× 246 227
H800 XSM 14.7× 433× 4.80× 1.57× 1.16× 249 243
H20 13.9× 412× 4.76× 1.54× 1.16× 248 236

That's right - without any special tuning for these GPUs, CUDA-L1's optimizations work across the board. Consumer GPUs like the RTX series even show more consistent gains than the datacenter GPUs. This suggests that the optimization patterns CUDA-L1 learns are fundamental enough to benefit any modern GPU architecture.

Could we do even better with GPU-specific training? Absolutely. We're planning to release specialized versions of CUDA-L1 for different GPU architectures in future updates.

The future of GPU programming might involve less manual optimization and more collaboration with AI that truly understands what makes kernels fast. And unlike traditional approaches, CUDA-L1 keeps getting better with each training iteration, continuously discovering new optimization patterns.

Case Study: Bidirectional GRU 466x

A Real Example: How CUDA-L1 Optimized a GRU Network

Let's look at a concrete example to see what CUDA-L1 actually does. We gave it a bidirectional multi-layer GRU (a type of neural network used in language processing) and watched the magic happen.

GRU Optimization Performance

CUDA-L1 applied four key optimizations:

  1. CUDA Graphs: Converted the entire sequence of operations into a single "super-kernel"
  2. Stream Management: Isolated execution on dedicated GPU streams
  3. Memory Optimization: Pre-allocated tensors and cached compiled graphs
  4. Reduced Branching: Eliminated unnecessary conditional logic

Here's where it gets fascinating. We tested each optimization individually and in combination. The results completely shattered our expectations:

  • CUDA Graphs alone: Only 2× speedup (disappointing!)
  • Stream Management alone: No speedup at all (1×)
  • CUDA Graphs + Stream Management: 260× speedup (!!)
  • All four optimizations: 430× speedup
Wait, what? How does 2× + 1× = 260×?

The answer reveals a fundamental truth about GPU optimization: it's not additive, it's multiplicative. Think of it this way:

  • • CUDA Graphs alone is like having a race car stuck in city traffic
  • • Stream Management alone is like building a highway but driving a bicycle
  • • Together? You get a race car on an open highway

The really mind-blowing part? CUDA-L1 discovered these synergistic relationships on its own. Through trial and error during training, it learned that certain optimizations are "gatekeepers" that unlock the potential of others. It figured out optimization principles that even experienced CUDA programmers might miss.

This is the power of Contrastive-RL: by comparing thousands of implementations with their performance scores, CUDA-L1 doesn't just memorize optimization tricks - it develops an intuition for how different techniques interact and amplify each other.

GRU Execution Time Comparison

BibTeX

@article{deepreinforce2025cudal1,
  title={CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning},
  author={DeepReinforce Team},
  journal={arXiv preprint arXiv:2507.14111},
  year={2025}
}