RLVR Is Dead: Next Gen Reasoning Lives in Pre-Train Space
A new arXiv paper argues that RLVR's gains on reasoning tasks are bounded by the base model's output distribution. The solution: shifting reinforcement learning into the pre-training phase to optimize the marginal distribution P(y) itself.
- An arXiv paper challenges the prevailing RLVR approach, arguing it is bottlenecked by the base model's output distribution.
- The authors propose shifting RL to pre-train space to optimize the marginal distribution P(y), not just the conditional P(y|x).
- This reframes the debate: current reasoning gains are real but finite; the next leap requires rethinking how models learn, not just how they finetune.
- DeepSeek, with its open-weight architecture and aggressive pre-training R&D, is the natural winner here; OpenAI and Anthropic may find their lead on reasoning benchmarks evaporating.
Why Is RLVR Hitting a Wall That No One Is Talking About?
The paper, released on arXiv on April 15, 2026, presents a stark mathematical truth: optimizing P(y|x) through RLVR can only exploit patterns already present in the model's output distribution. The authors prove that the model's exploration capacity is bounded by the base distribution's support. In plain English: you cannot teach a model to reason about quantum mechanics if its pre-training distribution never included coherent physics reasoning. This is not a finetuning problem—it is a knowledge architecture problem.
What Does Shifting RL to Pre-Train Space Actually Change?
Instead of using static corpora for passive learning, the paper proposes active reinforcement during pre-training to shape P(y). This means the model learns not just to predict the next token, but to explore reasoning paths during its initial training. The authors show this preserves broader exploration capacity—the model can discover novel reasoning strategies rather than being constrained to the best paths within a fixed output space. This is a direct challenge to the current orthodoxy at OpenAI and Anthropic, who have invested billions in post-training RL pipelines.

| Dimension | RLVR (Post-Training) | RL in Pre-Train Space |
|---|---|---|
| Optimization Target | P(y|x) conditional | P(y) marginal |
| Exploration Bound | Base model's output distribution | Potentially unbounded |
| Training Cost | Moderate (finetuning) | Extremely high (pre-training overhaul) |
| Reasoning Ceiling | Fixed by pre-training data | Can surpass pre-training data |
| Key Proponent | OpenAI, Anthropic, Google | DeepSeek, academic labs |
| Verdict | Short-term gain, long-term dead end | Short-term pain, long-term dominance |
Who Wins If This Paper Is Right?
DeepSeek is the clearest winner. The Chinese lab has already demonstrated cost-efficient pre-training and open-weight philosophy. They can experiment with this approach without the quarterly earnings pressure that constrains OpenAI. Google DeepMind, with its massive compute and DeepMind's RL expertise, is also well-positioned—but its bureaucratic structure may slow adoption. The biggest loser is Anthropic, which has bet its entire safety narrative on RLVR-based constitutional AI. If RLVR is a dead end, Anthropic's reasoning advantage evaporates.
My thesis: The RLVR era is a temporary plateau, and the next reasoning breakthrough will come from labs that embed reward signals into pre-training, not post-training. In the short term, expect a flurry of replication attempts from major labs. OpenAI will likely claim they've already explored this internally. But the paper's mathematical proof is hard to dismiss: you cannot get blood from a stone. The long-term consequence is a bifurcation of the AI industry. Labs that can afford to retool their pre-training pipelines will leap ahead; those that cannot will be stuck on the RLVR treadmill, chasing incremental gains on benchmarks that no longer matter. DeepSeek gains the most because they have the compute, the talent, and the freedom to experiment. I expect DeepSeek to release a model trained with RL in pre-train space by Q1 2027, achieving a 15-20% improvement on MATH and GSM8K over the best RLVR models. Anthropic loses the most—their safety framework is built on RLVR, and this paper undermines its theoretical foundation.
- DeepSeek will release a model trained with RL in pre-train space by Q1 2027, achieving a 15-20% improvement on MATH and GSM8K over the best RLVR models.
- OpenAI will internally pivot away from RLVR by Q3 2026, but will not announce it publicly until Q1 2027.
- The EU AI Office will require disclosure of pre-training reward architectures by 2028, citing the paper as evidence that post-training alignment is insufficient.
- April 2026arXiv paper published
Paper 'From P(y|x) to P(y)' released, challenging RLVR orthodoxy.
- Q3 2026Expected OpenAI pivot
Prediction: OpenAI internally shifts focus away from RLVR.
- Q1 2027Expected DeepSeek release
Prediction: DeepSeek releases model trained with RL in pre-train space.
Projected Reasoning Improvement (MATH) by Approach (estimated)
- RLVR is a temporary fix, not a permanent solution; the paper proves it mathematically.
- The next AI arms race will be in pre-training infrastructure, not finetuning techniques.
- DeepSeek is the dark horse that could leapfrog OpenAI's reasoning lead.
- Anthropic's safety narrative is at risk if RLVR is a dead end.
- Regulators will use this paper to demand more transparency in pre-training.
Source and attribution
arXiv
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
Discussion
Add a comment