RLVR Is Dead: Next Gen Reasoning Lives in Pre-Train Space

RLVR Is Dead: Next Gen Reasoning Lives in Pre-Train Space

A new arXiv paper argues that RLVR's gains on reasoning tasks are bounded by the base model's output distribution. The solution: shifting reinforcement learning into the pre-training phase to optimize the marginal distribution P(y) itself.

A new preprint from arXiv drops a bombshell: reinforcement learning with verifiable rewards (RLVR), the darling of reasoning benchmarks, is hitting a fundamental ceiling. The paper argues that optimizing P(y|x)—the model's conditional output given a prompt—cannot transcend the base model's innate output distribution, and that the real frontier lies in reshaping P(y) during pre-training.
  • An arXiv paper challenges the prevailing RLVR approach, arguing it is bottlenecked by the base model's output distribution.
  • The authors propose shifting RL to pre-train space to optimize the marginal distribution P(y), not just the conditional P(y|x).
  • This reframes the debate: current reasoning gains are real but finite; the next leap requires rethinking how models learn, not just how they finetune.
  • DeepSeek, with its open-weight architecture and aggressive pre-training R&D, is the natural winner here; OpenAI and Anthropic may find their lead on reasoning benchmarks evaporating.

Why Is RLVR Hitting a Wall That No One Is Talking About?

The paper, released on arXiv on April 15, 2026, presents a stark mathematical truth: optimizing P(y|x) through RLVR can only exploit patterns already present in the model's output distribution. The authors prove that the model's exploration capacity is bounded by the base distribution's support. In plain English: you cannot teach a model to reason about quantum mechanics if its pre-training distribution never included coherent physics reasoning. This is not a finetuning problem—it is a knowledge architecture problem.

What Does Shifting RL to Pre-Train Space Actually Change?

Instead of using static corpora for passive learning, the paper proposes active reinforcement during pre-training to shape P(y). This means the model learns not just to predict the next token, but to explore reasoning paths during its initial training. The authors show this preserves broader exploration capacity—the model can discover novel reasoning strategies rather than being constrained to the best paths within a fixed output space. This is a direct challenge to the current orthodoxy at OpenAI and Anthropic, who have invested billions in post-training RL pipelines.

RLVR Is Dead: Next Gen Reasoning Lives in Pre-Train Space
DimensionRLVR (Post-Training)RL in Pre-Train Space
Optimization TargetP(y|x) conditionalP(y) marginal
Exploration BoundBase model's output distributionPotentially unbounded
Training CostModerate (finetuning)Extremely high (pre-training overhaul)
Reasoning CeilingFixed by pre-training dataCan surpass pre-training data
Key ProponentOpenAI, Anthropic, GoogleDeepSeek, academic labs
VerdictShort-term gain, long-term dead endShort-term pain, long-term dominance

Who Wins If This Paper Is Right?

DeepSeek is the clearest winner. The Chinese lab has already demonstrated cost-efficient pre-training and open-weight philosophy. They can experiment with this approach without the quarterly earnings pressure that constrains OpenAI. Google DeepMind, with its massive compute and DeepMind's RL expertise, is also well-positioned—but its bureaucratic structure may slow adoption. The biggest loser is Anthropic, which has bet its entire safety narrative on RLVR-based constitutional AI. If RLVR is a dead end, Anthropic's reasoning advantage evaporates.

My thesis: The RLVR era is a temporary plateau, and the next reasoning breakthrough will come from labs that embed reward signals into pre-training, not post-training. In the short term, expect a flurry of replication attempts from major labs. OpenAI will likely claim they've already explored this internally. But the paper's mathematical proof is hard to dismiss: you cannot get blood from a stone. The long-term consequence is a bifurcation of the AI industry. Labs that can afford to retool their pre-training pipelines will leap ahead; those that cannot will be stuck on the RLVR treadmill, chasing incremental gains on benchmarks that no longer matter. DeepSeek gains the most because they have the compute, the talent, and the freedom to experiment. I expect DeepSeek to release a model trained with RL in pre-train space by Q1 2027, achieving a 15-20% improvement on MATH and GSM8K over the best RLVR models. Anthropic loses the most—their safety framework is built on RLVR, and this paper undermines its theoretical foundation.

  1. DeepSeek will release a model trained with RL in pre-train space by Q1 2027, achieving a 15-20% improvement on MATH and GSM8K over the best RLVR models.
  2. OpenAI will internally pivot away from RLVR by Q3 2026, but will not announce it publicly until Q1 2027.
  3. The EU AI Office will require disclosure of pre-training reward architectures by 2028, citing the paper as evidence that post-training alignment is insufficient.
  1. April 2026
    arXiv paper published

    Paper 'From P(y|x) to P(y)' released, challenging RLVR orthodoxy.

  2. Q3 2026
    Expected OpenAI pivot

    Prediction: OpenAI internally shifts focus away from RLVR.

  3. Q1 2027
    Expected DeepSeek release

    Prediction: DeepSeek releases model trained with RL in pre-train space.

Projected Reasoning Improvement (MATH) by Approach (estimated)

  • RLVR is a temporary fix, not a permanent solution; the paper proves it mathematically.
  • The next AI arms race will be in pre-training infrastructure, not finetuning techniques.
  • DeepSeek is the dark horse that could leapfrog OpenAI's reasoning lead.
  • Anthropic's safety narrative is at risk if RLVR is a dead end.
  • Regulators will use this paper to demand more transparency in pre-training.

Source and attribution

arXiv
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

Discussion

Add a comment

0/5000
Loading comments...