RoPE’s Fixed Rotations: The Hidden Waste in Every Transformer
A new arXiv paper argues that the fixed rotation manifold in Rotary Positional Embeddings (RoPE) is a wasted opportunity for Transformer expressivity. By making both temporal and semantic rotations learnable, the authors claim a new dimension of attention capacity is unlocked.
- A new arXiv paper (April 27, 2026) argues that RoPE's rotation space is a 'largely overlooked second dimension of expressivity' in attention mechanisms.
- The authors propose making both temporal (positional) and semantic (content-based) rotations learnable, rather than using fixed hand-crafted angles.
- If validated, this could render current positional encoding methods obsolete and force a re-architecture of attention across all major Transformer-based models.
What Is the Core Argument of 'Learning to Rotate'?
According to the paper published on arXiv on April 27, 2026, titled 'Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling,' the authors make a startling claim: every Transformer architecture today dedicates enormous capacity to learning rich semantic embeddings, yet the rotation manifold acted upon by Rotary Positional Embeddings (RoPE) has been treated as a fixed, hand-crafted structure populated only by discrete ordinal indices. The authors argue that this rotation space is a 'largely overlooked second dimension of expressivity' in the attention mechanism. In essence, they are saying that current models are leaving a massive amount of potential modeling capacity on the table by not learning what the rotations themselves should be.
The original RoPE paper (Su et al., 2021, arXiv:2104.09864) introduced a clever method to encode positional information by rotating token embeddings by angles that depend on the token's position. But those rotation angles are fixed—they are a function of position only, not of the token's meaning. The new paper proposes to break this constraint by making the rotation matrix learnable, not just for positions (temporal) but also for the semantic content of tokens themselves. This effectively transforms the attention mechanism from operating on a static geometric manifold to a dynamic, learned one.

Why Has This Rotation Space Been Overlooked for So Long?
The oversight is partly historical and partly architectural. When RoPE was introduced in 2021, it was a breakthrough because it provided a way to encode relative position without adding separate positional embeddings. The field adopted it quickly—it is now used in Llama 2, Llama 3, Mistral, and many other models—and the fixed rotation scheme became a de facto standard. According to the paper's analysis, researchers implicitly assumed that the rotation space was a 'neutral' carrier of positional information that did not need to be learned. But the authors argue this assumption is wrong: the rotation manifold itself can encode meaningful semantic relationships that the current static scheme misses.
The paper notes that the rotation space is 'systematically explored' in their work, opening 'a new door for attention-based architecture.' This is not a minor tweak—it is a fundamental rethinking of how attention computes relationships between tokens. The fixed rotations of RoPE essentially treat all tokens at position i as having the same rotational relationship to tokens at position j, regardless of what those tokens mean. By making rotations learnable, the model can discover that certain semantic pairs (e.g., 'king' and 'queen') benefit from different rotational dynamics than other pairs (e.g., 'king' and 'table').
What Does This Mean for Current Transformer Architectures?
If validated, the implications are profound. Every major AI lab—OpenAI, Google DeepMind, Meta, Mistral, Anthropic—has invested heavily in optimizing Transformer architectures around fixed RoPE. A learnable rotation space would require a re-architecture of the attention mechanism, potentially making existing models obsolete for long-context and high-fidelity reasoning tasks. The paper claims this approach can 'open a new door' for attention, suggesting that the current ceiling on Transformer performance may be partially due to this overlooked dimension, not just scale or data.
However, the paper is currently a preprint and has not been peer-reviewed. The authors do not provide benchmark comparisons against standard RoPE on common tasks like long-range arena (LRA) or language modeling perplexity. This is a significant gap—the claim is compelling, but the evidence is incomplete. The key question is whether the additional expressivity of learnable rotations justifies the increased computational cost and training complexity.
Who Gains and Who Loses If This Approach Proves Effective?
If the approach is validated, the winners are clear: any lab that can quickly adopt learnable rotations and gain a performance edge. Startups with flexible architectures (e.g., Mistral AI, AI21 Labs) may be more agile than incumbents with massive sunk costs in existing architectures (e.g., OpenAI's GPT-4, Meta's Llama 3). The losers would be companies that have locked themselves into fixed RoPE and cannot easily adapt, potentially losing the long-context race.
| Dimension | Fixed RoPE (Current) | Learnable RoPE (Proposed) |
|---|---|---|
| Rotation angles | Hand-crafted, fixed | Learned from data |
| Semantic encoding | None (position only) | Yes (content-based) |
| Training complexity | Low (no extra params) | Higher (learnable rotation matrix) |
| Long-context performance | Degrades beyond training length | Potentially superior (extrapolation) |
| Adoption risk | None (proven) | High (unproven) |
| Verdict | Safe but limited | Risky but potentially transformative |
My thesis: The fixed rotation manifold of RoPE is the single largest untapped architectural lever in modern Transformers, and 'Learning to Rotate' correctly identifies it—but the paper's lack of benchmarks makes it an intriguing hypothesis, not a proven result. In the short term, this paper will spark a wave of replication attempts and ablation studies. Expect to see results within 6 months from groups at Stanford, MIT, and DeepMind. In the long term, if the approach works, it will fundamentally change how we think about positional encoding: from a fixed geometric scaffold to a learned semantic space. The biggest gainers will be small, agile labs that can quickly iterate on the idea; the biggest losers will be incumbents with massive infrastructure tied to fixed RoPE. I predict that by Q1 2027, at least one major model release will incorporate a form of learnable rotation, likely from a startup rather than a Big Tech lab.
Predictions
- By Q1 2027, at least one major open-source model (e.g., from Mistral AI or AI21 Labs) will incorporate learnable rotations and show a 5-10% improvement on long-context benchmarks.
- By Q2 2027, Google DeepMind will publish a replication study confirming or refuting the core claims of this paper, given their strong interest in efficient attention mechanisms.
- The EU AI Office will not directly regulate this, but the European Commission's Horizon Europe program will fund a project exploring learnable rotations for multilingual models by 2028.
- April 2021Original RoPE paper published
Su et al. introduce Rotary Positional Embeddings, fixing rotation angles to hand-crafted functions of position.
- 2022-2025RoPE becomes standard
RoPE adopted by Llama 2, Llama 3, Mistral, and adapted for GPT-4, becoming the dominant positional encoding method.
- April 27, 2026'Learning to Rotate' paper published
New arXiv paper argues RoPE's fixed rotations are a missed opportunity and proposes learnable temporal and semantic rotations.
- Expected Q1 2027First major model with learnable rotations
Predicted: a startup or open-source model will incorporate learnable rotations and show long-context improvements.
- April 2021: Original RoPE paper published (Su et al., arXiv:2104.09864).
- 2022-2025: RoPE becomes standard in Llama 2, Llama 3, Mistral, GPT-4 (adapted).
- April 27, 2026: 'Learning to Rotate' paper published on arXiv, challenging fixed RoPE assumption.
- Expected Q1 2027: First major model with learnable rotations.
Article Summary
- The fixed rotation manifold in RoPE is an overlooked source of expressivity—this paper is the first systematic attempt to explore it.
- The lack of benchmarks in the paper means the field should treat it as a hypothesis, not a result—replication is critical.
- If validated, this will force a re-architecture of attention, with startups likely to benefit more than incumbents.
- The semantic dimension of rotations could unlock new capabilities in long-context and reasoning without increasing parameter count.
- Expect a wave of follow-up work within 6 months; the idea is too compelling to ignore.
Source and attribution
arXiv
Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling
Discussion
Add a comment