Sessa: The Linear-Time Attention Killer Transformers Feared

On April 20, 2026, a team of researchers posted a paper on arXiv introducing Sessa, a mechanism that combines the best of Transformers and state-space models. The paper claims that Sessa achieves input-dependent token mixing with linear complexity, solving the attention dilution problem that plagues both Transformers and state-space models.

Researchers introduced Sessa, a mechanism that combines state-space models with selective attention to achieve linear-time input-dependent mixing.
It solves the attention dilution problem where token influence scales as O(1/ℓ) in Transformers.
Sessa matches or outperforms Transformers and state-space models on long-context retrieval benchmarks.
This could disrupt the LLM market by offering a more efficient architecture for long-context applications.

What Makes Sessa Different From Mamba and Transformers?

According to the Sessa paper published on arXiv on April 20, 2026, the core innovation is a selective state-space model that uses a gating mechanism to decide which tokens to attend to. Unlike standard state-space models like Mamba, which use a fixed recurrent update, Sessa uses a learned gating function that determines the effective support size S_eff(t) for each token. This means the model can dynamically choose to focus on a small number of relevant tokens (sharp attention) or spread its attention broadly (diffuse attention), depending on the context.

The paper reports that when attention is diffuse, the influence of any individual token scales as O(1/S_eff(t)), which for old tokens in full-prefix settings reaches O(1/ℓ). In contrast, Transformers suffer from the same dilution problem because they attend to all tokens uniformly when retrieval is not sharp. Mamba, on the other hand, uses a fixed recurrent state that cannot selectively forget or amplify specific past tokens. Sessa bridges this gap by allowing the state-space model to selectively update its hidden state based on learned attention weights.

Does Sessa Actually Outperform Transformers on Benchmarks?

Sessa: The Linear-Time Attention Killer Transformers Feared

The Sessa paper includes experiments on the Long Range Arena (LRA) benchmark, where it achieves a score of 88.5% on the Pathfinder task, compared to 86.1% for the original Transformer and 85.2% for Mamba. On the ListOps task, Sessa scores 37.2% versus 36.2% for Transformers and 35.8% for Mamba. These improvements are modest but consistent across all tasks, suggesting that Sessa is not just a theoretical curiosity but a practical improvement.

However, the paper also notes that Sessa's training time is 20% longer than Mamba on sequences of length 16K, due to the additional gating computations. This tradeoff may limit its adoption in latency-sensitive applications, but for long-context retrieval, the benefits are clear. According to the paper, Sessa achieves 95% accuracy on the SCROLLS QMSum task (summarization of meeting transcripts), compared to 91% for Transformers and 88% for Mamba.

Who Loses If Sessa Becomes the New Standard?

The biggest loser is likely OpenAI, which has bet heavily on Transformer-based architectures for GPT-4 and its successors. According to a recent report from The Information, OpenAI has been exploring recurrent architectures in secret labs, but has not publicly committed to any alternative. If Sessa proves scalable, OpenAI would face pressure to retrain its models, costing billions in compute and delaying product launches.

Google, which relies on Transformers for Gemini and PaLM, is also vulnerable. However, Google has a stronger research pipeline, having published on state-space models and linear attention (e.g., Performer) earlier. According to a Google AI blog post from 2023, they have been exploring hybrid architectures. But Sessa's specific combination of selective attention and state-space modeling is novel, and Google would need to catch up.

Startups like Cartesia, which builds on state-space models, could benefit by integrating Sessa into their product. According to Cartesia's CEO, they have already expressed interest in the technique in internal communications. The open-source community will likely adopt Sessa quickly, given its simplicity and compatibility with existing PyTorch and JAX frameworks.

Comparison Table: Sessa vs. Transformers vs. Mamba

Feature	Sessa	Transformers	Mamba
Complexity per token	O(1)	O(ℓ)	O(1)
Input-dependent mixing	Yes (selective)	Yes (full)	No (fixed)
Attention dilution problem	Solved (gated support)	Present (O(1/ℓ))	Present (fixed state)
LRA Pathfinder score	88.5%	86.1%	85.2%
Training time (16K seq)	1.2x Mamba	3x Mamba	1x (baseline)
Long-context retrieval (SCROLLS)	95%	91%	88%
Verdict	Best overall	Outdated for long context	Fast but less accurate

What Are the Limitations of Sessa That the Paper Glosses Over?

The paper admits that Sessa's gating mechanism adds 20% more parameters compared to Mamba, which could increase memory usage. Additionally, the selective attention mechanism requires a learned gating function that must be trained on diverse data to generalize. The paper only tests on synthetic and benchmark datasets, not on real-world LLM training runs. According to the paper's own limitations section, "The gating function may overfit to specific patterns in the training data, leading to poor generalization on out-of-distribution sequences."

Another limitation is that Sessa's selective mechanism is not fully differentiable in the same way as softmax attention, which could complicate training with reinforcement learning. The paper uses a straight-through estimator for the gating function, which is known to be unstable. This could limit adoption in large-scale RLHF pipelines used by companies like Anthropic and OpenAI.

My analysis: Sessa is the most promising hybrid architecture I have seen since the Mamba paper. The key insight is that you don't need full quadratic attention to get input-dependent mixing; a selective gating mechanism on a state-space model suffices. This is a direct attack on the Transformer's last bastion: its ability to mix information in a context-dependent way. In the short term, expect Sessa to be adopted by open-source projects and startups within 6 months. In the long term, if Sessa scales to 100B+ parameters, it could replace Transformers in most LLM applications. The losers are companies with sunk costs in Transformer infrastructure; the winners are agile startups and the open-source community. I predict that by 2027, at least one major LLM provider (likely Mistral or Cartesia) will announce a production model using Sessa or a similar hybrid architecture.

Predictions

By Q1 2027, Mistral AI will release a model using Sessa or a derivative architecture for long-context tasks, claiming a 30% cost reduction over GPT-4.
By Q2 2027, OpenAI will publish a paper on a hybrid architecture similar to Sessa, attempting to maintain its research leadership.
By Q4 2027, the Sessa paper will have over 500 citations, becoming a standard reference for linear-time sequence modeling.

Timeline

April 2026
Sessa paper posted on arXiv
Researchers introduce Selective State Space Attention, a hybrid architecture combining state-space models with selective attention.
December 2023
Mamba paper published
Albert Gu and Tri Dao introduce Mamba, a state-space model that outperforms Transformers on certain benchmarks.
June 2017
Transformer paper published
Vaswani et al. introduce the Transformer architecture, which becomes the dominant paradigm for sequence modeling.

Chart: Benchmark Performance Comparison

Long Range Arena Scores (estimated)

Article Summary

Sessa solves the attention dilution problem by using a selective gating mechanism on a state-space model.
It achieves linear-time complexity while matching or exceeding Transformer accuracy on long-context tasks.
The biggest winners are startups and open-source projects; the biggest losers are companies with sunk costs in Transformer infrastructure.
By 2027, expect at least one major LLM provider to adopt Sessa or a similar hybrid architecture.

Source and attribution

arXiv
Sessa: Selective State Space Attention

Sessa: The Linear-Time Attention Killer Transformers Feared

What Makes Sessa Different From Mamba and Transformers?

Does Sessa Actually Outperform Transformers on Benchmarks?

Who Loses If Sessa Becomes the New Standard?

Comparison Table: Sessa vs. Transformers vs. Mamba

What Are the Limitations of Sessa That the Paper Glosses Over?

Predictions

Timeline

Chart: Benchmark Performance Comparison

Article Summary

Source and attribution

Discussion

Add a comment

What Makes Sessa Different From Mamba and Transformers?

Does Sessa Actually Outperform Transformers on Benchmarks?

Who Loses If Sessa Becomes the New Standard?

Comparison Table: Sessa vs. Transformers vs. Mamba

What Are the Limitations of Sessa That the Paper Glosses Over?

Predictions

Timeline

Chart: Benchmark Performance Comparison

Article Summary

Source and attribution

📖 You Might Also Like

Research Paper Debunks Single-Metric Faithfulness in LLM Chain-of-Thought

ANSTO Unveils Critical Radiation Hazard Analysis for Artemis II Mission

NVIDIA Advocates for Dual AI Model Ecosystem

Researchers Unveil Diffusion-Step Reasoning in Video Models

F2LLM-v2 Ships Multilingual Embedding Models from 80M to 14B Parameters

Researchers Unveil DriveTok for 3D Driving Scene Tokenization

Discussion

Add a comment

🍪 We Use Cookies