Research - Latest News & Updates

Research Desk

Read Full Article →

15.05.2026 00:18

AdamW Is Dead for Tabular MLPs: Lion and Sophia Win the Benchmark

A rigorous benchmark of 19 optimizers on 45 tabular datasets shows that Lion and Sophia beat AdamW, the default optimizer for tabular MLPs. This paper tells the field: your go-to choice is leaving performance on the table.

14.05.2026 00:19

AD4AD Exposes Autonomous Driving's Blind Spot: Anomaly Detection Failures

AD4AD is not another academic benchmark; it is a wake-up call. It proves that today's anomaly detection models, trained on clean datasets, fail to recognize the very obstacles that cause crashes. This article argues that the industry must pivot to scene-level, risk-aware metrics or accept liability for preventable accidents.

14.05.2026 00:19

MAny: The Paper That Exposes a Hidden MLLM Crisis

The MAny paper identifies a critical blind spot in multimodal instruction tuning: forgetting isn't just about language reasoning, but also about visual perception and parameter stability. Their merging approach offers a practical fix, but the real question is who will commercialize it first.

13.05.2026 00:23

LLMs Beat VLMs at Spatial Reasoning: Vision Is Overrated

Researchers at arXiv have demonstrated that LLMs can reason about spatial transformations through text alone, challenging the assumption that vision is required for spatial intelligence. This has profound implications for robotics, autonomous systems, and the ongoing debate between pure language models and multimodal approaches.

13.05.2026 00:23

LLM Judges Are Lying: 67% of Evaluations Are Inconsistent

New research reveals that LLM-as-judge frameworks suffer from per-instance inconsistency masked by aggregate metrics. The paper proposes conformal prediction sets as a diagnostic tool, but the findings suggest that current evaluation pipelines are unreliable.

13.05.2026 00:23

Quantum Embeddings Flop in GNN Benchmark—But Expose Deeper Flaw

A controlled benchmark of node embeddings for graph neural networks finds quantum-oriented methods underperform classical baselines. But the paper's real contribution is exposing widespread methodological sloppiness in GNN evaluation that has inflated reported performance for years.

12.05.2026 00:19

Shortest Path Exposes LLM Generalization Fraud

The shortest-path benchmark proves LLMs are pattern matchers, not reasoners. This article names the winners and losers in the coming hybrid-AI shakeout.

12.05.2026 00:19

VAKRA Benchmark: Agents Are Still Failing at Basic Tool Use

The VAKRA benchmark from IBM Research reveals that current AI agents are dangerously unreliable at multi-step tool use and reasoning. This article explains why VAKRA matters, who wins and loses, and what developers must do to build production-ready agents.

11.05.2026 00:16

RLVR Is Dead: Next Gen Reasoning Lives in Pre-Train Space

A new arXiv paper argues that RLVR's gains on reasoning tasks are bounded by the base model's output distribution. The solution: shifting reinforcement learning into the pre-training phase to optimize the marginal distribution P(y) itself.

09.05.2026 00:12

SceneCritic: The End of Vibe-Check AI Evaluation

SceneCritic replaces subjective LLM/VLM judges with a deterministic, symbolic evaluator for 3D indoor scenes. This kills the unreliable 'vibe-check' method, forcing companies like Nvidia and Meta to adopt transparent, reproducible benchmarks or lose credibility.

09.05.2026 00:12

On-Policy Distillation: The Hidden Trap in LLM Post-Training

A systematic investigation into on-policy distillation reveals two critical conditions for success that most labs are ignoring. The paper shows that OPD fails when teacher-student thinking patterns are incompatible or when the teacher offers only marginal score improvements, challenging the dominant post-training paradigm.

09.05.2026 00:12

Introspective Diffusion Kills Autoregressive LLMs

If this paper is real and reproducible, every major LLM company needs to panic. The autoregressive transformer — the architecture behind ChatGPT, Claude, and Gemini — just got a credible challenger that is both more sample-efficient and more controllable.

Append the next batch without leaving this page.

← Previous … 7 8 9 10 … Next →

🍪 We Use Cookies