Research - Latest News & Updates

Research Desk

Read Full Article →

08.07.2026 00:19

PEEU: Open-Source GUI Agents Beat GPT-4V on Task Planning

The PEEU method enables small open-source MLLMs to autonomously explore GUI environments and learn from hindsight experience, achieving superior task planning compared to GPT-4V. This shifts the cost-privacy-performance tradeoff in favor of open-source agents.

08.07.2026 00:19

RiVER: RL Without Ground Truth Beats Answer-Key Training

RiVER uses deterministic execution feedback as continuous-valued supervision, enabling group-relative RL on tasks like code optimization and logistics ranking where no ground truth exists. The paper claims this outperforms standard RLVR on score-based benchmarks.

08.07.2026 00:19

When Likely Answers Are Wrong: LLM Probability-Correctness Gap

The paper demonstrates that sequence probability correlates with correctness only for constrained tasks, and that maximizing probability can actually reduce accuracy on open-ended generation. This forces a re-evaluation of how decoding methods are deployed.

06.07.2026 00:41

Tapered LLMs: The End of Uniform Depth Layers?

The 'Tapered Language Models' paper from arXiv (June 2026) provides evidence that uniform parameter allocation across layers is inefficient. This analysis explores what the evidence supports, who benefits, and what changes are likely in model design.

06.07.2026 00:41

LLMs Fail to Self-Report Adversarial Prefills, Study Finds

The study tested ten open-weight LLMs on four safety benchmarks and found that no model reliably identifies its own compromised outputs. This finding challenges prior work on LLM introspection and suggests that self-report mechanisms are insufficient for safety-critical applications.

05.07.2026 00:22

Program Synthesis Unlocks Attention Head Logic

The paper introduces a novel interpretability technique that uses program synthesis to approximate attention head behavior. Early results suggest promise, but scalability and faithfulness remain open questions.

04.07.2026 00:41

Open Models Fail Agentic Benchmark: Hugging Face Shows Gap

Hugging Face's new 'Is it agentic enough?' benchmark provides a practical tool for evaluating open models on agentic tasks, but the results reveal a clear reliability gap between open and closed models. This analysis explains the benchmark, its implications for developers, and how to choose the right model for production agentic workflows.

04.07.2026 00:19

DeepRubric: Evidence Trees Fix RL Research Agents' Blind Spot

DeepRubric introduces evidence-tree rubric supervision for RL-based research agents, improving report completeness by anchoring rewards to explicit evidence structures. This method outperforms baseline rubric generation but raises questions about scalability and domain dependency.

03.07.2026 00:36

DP-FL's Privacy Cloak Hides Backdoor Attacks

New research reveals that differential privacy in federated learning can inadvertently shield backdoor attacks from detection, turning a presumed defense into an attacker's cloak. The paper provides empirical evidence that compliant DP updates evade current defenses while non-compliant ones are caught.

03.07.2026 00:36

Exact Posterior Score Ends Diffusion Steering's Free Lunch

The paper introduces a mathematical identity that turns a pretrained unconditional denoiser into an exact posterior sampler for linear inverse problems. This removes the need for approximate measurement-matching corrections or task-specific retraining.

03.07.2026 00:17

Phase Dominates Neural Nets: Oppenheim-Lim Test Reveals Hidden Bias

Researchers at arXiv have shown that when they swap the phase information between two images inside a neural network's hidden layers, the classifier's prediction follows the phase donor, not the magnitude donor. This internal Oppenheim-Lim test reveals that phase dominates in PRISM2D, GFNet, and ViT-B/16, challenging standard interpretability approaches.

02.07.2026 00:08

ClinHallu Exposes Where Medical AI Hallucinations Really Start

ClinHallu provides the first stage-wise hallucination diagnosis for medical MLLMs, revealing that errors originate at different reasoning stages depending on the clinical case. This changes how developers should evaluate and improve models for clinical decision support.

Append the next batch without leaving this page.

← Previous 1 2 3 4 … Next →

🍪 We Use Cookies