VAKRA Benchmark: AI Agents Still Fail at Tool Use

IBM Research dropped VAKRA, a benchmark for visual agent reasoning and tool use, and the results are damning. Top models like GPT-4o and Claude 3.5 Sonnet fail more than half the time on tasks that require chaining API calls and reasoning about tool outputs.

IBM Research's VAKRA benchmark tests agents on visual reasoning and tool use across 1,000+ tasks, revealing failure rates above 50% for leading models.
GPT-4o scored highest but still failed 47% of tasks; Claude 3.5 Sonnet failed 52%; open-source models like Llama 3.1 failed over 80%.
The key finding: agents fail not because of reasoning alone, but because they cannot recover from tool execution errors—a critical flaw for production systems.
This article argues that the future belongs not to more powerful LLMs, but to agentic frameworks that enforce structured error recovery and safety checks.

What Makes VAKRA Different From Other Benchmarks?

VAKRA (Visual Agent for Knowledge-intensive Reasoning and Action) isn't just another leaderboard. IBM Research designed it to test agents on multi-step tool use with visual reasoning—think of an agent that must read a chart, query a database, then call an API to produce a result. The benchmark includes over 1,000 tasks across domains like finance, healthcare, and logistics. According to the IBM Research blog published on April 15, 2026, the benchmark exposes failure modes that simpler benchmarks like GSM8K or HumanEval miss: tool call ordering errors, hallucinated API parameters, and inability to recover from intermediate failures.

Why Do Agents Fail So Spectacularly on VAKRA?

The results are brutal. GPT-4o achieved the highest accuracy at 53%, meaning it fails nearly half the time. Claude 3.5 Sonnet scored 48%, and Gemini 1.5 Pro hit 45%. Open-source models collapsed: Llama 3.1 70B managed only 19%, and Mistral Large 24.7%. IBM's analysis identifies three core failure modes: tool call hallucination (agents inventing API endpoints), state tracking failures (losing context after a tool returns unexpected data), and inability to retry (agents giving up after a single error). This is damning evidence that current agents lack the robustness for any production workflow that involves more than a single API call.

VAKRA Benchmark: Agents Are Still Failing at Basic Tool Use

Who Should Be Worried About These Results?

Every developer building agentic workflows should be alarmed. The VAKRA results directly challenge the narrative that LLMs are ready for autonomous operation. Losers include any startup selling "autonomous AI agents" without a safety framework—think of companies like Adept AI, which demoed agents but hasn't published benchmark results. Winners include platform providers that enforce structured execution: Microsoft Copilot (which uses Graph-based grounding), LangChain (which offers built-in error handling), and IBM's own watsonx Orchestrate. These platforms abstract away the failure modes VAKRA exposes.

Feature	GPT-4o (OpenAI)	Claude 3.5 Sonnet (Anthropic)	Llama 3.1 70B (Meta)
VAKRA Accuracy	53%	48%	19%
Tool Hallucination Rate	22%	27%	41%
State Tracking Errors	15%	18%	33%
Retry Success Rate	58%	51%	12%
Open Source?	No	No	Yes
Verdict	Best overall, but still unreliable	Strong reasoning, weak tool recovery	Unusable for production

Can Open-Source Models Ever Catch Up on Agentic Tasks?

The VAKRA results suggest a bleak outlook for open-source models in agentic workflows. Llama 3.1 70B's 19% accuracy is not just low—it's catastrophic for any real-world use. IBM's analysis shows that open-source models struggle particularly with tool call ordering, likely because their training data lacks sufficient examples of multi-step API interactions. I predict that open-source models will need specialized fine-tuning on agentic trajectories, similar to how CodeLlama was trained on code. Meta should prioritize this by Q4 2026, or risk losing the agent market entirely to closed-source models.

My thesis: VAKRA proves that the agent hype cycle has peaked, and the next frontier is not better LLMs but safer orchestration frameworks. In the short term, this benchmark will cause a backlash against autonomous agents, especially in regulated industries like healthcare and finance. Enterprises that rushed to deploy customer-facing agents will need to add human-in-the-loop checks. In the long term, the winners will be companies that build agentic middleware—tools that handle retries, state management, and tool validation. I expect LangChain to acquire a startup focused on agent safety by Q3 2026, because their current framework still leaves too much to the LLM. IBM gains credibility here by publishing hard data, but they need to productize these insights into watsonx quickly. The losers are any vendor selling agents as a black-box solution—Adept AI, Inflection AI, and even some of Anthropic's enterprise customers will face tough questions.

What Should Developers Do Differently After VAKRA?

Stop trusting agents to be autonomous. VAKRA's failure modes—tool hallucination, state loss, and retry refusal—are not bugs; they are features of current LLM architecture. Developers must: (1) implement explicit retry policies with backoff, (2) validate tool call arguments against a schema before execution, and (3) maintain a separate state tracker outside the LLM's context window. The VAKRA benchmark code is open-source on GitHub; every team building agents should run it against their own stack before going to production.

Predictions

By Q1 2027, OpenAI will release a dedicated agent model trained on VAKRA-like data, achieving >70% accuracy, but will require a new pricing tier.
LangChain will acquire an agent safety startup (e.g., Guardrails AI) by Q3 2026, integrating structured error recovery into its core library.
The EU AI Office will cite VAKRA results in its upcoming guidance on high-risk AI agents, mandating human oversight for any agent with >20% failure rate on production tasks.

April 2026
VAKRA benchmark published
IBM Research releases VAKRA benchmark on Hugging Face, testing agents on visual reasoning and tool use.
March 2026
GPT-4o and Claude 3.5 released
OpenAI and Anthropic release their latest models, which are later tested on VAKRA.
December 2025
Agent hype peaks
Multiple startups raise funding for autonomous agents, before VAKRA exposes their unreliability.

Article Summary

VAKRA is the first benchmark to systematically measure agent failure modes in multi-step tool use, not just reasoning accuracy.
All current models fail more than 40% of tasks, with open-source models being virtually unusable for production agentic workflows.
The core problem is not reasoning but execution safety: agents cannot recover from tool errors, hallucinate APIs, and lose state.
Platforms that enforce structured orchestration (Microsoft Copilot, LangChain) will win over pure LLM vendors.
Developers must implement explicit retry and validation layers, and should run VAKRA on their own stacks before deploying agents.