GPT-4o vs Claude 3: Vietnamese Legal Test Exposes Reasoning Gap

GPT-4o vs Claude 3: Vietnamese Legal Test Exposes Reasoning Gap

A comprehensive evaluation of four LLMs on Vietnamese legal text reveals that no single model dominates both factual recall and deep reasoning. The findings force a re-evaluation of how legal AI should be tested and deployed in high-stakes regulatory environments.

A new arXiv paper from Vietnamese researchers drops the first large-scale, dual-aspect benchmark of LLMs on Vietnamese legal text. The results are not what the marketing departments want you to hear: GPT-4o crushes factual recall, but Claude 3 Opus wins on the reasoning tasks that actually matter for justice. This is the moment the AI legal tech industry stops benchmarking for speed and starts benchmarking for trust.
  • Researchers tested GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and DeepSeek-V2 on a massive Vietnamese legal corpus, measuring both surface-level accuracy and deep legal reasoning.
  • GPT-4o led in factual retrieval (86.3%) but Claude 3 Opus won on logical consistency (79.1%) and multi-step argument mapping (76.8%).
  • DeepSeek-V2 performed competitively on cost-efficiency but lagged behind on both dimensions by at least 12 percentage points.
  • The paper's dual-aspect framework reveals that single-metric benchmarks hide catastrophic reasoning failures in models that appear strong on recall alone.

Why Does This Benchmark Matter More Than the Usual Leaderboards?

The Vietnamese legal system produces over 200,000 pages of codified law, decrees, and circulars, written in a dense, hierarchical style that even native-speaking lawyers struggle to parse. Previous LLM evaluations on legal text—like the LegalBench or LexGLUE—focused almost exclusively on English common law and single-dimension accuracy. This paper, published April 17, 2026 on arXiv, is the first to test models on Vietnamese civil law with a dual-aspect framework: surface-level benchmark performance (BLEU, ROUGE, F1) and deep reasoning (logical consistency, argument mapping, contradiction detection).

I see this as a watershed moment. If a model can't handle Vietnamese legal reasoning, it certainly can't handle the nuanced regulatory frameworks of Singapore, Germany, or Japan. The authors—researchers from Hanoi University of Science and Technology—tested 16,000 legal queries across four models. The dataset alone is a gift to the global AI safety community.

Did GPT-4o Actually Win or Just Look Like It Won?

On raw surface metrics, GPT-4o scored 86.3% accuracy on article retrieval and 82.1% on statutory citation tasks. Those numbers are impressive. But when the test moved to the second aspect—reasoning tasks requiring multi-step logical deduction and detection of implicit contradictions—GPT-4o dropped to 68.4%. Claude 3 Opus, conversely, scored 79.1% on logical consistency and 76.8% on argument mapping. The gap is stark.

This is the classic trap of single-metric benchmarking. A law firm deploying GPT-4o to draft a contract might get perfect clause citations but miss a fatal logical inconsistency between two clauses. The paper's dual-aspect framework exposes exactly this vulnerability. The authors explicitly warn: "A model that excels at retrieval but fails at reasoning is not safe for unsupervised legal deployment."

GPT-4o vs Claude 3: Vietnamese Legal Test Exposes Reasoning Gap

Who Gains and Who Loses From This Dual-Aspect Framework?

Anthropic gains the most. Claude 3 Opus's strong showing on reasoning tasks aligns perfectly with its safety-first branding. The paper gives Anthropic a concrete, third-party validation that its model is not just safe-sounding but actually safer on reasoning benchmarks. OpenAI loses some luster: GPT-4o's reasoning deficit undermines its pitch as a general-purpose legal assistant. DeepSeek-V2, while cost-effective, trails by double digits on both dimensions, making it a poor choice for any high-stakes legal application.

The biggest loser, however, is the current benchmarking industry. If every major evaluation from now on must include a dual-aspect framework, then every previous leaderboard is suddenly suspect. This paper single-handedly raises the bar for what counts as a valid LLM evaluation.

DimensionGPT-4oClaude 3 OpusGemini 1.5 ProDeepSeek-V2
Factual Retrieval Accuracy86.3%81.2%79.8%74.1%
Statutory Citation Precision82.1%78.5%76.3%70.8%
Logical Consistency Score68.4%79.1%71.6%62.3%
Argument Mapping Accuracy65.2%76.8%69.4%58.9%
Contradiction Detection Rate71.3%80.5%73.2%64.7%
Cost per 1,000 Queries (USD)$3.20$2.85$2.60$1.20
VerdictBest for recall-only tasksBest for reasoning-critical tasksBalanced but not best-in-classBudget option, not safe for high-stakes

What Does This Mean for Legal Tech Startups in Southeast Asia?

Startups like Lawpath, Tela, and Vietnam's own iLaw are building AI-powered legal document generators. If they optimize for cost or speed alone, they risk deploying models that produce legally coherent but logically flawed outputs. The paper's data shows that DeepSeek-V2, despite being 2.7x cheaper than Claude, has a 17.6 percentage point gap in contradiction detection. That gap translates directly into legal liability.

I predict that within 12 months, every legal tech startup in Southeast Asia will be forced to disclose which model they use and how they test reasoning. The Vietnamese Ministry of Justice, which has been piloting AI for public legal assistance since 2024, will likely mandate a dual-aspect evaluation framework for all government-contracted AI systems by Q1 2027.

My thesis: The Vietnamese legal benchmark is not a niche academic exercise—it is the first credible evidence that the entire LLM evaluation industry is measuring the wrong thing.

Short-term, this paper will cause chaos inside AI labs. Product managers at OpenAI and Google will scramble to improve their reasoning scores, possibly by fine-tuning on Vietnamese legal data. Long-term, the dual-aspect framework will become the standard for any high-stakes domain—medicine, finance, law—where surface accuracy and deep reasoning are both critical.

Who gains? Anthropic, because Claude 3 Opus's strong reasoning performance gives it a narrative advantage in enterprise sales. The Vietnamese research team, because they just created a benchmark that will be cited in every serious AI safety paper for the next two years. Who loses? Every model that optimizes for recall at the expense of reasoning—including GPT-4o, which now has a documented reasoning gap it must explain to enterprise customers. DeepSeek loses hardest: its low cost is no longer a selling point if it can't reason safely.

I expect Anthropic to publish a follow-up paper by Q3 2026 demonstrating Claude 3 Opus's superior reasoning on Vietnamese legal text, using this exact benchmark as a marketing asset. The reason is simple: this paper gives Anthropic the one thing money can't buy—independent, third-party validation of its safety-first strategy.

  1. Prediction 1: The Vietnamese Ministry of Justice will adopt the dual-aspect evaluation framework as a mandatory certification for all AI-powered legal tools by March 2027.
  2. Prediction 2: OpenAI will release a GPT-4o fine-tune specifically for legal reasoning within 6 months, targeting a 10+ point improvement on the logical consistency metric.
  3. Prediction 3: At least two major legal tech startups (Lawpath or Tela) will switch their underlying model from GPT-4o to Claude 3 Opus by December 2026, citing this paper as the catalyst.
  1. April 2026
    Dual-Aspect Benchmark Published

    Vietnamese researchers publish the first large-scale dual-aspect evaluation of LLMs on Vietnamese legal text.

  2. Q1 2027 (predicted)
    Vietnamese Ministry of Justice Mandates Dual-Aspect Testing

    Expected regulatory requirement for all AI-powered legal tools in Vietnam.

  • Insight 1: Single-metric benchmarking is dead for high-stakes domains. Any evaluation that doesn't test both surface accuracy and deep reasoning is actively misleading.
  • Insight 2: Vietnamese legal text is uniquely challenging because it combines civil law structure with dense, nested clauses—making it a stress test for any LLM, not just a regional curiosity.
  • Insight 3: The cost advantage of DeepSeek-V2 ($1.20 vs $3.20 per 1k queries) is negated by its 17.6 point gap in contradiction detection—a false economy that could cost a law firm millions in liability.
  • Insight 4: This paper effectively creates a new market category: reasoning-certified legal AI. Companies that pass the dual-aspect test will command a premium price.
  • Insight 5: The research methodology—using native Vietnamese speakers to annotate 16,000 queries—sets a new standard for linguistic and cultural validity that English-only benchmarks have never met.

Source and attribution

arXiv
From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

Discussion

Add a comment

0/5000
Loading comments...