LLM Judges Are Lying: 67% of Evaluations Are Inconsistent
New research reveals that LLM-as-judge frameworks suffer from per-instance inconsistency masked by aggregate metrics. The paper proposes conformal prediction sets as a diagnostic tool, but the findings suggest that current evaluation pipelines are unreliable.
- Researchers applied a two-pronged diagnostic toolkit to SummEval, revealing that 33-67% of documents exhibit at least one transitivity violation (a directed 3-cycle) when judged by LLMs, despite aggregate violation rates of only 0.8-4.1%.
- The paper proposes split conformal prediction sets over 1-5 Likert scores as a per-instance uncertainty quantification method, but the core finding is that LLM judges are far less reliable than their aggregate scores suggest.
- The key tension: organizations deploying LLM judges for automated evaluation are making decisions based on metrics that obscure pervasive per-instance failures, raising serious questions about the validity of any evaluation pipeline that lacks uncertainty quantification.
Why Are Aggregate Metrics Hiding the Real Problem?
The paper's most striking finding is the disconnect between aggregate and per-instance reliability. When the researchers measured transitivity violations—instances where an LLM judge says A > B, B > C, but A < C—the aggregate violation rate across SummEval documents was only 0.8-4.1%. This looks like a minor issue. But when they examined individual documents, they found that 33-67% of documents contained at least one directed 3-cycle. The aggregate metric is averaging across thousands of inputs, masking the fact that some documents are consistently misjudged while others are fine.
This is a classic case of Simpson's paradox in evaluation. The low aggregate rate suggests the judge is reliable, but the per-instance analysis reveals that for any given document, there's a non-trivial chance the judge is contradicting itself. The researchers' transitivity analysis is essentially a canary in the coal mine—it exposes a systemic flaw that aggregate metrics hide.
What Does a Directed 3-Cycle Actually Mean for My Evaluation Pipeline?
A directed 3-cycle means the LLM judge violates the transitive property of preferences: it says summary A is better than B, B is better than C, but then ranks A as worse than C. In the real world, this is the equivalent of a human judge saying "this apple is better than this orange, this orange is better than this banana, but this banana is better than this apple." It's not just noise—it's a fundamental inconsistency in the judge's preference structure.
For practitioners using LLM judges to compare summaries, code outputs, or chatbot responses, this means that the ranking between any two outputs is unreliable. If you're using LLM judges to select the best response from a set of candidates, you cannot trust the output for any single document. The paper's finding that 33-67% of documents have at least one cycle suggests that for most documents, the judge's preferences are partially or entirely inconsistent.
The practical implication is stark: any evaluation pipeline that relies on LLM judges to produce a single ranking across outputs is building on sand. The researchers propose conformal prediction sets as a solution—instead of a single score, they output a set of possible scores with a guaranteed coverage probability. But this only mitigates the problem; it doesn't solve the underlying inconsistency.
Who Benefits From This Diagnostic Toolkit?
The immediate winners are researchers and practitioners who are already skeptical of LLM judges. The toolkit provides a concrete, reproducible method to diagnose per-instance reliability. Any team using LLM judges for evaluation should apply this toolkit to their own datasets before trusting the results.
The losers are companies selling LLM-as-judge solutions without transparency about per-instance reliability. Any vendor claiming their LLM judge achieves X% accuracy or correlation without providing per-instance uncertainty quantification is selling a product that may systematically misrank outputs for a significant fraction of inputs.
Long-term, this paper strengthens the case for hybrid evaluation pipelines that combine LLM judges with human oversight or statistical guarantees. The conformal prediction approach is a step in the right direction, but it only works if practitioners actually use it.
Can Conformal Prediction Sets Fix the Underlying Problem?
The paper's second contribution is split conformal prediction sets over 1-5 Likert scores. Instead of a single point estimate, the judge outputs a prediction set that contains the true score with a user-specified probability (e.g., 90%). This is a principled approach to uncertainty quantification, but it has limitations.
First, conformal prediction sets are only as good as the underlying model. If the LLM judge is systematically biased—which the transitivity analysis suggests it is—then the prediction sets will be calibrated to the wrong distribution. Second, the prediction sets can be large, especially for ambiguous inputs. A 90% prediction set that spans 3 out of 5 Likert categories is not very useful for fine-grained evaluation.
The real value of conformal prediction is not as a fix but as a diagnostic. It forces practitioners to confront the uncertainty in their evaluations. If your prediction sets are consistently large, that's a signal that your LLM judge is not reliable for that input domain.
| Metric | Aggregate Value | Per-Instance Value | Implication |
|---|---|---|---|
| Transitivity violation rate | 0.8-4.1% | 33-67% of documents | Aggregate hides widespread inconsistency |
| Directed 3-cycle per document | Not reported | At least 1 per affected doc | Judge contradicts itself on same input |
| Conformal prediction set size | N/A | Depends on coverage | Uncertainty is quantifiable but large |
| Verdict | Misleadingly low | Alarmingly high | LLM judges are unreliable per-instance |
My thesis: The LLM-as-judge paradigm is fundamentally broken for high-stakes evaluation, and this paper provides the evidence that should force a reckoning in the field.
Short-term, this paper will be cited by every practitioner who has ever felt uneasy about LLM judges. It provides a rigorous, reproducible method to expose the problem. I expect to see a wave of follow-up work applying this toolkit to other datasets (e.g., Chatbot Arena, HumanEval) and finding similar patterns.
Long-term, the field will bifurcate. One camp will adopt conformal prediction sets and other uncertainty quantification methods, accepting that LLM judges are noisy tools that require statistical guardrails. The other camp will double down on improving LLM judges through better prompting, fine-tuning, or ensembling, hoping to reduce transitivity violations to zero. I believe the first camp is correct, because the violations are not noise—they reflect a fundamental limitation of LLMs' ability to maintain consistent preference structures across diverse inputs.
The winners are researchers and practitioners who adopt uncertainty quantification. The losers are companies that sell LLM judges as turnkey solutions without transparency. The biggest loser is the entire evaluation pipeline of any organization that currently uses LLM judges without per-instance diagnostics—they are making decisions based on flawed metrics.
I predict that by Q4 2026, at least two major LLM evaluation platforms (e.g., LangChain's evaluation suite or Hugging Face's Evaluate library) will integrate conformal prediction sets as a default feature, because the pressure from this paper and its follow-ups will make it untenable to offer uncalibrated LLM judges.
- LangChain will integrate conformal prediction sets into its LangSmith evaluation suite by Q4 2026, because the paper's methodology is directly applicable to their use case and competitive pressure will force them to address per-instance reliability.
- At least one major benchmark (e.g., Chatbot Arena or HumanEval) will adopt transitivity analysis as a standard diagnostic within the next 12 months, because the paper provides a simple, interpretable metric that exposes LLM judge inconsistency.
- Regulators in the EU AI Office will cite this paper in guidance on automated evaluation for high-risk AI systems, because the finding that aggregate metrics mask per-instance failures has direct implications for conformity assessments under the AI Act.
- Insight 1: The paper's transitivity analysis is not just a diagnostic—it's a proof that LLM judges lack a consistent preference structure, which is a prerequisite for any reliable ranking system.
- Insight 2: Conformal prediction sets are a band-aid, not a cure. They quantify uncertainty but do not reduce it. The real solution is to invest in evaluation methods that do not rely on LLM judges as the sole arbiter.
- Insight 3: The 33-67% per-document violation rate suggests that LLM judges are essentially random for a significant fraction of inputs. This is not a calibration issue—it's a fundamental reliability ceiling.
- Insight 4: The paper's methodology is model-agnostic and dataset-agnostic, meaning it can be applied to any LLM judge and any dataset. This is both a strength (generalizability) and a weakness (it doesn't tell you how to fix the problem).
- Insight 5: The field needs to move beyond single-score evaluation. The paper's conformal prediction approach points toward a future where evaluation outputs are always accompanied by uncertainty intervals, not point estimates.
Source and attribution
arXiv
Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
Discussion
Add a comment