Jepsen Exposes Safety Lies in Distributed AI Systems
Kyle Kingsbury's Jepsen analysis reveals systemic safety failures in distributed systems powering AI infrastructure. The article argues that vendor claims of 'safe' systems are often marketing fabrications, and that the industry must adopt formal verification or face cascading failures.
- Kyle Kingsbury (Jepsen) published a comprehensive critique of safety claims in distributed systems, naming vendors like MongoDB, Cassandra, and Redis as having historically overstated guarantees.
- The analysis argues that AI infrastructure built on these foundations inherits latent safety defects, creating systemic risk for real-time AI applications.
- Kingsbury advocates for formal verification and fault injection as the only reliable paths to safety, a view that positions Jepsen as a critical audit vendor in the AI supply chain.
Why Did Jepsen Call Out Distributed Systems as 'Lies'?
According to Kyle Kingsbury's April 2026 analysis, the core problem is that distributed systems vendors have historically conflated "best-effort" behavior with "guaranteed safety." Kingsbury documented specific cases where MongoDB's claims of 'strong consistency' were falsified by Jepsen tests, and where Cassandra's 'eventual consistency' was actually 'sometimes consistency, sometimes data loss.' The pattern, Kingsbury argues, is not isolated but systemic: every major NoSQL and NewSQL vendor has at some point published safety claims that Jepsen's fault injection tests contradicted.
This matters directly for AI because modern AI infrastructure—vector databases, model registries, feature stores, and agent orchestration layers—are built on these same distributed systems. Kingsbury's thesis is that if the foundation is a lie, the AI applications running on top are operating on borrowed trust.
What Does This Mean for AI Infrastructure Vendors?
Kingsbury's analysis names specific vendors and their documented failures. For example, Jepsen's 2023 analysis of etcd (a key-value store used by Kubernetes) found that it could silently lose committed writes under network partitions. According to Kingsbury, this is not a theoretical risk—it is a reproducible defect. For AI platforms that rely on Kubernetes for model serving, this means that a network partition could cause a model registry to return stale or missing versions, potentially deploying the wrong model to production.
The implication is that every AI infrastructure vendor claiming 'five-nines' reliability or 'strong consistency' should be treated as guilty until proven innocent by independent audit. Kingsbury explicitly calls for a new industry norm: publishing Jepsen analysis results as a prerequisite for any safety-critical AI deployment.

Who Are the Winners and Losers in the Safety Verification Market?
| Company/Approach | Safety Claim | Jepsen-Verified? | Risk to AI Workloads | Verdict |
|---|---|---|---|---|
| MongoDB (Atlas) | Strong consistency | No (Jepsen found violations in 2015, 2017) | High (model metadata corruption) | Loser |
| etcd / Kubernetes | Linearizable reads | No (Jepsen found violations in 2023) | High (model registry inconsistency) | Loser |
| FoundationDB | Serializable isolation | Yes (Jepsen confirmed in 2019) | Low | Winner |
| Amazon DynamoDB | Strong consistency option | Partial (Jepsen found edge cases in 2020) | Medium | Neutral |
| Jepsen (Kingsbury's firm) | Independent audit | N/A (auditor) | N/A | Winner (market demand) |
| Formal verification tooling (e.g., TLA+, Verdi) | Mathematical proof | N/A | Low (if adopted) | Winner (long-term) |
| Verdict | FoundationDB and formal verification win; MongoDB, etcd, and Cassandra lose in safety-critical AI deployments. | |||
Can AI Systems Survive Without Formal Verification?
Kingsbury's answer is a definitive 'no' for any system that makes safety claims. According to his analysis, the only way to know if a distributed system is safe is to test it under controlled faults—exactly what Jepsen does. Kingsbury reported that even after vendors fix bugs found by Jepsen, new ones emerge in subsequent releases. This creates a treadmill: vendors fix, Jepsen finds new bugs, vendors fix again. The only exit is formal verification, where mathematical proof replaces empirical testing.
For AI companies, this means that any safety-critical AI application—autonomous vehicles, medical diagnosis, financial trading—must either adopt formally verified infrastructure (like FoundationDB or TLA+-verified systems) or accept that their safety claims are, in Kingsbury's words, 'lies.'
My analysis: The AI industry has a safety debt problem, and Jepsen is the auditor calling it in. The thesis is clear: vendor safety claims are unreliable, and independent verification is the only credible path forward. In the short term (6-12 months), I expect at least two major AI platform vendors to commission Jepsen audits of their infrastructure stacks, and those that fail will face pressure to switch to formally verified alternatives. The long-term winner is FoundationDB, which Apple open-sourced and which already passed Jepsen's tests. The loser is any vendor that continues to market 'five-nines' without independent proof—they will be exposed by the next Jepsen analysis. My specific prediction: by Q2 2027, MongoDB will announce a formal verification initiative for its distributed transaction subsystem, or lose at least two Fortune 500 AI customers to FoundationDB.
Predictions
- By Q3 2026, at least one major cloud provider (AWS, GCP, or Azure) will announce a Jepsen audit program for all AI infrastructure services. The reputational risk of being named in a Kingsbury post is too high to ignore.
- By Q2 2027, FoundationDB will be adopted as the default metadata store by at least two of the top five AI model registries (e.g., MLflow, DVC, or Weights & Biases). The Jepsen-verified safety guarantee is a competitive differentiator.
- By Q4 2026, the EU AI Office will include Jepsen-style fault injection testing in its draft technical standards for high-risk AI systems. Regulators will see Kingsbury's work as a ready-made compliance framework.
Article Summary
- Kyle Kingsbury's Jepsen analysis is the most credible independent audit of distributed systems safety, and it exposes systematic overclaiming by vendors.
- AI infrastructure built on unverified distributed systems inherits latent safety defects that can cause silent data corruption, model version mismatches, and inconsistent inference results.
- The only reliable path to safety is formal verification (e.g., TLA+, FoundationDB) combined with independent fault injection testing (Jepsen).
- Vendors that invest in verification will win the safety-critical AI market; those that don't will face regulatory and customer backlash.
- Regulators are likely to adopt Jepsen-style testing as a de facto standard for high-risk AI systems within 18 months.
Source and attribution
Hacker News
The Future of Everything Is Lies, I Guess: Safety
Discussion
Add a comment