• Cerebral Valley
  • Posts
  • Antithesis - the last word in autonomous software testing šŸŽ›

Antithesis - the last word in autonomous software testing šŸŽ›

Plus: CEO/Founder Will on why he believes AI-driven dev tools will benefit from rock-solid verification...

CV Deep Dive

Today, weā€™re talking with Will Wilson, Founder and CEO of Antithesis.

Antithesis offers what some might call a radically new paradigm for testing and verifying complex software. By running your applications in a deterministic hypervisor and using an intelligent search to systematically break your code, Antithesis promises to catch the rare and seemingly impossible bugs that slip through standard integration tests or chaos engineering experiments. Will brings extensive experience from FoundationDB (acquired by Apple and recently disclosed to be underpinning all of DeepSeekā€™s infrastructure) and Googleā€™s Spanner team, and much of the core Antithesis crew likewise hails from that same background.

Key Takeaways

  • Preempt Production Incidents with Deterministic Testing: Antithesis simulates all the things that could go wrong with your distributed system in a deterministic environment. This eliminates the reproducibility headaches that plague most large-scale system testing.

  • Autonomous  Search: Rather than you writing a million test cases, Antithesis actively seeks out new and interesting behaviors in your software on its ownā€”then shows you precisely how they happened.

  • Already Helping Startups & Enterprises Alike: While initial focus is on larger customers, small companies (like Turso DB) have also used Antithesis to secure bold projects like total rewrites.

  • Potential for AI Synergy: With GenAI producing ever more codeā€”often from less expert devsā€”demand for deeper, more robust testing only grows. Meanwhile, Antithesis sees opportunities to use AI to generate novel test inputs and fix discovered bugs in a closed loop.

  • Future Expansion Beyond Distributed Systems: Though best known for finding fault tolerance bugs, Antithesisā€™s approach can be leveraged to test a wide range of applicationsā€”from mobile apps to games to websites.

In this conversation, Will explains how the companyā€™s unique technology was inspired by the ā€œimpossibleā€ distributed database days at FoundationDB, how intelligent search differs from random chaos testing, and why he believes AI-driven dev tools will benefit from rock-solid verification.

Letā€™s dive in āš”ļø

Read time: 8 mins

Our Chat with Will šŸ’¬

Hey Will - welcome to Cerebral Valley! Could you give us a bit of intro on yourself and what led you to found Antithesis?

Iā€™m Willā€”a software engineer by background who never really planned to start a company, but here I am. A lot of what inspired Antithesis came from my time at FoundationDB (acquired by Apple in 2015). We built a distributed database with ACID transactions and high fault toleranceā€”people said it was impossible because of the CAP theorem, but we did it. The key was this sophisticated autonomous testing system we created. It simulated arbitrary failure conditions, searched the entire ā€œstate spaceā€ to trigger weird bugs, and it let us deterministically reproduce any situation. That was a total game-changer. After Apple, I worked on Google Spanner and noticed they didnā€™t have anything comparable. We realized there was a huge opportunity to bring that style of testing to the broader world.

A ton of my colleagues here are from FoundationDB, including my co-founder who used to be my boss there. Come to think of it, my boss at Apple now works for me as our VP of Engineering, and another former boss is an investor. I guess I have good relationships with bosses! But seriously, the vision is to free devs from writing endless test cases by letting an intelligent system break their software in a reproducible environment.

What exactly is Antithesis for the uninitiated developerā€”like an elevator pitch?

We flip the usual approach to testing on its head. Normally, you write tests for specific cases you think might matter, then hope you covered enough ground. In practice, you deploy to production and discover all kinds of insane edge scenarios you never imaginedā€”routers delaying packets, machines shutting down mid-request, user input with 2^8 characters.

Antithesis starts from the end goal: you specify what your software is supposed to do or not do (e.g., ā€œDonā€™t crashā€), and we systematically explore how to break that rule. We do this by injecting weird environment faults, bizarre inputs, or exotic usage patterns. Because we run everything in a deterministic environment, once we find a bug, you can re-run that exact scenario. No more ā€œworks on my machine, breaks on yours.ā€ Itā€™s like combining property-based testing with advanced fault injection, but at scale for big real-world apps.

You mentioned a deterministic state space that you rigorously test through. What does that mean, exactly?

Conventional software is inherently non-deterministic. In real life, a program can spawn threads that run in arbitrary orders, send and receive network traffic with random delays, check the time at different moments, or generate random numbers. That means bugs often appear sporadically, or only 1 in 1,000 times (but Murphyā€™s Law means itā€™ll happen at the exact wrong moment in prod).

We built a hypervisor that forces your whole system to be deterministicā€”thread scheduling, network responses, everything. If a thread is about to run, we decide the scheduling in a repeatable way, so if we see a bug, we can replay the entire scenario. That transforms debugging: no more ephemeral ā€œheisenbugs.ā€ Our approach systematically covers a huge range of possible interleavings and fault conditions.

Is your system primarily geared toward enterprise customers, or could smaller shops or individuals use it too?

Right now, weā€™re mostly oriented toward enterprise, because thatā€™s where we see the biggest immediate ROIā€”large companies want to reduce outages and production firefighting. We do have some small startups as well, including pre-seed ones, but the overall product can be quite expensive in its current form. Over time, we plan to make a more polished, self-serve version with a cheaper or free tier. Thatā€™s definitely on the roadmap.

What are the key output metrics you look for to determine success, and do you have any success stories that show a drastic change in those metrics?

Number one is reducing production incidents and outages. If youā€™re shipping fewer bugs and spending less time on firefighting, thatā€™s huge. We had one customer tell us that their ā€œsupport people were getting boredā€, and that made my day. 

Another is developer productivityā€”how much time are teams spending on writing and maintaining tests? Or how much are they wasting triaging and investigating weird, non-reproducible issues? The latter task especially tends to follow on the most senior and valuable members of the team. When we free them up to write features instead, thatā€™s a huge win. 

Then thereā€™s the ā€œfrontier of whatā€™s possibleā€: can devs tackle projects they never dared attempt before?Turso DB is a great example. They wanted to rewrite SQLite from scratch, but initially thought it was too riskyā€”like, how do you test something that big and complicated? After working with us, they felt safe to do it, because we systematically hammered on the new code until it was stable. Thatā€™s the kind of ā€œfrontierā€ effect we love seeing.

This sounds like a novel approachā€”who do you see as competition or the ā€œincumbentsā€ in the testing space?

Thereā€™s not much direct competition doing exactly what we do. The biggest ā€œcompetitorā€ is often a homegrown system at large companies, usually some janky fault-injection approach they built internally. Then you have chaos testing, popularized by Netflix, which is basically introducing random disruptions in production. For smaller or stateless things, thereā€™s fuzzing or property-based testing, but those rarely scale to big distributed apps.

We do see some teams building their own deterministic simulation, but thatā€™s an enormous effort. For most, itā€™s easier to go with a vendor approach. So in short, thereā€™s no one else systematically combining deterministic simulation, intelligent search, and environment fault injection the way we do.

You mentioned a moat around the insights you gain from multiple customersā€™ code. Can you share any interesting or surprising insights from that?

Every new customer adds more diversity to our ā€œtraining corpus.ā€ That means weā€™re less likely to overfit on just one type of system. We do see broad patternsā€”like how pure uniform randomness is actually not that useful. If you simply drop 5% of packets or randomly pick functions to call, you might never hit the weird corner where you call function A 100 times in a row without an intervening call to function B.

Structured randomness is more powerful. Weā€™ll do, for instance, periods of complete disconnection, then normal traffic, or sequences that call function A repeatedly in case it contains a memory leak that gets cleaned up by function B. This can reveal deeper bugs that uniform sampling would never touch. Another big theme is that real-world usage is rarely a simple distributionā€”so we design our search to systematically produce ā€œpockets of chaosā€ instead of random mild chaos everywhere.

Where do you see software development going in five or ten years if this new paradigm of testing takes off?

Ideally, devs can focus more on high-level logic and less on writing a million test permutations. We want a world where you specify your softwareā€™s constraintsā€”like ā€œit shouldnā€™t crash, it should maintain these invariants, it should respond within X timeā€ā€”and let an intelligent system handle writing the test cases.

AI dev tools come into play here too. If an AI can generate large amounts of code quickly, there will be more code with potentially more bugs. But also, if we can integrate a strong testing approach behind the scenes, that code can get self-verified. Maybe you ask a language model for new functionality, it writes some code, Antithesis runs a state-space search, finds the broken scenario, then the AI fixes it, and so on. That might let us scale software creation far beyond current limits, because we can systematically ensure correctness.

Speaking of AI, do you see generative coding tools shifting Antithesisā€™s roadmap, or is Antithesis already using AI?

In the short term, itā€™s good for us that so much new code is being churned outā€”some of it by less experienced devs or by LLMs. That means more potential bugs to find. On a deeper level, we see big possibilities in integrating with coding agents: you give them a spec, they produce code, we automatically test it in a closed loop, and only return results once the code passes. Thatā€™s definitely an idea weā€™re exploring.

Weā€™re also experimenting with using AI to generate test harnesses and weird usage scenarios. If it calls an API ā€œincorrectly,ā€ well, your program shouldnā€™t crash anyway, so thatā€™s beneficial. A higher ā€œtemperatureā€ can lead to more interesting test inputs. Weā€™re seeing promising early results from letting AI produce synthetic calling code that humans wouldnā€™t typically think of.

Aside from AI, how will Antithesis evolve over the next 12 months?

Our biggest push is reducing latency to get results. We used to run in a batch mode, where youā€™d submit code and a few hours later get a bug report. Now weā€™re moving toward a streaming model, so as soon as we find a bug, you get notified. This makes the feedback loop much tighter and more dev-friendly.

Weā€™re also broadening the problem domain. Right now, weā€™re best known for unearthing fault tolerance issues in distributed systems, but the same fundamental approach can test websites, mobile apps, even games. Our architecture is general enoughā€”it just needs a bit more productization. That expansion will open the door to a much larger market.

What can you tell us about the culture at Antithesis, and what roles are you hiring for?

Weā€™re extremely collaborativeā€”everyone says that, but we really mean it. We often spend more time talking through design choices than coding, because once we figure out the correct approach, implementation goes faster. Weā€™re also an in-person company, located in D.C. Thatā€™s unusual in the startup world, but it works for us, especially since we do a lot of deep systems work that benefits from real-time interaction.

We hire folks across a broad spectrumā€”from kernel hackers to front-end/UI people to machine learning researchers. If youā€™re excited about deterministic simulation, advanced testing, or pushing the boundaries of software reliability, we want to talk. D.C. isnā€™t the most common tech hub, but thereā€™s plenty of talented engineers here, and we love building a strong presence in the city.

Conclusion

Stay up to date on the latest with Antithesis, learn more about them here.

Read our past few Deep Dives below:

If you would like us to ā€˜Deep Diveā€™ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.