Cerebral Valley
Posts
Halluminate - Evaluation infrastructure for AI agents 🎛

Halluminate - Evaluation infrastructure for AI agents 🎛

Plus: CEO Jerry Wu on how evaluation models will be 30-40% of all AI inference spend...

October 11, 2024

CV Deep Dive

Today, we’re talking with Jerry Wu, co-founder and CEO of Halluminate.

Halluminate is building evaluation infrastructure for AI agents. Fully autonomous software is the future, yet many companies are prevented from deploying these systems at scale due to the high cost and labor of human evaluation and oversight. Halluminate’s “model-driven testing” automates the human review process with evaluation models, saving engineers hundreds of hours with streamlined, scalable ways to validate the reliability and quality of their agentic systems.

The platform is meant mainly for AI engineers at Series A+ startups and enterprise applied machine learning teams, especially riskier industries like financial services where robust testing is crucial. As AI agents are increasingly used for complex decision-making, Halluminate helps ensure these systems function as expected while reducing the high costs and manual effort associated with validation.

Backed by Alchemist Accelerator and a crop of top tier angels, Halluminate is in the midst of scaling up their product with some of the best engineering teams in the world.

In this conversation, Jerry explains how model-driven testing works, shares success stories, and discusses how evaluation models can grow to become 30-40% of all AI inference spend.

Let’s dive in ⚡️

Read time: 8 mins

Our Chat with Jerry 💬

Thanks for joining us Jerry — tell us your backstory and what led you to co-found Halluminate.

Sure! My name’s Jerry. I have a background in computer science and machine learning research from Cornell, where I first got into coding. Like a lot of people, I realized that software is just an amazing way to solve real-world problems. I got into machine learning research because it felt like the cutting edge of what technology could achieve, specifically working on model quantization—basically making models smaller while keeping them accurate.

But while doing that, I quickly realized academic research wasn’t for me—it felt too slow and constraining. I discovered I enjoyed building products and getting real-world feedback much more. So I started diving into entrepreneurship and product development. During my senior year in college, I worked on my first startup, Code Bozu. It was a platform that aimed to teach kids how to code through short, bite-sized tutorials on TikTok. It didn’t take off—creating content is tough and expensive, especially back then—but it was an awesome learning experience.

That led me to Capital One Labs, where I worked on my first AI agent system in 2023, which was designed to automate end-to-end call center tasks. While building these AI systems, I realized that the biggest challenge wasn’t in building —it was in testing them. AI agents are inherently non-deterministic, meaning they produce different outputs for the same chain of inputs. This kind of unpredictability creates a lot of challenges, especially in highly regulated industries like banking. I thought long and hard about this space and realized there were no good solutions.

I felt it was worth leaving to solve the problem full-time, which led to the founding of Halluminate with my co-founder Wyatt. Wyatt and I have been roommates and friends for almost seven years—we’ve been building together since our first week at Cornell. He also has a background in machine learning, particularly doing research around image generation as a Milstein Scholar, and was working on large-scale machine learning monitoring at a startup before we teamed up to tackle this problem.

What’s the high-level on what Halluminate is for the uninitiated?

At a high level, Halluminate is the testing and evaluation layer for AI agents. Essentially, it’s a suite of tools designed for model-driven testing. To break that down, our worldview is that while software has traditionally been deterministic, AI and AI agents are making software non-deterministic. In traditional software development, about 40% of time and cost is spent on testing and quality assurance, but we can’t test AI systems the same way we test deterministic software.

For example, most developers are used to writing unit tests or test scripts, but with AI systems—especially large language models (LLMs)—you don’t get the same consistent outputs, even from the same inputs. Controlling for things like temperature can force a level of consistency, but that comes at the expense of creativity, which limits the potential of the AI. So, the question is: how do we test non-deterministic software automatically?

We believe the answer is what we call model-driven testing, or evaluation models—where you can essentially use foundation models to grade and assess the outputs produced by an agentic system - also known as llm as a judge. Right now, it requires a lot of engineering and research resources to get this automated evaluation method to work well, but Halluminate is building a platform of tools to make it accurate, streamlined, and easy to deploy out of the box.

The really cool thing with evaluation models is that they're extremely versatile. You can use them in various points of the developer life cycle, from CI/CD and testing to real time checking and quality assurance. We think this second category is where the future will be.

An example architecture diagram from an engineer who used Halluminate

So then who exactly are your customers?

Our customers are primarily everyday AI engineers. This is a new job family that’s starting to gain recognition. You have traditional developers who’ve historically worked on normal software systems, and you have machine learning scientists and data scientists who have built and trained models. But now we’re seeing the rise of AI engineers. These are traditional engineers who often have little to no machine learning background but are now responsible for implementing LLMs and other machine learning systems into their code, largely because these technologies have become so accessible through APIs.

Our core early customers are AI engineers working in startups—typically Series A-plus AI startups—and applied machine learning teams in fintech. These engineers are at the point where they’re asking, “How do we build production-scale systems?” They’ve moved past proof of concept and are scaling their products, which is where testing and quality assurance become critical. Once you start reaching product-market fit and scaling, that’s when engineers realize they can’t rely on traditional test scripts anymore. That’s when model-driven testing becomes necessary, and that’s when they start turning to tools like Halluminate.

A photo of our most recent Hackathon with a community of AI engineers at The Commons!

Are there any customer success stories that you want to share?

One engineer we work with is building an AI agent for SEC analyst reports and faces challenges manually validating output accuracy of their system across criteria like accuracy, coherence, and regulatory compliance. They currently spend hundreds of hours and over $100,000+ in human labor recruiting individuals, often MBA graduates, to check their product outputs. They knew this process was not scalable, so what we’re currently doing is building automated evaluation models that are closely correlated with their custom human evaluation process to deploy within their product.

Our product involves identifying key evaluation criteria and labeling a set of examples to train this automated evaluation model, which is then deployable as a serverless endpoint for automated testing. This drastically cuts their review time, allowing them to focus on key issues to produce IP rather than manually reviewing everything. While some human oversight remains, our solution significantly reduced their workload by 90%+ and improved efficiency.

At what point does it make sense for developers to start using Halluminate for model testing?

Great question. Ideally, anyone building autonomous AI agents is facing this issue and should adopt model-driven testing because, in programming, we’re taught that test-driven development ensures higher quality. That said, in reality, most engineers don’t prioritize testing early on—they just want to build something and see if it works.

As for when Halluminate is more essential, I’d start by distinguishing between AI agents and simpler LLM applications. While both benefit from testing, it’s far more important for agents. Agents are autonomous systems that chain multiple LLM decisions together, and each decision introduces uncertainty. The more steps involved, the higher the chance of failure, so testing becomes crucial to make sure everything functions properly. In contrast, something like a single-point retrieval-based (RAG) solution has only one potential point of failure, making testing simpler but still necessary.

Also, in high-risk industries—like banking or healthcare—where AI agents are making critical decisions, robust testing is absolutely necessary. These are environments that demand reliability, and deploying systems without extensive testing could have serious consequences.

On the flip side, you might have use cases like AI-powered SDRs (sales development reps), where the goal is quantity over quality, and a few mistakes won’t really hurt. These don’t require as much rigorous testing since the risks are lower and the deployment is often internal. The need for testing really depends on the context and risk level of the application.

Could you define “model-driven testing” for us?

Model-driven testing is simply the act of using one model to check and grade the outputs of another model. You can also think of it as probabilistic or non-deterministic testing. Normally, when you write a function, you get a deterministic output, and you might write an assert that says, "Given this input, I expect this output." But with large language models, their responses are non-deterministic. So, if you ask a model to summarize an article, for example, you can't just check that with a simple assertion.

Traditional natural language metrics like Rouge or Blue were fine for previous natural language tasks, but they don’t really work for generative systems. This is because these methodologies often rely on having pre-written ground truth which is not a scalable approach. The solution is to use foundation models as evaluators. Some people call this "LLM as a judge." You essentially set up evaluation models with rules to grade the outputs of your AI system based on parameters you define.

We believe that in the future, engineers will move away from writing test scripts and instead focus on defining and deploying evaluation models to perform testing at scale. It’s a shift in how testing and quality assurance are approached as we build increasingly autonomous software.

How significant do you think the market potential for AI evaluation models is, and what impact do they have on the overall cost for companies using AI agents?

We’ve spoken to several AI agent companies that are already using evaluation models or LLM-as-a-judge systems. What we found is that these evaluation systems can account for up to 40%, and on average 20-30%, of the total inference cost for those companies. That means for every dollar spent on inference, around 20 to 30 cents are going toward evaluation models, whether it’s asynchronous evaluations or real-time monitoring powered by LLMs.

This stat shows how big the market could be, especially as companies realize how crucial it is to ensure quality and reliability through automated evaluations. Just like how 40% of software development costs are tied to testing and quality assurance, we see a similar correlation here. This is a key reason why we’re so confident in the size and diversity of this market.

Our world view is that the future of model inference is actually split into two categories of models. One family of models that “generates” output, and another family of models that rigorously assesses and grades those outputs. We want to build that second group.

The future of inference ft. @Jerr_Wu
Full episode: youtu.be/F5rQ2CwdKKc?si…
#AI#ArtificialIntelligence#data#AIrebels#AIrevolution#MachineLearning#coding#llm#podcast#aipodcast#podcastshow
— AI Rebels Podcast (@ai_rebels)
12:04 PM • Oct 2, 2024

What do you think needs to change for industries with higher risks, like banking, healthcare, or defense, to fully adopt LLM-based systems for critical tasks?

There are three main factors: cultural, technical, and regulatory. Culturally, people need to become more comfortable with the idea that some systems we use might not be perfect. Many users of ChatGPT are already aware of this non-determinism, but it will take time for everyone to fully accept it.

Technically, systems themselves need to become more reliable. Right now, a lot of agentic systems aren't working as expected, and there's a big push to fix this through better development and infrastructure. Companies like Halluminate, AgentOps, and Toolhouse are all working to make these systems more accurate. Another way to mitigate non-determinism is how it's presented—either by clearly warning users about potential errors or by using humans in the loop (services as a software) to validate outcomes, which could be key to scaling these systems.

Lastly, from a regulatory perspective, industries like banking are waiting for clearer guidance. They want to innovate but are hesitant to roll out new products for fear of regulatory backlash. For example, at Capital One, we shifted from customer-facing products to internal tools due to this uncertainty. This is particularly relevant in highly regulated industries like fintech, healthcare, and defense. While Halluminate is working on the technical side, the cultural and regulatory shifts will come in time.

Can you break down the competitive landscape in your space, and how Halluminate is differentiated from other players in the AI Agent management space?

First, a lot of people assume we compete in the observability platform and developer framework space. However, observability platforms focus on collecting signals and helping you understand what's happening in your system. In contrast, Halluminate is focused on producing higher quality signals—specifically, the evaluation models themselves.

Many of these platforms allow you to plug in your own custom evaluations, and that's where Halluminate comes in. We focus on helping engineers build those evaluation models. We’re not competing with these platforms but complementing them by providing the ecosystem and library of evaluation models needed to create custom evaluations that they can plug into these platforms. In fact, some users are already doing that with us, integrating our results into platforms like Braintrust and AgentOps.

Second, within this space, there are others starting to build off the shelf evaluation models - But when we talk to engineers, about 90% of them need to be highly specific, custom, and non-generalizable. That’s our focus—empowering engineers to build their own models for these custom, unique use cases. We believe this is critical because every software application is different, and no one-size-fits-all solution can predict exactly what an engineer will need.

Third, we’re laser focused on AI agent and multi agent applications. We want to skate to where the puck is going, and we believe building for fully autonomous applications and architecture is the highest leverage because this is where software will converge in the next 5 to 10 years. This has really interesting implications for the product, as we specifically start thinking about what sort of evaluations agents may need (decision making, tool usage, management capabilities, security, etc).

What is Halluminate's product roadmap for the next 6 to 12 months, and what are your main focus areas for growth?

We’re taking a developer-led advocacy approach, and we’re focusing on building out a more robust suite of developer-facing tools for our platform. Right now, we’ve already built a library of "LLMs as a judge" evaluation models, as well as a streamlined "create your own" custom methodology.

In the short term, we’re launching two key features. The first is auto-alignment, where engineers can bring good and bad examples (essentially their test script) of what their system should be doing, and we’ll automatically train a judge that aligns to those labels and criteria with high accuracy. The second is iterating on that process to improve alignment through better research techniques—making the models cheaper, faster, and more efficient to run.

In the medium to long term, we want to become the place to go when any engineer thinks about llm as a judge evaluation. There's an educational component here as well—helping the community understand how to build these evaluation models effectively. Ultimately, we envision Halluminate as "Postman for AI agents," making it easier to test AI agent systems the same way Postman streamlined API testing.

Our goal over the next year is to scale the platform, grow the developer base, and keep evolving the product based on feedback from the community.

What was the hardest technical challenge in building Halluminate?

The hardest technical challenge, which we're still working on, is ensuring that the evaluation models used for testing are accurate. The tricky part is making sure they're aligned with human judgment, which is often messy and inconsistent. Accuracy comes down to three steps: defining what accuracy means, collecting data on what's considered accurate, and building a model that can reliably check for that accuracy.

Defining accuracy is a challenge in itself. For instance, if I ask you what makes for a good customer service response, your criteria might be different from what a service agent would consider important. While we can usually find a decision-maker in an enterprise setting to give us a clear definition, getting to that point can involve a lot of back-and-forth.

The second challenge is data collection, which is labor-intensive but not particularly technical. The real technical challenge is the third step: fine-tuning an evaluation model to be both accurate and aligned with human expectations. We're excited about emerging research directions here that can boost accuracy. This can be more expensive but leads to more consistent results.

What are the specific research techniques or directions that you find promising for improving model evaluation and judging accuracy?

Right now, there are three main areas we’re exploring.

First, there are a lot of advancements in prompting protocols and orchestration techniques that can improve judge accuracy. One simple method we’re testing is the “jury” approach—using multiple judges, averaging out their scores, and normalizing the distribution. We’re also looking at chain-of-thought protocols, where the judge self-reflects and revises its judgments. We’ve already seen a 4-5% improvement on accuracy with this method with our internal testing. Another emerging technique is the "debate" protocol, where multiple judges debate the correct output before reaching a final decision. This concept seems really promising and is based on research presented at ICML this year and could offer even higher alignment.

Second, improving judging isn’t just about prompting—it’s about integrating business context and relevant data into the decision-making process. For example, in the case of an AI agent acting as an SEC analyst, the judge needs to incorporate knowledge from actual SEC analysts or specific data sources to make informed decisions. We're looking into how knowledge bases and real-time data can be integrated into the judging process to improve contextual understanding, making these systems almost like applications in themselves. For instance, if you’re using a fact-checking AI, you could hook it up to a web API to validate real-time facts, which would significantly enhance judgment accuracy. We’re really excited about this and think tool-based judges are a vastly unexplored green space.

Lastly, we’re exploring fine-tuning models through synthetic data generation, but that’s further down our roadmap since we think the biggest gains will come from the first two approaches. Fine-tuning is becoming a more straightforward but expensive task, so we’re more focused on optimizing prompting and data integration right now.

Tell us a little bit about the team at Halluminate. What kind of culture are you building, and what kinds of roles are you hiring for?

We value being in the same space because it helps us riff on ideas more effectively. We organize social events—happy hours, mini golf, walks, or runs together—because we believe in building both intellectual and emotional bonds within the team.

On the engineering side, we really value execution. While research is important, we think the more valuable route is productizing research breakthroughs faster than anyone else. So we're looking for strong engineers with machine learning intuition, not necessarily deep ML scientists. We want people who can read and understand research and help us bring it to market quickly. The great thing about foundation model research today is it's more accessible, especially in the language space, so it's easier for everyday engineers to reproduce than, say, deep neural network research from earlier years.

Ultimately, my north star belief is that the great companies of tomorrow will be built around excellent generalists who can use AI tools - we’re seeing the first crop of startups in an era of companies that can be truly AI-native. Execution of entire work streams may soon become commoditized by tools, it's truly an exciting time to build. There’s been a lot of talk about when we will see the first one-person billion dollar company, but before that I think we will see a few 10-person billion dollar companies. We believe the best way to get there is to hire generalists that understand not just how to build, but what and why.

Conclusion

To stay up to date on the latest with Halluminate, learn more about them here.

Read our past few Deep Dives below:

If you would like us to ‘Deep Dive’ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.

Halluminate - Evaluation infrastructure for AI agents 🎛

Plus: CEO Jerry Wu on how evaluation models will be 30-40% of all AI inference spend...

CV Deep Dive

Our Chat with Jerry 💬

Conclusion

Join Slack | All Events | Jobs