• Cerebral Valley
  • Posts
  • Comet is Closing the Loop Between Production and Development for AI Agents

Comet is Closing the Loop Between Production and Development for AI Agents

Plus: Comet CEO Gideon Mendels on launching Opik, why agent eval needed its own product, and the future of self-improving agents...

CV Deep Dive

Today, we're talking with Gideon Mendels, Co-Founder and CEO of Comet.

Comet is the company behind Opik, the fastest-growing open-source project for agent observability, testing, and optimization - now used by more than 150,000 builders.

Comet has been building AI developer tools since 2017, originally for teams training their own models. As users shifted toward building on top of LLMs like OpenAI and Anthropic, Comet spun out Opik as a distinct, open-source product purpose-built for the new workflow. Today Opik covers the full agent development lifecycle - tracing, evaluation, test suites, optimization, and production observability - and powers AI teams at Uber, Netflix, Autodesk, NatWest, and more.

Comet has raised $63M across three rounds, including a $50M Series B led by OpenView Venture Partners.

In this conversation, Gideon explains why LLM tooling needed its own product rather than a feature bolted onto MLOps, walks through the workflow Opik enables, and unpacks the new Opik launch - including Test Suites, a new coding harness called Ollie and an Agent Playground. He also shares where he sees agent development heading next: toward self-improving systems where production failures automatically flow back into a better agent.

Let's dive in ⚡️

Read time: 8 mins

Our Chat with Gideon 💬

Hey Gideon! For folks in our audience who haven't crossed paths with Comet yet, give us the 60-second version. What does Comet do, and who is it for?

We're Comet. We're the company behind Opik, which is the fastest-growing open-source project for agent observability, testing, and optimization. It's designed for anyone who's building agents, or for that matter, any LLM-powered application. It covers the end-to-end workflow from early development, testing, and evaluations through to production observability. Most importantly, our focus is making it really easy to improve your agent based on what you see in production.

Opik is open source, and there are over 150,000 builders using it. On the commercial side, we power some of the best AI teams out there - Uber, Netflix, Autodesk, NatWest, and a bunch more.

The AI tooling landscape has shifted dramatically over the past two years. Your users used to train their own models, now they're building on top of existing LLMs. How did that shift change what Comet needed to be?

Our origins were very much focused on teams building and training their own models - we've been around for almost eight years. We onboarded a lot of amazing model builders. About two and a half years ago, we started having conversations with customers where they were saying, "Hey, for this use case, instead of training our own model, we're going to try to build it on top of OpenAI or Anthropic." They were trying to use the same ML experiment tracking tool to do that, because a lot of things are similar - you're changing a lot of parameters, you want to evaluate it - but you're also missing a bunch of functionality that's more tailored to this new workflow.

At first we shipped some features to try to alleviate some of that pain. Then it became pretty obvious that while there are a lot of synergies and similarities between the workflows, they're also distinctly different in many ways. Trying to pigeonhole it into the MLOps product didn't work, so we decided to focus on a new product - Opik - specifically. We brought in the experience we had doing this, both from the pain points and workflows and from running these systems in production for some of the biggest AI teams in the world. We took everything relevant from the MLOps product and built Opik on top of that.

That trend is continuing. You still see teams building models - self-driving car people, foundation model builders - but the vast majority of use cases today are LLM-API-based.

Was there a specific customer conversation that sparked the shift, or was it more a pattern you were seeing across the market?

It was definitely customer-driven. Customers came to us and said, "Hey, this is what we're using it for - the use case Opik solves today." They needed to see tracing and prompts, which didn't exist in the same way in MLOps. The first thing we shipped was a way to view input prompts and outputs. Then they'd say, "Okay, but now I want to run this evaluation, and I'm no longer looking at hyperparameters, though I still have a dataset."

Every time we got another feature request, it started to feel like: this product was designed to do something very similar, but different. At some point you look at it and say, "This is distinct enough that we want a clean codebase to build it in the right way."

What problems does Opik solve that existing tools weren't, and why did you build it as a distinct open-source product rather than bolting it onto Comet?

When we looked at what was out there, there were a few tools focused on observability for LLMs, and a few focused on evaluations. A lot of this stems from our experience with ML models, but these concepts - observability and evaluations - aren't as useful if you don't know how to tie them together.

With observability, you have a dashboard with millions of traces. What do you actually do with it? In software with APM, you see errors - that's a good way to start, because you see where your backend is throwing exceptions. But here, the failures are often very silent. It's hard to make sense of massive amounts of traces. With the evaluation products that existed back then, they'd tell you, "Here's how well your LLM application is performing on this metric." It doesn't really tell you how to fix something. This prompt gets you 5% better on this - okay, what do I do with that? It's not actionable.

Our goal was to really tie these concepts together in a simple, easy-to-use, but very powerful workflow. At a high level, the workflow is: you automatically collect failures from production. Those get automatically tagged by online evals, which aren't perfect but are good enough to surface issues. Those get added to a dataset. Then you run an optimization to fix the failures and release a new version. It's a flywheel - the more you do it, the better your agent becomes.

On open source, it was an easy choice for us. The ML product we built before wasn't open source at its core, though we shipped a bunch of open-source packages. We've always been big fans, and we've benefited significantly from using a ton of open-source tools and libraries. In one sense, this was a huge opportunity to give back to the community.

From a strategic perspective, this market is moving faster than anything else we've ever experienced - every single day something big happens. I don't think any one proprietary software vendor, with 20 or 300 engineers, can ship fast enough. By going open source, we really benefit from contributions from the community - PRs, issues, conversations. It's pushing the product much faster than what our team could deliver on its own. These contributions come from a wide range of people, from indie hackers building stuff on the weekend to the biggest companies in the world running Opik. It's a big reason why Opik has been growing so fast, both in adoption and in scope, quality, and product surface area. It's superpowers - we have all these people behind us helping.

If I'm an engineer shipping an LLM-powered feature tomorrow, what does Opik give me that I wouldn't have otherwise?

The baseline workflow, if you're not using Opik or something like it, is that you have an agent or LLM-powered feature in production or in test, and you're somehow made aware of an issue. Whether it's a tester or someone complaining about it, it's usually a person flagging it. You try to reproduce the issue - which isn't always trivial - try to fix it, and ship a new version.

The problem is, these systems are so complicated, with so many moving parts and a bit of stochasticity in there, that if you fixed an issue today, you don't know if you broke 10 other things. There's no testing whatsoever. On top of that, you're only fixing the things you were made aware of. Everyone walks around with the feeling: I'm sure there's a bunch of stuff that's broken that I'm not aware of. It's hard to make meaningful progress that way - you've got a whack-a-mole problem, and you don't have a good idea of where it's failing or succeeding.

With Opik, it's much easier. You get full traces of every interaction with full granularity: every tool call that was made, every API call, every LLM response - with all the details you need: inputs, outputs, context, timing.

If you have a conversational agent, every turn is a trace, and every trace can contain multiple spans. Each span is a tool call or an LLM call. Those traces get automatically tagged by online LLM-as-a-judge. We have some built-in ones, but you typically want to configure them for your use case.

Because you have all this data, it also solves the reproducibility problem. If you want to generate the same output - imagine the agent has context, a bunch of stuff happened before - it's really hard to get back to the exact same point. You get that for free, because we have the full trace and can feed it into your agent step by step exactly as it happened.

Because everything is tagged, you can easily say, "This didn't happen just once - there are 10 other times this happened, and this is an error I'm seeing in so many other ways." One of our customers is one of the biggest pharma companies in the world. They have an English-based conversational agent for their customers, and they realized they were starting to get a bunch of questions in Spanish. You'd never think about that, but they realized there was actually a business opportunity in Spain. That's a simple example - it's very easy to determine English versus Spanish - but they just hadn't known before.

Then the way you go about fixing the issue is different. We have our own coding harness specifically designed for building agents called Ollie. It's similar to Claude Code - running on top of LLMs from top providers - and it can drive code changes, prompt changes, everything Claude Code can do. The difference is that, unlike Codex or Claude Code, it has full access to everything in Opik: all the trace data, any historical changes that happened before, really everything. Fixing issues is much easier, because it's not flying blind and creating more problems when it tries to fix something.

The first thing it does is create a test automatically - an assertion like, "The agent should have pulled this data from the production DB versus the BI DB." Then it comes up with a fix, verifies the test passes, and - since this is probably not the first or last issue you're going to fix - it reruns the entire test suite every time. You're confident you don't have regressions. You don't have to manually create tests or evaluation datasets and label them. You just fix stuff like you normally do, with the confidence that you didn't break something you fixed before. It's a magical experience, similar to coding with unit tests - when you make a change, you know you didn't blow up the whole thing. You can move a lot faster.

Agent Optimizer, Guardrails, online evaluations - you've shipped a lot of surface area quickly. What's driving the roadmap? How much is community pull versus your own conviction about where the market needs to go?

It's a combination, but I'd say 80/20 toward community pull. Community input, PRs, and customer feedback tend to focus on more incremental improvements: "This feature is great, I wish it could do that too," or, "Can you add a feature that helps us with this use case?" That's super powerful, and a good chunk of what we ship - if you look at our release histories - is that. At the end of the day, that's what makes the product great.

There's a component, especially when you talk about Ollie and how we think about agent optimization, that comes from our own conviction. This is brand new - no one was building agents two years ago, and I'm being conservative. When you look at the workflow and some of the approaches other tools take, it's a mishmash of, "This is how you do it with ML, this is how you do it with software engineering," and it doesn't always fit. We try to think a year, a year and a half from now - how is this going to look?

One insight that led us to build optimization: I have some way to tell how well my agent's doing - tests, evaluations. My agent is a combination of code, system prompts, tool descriptions, API calls. As a human, I go in and try to change that configuration to get a better result. Why can't we learn that? Why manually do all these things? From an ML lens, it's nuts - it's like manually wiring neural network neurons instead of using SGD. That's not something customers asked us to build. They said, "How do we make this agent better?" We came up, through a lot of research and work, with our own approach.

You've got a big launch this week. Tell us what you're releasing, and why this is a significant moment for Comet.

This is a big launch for us - we've been working on it for a long time. It all stems from being very close to the people building agents over the last year and a half, and identifying not just where there's bugs or friction in the product, but where there are opportunities to do 10 times - if not 100 times - better.

The first thing is Test Suites. Evaluations aren't a new concept - we've been doing them for a long time. What we found is that almost no one creates evaluation datasets. They're really hard to create. You have to figure out the input, which is the trace with all the context and steps before it. You have to come up with the expected answer. You have to come up with a way to measure the difference. Some of our bigger customers have teams with subject-matter experts to do this, and when you do it, it works well. Most people don't.

Test Suites is a very different approach. When you fix something - which you're already going to do - we create a test for it, and then we can run regression tests for you for free. You don't have to do any additional work. The more you do it, the more confidence you have in your application. If you want to switch to a different LLM provider tomorrow, you run your test suite and you immediately know if you pass or not. There's some squishiness, because the assertion is free text and whether it passes is determined by an LLM-as-a-judge, but we have functionality for multiple runs to make sure you're not hitting some sensitivity with the LLM provider. Test Suites is fully open source.

The second component is Ollie, our coding harness. Ollie is based on Pi, which is a great coding harness. We run it in a fully sandboxed environment. It's deeply connected to everything in Opik, so it can fetch all the information - but not only fetch. It can control the UI as well. If you say, "Are there any other traces that fail the same way?" it won't just print a list. It will literally render the UI with those traces. The experience is really smooth, baked in with a lot of things useful to building agents - skills on how to write prompts, documentation for top agent frameworks like LangGraph and ADK.

It's a unique approach. You have the agent in the UI, but then you run a small local executable on your dev machine that can do tool calls, file reads, file edits, and so on.

The last thing is our agent playground. Often the interaction layer with these agents is Slack, or it's headless - it's not trivial to get to a point where you can query or trigger it. We provide full support for that. You add some instrumentation to your function entry point, and we automatically create an endpoint for it. You can query it from the UI, see inputs and outputs. It's very easy to interact with.

What was the hardest part of this to build, or the most rewarding moment?

We tried for a very long time to give people value while working around code changes. We said, "Okay, we can control the prompts, we can control the tool descriptions, we can do all these things." The reality is, too many of the issues - fixing failures in production - require real coding changes. We explored a lot of options, including doing it as a Claude Code plugin, and went back and forth on how to do it.

Someone on the team did a POC of our own coding harness - whipped it up in 24 hours, a while ago. It's one of those things where you start using it and go, "Holy shit, this is it." As soon as everyone got on it, we put a team on it. We also had to go back and change a bunch of stuff - redo optimizations to work with code changes, all of that. A lot of the power obviously comes from the underlying LLM, but seeing how it navigates the UI, makes code changes, reruns them, tells you how it looks - it's a magical experience. Once you see it, you know.

Was there a moment like that when you showed it to the first users?

We had design partners who were part of the whole process, but then we released it in production with a small canary deployment for a very limited set of users. We use Opik to manage our own Ollie agent, so you go in and see how people are using it. This wasn't that long ago - maybe a week - and you see the curve of how they're using it and how much. You go user by user, and suddenly someone has 20 tests, and you're just watching them asking questions, changing stuff, adding tests.

What challenges will this launch address for AI developers that other tools don't solve today?

A few things. First, you can build and improve your agent without worrying about breaking 10 other things. You can do it while knowing what the most common failures are - instead of randomly fixing the one that the loudest person yells about. You can do it in an experience where you don't have to constantly copy-paste things into Claude Code. It's a really natural workflow.

How does this launch fit into the longer arc of where Opik is heading?

Our goal is to close the loop between production and development to the point where you have fully self-improving agents. For use cases like OpenClaw, where you're both the user and the owner, it's a little bit easier - it has memories, it remembers things. If it's a commercial agent, you don't want your end users baking things into memories that will impact other users.

Our goal is to get to where things that happen in production automatically funnel into a better agent moving forward. This launch is a massive step in that direction. There's still a human in the loop for a lot of things, but it does get us one step closer.

How does agentic AI being the topic of the moment change the tools developers need to build production-ready AI? What's the concrete difference between debugging a single LLM call and a multi-step agent?

When you have an application with a single LLM call, the LLM response has some stochasticity, but other than that, your workflow is mostly deterministic. There are going to be failures, but they're pretty contained. You'll know immediately if a failure came from the LLM or something else.

When it's truly an agentic system, where the agent decides which tool calls to make and how to respond, the more agency it has, the more complexity, the more failures, and the harder it is to fix these things.

Where do you see the agent development space heading over the next 12–18 months? What will the best AI teams be doing that most teams aren't doing yet?

The agents of the future will be self-improving, self-evolving. There will still be times and places where you want a human in the loop, a developer going in and changing things. But as time goes by, that's going to happen in fewer and fewer places. Today, as a developer, you're still doing a lot of things manually that can be automated.

If we think about what a production agent looks like a year or two from now, it's going to be a lot more autonomous. Not necessarily in the sense that it can do things it can't do today, but in terms of dealing with issues and failures - versus having a human review some logs and fix them.

The fact that people go in and change prompts or some code to get a result we already know should happen - that seems wrong to me. An analogy: in the MLOps/ML world, all of our customers retrain their models on new production data on a weekly or monthly basis. That's the baseline. The more data you have, the better your model gets. But with these agents, you could have one user or a billion users, and it doesn't get better automatically. That's the future - and all this manual work a bunch of us are doing is something we're going to look back on and think, "I can't believe we actually did this manually."

If someone's reading this and wants to try Opik today, what's the fastest path to value?

The easiest is to use the hosted product. Go to comet.com - it's completely free. Sign in with GitHub and you'll be up and running in about 10 seconds. If you want to self-host, go to our GitHub repo, and you can deploy it on your laptop in about five minutes. Those are the two easiest paths.

Any final takeaways for the CV audience?

If you're building an agent and you're not doing evaluations - because it's hard, because it's complicated, because you didn't see value in it - try Test Suites. It's a much more productive way to get very similar outcomes. And it's easy.

Try Opik today at comet.com or explore the open-source repo on GitHub.

Read our past few Deep Dives below:

If you would like us to ‘Deep Dive’ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.

CVInstagramXAll Events