Cerebral Valley
Posts
HoneyHive, when AI Eval is mission critical 🔍

HoneyHive, when AI Eval is mission critical 🔍

Plus: Co-founders Mohak and Dhruv on their recent $7.4M seed and pre-seed funding round...

Ash Rathie
April 08, 2025

CV Deep Dive

Today, we’re talking with Mohak Sharma (CEO) and Dhruv Singh (CTO), co-founders of HoneyHive.

They met as roommates at Columbia, each forging their own path through data science, software engineering, and AI. Mohak spent time building out entire data and AI platforms at Templafy, while Dhruv explored everything from observability in Office365 to early AI prototypes at Microsoft. Years later, they joined forces to tackle one of the industry’s biggest questions: How do you move AI from toy prototypes to robust, enterprise-grade production?

Their answer is HoneyHive—a new “DevOps stack” for AI that focuses especially on agentic workflows and the complex, multi-step logic that generative applications often require.

Key Takeaways

DevOps for AI: HoneyHive provides an evaluation and observability platform bridging development and production for AI applications. They aim to close the loop between building an LLM-based pipeline and monitoring it live, so teams can iterate systematically.
Agentic Evals: Modern AI products often incorporate “agents” that call multiple tools, retrieve data from vector databases, and combine logic over multiple steps. HoneyHive is designed to track and evaluate these multi-step pipelines.
Enterprise-Grade Deployments: While HoneyHive started with startups, the team now also works with Fortune 100s that demand on-prem, air-gapped, or single-tenant deployments for compliance.
Scaling with Big Context: Large context windows mean bigger logs, bigger traces, and more complex debugging. HoneyHive re-architected observability to handle huge volumes of AI “events,” including multi-megabyte spans.
Future-Proofing: Mohak and Dhruv assume the AI world will keep changing at a rapid pace, so they built HoneyHive to adapt to new modalities (voice, advanced tool usage) and new deployment models.

Today, HoneyHive announced they’ve raised a $7.4M seed and pre-seed from Insight Partners, Zero Prime Ventures, and AIX Ventures.

Let’s dive in ⚡️

Read time: 8 mins

Our Chat with Mohak and Dhruv 💬

Mohak and Dhruv - welcome to Cerebral Valley! Let’s start off with the journey that led you to co-founding HoneyHive?

Mohak: I studied operations research at Columbia, focusing on data science and math. Dhruv was my roommate and studied computer science, so we were both heavily into building and coding. At Columbia, we always talked about starting a company, but it never quite happened. After graduation, I landed at Templafy in New York, building what they called a document automation platform. By the time I left, I had ownership of the data platform, the ML platform, and an AI assistant that I helped spin up. They went from Series B to D, so I got to see quite a bit of growth, but eventually I got bored. I really wanted to do something more product-focused and creative around AI.

Dhruv: I was majoring in CS and getting curious about AI. At first, I wasn’t sure if it was real or just hype—would we actually get to AGI in my lifetime or was it just a meme? Out of college, I joined Microsoft in their Office365 observability team. It was stable to the point that I questioned what value I was adding. Later, I moved over to an AI innovation group that was working with OpenAI in 2021, way before generative AI blew up in the mainstream. There I built a shell copilot tool using GPT Codex—kind of an early version of what we see everywhere now. It opened my eyes to how drastically app development could change with LLMs. Meanwhile, Mohak was itching to start something. We decided to finally join forces and build around these new paradigms we were each exploring.

Could you give us a HoneyHive elevator pitch?

Mohak: We call it an “evaluation and observability platform for AI applications,” but that might not fully capture what we do. If you’re building an AI pipeline—maybe you have retrieval from a vector database, some business logic that decomposes user queries, then a large language model call, then you combine the outputs—there’s uncertainty at each step. We realized teams have almost no visibility into what’s really happening in these multi-step processes. They do quick tests or “vibe checks” in a playground, but that doesn’t scale. So we built HoneyHive to help them test thoroughly pre-production and observe systematically in production. The goal is to unify these environments—so you can catch regressions and weird failure cases quickly, then fix them.

Dhruv: It’s helpful to think of DevOps for AI. Traditional DevOps has clear ways to check logs, monitor metrics, do canary deployments, or run automated tests. But AI pipelines are new, with random or “stochastic” models that might behave differently from run to run. People do retrieval augmented generation (RAG), they orchestrate agent logic, they store intermediate outputs, and they might generate multiple steps that feed each other. Observing all that requires a specialized approach. That’s where HoneyHive’s design started: bridging the gap between typical software observability and the “ML mindset” of versioning, drift detection, and data set feedback.

Which kinds of teams tend to adopt HoneyHive? Are they mostly small startups, or bigger enterprises too?

Mohak: When we started, it was definitely smaller AI startups. We’d get two-person teams with no funding that just liked the approach, or bigger ones around Series B and C. But around 2024 or so, enterprises realized they needed real ROI from AI—no more toy projects. Suddenly, we got calls from Fortune 100 insurance companies and big banks that wanted a reliable way to deploy generative AI in production. They needed to do large-scale experimentation, run load tests, watch for weird behaviors, and remain compliant with internal security policies.

Dhruv: One particular Fortune 100 insurance company is using HoneyHive across different divisions. They have an enterprise-wide push to incorporate AI into their workflows, but as soon as they tried to build out agentic or multi-step applications, they realized it’s quite messy. There are no clear best-practices on guaranteeing no misleading or risky content is shipping to customers. That means they need to continuously evaluate the large, multi-modal and often sensitive trace data that their applications produce to catch issues & collect better datasets. We help them unify all that behind a single platform. Meanwhile, we still support smaller teams, but we do see that big customers drive more sophisticated requirements. That in turn has shaped our roadmap.

How exactly do customers use HoneyHive to improve their AI systems?

Mohak: Let’s say you’re building a chat application for an insurance use case. Your pipeline might have a retrieval step from a vector DB, some re-ranking, a step that merges context, and a final LLM call. Each step is a potential failure mode. Maybe your question decomposition is off, or your RAG logic misses relevant documents. Or maybe the final model ignores context and goes off on a tangent.

What people do with HoneyHive is track how each step is performing. We let them define evaluation criteria and custom metrics. For example, they might want to see if a decomposition step is correct at least 80 percent of the time, or if the final LLM output remains faithful to the user’s initial query. We unify these metrics so teams can see, “I changed the retrieval approach from naive nearest neighbor to advanced query decomposition—did that actually help?” Historically, they’d do a quick check in a playground or read a few sample outputs. That’s not enough for real production scale.

Dhruv: Over time, we’ve noticed that as soon as people see that they can measure things systematically, they start trying more experiments. It’s like a DevOps or data science mindset: when you have metrics, you iterate faster. They might do automated regression tests on a nightly basis with thousands of user queries, or sample real production queries and re-run them on new code branches. That drastically speeds up the feedback loop. By bridging dev and production in a single platform, we help them make iteration a normal part of building AI, not a big guess-and-check process every time.

AI observability is an emerging space and getting more crowded. Could you give us an idea of the competitive landscape and how you stand out from the pack?

Mohak: We see four broad categories. The first is prompt management, which deals with versioning and editing single prompts. That was popular early on when it was all about chat completions. Then there are tools for non-technical domain experts, letting them do lightweight Evals or annotate outputs. A third category is purely developer-centric, often CLI-based, which might help you log your prompts or collect some metrics, but it doesn’t necessarily tie in with production. Finally, we see established MLOps providers for older predictive analytics, who are now pivoting to generative AI.

HoneyHive’s approach is more integrated. We focus on AI engineers building complex pipelines. We want them to debug multi-step logic, do systematic tests, and then track production to feed real errors back into dev. We didn’t want to be just a single-phase tool for either pre-production or post-production monitoring. We unify them because you only get real improvement if you can connect those dots.

Crucially, HoneyHive stands out by being truly enterprise-ready. The vast majority of other Startups in this space simply were not built with the complex and rigorous requirements required by Fortune 500 companies.

Dhruv: We also realized that typical software observability, like Datadog or New Relic, didn’t anticipate huge text spans. We’re sometimes dealing with multi-megabyte logs if context windows are 100K or more. That means we had to design a new data pipeline for how we store and retrieve these “events.” Another big difference is that we treat each step in an agent’s workflow as a first-class entity. We let you define custom metrics or checks for any step, which is more advanced than the typical one-and-done approach of simply storing the final pipeline output.

You mentioned that Honey Hive stands out by actually being enterprise-ready. What does that mean?

Mohak: Enterprise clients care about security, compliance, and data boundaries. Some want standard SaaS but with enterprise features. Others want single-tenant SaaS, an air-gapped approach where we stand up a dedicated cluster that only they can use. And some want a full on-prem deployment in their own VPC, behind their own firewalls, for the most sensitive data.

We built HoneyHive to handle all those modes because we saw early on that large insurance or financial orgs have multiple teams building AI apps across multiple regions with heavy internal restrictions. You can’t just have them sign up for a hosted multi-tenant environment. We run on a robust pipeline with ClickHouse. We replicate data if needed, encrypt it at rest, handle data egress constraints, and so on. Doing that well is not trivial, but it’s basically table stakes if you want to do real enterprise AI.

Dhruv: We also realized that in big companies, no one wants to figure out cluster configurations or handle network security boundaries for each new team. We do that once at the organizational level, then every app team can quickly connect to HoneyHive. That means the org can scale AI across different departments without a huge overhead.

With Agents being a focus in 2025, how did you see that affecting HoneyHive's product roadmap over the next 12 months?

Dhruv: Agents are the natural path to big ROI. A lot of generative AI apps so far have been single-step or short-step, like chat or summarization, but that’s not worth billions on its own. If you want an agent to do multi-step tool usage—maybe logging into your bank, scanning transactions, booking something, pulling info from a knowledge base—that means you have multi-step logic that can fail at any point. The complexity balloons.

We saw early on that evaluating an entire sequence is different from single-step prompts. For instance, if an agent is supposed to buy something online, you can’t just confirm the first step looked right. It might fail on the second or third step, or produce a partial success that’s tricky to detect if you only glance at the final output. That means we need to track the entire trajectory. Another factor is context window expansions. If you’re feeding the entire conversation up to step 15, that can be enormous. Traditional logs or debugging tools can’t handle it gracefully.

Mohak: We’re thinking about how more advanced agent systems may start resembling an autonomous vehicle scenario, where you need “simulations” to systematically test the agent’s decision-making. We’re not there yet, but we foresee that as a future. If you want an agent that can handle complex tasks autonomously, you’ll want to simulate many user interactions or environment states. Our architecture is built to incorporate that. Another piece is that some teams are already analyzing entire multi-step “trajectories” with a specialized model just for evaluation, which can be another LLM that checks for correctness or consistency. That’s another reason why we handle large context windows in the logs.

Retrieval is a critical part of a multi-step AI pipeline, agentic or not. How do you guys think about where retrieval fits in?

Dhruv: People talk about multi-step agents, but a huge piece of that puzzle is how you retrieve data or knowledge for the model to act on. If your retrieval is missing key documents or if it’s returning the wrong ones, everything else falls apart. We often see teams trying different vector DBs or adding re-ranking, but they need to measure if those changes are truly improving final answers or causing random regressions.

Mohak: The second half is that once you have bigger context, the model might not respect all of it. So you might do everything “correctly” in retrieval but still get nonsense in the output, or partial nonsense. If you only look at final correctness, you can’t tell whether retrieval or generation is at fault. Our system breaks it down: you know if the right docs were found, if the doc was appended properly, if the model used that doc or ignored it. Observability at each step is crucial.

What about your own culture and hiring priorities? Who do you need to expand the team right now?

Mohak: Currently we’re about seven people, but we want to hire three key roles soon. One is product engineering, building front-end features, user flows, dashboards, as well as some of the logic that orchestrates multi-step evaluations. Another is systems and infrastructure engineering, because behind the scenes, we’re dealing with large, complex data. We have to handle multi-megabyte spans, on-prem installations, and more.

We also want to hire a DevRel lead. We’ve been in stealth, so we haven’t spent time marketing. Now we want to share what we’re doing with the community and show that you don’t have to be an enterprise to use HoneyHive. We do see smaller teams getting value—like if you have a few employees but want to iterate rapidly and not break production.

Dhruv: In terms of culture, we want people who can ship quickly and think from first principles. AI engineering changes so fast that there’s no single script to follow. We’re essentially combining software observability, ML analytics, domain-specific knowledge, and multi-step logic. If you come in with a super narrow lens, you’ll be frustrated. We need open-mindedness, comfort with ambiguous territory, and a willingness to build new product concepts. We also put a premium on “intellectual honesty.” We spend a lot of time discussing how to design a feature so it’s flexible enough for future agent paradigms—like voice-based interactions or even more advanced tool usage. That means we talk a lot about what the space might look like in six to twelve months, and we try not to get stuck in the patterns of the moment

What do you wish more people knew about HoneyHive?

Mohak: One point is that we’re not just for agentic start-ups. We also handle the big enterprise demands. We’re seeing Fortune 100 insurers or banks wanting to spin up entire AI solutions across multiple departments. They have complicated data boundaries or need air-gapped deployments. We’re comfortable in that environment.

Beyond that, we want to emphasize that bridging dev and production is critical. If you only do pre-production tests, you won’t catch the long-tail user queries that happen in real life. If you only do production logs, you won’t have an easy way to feed those logs back into a test harness and systematically fix issues. Tying those steps together is our guiding philosophy.

Dhruv: Also, we strongly believe agent-based AI is the future. Summaries and single-step chats can be neat, but for massive ROI, you need advanced logic and multi-step operations—whether that’s booking flights, updating your finances, scheduling complex tasks, or even controlling hardware. That’s a fundamentally different development process from traditional software or older ML. If you share that vision, or you’re actively building those use cases, we’d love to chat.

Mohak: Exactly. We see HoneyHive as the next-gen DevOps for AI. If you’re reading this and you’re dealing with complicated LLM pipelines, or you want to get from toy experiments to serious production, you should reach out. Or if you’re an engineer or DevRel person who wants to build the “observability for the AI era,” come join us. We think the real wave of AI products is only just beginning, and we’re excited to help shape it.

Conclusion

Stay up to date on the latest with HoneyHive, learn more about them here.

Read our past few Deep Dives below:

If you would like us to ‘Deep Dive’ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.

HoneyHive, when AI Eval is mission critical 🔍

Plus: Co-founders Mohak and Dhruv on their recent $7.4M seed and pre-seed funding round...

CV Deep Dive

Our Chat with Mohak and Dhruv 💬

Conclusion

About | Events | Jobs