Cerebral Valley
Posts
Cleric - Your AI SRE teammate 🌐

Cleric - Your AI SRE teammate 🌐

Plus: CEO Shahram on why AI agents can handle the complexity of today'sa production environments...

November 27, 2024

CV Deep Dive

Today, we’re talking with Shahram Anver, Co-Founder and CEO of Cleric.

Earlier this year, Cleric launched the first autonomous AI Site Reliability Engineer (SRE), sparking comparisons to Devin. Engineering teams at scale often spend half their engineering bandwidth on production support - investigating alerts, troubleshooting issues, and maintaining infrastructure.

Cleric aims to change this by operating like an experienced SRE teammate, autonomously investigating and diagnosing issues in complex production environments. By integrating with existing tools, Cleric helps engineering teams shift from reactive firefighting back to building software.

Their AI SRE is already live with enterprises and processing thousands of issues every day, with investigation times averaging under 2 minutes. In this conversation, Shahram shares how Cleric is laying the foundation for self-healing infrastructure, and why AI agents are uniquely suited to handle the complexity of modern production environments.

Let’s dive in ⚡️

Read time: 8 mins

Our Chat with Shahram

Shahram - welcome to Cerebral Valley! First off, give us a bit about your background and what led you to co-found Cleric?

Hey, I'm Shahram. Before Cleric, I led engineering for MLOps, DevOps, and FinOps platforms at Gojek, Southeast Asia's super-app handling millions of rides daily. My co-founder Willem created Feast, the open source feature store, and has worked with high scale ML infrastructure teams at Google, Apple, Twitter and Cloudflare. We’ve both seen how painful production operations are today and felt strongly that it’s an obvious area to solve with AI agents.

At large companies like Gojek, Conway's law is a big factor - as teams grow, infrastructure complexity grows exponentially. When you're processing millions of transactions daily across hundreds of microservices, adding more engineers or tools doesn't solve the underlying problem. If anything, it made things worse through coordination overhead. We were spending most of our time maintaining systems instead of building new capabilities.

What's interesting is that both of us had worked at the intersection of infrastructure and ML for years. When we saw advances in LLMs and agentic workflows, we realized the technology was finally ready to tackle infrastructure complexity. We started Cleric because we believe engineers want to build software, not spend their nights and weekends fighting production fires.

How would you describe Cleric to an AI engineer or developer who isn’t as familiar?

Think of Cleric as an AI SRE teammate that helps to investigate production issues across your production environments. It connects to your existing tools and systems, understands how they work together, and when something goes wrong, it investigates autonomously - pulling data, correlating events, and building clear evidence chains.

For example, when latency spikes in production, Cleric immediately starts investigating across your stack - checking recent changes, resource metrics, downstream dependencies - and typically diagnoses root cause in under two minutes. It's doing the same investigative work an engineer would, just much faster.

Right now, we're focused on making investigation and diagnosis reliable. The agent has read only access and works alongside your teams, helping them understand what's happening in production faster. Every investigation improves its understanding of your specific environment.

What's interesting from an AI perspective is that we're not building a general-purpose agent. We've focused entirely on making our agent work reliably in production infrastructure - understanding system relationships, reasoning about cause and effect, and providing clear evidence for its findings. It's a focused application of AI that delivers value today while laying groundwork for more autonomous operations.

Talk to us about your users today - who’s finding the most value in what you’re building with Cleric?

We work with teams that spend the majority of their time supporting systems in production. If your engineering team spent the last sprint troubleshooting production issues instead of shipping features this is when a product like Cleric becomes valuable.

We see this most often in companies with 100+ engineers, where engineering time is valuable and system reliability directly impacts revenue. These teams have already invested in observability tools and cloud infrastructure. They have good practices in place, but the complexity of their systems means there's constant operational work - from investigating performance issues to troubleshooting production incidents.

But they've hit that point where throwing more engineers at operations doesn't scale. Whether they have dedicated platform teams or engineers handling their own services, they need a way to investigate and resolve issues without interrupting development work.

We're getting to the scale where infra is really starting to matter. PromptLayer powers real production apps— so uptime & latency is really important. Super excited for agent-based infra automation.. keeping an eye on the AI SRE
— Jared Zoneraich (@imjaredz)
3:08 PM • Nov 27, 2024

Any customer or design-partner success stories you’d like to share?

We've seen strong results with teams managing critical systems where they can't ignore potential issues. One company runs a data platform used by dozens of application teams. Every alert needs investigation because small problems can cascade into customer facing incidents. They've found significant value in having Cleric respond to issues because it gives them immediate context - often diagnosing root causes before they even see the alert.

Another company's platform engineering team uses Cleric to investigate their Kubernetes deployments. What used to be a constant stream of interruptions has become much more manageable. They now catch resource constraints and configuration issues before they impact application teams.

Our focus has been on catching and resolving smaller issues before they compound into bigger problems. Engineering teams particularly value getting clear evidence about whether an issue needs immediate attention or if it's expected behavior, letting them stay focused on development while maintaining system reliability.

We're on the road to autonomous infrastructure
— Willem Pienaar (@willpienaar)
7:01 PM • Nov 27, 2024

What has been the hardest technical challenge around building Cleric into the product it is today?

There have definitely been thorny technical challenges, but the biggest challenge for us has really been about how to resist the temptation to build all these enticing, fancy features and just focus.

For example, we could’ve easily spent three or four months developing a deep, intricate knowledge graph mapping out all of the infrastructure and how every piece connects. But what we discovered is that a small portion of your stack—maybe 20%—tends to cause 80% of the issues. So, we decided to focus on just that core kernel.

I know that might not sound like a purely technical answer, but it’s critical when you’re building agents. This technology feels like magic, and the temptation to build all sorts of flashy stuff is always there. But staying disciplined is key.

How have your users evolved from the very beginning of Cleric? Have you found them to be receptive to Cleric’s integrations across the board?

From the start, we’ve been selective with our design partners - we knew that picking the right initial partners makes a material difference in the product. We looked for companies with sufficient scale (100+ engineers) and saw the future as human-machine augmentation like we did. The scale means we’re solving a real problem, the shared vision has helped a lot to keep energy levels high. It’s not every day you hear clients say “I’m so excited”.

In terms of tactical learnings, designing AI-human interaction has been fascinating. From a tech perspective, it sounds cool to say we're building an AI SRE, but to have it actively engaging in Slack channels with engineers was unexpectedly hard. We had to be thoughtful about timing our responses. For example, if a channel was really busy, we'd sometimes wait five minutes before responding to avoid interrupting an engineer who was already handling an issue. We put a lot of effort into figuring out when to engage versus when to stay quiet, especially if two people were already deep in conversation.

Another thing that stood out was how framing affects usage. Since we frame Cleric as an AI agent, almost like a human teammate, it's perceived differently from traditional tools. With typical platforms, you're just configuring settings. But with Cleric, people ask things like, “How do we teach Cleric?” or use terms like “Cleric said X”. It's very personalized, which is exciting and shows the “AI as teammate” paradigm can work.

We’re also seeing sharp growth in the use of AI coding assistants. Engineers are now shipping code faster than ever to production which is also increasing the pressure on production systems and on-call teams. Engineering leaders we speak to are increasingly concerned about this gap - they see the productivity gains from AI, but worry about who's watching the underlying systems.

How do you plan on Cleric progressing over the next 6-12 months? Anything specific on your roadmap that new or existing customers should be excited for?

We're taking a measured approach to autonomous operations. Today, we've built the foundation - accurate multi step investigations. This includes the collaboration layer - how Cleric learns from engineers, integrates with existing workflows, and builds trust through clear evidence and reasoning.

The next phase expands our supervised resolution capabilities. We're already identifying potential fixes for common issues, but keep engineers in control of implementation. As diagnosis accuracy increases for specific use cases, we'll expand this automated resolution with continued engineer oversight. We're seeing strong results already where Cleric diagnoses complex issues in minutes where an engineer would waste 15-20 minutes to do the same thing.

The long term goal is closed loop autonomous operations. But this requires proving reliability at each step. We expand automation gradually, use case by use case, environment by environment. Teams control this progression - some might want full automation for certain scenarios while keeping others under review. AI agents still struggle with certain complex scenarios, so we're deliberate about where we automate. But even just reducing the search space from hundreds of potential causes to a few likely ones already saves engineers hours per incident.

What's important is that teams already see value today through faster diagnosis and reduced interruptions. Each capability we add builds on this foundation, making systems more self healing over time. It's a pragmatic path to autonomous infrastructure.

Agents have become a hugely exciting part of generative AI, and have received a lot of interest. What sets Cleric apart from others in the space?

Infrastructure is uniquely suited for agents, but it's also a harder problem than it appears. Unlike code generation where you have clear test cases, infrastructure problems are open ended and require understanding complex causal relationships across systems. You need to correlate metrics across multiple time series, connect failures across services, and understand how different systems interact.

We've spent over a year building the foundations - not just applying general agents to infrastructure, but creating the domain-specific integrations and abstractions that make agents reliable in this space. A lot of our work has focused on the unglamorous but critical parts: evaluation frameworks that ensure reliability, production feedback loops that improve performance, and making the product self-serve for engineering teams.

Our approach is pragmatic. We're building something that delivers clear value today while positioning for where foundation models are heading. This means focusing on specific use cases where we can be highly reliable, building deep tool integrations, and ensuring our systems learn and improve from production usage.

Our team also has deep experience in both infrastructure and AI - we've built and operated ML platforms and infrastructure at scale. We've felt this pain personally, which is why we're so singularly focused on solving it.

How does AI safety play a role in your product vision for Cleric?

AI safety in production environments requires a much more serious approach from other AI applications. When you're dealing with business critical systems, there's no room for uncertainty or unexpected behavior.

Our approach to safety focuses on strong foundational guardrails:

Start with read only access, enforced at the infrastructure level through cloud provider and Kubernetes RBAC
Deploy entirely within customer VPCs, keeping sensitive data within their security boundaries
Built in PII detection and redaction layers to protect sensitive information
End-to-end encryption and strict data privacy controls
Human approval required for any system changes
Clear boundaries on agent capabilities and actions

We use AI for what it's good at - reasoning about complex systems and suggesting solutions - while maintaining strict controls on what it can actually do. This conservative approach aligns with how the best engineering teams operate their critical infrastructure.

been watching @cleric_io's approach to building their AI SRE - one of the few teams building guardrails/safety into the core of their product - this will def become the norm
— dex (@dexhorthy)
11:57 PM • Nov 26, 2024

Lastly, tell us a little bit about the team and culture at Cleric. How big is the company now, and what do you look for in prospective team members that are joining?

My co-founder and I have known each other for a decade and have both built and sold companies before. Most of our team are engineers we've worked with over the years - people who've built and operated systems at scale.

Cleric team debating agent architectures in Tahoe (or hiking routes - hard to tell from the diagrams)

We're six engineers now, with backgrounds spanning observability platforms like Splunk to ML and data infrastructure at companies like Tecton and Next Data. What makes this team different is that we've all lived this problem from different angles - building platforms, operating production systems, and developing AI systems.

Our culture is direct and iterative. We work in person, move fast, and focus on technical excellence. We look for troublemakers - engineers who see broken systems and can't help but fix them. People who are comfortable with ambiguity, have strong technical depth, and challenge the status quo.

Anything else you’d like our readers to know about Cleric?

What's fascinating is how differently AI agents approach operations compared to humans. Just as AlphaGo used new ways of thinking about Go that grandmasters had never considered, we're seeing agents find patterns and correlations that experienced SREs might miss. All our existing tools - dashboards, runbooks, documentation - were built for human operators. We're at the beginning of a fundamental shift in how systems are operated.

We'll be sharing more customer success stories soon. We think the community will find real production examples valuable, especially teams dealing with similar operational challenges.

We're expanding our team in San Francisco. If you're a builder with infrastructure experience and want to tackle this problem, reach out. And if your team is drowning in operational support, let's talk.

Conclusion

To stay up to date on the latest with Cleric, learn more about them here.

Read our past few Deep Dives below:

If you would like us to ‘Deep Dive’ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.

Cleric - Your AI SRE teammate 🌐

Plus: CEO Shahram on why AI agents can handle the complexity of today'sa production environments...

CV Deep Dive

Our Chat with Shahram

Conclusion

Join Slack | All Events | Jobs