Cerebral Valley
Posts
Galileo's GenAI Studio is taking model evals to the next level 📈

Galileo's GenAI Studio is taking model evals to the next level 📈

Plus: Founder-CEO Vikram on BERT, RAG analytics, and more...

May 07, 2024

CV Deep Dive

Today, we’re talking with Vikram Chatterji, Founder and CEO of Galileo.

Galileo is an end-to-end evaluation stack for GenAI, purpose-built for AI teams. Founded by Vikram, Yash Sheth, and Atin Sanyal in 2021 after working with Google’s Transformers team and Uber’s AI team, the startup’s mission is to provide the platform and research-backed evaluation models for AI teams to build trustworthy and accurate GenAI applications.

Galileo’s flagship product GenAI Studio helps enterprise AI teams evaluate, observe, and protect their AI solutions to build trustworthy GenAI powered by a suite of Evaluation Foundation Models built by Galileo’s research arm. This helps AI teams move away from ‘asking GPT’ or throwing humans at the evaluation problem, but instead using Galileo’s evaluation models to detect hallucinations, security threats, data privacy breaches, and more.

Today, Galileo has dozens of enterprise AI teams using its product for model evaluations, including numerous Fortune 50 companies. In 2022, the startup announced an $18 million Series A round led by Battery Ventures with participation from The Factory, Walden Catalyst, FPV Ventures, Kaggle co-founder Anthony Goldbloom and other angel investors.

In this discussion, Vikram walks us through the founding premise of Galileo, their suite of evaluation-focused models and products across the development and production lifecycle, and how his time working with the Transformers team at Google AI shaped his approach to generative AI.

Let’s dive in!

Read time: 8 mins

Our Chat with Vikram 💬

Vikram - welcome to Cerebral Valley. Introduce yourself and give us a bit of background on yourself and Galileo. What led you to start Galileo?

Hey there! My name is Vikram and I’m the co-founder and CEO of Galileo. Prior to starting Galileo in early 2021, I was heading up product management at Google AI where my team worked with Google’s ‘OG’ large language model, BERT, as well as the Transformers team. That was a great time to be at Google, because BERT had ushered in this new wave of NLP that everybody, from startups to large enterprises, was interested in taking part in.

At Google, my team was focused on taking language models and building production-grade applications for large enterprises across financial services, retail, healthcare, and contact centers. We spent a ton of time on evaluation of model inputs and outputs, and after many months, I had realized a few critical things:

There’s a ton of unstructured data in the enterprise – roughly 80% of the data across the organizations I was working with were in the form of docs, images, audio and video.
Looking outside of Google in 2020, ‘ML Ops’ was becoming a thing, but almost exclusively focused on the 20% of enterprise data – structured features in a tabular format.
It was clear natural language would only continue to grow in popularity

After realizing there wasn’t an effective evaluation solution for AI teams working with unstructured data, my co-founders and I started Galileo in Feb 2021. We started with language models and focused on measuring data quality by painstakingly building our own metrics for data quality quantification (we called it Data Error Potential), semantic drift detection and more – things that didn’t exist back then but would’ve saved my team at Google hundreds of hours a month.

This laid the foundation for our research-first approach towards evaluation in the generative AI era.

How would you describe Galileo to the uninitiated developer or ML researcher?

Galileo is an end-to-end genAI evaluation platform designed to help AI teams go from development to production. We help AI teams evaluate, monitor, and protect GenAI applications across the development lifecycle using our proprietary Evaluation Foundation Models.

To take a step back, what we’re hearing from our customers is that GenAI application development requires building and shipping an increasingly complex AI system – involving prompts, LLMs, training data, context data, embedding models, vector stores and more. This means you’re always focused on answering a multitude of questions - how do you know if you’re working with the right model or prompt? What’s the quality of my retriever? Are the chunks that I have, the right chunks for the response? Is my model hallucinating? There’s no real evaluation layer in the GenAI stack today, and that’s what we’re solving.

Today, Galileo offers 3 products, all available through a powerful Python SDK, REST APIs and UI:

Evaluate: built for 0 to 1 iteration across the various parts of the GenAI system
Observe: built for real-time monitoring, alerting and root-cause analysis of model hallucinations, security attack vectors and data privacy exposures.
Protect: built for real-time interception of a user’s queries and the model’s response to proactively avoid harm to users and a company’s brand

Our enterprise customers have expressed the desire for a balance between embracing the operational efficiencies that come from Generative AI while preventing the risk of hallucinations, data privacy lapses, and security hacks. So, we think of ourselves as the GenAI Evaluation stack that helps enterprise AI teams rapidly evaluate their GenAI systems, monitor live traffic, and protect their systems and users.

As GenAI allows enterprises to unlock the 80% unstructured data that was latent for so long, we at Galileo aim to be their evaluation partner across the AI development lifecycle.

Who are your current users (roles, titles)? Who finds the most value in Galileo’s platform on a weekly or monthly basis?

Broadly, we obsess over ‘AI teams’ shipping business-critical GenAI-powered applications. Within these teams, we’ve built our product to address the needs of three personas - The AI Engineer, the Subject Matter Expert, and the Annotator:

The ‘AI Engineer’ – Brings together various parts of the system, typically in a Python notebook. The AI Engineer uses Galileo’s python SDK or REST API to quickly evaluate a Run (model inference) using dozens of proprietary metrics to measure hallucinations, RAG system quality, security attack risks, data privacy and more.
The ‘Subject Matter Expert’ (SME) – For some customers, the AI Engineer doesn't have enough domain expertise to create high-quality test sets or evaluate outputs. For example, an AI Engineer can’t determine whether a medical recommendation is accurate. In these cases, Galileo has an intuitive UI, enriched with Galileo’s evaluation metrics, to help SMEs (e.g., lawyers, doctors, accountants, PMs) more quickly and accurately review responses
The ‘Annotator’ – Lastly, we also have an easy-to-use interface that helps annotators label and rate high-quality test sets to accelerate the evaluation process.

There are a number of teams approaching MLOps, especially in the realms of observability and evals. What is Galileo doing differently from a technical perspective?

This is a great question! We're seeing three buckets in the market: (1) players pivoting from tabular data into the GenAI evaluation space, (2) companies approaching MLOps from an observability perspective, and (3) newer prompt playground platforms. The two key differences I’d highlight are the following:

1. Evaluation-First. Research-First.

We are coming at MLOps from an evaluation-first perspective. We started Galileo in Feb 2021 and were, for the longest time, among the only companies doing language model evaluations. We believe you can’t fix what you can’t measure, and we’ve always felt that AI teams don’t have the right tools to effectively measure their AI systems. BLEU and ROUGE are not enough - innovation in the Evaluation Stack requires innovation in the Metrics Stack.

In fact, our first ever hire was an AI researcher to build evaluation metrics. We’ve always obsessed over helping AI teams quantifiably answer key questions about their GenAI systems: “Which model responses are hallucinatory? Why?”, “What is the quality of my retriever?”, “Which documents and associated chunks were really used by my model?”, “What data is my model finding hard to train on?” – high precision metrics for each of these questions cannot be built overnight and have been core to our roadmap since Day 1. These metrics have been super high ROI for our customers. We publish papers (eg: Chainpoll for hallucination detection without the need for any ground truth data, as well as a first of its kind Hallucination Index) to make sure there is transparency around our research techniques, experiments and results.

2. End-to-End AI System

The second thing that sets us apart is our focus on supporting AI Teams’ eval needs across the entire development lifecycle - from developing, testing, and experimentation in pre-production to monitoring and interception in post-production.

GenAI, as we all know, is non-deterministic - and as such, AI is an iterative sport. Over time, we have seen the aspects of iterative AI development that used to be tedious become increasingly automated – model hyperparameter tuning, data labeling, etc. With GenAI Studio, we have built an ‘Evaluation Stack’ that solves for each part of the AI development lifecycle from 0 to 1 iteration, to fine tuning, to production monitoring and now real-time response interception. The beauty of this approach is two-fold.

Enterprises get a unified evaluation solution that spans development and production, which increases developer efficiency, product velocity, and ensures more accurate genAI applications.
This enables automated genAI system optimization. AI teams can now use Galileo Observe to automatically detect hallucinatory responses in production. These instances are then processed by Galileo Evaluate, where automated testing and prompt optimization across models refine the GenAI system for reintegration into production. This seamless support through every phase of development allows Galileo to uniquely support the entire genAI lifecycle.

How much growth are you seeing user-wise? Any user stories you’d like to share?

We launched our flagship evaluation platform, GenAI Studio, in the middle of last year, focused exclusively on large enterprise AI teams. Since then, we’ve seen ARR 3x quarter-over-quarter. We’ve also seen inbound demand 10x since then. We’ve actually had to 2x our team to support this demand!

In terms of user stories, one of our customers - a Fortune 50 CPG brand - was working on a RAG based application to generate appealing product descriptions faster for their inventory. Given the user-facing nature of this application, they needed to minimize hallucinations - however, after months of iterations they still found a host of products with wildly inaccurate descriptions. They started using GenAI Studio and built out this incredible system with RAG - immediately, they could transparently see which parts of their stack were problematic.

Within a few days, a single engineer was able to revise and deploy a new retriever and embedding model, and push their product description tool to production. This would have taken them weeks and multiple engineers to resolve in the past.

Any specific features of Galileo’s platform that you’d like to highlight? Give us a comprehensive overview of each of these.

There are 2 products I’d love to highlight here, and each of these was borne out of listening to our customers’ pain points for months at a time:

Galileo RAG Analytics: RAG has emerged as the de-facto method for quickly getting started and productionizing GenAI applications. However, we heard from customers that they were often unsure about the quality of their RAG system – for example, the chunking strategy, the quality of the retrieved chunks, the completeness of the response, the response’s adherence to context and a lot more.
We took a first principles approach and built 5 new high accuracy metrics, each to actionably make the RAG system transparent. With evaluation platforms, users don’t just want a trace of their run in a fancy UI – that’s easy. They want the platform to point out blindspots and fundamentally improve the model’s accuracy. That’s what RAG analytics provides.

Galileo Protect: We just launched Protect. When we launched Observe, our real-time monitoring product, customers loved its hallucinations, security gaps, and privacy-focused metrics. We kept getting asked if it was possible to intercept their LLM’s API calls and proactively disable harmful responses from reaching the user. This meant we needed to be in the application layer!

To do this well, we’ve reduced our Evaluation Foundation Model latency to milliseconds and brought their cost down to ~$0. We’re excited to finally launch Galileo Protect and the early customer results have been tremendous! This required a close collaboration between our ML research and infra teams. I am super proud of the team.

How do you see Galileo’s core product offering evolving over the next 6-12 months?

Currently, we’re very focused on evolving Galileo across 2 distinct axes:

1) Breadth. As genAI becomes more complex and varied (e.g., agents, multi-modality), we’re making sure Galileo’s platform can quickly bring the same level of evaluation and actionability to these new paradigms and modalities. For example, we’re excited about the increasing interest in multi-modality and already have native support for computer vision.

2) Depth. Working towards our eventual goal of automating genAI evaluations across the AI development lifecycle, we have a long roadmap of new evaluation models we’re excited to ship to our users.

What’s Galileo’s internal approach to choosing what to build? How does your team balance research vs. product development, given so much of the value in AI is still in the research phase and early in product development?

I’d say we’re extremely customer-obsessed and are thankful to have dozens of enterprise customers to build for and with. Our roadmap is informed by customer feedback and recurring pains we observe from customer usage. We strive to deliver ‘magical’ experiences for our users which often requires extensive efforts from our AI research team. A great example of this is our RAG and Agent analytics I mentioned above, which required extensive R&D in order to save users from many hours of manual eyeballing.

Overall, I’d say it’s a super exciting time to be building in AI! We keep very close tabs on academia (which has been a huge source for the building blocks of our innovative approach internally) and we talk to dozens of prospects every week, which helps us identify trends and user pain points early. Internally as a team, we over-communicate these learnings across engineering, research, product, sales, and marketing, so that we’re all pushing full steam ahead together and sharing ideas. You should see our Slack channel!

How would you describe the culture at Galileo? Are you hiring? What do you look for in prospective team members?

My co-founders and I have always focused on creating a culture that’s highly collaborative, high-ownership and customer-obsessed. We have a fairly flat structure and a lot of open communication - a lot of this is likely borrowed from Google given how long we had worked there!

We’re always looking for great folks to join our team across engineering, AI research, sales and marketing. When hiring, apart from obviously looking for intelligence and role relevance, we also do the airport test – meaning, if we were both stuck waiting for a plane at an airport for a few hours, could we get along really well? This has helped us in fostering a genuinely supportive team as we move at 1,000 miles per hour! We have some exciting roles open right now!

Anything else you’d like readers to know?

The GenAI landscape moves at lightning speed and it truly feels like things are just getting started. It’s been super encouraging to collaborate and partner with other brands and communities in this space like Cerebral Valley. Thanks for all you folks do for the community!

Conclusion

To stay up to date on the latest with Galileo, follow them on X and learn more about them at Galileo.

Read our past few Deep Dives below:

If you would like us to ‘Deep Dive’ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.

Galileo's GenAI Studio is taking model evals to the next level 📈

Plus: Founder-CEO Vikram on BERT, RAG analytics, and more...

CV Deep Dive

Our Chat with Vikram 💬

Conclusion

Join Slack | All Events | Jobs