- Cerebral Valley
- Posts
- Galileo's GenAI Studio is taking model evals to the next level đ
Galileo's GenAI Studio is taking model evals to the next level đ
Plus: Founder-CEO Vikram on BERT, RAG analytics, and more...

CV Deep Dive
Today, weâre talking with Vikram Chatterji, Founder and CEO of Galileo.
Galileo is an end-to-end evaluation stack for GenAI, purpose-built for AI teams. Founded by Vikram, Yash Sheth, and Atin Sanyal in 2021 after working with Googleâs Transformers team and Uberâs AI team, the startupâs mission is to provide the platform and research-backed evaluation models for AI teams to build trustworthy and accurate GenAI applications.
Galileoâs flagship product GenAI Studio helps enterprise AI teams evaluate, observe, and protect their AI solutions to build trustworthy GenAI powered by a suite of Evaluation Foundation Models built by Galileoâs research arm. This helps AI teams move away from âasking GPTâ or throwing humans at the evaluation problem, but instead using Galileoâs evaluation models to detect hallucinations, security threats, data privacy breaches, and more.
Today, Galileo has dozens of enterprise AI teams using its product for model evaluations, including numerous Fortune 50 companies. In 2022, the startup announced an $18 million Series A round led by Battery Ventures with participation from The Factory, Walden Catalyst, FPV Ventures, Kaggle co-founder Anthony Goldbloom and other angel investors.
In this discussion, Vikram walks us through the founding premise of Galileo, their suite of evaluation-focused models and products across the development and production lifecycle, and how his time working with the Transformers team at Google AI shaped his approach to generative AI.
Letâs dive in!
Read time: 8 mins
Our Chat with Vikram đŹ
Vikram - welcome to Cerebral Valley. Introduce yourself and give us a bit of background on yourself and Galileo. What led you to start Galileo?
Hey there! My name is Vikram and Iâm the co-founder and CEO of Galileo. Prior to starting Galileo in early 2021, I was heading up product management at Google AI where my team worked with Googleâs âOGâ large language model, BERT, as well as the Transformers team. That was a great time to be at Google, because BERT had ushered in this new wave of NLP that everybody, from startups to large enterprises, was interested in taking part in.
At Google, my team was focused on taking language models and building production-grade applications for large enterprises across financial services, retail, healthcare, and contact centers. We spent a ton of time on evaluation of model inputs and outputs, and after many months, I had realized a few critical things:
Thereâs a ton of unstructured data in the enterprise â roughly 80% of the data across the organizations I was working with were in the form of docs, images, audio and video.
Looking outside of Google in 2020, âML Opsâ was becoming a thing, but almost exclusively focused on the 20% of enterprise data â structured features in a tabular format.
It was clear natural language would only continue to grow in popularity
After realizing there wasnât an effective evaluation solution for AI teams working with unstructured data, my co-founders and I started Galileo in Feb 2021. We started with language models and focused on measuring data quality by painstakingly building our own metrics for data quality quantification (we called it Data Error Potential), semantic drift detection and more â things that didnât exist back then but wouldâve saved my team at Google hundreds of hours a month.
This laid the foundation for our research-first approach towards evaluation in the generative AI era.
How would you describe Galileo to the uninitiated developer or ML researcher?
Galileo is an end-to-end genAI evaluation platform designed to help AI teams go from development to production. We help AI teams evaluate, monitor, and protect GenAI applications across the development lifecycle using our proprietary Evaluation Foundation Models.
To take a step back, what weâre hearing from our customers is that GenAI application development requires building and shipping an increasingly complex AI system â involving prompts, LLMs, training data, context data, embedding models, vector stores and more. This means youâre always focused on answering a multitude of questions - how do you know if youâre working with the right model or prompt? Whatâs the quality of my retriever? Are the chunks that I have, the right chunks for the response? Is my model hallucinating? Thereâs no real evaluation layer in the GenAI stack today, and thatâs what weâre solving.
Today, Galileo offers 3 products, all available through a powerful Python SDK, REST APIs and UI:
Evaluate: built for 0 to 1 iteration across the various parts of the GenAI system
Observe: built for real-time monitoring, alerting and root-cause analysis of model hallucinations, security attack vectors and data privacy exposures.
Protect: built for real-time interception of a userâs queries and the modelâs response to proactively avoid harm to users and a companyâs brand
Our enterprise customers have expressed the desire for a balance between embracing the operational efficiencies that come from Generative AI while preventing the risk of hallucinations, data privacy lapses, and security hacks. So, we think of ourselves as the GenAI Evaluation stack that helps enterprise AI teams rapidly evaluate their GenAI systems, monitor live traffic, and protect their systems and users.
As GenAI allows enterprises to unlock the 80% unstructured data that was latent for so long, we at Galileo aim to be their evaluation partner across the AI development lifecycle.
Who are your current users (roles, titles)? Who finds the most value in Galileoâs platform on a weekly or monthly basis?
Broadly, we obsess over âAI teamsâ shipping business-critical GenAI-powered applications. Within these teams, weâve built our product to address the needs of three personas - The AI Engineer, the Subject Matter Expert, and the Annotator:
The âAI Engineerâ â Brings together various parts of the system, typically in a Python notebook. The AI Engineer uses Galileoâs python SDK or REST API to quickly evaluate a Run (model inference) using dozens of proprietary metrics to measure hallucinations, RAG system quality, security attack risks, data privacy and more.
The âSubject Matter Expertâ (SME) â For some customers, the AI Engineer doesn't have enough domain expertise to create high-quality test sets or evaluate outputs. For example, an AI Engineer canât determine whether a medical recommendation is accurate. In these cases, Galileo has an intuitive UI, enriched with Galileoâs evaluation metrics, to help SMEs (e.g., lawyers, doctors, accountants, PMs) more quickly and accurately review responses
The âAnnotatorâ â Lastly, we also have an easy-to-use interface that helps annotators label and rate high-quality test sets to accelerate the evaluation process.
There are a number of teams approaching MLOps, especially in the realms of observability and evals. What is Galileo doing differently from a technical perspective?
This is a great question! We're seeing three buckets in the market: (1) players pivoting from tabular data into the GenAI evaluation space, (2) companies approaching MLOps from an observability perspective, and (3) newer prompt playground platforms. The two key differences Iâd highlight are the following:
1. Evaluation-First. Research-First.
We are coming at MLOps from an evaluation-first perspective. We started Galileo in Feb 2021 and were, for the longest time, among the only companies doing language model evaluations. We believe you canât fix what you canât measure, and weâve always felt that AI teams donât have the right tools to effectively measure their AI systems. BLEU and ROUGE are not enough - innovation in the Evaluation Stack requires innovation in the Metrics Stack.
In fact, our first ever hire was an AI researcher to build evaluation metrics. Weâve always obsessed over helping AI teams quantifiably answer key questions about their GenAI systems: âWhich model responses are hallucinatory? Why?â, âWhat is the quality of my retriever?â, âWhich documents and associated chunks were really used by my model?â, âWhat data is my model finding hard to train on?â â high precision metrics for each of these questions cannot be built overnight and have been core to our roadmap since Day 1. These metrics have been super high ROI for our customers. We publish papers (eg: Chainpoll for hallucination detection without the need for any ground truth data, as well as a first of its kind Hallucination Index) to make sure there is transparency around our research techniques, experiments and results.
2. End-to-End AI System
The second thing that sets us apart is our focus on supporting AI Teamsâ eval needs across the entire development lifecycle - from developing, testing, and experimentation in pre-production to monitoring and interception in post-production.
GenAI, as we all know, is non-deterministic - and as such, AI is an iterative sport. Over time, we have seen the aspects of iterative AI development that used to be tedious become increasingly automated â model hyperparameter tuning, data labeling, etc. With GenAI Studio, we have built an âEvaluation Stackâ that solves for each part of the AI development lifecycle from 0 to 1 iteration, to fine tuning, to production monitoring and now real-time response interception. The beauty of this approach is two-fold.
Enterprises get a unified evaluation solution that spans development and production, which increases developer efficiency, product velocity, and ensures more accurate genAI applications.
This enables automated genAI system optimization. AI teams can now use Galileo Observe to automatically detect hallucinatory responses in production. These instances are then processed by Galileo Evaluate, where automated testing and prompt optimization across models refine the GenAI system for reintegration into production. This seamless support through every phase of development allows Galileo to uniquely support the entire genAI lifecycle.
How much growth are you seeing user-wise? Any user stories youâd like to share?
We launched our flagship evaluation platform, GenAI Studio, in the middle of last year, focused exclusively on large enterprise AI teams. Since then, weâve seen ARR 3x quarter-over-quarter. Weâve also seen inbound demand 10x since then. Weâve actually had to 2x our team to support this demand!
In terms of user stories, one of our customers - a Fortune 50 CPG brand - was working on a RAG based application to generate appealing product descriptions faster for their inventory. Given the user-facing nature of this application, they needed to minimize hallucinations - however, after months of iterations they still found a host of products with wildly inaccurate descriptions. They started using GenAI Studio and built out this incredible system with RAG - immediately, they could transparently see which parts of their stack were problematic.
Within a few days, a single engineer was able to revise and deploy a new retriever and embedding model, and push their product description tool to production. This would have taken them weeks and multiple engineers to resolve in the past.
Any specific features of Galileoâs platform that youâd like to highlight? Give us a comprehensive overview of each of these.
There are 2 products Iâd love to highlight here, and each of these was borne out of listening to our customersâ pain points for months at a time:
Galileo RAG Analytics: RAG has emerged as the de-facto method for quickly getting started and productionizing GenAI applications. However, we heard from customers that they were often unsure about the quality of their RAG system â for example, the chunking strategy, the quality of the retrieved chunks, the completeness of the response, the responseâs adherence to context and a lot more.
We took a first principles approach and built 5 new high accuracy metrics, each to actionably make the RAG system transparent. With evaluation platforms, users donât just want a trace of their run in a fancy UI â thatâs easy. They want the platform to point out blindspots and fundamentally improve the modelâs accuracy. Thatâs what RAG analytics provides.
Galileo Protect: We just launched Protect. When we launched Observe, our real-time monitoring product, customers loved its hallucinations, security gaps, and privacy-focused metrics. We kept getting asked if it was possible to intercept their LLMâs API calls and proactively disable harmful responses from reaching the user. This meant we needed to be in the application layer!
To do this well, weâve reduced our Evaluation Foundation Model latency to milliseconds and brought their cost down to ~$0. Weâre excited to finally launch Galileo Protect and the early customer results have been tremendous! This required a close collaboration between our ML research and infra teams. I am super proud of the team.
How do you see Galileoâs core product offering evolving over the next 6-12 months?
Currently, weâre very focused on evolving Galileo across 2 distinct axes:
1) Breadth. As genAI becomes more complex and varied (e.g., agents, multi-modality), weâre making sure Galileoâs platform can quickly bring the same level of evaluation and actionability to these new paradigms and modalities. For example, weâre excited about the increasing interest in multi-modality and already have native support for computer vision.
2) Depth. Working towards our eventual goal of automating genAI evaluations across the AI development lifecycle, we have a long roadmap of new evaluation models weâre excited to ship to our users.
Whatâs Galileoâs internal approach to choosing what to build? How does your team balance research vs. product development, given so much of the value in AI is still in the research phase and early in product development?
Iâd say weâre extremely customer-obsessed and are thankful to have dozens of enterprise customers to build for and with. Our roadmap is informed by customer feedback and recurring pains we observe from customer usage. We strive to deliver âmagicalâ experiences for our users which often requires extensive efforts from our AI research team. A great example of this is our RAG and Agent analytics I mentioned above, which required extensive R&D in order to save users from many hours of manual eyeballing.
Overall, Iâd say itâs a super exciting time to be building in AI! We keep very close tabs on academia (which has been a huge source for the building blocks of our innovative approach internally) and we talk to dozens of prospects every week, which helps us identify trends and user pain points early. Internally as a team, we over-communicate these learnings across engineering, research, product, sales, and marketing, so that weâre all pushing full steam ahead together and sharing ideas. You should see our Slack channel!
How would you describe the culture at Galileo? Are you hiring? What do you look for in prospective team members?
My co-founders and I have always focused on creating a culture thatâs highly collaborative, high-ownership and customer-obsessed. We have a fairly flat structure and a lot of open communication - a lot of this is likely borrowed from Google given how long we had worked there!
Weâre always looking for great folks to join our team across engineering, AI research, sales and marketing. When hiring, apart from obviously looking for intelligence and role relevance, we also do the airport test â meaning, if we were both stuck waiting for a plane at an airport for a few hours, could we get along really well? This has helped us in fostering a genuinely supportive team as we move at 1,000 miles per hour! We have some exciting roles open right now!
Anything else youâd like readers to know?
The GenAI landscape moves at lightning speed and it truly feels like things are just getting started. Itâs been super encouraging to collaborate and partner with other brands and communities in this space like Cerebral Valley. Thanks for all you folks do for the community!
Conclusion
To stay up to date on the latest with Galileo, follow them on X and learn more about them at Galileo.
Read our past few Deep Dives below:
If you would like us to âDeep Diveâ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.