• Cerebral Valley
  • Posts
  • Rime - Authentic AI Voices for Real-time Conversations at Scale 🗣️

Rime - Authentic AI Voices for Real-time Conversations at Scale 🗣️

Plus: Founder/CEO Lily Clifford on what the future holds for voice AI in business applications...

CV Deep Dive

Today, we’re talking with Lily Clifford, Founder and CEO of Rime.

Rime is building next-generation speech synthesis models designed for high-volume, real-time enterprise applications. Unlike traditional text-to-speech solutions, Rime’s models are built for dynamic, customer-facing interactions—powering millions of automated phone calls, drive-thru orders, and enterprise voice applications every day. Founded in 2023, Rime emerged as businesses across industries—from fast food and retail to healthcare and telecom—demanded more lifelike, customizable AI voices that could operate at scale. 

Today, Rime’s technology powers everything from phone ordering at major brands like Domino’s and Wingstop, to backend automation in healthcare and enterprise customer support. With a strong emphasis on personalization, the company is also exploring how AI voices can be tailored to different customer demographics, brands, and user experiences. Rime prioritizes pronunciation accuracy, real-time responsiveness, and voice adaptability, helping businesses not only automate interactions but also improve customer engagement and conversion rates.

In this conversation, Lily shares how Rime built its foundation models, the complexities of training AI for speech synthesis at scale, and what the future holds for voice AI in business applications.

Let’s dive in ⚡️

Read time: 8 mins

Our Chat with Lily 💬

Lily, welcome to Cerebral Valley! First off, introduce yourself and give us a bit of background on you and Rime. What led you to co-found Rime?

Hey there! I’m Lily, Founder and CEO of Rime. Prior to founding Rime in 2023, I was a PhD student at Stanford studying Computational Linguistics. I ended up dropping out because I wanted to hack on speech synthesis models—specifically for customer support. This was pre-ChatGPT, before OpenAI was doing anything in voice, and I was deep into sociophonetics—the study of how social and demographic factors influence speech. People in Texas sound different from people in California, and we all pick up on these cues, consciously or not.

At the same time, I was obsessed with the multi-speaker speech synthesis research happening around 2019-2020. End-to-end attention-based models were getting so good that real-time, near-human speech was starting to feel within reach. After months of moonlighting on what would eventually become Rime, I dropped out and convinced two friends to join me as co-founders—Brooke, who was at Amazon Alexa, and Ares, who was at UC San Francisco working on brain-computer interfaces for people who had lost the ability to speak.

We started by setting up a recording studio in San Francisco’s Mid-Market and collecting an insanely large dataset of full-duplex, speech-to-speech interactions. At the time, we didn’t even realize how ahead of the curve we were—people weren’t really training end-to-end speech-to-speech models yet. We just knew this dataset was going to be valuable. On the basis of this ever-growing dataset, we started training speech synthesis models—while speech-to-speech demos are attracting a ton of attention, TTS is still an incredibly impactful technology. I always joke that it’s the OG generative AI.

Before we dive in further, how would you describe Rime to the uninitiated developer or AI team?

Rime trains foundation models for generating AI voices that sound natural and relatable. These voices help businesses connect with customers in real-world scenarios—whether over the phone, at a drive-thru, or anywhere voice applications are used.

Talk to us about that a bit more - how does Rime position itself with TTS within the broader context of the generative AI revolution of the past few years?

Text-to-speech has been around for decades solving business problems, so there was already an existing market. We found our first design partners through personal networks, startup communities, social media, and a lot of inbound interest. This was 2023, when consumer expectations for voice applications were leveling up fast. The legacy solutions from hyperscalers—Google, Microsoft, Amazon—suddenly felt outdated, kind of like how CGI from the ’90s looks today.

We spent a lot of time talking with customers, design partners, and potential users to figure out what they actually needed—beyond just better voices. Traditional TTS problems were still there, but now, with large language models, entirely new possibilities for voice AI were opening up. What surprised me was how quickly businesses started scaling with Rime. We’re processing over a million requests a day now, with 20 million outbound sales calls this quarter alone. None of this would have been possible before LLMs—there was always an uncanny valley problem. But once you get past that, it suddenly becomes much easier to build these applications.

At this point, 99.9% of the traffic going through our API is generated by LLMs—almost none of it is scripted. At the same time, some of the hardest problems in real-time, enterprise-scale conversational AI have actually become even harder to solve. So we focused on where we knew we could win—not just building the fastest, most lifelike speech synthesis models, but also developing the right tooling to support high-volume, real-time AI voice applications at scale.

If you’re building a phone ordering application for a restaurant chain like Sbarro—which we don’t work with yet, but they have 600 locations worldwide—you simply can’t use most of the other text-to-speech products on the market. They can’t even pronounce "Sbarro" correctly. That’s the kind of problem we’re solving. Some people might call it a last-mile problem, but if a model can’t get the name of the business right, that sounds like a first-mile problem to me.

Riding this wave of innovation has taken us in directions I didn’t anticipate, especially when it comes to personalization. Two years ago, if you were building a voice application, you’d pick the one or two North American English voices from Google or Microsoft that your customers hated the least, put them into production, and just hope they pronounced everything correctly. That’s not an exciting user experience. Now, the next big push for us is figuring out how to let customers use what they know about their users to personalize voice interactions in meaningful ways.

Who are your users today? Who is finding the most value in what you're building at Rime? 

We work with a ton of really exciting startups building in back-office healthcare and revenue cycle management—really disruptive use cases there. One of our partners, ConverseNow, powers about 80% of all Wingstop and Domino’s phone orders in North America. So there’s a three-in-four chance that if you call Domino’s or Wingstop in North America, you’re hearing our voices.

The interesting thing about our users is that they tend to be creating high-volume applications where they need scalability and  a really fast model - but also, if you’re deploying phone ordering for Domino’s, Wingstop, Blake’s Lotaburger, Jet’s Pizza, you need to be sure the bot is going to pronounce all these menu items correctly.

I’m not kidding—our ideal customer, and really our median customer right now, is making somewhere between 30,000 and 100,000 phone calls a day or receiving that many inbound calls. The problems that come up at that scale are just different. It’s not that we’re uniquely able to solve them, but we’ve built capabilities into our models that give developers the tools to handle these first-mile issues themselves.

In the healthcare use case, for example, a model needs to be able to pronounce things like a member ID number in the same way a human would. That’s still a really hard problem for these models. So we’ve been heavily focused on the data collection effort to support more typical enterprise calling, where 80% of the call is just confirming information—like, "Let me make sure I’ve got this right. Your name is Patrick, spelled P-A-T-R-I-C-K." Simple problems, but solving them well makes a huge difference for businesses handling hundreds of thousands of calls a day.

Walk us through Rime’s platform. What use-case should customers experiment with first, and how easy is it for them to get started? 

Getting started is super easy. The interesting thing about any model served by an API is that, while there are some interoperability concerns, being an API-first and model-first business means people can plug and play pretty easily—especially if they’re using an orchestration platform like LiveKit or Daily, which a lot of our users rely on.

Typically, what brings a customer to us over a competitor is that they’ve run into issues where they need more customizability than they’re getting elsewhere. And for a lot of these use cases, Google, Microsoft, and Amazon aren’t even options because their voices just aren’t realistic enough.

Our sweet spot is someone who has concerns around scale. Maybe they’re already using another startup TTS product but need more control—especially when they’re managing hundreds of thousands of calls in production. A big part of our story is figuring out how to build more tooling into the model itself so that when a user calls one of our customers—like Wingstop—the system knows which voice should be served to create the most compelling experience for that specific user.

Speed, quality, and accuracy will continue improving across the board, whether it’s Google, Microsoft, Amazon, or us. So the question becomes: how do you keep delivering value in a world where it’s getting easier and easier to train these models? That’s the curve we’re staying ahead of

Which existing use-case for Rime would you say has fit the best so far? 

Our unique advantage is around calls that are customer-facing, where brand becomes really important. When someone calls Wingstop, they want a voice that resonates with their brand. And at the same time, if they’re trying to personalize that experience, the question we ask is: why? What does personalization get you? Does it increase automation success? Does it improve the likelihood of handling the call fully? Does it increase the chance of upsell during that automated interaction? If the answer is yes, then that’s where we come in—helping businesses choose the voices that drive real business impact.

Let me put it this way—99.99999% of our volume happens over the phone or at the drive-thru. Our users aren’t making static audio content for podcasts, TikTok, or YouTube. Some are, and we’re in close contact with them, but the model you build for telephony looks very different from the one you build for audiobook narration. We’re building these models specifically for enterprise calling.

How do you think about testing and measuring the impact that you’re creating for your customers? 

It's often hard to measure when there are multiple contributing factors to an improvement in a KPI, but just as an example—when one of our customers switches from Microsoft to Rime text-to-speech for a phone ordering application for fast food, they immediately see 20% more of those phone calls being automated where they weren’t before. That represents millions in revenue.

The reason why is that what our customers are trying to maximize is simply the willingness to talk to the bot. There are still a lot of frustrating things about talking to a bot, even with all the improvements in intelligence over the last two years. If it doesn’t understand your name and then says, “Oh, and by the way, is your name this?” and mispronounces it, that’s a problem that impacts customer experience.

So right out of the gate, having a model that’s faster to respond and sounds not just human, but like it’s actually having a conversation with you, increases the willingness to engage—without pressing 0, without hanging up, or asking to be transferred to a human. Typically, our customers have a really strong idea of what call success looks like, which allows us to measure it in the pilot process. We tell them, “Put 5% of your volume on Rime and you’ll see if it improves the business metrics you care about.” And honestly, we’ve had a 100% success rate in running these pilots.

Could you share a little bit about Rime’s technical architecture? How does it work under the hood?  

The first version of the model we trained with our design partners wasn’t very good. When you’re working with a large amount of proprietary conversational speech data, extemporaneous speech is really hard to model. The way someone speaks in an audiobook is completely different from how they talk in a conversation, and that impacts the architectural decisions you make around speech synthesis. We had to develop a really firm understanding of how to model that effectively.

Then there’s the question of how to build tooling that lets developers customize pronunciation without needing to know the International Phonetic Alphabet. Anyone can claim their model has custom pronunciation, but if the only way to edit it is through IPA, how useful is that really?

So much of what we do falls into two buckets. One is what I’d call "linguistics as a service"—making it easy for developers to adjust pronunciation and speech patterns without needing deep phonetics knowledge. The other is "demographics as a service"—because most of our users, myself included, aren’t voice casting directors. The old approach to text-to-speech was, "Which of these 10 voices do I hate the least?" 

Now, with infinite voice options, the question is, "Which voice do I choose to maximize the business outcome I care about?" And the people making these decisions aren’t brand marketers; they’re VPs of Engineering or AI teams who don’t necessarily have experience thinking about voice in this way. So a big part of what we solve is not just making high-quality voices, but actually guiding customers on which ones to use in the first place.

How do you see Rime evolving over the next 6-12 months? Any specific product developments that your customers should be excited about? 

We're lucky that we have customers pushing us into new linguistic territory. We’re working with major telco providers in India and Sri Lanka to develop high-quality text-to-speech models for underserved linguistic markets, where most business interactions still happen over the phone. That’s an exciting challenge for us.

At the same time, we’re focused on building a platform that allows users to A/B test voices and measure their impact on business outcomes. Our core customer base includes startups making a high volume of calls and large enterprises upgrading their outdated IVR systems. The same product that powers the phone tree for a top-five North American bank is also used by a five-person startup in healthcare operations, helping them iterate and improve their application over time.

I was recently asked if Rime would ever build an agent—a framework for orchestrating interactions. I think my answer is no. Our priority is enabling customers to use our models for things they couldn’t do otherwise, not creating orchestration tools. There are already strong players in that space like Daily and LiveKit. We trust our customers to integrate the best solutions available and help shape our product roadmap in the process.

We're rolling out self-serve enterprise voice cloning through our API, so customers won’t have to go through professional services to get it set up. That’s a big deal for platforms that want to offer voice cloning but don’t want their users bouncing to a third party.

Latency is always a focus, and on-prem is going to be massive for us this year. By the end of the year, I’d guess 90% of our volume will be running on-prem. A lot of that comes down to who our customers are—highly regulated industries that need tight security and super low latency. Also, keep an eye out for our platform’s personalization tools!

Lastly, tell us a bit about the team - how would you describe the culture at Rime? Are you hiring, and what do you look for in prospective team members joining Rime? 

There are nine of us, and Rime is powering tens of millions of phone calls and drive-thru orders every month. So each Rime employee has a huge impact—we’re each helping to power over a million phone calls every month.

My hunch on culture is that we’re very kind, down-to-earth people who really care about solving particular problems. From a product perspective, we don’t care how we get there. From a modeling perspective, people always ask, “What’s your calling card?” Like, so-and-so has Transformers, this other company has Diffusion. Honestly, text-to-speech systems are Frankensteins to begin with. One part might be a Transformer, another part might be Diffusion, some other part might be a state-space model.

I always try to instill this sense in our team that we should be product-focused, not just innovation-focused. Innovation drives the product forward, sure, but you don’t want to put the cart before the horse.

We work in person in San Francisco, and we’re hiring ML and product engineers right now to build out some of this platform-level stuff. If anyone’s looking for really unique opportunities to harness speech synthesis models to do things they were never able to do before, that’s the kind of person we’d be looking for.

Conclusion

Stay up to date on the latest with Rime, learn more about them here.

Read our past few Deep Dives below:

If you would like us to ‘Deep Dive’ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.