Cerebral Valley
Posts
Any voice, any way — LMNT is leading the future of AI speech 🗣

Any voice, any way — LMNT is leading the future of AI speech 🗣

Plus: Founder Sharvil on his desire to make human-machine interaction seamless...

July 30, 2024

CV Deep Dive

Today, we’re talking with Sharvil Nanavati, Founder of LMNT.

LMNT is leading the way in creating lifelike AI speech. Driven by a longstanding desire to make human-machine interaction seamless, LMNT focuses on creating AI voices that sound authentic, making interactions with AI much more engaging and effective. Plus, their quick response times mean smoother, more natural conversations without awkward pauses.

They've nailed down some big partnerships, like with Khan Academy, and are making waves in education, content creation, and gaming. Whether it’s giving an AI tutor a natural-sounding voice or helping content creators find their voice, LMNT is making AI speech more relatable for everyone.

LMNT raised a seed round in mid-2023, led by Elad Gil and Sarah Guo from Conviction. In this conversation, Sharvil shares his journey of founding LMNT, as well as some challenges with speech synthesis and forecasts for what’s next.

Let’s dive in ⚡️

Read time: 8 mins

Our Chat with Sharvil 💬

Sharvil - welcome to Cerebral Valley! Tell us about your journey leading up to founding the AI voice startup LMNT (pronounced “element”).

Hey, I'm Sharvil! I've been building things my whole life – from my first startup, an early mobile music streaming platform back in 2003, to originating and leading the team behind Google Glass. After working across consumer electronics, machine learning, and venture capital briefly, I decided to return to building from the ground up. Along the way, I found a lot of problems with state-of-the-art speech synthesis technology as well as some hidden gems that I knew could help solve them. That’s when I started LMNT.

I've always been passionate about human-machine interaction, especially through speech. Sci-fi often shows amazing tech but really unnatural AI speech, and we're here to change that. I've been researching this space since 2019, and now with LMNT, our team is tackling the hardest problems of speech production head-on. We’re building technology more approachable and capable of understanding—and exhibiting—human emotions. There’s no time more exciting than now to be doing this work, and we’re loving the process of leading it.

(Part of the founding team, Sharvil helped bring Google Glass to life in 2010 that brought this groundbreaking augmented reality device to life.)

Give us a top-level overview of LMNT as it stands today. How would you describe the startup and its mission to those who are maybe less familiar with you?

LMNT is on a mission to break down communication barriers with AI.

We build multimodal models that produce human-like speech in any language, voice, style, and emotion. Our goal is to make speech production more engaging and accessible for everyone. Our state-of-the-art models are built to produce incredibly realistic speech (we bring the human “LMNT” in our models!), with response times as fast as natural conversation (<150ms latency), and zero hallucinations.

We provide developers an API to give their products a voice. This can be applied to various product categories, including entertainment, media production, and customer service agents.

Whether you need a voice to power your new teaching tool or provide (tuneable) commentary on a sports match you’re watching, we’ve got it covered! Some use cases that we are particularly interested in are:

Education: we’re the voice behind Khanmigo, Khan Academy’s massively popular AI tutor.

Gaming: we’re changing the gaming space with interactive, streaming speech in studios like Aviar Labs and for many indie developers.
Content creation: for creators who want to scale up their existing productions, expand their reach into other language markets, or want to experiment with new content forms enabled by AI tools.

What's the most challenging technical problem you're facing at LMNT right now, and how are you tackling it?

It is very difficult to quantify or benchmark speech quality. What to you is a perfect voice might sound horrendous to someone else — or vice versa. It all depends on the context of the speech and individual preferences. Speech quality assessment is inherently a qualitative metric that current algorithms struggle to quantify. This is not necessarily the case for other text-based benchmarks like MMLU, HumanEval, or MATH because those are based on accuracy, not “human sounding-ness” or anything like that. Actually, it was great to see that Artificial Analysis launched a TTS leaderboard – it goes a long way in helping us, and everyone working in the speech domain, improve.

Broadly speaking, there are many more challenges than just making speech sound "human." For example, think about code-switching: the natural way multilingual people seamlessly switch between languages during conversation, oftentimes without conscious awareness. It's so ingrained in many multilingual families; If I stopped doing it, my family would think something was wrong. How do you capture this with AI speech? We’re working on it!

Another challenge is replicating the intricacies of human thought and conversation. We hesitate, we pause, we restart sentences. These are all natural parts of communication that current systems often struggle to understand and respond to appropriately. Imagine if you paused to gather your thoughts, and the system immediately jumped in, interrupting you. That's the kind of problem we're tackling head-on.

Essentially, our biggest challenge is creating technology that truly understands and mimics human conversational behavior. It's a complex, ongoing adventure, but we're committed to pushing the boundaries of what's possible.

For those using AI speech in their projects, what are some complexities they should be aware of?

The nuances in AI text-to-speech go far beyond the basics. Every language and region has its own intricacies. For example, the use of filler words like "um" or "ah" varies significantly across languages – what works in English might sound odd in Mandarin. The same goes for regional dialects. In India alone, there are over 3,000 languages and dialects, with accents changing drastically even over short distances. Capturing all that richness and nuance is incredibly difficult but essential for diversity and representation.

From a representation standpoint, you also want to ensure you're reflecting the diversity of your users in many contexts. The voices should adapt to different tones, whether it’s customer service agents sounding empathic and calm, or as in education, where we want AI tutors to sound supportive and confident, but not authoritative.

This means we need to pay close attention to understanding and replicating the subtle variations in human communication—specific emotions and contextual nuances—for different scenarios. We naturally express ourselves differently when chatting with friends versus delivering a formal presentation, and these subtle differences are key to creating authentic and relatable AI voices.

Voice cloning presents another unique challenge. Your own recorded voice often sounds strange to your own ears. Even more fascinating, the same person can sound distinct when speaking different languages. My voice in Hindi or Gujarati sounds quite different from my English voice – it’s lower pitched with bigger swings in intonation.

How would you say LMNT is architecturally different from the tools that other players in the space are putting out there? What’s unique about your approach?

Speech is interesting because it sits between images and text. Text is sequential and low-dimensional, while images are high-dimensional and random access. Speech combines elements of both: it's high-dimensional and sequential.

This dual nature presents specific challenges. We need architectures that handle high-dimensional data efficiently and support fast inference. At the same time, we need more structured output to avoid issues like repetition or hallucinations, which can occur in LLMs.

We’ve brought together ideas from both the image synthesis and LLM worlds to find a sweet spot of efficient, low-latency inference without hallucinations. We’ve also managed to get significant gains in performance through deep technical work: from cloud infrastructure all the way down to GPU-specific optimizations.

How do you see LMNT evolving in the next 6-12 months?

We think it’s critical to build "zero-iteration" products. Currently, getting the right result from an LLM often involves multiple iterations—tweaking prompts and editing responses. But in real-time conversations, you can't afford this back-and-forth. You need to get it right on the first try.

We're also putting a lot of energy into voices generated by users with simple inputs. For example, if you're running an ad in Australia and need a middle-aged Australian male voice, you should be able to simply ask for that and get it. You should also be able to modify your voice. When playing FIFA, it could be fun to make your commentator more enthusiastic, use an unusual accent, or speak in a different language.

Can you tell us about the culture at LMNT? Are you hiring, and what do you look for in prospective team members?

We're a cracked team of seven based in Palo Alto coming together from Google, Microsoft, Stanford, Meta, and TikTok. We're all about building that old-school Silicon Valley vibe, similar to what I experienced at Google[x]—work hard, play hard, and tackle crazy ambitious problems to build next-gen tech.

We’re growing our AI team and are looking for stellar Product/Full-Stack Engineers. Come join us if you’re a builder with a high learning rate and drive. We have job postings here, and if you’re looking for a role that’s not posted, email us at [email protected].

Conclusion

To stay up to date on the latest with LMNT, follow them on X and learn more about them at LMNT.

Read our past few Deep Dives below:

If you would like us to ‘Deep Dive’ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.

Any voice, any way — LMNT is leading the future of AI speech 🗣

Plus: Founder Sharvil on his desire to make human-machine interaction seamless...

CV Deep Dive

Our Chat with Sharvil 💬

Conclusion

Join Slack | All Events | Jobs