- Cerebral Valley
- Posts
- Aqua Voice (YC W24) wants you to never type again đŁď¸
Aqua Voice (YC W24) wants you to never type again đŁď¸
Plus: CEO Finn on whatâs next for the future of voice-driven technology...
CV Deep Dive
Today, weâre talking with Finn Brown, founder and CEO of Aqua Voice (YC W24).
Aqua Voice aims to redefine how people interact with technology through voice, using a modular system of transcription, language, and routing models. By orchestrating tools like Whisper and custom phonetic models, Aqua delivers dictation that feels more like working with a human scribeâaccurate, real-time, and capable of handling complex edits on the fly.
Stemming from his trial and errors with other dictation tools as a dyslexic student, Finn has built Aqua Voice to make voice input intuitive, contextual, and as natural as speaking to another person.
Key Takeaways:
Real-Time Voice Dictation: Unlike competitors that rely on asynchronous transcription, Aqua Voice updates text live, allowing users to correct errors and issue commands in real-time.
Tailored AI Pipeline: Aqua integrates transcription, phonetics, user dictionaries, and session-specific context for highly personalized and accurate voice input.
Visionary Goal: Aqua Voice is building toward a future as the universal voice input layer for all devices and applications.
Currently, the team operates out of a hacker house in San Francisco, living and working together as they refine their product and prepare to tackle broader use cases.
In this conversation, Finn shares how Aqua Voiceâs orchestrated-model approach beats competitors, why real-time interaction is key, and whatâs next for the future of voice-driven technology.
Letâs dive in âĄď¸
Read time: 8 mins
Our Chat with Finn đŹ
Finn, could you introduce yourself and give us a bit of background on what led you to found Aqua Voice?
Absolutely. Iâm Finn, and Iâve always been kind of an âideas personâ who loves books, philosophy, and exploring different concepts. Iâm also dyslexic, which played a big role in shaping my interest in voice technology. Back in sixth gradeâso I was about 11 or 12âI started using Dragon Dictation to write my papers because typing them out was a nightmare with all the spelling mistakes. It was clunky and not very intuitive. I had to learn all these weird commands, and then go back and check every single word. It never felt natural, like talking to a real person or a good scribe.
That experience stuck with me as I went through school. I dictated all my college papers, too, but the workflow never improved much. It felt like something needed to be done to make talking to your computer more like talking to a human, and less like wrangling complicated software.
I grew up in Boise, Idaho, spending a lot of time outdoorsâhiking, skiing, all that. Eventually, I ended up at Harvard, which was a big change. I mostly studied philosophy there, reading Russian novels and diving into old, somewhat forgotten philosophers. I only really started coding junior year, more on a whim than anything else. But that combination of my personal struggle with voice workflows, my philosophical love for exploring ideas, and eventually picking up coding led me to found Aqua Voice. I wanted to build something that felt as natural as talking to another personâbecause if I wanted it for myself, I figured a lot of other people might want it, too.
For those unfamiliar, could you give us a short explanation of what Aqua Voice is?
Aqua Voice is dictation that truly understands human commands, letting you fully control text editing with natural speech. Itâs like having someone at the keyboard who âgetsâ what you meanâwhether itâs clarifying a spelling on the fly or rephrasing a sentenceâso talking to your computer feels more like having a real human scribe at your side, accurately capturing your words and intent.
INTRODUCING: Aqua Voice Desktop.
A universal voice layer on your desktop - that lets you dictate using natural language.
â Aqua Voice (@aquavoice_)
3:28 PM ⢠Dec 11, 2024
Youâve been dealing with dictation since sixth grade, waiting for a better solution. When did you realize you could finally solve that problem yourself?
The pivotal moment was reading the Whisper paper. When Whisper came out, I dug in and realized they werenât just scaling up speech models, but also starting to show these emergent behaviors. For instance, Whisper would often remove âumsâ and âuhsâ automatically. I saw that as a sign that the decoder blockâessentially a language model in its own rightâwas getting more contextually aware. It wasnât just brute force transcription anymore; it understood patterns and could clean things up.
Before that, Iâd been stuck with Dragon Dictation and various cloud servicesâGoogle, Amazon, Revâjust constantly benchmarking them with my own Python scripts, hoping one would outperform Dragon. But none ever really stood out. They all made too many mistakes. Whisper changed that because it hinted that with a model this good as a foundation, we could build dictation that not only got the words right, but understood the context and intent behind them. Thatâs when I realized this wasnât just a pipe dream anymore.
You mentioned Whisper and these various models youâre using under the hood. Could you walk us through how Aqua Voice leverages multiple toolsâlike Whisper and LLMsâto make the magic happen?
We basically take an all-of-the-above approach. Weâve orchestrated a bunch of different componentsâWhisper or Whisper-like models, several LLMs, and a router modelâto figure out what works best given the userâs context and what theyâre trying to do. We donât rely on just vanilla Whisper or a single model. Instead, we string together multiple pieces so that, to the user, it all feels seamless.
For instance, Whisper is asynchronous and processes audio in chunks. You can tweak it to seem real-time, but itâs not naturally suited to that. We also use real-time models, plus a few different language models. We have a router model that decides which tool to use at any given time, based on the context and what the userâs saying.
I wonât go into every engineering detail, but the main idea is we donât just feed everything into one big fused model and hope for the best. Sometimes raw audio understanding is helpful, sometimes itâs not. Pure transcription might be enough most of the time, and going beyond thatâlike integrating audio cues directly into a language modelâcan be expensive, slow, or just unnecessary.
In short, we blend specialized models for transcription, reasoning, and routing. This lets us be tactical: we get maximum accuracy and reliability without wasting resources or complicating the entire pipeline. Itâs not just throwing everything into one monolithic systemâweâre more strategic, ensuring every piece does what itâs best at.
Whatâs a surprising technical observation youâve made while building Aqua Voice?
A year ago, people thought a single huge multimodal model would outperform specialized tools at everything, including transcription. But in practice, adding audio into a big language model often isnât as good as using a top-tier transcription model first. We discovered this too, so weâre selective about when we use which component. Sometimes the userâs intention isnât clear from text alone, so we leverage the audio embeddings or hints from the Whisper encoderâs output, giving the language model a richer input than just text but without fully merging everything into one giant model.
Youâve mentioned how Aqua Voice is like talking to your own personal scribe, with natural language corrections. So it uses async transcription (like Whisper), multiple language models, and a routing model to understand user intent and even fix mistakes on the fly. Can you explain how these pieces fit together?
So under the hood, weâve orchestrated a combination of different models, each good at different tasks. Whisper (or a Whisper-like system) handles the raw transcription from audio chunks, while separate language models step in for understanding context, making transformations, and interpreting commandsâlike when you say âdelete everything afterâ a certain point. Then thereâs a router model that quickly decides which tool to use based on what the userâs doing at that moment.
For example, if we sense the user is giving a command like âActually, scratch that, go back and fix this part,â we might hand that request off to a model known to be reliable at editing. If the userâs just talking normally, a straightforward transcription might be all we need. But if we detect subtle audio cues or something thatâs not clear from text alone, we might bring in additional context or a âfusionâ approachâwhere we integrate extra hints from the audioâto improve accuracy. All this happens behind the scenes. The user just sees the words adjusting as they speak, even correcting initial misunderstandings once thereâs enough context to realize the first guess was off.
Itâs not just about one big model doing everything. Instead, weâre using specialized models for specific jobs, and a router to pick the right approach on the fly. This modular system lets the text on the screen constantly refine itself, self-heal errors, and handle complex voice commandsâall without the user having to manually prompt corrections.
Some people might wonder why they need voice dictation if they already type well. What are the most effective use cases youâve seen?
Our main unlocked scenario is what Iâd call âstream-of-consciousness writing.â Traditionally, dictation systems required you to think through exactly what you wanted to say before you said it. It was like reciting lines rather than freely expressing yourself. But with Aqua, you can throw out partial sentences, incomplete ideasâjust talk it out, and the system makes it sound intentional. Itâs more like having a conversation than dictating a memo.
We see two main groups benefiting. First, people who need dictation because of physical reasonsâmaybe an injury, or theyâre on the move without a keyboardâand this lets them get their thoughts down without pre-planning every word. Itâs less mentally draining since they donât have to âcompressâ their ideas into a perfect sentence before speaking. Second, people who can type just fine but find it easier and more natural to speak out their ideas and refine them verbally. For them, talking something through, even if itâs messy at first, helps them arrive at the right phrasing more organically.
The dictation space seems to be heating up, can you give an idea of the competitive landscape. How does Aqua Voice stand out from these?
Aquaâs key distinction is that we operate fully in real-time and handle complex edits smoothly, unlike many competitors. The elephant in the room is Flow from Wispr who joined the space after us. I think a lot of these async competitors are essentially a basic Whisper + LLM pipeline and only show you a finalized output after youâve finished speaking. Itâs like using a pen that writes in âinvisible inkâ until youâve ended your paragraphâby the time you see it, any inaccuracies or needed edits force you to backtrack and fix everything at once.
With Aqua, thereâs no âinvisible ink.â We update text as you speak, and if you say, âActually, scratch thatâ or âChange Tuesday to Thursday,â we can handle it on the fly. Weâre not just doing simple transcription; weâre orchestrating multiple models and a router to pick the right tool at the right time, ensuring accuracy and responsiveness. We invest more compute and complexity under the hood so you donât have to settle for rough first drafts.
For anyone writing important contentâbe it technical documentation, investor updates, academic papersâAquaâs the obvious choice. Rather than reading through and cleaning up a rough transcript later, you get a real-time, human-like collaboration thatâs more accurate and intuitive than what Wispr Flow, Rev, Google, or Amazon currently offer.
So first you released a web app and Chrome extension, now itâs evolved to the desktop layer. What are the improvements youâre focusing on over the next 6â12 months?
The desktop appâs a big visible step, but itâs also tied to major backend upgrades. One huge improvement relates to context handling. Previously, most competitors do clean-up after the fact with async models, but weâre real-time and needed a different approach. Now, weâre leveraging a user dictionary and integrating context right into the decoder block at the acoustic level. Basically, each session can have its own custom acoustic model, so to speakâno one else does that.
Looking ahead, we think we can push this even further. Eventually, the more you use Aqua, the more we can tailor the entire inference stackâweights and allâto your preferences. It wonât just be about changing prompts; weâll be customizing the models themselves to reflect individual user habits. Weâre still refining this internally, but our goal is to solve remaining edge cases and get even closer to giving you the feeling of a human scribe who âgetsâ you, not just a generic AI tool.
When you say âacoustic modelâ, what do you mean? How does that fits into Aqua Voiceâs pipeline, especially regarding user dictionaries and session context?
Whisper (OpenAI model) is purely deep learning on a spectrogram image of the audio, directly predicting text tokens. It doesnât break things down into sounds or phonetic unitsâitâs basically: âHereâs the audio chunk, now guess the words.â Thatâs great for asynchronous transcription, but for real-time interaction, itâs not as effective.
At Aqua, our real-time model takes a different approach. Instead of jumping straight from audio to text, we insert a phonetic step. We first identify the soundsâthe acoustic featuresâand then map those sounds into words. This phonetic layer helps us handle real-time transcription more smoothly and gives us flexibility to integrate user-specific customization. For example, your personal dictionary and the context of your current session can modify how the model decodes sounds into words. If youâre dictating a paper that cites a particular author, Aqua can âacousticallyâ learn that authorâs name, so when you say it again, it transcribes it correctly on the first try. This isnât a post-hoc correctionâitâs integrated into the acoustic-to-text pipeline itself, ensuring more accurate, context-aware output right from the start.
Looking ahead, whatâs the long-term vision for Aqua Voice? Where does all of this go?
Our end game is to become the universal voice input layer for everything. We donât envision people just tossing Whisper on top of their apps and calling it a day. Instead, we want Aqua Voice to be the go-to solution for any scenario where you need to put words into a systemâwhether youâre writing prose, interacting with a coding agent, or using some advanced interface. Over time, weâll build enough context and personalization that our voice layer âknowsâ how you speakâyour pronunciations, your styleâand can adapt seamlessly. Thatâs why you need a universal voice layer, Aqua, instead of every app just having independent Whispers on top of it.
In other words, weâre aiming for a future where Aqua Voice is integrated into all your devices, all your AI agents, and all your apps. Instead of a clunky add-on, weâll be a deeply personalized, ever-present voice interface that makes talking to technology feel as natural and rich as talking to another human.
Before we wrap up, could you tell us about the team behind Aqua Voiceâhow many of you are there, whatâs the culture like, and if youâre hiring, what kind of people would you be looking for?
Right now, itâs just three of us living and working together in a hacker house in Noe Valley. We put in long hours, have intense debates, and itâs very much a full-contact âhothouse of rare and choice plantsâ kind of environment. Weâre not actively hiring at this exact moment, but that could change in the future.
If and when we do hire, weâre looking for people who genuinely care about voice. Itâs not just another way to make moneyâitâs something weâre all personally invested in. Maybe you need voice because typing is tough for you, or you love the idea of dictating your next novel while walking through the woods, or you dream of commanding technology like Captain Kirk on a starship bridge. Whatever the reason, if voice matters deeply to you, thatâs the kind of personal motivation we want. Itâs less about a specific skill set and more about sharing our passion for making voice input as natural and powerful as it can be.
Great! Anything else youâd like people to know about Aqua or AI Dictation?
Just that writing is an iterative processâitâs never one-and-done. While weâve made huge strides, I donât think anyone, including us (though weâre closer than most), has fully cracked the code on iterative voice editing. To truly become a universal voice input layer, you need to replicate what people do with a keyboard: backspace, select text, and revise naturally. Doing all that seamlessly by voice is still an unsolved challenge, but itâs an essential goal weâre working toward.
Conclusion
To stay up to date on the latest with Aqua Voice, learn more about them here.
Read our past few Deep Dives below:
If you would like us to âDeep Diveâ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.