Cerebral Valley
Posts
Aqua Voice (YC W24) wants you to never type again 🗣️

Aqua Voice (YC W24) wants you to never type again 🗣️

Plus: CEO Finn on what’s next for the future of voice-driven technology...

December 19, 2024

CV Deep Dive

Today, we’re talking with Finn Brown, founder and CEO of Aqua Voice (YC W24).

Aqua Voice aims to redefine how people interact with technology through voice, using a modular system of transcription, language, and routing models. By orchestrating tools like Whisper and custom phonetic models, Aqua delivers dictation that feels more like working with a human scribe—accurate, real-time, and capable of handling complex edits on the fly.

Stemming from his trial and errors with other dictation tools as a dyslexic student, Finn has built Aqua Voice to make voice input intuitive, contextual, and as natural as speaking to another person.

Key Takeaways:

Real-Time Voice Dictation: Unlike competitors that rely on asynchronous transcription, Aqua Voice updates text live, allowing users to correct errors and issue commands in real-time.
Tailored AI Pipeline: Aqua integrates transcription, phonetics, user dictionaries, and session-specific context for highly personalized and accurate voice input.
Visionary Goal: Aqua Voice is building toward a future as the universal voice input layer for all devices and applications.

Currently, the team operates out of a hacker house in San Francisco, living and working together as they refine their product and prepare to tackle broader use cases.

In this conversation, Finn shares how Aqua Voice’s orchestrated-model approach beats competitors, why real-time interaction is key, and what’s next for the future of voice-driven technology.

Let’s dive in ⚡️

Read time: 8 mins

Our Chat with Finn 💬

Finn, could you introduce yourself and give us a bit of background on what led you to found Aqua Voice?

Absolutely. I’m Finn, and I’ve always been kind of an “ideas person” who loves books, philosophy, and exploring different concepts. I’m also dyslexic, which played a big role in shaping my interest in voice technology. Back in sixth grade—so I was about 11 or 12—I started using Dragon Dictation to write my papers because typing them out was a nightmare with all the spelling mistakes. It was clunky and not very intuitive. I had to learn all these weird commands, and then go back and check every single word. It never felt natural, like talking to a real person or a good scribe.

That experience stuck with me as I went through school. I dictated all my college papers, too, but the workflow never improved much. It felt like something needed to be done to make talking to your computer more like talking to a human, and less like wrangling complicated software.

I grew up in Boise, Idaho, spending a lot of time outdoors—hiking, skiing, all that. Eventually, I ended up at Harvard, which was a big change. I mostly studied philosophy there, reading Russian novels and diving into old, somewhat forgotten philosophers. I only really started coding junior year, more on a whim than anything else. But that combination of my personal struggle with voice workflows, my philosophical love for exploring ideas, and eventually picking up coding led me to found Aqua Voice. I wanted to build something that felt as natural as talking to another person—because if I wanted it for myself, I figured a lot of other people might want it, too.

For those unfamiliar, could you give us a short explanation of what Aqua Voice is?

Aqua Voice is dictation that truly understands human commands, letting you fully control text editing with natural speech. It’s like having someone at the keyboard who “gets” what you mean—whether it’s clarifying a spelling on the fly or rephrasing a sentence—so talking to your computer feels more like having a real human scribe at your side, accurately capturing your words and intent.

INTRODUCING: Aqua Voice Desktop.
A universal voice layer on your desktop - that lets you dictate using natural language.
— Aqua Voice (@aquavoice_)
3:28 PM • Dec 11, 2024

You’ve been dealing with dictation since sixth grade, waiting for a better solution. When did you realize you could finally solve that problem yourself?

The pivotal moment was reading the Whisper paper. When Whisper came out, I dug in and realized they weren’t just scaling up speech models, but also starting to show these emergent behaviors. For instance, Whisper would often remove “ums” and “uhs” automatically. I saw that as a sign that the decoder block—essentially a language model in its own right—was getting more contextually aware. It wasn’t just brute force transcription anymore; it understood patterns and could clean things up.

Before that, I’d been stuck with Dragon Dictation and various cloud services—Google, Amazon, Rev—just constantly benchmarking them with my own Python scripts, hoping one would outperform Dragon. But none ever really stood out. They all made too many mistakes. Whisper changed that because it hinted that with a model this good as a foundation, we could build dictation that not only got the words right, but understood the context and intent behind them. That’s when I realized this wasn’t just a pipe dream anymore.

You mentioned Whisper and these various models you’re using under the hood. Could you walk us through how Aqua Voice leverages multiple tools—like Whisper and LLMs—to make the magic happen?

We basically take an all-of-the-above approach. We’ve orchestrated a bunch of different components—Whisper or Whisper-like models, several LLMs, and a router model—to figure out what works best given the user’s context and what they’re trying to do. We don’t rely on just vanilla Whisper or a single model. Instead, we string together multiple pieces so that, to the user, it all feels seamless.

For instance, Whisper is asynchronous and processes audio in chunks. You can tweak it to seem real-time, but it’s not naturally suited to that. We also use real-time models, plus a few different language models. We have a router model that decides which tool to use at any given time, based on the context and what the user’s saying.

I won’t go into every engineering detail, but the main idea is we don’t just feed everything into one big fused model and hope for the best. Sometimes raw audio understanding is helpful, sometimes it’s not. Pure transcription might be enough most of the time, and going beyond that—like integrating audio cues directly into a language model—can be expensive, slow, or just unnecessary.

In short, we blend specialized models for transcription, reasoning, and routing. This lets us be tactical: we get maximum accuracy and reliability without wasting resources or complicating the entire pipeline. It’s not just throwing everything into one monolithic system—we’re more strategic, ensuring every piece does what it’s best at.

What’s a surprising technical observation you’ve made while building Aqua Voice?

A year ago, people thought a single huge multimodal model would outperform specialized tools at everything, including transcription. But in practice, adding audio into a big language model often isn’t as good as using a top-tier transcription model first. We discovered this too, so we’re selective about when we use which component. Sometimes the user’s intention isn’t clear from text alone, so we leverage the audio embeddings or hints from the Whisper encoder’s output, giving the language model a richer input than just text but without fully merging everything into one giant model.

You’ve mentioned how Aqua Voice is like talking to your own personal scribe, with natural language corrections. So it uses async transcription (like Whisper), multiple language models, and a routing model to understand user intent and even fix mistakes on the fly. Can you explain how these pieces fit together?

So under the hood, we’ve orchestrated a combination of different models, each good at different tasks. Whisper (or a Whisper-like system) handles the raw transcription from audio chunks, while separate language models step in for understanding context, making transformations, and interpreting commands—like when you say “delete everything after” a certain point. Then there’s a router model that quickly decides which tool to use based on what the user’s doing at that moment.

For example, if we sense the user is giving a command like “Actually, scratch that, go back and fix this part,” we might hand that request off to a model known to be reliable at editing. If the user’s just talking normally, a straightforward transcription might be all we need. But if we detect subtle audio cues or something that’s not clear from text alone, we might bring in additional context or a “fusion” approach—where we integrate extra hints from the audio—to improve accuracy. All this happens behind the scenes. The user just sees the words adjusting as they speak, even correcting initial misunderstandings once there’s enough context to realize the first guess was off.

It’s not just about one big model doing everything. Instead, we’re using specialized models for specific jobs, and a router to pick the right approach on the fly. This modular system lets the text on the screen constantly refine itself, self-heal errors, and handle complex voice commands—all without the user having to manually prompt corrections.

Some people might wonder why they need voice dictation if they already type well. What are the most effective use cases you’ve seen?

Our main unlocked scenario is what I’d call “stream-of-consciousness writing.” Traditionally, dictation systems required you to think through exactly what you wanted to say before you said it. It was like reciting lines rather than freely expressing yourself. But with Aqua, you can throw out partial sentences, incomplete ideas—just talk it out, and the system makes it sound intentional. It’s more like having a conversation than dictating a memo.

We see two main groups benefiting. First, people who need dictation because of physical reasons—maybe an injury, or they’re on the move without a keyboard—and this lets them get their thoughts down without pre-planning every word. It’s less mentally draining since they don’t have to “compress” their ideas into a perfect sentence before speaking. Second, people who can type just fine but find it easier and more natural to speak out their ideas and refine them verbally. For them, talking something through, even if it’s messy at first, helps them arrive at the right phrasing more organically.

The dictation space seems to be heating up, can you give an idea of the competitive landscape. How does Aqua Voice stand out from these?

Aqua’s key distinction is that we operate fully in real-time and handle complex edits smoothly, unlike many competitors. The elephant in the room is Flow from Wispr who joined the space after us. I think a lot of these async competitors are essentially a basic Whisper + LLM pipeline and only show you a finalized output after you’ve finished speaking. It’s like using a pen that writes in “invisible ink” until you’ve ended your paragraph—by the time you see it, any inaccuracies or needed edits force you to backtrack and fix everything at once.

With Aqua, there’s no “invisible ink.” We update text as you speak, and if you say, “Actually, scratch that” or “Change Tuesday to Thursday,” we can handle it on the fly. We’re not just doing simple transcription; we’re orchestrating multiple models and a router to pick the right tool at the right time, ensuring accuracy and responsiveness. We invest more compute and complexity under the hood so you don’t have to settle for rough first drafts.

For anyone writing important content—be it technical documentation, investor updates, academic papers—Aqua’s the obvious choice. Rather than reading through and cleaning up a rough transcript later, you get a real-time, human-like collaboration that’s more accurate and intuitive than what Wispr Flow, Rev, Google, or Amazon currently offer.

So first you released a web app and Chrome extension, now it’s evolved to the desktop layer. What are the improvements you’re focusing on over the next 6–12 months?

The desktop app’s a big visible step, but it’s also tied to major backend upgrades. One huge improvement relates to context handling. Previously, most competitors do clean-up after the fact with async models, but we’re real-time and needed a different approach. Now, we’re leveraging a user dictionary and integrating context right into the decoder block at the acoustic level. Basically, each session can have its own custom acoustic model, so to speak—no one else does that.

Looking ahead, we think we can push this even further. Eventually, the more you use Aqua, the more we can tailor the entire inference stack—weights and all—to your preferences. It won’t just be about changing prompts; we’ll be customizing the models themselves to reflect individual user habits. We’re still refining this internally, but our goal is to solve remaining edge cases and get even closer to giving you the feeling of a human scribe who “gets” you, not just a generic AI tool.

When you say “acoustic model”, what do you mean? How does that fits into Aqua Voice’s pipeline, especially regarding user dictionaries and session context?

Whisper (OpenAI model) is purely deep learning on a spectrogram image of the audio, directly predicting text tokens. It doesn’t break things down into sounds or phonetic units—it’s basically: “Here’s the audio chunk, now guess the words.” That’s great for asynchronous transcription, but for real-time interaction, it’s not as effective.

At Aqua, our real-time model takes a different approach. Instead of jumping straight from audio to text, we insert a phonetic step. We first identify the sounds—the acoustic features—and then map those sounds into words. This phonetic layer helps us handle real-time transcription more smoothly and gives us flexibility to integrate user-specific customization. For example, your personal dictionary and the context of your current session can modify how the model decodes sounds into words. If you’re dictating a paper that cites a particular author, Aqua can “acoustically” learn that author’s name, so when you say it again, it transcribes it correctly on the first try. This isn’t a post-hoc correction—it’s integrated into the acoustic-to-text pipeline itself, ensuring more accurate, context-aware output right from the start.

Looking ahead, what’s the long-term vision for Aqua Voice? Where does all of this go?

Our end game is to become the universal voice input layer for everything. We don’t envision people just tossing Whisper on top of their apps and calling it a day. Instead, we want Aqua Voice to be the go-to solution for any scenario where you need to put words into a system—whether you’re writing prose, interacting with a coding agent, or using some advanced interface. Over time, we’ll build enough context and personalization that our voice layer “knows” how you speak—your pronunciations, your style—and can adapt seamlessly. That’s why you need a universal voice layer, Aqua, instead of every app just having independent Whispers on top of it.

In other words, we’re aiming for a future where Aqua Voice is integrated into all your devices, all your AI agents, and all your apps. Instead of a clunky add-on, we’ll be a deeply personalized, ever-present voice interface that makes talking to technology feel as natural and rich as talking to another human.

Before we wrap up, could you tell us about the team behind Aqua Voice—how many of you are there, what’s the culture like, and if you’re hiring, what kind of people would you be looking for?

Right now, it’s just three of us living and working together in a hacker house in Noe Valley. We put in long hours, have intense debates, and it’s very much a full-contact “hothouse of rare and choice plants” kind of environment. We’re not actively hiring at this exact moment, but that could change in the future.

If and when we do hire, we’re looking for people who genuinely care about voice. It’s not just another way to make money—it’s something we’re all personally invested in. Maybe you need voice because typing is tough for you, or you love the idea of dictating your next novel while walking through the woods, or you dream of commanding technology like Captain Kirk on a starship bridge. Whatever the reason, if voice matters deeply to you, that’s the kind of personal motivation we want. It’s less about a specific skill set and more about sharing our passion for making voice input as natural and powerful as it can be.

Great! Anything else you’d like people to know about Aqua or AI Dictation?

Just that writing is an iterative process—it’s never one-and-done. While we’ve made huge strides, I don’t think anyone, including us (though we’re closer than most), has fully cracked the code on iterative voice editing. To truly become a universal voice input layer, you need to replicate what people do with a keyboard: backspace, select text, and revise naturally. Doing all that seamlessly by voice is still an unsolved challenge, but it’s an essential goal we’re working toward.

Conclusion

To stay up to date on the latest with Aqua Voice, learn more about them here.

Read our past few Deep Dives below:

If you would like us to ‘Deep Dive’ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.

Aqua Voice (YC W24) wants you to never type again 🗣️

Plus: CEO Finn on what’s next for the future of voice-driven technology...

CV Deep Dive

Our Chat with Finn 💬

Conclusion

Join Slack | All Events | Jobs