Skip to content(if available)orjump to list(if available)

Show HN: Aqua Voice 2 – Fast Voice Input for Mac and Windows

Show HN: Aqua Voice 2 – Fast Voice Input for Mac and Windows

89 comments

·April 9, 2025

Hey HN - It’s Finn and Jack from Aqua Voice (https://withaqua.com). Aqua is fast AI dictation for your desktop and our attempt to make voice a first-class input method.

Video: https://withaqua.com/watch

Try it here: https://withaqua.com/sandbox

Finn is uber dyslexic and has been using dictation software since sixth grade. For over a decade, he’s been chasing a dream that never quite worked — using your voice instead of a keyboard.

Our last post (https://news.ycombinator.com/item?id=39828686) about this seemed to resonate with the community - though it turned out that version of Aqua was a better demo than product. But it gave us (and others) a lot of good ideas about what should come next.

Since then, we’ve remade Aqua from scratch for speed and usability. It now lives on your desktop, and it lets you talk into any text field -- Cursor, Gmail, Slack, even your terminal.

It starts up in under 50ms, inserts text in about a second (sometimes as fast as 450ms), and has state-of-the-art accuracy. It does a lot more, but that’s the core. We’d love your feedback — and if you’ve got ideas for what voice should do next, let’s hear them!

idk1

I’ve been using this for some time and I have to say it is fantastic. I’m intentionally not writing this with Aqua but by hand and it is taking so much longer. This to me feels like what Apple Intelligence could be, it is so much better than stuff all of the big tech is doing. For example, if you tell Siri voice dictation to go back and delete something what Siri will do is just write out “go back and delete something“ also if you tell Siri to go back and spell a name differently all Siri will do is write out the letters that you said to go back and type out. Honestly, for voice dictation software it feels like travelling to another planet in terms of improvement.

niel

Real-time text output à la Apple Dictation with the accuracy of Whisper is something I've been looking for recently - I'll definitely give Aqua a spin.

MacWhisper [0] (the app I settled on) is conspicuously missing from your benchmarks [1]. How does it compare?

[0]: https://goodsnooze.gumroad.com/l/macwhisper

[1]: https://withaqua.com/blog/benchmark-nov-2024

the_king

We're more accurate and much faster than Mac Whisper, even their strongest model (Whisper Cpp Large V3).

For that benchmarking table, you can use Whisper Large V3 as a stand-in for Mac Whisper and Super Whisper accuracy.

pbowyer

I've been using Superwhisperapp for over a year and I get nothing like the error level your comparison table suggests. Which model were you using with it?

Aqua looks good and I will be testing it, but I do like that with superwhisper nothing leaves my computer unless I add AI integrations.

aylmao

This is super impressive, great job!

Side-comment of something this made me think of (again): tech builds too much for tech. I've lived in the Bay before, so I know why this happens. When you're there, everyone around you is in tech, your girlfriend is in tech, you go to parties and everyone invariably ends up talking about work, which is tech. Your frustrations are with tech tools and so are your peers', so you're constantly thinking about tech solutions applicable to tech's problems.

This seems very much marketed to SF people doing SF things ("Cursor, Gmail, Slack, even your terminal"). I wonder how much effort has gone into making this work with code editors or the terminal, even though I doubt this would a big use-case for this software if it ever became generally popular. I'd imagine the market here is much larger in education, journalism, film, accessibility, even government. Those are much more exciting demos.

the_king

thanks!

I share the same sentiment. I remember thinking in college how annoying it was that I was reading low-resolution, marked-up, skewed, b&w scans of a book using Adobe Acrobat while CS concentrators were doing everything in VS Code (then brand new).

but we do think voice is actually great with Cursor. It’s also really useful in the terminal for certain things. Checking out or creating branches, for example.

fxtentacle

This looks like it'll slurp up all your data and upload it into a cloud. Thanks, no. I want privacy, offline mode and source code for something as crucial to system security as an input method.

"we also collect and process your voice inputs [..] We leverage this data for improvements and development [..] Sharing of your information [..] service providers [..] OpenAI" https://withaqua.com/privacy

FloatArtifact

Local inference only is an absolute requirement. It's not even really all that accessible if it's online only. I can say this as someone that's used over 20000 hours worth of voice dictation and computer control.

canada_dry

First thing I looked for and read: the FAQ.

No mention of privacy (or on prem) - so assumed it's 100% cloud.

Non-starter for me. Accuracy is important, but privacy is more so.

Hopefully a service with these capabilities will be available where the first step has the user complete a brief training session, sends that to the cloud to tailor the recognition parameters for their voice and mannerisms... then loads that locally.

oulipo

A similar but offline tool is VoiceInk, it's also open-source so you can extend it

pokstad

This should be on the FAQ. I was trying to find out if it was 100% processed locally.

jmcintire1

fair point. offline+local would be ideal, but as it stands we can't run asr and an llm locally at the speed that is required to provide the level of service we want to.

given that we need the cloud, we offer zero data retention -- you can see this in the app. your concern is as much about ux and communications as it is privacy

fxtentacle

The problem if you actually need the cloud is that it kind of completely destroys your business model. OpenAI is bleeding money every month because they massively subsidize the hosting cost of their models. But eventually they will have to post a profit. And then if they know that your product is completely dependent on their API, they can milk you until there's no profits left for you.

And self-hosting real-time streaming LLMs will probably also come out at 50 cents per hour. Arguing a $120/month price for power users is probably going to be very difficult. Especially so if there is free open-source alternatives.

mrtesthah

MacWhisper does realtime system-wide dictation on your local machine (among other things). Just a one-time fee for an app you download -- the way shareware is supposed to be. Of course it doesn't use MoE transcription with 6 models like Aqua Voice, but if you guys expect to be acquired by Apple (that is your exit strategy, right?), you're going to need better guarantees of privacy than "we don't log".

shinycode

I downloaded the turbo whisper model optimized for Mac, created a python script that get the mic input and paste the result. The python script is LLM generated and it works with pushing a key. For 80% of the functionality for free and done locally.

toddmorey

And man it's another monthly subscription. I'm not mad at them for finding a gap in the market and putting a business around it. I'm mad at Apple for leaving that gap... hopefully built in voice dictation improves quickly.

FireBeyond

Is there a gap in the market? It's being rapidly filled with the likes of MacWhisper, etc., which offer local-only, one-off pricing.

pablopeniche

"hopefully built in voice dictation improves quickly." I would not hold my breath on that one lol

jackthetab

Agreed.

This is where I bounce (out of this discussion).

thmsmlr

I totally agree, I created BetterDictation (.com) exactly because of that. Offline was a super important requirement for me.

jrvarela56

Feedback: I use MacWhisper and Tiny wisperkit model (english only) is way faster than any cloud service on my M1 macbook pro.

I’d say local is necessary for delightful product experience and the added bonus is that it ticks the privacy box

null

[deleted]

brianjking

How much ram is in your m1?

jrvarela56

16gb

marcogarces

Pesant! (just joking); Mine is 96GB!

alxlu

I’ve been using this for a while now and I really enjoy it. I ran into a semi-obscure bug and emailed them and they basically fixed it the same day.

I do wish there was a mobile app though (or maybe an iOS keyboard). It would also be nice to be able to have a separate hotkey you can set up to send the output to a specific app (instead of just the active one).

the_king

thanks! We're working on iOS, but it's tough to get the ergos right given all of Apple's restrictions and neglected APIs.

polishdude20

Android app please!

rkagerer

You mentioned it "lives on your desktop". How does licensing work, and can you install and use it on a machine without internet access?

rickydroll

I've been using Aqua since it was announced on HNN. I've survived the teething pains by using a mixture of Aqua and Dragon, depending on what I was doing. With this new Windows app, I've given up using Dragon for anything.

Things I've learned are:

1. It works better if you're connected by Ethernet than by Wi-Fi.

2. It needs to have a longer recognition history because sometimes you hit the wrong key to end a recognition session, and it loses everything.

3. Besides the longer history, a debugging mode that records all the characters sent to the dictation box would be useful. Sometimes, I see one set of words, blink, and then it's replaced with a new recognition result. Capturing would be useful in describing what went wrong.

4. There should be a way to tell us when a new version is running. Occasionally, I've run into problems where I'm getting errors, and I can't tell if it's my speaking, my audio chain, my computer, the network, or the app.

5. Grammarly is a great add-on because it helps me correct mis-speakings and odd little errors, like too many spaces caused by starting and stopping recognition.

When Dragon Systems went through bankruptcy court, a public benefits corporation bid for the core technology because it recognized that Dragon was a critical tool for people with disabilities to function in a digital world.

In my opinion, Aqua has reached a similar status as an essential tool. Well, it doesn't fully replace Dragon for those who need command and control (yet). The recognition accuracy and smoothness are so amazing that I can't envision returning to Dragon Systems without much pain. The only thing worse would be going back to a keyboard.

Aqua Guys, don't fuck it up.

replete

Product/UI looks good. Nice job. I would pay for a completely offline version of this, cloud voice data is non-starter for me though unfortunately

voltaireodactyl

Check out MacWhisper which is one time payment and does this among many other things.

willwade

You’re real market you need to go hard on is the assistive tech market. You know the biggest companies in this space are those solving problems for dyslexia where govt grants in eg UK fund pretty much all their work? I had an access to work assessment and they recommend like sweets stuff from texthelp. It’s then paid for by the government following these assessments. But it’s crap. It literally is a crap tool for adhd or dyslexia because these users literally CANT remember or deal with barriers like learning how to dictate correctly. Aqua voice solves this. I’m your biggest fan. I recommend it in my AT assessments all the time :)

waveringana

yes I really hope a lot of these ML startups check out the history of ML tech a bit more because so many accessibility tools are built via ML but theyve been abandoned

adamesque

I was very delighted by Aqua v1, which felt like magic at first.

But I’ve noticed/learned that I can’t dictate written content. My brain just does not work that way at all — as I write I am constantly pausing to think, to revise, etc and it feels like a completely different part of my brain is engaged. Everything I dictated with Aqua I had to throw away and rewrite.

Has anyone had similar problems, and if so, had any success retraining themselves toward dictation? There are fleeting moments where it truly feels like it would be much faster.

SCdF

I use my (work) computer entirely with my voice, and it takes a lot of effort to work out what to actually write and to not ramble. Like you I've found that it's better to throw out words in sort of half sentence chunks, to give your brain time to work out what the next chunk is.

It's very hard, and I wouldn't do it if I didn't have to.

(which is why I'm always perplexed by these apps which allow voice dictation or voice control, but not as a complete accessibility package. I wouldn't be using my voice if my hands worked!)

It's also critically important (and after 3-4 years of this I still regularly fail at this) to actually read what you've written, and edit it before send, because those chunks don't always line up into something that I'd consider acceptably coherent. Even for a one sentence slack message.

(also, I have a kiwi accent, and the dictation software I use is not always perfect at getting what I wanted to say on the page)

e12e

Curious about your current setup, and if maybe adding a macro/functionality to clean up input via an LLM would help?

In my experience LLM can be quite forgiving when given some unfinished input and asked to expand/clean up?

noahjk

Same here. My two biggest hurdles are:

1. like you mentioned, the second I start talking about something, I totally forget where I'm going, have to pause, it's like my thoughts aren't coming to me. Probably some sort of mental feedback loop plus, like you mentioned, different method of thinking.

2. in the back of my mind, I'm always self-conscious that someone is listening, so it's a privacy / being judged / being overheard feeling which adds a layer of mental feedback.

There's also not great audio clues for handling on-the-fly editing. I've tried to say "parentheses word parentheses" and it just gets written out. I've tried to say "strike that" and it gets written out. These interfaces are very 'happy path' and don't do a lot of processing (on iOS, I can say "period" and get a '.' (or ?,!) but that's about the extent).

I have had some success with long-form recording sessions which are transcribed afterwards. After getting over the short initial hump, I can brain-dump to the recording, and then trust an app like Voice Notes or Superwhisper to transcribe, and then clean up after.

The main issue I run into there, though, is that I either forget to record something (ex. a conversation that I want to review later) or there is too much friction / I don't record often enough to launch it quickly or even remember to use that workflow.

I get the same feeling with smart home stuff - it was awesome for a while to turn lights on and off with voice, but lately there's the added overhead of "did it hear me? do I need to repeat myself? What's the least amount of words I can say? Why can't I just think something into existence instead? Or have a perfect contextual interface on a physical device?"

the_king

I think Aqua v1 had two problems:

1. The models weren't ready.

2. The interactions were often strained. Not every edit/change is easy to articulate with your voice.

If 1 had been our only problem, we might have had a hit. In reality, I think optimizing model errors allowed us to ignore some fundamental awkwardness in the experience. We've tried to rectify this with v2 by putting less emphasis on streaming for every interaction and less emphasis on commands, replacing it with context.

Hopefully it can become a tool in the toolbox.

adamesque

Looking forward to giving it another try!

null

[deleted]

jmcintire1

Imo it is a question of right tool for the right job, adjusted for differences between people. For me, the use case that made our product click was prompting Cursor while coding. Then I wanted to use it whenever I talked to chatgpt -- it's much faster to talk and then read, and repeat.

Voice is great for whenever the limiting factor to thought is speed of typing.

cloogshicer

I'm exactly the same. Aqua is so incredible and I really tried to like it, but I just can't get my brain to think of what I want to say first, I have to pause to think constantly.

SCdF

I currently use Talon, which I note is not in your benchmarks.

I can't find any documentation on how Aqua works, or how it compares, so I'm not sure it's meant to be a replacement / competitor to Talon? What are you configuring? How are you telling it that you like "genz" style in Slack? Can I create custom configurations / macros?

One thing I like about Talon is it's not magic. Which maybe is not what you're going for. But I am giving it explicit commands that I know it will understand (if it understands my accent obvs), as opposed to guessing and constructing a human language vague sentence and hope that an llm will work it out. Which means it feels like something I can actually become fast with, and build up muscle memory of.

Also that it's completely offline, so I can actually run it on a work computer without my security folks freaking out.

the_king

We're building something different, but there is some overlap. Aqua is built for max speed, while keeping accuracy high. To achieve that, inference runs in a datacenter (for now).

You can customize Aqua using custom instructions, similar to ChatGPT custom instructions, and get some Talon functionality from it:

In my own, I have:

1. Breaking the paragraphs with three or four sentences.

2. Don't start a sentence with "and".

3. Use lowercase in Slack and iMessage.

4. Here are some common terminal commands...

willwade

Aqua voice is nothing like talon. I wouldn’t bother trying to compare. It’s a dictation tool. Just entry. Not commands. But it’s bloody impressive. You don’t need to learn anything - you just talk like you would talk to someone across the way from you

SCdF

Oh, from the video I got the impression it was more than that, based on it recognising app contexts and the like. I guess that's mostly just icing on the cake for the core dictation part.

pablopeniche

>recognizing app contexts

Users have different preferences on the text format they input into different apps. Aqua is able to pick up on these explicit and implicit preferences across apps – but no "open XYZ app" commands, yea