Alterego: Thought to Text

88 comments

·September 8, 2025

giveita

Very impressive but use case is narrow. You are on a train and don't have time to make a note, ok. But most of the time we are in places reasonably private where speech recognition is fine and would be convenient maybe more so learning to say without saying.

As a disability speech aid though maybe it would be amazing?

stevage

The great thing about a product like this is that it's so easy to fake it in video.

I don't really buy that typing speed is a bottleneck for most people. We can't actually think all that fast. And I suspect AI is doing a lot of filling in the gaps here.

It might have some niche use cases, like being able to use your phone while cycling.

Bjartr

Personal anecdote: I do find typing to be a bottleneck in situations where typing speed is valuable (so notes in meetings, not when coding).

I can break 100wpm, especially if I accept typos. It's still much, much slower to type than I can think.

stevage

My experience with taking notes in meetings is definitely that the brain is the bottleneck, not the fingers. There are times where I literally just type what the person is saying, dictation style (ie, recording a client's exact words, often helpful for reference, even later in the meeting). I can usually keep up. But if I'm trying to formulate original thoughts, or synthesise what I've heard, or find a way to summarise what they have been saying - that's where I fall behind, even though the total number of words I need to write is actually much smaller.

So this definitely wouldn't help me here. Realistically though, there ought to be better solutions like something that just listens to the meeting and automatically takes notes.

robofanatic

> notes in meetings

That’s already solved by AI, if you let AI listen to your meetings.

Feathercrown

I haven't found that to be very accurate. I suspect the internal idiosyncrasies of a company are an issue, as the AI doesn't have the necessary context.

j45

Speech to text can be 130-200 wpm.

Also, keybr.com helps speed up typing if you were thinking about it.

jimkleiber

Most people i think type very slowly on computers and i believe type even more slowly on phones. I've had many many people remark at how fast i type on both of those platforms and it still confuses me, as i think it's so easy for me to overlook how slowly people type.

dllthomas

Typing speed is very much a bottleneck when I'm washing dishes, at least.

prerok

How about when baking?

https://xkcd.com/341/

lennxa

talk then

noduerme

Plausible deniability is the ticket. I see the killer app here being communication with people in comas. Or corpses.

jonwinstanley

Or dogs

w00ds

It's possible the demo is faked, and I'm skeptical. But I also don't think the speed is really the point of a device like this. Getting out a device, pressing keys or tapping on it, and putting it away again, those attentional costs of using some device... I know something like basic notetaking would feel really different to me if I was able to just do the thing in the demo at high accuracy instead. That's a big if, though - the accuracy would have to be high for it to really be useful, and the video is probably best-case clips.

com2kid

Pulling out my phone, unlocking it, opening my notes app, creating a new note, that is a bottleneck.

Puling out my phone, unlocking it, remembering what the hotkey is today for starting google/gemini, is a bottle neck. Damned if I can remember what random gesture lets me ask Gemini to take a note today (presumably gemini has notes support now, IIRC the original release didn't).

Finding where Google stashes todo items at, also a bottle neck. Of course that entails me getting my phone out and navigating to whatever notes app (for awhile todos/notes were inside a separate Google search app!) they are shoved into.

My Palm Pilot from 2000 had more usability than a modern smartphone.

This device can solve all of those issues.

soulofmischief

> We currently have a working prototype that, after training with user-specific example data, demonstrates over 90% accuracy on an application-specific vocabulary. The system is currently user-dependent and requires individual training. We are currently on working on iterations that would not require any personalization.

https://www.media.mit.edu/projects/alterego/frequently-asked...

andymatuschak

That text was written about the Media Lab-era prototype in 2019: https://web.archive.org/web/20190102110930/https://www.media...

I wonder how far they've gotten past it.

com2kid

I am surprised no one here has noted that a device like this almost completely negates the need for literacy. That is huge. Right now people still need to interact with written words, both typing and reading. Realistically a quiet vocal based input device like this could have a UX built around it that does not require users to be literate at all.

giveita

Hey Google make a note to pack my hiking boots. Done.

aDyslecticCrow

How convenient! Literacy has always been a thorn to efficient society, as books too easily spread dangerous heretical propaganda. Now we can directly filter the quality of information and increase cultural unification. /j

com2kid

That is my fear yeah, a continued dumbing down of society.

Literacy rates in the US are already garbage, this device may just make it worse. If people never have to read or write, why would they bother learning how?

jussaying2

Not to mention the support it brings for people with disabilities! (speech, hands/fingers)

aDyslecticCrow

Amazing for paralysis and other sevear physical disabilities. Similar tech is already widely researched for years.

But I'm sceptical about this specific company with the lack of technical details.

boznz

Spent all last year writing a techno-thriller about mind-reading, I'm sure this is about as factual, and, of course nothing nefarious could possibly happen if this ever became real.

wcrossbow

This is the stuff nightmares are made of. We already live in a you have nothing to hide society. Now imagine one where mega corps and the government have access to every thought you have. No worries, you got nothing to hide right? What would that do to our thought process and how we articulate our inner selfs? What do we allow ourselves to even think? At some point it will not even matter because we will have trained ourselves to suppress any deviant thought. I'd rather not keep on going because the ramifications of this technology make me truly sick in the stomach.

deekshith13

You probably thought about some nefarious stuff that could happen. Mind to share some interesting ones?

balamatom

Well, for starters, there's the one where social consensus decides to define whether a subvocalization is "intentional" by whether the interface responded to it.

vunderba

From the article:

> Alterego only responds to intentional, silent speech.

What exactly do they mean by this? Some kind of equivalent to subvocalization [1]?

[1] https://en.wikipedia.org/wiki/Subvocalization

hyperadvanced

Oh god we’re about to have the “I don’t have an inner monologue” debate again, aren’t we?

balamatom

I got a whole inner panel discussion!

ipsum2

Yes. The paper the company is based on uses EMG (muscle movements) to convert into text.

synapsomorphy

The accuracy is going to be the real make or break for this. In a paper from 2018 they reported 92% word accuracy [1]. That's a lifetime ago for ML but they were also using five facial electrodes where now it looks confined to around the ears. If the accuracy was great today they would report it. In actual use I can see even 99% being pretty annoying and 95% being almost unusable (for people who can speak normally).

[1] https://www.media.mit.edu/publications/alterego-IUI/

ivape

Why do you say that? I often vocalize near giberrish and the LLM fixes it for me and mostly gets what I meant.

pedalpete

I'd love to get a better understanding of the technology this is built with (without sitting through an exceedingly long video).

I suspect it's EMG though muscles in the ear and jaw bone, but that seems too rudimentary.

The TED talk describes a system which includes sensors on the chin across the jaw bone, but the demo obviously has removed that sensor.

jackthetab

Thirteen minutes is an "exceedingly long video"?! Man, I thought I was jaded complaining about 20 minute videos! :-)

I want to know is what are the connected to? A laptop? A AS400? An old Cray they have lying around? I'd think doing the demo while walking would have been de riguer.

Anyway, tres cool!

esafak

These guys were not born when Crays roamed the earth.

balamatom

Their investor had one in the garage that they didn't know what to do with

ilaksh

Maybe they have combined an LLM or something with the speech detection convolution layers or whatever they were doing. Like with JSON schemas constraining the set of available tokens that are used for structured outputs. Except the set of tokens comes from the top 3-5 words that their first analysis/network decided are the most likely. So with that smarter system they can get by with fewer electrodes in a smaller area at the base of the skull where cranial nerves for the face and tongue emerge from the brainstem.

fxwin

i think this is what you're looking for: https://www.media.mit.edu/projects/alterego/publications/

moezd

Great, now if I find myself in a weird dialogue and murmur under my breath this company can store exactly what I called those people. They can also sell that data to whoever pays highest. Tremendous job you guys, as if ad industry wasn't annoying and intrusive as they are currently!

socalgal2

I just imagine this going really wrong. My chain of thought would be something like: "Let's see, I need to rotate this image so I need to loop over rows then columns, .. gawd fuck this code base is shit designed, there are no units on these fields, this could be so much cleaner, ... for each row ... I wonder what's for lunch today? I hope it's good ... for each column ... Dang that response on HN really pissed me off, I'd better go check it ... read pixel from source ... tonight I meeting up with a friend, I'd better remember to confirm, ... write pixel to dest ...."

oldfuture

learn more on:

https://www.media.mit.edu/projects/alterego/overview/

adding also their press release here:

https://docsend.com/view/dmda8mqzhcvqrkrk/d/fjr4nnmzf9jnjzgw

gcanyon

For those thinking about speed: an average human talks anywhere from 120-240 words per minute. An average human who touch types is probably 1/3 to 1/2 as fast as that, while an average human on a phone probably types 1/5 as fast as that.

But for me speed isn't even the issue. I can dictate to Siri at near-regular-speech speeds -- and then spend another 200% of the time that took to fix what it got wrong. I have reasonable diction and enunciation, and speech to text is just that bad while walking down the street. If this is as accurate as they're showing, it would be worth it just for the accuracy.

keleftheriou

I agree, but I think LLM-based voice input is a lot better. I’m using OpenAI’s realtime API for my Apple Watch app, and it does wonders, even editing can be as simple as “add a heart emoji at the end”, and it just works.

https://x.com/keleftheriou/status/1963399069646426341