Skip to content(if available)orjump to list(if available)

Show HN: I open-sourced my AI toy company that runs on ESP32 and OpenAI realtime

Show HN: I open-sourced my AI toy company that runs on ESP32 and OpenAI realtime

67 comments

·April 22, 2025

Hi HN! Last year the project I launched here got a lot of good feedback on creating speech to speech AI on the ESP32. Recently I revamped the whole stack, iterated on that feedback and made our project fully open-source—all of the client, hardware, firmware code.

This Github repo turns an ESP32-S3 into a realtime AI speech companion using the OpenAI Realtime API, Arduino WebSockets, Deno Edge Functions, and a full-stack web interface. You can talk to your own custom AI character, and it responds instantly.

I couldn't find a resource that helped set up a reliable, secure websocket (WSS) AI speech to speech service. While there are several useful Text-To-Speech (TTS) and Speech-To-Text (STT) repos out there, I believe none gets Speech-To-Speech right. OpenAI launched an embedded-repo late last year which sets up WebRTC with ESP-IDF. However, it's not beginner friendly and doesn't have a server side component for business logic.

This repo is an attempt at solving the above pains and creating a great speech to speech experience on Arduino with Secure Websockets using Edge Servers (with Deno/Supabase Edge Functions) for fast global connectivity and low latency.

Sean-Der

This is wonderful, really great job on this! For me physical devices is when it really starts to feel magical. My pre-schooler never engaged with Speech-to-Speech examples I showed her on a screen. However, when I showed her a reindeer toy[1] on my desk that tells joke that is when it became real. It is the same joy/wonder I felt playing Myst for the first time.

----

If anyone is trying to build physical devices with Realtime API I would love to help. I work at OpenAI on Realtime API and worked on [0] (was upstreamed) and I really believe in this space. I want to see this all built with Open/Interoperable standards so we don't have vendor lock-in and developers can build the best thing possible :)

[0] https://github.com/openai/openai-realtime-embedded

[1] https://youtu.be/14leJ1fg4Pw?t=804

StefMyb

I would love to chat further with you about this. I am working on building a educational conversational toy. The toy will tell stories and sing but the conversational aspect is the only thing at this stage that requires AI. The whole idea came from my daughter who was in Kinder at the time

Sean-Der

sean @ pion.ly please email me any time.

Offer is open for anyone. If you need help with WebRTC/Realtime API/Embedded I am here to help. I have an open meeting link on my website.

drakenot

Something that really kills the 'effect' of most of the Voice > AI demos that I see is the cold start / latency.

The OpenAI "Voice Mode" is closer, but when we can have near instantaneous and natural back and forth voice mode, that will be a big in terms of it feeling magical. Today, it is say something, awkwardly wait N seconds then listen to the reply and sometimes awkwardly interrupt it.

Even if the models were no smarter than they are today, if we could crack that "conversational" piece and performance piece, it would be a big difference in my opinion.

akadeb

Yeah the way I am handling this is turn detection which feels unnatural. I like how Livekit handles turn detection with a small model[0][1] [0]https://www.youtube.com/watch?v=EYDrSSEP0h0 [1]https://docs.livekit.io/agents/build/turns/turn-detector/

``` turn_detection: { type: "server_vad", threshold: 0.4, prefix_padding_ms: 400, silence_duration_ms: 1000, }, ```

Sean-Der

I think it will always feel unnatural as long as 'AI Speech' is turn based. Right now developers used Voice Activity Detection to detect when the user has stopped talking.

What would be REALLY cool is if we had something that would interrupt you during conversation like talking with a real human.

conductr

I can see how interruptions would prove even more unnatural and annoying pretty quick. There's a lot of nuance in knowing how to interrupt properly and often, people that interrupt only do so quickly, then yield, allow person to finish then resume - very situational and tons of nuance. Otherwise, with current level of sophistication, you'd just have the AI talking over you the entire time, not allowing you to complete your thoughts/questions/commands/etc and people would quickly be more frustrated and just turn it off.

hoppp

Its great.lovely. but on the long run these toys rely on subscription payment?

Both the supabase Api and OpenAI billing is per api call.

So the lovely talking toys can die if the company stops being profitable.

I would love to see a version with decent hardware that runs a local model, that could have a long lifespan and work offline.

xp84

> lovely talking toys can die if the company stops being profitable.

This is a good point to me as a parent -- in a world where this becomes a precious toy, it would be a serious risk of emotional pain if the child experienced this scenario like the death of a pet or friend.

> version with decent hardware that runs a local model

I feel like something small and efficient enough to meet that (today) would be dumb as a post. Like Siri-level dumb.

Personally, I'd prefer a toy which was tethered to a home device. Without a cloud (and thus commercial) dependency, the toy wouldn't be 'smart' outside of Wi-fi range, but I'd design it so that it got 'sleepy' when away from Wi-fi, able to be "woken up" and, in that state, to respond to a few phrases with canned, Siri-like answers. Perhaps new content could be made up for it daily and downloaded to local storage while at home, so that it could still "tell me a story" offline etc.

scottmcf

> This is a good point to me as a parent -- in a world where this becomes a precious toy, it would be a serious risk of emotional pain if the child experienced this scenario like the death of a pet or friend.

We've already seen this exact scenario play out with "Moxie" a few months ago:

https://www.axios.com/2024/12/10/moxie-kids-robot-shuts-down

supermatt

This looks like so much fun! I have recently gotten into working with electronics, so it seems like a nice little project to undertake.

I noticed that it is dependent on openAIs realtime API, so it got me wondering what open alternatives there are as I would love a more realtime alexa-like device in my home that doesnt contact the cloud. I have only played with software, but the existing solutions have never felt realtime to me.

I could only find <https://github.com/fixie-ai/ultravox> that would seem to really work as realtime. It seems to be some model that wires up llama and whisper somehow, rather than treating them as separate steps which is common with other projects.

What other options are available for this kind of real-time behaviour?

Sean-Der

My plan is that Espressif’s WebRTC code[0] will hook up to pipe at [1] that gets you the freedom to do whatever you want.

The design of OpenAI + WebRTC was to lean on WebRTC as much as possible to make it easier for users.

[0] https://github.com/espressif/esp-webrtc-solution

[1] https://github.com/pipecat-ai/pipecat

supermatt

Fantastic! This will save a ton of work

_neil

Not on-device but for local network I’ve been looking at Speaches[0]. Haven’t tried it yet, but I have been running kokoru-web[1] and the quality and speed is really good.

[0] https://speaches.ai/ [1] https://huggingface.co/spaces/Xenova/kokoro-web

3D30497420

Maybe inspiration from how Home Assistant can do local speech-to-text and vice versa? https://www.home-assistant.io/voice_control/voice_remote_loc...

Pretty sure you'd need to host this on something more robust than an ESP32 though.

supermatt

Yeah, I was looking at home assistant as well, but it doesnt feel real-time, likely due to it having the transcription stage separate from the inference.

behnamoh

am I the only one who finds the unnecessarily positive vibes of OpenAI realtime voices unrealistic, too much, and borderline creepy?

mickael-kerjean

Yep and having it in a child toy is way beyond the border of creepy

3np

Moreso from the consent- and privacy angle.

justanotheratom

This is quite cool. Two questions:

- why do you need nextjs frontend for what looks like a headless use case? - how much would be the OpenAI bill if there is 15 minutes of usage per day?

irq-1

> This equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output.

https://openai.com/index/introducing-the-realtime-api/

About the nextjs site, I was thinking maybe its difficult to have supabase hold long connections, or route the response? I'm curious too.

akadeb

The long connections are ultimately handled by Deno Edge so the site isn't used there. The NextJS frontend (which also could be an iOS/Android app) helps provide an interface to select character, create AI characters, set ESP32 volume, and view conversation history.

akadeb

thank you! The nextjs frontend is to set things like device volume, selecting which character you are interacting with, viewing conversation history etc. I just tried it and for a 15 minute chat, it's roughly 20c. Roughly 570 input tokens

JKCalhoun

And I am wondering, why use an ESP32 if you don't need the WiFi? (And, please, no WiFi in a toy!)

akadeb

Currently we connect to a Wifi network to reach the Deno edge server. Some popular toys doing it: Yoto, Toniebox

null

[deleted]

ianbicking

What's been your experience with the Realtime API? I've been doing LLM with voice, but haven't really given it a try – the price is so high, and it feels like it's much harder to control. Specifically that you just get one system prompt and then the model takes over entirely. (Though looking at the API, I see you can inject text and do some other things to play around with the session.)

vunderba

I remember when LLMs started getting mass traction and the first thing everyone wanted to build was AG Talking Bear + ChatGPT.

https://en.wikipedia.org/wiki/AG_Bear

With regard to this project, using an ESP32 makes a lot of sense, I used an Espressif ESP32-S3 Box to build a smart speaker along with the Willow inference server and it worked very well. The ESP speech recognition framework helps with wake word / far field audio processing.

dayvid

Really interesting. Also more powerful if integrated with animatronic movement. Reminds me of Furby. Doesn't even have to be full AI, just augmented with slightly smarter and more flexible capabilities

stavros

This is great, thank you! I can learn a lot from this.

null

[deleted]