Skip to content(if available)orjump to list(if available)

Show HN: Lemon Slice Live – Have a video call with a transformer model

Show HN: Lemon Slice Live – Have a video call with a transformer model

56 comments

·April 24, 2025

Hey HN, this is Lina, Andrew, and Sidney from Lemon Slice. We’ve trained a custom diffusion transformer (DiT) model that achieves video streaming at 25fps and wrapped it into a demo that allows anyone to turn a photo into a real-time, talking avatar. Here’s an example conversation from co-founder Andrew: https://www.youtube.com/watch?v=CeYp5xQMFZY. Try it for yourself at: https://lemonslice.com/live.

(Btw, we used to be called Infinity AI and did a Show HN under that name last year: https://news.ycombinator.com/item?id=41467704.)

Unlike existing avatar video chat platforms like HeyGen, Tolan, or Apple Memoji filters, we do not require training custom models, rigging a character ahead of time, or having a human drive the avatar. Our tech allows users to create and immediately video-call a custom character by uploading a single image. The character image can be any style - from photorealistic to cartoons, paintings, and more.

To achieve this demo, we had to do the following (among other things! but these were the hardest):

1. Training a fast DiT model. To make our video generation fast, we had to both design a model that made the right trade-offs between speed and quality, and use standard distillation approaches. We first trained a custom video diffusion transformer (DiT) from scratch that achieves excellent lip and facial expression sync to audio. To further optimize the model for speed, we applied teacher-student distillation. The distilled model achieves 25fps video generation at 256-px resolution. Purpose-built transformer ASICs will eventually allow us to stream our video model at 4k resolution.

2. Solving the infinite video problem. Most video DiT models (Sora, Runway, Kling) generate 5-second chunks. They can iteratively extend it by another 5sec by feeding the end of the 1st chunk into the start of the 2nd in an autoregressive manner. Unfortunately the models experience quality degradation after multiple extensions due to accumulation of generation errors. We developed a temporal consistency preservation technique that maintains visual coherence across long sequences. Our technique significantly reduces artifact accumulation and allows us to generate indefinitely-long videos.

3. A complex streaming architecture with minimal latency. Enabling an end-to-end avatar zoom call requires several building blocks, including voice transcription, LLM inference, and text-to-speech generation in addition to video generation. We use Deepgram as our AI voice partner. Modal as the end-to-end compute platform. And Daily.co and Pipecat to help build a parallel processing pipeline that orchestrates everything via continuously streaming chunks. Our system achieves end-to-end latency of 3-6 seconds from user input to avatar response. Our target is <2 second latency.

More technical details here: https://lemonslice.com/live/technical-report.

Current limitations that we want to solve include: (1) enabling whole-body and background motions (we’re training a next-gen model for this), (2) reducing delays and improving resolution (purpose-built ASICs will help), (3) training a model on dyadic conversations so that avatars learn to listen naturally, and (4) allowing the character to “see you” and respond to what they see to create a more natural and engaging conversation.

We believe that generative video will usher in a new media type centered around interactivity: TV shows, movies, ads, and online courses will stop and talk to us. Our entertainment will be a mixture of passive and active experiences depending on what we’re in the mood for. Well, prediction is hard, especially about the future, but that’s how we see it anyway!

We’d love for you to try out the demo and let us know what you think! Post your characters and/or conversation recordings below.

zebomon

This is impressive. The video chat works well. It is just a hair away from a very comfortable conversation. I'm excited to see where you have it a year from now, if it turns out to be financially viable. Good luck!

lcolucci

Thank you! Very much agree that we need to improve speed to make the conversation more comfortable. Our target is <2sec latency (as measured by time to first byte). The other building blocks of the stack (like interruption handling, etc) will get better in the coming months as well. In 1 year things should feel like the equivalent of a zoom conversation with another human.

srameshc

I am very much fascinated by this virtual avatar talking thing. I tried video-retalking https://github.com/OpenTalker/video-retalking just to see how far I can make it work to make a talking avatar but it is tremendously difficult. But this holds tremendous possibilities and I hope it can be eventually cheaper to run such models. I know this is far superior and probably a lot different but I hope to find open source solutions like Lemon Slice someday that I can experiment with.

sid-the-kid

Nice! Thanks for sharing. I hadn't seen that paper before. Looks like they take in a real-world video and then re-generate the mouth to get to lip synch. In our solution, we take in an image and then generate the entire video.

I am sure they will have open source solutions for fully-generated real-time video within the next year. We also plan to provide an API for our solution at some point.

dang

lcolucci

haha this is amazing! Just made him a featured character. Folks can chat with him by searching for "Devil"

lostmsu

This is very impressive. Any details about model architecture and size? Input and output representation?

How does voice work? You mentioned Deepgram. Does it mean you do Speech-to-Text-to-Speech?

sid-the-kid

For the input, we pass the model: 1) embedded audio and 2) a single image (encoded with a causal VAE). The model outputs the final RGB video directly.

The key technical unlock was getting the model to generate a video faster than real-time. This allows us to stream video directly to the user. We do this by recursively generating the video, always using the last few frames of the previous output to condition the next output. We have some tricks to make sure the video stays relatively stable even with recursion.

tough

I'm not at that level but reminded me of https://news.ycombinator.com/item?id=43736193

sid-the-kid

Nice find! I hand't seen this before (and will take a deeper look later). It looks like this is an approach to better utilize the GPU memory. And, we would probably benefit from this to get more of a speed-up, which would also help us get better video quality.

I do not think they are running in real time though. From the website: "Personal RTX 4090 generates at speed 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache)." That means it would take them 37.5s to generate 1 second of video, which is fast for video but way slower than real time.

tony_cannistra

Nice. This is also how recent advances in ML weather forecasting work. Weather forecasting really is just "video generation" but in higher dimensions.

dheera

Nice! What infra do you use for inference? I'm wondering what the cost-effective platforms are for projects like this. GPUs on AWS and Azure are incredibly expensive for personal use.

sid-the-kid

We use modal (https://modal.com/). They give us GPUs on-demand, which is critical for us so we are only paying for what we are using. Pricing is about $2/hr per GPU (as a baseline of the costs). Long story short, things get VERY expensive quickly.

lcolucci

thank you! We have an architecture diagram and some more details in the tech report here: https://lemonslice.com/live/technical-report

And yes, exactly. In between each character interaction we need to do speech-to-text, LLM, text-to-speech, and then our video model. All of it happens in a continuously streaming pipeline.

gitroom

honestly this feels kinda huge - stuff like this is moving so fast, it's insane seeing it go real-time

sid-the-kid

IMO, most videos models will be fully real time within 2 years. You will be able to pick a model, imagine any world and then be fully immersed in it. Walk around any city interacting with people, first person shooter games on any map with crazy monsters, or just let the model auto-pilot an adventure for you.

lcolucci

thanks so much for the kind words! we agree that the leap to real-time feels huge. so excited to share this with you all

elternal_love

Hmm, plug this together with a app which collects photos and chats with a deceased love one and you have a working Malachim. Might be worth a shot.

Impressive technology - impressive demo! Sadly, the conversation seems to be a little bit overplayed. Might be worth plugging ChatGPT or some better LLM in the logic section.,

andrew-w

Thanks for the feedback. Optimizing for speed meant we had fewer LLMs to choose from. OpenAI had surprisingly high variance in latency, which made it unusable for this demo. I think we could probably do a better job with prompting for some of the characters.

null

[deleted]

benob

Very nice. Are you planning a paper?

lcolucci

thank you! No concrete paper plan yet as we're focused on shipping product features. anything specific you'd want to read about?

aorloff

Max Headroom lives !

andrew-w

Just added as a public character :)

consumer451

Really wish his trademark glitching head nod was there, but I can imagine how that might not be possible.

Super cool product in any case.

sid-the-kid

Does he? I can't find him.

sid-the-kid

Looked it up. Cool reference.

sid-the-kid

The system just crashed. Sorry! Working on getting things live again as fast as we can!

sid-the-kid

We are live again folks! Sorry about that. We ran out of storage space.

PUSH_AX

Ah the ole HN soak test.

sid-the-kid

Ya. You always think you cross your Ts. But, the law always holds.

lcolucci

haha one of the reasons launching on HN is great!

andrewstuart

A really compelling experience.

It seems clumsy to use copyrighted characters in your demos.

Seems to me this will be a standard way to interact with LLMs and even companies - like a receptionist/customer service/salesperson.

Obviously games could use this.

bigyabai

> reducing delays and improving resolution (purpose-built ASICs will help)

How can you be sure? Investing in an ASIC seems like one of the most expensive and complicated solutions.

lcolucci

We wouldn't build it ourselves, but there are several companies like Etched, Groq, and Cerebras working on purpose-built hardware for transformer models. Here's more: https://www.etched.com/announcing-etched