Helix: A vision-language-action model for generalist humanoid control

177 comments

·February 20, 2025

porphyra

It seems that end to end neural networks for robotics are really taking off. Can someone point me towards where to learn about these, what the state of the art architectures look like, etc? Do they just convert the video into a stream of tokens, run it through a transformer, and output a stream of tokens?

vessenes

I was reading their site, and I too have some questions about this architecture.

I'd be very interested to see what the output of their 'big model' is that feeds into the small model. I presume the small model gets a bunch of environmental input, and some input from the big model, and we know that the big model input only updates every 30 or 40 frames in terms of small model.

Like, do they just output random control tokens from big model and embed those in small model and do gradient descent to find a good control 'language'? Do they train the small model on english tokens and have the big model output those? Custom coordinates tokens? (probably). Lots of interesting possibilities here.

By the way, the dataset they describe was generated by a large (much larger presumably) vision model tasked with creating tasks from successful videos.

So the pipeline is:

* Video of robot doing something

* (o1 or some other high end model) "describe very precisely the task the robot was given"

* o1 output -> 7B model -> small model -> loss

yurimo

I don't know, there has been so many overhyped and faked demos in humanoid robotics space over the last couple years, it is difficult to believe what is clearly a demo release for shareholders. Would love to see some demonstration in a less controlled environment.

falcor84

I suppose the next big milestone is Wozniak's Coffee Test: A robot is to enter a random home and figure out how to make coffee with whatever they have.

UltraSane

That could still be decades away.

kilroy123

I don't know... I'm starting to seriously think that is only 5-10 years away.

ge96

Imagine they bring one out to a construction site and they treat the robot as a new rookie guy, go pick up those pipes. That would be an ultimate on the fly test to me.

ortsa

Picking up a bundle of loose pipes actually seems like a great benchmark for humanoid robots. Especially if they're not in a perfect pile. A full test could be something like grabbing all the pipes, from the floor, and putting them into a truck bed, in some (hopefully) sane fashion

sayamqazi

I have my personal multimodal benchmark for physical robots.

You put a keyring with bunch of different keys in front of a robot and then instruct it pick it up and open a lock while you are describing which key is the correct one. Something like "Use the key with black plastic head and you need to put it in teeths facing down"

I have low hopes of this being possibe in the next 20 years. I hope I am still alive to witness if it ever happens.

m0llusk

pick up that can, heh heh heh

causal

I'm always wondering at the safety measures on these things. How much force is in those motors?

This is basically safety-critical stuff but with LLMs. Hallucinating wrong answers in text is bad, hallucinating that your chest is a drawer to pull open is very bad.

silentwanderer

In terms of low-level safety, they can probably back out forces on the robot from current or torque measurement and detect collisions. The challenge comes with faster motions carrying lots of inertia and behavioral safety (e.g. don't pour oil on the stove)

rtkwe

That's actually more of a solved problem. Robot arms that can track the force they're applying and where to avoid injuring humans have been kicking around for 10-15 years. It let them go out of the mega safety cells into the same space as people and even do things like letting the operator pose the robot to teach it positions instead of having to do it in a computer program or with a remote control.

The term I see a lot is co-robotics or corobots. At least that's what Kuka calls them.

Symmetry

That's fine for wheeled robots or robots bolted to the floor but for legged robots, especially bipeds, the hard question is how to prevent them from falling over on things. These don't look heavy enough to be too dangerous for a standing adult but you've still got pets/children to worry about.

UltraSane

You can have dedicated controllers for the motors that limit their max torque.

imtringued

That's not enough. When a robot link is in motion and hits an object, the change in momentum creates an impulse over the duration of deceleration. The faster the robot moves, the faster it has to decelerate, the higher the instantaneous braking force at the impact point.

UltraSane

Then limit max velocity.

cess11

Not a big deal on the battlefield.

causal

I'd say a very big deal when munitions and targeting are involved

cess11

Why do you think that?

Battle fields are violent and exhausting, people get the shakes, make mistakes, hurt themselves and each other all the time. Munitions are generally designed to not explode from a bump in the road or getting dropped or squeezed, and targeting systems commonly support automatic tracking or similar, for this specific reason.

rizky05

[dead]

mmh0000

The thing in the video moves slower than the sloth in Zootopia. If you die by that robot, you probably deserve it.

throwaway0123_5

As a sibling comment implies though, there's also danger from it being stupid while unsupervised. For example, I'd be very nervous having it do something autonomously in my kitchen for fear of it burning down my house by accident.

mikehollinger

From a different robot (Boston Dynamics' new Atlas) - the system moves at a "reasonable" speed. But watch at 1m20s in this video[1]. You can see it bump and then move VERY quickly -- with speed that would certainly damage something, or hurt someone.

[1] https://www.youtube.com/watch?v=F_7IPm7f1vI

charlie0

Especially if holding a knife or something sharp.

exe34

or if you're old, injured, groggy from medication, distracted by something/someone else, blind, deaf or any number of things.

it's easy to take your able body for granted, but reality comes to meet all of us eventually.

dr_kiszonka

They are designed to penetrate Holtzman shields, surely.

causal

Are you saying it cannot move faster than they because of some kind of governor?

UltraSane

That is how I would design it. It is common in safety critical PLC systems to have 1 or more separate safety PLCs that try to prevent bad things from happening.

Symmetry

A governor, the firmware in the motor controllers, something like that. Certainly not the neural network though.

Symmetry

So, there's no way you can have fully actuated control of every finger joint with just 35 degrees of freedom. Which is very reasonable! Humans can't individually control each of our finger joints either. But I'm curious how their hand setups work, which parts are actuated and which are compliant. In the videos I'm not seeing any in-hand manipulation other than just grasping, releasing, and maintaining the orientation of the object relative to the hand and I'm curious how much it can do / they plan to have it be able to do. Do they have any plans to try to mimic OpenAI's one handed rubics cube demo?

wwwtyro

Until we get robots with really good hands, something I'd love in the interim is a system that uses _me_ as the hands. When it's time to put groceries away, I don't want to have to think about how to organize everything. Just figure out which grocery items I have, what storage I have available, come up with an optimized organization solution, then tell me where to put things, one at a time. I'm cautiously optimistic this will be doable in the near term with a combination of AR and AI.

camjw

Maybe I don't understand exactly what you're describing but why would anyone pay for this? When I bring home the shopping I just... chuck stuff in the cupboards. I already know where it all goes. Maybe you can explain more?

loudmax

One use case I imagine is skilled workmanship. For example, putting on a pair of AR glasses and having the equivalent of an experienced plumber telling me exactly where to look for that leak and how to fix it. Or how to replace my brake pads or install a new kitchen sink.

When I hire a plumber or a mechanic or an electrician, I'm not just paying for muscle. Most of the value these professionals bring is experience and understanding. If a video-capable AI model is able to assume that experience, then either I can do the job myself or hire some 20 year old kid at roughly minimum wage. If capabilities like this come about, it will be very disruptive, for better and for worse.

hulahoof

Sounds like what Hololens was designed to solve, more in the AR space than AI though

semi-extrinsic

This is called "watching YouTube tutorials". We've had it for decades.

__MatrixMan__

It would be nice to be able to select a recipe and have it populate your shopping list based on what is currently in your cupboards. If you just chuck stuff in the cupboards then you have to be home to know what they contain.

Or you could wear it while you cook and it could give you nutrition information for whatever it is you cooked. Armed with that it could make recommendations about what nutrients you're likely deficient in based on your recent meals and suggest recipes to remedy the gap--recipes based on what it knows is already in the cupboard.

gopher_space

Maybe I’m showing my age, but isn’t this a home ec class?

mistercheph

[flagged]

luma

> why would anyone pay for this?

Presumably, they won't as this is still a tech demo. One can take this simple demonstration and think about some future use cases that aren't too different. How far away is something that'll do the dishes, cook a meal, or fold the laundry, etc? That's a very different value prop, and one that might attract a few buyers.

Philip-J-Fry

The person you're replying to is referring to the GP. The GP asks for an AI that tells them where to put their shopping. Why would anyone pay for THAT? Since we already know where everything goes without needing an AI to tell us. An AI isn't going to speed that up.

SoftTalker

Yes it's pretty amazing how so many people seem to overcomplicate simple household tasks by introducing unnecessary technology.

bear141

Maybe some people just assume there is a “best” or “optimal” way to do everything and AI will tell us what that is. Some things are just preference and I don’t mind the tiny amount of energy that goes into doing small things the way I like.

jayd16

Maybe they're imagining more complex tasks like working on an engine.

Philpax

Sounds like what's described in Manna: https://marshallbrain.com/manna1

sho_hn

Dunno, I would not want to bleed my mental faculties for doing even simple planning work like this by outsourcing it to AI. Reliance on crutches like this would seem like a pathway to early-onset dementia.

meowkit

Already playing out, anecdotally to my experience.

Its similar to losing callouses on our hands if you don’t labor/go to the gym.

mistercheph

[flagged]

RedNifre

I fully agree, building something like this is somewhere in my back log.

I think the key point why this "reverse cyborg" idea is not as dystopian as, say, being a worker drone in a large warehouse where the AI does not let you go to the toilet is that the AI is under your own control, so you decide on the high level goal "sort the stuff away", the AI does the intermediate planning and you do the execution.

We already have systems like that, every time you use you tell your navi where you want to go, it plans the route and gives you primitive commands like "on the next intersection, turn right", so why not have those for cooking, doing the laundry, etc.?

Heck, even a paper calendar is already kinda this, as in separating the planning phase from the execution phase.

Jarwain

I'm quite slowly working on something like this, but for time.

For "stuff" I think a bigger draw is having it so it can let me know "hey you already have 3 of those spices at locations x, y, and z, so don't get another" or "hey you won't be able to fit that in your freezer"

falcor84

This is almost literally the first chapter in Marshall Brain's "Manna" [0], being the first step towards world-controlling AGI:

> Manna told employees what to do simply by talking to them. Employees each put on a headset when they punched in. Manna had a voice synthesizer, and with its synthesized voice Manna told everyone exactly what to do through their headsets. Constantly. Manna micro-managed minimum wage employees to create perfect performance.

[0] https://marshallbrain.com/manna1

__MatrixMan__

I imagine a something like a headlamp except it's a projector and a camera so it can just light up where it wants you to pick something up in one color or where it wants you to put it down in another color. It can learn from what it sees of my hands how the eventual robot should handle the space (e.g. not putting heavy things on top of fragile things and such).

I'd totally use that to clean my garage so that later I can ask it where the heck I put the thing or ask it if I already have something before I buy one...

lynx97

A good AI fridge would be already a great starting point. With a checkin procedure that makes sure to actually know whats in the fridge. Complete with expiry tracking and recipe suggestions based on personal preferences combined with product expiry. I am totally unimpressed with almost everything I see in home automation these days, but I'd immediately buy the AI fridge if it really worked smoothly.

cactusplant7374

Your solution sounds like the worst cognitive load for getting home from the grocery store and wanting it all to be over.

ziofill

There’s nothing I want more than a robot that does house chores. That’s the real 10x multiplier for humans to do what they do best.

01100011

I'd pay $2k for something that folds my laundry reliably. It doesn't need arms or legs, just like my dishwasher doesn't need arms or legs. It just needs to let me dump in a load of clean laundry and output stacks of neatly folded or hung clothing.

mistercheph

There are many services that for ~$3/lb will pick up, wash, dry, fold/hang, and deliver 10lb's of laundry every week for $1,500/yr.

null

[deleted]

imtringued

Laundry folding machines already exist. You can find cheap ones on AliExpress.

https://www.aliexpress.com/w/wholesale-clothes-folding-machi...

mmh0000

It's called a "house cleaner" and they only cost ~$150 (area and all varies) bi-weekly. I'll shit a brick (and then have the robot clean it up) if a robot is ever cheaper than ~$4000/yr.

vessenes

A robot will definitely cost less than $30/hr eventually. But you'll be running it a lot more than a few hours every other week.

ziofill

Yeah, but a robot will work 24/7, not 2h byweekly -_-

mclau156

do you need a robot to work in your house 24/7?

dartos

Hopefully in the next decade we’ll get there.

Vision+language multimodal models seem to solve some of the hard problems.

cess11

To me this is such a weird wish. Why would you not want to care for your home and the people living there? Why would you want to have a slave taking these activities from you?

I'd rather have less waged labour and more time for chores with the family.

int_19h

Because it's boring and tedious. And no, a robot is not a slave.

abraxas

Yeah, except that future doesn't need us. By us I mean those of us who don't have $1B to their name.

Do you really expect the oligarchs to put up with the environmental degradation of 8 billion humans when they can have a pristine planet to themselves with their whims served by the AI and these robots?

I fully anticipate that when these things mature enough we'll see an "accidental" pandemic sweep and kill off 90% of us. At least 90%.

ben_w

I'd expect Musk and Bezos to know about von Neumann replicators; factories that make these robots staffed entirely by these robots all the way to the mines digging minerals out of the ground… rapid and literally exponential growth until they hit whatever the limiting factor is, but they've both got big orbial rockets now, so the limit isn't necessarily 6e24 kg.

ewjt

Oligarchs would use the robots to kill people instead of a pandemic. A virus carries too much risk of affecting the original creators.

Fortunately, robotic capability like that basically becomes the equivalent of Nuclear MAD.

Unfortunately, the virus approach probably looks fantastic to extremist bad actors with visions of an afterlife.

siavosh

What do humans do best?

jayd16

Everything everything else is worse at.

ein0p

Browse Instagram, apparently.

ziofill

I mean to use their time to pursue their passions and interests, not cleaning up the kitchen or making the bed or doing laundry...

hooverd

Given time to "pursue their passions and interests", most people chose to turn their brain to soup on social media.

null

[deleted]

plipt

The demo is quite interesting but I am mostly intrigued by the claim that it is running totally local to each robot. It seems to use some agentic decision making but the article doesn't touch on that. What possible combo of model types are they stringing together? Or is this something novel?

The article mentions that the system in each robot uses two ai models.

    S2 is built on a 7B-parameter open-source, open-weight VLM pretrained on internet-scale data

and the other

    S1, an 80M parameter cross-attention encoder-decoder transformer, handles low-level [motor?] control.

It feels like although the article is quite openly technical they are leaving out the secret sauce? So they use an open source VLM to identify the objects on the counter. And another model to generate the mechanical motions of the robot.

What part of this system understands 3 dimensional space of that kitchen?

How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?

How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?

    Figure robots, each equipped with dual low-power-consumption embedded GPUs

Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?

liuliu

It looks pretty obvious (I think):

1. S2 is a 7B VLM, it is responsible for taken in camera streams (from however many of them), run through prompt guided text generation, and before the lm_head (or a few layers leading to it), directly take the latent encoding;

2. S1 is where they collected a few hundreds hours of teleoperating data, retrospectively come up with prompt for 1, then train from the scratch;

Whether S2 finetuned with S1 or not is an open question, at least there is a MLP adapter that is finetuned, but could be the whole 7B VLM is finetuned too.

It looks plausible, but I am still skeptical about the generalization claim given it is all fine-tuned with household tasks. But nowadays, it is really difficult to understand how these models generalize.

tim_ai_robotics

I'm very skeptical. I'm quite familiar with VLAs and this seems like an unbelievable leap forward based on their claims.

bbor

I'm very far from an expert, but:

  What part of this system understands 3 dimensional space of that kitchen?

The visual model "understands" it most readily, I'd say -- like a traditional Waymo CNN "understands" the 3D space of the road. I don't think they've explicitly given the models a pre-generated pointcloud of the space, if that's what you're asking. But maybe I'm misunderstanding?

  How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?

It appears that the robot is being fed plain english instructions, just like any VLM would -- instead of the very common `text+av => text` paradigm (classifiers, perception models, etc), or the less common `text+av => av` paradigm (segmenters, art generators, etc.), this is `text+av => movements`.

Feeding the robots the appropriate instructions at the appropriate time is a higher-level task than is covered by this demo, but I think is pretty clearly doable with existing AI techniques (/a loop).

  How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?

If your question is "where's the GPUs", their "AI" marketing page[1] pretty clearly implies that compute is offloaded, and that only images and instructions are meaningfully "on board" each robot. I could see this violating the understanding of "totally local" that you mentioned up top, but IMHO those claims are just clarifying that the individual figures aren't controlled as one robot -- even if they ultimately employ the same hardware. Each period (7Hz?) two sets of instructions are generated.

[1] https://www.figure.ai/ai

  What possible combo of model types are they stringing together? Or is this something novel?

Again, I don't work in robotics at all, but have spent quite a while cataloguing all the available foundational models, and I wouldn't describe anything here as "totally novel" on the model level. Certainly impressive, but not, like, a theoretical breakthrough. Would love for an expert to correct me if I'm wrong, tho!

EDIT: Oh and finally:

  Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?

Surely they are downplaying the difficulties of getting this setup perfectly, and don't show us how many bad runs it took to get these flawless clips.

They are seeking to raise their valuation from ~$3B to ~$40B this month, sooooooo take that as you will ;)

https://www.reuters.com/technology/artificial-intelligence/r...

plipt

    their "AI" marketing page[1] pretty clearly implies that compute is offloaded

I think that answers most of my questions.

I am also not in robotics, so this demo does seem quite impressive to me but I think they could have been more clear on exactly what technologies they are demonstrating. Overall still very cool.

Thanks for your reply

verytrivial

Are they claiming these robots are also silent? They seem to have "crinkle" sounds handling packaging, which if added in post seems needlessly smoke-and-mirror for what was a very impressive demonstration (of robots impersonating an extreme stoned human.)

bilsbie

This is amazing but it also made me realize I just don’t trust these videos. Is it sped up? How much is preprogrammed?

I now they claim there’s no special coding but did they practice this task? Special training?

Even if this video is totally legit I’m but burned out by all the hype videos in general.

turnsout

They appear to be realtime, based on the robot's movements with the human in the scene. If you believe the article, it's zero shot (no preprogramming, practice or special training).

ge96

they seem slow to me, I was thinking they're slow for safety

null

[deleted]

aerodog

Interesting timing - same day MSFT releases https://microsoft.github.io/Magma/

Current discussion: https://news.ycombinator.com/item?id=43110265

pr337h4m

Goal 2 has been achieved, at least as a proof of concept (and not by OpenAI): https://openai.com/index/openai-technical-goals/

Symmetry

They can put away clutter but if they could chop a carrot or dust a vase they'd have shown videos demonstrating that sort of capability.

EDIT: Let alone chop an onion. Let me tell you having a robot manipulate onions is the worst. Dealing with loose onion skins is very hard.

j-krieger

Sure. But if you showed this video to someone 5 or 10 years ago, they'd say it's fiction.

Symmetry

Telling a robot verbally "Put the cup on the counter" and having it figure out what the cup is, what the counter is in its field of view would have seemed like science fiction. The object manipulation itself is still well behind what we saw in the 2015 DARPA Grand Challenge though.

squigz

There's something hilarious to me about the idea of chopping onions being a sort of benchmark for robots.

HN

Helix: A vision-language-action model for generalist humanoid control

Helix: A vision-language-action model for generalist humanoid control