Skip to content(if available)orjump to list(if available)

Robot Dexterity Still Seems Hard

Robot Dexterity Still Seems Hard

26 comments

·April 26, 2025

MisterTea

You need feedback. I started with industrial robotics in the 90s and then having done a bunch of CNC and motion control: positioning is easy. The big problem to solve is enabling the robot to feel what it's doing and understand how it relates to the coordinate space. That's why we're dexterous, we can close our eyes and feel our hands in 3D space instead of just knowing a position in some coordinate system. We can put on a pair of gloves without looking by feel alone. I picture a robot arm similar to when you arm goes numb from sleeping on it. You can see it but it's dead. That's how a robot feels.

sashank_1509

Recently I had a chance to listen to a set of talks powering Waymo Technology. I think the average academic roboticist will be shocked by the complete lack of end to end deep learning models or even large models powering Waymo. It’s interesting to me that the only working self driving car on the market right now, basically has painstakingly listed every possible road obstacle, has coded every possible driving logic to it, and manually addressed every edge case. maybe Tesla’s end to end approach will work, and that will be the way moving forward, but the real world seems to provide an almost limitless amount of edge cases that neural networks don’t seem great at handling. In fact the winning approach to humanoids, if Waymo is proven to be the right approach might be listing every possible item a humanoid can see an environment, detecting them and then planning for them.

boulos

(Disclosure: I work for Waymo)

While there is plenty of classical robotics code in our planner, I wouldn't want people to assume that we don't use neural networks for planning.

Just because we don't deploy end-to-end models (e.g., sensors to controls), but have separate perception and planning components doesn't mean there isn't ML in each part. Having the components separate means we can train and update each individually, test them individually, inject overrides as needed, and so on. On the flip side, it's true that because it's not learned end-to-end today that there might exist a vastly simpler or higher quality system.

So we do a lot of research in this area, like EMMA (https://waymo.com/research/emma/) but don't assume that our planning isn't heavily ML based. A lot of our progress in the last couple of years has been driven by increasing the amount of ML used for planning, especially for behavior prediction (e.g., https://waymo.com/research/wayformer/)

marcosdumay

> basically has painstakingly listed every possible road obstacle, has coded every possible driving logic to it, and ... addressed every edge case

Removed that "manually" world so now it describes exactly the what you would have to do to train an end to end neural network.

NNs don't get information from nothing, you would have to subject them to the exact same obstacles, geometries and behaviors you coded on the manual version.

Zigurd

Not every edge case, but enough that the vehicle can correctly determine it doesn't know how to proceed and must ask a human to choose from among a menu of choices. This is how Waymo described how supervision works. Nobody actually drives the vehicle remotely. They just make a decision the on-board intelligence has decided it can't make.

One good bet based on Waymo's decision to expand is that the amount of supervision each robotaxi needs keeps going down, so supervision is not tightly coupled to fleet size.

huevosabio

I think it would be the other way around, academic roboticists are very well aware of how damn hard the physical world is.

sho_hn

I suppose the (crummy) analog is that a human's "models" are equally not entirely general; we have evolved a particular architecture that is baked into our hardware and perpetuated via our DNA.

It's fuzzy and plastic and complex, but the brain has functional areas, there is intelligence more local to specific sensors, pipelines where fusion happens, governors and supervisors, specific numeric limits to certain tasks, etc.

This is a bit akin to your "listing every possible item", in a way, in the sense that there are definitely finite structures tuned toward the application of being human.

This interplay via our supposed "AGI" and what is "cached" in our also not static but evolving hardware is really one of the most fascinating aspects of biology.

iandanforth

For some perspective we have not yet scaled robot training. The amount of data that Pi is using to train their impressively capable robots is in the range of thousands of hours of data. In contrast language models are trained over trillions of tokens comprising the entirety of human knowledge. So if you're saying things like "this still seems hard" just remember we have yet to hit this with the data hammer. Simulation is proving a great way to augment / bootstrap robot dexterity but it still pales in comparison to data in the real world. So, as the author points out, we may get capability scaling like Waymo where one company painstakingly collects real data over a decade, but we may also see the rapid progress in simulators and simulator speed overtake for practical household / industrial tasks. My bet is on the latter.

deeThrow94

> In contrast language models are trained over trillions of tokens comprising the entirety of human knowledge.

Not even close! At best it's a small subset of the internet + published books. The vast majority of human knowledge isn't even in the training sets yet.

I would question the use of a model fed everything, though.

hahaxdxd123

Correct me if I'm wrong, but I haven't seen any simulator progress in years (e.g. MuJoCo hasn't changed in 5 years but is still SOTA accuracy)

erwincoumans

MuJoCo, Drake, Pinocchio (and other simulators) are still improving (adding more accurate collision detection, better solvers etc).

frainfreeze

How do they compare to PyBullet ?

levocardia

Surprised that there isn't any explicit discussion of why dexterity is so hard, beyond sensory perception. One of the root causes (IMHO the biggest one) is that modeling contact, ie the static and dynamic friction between two objects, is extremely complicated. There are various modeling strategies but their results are highly sensitive to various tuning parameters which makes it very hard to learn in simulation. From what I remember, the OpenAI Rubik's Cube solver basically learned across a giant set of worlds of many different possible tuning parameters for the contact models and was able to generalize okay to the real world, in various situations.

It seems most likely that this sort of boring domain randomization will be what works, or works well enough, for solving contact in this generation of robotics, but it would be much more exciting if someone figures out a better way to learn contact models (or a latent representation of them) in real time.

beau_g

On a freestanding humanoid robot, you have an inverse kinematic chain running all the way from the touch point to the ground, with many actuators in between, each of which to some degree squares the complexity of the problem. The parent article mentions a Fanuc or Kuka bot, which lets say is 6 axis - they are incredibly stiff/strong, in many cases many orders of magnitude stronger than they really need to be for the job they are tasked with, they do not move, modeling things like clashing with the environment/itself is much simpler because they are placed in 100% controlled environments - remove all of those qualifiers (weak robot because it needs to be light, dynamic environment, and count the DOF between the robots finger and it's ankles) and it gives a clearer picture than the article offers of why all this stuff is difficult. Can't take much of a divide and conquer approach like you can in other domains.

sho_hn

This rings super plausible to me. I dabbled a bit in hobby electronics making DIY walkers, and the more time you spend on junior stuff like that (trying to model a good response to servo load feedback that works in every situation, etc.) the more it dawns on you that what humans and other animals do with the sensor feedback they get from their limbs is so rich in "magic" and intelligence.

Figuring out physical interaction with the environment and traversal is truly one of the most stunning early achievements of life.

rapjr9

Dexterity is also hard because, at least in humans, it relies on knowing something of the nature of an object _before_ manipulating it. Is it light or heavy? Soft or rigid? Is it a bag of popcorn, popcorn kernels, a bag of powder, or a pillow? How tightly is it packed in the bag? Fabric or cardboard? Attached to other objects or not? Is the USB plug the right type and oriented correctly? (Even humans have trouble with this one.) Does it have a slippery surface or a grippy surface? To be immediately successful in manipulation, pre-knowledge based on sensing and identification is usually required. Possibly it would be ok if a robot took several tries to figure this out based on some general principles, but it will seem clumsy and be slower. It seems there is an ontology problem here, which requires understanding a lot about the world in order to be able to successfully manipulate it.

More generally, continuous learning in real-time is something current models don't do well. Retraining an entire LLM every time something new is encountered is not scalable. Temporary learning does not easily transfer to long term knowledge. Continuous learning still seems in its infancy.

pixl97

Also when we don't know the properties of an object we are about to manipulate we'll approach it cautiously and learn it before we apply too much force. This tends to happen transparently and quickly for adults, but for infants you can watch it play out more slowly.

marcosdumay

My guess is that it helps a lot that we have flexible cushioned fingertips that are highly sensitive to pressure. That's a hardware feature that robots mostly lack.

hahaxdxd123

Here's an interesting blog post on the limitations of domain randomization for OpenAI's results: https://www.alexirpan.com/2019/10/29/openai-rubiks.html

Basically the solve rate was much lower without the use of a Bluetooth sensor, and they did a bunch of other things that made the result less impressive. Still a long way to go here.

m3kw9

Just today noticed without looking that I can tell from feel that there are 2 objects in a bag instead of one tells me we have likely 1000x different type of sensor and w we combine them all to form a meaning, and dexterity goes hand in hand with it

Zigurd

There are half a dozen successful commercially available surgical robot products out there. None try to mimic a surgeon's hands.

Even if biomimicry turns out to be a useful strategy in designing general purpose robots, I would bet against humans being the right shape to mimic. And that's assuming general purpose robots will ever be more useful than robots designed or configured for specific tasks.

michaelt

The reason people keep working on human-like hands for robots is: The world is absolutely full of things adapted to be operated with human hands.

Handling heavy boxes? Baking a cake? Operating a circular saw? Assembling a PC? Performing surgery? Loading a ream of paper into a printer? Playing a violin? Opening a door? You can do it all with two five-fingered hands.

ndileas

I think it's important to note that many individual humans are adapted to only a few of these tasks. A construction worker's hands and a magician have very different muscles, skin thickness, grip strength, dexterity, etc. even though they can both wash a dish and open a door.

beefnugs

Its because after they saw how big a suckers everyone is for "AI" of course they can sell dumbasses a $60k vaguely human shaped thing that still wont be able to do laundrey or dishes or answer the door or screw in a screw or step over a puppy

DGAP

Do these challenges apply to surgical robots? There's a lot of interest in essentially creating automated Davincis, for which there is a great deal of training data and for which the robots are prepositioned.

Maybe all this setup means that completing surgical tasks doesn't counter as dexterity.