Skip to content(if available)orjump to list(if available)

Gemini Robotics brings AI into the physical world

fusslo

I'm a firmware engineer that's been working in consumer electronics and I feel very bleak about my future I feel so left behind. I have extremely limited robotics and computer vision experience. I have no ML experience. The only math I know has to do with basic signal processing.

When I see open roles at these companies I think the projects I'm going to work on in the future will be more and more irrelevant to society as a whole.

Anyway, this is amazing. Please delete/remove my post if it seems like this adds nothing to the conversation

daemonologist

There's one shot that stood out to me, right at the end of the main video, where the robot puts a round belt on a pulley: https://youtu.be/4MvGnmmP3c0?si=f9dOIbgq58EUz-PW&t=163 . Of course there are probably many examples of this exact action in its training data, but it felt very intuitive in a way the shirt-folding and object-sorting tasks in these demos usually don't.

(Also there seems to be some kind of video auto-play/pause/scroll thing going on with the page? Whatever it is, it's broken.)

krunck

That stood out for me as well. But only because the humans seemed to be inept.

metayrnc

I am not sure whether the videos are representative of real life performance or it is a marketing stunt but sure looks impressive. Reminds of the robot arm in Iron Man 1.

ksynwa

AI demos and even live presentations have exacerbated my trust issues. The tech has great uses but there is no modesty from the proprieters.

whereismyacc

i thought it was really cool when it picked up the grapes by the vine

edit: it didn't.

yorwba

Here it looks like its squeezing a grape instead: https://www.youtube.com/watch?v=HyQs2OAIf-I&t=43s Bit hard to tell whether it remained intact.

whereismyacc

welp i guess i should get my sight checked

saberience

[flagged]

nomel

This is, nearly exactly, like saying you've seen screens slowly display text before, so you're not impressed with LLM.

How it's doing it is the impressive part.

KoolKat23

For the most part that's been on known objects, these are objects it has not seen.

mkagenius

Not specifically trained on but most likely the Vision models have seen it. Vision models like Gemini flash/pro are already good at vision tasks on phones[1] - like clicking on UI elements and scrolling to find stuff etc. The planning of what steps to perform is also quite good with Pro model (slightly worse than GPT 4o in my opinion)

1. A framework to control your phone using Gemini - https://github.com/BandarLabs/clickclickclick

asadm

the difference is the dynamic nature of things here.

Current arms and their workspaces are calibrated to mm. Here it's more messy.

Older algorithms are more brittle than having a model do it.

beklein

Here's the link to the full playlist with 20 video demonstrations (around 1min each) on YouTube: https://www.youtube.com/watch?v=4MvGnmmP3c0&list=PLqYmG7hTra...

sgerenser

So has the labels like "Autonomous 1x" actually been a thing that Google has used before, or is it actually meant to be an "inside joke" jab at Tesla's previous videos that had small labels indicating the video was sped up and/or being human controlled?

sgillen

Videos like these are so often sped up or Teleop, I don't think it's really a jab at anyone specifically, just making it clear this video is showing an Autonomous agent without any speedup.

gatinsama

The problem with Google is that their ad business brings so much revenue that no other product makes sense. They will use whatever they learn with robots to raise their ad revenue, somehow.

Viliam1234

Probably will use the robots to spy on their users in real life, and then sell the information to the advertisers.

Animats

I'd like to see more about what the Gemini system actually tells the robot. Eventually, it comes down to motor commands. It's not clear how they get there.

null

[deleted]

lquist

How does this compare to what Physical Intelligence is up to?

lenerdenator

> To further assess the societal implications of our work, we collaborate with experts in our Responsible Development and Innovation team and as well as our Responsibility and Safety Council, an internal review group committed to ensure we develop AI applications responsibly. We also consult with external specialists on particular challenges and opportunities presented by embodied AI in robotics applications.

Well, for now, at least.

I know who will be the first shown the door when the next round of layoffs comes: the guy saying "you can't make money that way."

fbn79

I suspect that if a nuclear war brings humans to extinction tomorrow, this project could be looked at by hypothetical aliens, visiting our planet in the future, as the "Antikythera mechanism" of our times. (well.... if we can trust the video)

fusionadvocate

Robotics has been trying the same ideas for the last who knows how many years. They still believe it will work now, somehow.

Perhaps it goes beyond the brightest minds at Google that people can grasp things with their eyes closed. That we don't need to see to grasp. But designing good robots with tactile sensors is too much for our top researchers.

FL33TW00D

Everything is an abject failure... until it works.

All the best ideas are tried repeatedly until the input technologies are ripe enough.

sjkelly

This is lack of impulse response data, usually broken by motor control paradigms. I reread Cybernetic by Norbert Weiner recently and this is one of the fundamental insights he had. Once we go from Position/Velocity/Torque to encoder ticks, resolver ADCs, and PWM we will have proprioception as you expect. This also requires several orders of magnitude cycle time improvement and variable rate controllers.

intalentive

Tactile input is a nice-to-have but unnecessary. A human can pilot a robot through image sensors alone.

fusionadvocate

I think this is correct, to an extent. But consider handling an egg while your arm is numb. It would be difficult.

But perhaps a great benefit of tactile input is its simplicity. Instead of processing thousands of pixels, which are passive to interference from changing light conditions, one only has to process perhaps a few dozen tactile inputs.

refulgentis

I'm a bit confused.

Ex-Googler so maybe I'm just spoiled by access to non-public information?

But I'm fairly sure there's plenty of public material of Google robots gripping.

Is it a play on words?

Like, "we don't need to see to grasp", but obviously that isn't what you meant. We just don't need to if we saw it previously, and it hadn't moved.

EDIT: It does look like the video demonstrates this, including why you can't forgo vision (changing conditions, see 1m02s https://youtu.be/4MvGnmmP3c0?t=62)

DoingIsLearning

I think the point GP is raising is that most of the robotic development in the past several decades has been on Motion Control and Perception through Visual Servoing.

Those are realistically the 'natural' developments in the domain knowledge of Robotics/Computer Science.

However, what GP (I think) is raising is the blind spot that robotics currently has on proprioception and tactile sensing at the end-effector as well as a along the kinematic chain.

As in you can accomplish this with just kinematic position and force feedback and Visual servoing. But if you think of any dexterous primate they will handle an object and perceive texture, compliance, brittleness etc in a much richer way then any state-of-the art robotic end-effector.

Unless you devote significant research to creating miniaturized sensors that give a robot an approximation of the information rich sources in human skin, connective tissue, muscle, joints (tactil sensors, tensile sensor, vibration sensors, Force sensors) that blind spot remains.

osigurdson

I think plumbers are safe for a while.