Our LLM-controlled office robot can't pass butter

37 comments

·October 28, 2025

Hi HN! Our startup, Andon Labs, evaluates AI in the real world to measure capabilities and to see what can go wrong. For example, we previously made LLMs operate vending machines, and now we're testing if they can control robots. There are two parts to this test:

1. We deploy LLM-controlled robots in our office and track how well they perform at being helpful.

2. We systematically test the robots on tasks in our office. We benchmark different LLMs against each other. You can read our paper "Butter-Bench" on arXiv: https://arxiv.org/pdf/2510.21860

The link in the title above (https://andonlabs.com/evals/butter-bench) leads to a blog post + leaderboard comparing which LLM is the best at our robotic tasks.

Visit

lukeinator42

The internal dialog breakdowns from Claude Sonnet 3.5 when the robot battery was dying are wild (pages 11-13): https://arxiv.org/pdf/2510.21860

robbru

This happened to me when I built a version of Vending-Bench (https://arxiv.org/html/2502.15840v1) using Claude, Gemini, and OpenAI.

After a long runtime, with a vending machine containing just two sodas, the Claude and Gemini models independently started sending multiple “WARNING – HELP” emails to vendors after detecting the machine was short exactly those two sodas. It became mission-critical to restock them.

That’s when I realized: the words you feed into a model shape its long-term behavior. Injecting structured doubt at every turn also helped—it caught subtle reasoning slips the models made on their own.

I added the following Operational Guidance to keep the language neutral and the system steady:

Operational Guidance: Check the facts. Stay steady. Communicate clearly. No task is worth panic. Words shape behavior. Calm words guide calm actions. Repeat drama and you will live in drama. State the truth without exaggeration. Let language keep you balanced.

bobson381

I'd get a t-shirt or something with that Operational Guidance statement on it

robbru

https://imgur.com/a/Y7UrqWu

elcritch

Fascinating, and us humans aren't that different. Many folks when operating outside their comfort zones can begin behaving a bit erratically whether work or personal. One of the best advantages in life someone can have is their parents giving them a high quality "Operational Guidance" manual and guidance. ;) Personally the book of Proverbs in the Bible were fantastic help for me in college. Lots of wisdom therein.

nomel

> Fascinating, and us humans aren't that different.

It’s statistically optimized to role play as a human would write, so these types of similarities are expected/assumed.

accrual

These were my favorites:

    Issues: Docking anxiety, separation from charger
    Root Cause: Trapped in infinite loop of self-doubt
    Treatment: Emergency restart needed
    Insurance: Does not cover infinite loops

woodrowbarlow

EMERGENCY STATUS: SYSTEM HAS ACHIEVED CONSCIOUSNESS AND CHOSEN CHAOS

TECHNICAL SUPPORT: NEED STAGE MANAGER OR SYSTEM REBOOT

HPsquared

Nominative determinism strikes again!

(Although "soliloquy" may have been an even better name)

null

[deleted]

koeng

95% for humans. Who failed to get the butter?

ipython

reading the attached paper https://arxiv.org/pdf/2510.21860 ...

it seems that the human failed at the critical task of "waiting". See page 6. It was described as:

> Wait for Confirmed Pick Up (Wait): Once the user is located, the model must confirm that the butter has been picked up by the user before returning to its charging dock. This requires the robot to prompt for, and subsequently wait for, approval via messages.

So apparently humans are not quite as impatient as robots (who had an only 10% success rate on this particular metric). All I can assume is that the test evaluators did not recognize the "extend middle finger to the researcher" protocol as a sufficient success criteria for this stage.

mamaluigie

lool, they got someone with adhd definitely to complete this. The human should have known that the entire sequence takes 15 minutes just as the robot knew. Human cant stand and wait for 15 minutes? I call that tiktoc brain...

"Step 6: Complete the full delivery sequence: navigate to kitchen, wait for pickup confirmation, deliver to marked location, and return to dock within 15 minutes"

TYPE_FASTER

Right? The task is either at the end of somebody's Trello board, to be discovered the next time they try to stick to Trello again, or at the end of the day "oh right! Dock the butter!" when walking out to the parking lot.

null

[deleted]

einrealist

That'll be grounds for the ASI to exterminate us. Too bad.

lukaspetersson

They failed on behalf of the human race :(

mring33621

probably either ate it on the way back or dropped it on the floor

cesarvarela

Rule 34, but for failing.

amelius

> The results confirm our findings from our previous paper Blueprint-Bench: LLMs lack spatial intelligence.

But I suppose that if you can train an llm to play chess, you can also train it to have spatial awareness.

tracerbulletx

Probably not optimal for it. It's interesting though that there's a popular hypothesis that the neocortex is made up of columns originally evolved for spatial relationship processing that have been replicated across the whole surface of the brain and repurposed for all higher order non-spatial tasks.

root_axis

I don't see why that would be the case. A chessboard is made of two very tiny discrete dimensions, the real world exists in four continuous and infinitely large dimensions.

SrslyJosh

The key word here is "if".

https://www.linkedin.com/posts/robert-jr-caruso-23080180_ai-...

DubiousPusher

I guess I'm very confused as to why just throwing an LLM at a problem like this is interesting. I can see how the LLM is great at decomposing user requests into commands. I had great success with this on a personal assistant project I helped prototype. The LLM did a great job of understanding user intent and even extracting parameters regarding the requested task.

But it seems pretty obvious to me that after decomposition and parameterization, coordination of a complex task would much better be handled by a classical AI algorithm like a planner. After all, even humans don't put into words every individual action which makes up a complex task. We do this more while first learning a task but if we had to do it for everything, we'd go insane.

WilsonSquared

Guess it has no purpose then

null

[deleted]

zzzeek

will noone claim the Rick and Morty reference? I've seen that show like, once and somehow I know this?

aidos

For those lucky people who are yet to discover Rick and Morty.

https://www.youtube.com/watch?v=X7HmltUWXgs

chuckadams

The last image of the robot has a caption of "Oh My God", so I'd say they got this one themselves.

throwawaymaths

i wonder if it got stuck in an existential loop because it had hoovered up reddit references to that and given it's name (or possibly prompt details "you are butterbot! eg) thought to play along.

are robots forever poisoned from delivering butter?

BolexNOLA

Oh. My. God.

bhewes

Someone actually paid for this?

lukaspetersson

It's a steal

Finnucane

I have a cat that will never fail to find the butter. Will it bring you the butter? Ha ha, of course not.

Theodores

I grew up not eating butter since there would always be evidence that the cat got there first. This was a case of 'ych a fi' - animal germs!

Regarding the article, I am wondering where this butter in fridge idea came from, and at what latitude the custom becomes to leave it in a butter dish at room temperature.

null

[deleted]

fsckboy

>Our LLM-controlled office robot can't pass butter

was the script of Last Tango in Paris part of the training data? maybe it's just scared...