Skip to content(if available)orjump to list(if available)

Notes on Anthropic's Computer Use Ability

imranq

This is basically RPA with LLMs. And RPA is basically the worst possible solution to any problem.

Agents won't get anywhere because any user process you want to automate is better done by creating APIs and creating a proper guaranteed interface. Any automated "computer use" will always be a one-off, absurdly expensive, and completely impractical.

bonoboTP

This kind of stuff is an existential threat to ad-based business models and upselling. If users no longer browse the web themselves, you can't show them ads. It's a monumental, Earth-shattering problem for behemoth like Google but also normal websites. Lots of websites (such as booking.com) rely on shady practices to mislead users and upsell them etc. If you have a dispassionate, smart computer agent doing the transaction, it will only buy what's needed to accomplish the task.

There will be enormous push towards steering these software agents towards similarly shady practices instead of making them act in the true interest of the user. The ads will be built into the weights of the model or something.

kredd

Still not a problem for Meta/TikTok/YouTube though, as people go there to consume content on purpose. But I agree, will be fun to see how Google and others will deal with it.

tracerbulletx

Ads will move to the layer of the new interface when that happens. Also a computer can't watch a youtube video for you or look at funny cat pictures. You can still put ads next to things people want to look at.

acrooks

I've built a couple of experiments using it so far and it has been really interesting.

On one hand, it has really helped me with prototyping incredibly fast.

On the other, it is prohibitively expensive today. Essentially you pay per click, in some cases per keystroke. I tried to get it to find a flight for me. So it opened the browser, navigated to Google Flights, entered the origin, destination etc. etc. By the time it saw a price, there had already been more than a dozen LLM calls. And then it crashed due to a rate limit issue. By the time I got a list of flight recommendations it was already $5.

But I think this is intended to be an early demo of what will be possible in the future. And they were very explicit that it's a beta: all of this feedback above will help them make it better. Very quickly it will get more efficient, less expensive, more reliable.

So overall I'm optimistic to see where this goes. There are SO many applications for this once it's working really well.

steveBK123

I guess I'm confused there's even a use case there. It's like "let me google that for you". I mean Siri can return me search results for flights.

A real killer app would be something that is adaptive and smart enough to deal with all the SEO/walled gardens in the travel search space, actually understanding the airlines available and searching directly there as well as at aggregators. It could also be integrated with your Airline miles accounts and all suggested options to use miles/miles&cash/cash, etc.

All of that is far more complex than .. clicking around google flights on your behalf and crashing.

Further, the real killer app is that it is bullet proof enough that you entrust it to book said best flight for you. This requires getting the product to 99.99% rather than the perpetual 70-80% we are seeing all these LLM use cases hit.

sithadmin

The airline booking + awards redemption use case is a mostly solved problem. Harcore milage redemption enthusiasts use paid tools like ExpertFlyer that present a UI and API for peeking into airline reservation backends. It has a steep learning curve, for sure.

ThePointsGuy blog tried to implement something that directly tied into airline accounts to track milage/points and redemption options, but I believe they got slapped down by several airlines for unauthorized scraping. Airlines do NOT like third parties having access to frequent flier accounts.

acrooks

While the strategy to find good deals / award space is a solved problem, the search tools to do so aren't. Tools like ExpertFlyer are super inefficient: it permits you to search for maximum one origin + one destination + one airline per search. What if you're happy to go to anywhere in Western Europe? Or if you want to check several different airlines? Then all of a sudden your one EF search might turn into dozens. And as you say, pretty much all of the aggregator tools are getting slapped down by airlines so they increasingly have more limited availability and some are shutting down completely.

And then add the complexity that you might be willing to pay cash if the price is right ... so then you add dozens more searches to that on potentially many websites.

All of this is "easy" and a solved problem but it's incredibly monotonous. And almost none of these services offer an API, so it's difficult to automate without a browser-based approach. And a travel agent won't work this hard for you. So how amazing would it be instead to tell an AI agent what you want, have it pretend to be you for a few minutes, and get a CSV file in your inbox at the end.

Whether this could be commercialised is a different question but I'm certainly going to continue building out my prototype to save myself some time (I mean, to be fair, it will probably take more time to build something to do this on my behalf but I think it's time well spent).

steveBK123

Yes, that seems to be the larger challenge. The search tools I have used will work for a while until they don't. Real cat & mouse game.

Hence the "adaptive" part of my comment.

It really needs to be a client side agent.

danielbln

Haiku 3.5 wol be here soon, and will before long support tool use and vision, so that should help a lot with cost.

inquisitor27552

time is also a huge factor on this one, should be a nice metric

god the future is here haha

elif

If it's only downside is cost, and cost is prohibitively expensive for all practical uses,

Why didn't this project start with https://huggingface.co/meta-llama/Llama-3.2-11B-Vision

azinman2

I saw those demoed yesterday. The model was asked to create a cool visualization. It ultimately tried to install steamlit and go its page, only to find its own Claude software running streamlit, so as part of debugging it killed itself. Not ready to let that go wild on my own computer!

Jayakumark

Any idea on how does Sonnet does this, is the image annotated with bounding boxes on text boxes etc. along with its coordinates before sending to sonnet and it responds with box name back or co-ordinate back or ? is SAM2 used for segmenting everything before sending to sonnet ?

belval

The product I would like to see out of this is a way to automate UI QA.

Ideally it would be given a persona and a list of use cases, try to accomplish each task and save the state where you/it failed.

Something like a Chrome lighthouse but for usability. Bonus point if it can highlight what part of my documentation is using mismatched terminology making it difficult for newcomers to understand what button I am referring to.

steveBK123

I've seen similar sentiment even pre-LLM that AI would help automate other forms of testing, and I just don't quite see it.

Implementing tests is not the hard part. You could make that an intern project or hire a consultant for 3 months. The hard part is the interpretation of results.

That is - making a thing that spits out tickets/alerts is easy. The signal/noise tuning and actual investigation workflows are the hard part and still very manual & human operated. I don't see LLM mouse/keyboard control changing that yet.

belval

> making a thing that spits out tickets/alerts is easy.

I don't really believe that what I am asking for is hard, yet I still can't buy it as far as I know.

> actual investigation workflows are the hard part and still very manual & human operated.

Sure but it would allow your QA worker to have pre-tested usecase-based path with some flag on whether or not they may be problematic with a screen-recording and some timestamp of where it went wrong.

These will always need human-in-the-loop to vet the findings before cutting a ticket to development team.

steveBK123

Fair - I'm not personally familiar with state of the art in UI QA automation, but I know theres been various screen recording type tools available for a decade+ with mixed success.

I come more from a "big data" background, and have dealt with CTOs who think "can't we just use AI?" is the answer to data quality checking multi-PB data lakes with 1000s of unique datasets from 100s of vendors. That is - they don't want to staff a data quality team, they think you can just magic it all away.

The answer was always - sure, but you are fixated on the easy part - anomaly detection. Actual data analysis on what broke, when, how, why, and escalating to data provider was always 95% of the work. Someone needs to look at the exhaust, and there will be exhaust every single day.. so you can kill your dev teams productivity or actually staff an operations team responsible for the tickets the thing spits out.

sys32768

So robotic process automation gains intelligence and we can train an AI intern to assist with tasks.

Our own personal digital squire.

Then eventually we become assistants to AI.

_heimdall

I'm all for the MVP approach and shipping quickly, though I'm really surprised they went with image recognition and tooling for injecting mouse/keyboard events for automating human tasks.

I wonder why leveraging accessibility tools for this wouldn't have been a better option. Browsers and operating systems both have pretty comprehensive tooling for accessibility tools like screen readers, and the whole point of those tools is to act as a middle man to programmatically interpret and interact with what's on screen.

infecto

Those sound like stop gaps at best. Its pretty clear the intended goal here. APIs are easy to integrate with but most systems in existence only have a visual interface intended for humans.

The end goal here is clear, being able to interface with anything available in the screen.

ryukafalz

Accessibility tools are made for humans. If there is information only available visually and not via a screen reader or other accessibility tools, that is a problem that needs to be addressed.

infecto

Accessibility tools are I find are never as great as the source. Just because they are made for humans does not mean they are an improvement. I imagine at best a stopgap as image models improve.

danielbln

I think the reason is that this is the most general implementation. It doesn't need playwright or have access to the DOM or anything else, if it has a screen and mouse/keyboard, then it will work. That's quite powerful (if slow and pricey, at the moment).

elif

Crazy that this needs to be said but 'Computer use' is far more expansive of a domain than Internet browsing...

VBprogrammer

Well, this just opened up a new phase in the captcha wars.

sunilkumardash9

It certainly did; it's like a Pandora's box. Unless they lobotomize, we can expect Qwen and Deepseek to release open-source models.

nilstycho

It seems like a cheaper intermediate capability would be to give Claude the ability to SSH to your computer or to a cloud container. That would unlock a lot of possibilities, without incurring the cost of the vision model or the difficulty of cursor manipulation.

Does this already exist? If not, would the benefits be lower than I think, or would the costs be higher than I think?

null

[deleted]

reportgunner

Not just benefits, costs but also the risks are to be considered here I think.

nilstycho

What are the risks? Isn't this a strict subset of the risks of full desktop access? Claude can just open a GUI terminal with Computer Use. (I think.)

reportgunner

Software posing as Claude that is actually a malware tricking an unsuspecting non-terminal user into executing it is what I was thinking about.

ActionHank

Anecdotal, but I think if you mention it in any discussions in a corporate alarm bells will go off because HACKERS use SSH.

This _seems_ more like a normal user so clearly could not do anything nefarious. /s

cl42

I really, really like this new product/API offering. Still crashes quite a bit for me and obviously makes mistakes, but shows what's possible.

For the folks who are more savvy on the Docker / Linux front...

1. Did Anthropic have to write its own "control" for the mouse and keyboard? I've tried using `xdotool` and related things in the past and they were very unreliable.

2. I don't want to dismiss the power and innovation going into this model, but...

(a) Why didn't Adept or someone else focused on RPA build this?

(b) How much of this is standard image recognition and fine-tuning a vision model to a screen, versus something more fundamental?