Notes on Anthropic's Computer Use Ability

107 comments

·October 25, 2024

acrooks

I've built a couple of experiments using it so far and it has been really interesting.

On one hand, it has really helped me with prototyping incredibly fast.

On the other, it is prohibitively expensive today. Essentially you pay per click, in some cases per keystroke. I tried to get it to find a flight for me. So it opened the browser, navigated to Google Flights, entered the origin, destination etc. etc. By the time it saw a price, there had already been more than a dozen LLM calls. And then it crashed due to a rate limit issue. By the time I got a list of flight recommendations it was already $5.

But I think this is intended to be an early demo of what will be possible in the future. And they were very explicit that it's a beta: all of this feedback above will help them make it better. Very quickly it will get more efficient, less expensive, more reliable.

So overall I'm optimistic to see where this goes. There are SO many applications for this once it's working really well.

steveBK123

I guess I'm confused there's even a use case there. It's like "let me google that for you". I mean Siri can return me search results for flights.

A real killer app would be something that is adaptive and smart enough to deal with all the SEO/walled gardens in the travel search space, actually understanding the airlines available and searching directly there as well as at aggregators. It could also be integrated with your Airline miles accounts and all suggested options to use miles/miles&cash/cash, etc.

All of that is far more complex than .. clicking around google flights on your behalf and crashing.

Further, the real killer app is that it is bullet proof enough that you entrust it to book said best flight for you. This requires getting the product to 99.99% rather than the perpetual 70-80% we are seeing all these LLM use cases hit.

sithadmin

The airline booking + awards redemption use case is a mostly solved problem. Harcore milage redemption enthusiasts use paid tools like ExpertFlyer that present a UI and API for peeking into airline reservation backends. It has a steep learning curve, for sure.

ThePointsGuy blog tried to implement something that directly tied into airline accounts to track milage/points and redemption options, but I believe they got slapped down by several airlines for unauthorized scraping. Airlines do NOT like third parties having access to frequent flier accounts.

acrooks

While the strategy to find good deals / award space is a solved problem, the search tools to do so aren't. Tools like ExpertFlyer are super inefficient: it permits you to search for maximum one origin + one destination + one airline per search. What if you're happy to go to anywhere in Western Europe? Or if you want to check several different airlines? Then all of a sudden your one EF search might turn into dozens. And as you say, pretty much all of the aggregator tools are getting slapped down by airlines so they increasingly have more limited availability and some are shutting down completely.

And then add the complexity that you might be willing to pay cash if the price is right ... so then you add dozens more searches to that on potentially many websites.

All of this is "easy" and a solved problem but it's incredibly monotonous. And almost none of these services offer an API, so it's difficult to automate without a browser-based approach. And a travel agent won't work this hard for you. So how amazing would it be instead to tell an AI agent what you want, have it pretend to be you for a few minutes, and get a CSV file in your inbox at the end.

Whether this could be commercialised is a different question but I'm certainly going to continue building out my prototype to save myself some time (I mean, to be fair, it will probably take more time to build something to do this on my behalf but I think it's time well spent).

steveBK123

Yes, that seems to be the larger challenge. The search tools I have used will work for a while until they don't. Real cat & mouse game.

Hence the "adaptive" part of my comment.

It really needs to be a client side agent.

danielbln

Haiku 3.5 wol be here soon, and will before long support tool use and vision, so that should help a lot with cost.

kordlessagain

It’s running in the browser but connected to a VM, right? When you say crashed, what did it do?

inquisitor27552

time is also a huge factor on this one, should be a nice metric

god the future is here haha

bonoboTP

This kind of stuff is an existential threat to ad-based business models and upselling. If users no longer browse the web themselves, you can't show them ads. It's a monumental, Earth-shattering problem for behemoth like Google but also normal websites. Lots of websites (such as booking.com) rely on shady practices to mislead users and upsell them etc. If you have a dispassionate, smart computer agent doing the transaction, it will only buy what's needed to accomplish the task.

There will be enormous push towards steering these software agents towards similarly shady practices instead of making them act in the true interest of the user. The ads will be built into the weights of the model or something.

tracerbulletx

Ads will move to the layer of the new interface when that happens. Also a computer can't watch a youtube video for you or look at funny cat pictures. You can still put ads next to things people want to look at.

rty32

Care to elaborate on the idea? I suppose you mean that ads will come to this "computer use" tool itself. Now, will users keep it in the foreground, when they already expect the tool to do (almost) everything for them?

steveBK123

I think the point is - don't be so naive. Companies are investing near trillions into developing models, training models, compute, datacenter, nuclear reactors, etc.

Is the endgame some free/cheap tool that abstracts away the entire ad based web economy to the benefit of end users?

Imagine something closer to a super duper smart useful Siri/Alexa that feeds you product recommendations, paid placement, and other ads interspersed with your actual request response.

Hey Siri what temperature is it? It's 45 and going to be chilly today, a North Face jacket might be handy today.. can I recommend you a few models? What's your size?

lupusreal

Only a matter of time before computers are generating videos of cats cuter than any real cat.

kredd

Still not a problem for Meta/TikTok/YouTube though, as people go there to consume content on purpose. But I agree, will be fun to see how Google and others will deal with it.

thenaturalist

> smart computer agent doing the transaction

None of these agents are smart.

And if purchases become agentic, fine print or other shady tricks hidden in business terms will be how businesses draw consumers in.

Also, none of this will be existential, earth-shattering or enourmous until compute power per watt comes to a degree where all of this is economical at scale.

Facemelters

the ads will target the latent biases of the agentic AI, just like they do with humans

steveBK123

and/or the ad dollars will move into the decision layer and the AI will make different decisions / recommendations to your request, depending on who is bidding the most..

Imagine the most dystopian outcomes and you'll probably be closer than "well I don't have to see ads anymore!"

voytec

> This kind of stuff is an existential threat to ad-based business models and upselling.

Sounds great. But corporations will find a way to fuck over their users for inverstors' gains in no time.

_heimdall

I'm all for the MVP approach and shipping quickly, though I'm really surprised they went with image recognition and tooling for injecting mouse/keyboard events for automating human tasks.

I wonder why leveraging accessibility tools for this wouldn't have been a better option. Browsers and operating systems both have pretty comprehensive tooling for accessibility tools like screen readers, and the whole point of those tools is to act as a middle man to programmatically interpret and interact with what's on screen.

danielbln

I think the reason is that this is the most general implementation. It doesn't need playwright or have access to the DOM or anything else, if it has a screen and mouse/keyboard, then it will work. That's quite powerful (if slow and pricey, at the moment).

_heimdall

Unless I'm mistaken, playwright doesn't actually use the accessibility tree directly. It does have quite a few APIs for accessing nodes based on a11y attributes, but I could have sworn those were glorified query selectors rather than directly accessing the accessibility tree.

Last I checked on it, maybe a year ago, there were browser proposals for standardizing the accessibility tree APIs but they were very early discussions and seemed pretty well stuck.

That would be a good reason for Anthropic using image processing here though, short of forking open source a11y tools there may not have been a simple way to use accessibility data to interact.

infecto

Those sound like stop gaps at best. Its pretty clear the intended goal here. APIs are easy to integrate with but most systems in existence only have a visual interface intended for humans.

The end goal here is clear, being able to interface with anything available in the screen.

ryukafalz

Accessibility tools are made for humans. If there is information only available visually and not via a screen reader or other accessibility tools, that is a problem that needs to be addressed.

infecto

Accessibility tools are I find are never as great as the source. Just because they are made for humans does not mean they are an improvement. I imagine at best a stopgap as image models improve.

lupusreal

It's like making a robot that can walk up stairs instead of roll up wheelchair ramps or use elevators. It's harder, but more capable.

ctoth

Hmmm.

I think I'm a blind user of the late 2024 Internet.

I use a screen reader for everything.

Now I'm paranoid that I'm just a computer use model, testing this a11y API hypothesis, in training.

elif

Crazy that this needs to be said but 'Computer use' is far more expansive of a domain than Internet browsing...

_heimdall

From what I've seen of this new product (I've never used it), it sounds like its specifically trying to mimic a human user and they went with image recognition plus faked input devices.

That approach is a weird one to me, though only as long as its limited to the current use. If this is just another test bed for a much more broad tool that could rely on accessibility APIs that makes sense.

imranq

This is basically RPA with LLMs. And RPA is basically the worst possible solution to any problem.

Agents won't get anywhere because any user process you want to automate is better done by creating APIs and creating a proper guaranteed interface. Any automated "computer use" will always be a one-off, absurdly expensive, and completely impractical.

danielbln

There is plenty of legacy software out there that has no and will never have a nice API to integrate with. Those are the situations where the terrible solutions are either let a human do it or automate the human tool chain from a high level. This is the LLM spin on it. Is it an efficient or even good solution? Hell no, but if there is no other solution to automation (assuming that's the goal) then does that matter?

alpha_squared

This is a severely under-appreciated perspective. A lot of software, especially in industries that are slow to change, is just not programming-friendly. There are no APIs and no access to underlying databases, just user-focused point-and-click.

rty32

My take is that those industries are also going to be very slow to adopt any AI tools, especially these, and for good reasons. We are looking at integrating LLM into our products, but we have customers that told us they can't use any of those, straightforward.

imranq

I'd argue that this is not even a solution to begin with. If the LLM gets even one pixel value wrong, then at best the whole process breaks down. At worst, you could do some irreversible damage.

supriyo-biswas

I could see this coming into Apple Intelligence for example; you could simply ask the browser to buy stuff off of your favorite store, or even do a chain of tasks like informing a contact off your list that you've bought said thing, etc.

The possibilities are quite exciting, in fact, even though the technology isn't quite there yet.

imranq

Apple should hook into app functions themselves instead of relying on UI. I would be really surprised if Apple made a browser automation tool, since that would be the complete opposite of the "it just works" credo

null

[deleted]

VBprogrammer

Well, this just opened up a new phase in the captcha wars.

sunilkumardash9

It certainly did; it's like a Pandora's box. Unless they lobotomize, we can expect Qwen and Deepseek to release open-source models.

jazzyjackson

Captchas were already outsourced to cheap labor, maybe 10 or 20 cents a pop? AI using image interpretation is not any cheaper so the captchas efficacy is unchanged

belval

The product I would like to see out of this is a way to automate UI QA.

Ideally it would be given a persona and a list of use cases, try to accomplish each task and save the state where you/it failed.

Something like a Chrome lighthouse but for usability. Bonus point if it can highlight what part of my documentation is using mismatched terminology making it difficult for newcomers to understand what button I am referring to.

steveBK123

I've seen similar sentiment even pre-LLM that AI would help automate other forms of testing, and I just don't quite see it.

Implementing tests is not the hard part. You could make that an intern project or hire a consultant for 3 months. The hard part is the interpretation of results.

That is - making a thing that spits out tickets/alerts is easy. The signal/noise tuning and actual investigation workflows are the hard part and still very manual & human operated. I don't see LLM mouse/keyboard control changing that yet.

belval

> making a thing that spits out tickets/alerts is easy.

I don't really believe that what I am asking for is hard, yet I still can't buy it as far as I know.

> actual investigation workflows are the hard part and still very manual & human operated.

Sure but it would allow your QA worker to have pre-tested usecase-based path with some flag on whether or not they may be problematic with a screen-recording and some timestamp of where it went wrong.

These will always need human-in-the-loop to vet the findings before cutting a ticket to development team.

steveBK123

Fair - I'm not personally familiar with state of the art in UI QA automation, but I know theres been various screen recording type tools available for a decade+ with mixed success.

I come more from a "big data" background, and have dealt with CTOs who think "can't we just use AI?" is the answer to data quality checking multi-PB data lakes with 1000s of unique datasets from 100s of vendors. That is - they don't want to staff a data quality team, they think you can just magic it all away.

The answer was always - sure, but you are fixated on the easy part - anomaly detection. Actual data analysis on what broke, when, how, why, and escalating to data provider was always 95% of the work. Someone needs to look at the exhaust, and there will be exhaust every single day.. so you can kill your dev teams productivity or actually staff an operations team responsible for the tickets the thing spits out.

nashadelic

It’s both. Most manual tests are required to be run whenever the underlying code has changed. And that’s pretty slow and annoying. Interpreting results is usually pretty trivial, like checking the http code or checking against an assert. I don’t think most companies use/should use manual testing but where it’s unavoidable, this is a great workaround.

martythemaniak

I've been been hacking on a web browsing agent the last few weeks and it's given me some decent understanding of what it'd take to get this working. My approach has been to make it general-purpose enough so that I describe the mechanics of surfing the web, without building in specific knowledge about tasks or website. Some things I've learned.

1. Pixels and screenshots (video really) and keyboard/mouse events is definitely the purest and most proper way to get agents working in the long term, but it's not practical today. Cost and speed are big obvious issues, but accuracy is also low. I found that GTP4o (08-06) is just plain bad at coordinates and bounding boxes and naively feeding it screenshots just doesn't work. As a practical example, another comment mentions trying to get a list of flight recommendations from Claude computer use and it costing $5, if my agent is up for that task (haven't tested this), it would cost $0.10-$0.25.

2. "feature engineering" helps a lot right now. Explicitly highlighting things and giving the model extra context and instructions on how to use that context, how to augment the info it sees on screenshots etc. It's hard to understand things like hover text, show/hide buttons, etc from pure pixels.

3. You have to heavily constrain and prompt the model to get it to do the right thing now, but when it does it, it feels magic.

4. It makes naive, but quite understandable mistakes. The kinds of mistakes a novice user might make and it seems really hard to get this working. A mechanism to correct itself and learn is probably the better approach rather than trying to make it work right from the get-go in every situation. Again, when you see the agent fail, try again and succeed the second time based on the failure of the previous action, it's pretty magical. The first time it achieved its objective, I just started laughing out loud. I don't know if I've ever laughed at a program I've written before.

It's been very interesting working on this. If traditional software is like building legos, this one is more like training a puppy. Different, but still fun. I also wonder how temporary this type of work is, I'm clearly doing a lot of manual work to augment the model's many weaknesses, but also models will get substantially better. At the same time, I can definitely see useful, practical computer use from model improvements being 2-3 years away.

nilstycho

It seems like a cheaper intermediate capability would be to give Claude the ability to SSH to your computer or to a cloud container. That would unlock a lot of possibilities, without incurring the cost of the vision model or the difficulty of cursor manipulation.

Does this already exist? If not, would the benefits be lower than I think, or would the costs be higher than I think?

reportgunner

Not just benefits, costs but also the risks are to be considered here I think.

nilstycho

What are the risks? Isn't this a strict subset of the risks of full desktop access? Claude can just open a GUI terminal with Computer Use. (I think.)

reportgunner

Software posing as Claude that is actually a malware tricking an unsuspecting non-terminal user into executing it is what I was thinking about.

pier25

I would never allow an AI to SSH into a server.

Just the other day someone used Claude to write a script to configure a server. It left a port open and the server was hacked hours later and used to attack other servers. Hetzner almost banned the hosting account.

https://x.com/rameerez/status/1848707234068382001

jazzyjackson

I had GPT4o walk me through configuring my RAID array, simple two drive duplication affair, and some command broke the configuration in new and mysterious ways - I can no longer get the drives to appear at all. So that will be the last time I copy paste anything from an AI into a shell.

nilstycho

IIUC, Claude's "Computer Use" is roughly a remote desktop, which is a superset of a remote shell. I don't think I'm proposing anything with a greater risk than already exists.

null

[deleted]

kordlessagain

I’m working on Webwright which presents as a shell. It’s on GitHub.

nilstycho

Based on the flow diagram, that doesn't seem to be the same thing. Webwright seems to be a shell as a tool for me, enhanced with AI features. I'm suggesting the shell as a tool for AI.

Webwright is a front-end shell that presents to me; I'm suggesting a back-end shell that presents to Claude.

It doesn't appear that Webwright enables tool-use. In other words, there's no task-oriented feedback loop between AI-provided shell commands and the results of those shell commands. Please correct me if that's not right.

ActionHank

Anecdotal, but I think if you mention it in any discussions in a corporate alarm bells will go off because HACKERS use SSH.

This _seems_ more like a normal user so clearly could not do anything nefarious. /s

elif

If it's only downside is cost, and cost is prohibitively expensive for all practical uses,

Why didn't this project start with https://huggingface.co/meta-llama/Llama-3.2-11B-Vision

Jayakumark

Any idea on how does Sonnet does this, is the image annotated with bounding boxes on text boxes etc. along with its coordinates before sending to sonnet and it responds with box name back or co-ordinate back or ? is SAM2 used for segmenting everything before sending to sonnet ?

cl42

I really, really like this new product/API offering. Still crashes quite a bit for me and obviously makes mistakes, but shows what's possible.

For the folks who are more savvy on the Docker / Linux front...

1. Did Anthropic have to write its own "control" for the mouse and keyboard? I've tried using `xdotool` and related things in the past and they were very unreliable.

2. I don't want to dismiss the power and innovation going into this model, but...

(a) Why didn't Adept or someone else focused on RPA build this?

(b) How much of this is standard image recognition and fine-tuning a vision model to a screen, versus something more fundamental?

ko_pivot

At the end of the day, the fundamental dynamic here is human creativity. We are taking a tool, the LLM, and stretching it to its limit. That’s great, but that doesn’t mean we are close to AGI. It means we are AGI.

blurbleblurble

This is an insightful comment, though it just goes to show how rigid the framing is of "natural vs. artificial" or "human vs. machine". None of this stuff has any vitality outside of _some_ relationship or interface with people.

bitwize

Yeah, it makes the owner class richer while driving the marginal cost of labor to zero, at which point the working class can't sell their labor at all and starve.

bamboozled

This would assume the rich some how oppresses everyone to pieces. If I have access to all this wonderful automation tech, I'm sure as fuck not going to sit around and starve, I'm going to try automate my food production to make more food, more efficiently ?

sunilkumardash9

This is a step towards a human-machine hybrid world. Putting a human in the loop can do wonders. Sure, it is expensive now, but the subsequent iterations will crush it.

im3w1l

Have you heard of Centaur chess? A human and a machine would team up to find the best chess moves against another similar team. It's not a thing anymore. Computers have advanced so much that humans can't really contribute in any meaningful sense.

steveBK123

All these AI models do quite well in games because there are set rules, finite moves, and they can iterate in a tight loop (without humans) to get immediate feedback on pass/fail.

I think this is what differentiates the speed at which AIs have gotten from ok -> good -> great -> better than humans at say chess, versus say driving a car, summarizing a paper, understanding human requests, recommending music, etc.

I think a lot of people are extrapolating the rate of progress & possible accuracy rates from chess bots to domains that do not compare.

bamboozled

Is the point of your comment to make people feel depressed ?

Either we're going to use these tools to augment our abilities or basically just become wiped out, at least our jobs will be, and there is no plan to provide support for anyone. Maybe the tech will make the transition to a post employment world so swift we don't even feel any negative economic effects at all, but let's see.

tsunamifury

Once we realize we can make machines that can beat us in ways we can’t even understand, I wonder if will question if we have always been influenced this way by an exterior force

HN

Notes on Anthropic's Computer Use Ability

Notes on Anthropic's Computer Use Ability