R1 Computer Use

60 comments

·February 6, 2025

Hey HN,

We are working to apply the ideas of R1 to computer use. The primary struggle is creating reliable neural reward models since hard-verification rewards are not available at scale in GUI interactions.

Our team is currently deep in the weeds of collecting reasoning annotation data for GUI interfaces to train a reliable reward model.

We would love all thoughts, feedback, and collaborations!

Visit

falcor84

> @software{r1_computer_use, title = {R1-Computer-Use: Reasoning-First Computer Interaction}, author = {Barker, Patrick}, year = {2025}, url = {https://github.com/agentsea/r1-computer-use}, }

Sorry to be a party-pooper, but does it really make sense to add a citation when you don't have fully working code yet, let alone a paper about it?

mountainriver

Fair, in the process of opening more of it up, so this seems a bit odd right now

soheil

And who is "we"? Is he using the royal we?

"We aim to replace hard-coded.."

sva_

Probably their company "AgentSea"?

llama-mini

It seems a placeholder for now? No content? Right?

dimitry12

No content, no code. "Roadmap" and "Training pipeline" in README are summaries of Section 2.3 of "DeepSeek-R1"-paper.

Sad.

Redoubts

Many such cases

mountainriver

Sharing ideas early is not a bad thing and very much encouraged by YC, we are gauging interest in collaboration on the topic. Our company has open sourced almost our entire compute use stack already https://github.com/agentsea

mountainriver

We are in the process of collecting the data right now which is fairly involved, we are going to be opening up that platform for others as well shortly.

throwaway314155

It's not my intention to be dismissive because the project idea seems really cool. I'm just curious, why not wait until the source code is ready to post it on HN?

mountainriver

Because gauging interest early and finding other people interested in building is a good idea and frankly very inline with YC thinking. We have already open sourced an enormous amount of code and datasets for computer use https://github.com/agentsea

mountainriver

Hey HN,

We are working to apply the ideas of R1 to computer use. The primary struggle is creating reliable neural reward models since hard-verification rewards are not available at scale in GUI interactions.

Our team is currently deep in the weeds of collecting reasoning annotation data for GUI interfaces to train a reliable reward model.

We would love all thoughts, feedback, and collaborations!

mkagenius

Training a base model just for computer use seems like an overkill as normal reasoning model like o3 for planning + a vision model like gemini-flash is good enough[1] without being trained specifically for computer use.

But if you still want to try out this path, Google has made the screenQA dataset(rico) available[2] along with bounding boxes.

1. A framework to use local/hosted models for android use/control - https://github.com/BandarLabs/clickclickclick

2. https://github.com/google-research-datasets/screen_qa

mountainriver

They aren't good enough to be reliable, and these agents aren't much use without being reliable. We initially started that way, and tried just about everything before focusing on training a base model.

refulgentis

Free advice (though, worth less than free, because A) it's unsolicited B) it's saying "don't do it")

TL;DR:

- Turns out that if you do UXR, even if computer use is 100% successful in the action execution, and there's no latency, people don't use it. (interesting to me is, the core demo was buying airline tickets, and so is OpenAI's. no one would defer to a computer on that, for humanist / design reasons)

- You're not going to be able out-do model companies on building models, they have too much funds.

- Try writing GUI-based integration tests. Then imagine an LLM, miraculously, always chooses the right route. Does the UX look good?

- Note the reasoning models are worse at tool calling. It's very, very, VERY stark when you have Claude next to o1/4o. OpenAI also owns up to this in the o3-mini paper, though its not under a blaring red line headline or phrased that straightforwardly.

- Why is that? You're fighting against the current when you're trying to teach the next token predictor to throw a bunch of text out there to <think>, then generate perfectly correct JSON/python/whatever given N tools.

CLI, though....

moffkalast

LLM computer use is like that inflatable autopilot from Airplane, it makes no sense for something that can interface directly with the underlying system to have to interact with the GUI that's only there because we are apes who like brightly coloured rectangles and clicking things.

mountainriver

Unless you can't interface directly with the underlying system, which turns out is... most apps

nick__m

If you had a real super-AGI it could directly interact with any computer program by injecting itself into the process.

It is possible for an human to inject features into notepad.exe without having access to the source code (I learn that in cracking tutorials from +HCU almost 30years ago...) it should be possible for an AI to do so.

mountainriver

> Turns out that if you do UXR, even if computer use is 100% successful in the action execution, and there's no latency, people don't use it. (interesting to me is, the core demo was buying airline tickets, and so is OpenAI's. no one would defer to a computer on that, for humanist / design reasons)

I would never buy plane tickets that way, we built it because there are tons of things we couldn't automate and this was the only way to do it

> - You're not going to be able out-do model companies on building models, they have too much funds.

There are plenty of people edging out the model companies all over the place. We aren't powerless to them

> Note the reasoning models are worse at tool calling. It's very, very, VERY stark when you have Claude next to o1/4o. OpenAI also owns up to this in the o3-mini paper, though its not under a blaring red line headline or phrased that straightforwardly. Why is that? You're fighting against the current when you're trying to teach the next token predictor to throw a bunch of text out there to <think>, then generate perfectly correct JSON/python/whatever given N tools.

The reason why they are bad at tool calling is they aren't trained on it. The current reasoning models require hard-verification reward models, we don't have those for tool calling, whereas that data is easy to get for math/code. Reasoning will improve tool calling, OpenAI just talked about it recently as being the answer to autonomous agents

refulgentis

> The reason why they are bad at tool calling is they aren't trained on it.

Yes they are.*

> OpenAI just talked about it recently as being the answer to autonomous agents

You nailed it.

It's key to autonomous agents, OpenAI says so out loud, and yet, we're 6 months into reasoning models and performance is regressing, which OpenAI also says out loud but in the fine print of a model card.

I know this is a flaming hot take because current thing is reasoning. But it's completely mistaken that reasoning models help with tool-use, both in theory and practice, which puts them in quite a situation.

I'm sure they'll figure it out, but I'm also sure on a long enough timeline the LLM is a computer.

* I have a bad habit of dismissing without evidence that which is asserted without evidence, and 'not even wrong', in the Pauli sense. It's a cheap way to avoid confrontation. But it makes me look petulant. We can observe, inter alia, every release post September 2024 (i.e. o1-mini and o1-preview) can call tools. (i.e. o1, o3-mini)

mountainriver

Please cite the source for o1 being trained heavily on tool usage.

Reasoning will obviously improve agentic workflows, reasoning is the main problem we see today with autonomous agents. It seems to be likely the _last_ issue we have with them

Are people concerned about the privacy implications of computer use at all? This is why I haven’t been using Claude computer use personally. Somehow the idea of sending everything I do on my computer to a random third party seems creepy. There are a lot of applications of AI (rewind comes to mind) that I simply cannot accept the idea of sharing my screen with

ben_w

I share this feeling every time a popup asks me to accept coookies for a website and its 1243 "trusted partners" — which, in this context, feels like a Ghengis Khan scale harem rather than any sane business relationship.

satvikpendem

That is why open source and local hostable models are so important. The privacy considerations are what are paramount, not just the ability to have unlimited token generation.

nyrikki

We have more than a quarter of a century of the normalization of zero privacy, and this is obviously anti Chinese AI company propaganda because the reality is that RTB is so bad that the security concerns are simply not new so why should people to even care?

Google, Meta, Microsoft and other RTB firms send RTB data about people in the U.S. to Russia and China and anyone else who signs up...many people don't see the difference. In fact for many people, the CCP having your data is far less of a risk vector than the thousands of others who get your data every single time you hit a webpage, visit the local store etc...

When Google, Meta, Microsoft etc... are selling your data thousands of times a day, and companies aggregate that to sell even sensitive information completely based on even national security sensitive categories.

https://www.iccl.ie/wp-content/uploads/2023/11/Americas-hidd...

https://www.eff.org/deeplinks/2025/01/online-behavioral-ads-...

n144q

* you can always choose to self a small model, although it probably doesn't work as well

* it's not a "random third party". You know to whom the data is being sent, and at least according to service agreements, most services don't use your data for training. If you don't trust Claude, you could trust AWS hosted version, or GPT/Deepseek hosted on Azure. Well, if you think Amazon/Microsoft is not trustworthy and they may misuse your data in these cloud services (not some random consumer facing service where you are the product), you might as well give up your digital life.

anon373839

> most services don't use your data for training

This is a claim that really irks me (when companies make it). It’s a non-denial denial. “We don’t train on user data” is NOT the same as:

- We don’t retain user data

- We don’t share user data with business partners

- We don’t mine user data for business ideas or ways to compete with our users’ products

- Etc.

ATechGuy

This. I'm surprised that even a lot of smart and tech savvy folks are not able to see these pitfalls.

seanmcdirmid

You can run AIs locally. I have a laptop that can DeepSeek's thinking <= 70b versions of qwen and llama locally, and they are a blast to play with even on an airplane without an internet connection.

sva_

It does not appear that this aims to use an API, but rather to train/finetune their own model using insights from R1 ([mostly] pure RL approach towards bootstrapping reasoning in an LLM)

tokioyoyo

Outside of tech circles? No, not really. The past decade is showed that if anything goes out of the window first, it'll be privacy if it helps in terms of speed, convenience or money.

likium

I’m concerned. Not just privacy but AI request forgery will also be a thing.

Running local models don’t protect you from prompt injection attacks or hallucinations.

There are some startups building capabilities apis to limit that but most websites/apps either don’t have the resources or aren’t willing to expose those capabilities.

And as some others have mentioned, users have a track record of giving up privacy for convenience. I’m not convinced educating non-technical users about the risks involved will ward them off.

ATechGuy

> There are some startups building capabilities apis to limit that but most websites/apps either don’t have the resources or aren’t willing to expose those capabilities.

Care to elaborate with example(s) of a startups doing this?

iiJDSii

What does your perception look like, are you using raw screenshots? GUI snapshots? Vision is very difficult for these, and snapshots are incomplete, is what I've found in some earlier experiments.

mountainriver

Perception is just 1-2 screenshots. A number of recent VLM models have a lot more pretraining data on GUI interactions, which helps.

iiJDSii

Such as? Are they able to recognize arbitrary GUI elements from various desktop programs, web browsers, etc?

mountainriver

Qwen2.5-vl seems to be the best right now by our tests.

UI-TARS by bytedance also has a good amount of pretraining.

Molmo is also very good at coordinates.

emregucerr

i wonder how good is R1 at counting pixels from a screenshot. what enabled claude and OAI's CUA to develop computer use was being able to precisely give x-y coordinates of a click location.

also, how big of a gain to have reasoning for computer use? i feel like reasoning unlocks a lot when there is a single complex question but not so much better at taking actions in a long term plan.

mountainriver

Yep, coordinate grounding is key, we use Ai2's pixmo for a lot of that https://huggingface.co/datasets/allenai/pixmo-points

We had previously created https://huggingface.co/datasets/agentsea/wave-ui but that was superseded by pixmo as it contains over a million data points.

fkyoureadthedoc

This is the type of post some VP at my company sees and starts telling people that R1 can use computer and then I have to be like "well actually" to 25 people.

Computer use is pretty exciting stuff in general though, good luck

mountainriver

Ha! Sorry about that!

crazygringo

I can't wait for something like this to be built.

People have tons of workflows that involve a lot of clicks and typing in response to data that are too difficult or one-off to automate with fragile macros.

But if my computer can quickly realize that I'm deleting every odd-numbered page of a PDF, or renaming every file to add a prefix, or following each link on a website and saving an image... and then just instantly automate the next 100 times... that's going to be huge!

n144q

> But if my computer can quickly realize that I'm deleting every odd-numbered page of a PDF, or renaming every file to add a prefix, or following each link on a website and saving an image... and then just instantly automate the next 100 times... that's going to be huge!

The first two tasks could be easily done by asking ChatGPT to write a script for you. Scraping a website can be a bit more tricky. Still, I don't see why you have to rely on "computer use" for these tasks -- there are much more efficient and reliable approaches to the tasks.

crazygringo

Those are just simple examples. Most of the clicking I do on my computer doesn't have a command-line equivalent. Nor do I want to have to type out a request to ChatGPT, even if there is one.

There's a gigantic area of productivity improvement around repetitive actions that aren't easily scriptable or no scripting interface exists. But where an AI assistant that interfaces with your screen, pointer and keyboard would be a huge help.

likium

The places where automation is most needed are for non-technical folks. To “write a script” is a huge hurdle.

null

[deleted]

null

[deleted]

soheil

Check out AutoGPT you don't have to wait, it's already built.

https://github.com/Significant-Gravitas/AutoGPT

crazygringo

I don't think that's what I'm describing.

That's about manually setting up agents, that run on a server, that seem to interact largely with the web (from the examples).

I'm talking about not manually setting up anything -- I'm talking about an AI that simply observes the repetitive actions you're taking on your computer, infers patterns from them, and then offers to take over and finish the job.

eichin

There was something like this on Macs in the mid 1990s, that would watch you work and suggest timed automations - there was a bit of a developer panic the first time it told someone "I see you launch <popular desktop game> around 4:30pm every Friday, would you like to do that automatically?"

mcny

As a start, I want to see if the agent can figure out how I play a clicker style game such as adventure capitalist on the computer. I am thinking I have a certain style of playing. I still don't understand how an agent can somehow figure out valid gameplay (earth, moon, mars, events) AND figure out a valid gameplay much less play the game in my own style.

I think we should start with something simple, repeatable, and does little to no harm if/when things go wrong.

Edit: repetitive -> repeatable

ai-christianson

Does this do "Computer Use" in that it looks at the screen, controls the mouse, keyboard (e.g. how Anthropic computer use does?)

Philpax

AutoGPT is an unserious project.