Skip to content(if available)orjump to list(if available)

Introducing Operator

Introducing Operator

218 comments

·January 23, 2025

mrdependable

A lot of people here seem to think this is somehow for their benefit, or that OpenAI and friends are trying to make something useful for the average person. They aren't spending billions of dollars to give everyone a personal assistant. They are spending billions now to save even more in wages later, and we are paying for the privilege of training their AI to do it. By the time this thing is useful enough to actually be a personal assistant, they will have released that capability in a model that is far too expensive for the average person.

reissbaker

This seems unreasonably pessimistic (or unreasonably optimistic in OpenAI's moat?). There are so, so many companies competing in this space. The cost will reflect the price of the hardware needed to run it: if it doesn't, they'll just lose to one of their many competitors who offer something similar for cheaper, e.g. whatever DeepSeek or Meta releases in the same space, with the cost driven to the bottom by commoditized inference companies like Together and Fireworks. And hardware cost goes down over time: even if it's unaffordable at launch, it won't be in five years.

They're not even the first movers here: Anthropic's been doing this with Claude for a few months now. They're just the first to combine it with a reasoning-style model, and I'd expect Anthropic to launch a similar model within the next few months if not sooner, especially now that there's been open-source replication of o1-style reasoning with DeepSeek R1 and the various R1-distills on top of Llama and Qwen.

franktankbank

The data is the moat.

mplewis

And none of the competitors can make this technology profitable, either.

random3

I think it's less a problem of cost for the average person and more a problem of setting the market price for them at a fraction of the current one. This has such a deflationary impact that it's unlikely captured or even conceivable by the current economic models.

There's a problem of "target fixation" about the capabilities and it captures most conversation, when in fact most public focus should be on public policy and ensuring this has the impact that the society wants.

IMO whether things are going to be good or bad depends on having a shared understanding, thinking, discussion and decisions around what's going to happen next.

fraboniface

Exactly, every country should urgently have a public debate on how best to use that technology and make sure it's beneficial to society as a whole. Social media are a good example that a technology can have a net negative impact if we don't deploy it carefully.

Night_Thastus

Don't worry, it'll never be good enough to actually be a personal assistant.

4ndrewl

Not this version, but in 3 years time. Promise.

Just keeping sending us money...

minimaxir

Overall, Operator seems the same as Claude's Computer Use demo from a few months ago, including architecture requiring user to launch a VM, and a tendency to be incorrect: https://news.ycombinator.com/item?id=41914989

Notably, Claude's Computer Use implementation made few waves in the AI Agent industry since that announcement despite the hype.

og_kalu

Big jumps in benchmarks from Claude's Computer Use though.

87% vs 56% on Webvoyager

58.1% vs 36.2% on WebArena

38.1% vs 22% on OsWorld

These are next gen improvements so the fact that Claude didn't make any waves doesn't really mean anything (Of course no guarantee this will either)

timabdulla

OpenAI is merely matching SOTA in browser tasks as compared to existing browser-use agents. It is a big improvement over Claude Computer Use, but it is more of the same in the specific domain of browser tasks when comparing against browser-use agents (which can use the DOM, browser-specific APIs, and so on.)

The truth is that while 87% on WebVoyager is impressive, most of the tasks are quite simple. I've played with some browse-use agents that are SOTA and they can still get very easily confused with more complex tasks or unfamiliar interfaces.

You can see some of the examples in OpenAI's blog post. They need to quite carefully write the prompts in some instances to get the thing to work. The truth is that needing to iterate to get the prompt just right really negates a lot of the value of delegating a one-off task to an agent.

og_kalu

Well that's fair. I wasn't saying that this was necessarily at a level of competence to be useful, simply that it seemed to be a lot better than Claude.

gregpr07

Yeah, and Browser Use already has 89% on WebVoyager https://browser-use.com/posts/sota-technical-report

cubefox

> OpenAI is merely matching SOTA in browser tasks as compared to existing browser-use agents.

No. It's not matching them, it's clearly exceeding them. The previous post provided the numbers.

YetAnotherNick

Gemini is 90.5% in Webvoyager[1] compared to 87% for OpenAI.

[1]: https://deepmind.google/technologies/project-mariner/

bko

I thought Claude Computer Use is through API, and I remember hearing about high number of queries and charges.

This looks like its in browser through the standard $20 Pro fee, which is huge. (EDIT: $200 a month plan so less of a slam dunk but still might be worth it)

Is there any open source or cheap ways to automate things on your computer? For instance I was thinking about a workflow like:

1. Use web to search for [companies] with conditions

2. Use linked in sales navigator to identify people in specific companies and loose search on job title or summary / experience

3. Collect the names for review

Or linked in only: Look at leads provided, and identify any companies they had worked for previously and find similar people in that job title

It doesn't have to be computer use, but given that it relies on my LinkedIn login, it would have to be.

gregpr07

If you are worried about costs you can use Browser Use with deepseek which becomes super cheap! https://github.com/browser-use/browser-use

usaar333

38% on osworld vs 22% for Claude. That seems like a jump

achierius

But of course, after all the benchmark issues we've had thus far -- memorization, conflicts of interest, and just plainly low-quality questions -- I think it's fair to be suspicious of the extent to which these numbers will actually map to usability in the real world.

minimaxir

Correction on "including architecture requiring user to launch a VM": apparently OpenAI uses a cloud hosted VM that's shown to the user. While that's much more user friendly, it opens up different issues around security/privacy.

fsndz

This is mainly to reclaim mindshare from DeepSeek that has done incredible launches recently. R1 was particularly a strong demonstration of what cracked team of former quants can do. The demo of Operator was nice but I still feel like R1 is the big moment in the AI space so far. https://open.substack.com/pub/transitions/p/openai-launches-...

karmasimida

R1 is a fundamental blow to their value proposition right now, the uniqueness is gone, and forever open sourced. Unless o3 is the game changer of game changer, I am not seeing they are getting the narrative back soon.

MagMueller

You can use browser-use as open-source alternative for Operator

ninininino

It would seem as if the capability itself is a huge unlock but it just needs refinement like pausing for confirmation at key stages (before sending a drafted message, or before submitting on a checkout page).

So the workflow for the human is ask the AI to do several things, then in the meantime between issuing new instructions, look at paused AI operator/agent flows stemming from prior instructions and unblock/approve them.

Like a general instructing an army.

easterncalculus

From the slide deck on the livestream:

"[Operator safety risks and mitigations] Harmful tasks: User is misaligned"

Looking forward to seeing some more of the examples for when openai considers their users as "misaligned", whatever that actually even means anymore.

darioush

As the storyline unfolds "AI" seems to be code for "machine learning based censorship".

Soon we will have home appliances and vehicles telling you about how aligned you are, and whether you need to improve your alignment score before you can open your fridge.

It is only a matter of time before this will apply to your financial transactions as well.

mattstir

I can sympathize with vague notions of AI dystopia, but this might be stretching the concept a bit too far. This kind of service is extremely abusable ("Operator, go to Wikipedia and start mass-vandalizing articles" or "Go to this website and try these people's email addresses with random passwords until it locks their accounts") and building some alignment goals into it doesn't seem like a terribly draconian idea.

Also, if you were under the impression that machine-learned (or otherwise) restrictions aren't already applied to purchases made with your cards, you're in for an unfortunate bit of news there as well.

darioush

You can also write a python script to achieve the same goals.

Except it's not python's responsibility to interpret the intent of your script, just as it's not your phone's responsibility to interpret the contents of your conversation.

So our tools are not our morality police. We have a legal system that can operate within the bounds of law and due process. I am well aware of the already applied levels of machine learning policing, I am just not very excited that society has decided that "this is the way now", and also doesn't seem to be bothered by the environmental costs of building and running all these GPUs (which does seem to be the case when they are used for censorship resistant transactions), or the ethical concerns about a non-profit becoming a for-profit etc.

gloosx

I don't think webmasters will be sitting down and hoping that this will not be abusable. Unlikely these kinds of agents would be allowed at all for producing content of any kind automatically (e.g. not via their APIs), or ai-slop will just overwhelm the internet exponentially.

The same neural networks are ready for detecting certain fingerprints and denying them entrance

A4ET8a8uTh0_v2

<< whether you need to improve your alignment score before you can open your fridge.

Did you not eat enough already? Come to think of it, do you not think you had enough internet for today Darious? You need to rest so that you can give 110% at <insert employer>. Proper food alignment is very important to a human.

93po

drink verification can

tedsanders

I assume here it means complying with requests that could harm other people. It's pretty common for businesses to tell their employees not to assist customers doing bad things, so not surprised to see AIs trained to not to assist customers doing bad things.

Examples:

- "operator, please sign up for 100 fake Reddit accounts and have them regularly make posts praising product X."

- "operator, please order the components need to make a high-yield bomb."

- "operator, please go harass my ex on Instagram"

swatcoder

It's pretty troubling and illiberal to use the same word for a software tool being constrained by its manufacturer's moral framework and for a human user being constrained to that manufacturer's moral framework.

While you can see how the word is formally valid and analogous in both cases, the connotation is that the user is being judged by the moral standards of a commercial vendor, which is about as Cyberpunk Dystopian as you can get.

easterncalculus

This is putting it in better words than I came up with myself.

hammock

Isn't that reddit/home depot/instagram's problem? Not a job for the guy you hired to do a thing

jsheard

It's OpenAIs problem if sites start throttling/challenging/blocking their agent traffic in response to abuse.

bilbo0s

If it makes you feel any better, law enforcement makes sure reddit, Home Depot, and instagram are "aligned" as well.

Don't worry though, it's all on the up and up. No backdoors or google-like search facilities our anything like that. It's not at all automated in that sort of unseemly fashion. They always go to court. Where they talk to a judge, that they totally don't go golfing with, and ask them for a warrant for the data they found on the instagram/home depot/reddit systems.

Oh wait, no, I mean, a warrant to try to find data on the instagram/home depot/reddit systems.

/s

madeofpalk

"operator, please perform this computationally expensive action on my competitors website 1000000 times"

jfengel

I appreciate that they all say please.

fassssst

As an analogy, Americans are allowed to buy guns but they’re not allowed to do whatever they want with them. An agent on the internet could be used for more harm than a gun.

null

[deleted]

moffkalast

OAI has decided to stop aligning models and focus on aligning the users instead.

TeMPOraL

"Society is fixed, biology is mutable", but taken to the extreme?

incognito124

First time hearing about it, nice read

gordon_freeman

What is fascinating about this announcement is if you look into future after considerable improvements in product and the model, we will be just chatting with ChatGPT to book dinner tables, flights, buy groceries and do all sort of mundane and hugely boring things we do on the web, just by talking to the agents. I'd definitely love that.

TeMPOraL

I don't. Chat interface sucks; for most of these things, a more direct interface could be much more ergonomic, and easier to operate and integrate. The only reason we don't have those interfaces is because neither restaurants, nor airlines, nor online stores, nor any other businesses actually want us to have them. To a business, the user interface isn't there to help the user achieve their goals - it's a platform for milking the users as much as possible. To a lesser or greater extent, almost every site actively defeats attempts at interoperability.

Denying interoperability is so culturally ingrained at this point, that it got pretty much baked into entire web stack. The only force currently countering this is accessibility - screen readers are pretty much an interoperability backdoor with legal backing in some situations, so not every company gets to ignore it.

No, we'll have to settle for "chat agents" powered by multimodal LLMs working as general-purpose web scrappers, because those models are the ultimate form of adversarial interoperability, and chat agents are the cheapest, least-effort way to let users operate them.

sky2224

I think the chat interface is bad, but for certain things it could honestly streamline a lot of mundane things as the poster you're replying two stated.

For example, McDonald's has heavily shifted away from cashiers taking orders and instead is using the kiosks to have customers order. The downside of this is 1) it's incredibly unsanitary and 2) customers are so goddamn slow at tapping on that god awful screen. An AI agent could actually take orders with surprisingly good accuracy.

Now, whether we want that in the world is a whole different debate.

krapp

McDonald's already tried having AI take orders and stopped when the AI did things like randomly add $250 of McNuggets or mistake ketchup for butter.

Note - because this is something which needs to be pointed out in any discussion of AI now - even though human beings also make mistakes this is still markedly less accurate than the average human employee.

gordon_freeman

I also do not like Chat interface. What I meant by above comment was actually talking and having natural conversations with Operator agent while driving car or just going for a walk or whenever and wherever something comes to my mind which requires me to go to browser and fill out forms etc. That would get us closer to using chatGPT as a universal AI agent to get those things done. (This is what Siri was supposed to be one day when Steve Jobs introduced it on that stage but unfortunately that day never arrived.)

TeMPOraL

> This is what Siri was supposed to be one day when Steve Jobs introduced it on that stage but unfortunately that day never arrived.

The irony is, the reason neither Siri nor Alexa nor Google Assistant/Now/${whatever they call it these days} nor Cortana achieved this isn't the voice side of the equation. That one sucks too, when you realize that 20 years ago Microsoft Speech API could do better, fully locally, on cheap consumer hardware, but the real problem is the integration approach. Doing interop by agreements between vendors only ever led to commercial entities exposing minimal, trivial functionality of their services, which were activated by voice commands in the form of "{Brand Wake word}, {verb} {Brand 1} to {verb} {Brand 2}" etc.

This is not an ergonomic user interface, it's merely making people constantly read ads themselves. "Okay Google, play some Taylor Swift on Spotify" is literally three brand ads in eight words you just spoke out loud.

No, all the magical voice experience you describe is enabled[0] by having multimodal LLMs that can be sicced on any website and beat it into submission, whether the website vendor likes it or not. Hopefully they won't screw it up (again[1]) trying to commercialize it by offering third parties control over what LLMs can do. If, in this new reality, I have to utter the word "Spotify" to have my phone start playing music, this is going to be a double regression relative to MS Speech API in the mid 2000s.

--

[0] - Actually, it was possible ever since OpenAI added function calling, which was like over a good year ago - if you exposed stuff you care about as functions on your own. As it is, currently the smartphone voice assistant that's closest to Star Trek experience is actually free and easy to set up - it's Home Assistant with its mobile app (for the phone assistant side) and server-side integrations (mostly, but not limited to, IoT hardware).

[1] - Like OpenAI did with "GPTs". They've tried to package a system prompt and function call configuration into a digital product and build a marketplace around it. This delayed their release of the functionality to the official ChatGPT app/website for about half a year, leading to an absurd situation where, for those 6+ months, anyone with API access could use a much better implementation of "GPTs" via third-party frontends like TypingMind.

windowlessmonad

Are our attention spans so shot that we consider booking a reservation at a restaurant or buying groceries "hugely boring"? And do we value convenience so much that we're willing to sacrifice a huge breadth of options for whatever sponsor du jour OpenAI wants to serve us just to save less than 10 minutes?

And would this company spend billions of dollars for this infinitesimally small increase in convenience? No, of course not; you are not the real customer here. Consider reading between the lines and thinking about what you are sacrificing just for the sake of minor convenience.

dougb5

I'm reminded of Kurt Vonnegut's famous story about buying postage stamps: https://www.insidehook.com/wellness/kurt-vonnegut-advice

"I stamp the envelope and mail it in a mailbox in front of the post office, and I go home. And I’ve had a hell of a good time. And I tell you, we are here on Earth to fart around, and don’t let anybody tell you any different...How beautiful it is to get up and go do something."

0_____0

I love so much. It really encapsulates what I've been feeling about tech and life generally. Society and especially tech seems so efficiency minded that I feel like a crazy person for going to do my groceries at the store sometimes.

openrisk

The fact that you are downvoted despite pointing the obvious tells you about the odds of the tech industry adopting a different path. Fleecing the ignoramy is the name of the game.

snakeyjake

The potential of x-Models (x=ll, transformer, tts, etc), which are not AI, to perfect the flooding of social media with bullshit to increase the sales of drop-shipped garbage to hundreds of millions of people is so great that there is a near-infinite stream of money available to be spent on useless shit like this.

Talking to an x-Model (still not AI), just like talking to a human, has never been, is not now, and will never be faster than looking at an information-dense table of data.

x-Models (will never be AI) will eat the world though, long after the dream of talking to a computer to reserve a table has died, because they are so good at flooding social media with bullshit to facilitate the sales of drop-shipped garbage to hundreds of millions of people.

That being said, it is highly likely that is an extremely large group of people who are so braindead that they need a robot to click through TripAdvisor links for them to create a boring, sterile, assembly-line one-day tour of Rome.

Whether or not those people have enough money to be extracted from them to make running such a service profitable remains to be seen.

CaptainFever

I would really love for Apple Knowledge Navigator to be real: https://www.youtube.com/watch?v=umJsITGzXd0

and I'm surprised that people don't bring this visualisation up more often.

insane_dreamer

> and even creating memes.

important work. glad to hear they're investing $500B in this space instead of stuff like, I don't know, making the planet livable for our grandkids

aerostable_slug

"Operator, I need to purchase 78,000 widgets for my company. Please find the best deal among suppliers who ship using carriers and ports who meet or exceed US EPA guidelines. Please ensure at least 50% of the product is sourced from post-consumer waste, and order your responses by price per unit."

patrickmcnamara

I wonder why they didn't put that in the press release. Huh.

gowld

"Low-cost slave-labor factory located. Enjoy your widgets!"

aerostable_slug

Then add criteria for worker welfare, factory safety standards, relative corruption level of the host nation, and/or whatever else turns your propeller.

The point is that this kind of tool is potentially a real labor-saver for those who are trying to act responsibly within their sphere of influence.

itskarad

I think this opens a new direction in terms of UI for companies like Instacart or Doordash — they can now optimise marketing for LLMs in place of humans, so they can just give benchmarks or quantized results for a product so the LLM can make a decision, instead of presenting the highest converting products first.

If the operator is told to find the most nutritious eggs for weight gain, the agent can refer to the nutrient labels (provided by Instacart) and then make a decision.

aerostable_slug

This reminds me of a scene in the latest entry to the Alien film franchise where the protagonists traverse a passage designated for 'artificial human' use only (it's dark and rather claustrophobic).

In the future we might well stumble into those kind of spaces on the net accidentally, look around briefly, then excuse ourselves back to the well-lit spaces meant for real people.

dataviz1000

I've been developing automated browser agents since 2017.

One approach involves using a headless browser like Playwright. We targeted around ten different websites, managing tens of millions of dollars in inventory. This was necessary because Fortune 500 companies are often slow to develop APIs, even for their largest customers. The solution worked well, and a few years later, the company eventually released an API that met the needs of its biggest customers.

Another approach is building Chrome Extensions, which keep a human in the loop. This offloads a significant amount of work but requires human acknowledgment for any sensitive actions. A Chrome Extension can automate almost every aspect of a web browser, except for a few actions such as playing audio, entering fullscreen mode, or handling event listeners that check if an event is "trusted"—meaning the user must physically press a button or tap the screen. However, an exception exists when using the Chrome DevTools Protocol (CDP) with the appropriate permissions, which allows programmatic control over nearly all browser functions. A Chrome Extension can communicate with CDP in the same way a Node.js script in Playwright does, sending commands to override browser restrictions.

Despite these limitations, they are often not a major obstacle. We don't need complete control of the browser to automate tasks such as making credit card payments. It's sufficient to automate everything up to the final step and then prompt the user to press a button to trigger a "trusted" event, allowing the purchase to proceed.

Using LLMs is a new vector of attack and the current built in browser security protections need to be headed. Imagine creating a item on Grubhub with a prompt injection attack? What is the equivolent of Robert'); DROP TABLE Students;-- for LLMs?

brap

I don't know why, but the approach where "agents" accomplish things by using a mouse and keyboard and looking at pixels always seemed off to me.

I understand that in theory it's more flexible, but I always imagined some sort of standard, where apps and services can expose a set of pre-approved actions on the user's behalf. And the user can add/revoke privileges from agents at any point. Kind of like OAuth scopes.

Imagine having "app stores" where you "install" apps like Gmail or Uber or whatever on your agent of choice, define the privileges you wish the agent to have on those apps, and bam, it now has new capabilities. No browser clicks needed. You can configure it at any time. You can audit when it took action on your behalf. You can see exactly how app devs instructed the agent to use it (hell, you can even customize it). And, it's probably much faster, cheaper, and less brittle (since it doesn't need to understand any pixels).

Seems like better UX to me. But probably more difficult to get app developers on board.

madeofpalk

> But probably more difficult to get app developers on board.

That's it. The problem is getting Postmates to agree to give away control of their UI. Giving away their ability to upsell you and push whatever makes them more money. Its never going to happen. Netflix still isn't integrated with Apple TV properly because they don't want to give away that access.

I'm not convinced this is the path forward for computers either though.

Nevermark

This is classic disruption vulnerability creation in real time.

AI’s are (just) starting to devalue the moat benefits of human-only interfaces. New entrants that preemptively give up on human-only “security” or moats, have a clear new opening at the low end. Especially with development costs dropping. (Specifics of product or service being favorable.)

As for the problem of machine attacks on machine friendly API’s:

Sometime, the only defense against attacks by machines will be some kind of micropayment system. Payments too small to be relevant to anyone getting value, but don’t scale for anyone trying to externalize costs onto their target (what all attacks essentially are).

jsheard

> I'm not convinced this is the path forward for computers either though.

With this approach they'll have to contend with the agent running into all the anti-bot measures that sites have implemented to deal with abuse. CAPTCHAs, flagging or blocking datacenter IP addresses, etc.

Maybe deals could be struck to allow agents to be whitelisted, but that assumes the agents won't also be used for abuse. If you could get ChatGPT to spam Reddit[1] then Reddit probably wouldn't cooperate.

[1] https://gizmodo.com/oh-no-this-startup-is-using-ai-agents-to...

xnx

> With this approach they'll have to contend with the agent running into all the anti-bot measures that sites have implemented to deal with abuse

I expect many more sites to adopt login requirements. This has the added benefit of more tracking/marketing data.

TeMPOraL

The solution is simple, and it's what's already done with search by proprietary LLMs: reasoning happens on the LLM vendor's servers, tool use happens client-side. Whether for search or "computer use", the websites will register activity coming from the user's machine, as it should be, because LLMs act as User Agents here.

Of course, already with LLM-powered search we see growing number of people doing the selfish/idiotic thing and blocking or poisoning user-initiated LLM interactions[0]; hopefully LLM tools following the practice above will spread quickly enough to beat this idea out of peoples' heads.

--

[0] - As opposed to LLM company crawlers that scrape the web for training data - blocking those is fine and follows the cultural best practices on the web, which have been holding for decades now. But guess what, LLM crawlers tend to obey robots.txt. The "bots" that don't are usually the ones performing specific query on behalf of users; such bots act as User Agents, neither have nor ever had any obligation to obey robots.txt.

Analemma_

And it's why you can't have a single messaging app that acts as a unified inbox for all the various services out there. XMPP could've been that but it died, and Microsoft tried to have it on Windows Phone but the messaging apps told them to get fucked.

Open API interoperability is the dream but it's clear it will never happen unless it's forced by law.

thrtythreeforty

APIs have an MxN problem. N tools each need to implement M different APIs.

In nearly every case (that an end user cares about), an API will also have a GUI frontend. The GUI is discoverable, able to be authenticated against, definitely exists, and generally usable by the lowest common denominator. Teaching the AI to use this generically, solves the same problem as implementing support for a bunch of APIs without the discoverability and existence problems. In many ways this is horrific compute waste, but it's also a generic MxN solution.

ItsMattyG

But if you have an AI then all that's needed to implement an api is documentation

skydhash

> I always imagined some sort of standard, where apps and services can expose a set of pre-approved actions on the user's behalf

OS specific, but Apple has the Scripting Support API [0] and Shortcut API for their app. Works great.

[0]: https://developer.apple.com/documentation/foundation/scripti...

cosmic_cheese

AppleScript support has sadly become more rare over time though, as more and more companies dig motes around their castles in effort to control and/or charge for interoperability. Phoned-in cross platform ports suffer this problem too.

susodapop

Yep, and on Windows this is exposed through the COM api.

alach11

> the approach where "agents" accomplish things by using the browser/desktop always seemed off to me

It's certainly a much more difficult approach, but it scales so much better. There's such a long-tail of small websites and apps that people will want to integrate with. There's no way OpenAI is going to negotiate a partnership/integration with <legacy business software X>, let alone internal software at medium to large size corporations. If OpenAI (or Anthropic) can solve the general problem, "do arbitrary work task at computer", the size of the prize is enormous.

samvher

A bit like humanoid robotics - not the most efficient, cheapest, easiest etc, but highly compatible with existing environments designed for humans and hence can be integrated very generically

brap

This is true, but what would make sense to me was if "Operator" was just another app on this platform, kind of like Safari is just another app on your iPhone that let's you use services that don't have iOS apps.

When iPhones first came out I had to use Safari all the time. Now almost everything has an app. The long tail is getting shorter.

You can even have several Operator-y apps to choose from! And they can work across different LLMs!

_rupertius

That's specifically what I'm working on at Unternet [1], based on observing the same issue while working at Adept. It seems absurd that in the future we'll have developers building full GUI apps that users never see, because they're being used by GPU-crunching vision models, which then in turn create their own interfaces for end-users.

Instead we need apps that have a human interface for users, and a machine interface for models. I've been building web applets [2] as an lightweight protocol on top of the web to achieve this. It's in early stages, but I'm inviting the first projects to start building with it & accepting contributions.

[1]: https://unternet.co/

[2]: https://github.com/unternet-co/web-applets/

archiepeach

You could make a similar argument for self-driving cars. We would have got there quicker if the roads were built from the ground up for automation. You can try to get the world on board to change how they do roads. Or make the computers adapt to any kind of road.

kccqzy

If there are pre-approved standardized actions, it would be just be a plain old API; it would not be AGI. It's clear the AI companies are aiming for general computer use, not just coding against pre-approved APIs.

brap

Naturally a "capability" is really just API + prompt.

If your product has a well documented OpenAPI endpoint (not to be confused with OpenAI), then you're basically done as a developer. Just add that endpoint to the "app store", choose your logo, and add your bank account for $$.

maxwells-daemon

Maybe there's a middle ground: a site that wants to work as well as possible for agents could present a stripped-down standardized page depending on the user agent string, while the agent tries to work well even for pages that haven't implemented that interface?

(or, perhaps, agents could use web accessibility tools if they're set up, incentivizing developers to make better use of them)

alach11

I don't know if I'm ready to hand over my grocery shopping (or date night planning) to an agent. But if pricing is reasonable, this could be a powerful alternative to normal RPA.

Instead of hardcoding some automation using Selenium, this would be a great option for automating repetitive tasks with legacy business software, which often lacks modern APIs.

celestialcheese

Locked behind their $200/mo plan - definitely too much for me with the accuracy they're showing.

mynameisvlad

For now, as a research preview. It isn't a stretch to think that it'll slowly be rolled out to their other plans.

ks2048

How are online advertising companies (including Google) going to react if more and more internet browsing is done by AI agents?

janwilmake

I strongly believe we need to use Open APIs for agents. OpenAPI is the perfect specification standard that would allow for an open world and an open internet for agents.

When OpenAI first came out with their first version of GPTs, it was all based on open APIs.

Now they are moving away from it more and more. This means they want to control the market because they don't want to base it on an open standard.

It's such a shame!

nycdatasci

Models will eventually be interface agnostic and they will cover all interfaces that are commonly used by individuals and organizations. It won't matter whether you have a nicely documented public API, a traditional website, or a phone interface to customer support.

WA

It will never happen. Same reason why we post screenshots from social network A in social network B. Many don’t even want to put in the simplest of all APIs: a simple link to an external website.

As long as people make money from meatspace eyeballs looking at banners, these agents will be actively blocked or restricted just like all other scrapers.

_jayhack_

Unfortunately a lot of the things we want agents to interact with don't expose neat APIs. Computer use and, eventually, physical locomotion are necessary for unlocking agent interactivity with the real world.

OoTheNigerian

I'm surprised folks on Hackernews are always critical of V1s.

In 18 month, apps will have APIs for "agentic browsing" ™OoTheNigerian ;)

And you will not need to give anything control over your browser. I you will merely connect your app to OpenAI or any other client.

minimaxir

OpenAI is a $50B company that should be releasing serious products, the "scrappy hacker releasing a beta product that doesn't do much" as a defense doesn't apply.

darioush

Yeah I also wonder how come web scraping was so vilified in all ToS's but I guess if you spend a lot of energy on GPUs and pay OpenAI then it's legit.

ActorNightly

When 4o came out with its chain of thought, people thought this is it. And today, nobody really cares. Its just another LLM.

Same thing with this.

The other day I was writing some code to compute some geometric angles, and I was getting 2 different results for what I though was the same angle, but in fact I didn't realize that these angles should not be equivalent. No LLM was able to tell me the issue, they just said double check my work.

willmarch

4o models don't have chain of thought, are you thinking of o1 perhaps?