ChatGPT agent: bridging research and action
294 comments
·July 17, 2025twalkz
Aurornis
> how it normally takes him 4 to 8 hours to put together complicated, data-heavy reports. Now he fires off an agent request, goes to walk his dog, and comes back to a downloadable spreadsheet of dense data, which he pulls up and says "I think it got 98% of the information correct...
This is where the AI hype bites people.
A great use of AI in this situation would be to automate the collection and checking of data. Search all of the data sources and aggregate links to them in an easy place. Use AI to search the data sources again and compare against the spreadsheet, flagging any numbers that appear to disagree.
Yet the AI hype train takes this all the way to the extreme conclusion of having AI do all the work for them. The quip about 98% correct should be a red flag for anyone familiar with spreadsheets, because it’s rarely simple to identify which 2% is actually correct or incorrect without reviewing everything.
This same problem extends to code. People who use AI as a force multiplier to do the thing for them and review each step as they go, while also disengaging and working manually when it’s more appropriate have much better results. The people who YOLO it with prompting cycles until the code passes tests and then submit a PR are causing problems almost as fast as they’re developing new features in non-trivial codebases.
jfarmer
From John Dewey's Human Nature and Conduct:
“The fallacy in these versions of the same idea is perhaps the most pervasive of all fallacies in philosophy. So common is it that one questions whether it might not be called the philosophical fallacy. It consists in the supposition that whatever is found true under certain conditions may forthwith be asserted universally or without limits and conditions. Because a thirsty man gets satisfaction in drinking water, bliss consists in being drowned. Because the success of any particular struggle is measured by reaching a point of frictionless action, therefore there is such a thing as an all-inclusive end of effortless smooth activity endlessly maintained.
It is forgotten that success is success of a specific effort, and satisfaction the fulfillment of a specific demand, so that success and satisfaction become meaningless when severed from the wants and struggles whose consummations they arc, or when taken universally.”
slg
The proper use of these systems is to treat them like an intern or new grad hire. You can give them the work that none of the mid-tier or senior people want to do, thereby speeding up the team. But you will have to review their work thoroughly because there is a good chance they have no idea what they are actually doing. If you give them mission-critical work that demands accuracy or just let them have free rein without keeping an eye on them, there is a good chance you are going to regret it.
dimitri-vs
An overly eager intern with short term memory loss, sure.
OtherShrezzing
I’ve never experienced an intern who was remotely as mediocre and incapable of growth as an LLM.
chatmasta
Yeah, people complaining about accuracy of AI-generated code should be examining their code review procedures. It shouldn’t matter if the code was generated by a senior employee, an intern, or an LLM wielded by either of them. If your review process isn’t catching mistakes, then the review process needs to be fixed.
This is especially true in open source where contributions aren’t limited to employees who passed a hiring screen.
ivape
”The people who YOLO it with prompting cycles until the code passes tests and then submit a PR are causing problems almost as fast as they’re developing new features in non-trivial codebases.”
This might as well be the new definition of “script kiddie”, and it’s the kids that are literally going to be the ones birthed into this lifestyle. The “craft” of programming may not be carried by these coming generations and possibly will need to be rediscovered at some point in the future. The Lost Art of Programming is a book that’s going to need to be written soon.
NortySpock
Oh come on, people have been writing code with bad, incomplete, flaky, or absent tests since automated testing was invented (possibly before).
It's having a good, useful and reliable test suite that separates the sheep from the goats.*
Would you rather play whack-a-mole with regressions and Heisenbugs, or ship features?
* (Or you use some absurdly good programing language that is hard to get into knots with. I've been liking Elixir. Gleam looks even better...)
maxlin
The act of trying to make that 2% appear like "minimal, dismissable" is almost a mass psychosis in the AI world at times it seems like.
A few comparisons:
>Pressing the button: $1 >Knowing which button to press: $9,999 Those 2% copy-paste changes are the $9.999 and might take as long to find as rest of the work.
Also: SCE to AUX.
ricardobayes
Of course, Pareto principle is at work here. In an adjacent field, self-driving, they are working on the last "20%" for almost a decade now. It feels kind of odd that almost no one is talking about self-driving now, compared to how hot of a topic it used to be, with a lot of deep, moral, almost philosophical discussions.
satvikpendem
> The first 90 percent of the code accounts for the first 90 percent of the development time. The remaining 10 percent of the code accounts for the other 90 percent of the development time.
— Tom Cargill, Bell Labs
stpedgwdgfhgdd
In my experience for enterprise software engineering, in this stage we are able to shrink the coding time with ~20%, depending on the kind of code/tests.
However CICD remains tricky. In fact when AI agents start building autonomous, merge trains become a necessity…
simantel
> It feels kind of odd that almost no one is talking about self-driving now, compared to how hot of a topic it used to be
Probably because it's just here now? More people take Waymo than Lyft each day in SF.
imiric
It's "here" if you live in a handful of cities around the world, and travel within specific areas in those cities.
Getting this tech deployed globally will take another decade or two, optimistically speaking.
joe_the_user
Well, if we say these systems are here, it still took 10+ years between prototype and operational system.
And as I understand it; These are systems, not individual cars that are intelligent and just decide how to drive from immediate input, These system still require some number of human wranglers and worst-case drivers, there's a lot of specific-purpose code rather nothing-but-neural-network etc.
Which to say "AI"/neural nets are important technology that can achieve things but they can give an illusion of doing everything instantly by magic but they generally don't do that.
danny_codes
It’s past the hype curve and into the trough of disillusionment. Over the next 5,10,15 years (who can say?) the tech will mature out of the trough into general adoption.
GenAI is the exciting new tech currently riding the initial hype spike. This will die down into the trough of disillusionment as well, probably sometime next year. Like self-driving, people will continue to innovate in the space and the tech will be developed towards general adoption.
We saw the same during crypto hype, though that could be construed as more of a snake oil type event.
ameliaquining
The Gartner hype cycle assumes a single fundamental technical breakthrough, and describes the process of the market figuring out what it is and isn't good for. This isn't straightforwardly applicable to LLMs because the question of what they're good for is a moving target; the foundation models are actually getting more capable every few months, which wasn't true of cryptocurrency or self-driving cars. At least some people who overestimate what current LLMs can do won't have the chance to find out that they're wrong, because by the time they would have reached the trough of disillusionment, LLM capabilities will have caught up to their expectations.
If and when LLM scaling stalls out, then you'd expect a Gartner hype cycle to occur from there (because people won't realize right away that there won't be further capability gains), but that hasn't happened yet (or if it has, it's too recent to be visible yet) and I see no reason to be confident that it will happen at any particular time in the medium term.
If scaling doesn't stall out soon, then I honestly have no idea what to expect the visibility curve to look like. Is there any historical precedent for a technology's scope of potential applications expanding this much this fast?
bugbuddy
Liquidity in search of the biggest holes in the ground. Whoever can dig the biggest holes wins. Why or what you get out of digging the holes? Who cares.
dingnuts
The critics of the current AI buzz certainly have been drawing comparisons to self driving cars as LLMs inch along with their logarithmic curve of improvement that's been clear since the GPT-2 days.
Whenever someone tells me how these models are going to make white collar professions obsolete in five years, I remind them that the people making these predictions 1) said we'd have self driving cars "in a few years" back in 2015 and 2) the predictions about white collar professions started in 2022 so five years from when?
n2d4
> said we'd have self driving cars "in a few years" back in 2015
And they wouldn't have been too far off! Waymo became L4 self-driving in 2021, and has been transporting people in the SF Bay Area without human supervision ever since. There are still barriers — cost, policies, trust — but the technology certainly is here.
ishita159
I think people don't realize how much models have to extrapolate still, which causes hallucinations. We are still not great at giving all the context in our brain to LLMs.
There's still a lot of tooling to be built before it can start completely replacing anyone.
doctorpangloss
Okay, but the experts saying self driving cars were 50 years out in 2015 were wrong too. Lots of people were there for those speeches, and yet, even the most cynical take on Waymo, Cruise and Zoox’s limitations would concede that the vehicles are autonomous most of the time in a technologically important way.
There’s more to this than “predictions are hard.” There are very powerful incentives to eliminate driving and bloated administrative workforces. This is why we don’t have flying cars: lack of demand. But for “not driving?” Nobody wants to drive!
samtp
This is the exact same issue that I've had trying to use LLMs for anything that needs to be precise such as multi-step data pipelines. The code it produces will look correct and produce a result that seems correct. But when you do quality checks on the end data, you'll notice that things are not adding up.
So then you have to dig into all this overly verbose code to identify the 3-4 subtle flaws with how it transformed/joined the data. And these flaws take as much time to identify and correct as just writing the whole pipeline yourself.
torginus
I'll get into hot water with this, but I still think LLMs do not think like humans do - as in the code is not a result of a trying to recreate a correct thought process in a programming language, but some sort of statistically most likely string that matches the input requirements,
I used to have a non-technical manager like this - he'd watch out for the words I (and other engineers) said and in what context, and would repeat them back mostly in accurate word contexts. He sounded remarkably like he knew what he was talking about, but would occasionally make a baffling mistake - like mixing up CDN and CSS.
LLMs are like this, I often see Cursor with Claude making the same kind of strange mistake, only to catch itself in the act, and fix the code (but what happens when it doesn't)
vidarh
I think that if people say LLMs can never be made to think, that is bordering on a religious belief - it'd require humans to exceed the Turing computable (note also that saying they never can is very different from believing current architectures never will - it's entirely reasonable to believe it will take architectural advances to make it practically feasible).
But saying they aren't thinking yet or like humans is entirely uncontroversial.
Even most maximalists would agree at least with the latter, and the former largely depends on definitions.
As someone who uses Claude extensively, I think of it almost as a slightly dumb alien intelligence - it can speak like a human adult, but makes mistakes a human adult generally wouldn't, and that combinstion breaks the heuristics we use to judge competency,and often lead people to overestimate these models.
Claude writes about half of my code now, so I'm overall bullish on LLMs, but it saves me less than half of my time.
The savings improve as I learn how to better judge what it is competent at, and where it merely sounds competent and needs serious guardrails and oversight, but there's certainly a long way to go before it'd make sense to argue they think like humans.
marcellus23
I don't think you'll get into hot water for that. Anthropomorphizing LLMs is an easy way to describe and think about them, but anyone serious about using LLMs for productivity is aware they don't actually think like people, and run into exactly the sort of things you're describing.
MattSayar
I just wrote a post on my site where the LLM had trouble with 1) clicking a button, 2) taking a screenshot, 3) repeat. The non-deterministic nature of LLMs is both a feature and a bug. That said, read/correct can sometimes be a preferable workflow to create/debug, especially if you don't know where to start with creating.
nemomarx
I think it's basically equivalent to giving that prompt to a low paid contractor coder and hoping their solution works out. At least the turnaround time is faster?
But normally you would want a more hands on back and forth to ensure the requirements actually capture everything, validation and etc that the results are good, layers of reviews right
samtp
It seems to be a mix between hiring an offshore/low level contractor and playing a slot machine. And by that I mean at least with the contractor you can pretty quickly understand their limitations and see a pattern in the mistakes they make. While an LLM is obviously faster, the mistakes are seemingly random so you have to examine the result much more than you would with a contractor (if you are working on something that needs to be exact).
sethops1
If the contractor is producing unusable code, they won't be my contractor anymore.
stpedgwdgfhgdd
In my experience using small steps and a lot of automated tests work very well with CC. Don’t go for these huge prompts that have a complete feature in it.
Remember the title “attention is all you need”? Well you need to pay a lot of attention to CC during these small steps and have a solid mental model of what it is building.
Fomite
98% correct spreadsheets are going to get so many papers retracted.
chairmansteve
Yes. Any success I have had with LLMs has been by micromanaging them. Lots of very simple instructions, look at the results, correct them if necessary, then next step.
null
null
2oMg3YWV26eKIs
The security risks with this sound scary. Let's say you give it access to your email and calendar. Now it knows all of your deepest secrets. The linked article acknowledges that prompt injection is a risk for the agent:
> Prompt injections are attempts by third parties to manipulate its behavior through malicious instructions that ChatGPT agent may encounter on the web while completing a task. For example, a malicious prompt hidden in a webpage, such as in invisible elements or metadata, could trick the agent into taking unintended actions, like sharing private data from a connector with the attacker, or taking a harmful action on a site the user has logged into.
A malicious website could trick the agent into divulging your deepest secrets!
I am curious about one thing -- the article mentions the agent will ask for permission before doing consequential actions:
> Explicit user confirmation: ChatGPT is trained to explicitly ask for your permission before taking actions with real-world consequences, like making a purchase.
How does the agent know a task is consequential? Could it mistakenly make a purchase without first asking for permission? I assume it's AI all the way down, so I assume mistakes like this are possible.
threecheese
Many of us have been partitioning our “computing” life into public and private segments, for example for social media, job search, or blogging. Maybe it’s time for another segment somewhere in the middle?
Something like lower risk private data, which could contain things like redacted calendar entries, de-identified, anonymized, or obfuscated email, or even low-risk thoughts, journals, and research.
I am Worried; I barely use ChatGPT for anything that could come back to hurt me later, like medical or psychological questions. I hear that lots of folks are finding utility here but I’m reticent.
DanHulton
There is almost guaranteed going to be an attack along the lines of prompt-injecting a calendar invite. Those things are millions of lines long already, with tones of auto-generated text that nobody reads. Embed your injection in the middle of boring text describing the meeting prerequisites and it's as good as written in a transparent font. Then enjoy exfiltrating your victim's entire calendar and who knows what else.
WXLCKNO
In the system I'm building the main agent doesn't have access to tools and must call scoped down subagents who have one or two tools at most and always in the same category (so no mixed fetch and calendar tools). They must also return structured data to the main agent.
I think that kind of isolation is necessary even though it's a bit more costly. However since the subagents have simple tasks I can use super cheap models.
pradn
I can't imagine voluntarily giving access to my data and also being "scared". Maybe a tad concerned, but not "scared".
crowcroft
Almost anyone can add something to people's calendars as well (of course people don't accept random invites but they can appear).
If this kind of agent becomes wide spread hackers would be silly not to send out phishing email invites that simply contain the prompts they want to inject.
0xDEAFBEAD
Anthropic found the simulated blackmail rate of GPT-4.1 in a test scenario was 0.8
https://www.anthropic.com/research/agentic-misalignment
"Agentic misalignment makes it possible for models to act similarly to an insider threat, behaving like a previously-trusted coworker or employee who suddenly begins to operate at odds with a company’s objectives."
FergusArgyll
I agree with the scariness etc. Just one possibly comforting point.
I assume (hope?) they use more traditional classifiers for determining importance (in addition to the model's judgment). Those are much more reliable than LLMs & they're much cheaper to run so I assume they run many of them
AgentMatrixAI
I'm not so optimistic as someone that works on agents for businesses and creating tools for it. The leap from low 90s to 99% is classic last mile problem for LLM agents. The more generic and spread an agent is (can-do-it-all) the more likely it will fail and disappoint.
Can't help but feel many are optimizing happy paths in their demos and hiding the true reality. Doesn't mean there isn't a place for agents but rather how we view them and their potential impact needs to be separated from those that benefit from hype.
just my two cents
lairv
In general most of the previous AI "breakthrough" in the last decade were backed by proper scientific research and ideas:
- AlphaGo/AlphaZero (MCTS)
- OpenAI Five (PPO)
- GPT 1/2/3 (Transformers)
- Dall-e 1/2, Stable Diffusion (CLIP, Diffusion)
- ChatGPT (RLHF)
- SORA (Diffusion Transformers)
"Agents" is a marketing term and isn't backed by anything. There is little data available, so it's hard to have generally capable agents in the sense that LLMs are generally capable
mumbisChungo
My personal framing of "Agents" is that they're more like software robots than they are an atomic unit of technology. Composed of many individual breakthroughs, but ultimately a feat of design and engineering to make them useful for a particular task.
lossolo
Yep. Agents are only powered by clever use of training data, nothing more. There hasn't been a real breakthrough in a long time.
skywhopper
Not even well-optimized. The demos in the related sit-down chat livestream video showed an every-baseball-park-trip planner report that drew a map with seemingly random lines that missed the east coast entirely, leapt into the Gulf of Mexico, and was generally complete nonsense. This was a pre-recorded demo being live-streamed with Sam Altman in the room, and that’s what they chose to show.
risyachka
>> many are optimizing happy paths in their demos and hiding the true reality
Yep. This is literally what every AI company does nowadays.
ankit219
Seen this happen many times with current agent implementations. With RL (and provided you have enough use case data) you can get to a high accuracy on many of these shortcomings. Most problems arise from the fact that prompting is not the most reliable mechanism and is brittle. Teaching a model on specific tasks help negate those issues, and overall results in a better automation outcome without devs having to make so much effort to go from 90% to 99%. Another way to do it is parallel generation and then identifying at runtime which one seems most correct (majority voting or llm as a judge).
I agree with you on the hype part. Unfortunately, that is the reality of current silicon valley. Hype gets you noticed, and gets you users. Hype propels companies forward, so that is about to stay.
wslh
> Can't help but feel many are optimizing happy paths in their demos and hiding the true reality.
Even with the best intentions, this feels similar to when a developer hands off code directly to the customer without any review, or QA, etc. We all know that what a developer considers "done" often differs significantly from what the customer expects.
null
BolexNOLA
>The more generic and spread an agent is (can-do-it-all) the more likely it will fail and disappoint.
To your point - the most impressive AI tool (not an LLM but bear with me) I have used to date, and I loathe giving Adobe any credit, is Adobe's Audio Enhance tool. It has brought back audio that prior to it I would throw out or, if the client was lucky, would charge thousands of dollars and spend weeks working on to repair to get it half as good as that thing spits out in minutes. Not only is it good at salvaging terrible audio, it can make mediocre zoom audio sound almost like it was recorded in a proper studio. It is truly magic to me.
Warning: don't feed it music lol it tries to make the sounds into words. That being said, you can get some wild effects when you do it!
bredren
This solves a big issue for existing CLI agents, which is session persistence for users working from their own machines.
With claude code, you usually start it from your own local terminal. Then you have access to all the code bases and other context you need and can provide that to the AI.
But when you shut your laptop, or have network availability changes the show stops.
I've solved this somewhat on MacOS using the app Amphetamine which allows the machine to go about its business with the laptop fully closed. But there are a variety of problems with this, including heat and wasted battery when put away for travel.
Another option is to just spin up a cloud instance and pull the same repos to there and run claude from there. Then connect via tmux and let loose.
But there are (perhaps easy to overcome) ux issues with getting context up to that you just don't have if it is running locally.
The sandboxing maybe offers some sense of security--again something that can be possibly be handled by executing claude with a specially permissioned user role--which someone with John's use case in the video might want.
---
I think its interesting to see OpenAI trying to crack the Agent UX, possibly for a user type (non developer) that would appreciate its capabilities just as much but not need the ability to install any python package on the fly.
htrp
Run dev on an actual server somewhere that doesn't shut down
threecheese
Any thoughts on using Mosh here,for client connection persistence? Could Claude Code (et al) be orchestrated via SSH?
twosdai
You know normally I am against doing this, but for claude code that is a very good use case.
The latency used to really bother me, but if Claude does 99% of the typing. Its a good idea.
ddp26
Predicted by the AI 2027 team in early April:
> Mid 2025: Stumbling Agents The world sees its first glimpse of AI agents.
Advertisements for computer-using agents emphasize the term “personal assistant”: you can prompt them with tasks like “order me a burrito on DoorDash” or “open my budget spreadsheet and sum this month’s expenses.” They will check in with you as needed: for example, to ask you to confirm purchases. Though more advanced than previous iterations like Operator, they struggle to get widespread usage.
superconduct123
Predicting 4-months into the future is not really that impressive
OtherShrezzing
Especially when the author personally knows the engineers working on the features, and routinely goes to parties with them. And when you consider that Altman said last year that “2025 will be the agentic year”
null
bigyabai
It was common knowledge that big corps were working on agent-type products when that report was written. Hardly much of a prediction, let alone any sort of technical revolution.
divan
And I'm still waiting for the simple feature – the ability to edit documents in projects.
I use projects for working on different documents - articles, research, scripts, etc. And would absolutely love to write it paragraph after paragraph with the help of ChatGPT for phrasing and using the project knowledge. Or using voice mode - i.e. on a walk "Hey, where did we finish that document - let's continue. Read the last two paragraphs to me... Okay, I want to elaborate on ...".
I feel like AI agents for coding are advancing at a breakneck speed, but assistance in writing is still limited to copy-pasting.
BolexNOLA
>I feel like AI agents for coding are advancing at a breakneck speed, but assistance in writing is still limited to copy-pasting.
Man I was talking about this with a colleague 30min ago. Half the time i can't be bothered to open chat gpt and do the copy/paste dance. I know that sounds ridiculous but roundtripping gets old and breaks my flow. Working in NLE's with plug-in's, VTT's, etc. has spoiled me.
msgodel
It's crazy. Aider has been able to do this forever using free models but none of these companies will even let you pay for it in a phone/web app. I almost feel like I should start building my own service but I know any day now they'd offer it and I'd have wasted all that effort.
pants2
I've been using OpenAI operator for some time - but more and more websites are blocking it, such as LinkedIn and Amazon. That's two key use-cases gone (applying to jobs and online shopping).
Operator is pretty low-key, but once Agent starts getting popular, more sites will block it. They'll need to allow a proxy configuration or something like that.
bijant
THIS is the main problem. I was listening the whole time for them to announce a way to run it locally or at least proxy through your local devices. Alas the Deepseek R1 distillation experience they went through (a bit like when Steve Jobs was fuming at Google for getting Android to market so quickly) made them wary of showing to many intermediate results, tricks etc. Even in the very beginning Operator v1 was unable to access many sites that blocked data-center IPs and while I went through the effort of patching in a hacky proxy-setup to be able to actually test real world performance they later locked it down even further without improving performance at all. Even when its working, its basically useless and its not working now and only getting worse. Either they make some kinda deal with eastdakota(which he is probably too savvy to agree to)or they can basically forget about doing web browsing directly from their servers.Considering, that all non web applications of "computer use" greatly benefit from local files and software (which you already have the license for!)the whole concept appears to be on the road to failure. Having their remote computer use agent perform most stuff via CLI is actually really funny when you remember that computer use advocates used to claim the whole point was NOT to rely on "outdated" pre-gui interfaces.
burningion
This is why an on device browser is coming.
It'll let the AI platforms get around any other platform blocks by hijacking the consumer's browser.
And it makes total sense, but hopefully everyone else has done the game theory at least a step or two beyond that.
ghm2180
You mean like calaude code's integration with play right ?
achrono
In typical SV style, this is just to throw it out there and let second order effects build up. At some point I expect OpenAI to simply form a partnership with LinkedIn and Amazon.
In fact, I suspect LinkedIn might even create a new tier that you'd have to use if you want to use LinkedIn via OpenAI.
gitgud
Why would platforms like LinkedIn want this? Bots have never been good for social media…
mountainriver
We have a similar tool that can get around any of this, we built a custom desktop that runs on residential proxies. You can also train the agents to get better at computer tasks https://www.agenttutor.com/
FergusArgyll
If people will actually pay for stuff (food, clothing, flights, whatever) through this agent or operator, I see no reason Amazon etc would continue to block them.
pants2
I was buying plenty of stuff through Amazon before they blocked Operator. Now I sometimes buy through other sites that allow it.
The most useful for me was: "here's a picture of a thing I need a new one of, find the best deal and order it for me. Check coupon websites to make sure any relevant discounts are applied."
To be honest, if Amazon continues to block "Agent Mode" and Walmart or another competitor allows it, I will be canceling Prime and moving to that competitor.
FergusArgyll
Right but there were so few people using operator to buy stuff that it's easier to just block ~ all data center ip addresses. If this becomes a "thing" (remains to be seen, for sure) then that becomes a significant revenue stream you're giving up on. Companies don't block bots because they're Speciesist, it's bec usually bots cost them money - if that changes, I assume they'll allow known chatgpt-agent ip addrs
null
modeless
Agents respecting robots.txt is clearly going to end soon. Users will be installing browser extensions or full browsers that run the actions on their local computer with the user's own cookie jar, IP address, etc.
pants2
I hope agents.txt becomes standard and websites actually start to build agent-specific interfaces (or just have API docs in their agent.txt). In my mind it's different from "robots" which is meant to apply rules to broad web-scraping tools.
modeless
I hope they don't build agent-specific interfaces. I want my agent to have the same interface I do. And even more importantly, I want to have the same interface my agent does. It would be a bad future if the capabilities of human and agent interfaces drift apart and certain things are only possible to do in the agent interface.
tomashubelbauer
I wonder how many people will think they are being clever by using the Playwright MCP or browser extensions to bypass robots.txt on the sites blocking the direct use of ChatGPT Agent and will end up with their primary Google/LinkedIn/whatever accounts blocked for robotic activity.
falcor84
I don't know how others are using it, but when I ask Claude to use playwright, it's for ad-hoc tasks which look nothing like old school scraping, and I don't see why it should bother anyone.
arkmm
Automating applying to jobs makes sense to me, but what sorts of things were you hoping to use Operator on Amazon for?
pants2
Finding, comparing, and ordering products -- I'd ask it to find 5 options on Amazon and create a structured table comparing key features I care about along with price. Then ask it to order one of them.
Topfi
Whilst we have seen other implementations of this (providing a VPS to an LLM), this does have a distinct edge others in the way it presents itself. The UI shown, with the text overlay, readable mouse and tailored UI components looks very visually appealing and lends itself well to keeping users informed on what is happening and why at every stage. I have to tip my head to OpenAIs UI team here, this is a really great implementation and I always get rather fascinated whenever I see LLMs being implemented in a visually informative and distinctive manner that goes beyond established metaphors.
Comparing it to the Claude+XFCE solutions we have seen by some providers, I see little in the way of a functional edge OpenAI has at the moment, but the presentation is so well thought out that I can see this being more pleasant to use purely due to that. Many times with the mentioned implementations, I struggled with readability. Not afraid to admit that I may borrow some of their ideas for a personal project.
alach11
It's very hard for me to imagine the current level of agents serving a useful purpose in my personal life. If I ask this to plan a date night with my wife this weekend, it needs to consult my calendar to pick the best night, pick a bar and restaurant we like (how would it know?), book a babysitter (can it learn who we use and text them on my behalf?), etc. This is a lot of stuff it has to get right, and it requires a lot of trust!
I'm excited that this capability is getting close, but I think the current level of performance mostly makes for a good demo and isn't quite something I'm ready to adopt into daily life. Also, OpenAI faces a huge uphill battle with all the integrations required to make stuff like this useful. Apple and Microsoft are in much better spots to make a truly useful agent, if they can figure out the tech.
levocardia
Maybe this is the "bitter lesson of agentic decisions": hard things in your life are hard because they involve deeply personal values and complex interpersonal dynamics, not because they are difficult in an operational sense. Calling a restaurant to make a reservation is trivial. Deciding what restaurant to take your wife to for your wedding anniversary is the hard part (Does ChatGPT know that your first date was at a burger-and-shake place? Does it know your wife got food poisoning the last time she ate sushi?). Even a highly paid human concierge couldn't do it for you. The Navier–Stokes smoothness problem will be solved before "plan a birthday party for my daughter."
sponnath
I would even argue the hard parts of being human don't even need to be automated. Why are we all in a rush to automate everything, including what makes us human?
nemomarx
Well, people do have personal assistants and concierges, so it can be done? but I think they need a lot of time and personal attention from you to get that useful right. they need to remember everything you've mentioned offhand or take little corrections consistently.
It seems to me like you have to reset the context window on LLMs way more often than would be practical for that
jacooper
I think it's doable with the current context window we have, the issue is the LLM needs to listen passively to a lot of things in our lives, and we have to trust the providers with such an insane amount of data.
I think Google will excel at this because their ad targeting does this already, they just need to adapt to an llm can use that data as well.
jstummbillig
> hard things in your life are hard because they involve deeply personal values and complex interpersonal dynamics, not because they are difficult in an operational sense
Beautiful
thewebguyd
> It's very hard for me to imagine the current level of agents serving a useful purpose in my personal life. If I ask this to plan a date night with my wife this weekend, it needs to consult my calendar to pick the best night, pick a bar and restaurant we like (how would it know?), book a babysitter (can it learn who we use and text them on my behalf?), etc. This is a lot of stuff it has to get right, and it requires a lot of trust!
This would be my ideal "vision" for agents, for personal use, and why I'm so disappointed in Apple's AI flop because this is basically what they promised at last year's WWDC. I even tried out a Pixel 9 pro for a while with Gemini and Google was no further ahead on this level of integration either.
But like you said, trust is definitely going to be a barrier to this level of agent behavior. LLMs still get too much wrong, and are too confident in their wrong answers. They are so frequently wrong to the point where even if it could, I wouldn't want it to take all of those actions autonomously out of fear for what it might actually say when it messages people, who it might add to the calendar invites, etc.
ActorNightly
Agents are nothing more than the core chat model with a system prompt, and wrapper that parses responses and executes actions and puts the result into the prompt, and a system instruction that lets the model know what it can do.
Nothing is really that advanced yet with agents themselves - no real reasoning going on.
That being said, you can build your own agents fairly straightforward. The key is designing the wrapper and the system instructions. For example, you can have a guided chat on where it builds of the functionality of looking at your calendar, google location history, babysitter booking, and integrate all of that into automatic actions.
miles_matthias
I think what's interesting here is that it's a super cheap version of what many busy people already do -- hire a person to help do this. Why? Because the interface is easier and often less disruptive to our life. Instead of hopping from website to website, I'm just responding to a targeted imessage question from my human assistant "I think you should go with this <sitter,restaurant>, that work?" The next time I need to plan a date night, my assistant already knows what I like.
Replying "yes, book it" is way easier than clicking through a ton of UIs on disparate websites.
My opinion is that agents looking to "one-shot" tasks is the wrong UX. It's the async, single simple interface that is way easier to integrate into your life that's attractive IMO.
bGl2YW5j
Yes! I’ve been thinking along similar lines: agents and LLMs are exposing the worst parts of the ergonomics of our current interfaces and tools (eg programming languages, frameworks).
I reckon there’s a lot to be said for fixing or tweaking the underlying UX of things, as opposed to brute forcing things with an expensive LLM.
base698
Similar to what was shown in the video when I make a large purchase like a home or car I usually obsess for a couple of years and make a huge spreadsheet to evaluate my decisions. Having an agent get all the spreadsheet data would be a big win. I had some success recently trying that with manus.
benjaminclauss
This problem particularly interests me.
One of my favorite use cases for these tools is travel where I can get recommendations for what to do and see without SEO content.
This workflow is nice because you can ask specific questions about a destination (e.g., historical significance, benchmark against other places).
ChatGPT struggles with: - my current location - the current time - the weather - booking attractions and excursions (payments, scheduling, etc.)
There is probably friction here but I think it would be really cool for an agent to serve as a personalized (or group) travel agent.
kenjackson
It has to earn that trust and that takes time. But there are a lot of personal use cases like yours that I can imagine.
For example, I suddenly need to reserve a dinner for 8 tomorrow night. That's a pain for me to do, but if I could give it some basic parameters, I'm good with an agent doing this. Let them make the maybe 10-15 calls or queries needed to find a restaurant that fits my constraints and get a reservation.
macNchz
I see restaurant reservations as an example of an AI agent-appropriate task fairly often, but I feel like it's something that's neither difficult (two or three clicks on OpenTable and I see dozens of options I can book in one more click), nor especially compelling to outsource (if I'm booking something for a group, choosing the place is kind of personal and social—I'm taking everything I know about everybody in the group into account, and I'd likely spend more time downloading that nuance to the agent than I would just scrolling past a few places I know wouldn't work).
brap
>it needs to consult my calendar to pick the best night, pick a bar and restaurant we like (how would it know?), book a babysitter (can it learn who we use and text them on my behalf?), etc
This (and not model quality) is why I’m betting on Google.
bryanhogan
One the one hand this is super cool and maybe very beneficial, something I definitely want to try out.
On the other, LLMs always make mistakes, and when it's this deeply integrated into other system I wonder how severe these mistakes will be, since they are bound to happen.
gordon_freeman
This.
Recently I uploaded screenshot of movie show timing at a specific theatre and asked ChatGPT to find the optimal time for me to watch the movie based on my schedule.
It did confidently find the perfect time and even accounted for the factors such as movies in theatre start 20 mins late due to trailers and ads being shown before movie starts. The only problem: it grabbed the times from the screenshot totally incorrectly which messed up all its output and I tried and tried to get it to extract the time accurately but it didn’t and ultimately after getting frustrated I lost the trust in its ability. This keeps happening again and again with LLMs.
barbazoo
And this is actually a great use of Agents because they can go and use the movie theater's website to more reliably figure out when movies start. I don't think they're going to feed screenshots in to the LLM.
tootyskooty
Honestly might be more indicative of how far behind vision is than anything.
Despite the fact that CV was the first real deep learning breakthrough VLMs have been really disappointing. I'm guessing it's in part due to basic interleaved web text+image next token prediction being a weak signal to develop good image reasoning.
polytely
Is there anyone trying to solve OCR, I often think of that annas-archive blog about how we basically just have to keep shadow libraries alive long enough until the conversion from pdf to plaintext is solved.
https://annas-archive.org/blog/critical-window.html
I hope one of these days one of these incredibly rich LLM companies accidentally solves this or something, would be infinitely more beneficial to mankind than the awful LLM products they are trying to make
kurtis_reed
This... what?
SlavikCA
That is the problem. LLMs can't be trusted.
I was searching on HuggingFace for the model which can fit on my system RAM + VRAM. And the way HuggingFace shows the models - bunch of files, showing size for each file, but doesn't show the total. I copy-pasted that page to LLM and asked to count the total. Some of LLMs counted correctly, and some - confidently gave me totally wrong number.
And that's not that complicated question.
ActorNightly
Im currently working on a way to basically make LLM spit out any data processing answer as code which is then automatically executed, and verified, with additional context. So things like hallucinations are reduced pretty much to zero, given that the wrapper will say that the model could not determine a real answer.
seydor
also LLMs mistakes tend to pile up , multiplying like probabilities. I wonder how scrabled a computer will be after some hours of use
tomjen3
Based on the live stream, so does OpenAI.
But of course humans makes a multitude of mistakes too.
dcre
Very slightly impressed by their emphasis on the gigantic (my word, not theirs) risk of giving the thing access to real creds and sensitive info.
edoloughlin
I'm amazed that I had to scroll this far to find a comment on this. Then again, I don't live in the US.
serjester
It's smart that they're pivoting to using the user's computer directly - managing passwords, access control and not getting blocked was the biggest issue with their operator release. Especially as the web becomes more and more locked down.
> ChatGPT agent's output is comparable to or better than that of humans in roughly half the cases across a range of task completion times, while significantly outperforming o3 and o4-mini.
Hard to know how this will perform in real life, but this could very well be a feel the AGI moment for the broader population.
xnx
Doesn't the very first line say the opposite?
"ChatGPT can now do work for you using its own computer"
null
The "spreadsheet" example video is kind of funny: guy talks about how it normally takes him 4 to 8 hours to put together complicated, data-heavy reports. Now he fires off an agent request, goes to walk his dog, and comes back to a downloadable spreadsheet of dense data, which he pulls up and says "I think it got 98% of the information correct... I just needed to copy / paste a few things. If it can do 90 - 95% of the time consuming work, that will save you a ton of time"
It feels like either finding that 2% that's off (or dealing with 2% error) will be the time consuming part in a lot of cases. I mean, this is nothing new with LLMs, but as these use cases encourage users to input more complex tasks, that are more integrated with our personal data (and at times money, as hinted at by all the "do task X and buy me Y" examples), "almost right" seems like it has the potential to cause a lot of headaches. Especially when the 2% error is subtle and buried in step 3 of 46 of some complex agentic flow.