Deep Research
189 comments
·February 3, 2025hi_hi
This is terrifying. Even though they acknowledge the issues with hallucinations/errors, that is going to be completely overlooked by everyone using this, and then injecting the outputs into their own powerpoints.
Management Consulting was bad enough before the ability to mass produce these graphs and stats on a whim. At least there was some understanding behind the scenes of where the numbers came from, and sources would/could be provided.
The more powerful these tools become, the more prevelant this effect of seepage will become.
autoconfig
Either you care about being correct or you don't. If you don't care then it doesn't matter whether you made it up or the AI did. If you care then you'll fact check before publishing. I don't see why this changes.
azinman2
When things are easy, you’re going to take the easy path even if it means quality goes down. It’s about trade offs. If you had to do it yourself, perhaps quality would have been higher because you had no other choice.
Lots of kids don’t want to do homework. That said, previously many would because there wasn’t another choice. But now they can just ask ChatGPT for the answers they’ll write that down verbatim with zero learning taking place.
Caring isn’t a binary thing or works in isolation.
hi_hi
Because maybe you want to, but you have a boss breathing down your neck and KPIs to meet and you haven't slept properly in days and just need a win, so you get the AI to put together some impressive looking graphs and stats that will look impressive in that client showcase thats due in a few hours.
Things aren't quite so black and white in reality.
dauhak
I mean those same conditions already just lead the human to cutting corners and making stuff up themselves. You're describing the problem where bad incentives/conditions lead to sloppy work, that happens with or without AI
Catching errors/validating work is obviously a different process when they're coming from an AI vs a human, but I don't see how it's fundamentally that different here. If the outputs are heavily cited then that might go someway into being able to more easily catch and correct slip-ups
spaceywilly
I think a lot about how differentiating facts and quality content is like differentiating signal from noise in electronics. The signal to noise ratio on many online platforms was already quite low. Tools like this will absolutely add more noise, and arguably the nature of the tools themselves make it harder to separate the noise.
I think this is a real problem for these AI tools. If you can’t separate the signal from the noise, it doesn’t provide any real value, like an out of range FM radio station.
WOTERMEON
Not only that: by publishing noise, you’re lowering the signal/noise ratio.
ADeerAppeared
> If you care then you'll fact check before publishing.
Doing a proper fact check is as much work as doing the entire research by hand, and therefore, this system is useless to anyone who cares about the result being correct.
> I don't see why this changes.
And because of the above this system should not exist.
layer8
People are much less scrupulous using LLM output than making up stuff themselves, because then they can blame the LLM.
sbarre
How hard it is to produce credible-looking bullshit makes a really big difference in these scenarios.
Consultants aren't the ones doing the fact-checking, that falls to the client, who ironically tend to assume the consultants did it.
scarab92
Think of it like a vaccine.
The majority of human written consultant reports are already complete rubbish. Low accuracy, low signal-to-noise, generic platitudes in a quantity-over-quality format.
LLMs are innoculating people to this kind of low information value content.
People who produce LLM quality output, are now being accused of using LLMs, and can no longer pretend to be adding value.
The result of this is going to be higher quality expectations from consultants and a shaking out of people who produce word vommit rather than accurate, insightful, contextually relevent information.
layer8
This has been downvoted, but I think there’s actually a chance it might become true (until AGI comes along at least).
tmnvdb
> At least there was some understanding behind the scenes of where the numbers came from, and sources would/could be provided.
Oh Sweet summer child.
cyanydeez
[flagged]
DigitalSea
Not sure if people picked up on it, but this is being powered by the unreleased o3 model. Which might explain why it leaps ahead in benchmarks considerably and aligns with the claims o3 is too expensive to release publicly. Seems to be quite an impressive model and the leading out of Google, DeepSeek and Perplexity.
lordofgibbons
> Which might explain why it leaps ahead in benchmarks considerably and aligns with the claims o3 is too expensive to release publicly
It's the only tool/system (I won't call it an LLM) in their released benchmarks that has access to tools and the web. So, I'd wager the performance gains are strictly due to that.
If an LLM (o3) is too expensive to be released to the public, why would you use it in a tool that has to make hundreds of inference calls to it to answer a single question? You'd use a much cheaper model. Most likely o3-mini or o1-mini combined with o4-mini for some tasks.
xbmcuser
It was expensive as they wanted to charge more for it but deepseek has forced their hand
willy_k
They’ve only released o3-mini, which is a powerful model but not the full o3 that is being claimed as too expensive to release. That being said, DeepSeek for sure forced their hand to release o3-mini to the public.
Sparkyte
Rightfully so, some models are getting super efficient.
null
mistercheph
I'm sure o3 will be a generation ahead of whatever deepseek, google and meta are doing today when it launches in 10 months, super impressive stuff.
petesergeant
I’m not sure if you’re implying this subtly in your comment or not, as it’s early here, but it does of course need to be a generation ahead of what 10 months of their competitors moving forward have done too. Nobody is standing still
ai-christianson
Has anyone here tried it out yet?
maroonblazer
Per the below, seems it's not available to many yet.
bbor
Interesting, thanks for highlighting! Did not pick up on that. Re:"leading", tho:
Effectiveness in this task environment is well beyond the specific model involved, no? Plus they'd be fools (IMHO) to only use one size of model for each step in a research task -- sure, o3 might be an advantage when synthesizing a final answer or choosing between conflicting sources, but there are many, many steps required to get to that point.
pazimzadeh
> In Nature journal's Scientific Reports conference proceedings from 2012, in the article that did not mention plasmons or plasmonics, what nano-compound is studied?
Aren't there more than one articles that did not mention plasmons or plasmonics in Scientific Reports in 2012?
Also, did they pay for access to all journal contents? that would be useful
nicce
Maybe that is the only one with open access
spyckie2
Is this ability really a prerequisite to AGI and ASI?
Reasoning, problem solving, research validation - at the fundamental outset it is all refinement thinking.
Research is one of those areas where I remain skeptical it is that important because the only valid proof is in the execution outcome, not the compiled answer.
For instance you can research all you want about the best vacuum on the internet but until you try it out yourself you are going to be caught in between marketing, fake reviews, influencers, etc. maybe the science fields are shielded from this (by being boring) but imagine medical pharmas realizing that they can get whatever paper to say whatever by flooding the internet with their curated blog articles containing advanced medical “research findings”. At some point you cannot trust the internet at all and I imagine that might be soon.
I worry especially with the rapidly changing landscape of the amount of generated text in the internet that research will lose a lot of value due to massive amounts of information garbage.
It will be a thing we used to do when the internet was still “real”.
observationist
It's a direction in a vast landscape, not a feature of itself - being better at different tasks, like search generally, and research in conjunction with reasoning, gets the model closer to AGI. An AGI will be able to do these tasks - so the point of the research is to have more Venn diagrams of capabilities like these to help narrow down the view on things that might actually be fundamental mechanisms involved in AGI.
Moravec detailed the idea of a landscape of human capabilities slowly being submerged by AI capabilities, and the point at which AI can do anything a human can, in practice or in principle, we'll know for certain we've reached truly general AI. This idea includes things like feeling pain and pleasure, planning, complex social, oral, and ethical dynamics, and anything else you can possibly think of as relevant to human intelligence. Deep Research is just another island being slowly submerged by the relentless and relentlessly accelerating flood.
numba888
> hings like feeling pain and pleasure
can machine feel? without that there is no AGI according to definition above.
and the second question: are animals "GI"? they don't have language and don't solve math problems, never heard of np-complete.
YmiYugy
If I understood the graphs correctly, it only achieves 20% pass rate on their internal tests. So I have to wait 30min and pay a lot of money just to sift through walls of most likely incorrect text? Unless the possibility of hallucinations is negligible, this is just way too much content to review at once. The process probably needs to be a lot more iterative.
itkovian_
Here's an example of the type of question it is acheiving 20% on;
The set of natural transformations between two functors F,G :C→DF,G:C→D can be expressed as the end Nat(F,G)≅∫AHomD(F(A),G(A)). Nat(F,G)≅∫A HomD (F(A),G(A)).
Define set of natural cotransformations from FF to GG to be the coend CoNat(F,G)≅∫AHomD(F(A),G(A)). CoNat(F,G)≅∫AHomD (F(A),G(A)).
Let: - F=B∙(Σ4)∗/F=B∙ (Σ4 )∗/ be the under ∞∞-category of the nerve of the delooping of the symmetric group Σ4Σ4 on 4 letters under the unique 00-simplex ∗∗ of B∙Σ4B∙ Σ4 . - G=B∙(Σ7)∗/G=B∙ (Σ7 )∗/ be the under ∞∞-category nerve of the delooping of the symmetric group Σ7Σ7 on 7 letters under the unique 00-simplex ∗∗ of B∙Σ7B∙ Σ7 .
How many natural cotransformations are there between FF and GG?
Davidzheng
btw isn't this question at least really badly worded (and maybe incorrect?) the definitions they give for F and G are categories not functors... (and both categories are in fact one object with contractible space of morphisms...)
rizky05
[dead]
brokensegue
26.6% on humanity's last exam is actually impressive.
pass rate really only matters in context of the difficulty of the tasks
null
tmnvdb
Only if you are asking questions at the level of a cutting edge benchmark
rvnx
This is one of the actual questions:
> In Greek mythology, who was Jason's maternal great-grandfather?
https://www.google.com/search?q=In+Greek+mythology%2C+who+wa...
pama
No it is not an actual question on this exam. From the paper: “To ensure question quality and integrity, we enforce strict submission criteria. Questions should be precise, unambiguous, solvable, and non-searchable, ensuring models cannot rely on memorization or simple retrieval methods. All submissions must be original work or non-trivial syntheses of published information, though contributions from unpublished research are acceptable. Questions typically require graduate-level expertise or test knowledge of highly specific topics (e.g., precise historical details, trivia, local customs) and have specific, unambiguous answers…”. (Emphasis mine)
elicksaur
In Greek mythology, Jason's maternal great-grandfather was Einstein.
tmnvdb
This is a hard question for language models since it targets one of their known weaknesses.
roenxi
Maybe. Not enough data to say. Say it does a days worth of work in a query. It is sensible to use if it takes less than a day to review ~5 days worth of work. I don't know if we're near that threshold yet but conceptually this would work well for actual research where the amount of preparation is large compared to the amount of output written.
And eyeballing the benchmarks, it'll probably reach a >50% rate per query by the end of the year. Seems to double every model or two.
random_cynic
[dead]
jmount
I had no idea there was a market for "Compile a research report on how the retail industry has changed in the last 3 years. Use bullets and tables where necessary for clarity." I imagine reading such a result is pure torture.
ejang0
Can anyone confirm if this is available in Canada and other countries? This site says "We are still working on bringing access to users in the United Kingdom, Switzerland, and the European Economic Area." But I'm not sure about other countries. I don't have Pro currently, only Plus.
carbocation
I don't even see it in the US right now.
getnormality
The demo on global e-commerce trends seems less useful than a Google search, where the AI answer will at least give you links to the claimed information.
adriand
Feels like only a matter of time before these crawlers are blocked from large swathes of the internet. I understand that they’re already prohibited from Reddit and YouTube. If that spreads, this approach might be in trouble.
scarab92
I doubt those crawler rules will be honoured for long.
I wouldn’t even be surprised if a law is passed requiring sites to provide equal access to humans whether accessed directly or via these models.
It’s too important an innovation to stall, especially considering the US’s competitors (China) won’t respect robots.txt either.
crazylogger
This is trivially bypassed by OpenAI asking the user to take control of their computer (or a sandboxed browser within it,) then for all intents and purposes it’s the user themselves accessing your site (with some productivity/accessibility aid from OAI.)
drcode
I suppose there is an equilibrium, where sites that penalize these types of crawlers will also get less traffic from people reading ai citations, so for many sites the upsides of allowing it will be greater than the downsides.
cj
Anyone selling anything would want to remain crawlable if people use this to research something that could lead to a purchase.
reaperman
Not necessarily. Southwest airlines doesnt allow itself on price comparison sites or Google Flights.
Amazon listings are blocked from google shopping and other price comparison sites.
bbor
TBF OpenAI in particular bought access to Reddit. Otherwise yeah this is my main confusion with all of these products, Perplexity being the biggest -- how do you get around the status-quo of refusing access to bots? Just to start off with, there is no Google Search API, and they work hard to make sure headless browsers can't access the normal service.
They do say "Currently, deep research can access the open web...", so maybe "open" there implies something significant. Like, "websites that have agreements with OpenAI and/or do not enforce norobot policies".
wahnfrieden
Client-side browsers that crawl for users (and prompt for logins or captcha as needed) won't be as easily blockable
optimalsolver
Big Tech Podcast listener?
cye131
Does anyone actually have access to this? It says available for pro users on the website today - I have pro via my employer but see no "deep research" option in the message composer.
fosterfriends
I have pro, in US, not seeing yet
chachamatcha
also US based, have pro and still no access.
snewman
Two different people I know with pro subscriptions report not having access yet.
energy123
Are you all in Europe?
> "We are still working on bringing access to users in the United Kingdom, Switzerland, and the European Economic Area."
greatpostman
Have pro, can’t see it yet
fizx
same same
6gvONxR4sf7o
There are some people in the blogosphere who are known experts in their niche or even niche-famous because they write popular useful stuff. And there are a ton more people who write useful stuff because they want that 'exposure.' At least, they do in the very broadest sense of writing it for another human to read it. I wonder if these people will keep writing when their readership is all bots. Dead internet here we come.
seanmcdirmid
I'm all for writing just for the bots, if I can figure it out. A lot of academic papers aren't really read anyways, just briefly glanced at so they can be cited together, large publications like journal pubs or dissertations even less so. But the ability to add to a world of knowledge that is very easy to access by people who want to use it...that is very appealing to me as an author. No more trudging through a bunch of papers with titles that might be relevant to what I want to know about...and no more trudging through my papers, I'm OK with that.
Gemini has had this for a month or two, also named "Deep Research" https://blog.google/products/gemini/google-gemini-deep-resea...
Meta question: what's with all of the naming overlap in the AI world? Triton (Nvidia, OpenAI) and Gro{k,q} (X.ai, groq, OpenAI) all come to mind