Skip to content(if available)orjump to list(if available)

The Deep Research problem

The Deep Research problem

15 comments

·February 21, 2025

smusamashah

Watched recent Viva la dirt league videos on how trailers lie and do false promises. Now I see LLM as that marketing guy. Even if he knows everything, he can't help with lying. You can't trust anything he says no matter how authoritative he sounds, even if he is telling the truth you have know way of knowing.

These deep research things are a waste of time if you can't trust the output. Code you can run and verify. How do you verify this.

tptacek

I did a trial run with Deep Research this weekend to do a comparative analysis of the comp packages for Village Managers in suburbs around Chicagoland (it's election season, our VM's comp had become an issue).

I have a decent idea of where to look to find comp information for a given municipality. But there are a lot of Chicagoland suburbs and tracking documents down for all of them would have been a chore.

Deep Research was valuable. But it only did about 60% of the work (which, of course, it presented as if it was 100%). It found interesting sources I was unaware of, and assembled lots of easy-to-get public data that would have been annoying for me to collect that made spot-checking easier (for instance, basic stuff like the name of every suburban Village Manager). But I still had to spot check everything myself.

The premise of this post seems to be that material errors in Deep Research results negate the value of the product. I can't speak to how OpenAI is selling this; if the claim is "subscribe to Deep Research and it will generate reliable research reports for you", well, obviously, no. But as with most AI things, if you get paste the hype, it's plain to see the value it's actually generating.

submeta

Deep Research is in its „ChatGPT 2.0“ phase. It will improve, dramatically. And to the naysayers: When OpenAI released its first models, many doubted that it will be good at coding. Now after two years look at Cursor, aider, and all the llms powering them, what you can do with a few prompts and iterations.

Deep research will dramatically improve as it’s a process that can be replicated and automated.

amelius

This is like saying: y=e^-x+1 will soon be 0, because look at how fast it went through y=2!

rollinDyno

Everyone who has been working on RAG is aware of how important source control is. Simply directing your agent to fetch keyword matching documents will lead to inaccurate claims.

The reality is that for now it is not possible to leave the human out of research, so I think the best LLM can only help curate sources and synthesize them, but cannot reliably write sound conclusions.

Edit: this is something elicit.com recognized quite early. But even when I was using it, I was wishing I had more control over the space over which the tool was conducting search.

jppope

These days I'm feeling like GenAi is basically an accuracy rate of 95% maybe 96%. Great at boilerplate, great at stuff you want an intern to do or maybe to outsource... but it really struggles with the valuable stuff. The errors are almost always in the most inconvenient places and they are hard to see... So I agree with Ben Evans on this one, what is one to do? the further you lean on it the worse your skills and specializations get. It is invaluable for some kinds of work greatly speeding you up, but then some of the things you would have caught take you down a rabbit hole that waste so much time. The tradeoffs here aren't great.

bakari500

Yeah but you have 4 to 6 % error that’s not good even if you have dumb computer

lsy

Research skills involve not just combining multiple pieces of data, but also being able to apply very subtle skills to determine whether a source is trustworthy, to cross-check numbers where their accuracy is important (and to determine when it's "important"), and to engage in some back and forth to determine which data actually applies to the research question being asked. In this sense, "deep research" is a misleading term, since the output is really more akin to a probabilistic "search" over the training data where the result may or may not be accurate and requires you to spot-check every fact. It is probably useful for surfacing new sources or making syntactic conjectures about how two pieces of data may fit together, but checking all of those sources for existence, let alone validity, still needs to be done by a person, and the output, as it stands in its polished form today, doesn't compel users to take sufficient responsibility for its factuality.

Lws803

I always wondered, if deep research has an X% chance of producing errors in it's report and you have to double check everything + visit every source or potentially correct it yourself. Then does it really save time in helping you get research done (outside of coding and marketing)? .

ImaCake

It might depend on how much you struggle with writers block. An LLM essay with sources is probably a better starting point than a blank page. But it will vary between people.

baxtr

I urge anyone to do the following: take a subject you know really really well and then feed it into one of the deep research tools and check the results.

You might be amazed but most probably very shocked.

ilrwbwrkhv

Yup none of these tools are actually any close to AGI or "research". They are still a much better search engine and of course spam generator.

iandanforth

I'll share my recipe for using these products on the off chance it helps someone.

1. Only do searches that result in easily verifiable results from non-AI sources.

2. Always perform the search in multiple products (Gemini 1.5 Deep Research, Gemini 2.0 Pro, ChatGPT o3-mini-high, Claude 3.7 w/ extended thinking, Perplexity)

With these two rules I have found the current round of LLMs useful for "researchy" queries. Collecting the results across tools and then throwing out the 65-75% slop results in genuinely useful information that would have taken me much longer to find.

Now the above could be seen as a harsh critique of these tools, as in the kiddie pool is great as long as you're wearing full hazmat gear, but I still derive regular and increasing value from them.

theGnuMe

One other existential question is Simpson's paradox, which I believe is exploited by politicians to support different policies from the same underlying data. I see this as a problem for government especially if we have liberal or conservative trained LLMs. We expect the computer to give us the correct answer, but when the underlying model is trained one way by RLHF or by systemic/weighted bias in its source documents -- Imagine training a libertarian AI on Cato papers -- you have could have highly confident pseudo-intellectual junk. Economists already deal with this problem daily since their field was heavily politicized. Law as well is another one.

ImaCake

I've never thought of Simpson's Paradox as a political problem before, thanks for sharing this!

Arguably this applies just as well to Bayesian vs Frequentist statisticians or Molecular vs Biochemical Biologists.