AI assisted search-based research works now

34 comments

·April 21, 2025

simonw

I think it's important to keep tabs on things that LLM systems fail at (or don't do well enough on) and try to notice when their performance rises above that bar.

Gemini 2.5 Pro and o3/o4-mini seem to have crossed a threshold for a bunch of things (at least for me) in the last few weeks.

Tasteful, effective use of the search tool for o3/o4-mini is one of those. Being able to "reason" effectively over long context inputs (particularly useful for understanding and debugging larger volumes of code) is another.

skydhash

One issue I can find with this workflow is tunnel vision, making ill informed decision because of the lack of surrounding information. I often skim books because even if I don't retain the content, I can have a mental map that can help me find further information when I need them. I wouldn't try to construct a complete answer to a question with just this amount of information, but I will use that map to quickly locate the source and have more information to synthesize the answer.

One could use the above workflow in the same way and argues that natural language search is more intuitive than keyword based search. But I don't think that brings any meaningful productivity improvement.

> Being able to "reason" effectively over long context inputs (particularly useful for understanding and debugging larger volumes of code) is another.

Any time I saw this "wish" pop up, my suggestion is to try a disassembler to reverse engineer some binary to really understand the problem of coming up with a theory of a program (based on Naur's definition). Individual statements are always clear (programming language are formal and have no ambiguity). The issue is grouping them, unambiguously define the semantic of these groups, and find the links between them, recursively.

Once that's done, what you'll have is a domain. And you could have skipped the whole thing by just learning the domain from a domain expert. So the only reason to do this is because the code doesn't really implement the domain (bugs) or it's hidden purposefully. So the most productive workflow there is to learn the domain first to find discrepancy (first case) or focus yourself on the missing part (second case). In the first case, the easiest approach is writing tests, and the more complete one is to do a formal verification of the software.

jsemrau

My main observation here is

1. Technically it might be possible to search the Internet, but it might not surface correct and/or useful information.

2. High-value information that would make a research report valuable is rarely public nor free. This holds especially true in capital-intensive or regulated industries.

simonw

I fully expect one of the AI-related business models going forward to be charging subscriptions for LLM search tool access to those kinds of archives.

ChatGPT plus an extra $30/month for search access to a specific archive would make sense to me.

sshine

Kagi is $10/mo. for search and +$15/mo. for premium LLMs with agentic access to search.

intended

I find that these conversations on HN end up covering similar positions constantly.

I believe that most positions are resolved if

1) you accept that these are fundamentally narrative tools. They build stories, In whatever style you wish. Stories of code, stories of project reports. Stories or conversations.

2) this is balanced by the idea that the core of everything in our shared information economy is Verification.

The reason experts get use out of these tools, is because they can verify when the output is close enough to be indistinguishable from expert effort.

Domain experts also do another level of verification (hopefully) which is to check if the generated content computes correctly as a result - based on their mental model of their domain.

I would predict that that LLMs are deadly in the hands of people who can’t gauge the output, and will end up driving themselves off of a cliff, while experts will be able to use it effectively on tasks where verification of the output has a comparative effort advantage, over the task of creating the output.

gh0stcat

You've perfectly captured my experience as well, I typically only trust it and have good experiences with LLMs when I have enough domain expertise to get to at least a 95% confidence the output is correct. (Specific to my domain of work, I don't always need "perfect"). I also can mostly use it as a first pass for getting the idea of where to begin research, after that I lose confidence that the more detailed and advanced content it is giving me is accurate. There is a gray area though where a domain expert might have a false sense of confidence, and over time experience "Skill Drift", where they lose expertise because they are only ever verifying a lossy compression of information, rather than re-setting their context with real world information. I am mostly concerned with that last bit.

null

[deleted]

saulpw

I tried it recently. I asked for videochat services like the one I use (WB) with 2 specific features that the most commonly used services don't have. It asked some clarifying questions and seemed to understand the mission, then went off for 10 minutes after which it returned 5 results in a table.

The first result was WB, which I gave to it as the first example and am already using. Results 2 and 3 were the mainstream services which it helpfully marked in the table as not having the features I need. Result 4 looked promising but was discontinued 3 years ago. Result 5 was an actual option which I'm trying out (but may not work for other reasons).

So, 1/5 usable results. That was mildly helpful I guess, but it appeared a lot more helpful on the surface than it was. And I don't seem to have the ability to say "nice try but dig deeper".

Gracana

You can tell it to try again. It took me a couple rounds with the tool before I noticed that your conversation after the initial research isn't limited to just chatting: if you select the "deep research" button on your message, it will run the search process in its response.

simonw

That sounds like a Deep Research query, was that with OpenAI or Gemini?

saulpw

This was OpenAI.

sshine

The article doesn’t mention Kagi: The Assistant, a search-powered LLM frontend that came out of closed beta around the beginning of the year, and got included in all paid plans since yesterday.

It really is a game changer when the search engine

I find that an AI performing multiple searches on variations of keywords, and aggregating the top results across keywords is more extensive than most people, myself included, would do.

I had luck once asking what its search queries were. It usually provides the references.

simonw

I haven't tried Kagi's product here yet. Do you know which LLM it uses under the hood?

Edit: from https://help.kagi.com/kagi/ai/assistant.html it looks like the answer is "all of them":

> Access to the latest and most performant large language models from OpenAI, Anthropic, Meta, Google, Mistral, Amazon, Alibaba and DeepSeek

dcre

Yep, regular paid Kagi sub comes with cheap models for free: GPT-4o-mini, Gemini 2.5 Flash, etc. If you pay extra you can get the SOTA models, though IMO flash is good enough for most stuff if the search result context is good.

baq

> I can feel my usage of Google search taking a nosedive already.

Conveniently Gemini is the best frontier model for everything else, they’re very interested and well positioned (if not best?) to also be the best in deep research. Let’s check back in 3-6 months.

throwup238

IMO they’re already the best. Not only is the rate limit much higher (20/day instead of OpenAI’s 10/month) but Gemini is capable of looking at far more sources, on the order of 10x.

I just had a research report last night that looked at 400 sources when I asked it to help identify a first edition Origin of Species (it did a great job too, correctly explaining how to identify a true first edition from chimeral ones).

btbuildem

It's a relevant question about the economic model for the web. On one hand, the replacement of search with a LLM-based approach threatens the existing, advertising-based model. On the other hand, the advertising model has produced so much harm: literally irreparable damage to attention spans, outrage-driven "engagement", and the general enshittification of the internet to mention just a few. I find it a bit hard to imagine whatever succeeds it will be worse for us collectively.

My question is, how to reproduce this level of functionality locally, in a "home lab" type setting. I fully expect the various AI companies to follow the exact same business model as any other VC-funded tech outfit: free service (you're the product) -> paid service (you're still the product) -> paid service with advertising baked in (now you're unabashedly the product).

I fear that with LLM-based offerings, the advertising will be increasingly inseparable, and eventually undetectable, from the actual useful information we seek. I'd like to get a "clean" capsule of the world's compendium of knowledge with this amazing ability to self-reason, before it's truly corrupted.

fzzzy

You need a copy of r1 and enough ram to run it, and a web searching tool, or a rag database with your personal data store.

energy123

  > The user-facing Google Gemini app can search too, but it doesn’t show me what it’s searching for.

Gemini 2.5 Pro is also capable of search as part of its chain of thought but it needs light prodding to show URLs, but it'll do so and is good at it.

Unrelated point, but I'm going to keep saying this anywhere Google engineers may be reading, the main problem with Gemini is their horrendous web app riddled with 5 annoying bugs that I identified as a casual user after a week. I assume it's in such a bad state because they don't actually use the app and they use the API, but come on. You solved the hard problem of making the world's best overall model but are squandering it on the world's worst user interface.

loufe

There must be some form of memory leak in AI Studio as I'll have to close and open a new tab after about 2 hours as it slowly grinds my slower computers to a halt. Its ability to create a markdown file without escaping the markdown itself (included code snippets) is definitely my first suggestion for them to fix.

It's a great tool, but sometimes frustrating.

oulipo

The main "real-world" use cases for AI use for now have been:

- shooting buildings in Gaza https://apnews.com/article/israel-palestinians-ai-weapons-43...

- compiling a list of information on Government workers in US https://www.msn.com/en-us/news/politics/elon-musk-s-doge-usi...

- creating a few losy music videos

I'd argue we'd be better off SLOWING DOWN with that shit

qwertox

I feel like the benefit which AI gives us programmers is limited. They can be extremely advanced, accelerative and helpful assistants, but we're limited to just that: architecting and developing software.

Biologists, mathematicians, physicists, philosophers and the like seem to have an open-ended benefit from the research which AI is now starting to enable. I kind of envy them.

Unless one moves into AI research?

bluefirebrand

I don't think AI is trustworthy or accurate enough to be valuable for anyone trying to do real science

That doesn't mean they won't try though. I think the replication crisis has illustrated how many researchers actually care about correctness versus just publishing papers

simonw

If you're a skilled researcher I expect you should be able to get great results out of unreliable AI assistants already.

Scientists are meant to be good at verifying and double-checking results - similar to how journalists have to learn to derive the truth from unreliable sources.

These are skills that turn out to be crucial when working with LLMs.

bluefirebrand

> Scientists are meant to be good at verifying and double-checking results

Verifying and double-checking results requires replicating experiments, doesn't it?

> similar to how journalists have to learn to derive the truth from unreliable sources

I think maybe you are giving journalists too much credit here, or you have a very low standard for "truth"

You cannot, no matter how good you are, derive truth from faulty data

barbazoo

Same with using AI for coding. I can’t imagine someone having the expectation to use the LLM output verbatim but maybe I’m just not good enough at prompting.

parodysbird

Biologists, mathematicians, physicists, and philosophers are already the experts who produce the text in their domain that the LLMs might have been trained on...