The Humans Building AI Scientists
23 comments
·March 20, 2025plaidfuji
the_snooze
>Or to think about this another way, imagine a PhD student who was never allowed to talk to other people, attend conferences etc., and could only read papers and try things in lab. But they can read papers extremely fast. Would they be successful?
For anyone who thinks this is sufficient, be aware that papers only tell you what's successful (for some definition of successful). Being part of a research community gives you access to the other side: hallway conversations at conferences and informal collaborative networks are far more candid, where people will tell you what they've tried and didn't work, or what resources they need for an ambitious study that's just out of reach for their current budget. This is also where a lot of new ideas and collaborations come from, where people come together with matching problems and solutions to new interesting questions.
I'm not sure how an AI is supposed to help with this, as research is ultimately a very social activity from my perspective.
kevinventullo
This comment made me realize something. There’s always been grumbling in the academic community about how negative or less exciting results aren’t publishable. As a result, there is quite a bit of knowledge that is essentially “lost” as no one ever bothers to write it down. Part of the reason this happens is that publishing unexciting results is unhelpful professionally, but I suspect another aspect here is that researchers know that no one would read such results; it is hard enough to keep up with the positive results in the field, much less negative results. So, in that sense they are not “contributing to the sum total of human knowledge”, which I think is a major part of many scientists’ motivations.
So then, in a world where the outcomes of all experiments could be reasonably fed into an AI model, it seems that there could be a great deal of value in having scientists publish these “low value” negative results, even in a relatively informal format (i.e. not worrying about formatting, perhaps skipping peer review, etc.). That way even if a human never reads the paper, at the very least an AI “scientist” model would pick it up.
I don’t think you can professionally incentivize this. Rather, scientists would need to do this out of the desire to contribute to the sum total of human knowledge, which could be embodied and not forgotten by these scientist AI’s.
ehsanu1
There seems to be a couple of field-specific journals of negative results for similar purposes. It seems like there should be value in citing negative results to inform current research. Perhaps if there were more journals dedicated to this, or a single one not limited to specific fields, there would still be some incentive to publish there, if the effort required was low enough (another area where AI might be applied: writing it up).
owenpalmer
> imagine a PhD student who was never allowed to talk to other people, attend conferences etc., and could only read papers and try things in lab. But they can read papers extremely fast. Would they be successful?
A lot of scientific knowledge cannot be communicated through papers. This is especially true in wet labs, where there's no procedural standardization. Keoni Gandall wrote an excellent post on this topic as it applies to synthetic bio [0]. I've experienced this first hand as a student participating in chemistry labs. Even when you're given a step by step procedure, it's impossible to predict the spacial logistics and inefficiencies you'll run into when you actually try to execute the procedure, regardless of your analytical preparation.
The other type of knowledge that is rarely communicated through papers is the informal exploratory thought process of the researcher, and their embarrassing failures/mistakes.
If I have my own lab someday, I think it would be cool if everyone wore bodycams, showing their first person view. By publishing the raw footage with the paper/code, hopefully this would help with reproducibility.
null
ludicrousdispla
>> I’m optimistic that an AI scientist will help with reproducibility overall. Did you do the experiment that you said you did? Did you record all the variables in a way that you can report it in the way you did it?
Interesting to think of the potential long term impact for science. Reminds me of the reform in the early 20th century that focused on ensuring the contents of canned goods matched their labeled ingredients.
analog31
That's still a thing, in the drug industry.
dr_dshiv
There are massive opportunities for accelerating science through AI-human collaboration. However, it requires new management and a new set of standards. If AI can write a paper that was good for publication a year ago, what does that mean for science?
It’s really unclear.
Im particularly interested in AI assisted education research. I think we need to keep an eye on empirical methods for developing smarter humans.
uptownfunk
I think more reliable in Silico experimentation will yield much better results in the long run but is probably akin to a spacex or Tesla type of investment and 1-2 oom more compute intensive.
null
aithrowawaycomm
I will never stop being amazed at AI folks' childish views of animal cognition:
> A lot of your tools reference crows. What’s up with that?
> White: When I got started in this space around October 2022, I was red-teaming with GPT4. Around the same time, a paper called “Language Models are Stochastic Parrots” was circulating, and people were debating whether these models were just regurgitating their training data or truly reasoning. The analogy is appealing, and parrots are definitely known for mimicking speech. But what we saw was that pairing these language models with external tools made them much more accurate — a bit like crows, which can use tools to solve puzzles.
> In the work that led to ChemCrow,1 for instance, we found that giving the large language model access to calculators or chemistry software made its answers much better. So we kind of retconned a little bit to make “Crows” be agents that can interact with tools using natural language.
This is incredibly insulting to crows, who can spontaneously create tools and use bizarre man-made tools with no training. And when crows use tools for problem solving in the lab, the tools are not "solve the problem for me" like a calculator, they require much more creative thinking. What White really means - whether he knows it or not - is that crows are known for being intelligent and he wants to use this for marketing purposes.
I don't think anyone alive today will live to see an AI as smart as a crow, in no small part because AI researchers and investors refuse to take animal intelligence seriously.
odyssey7
Western awareness still not recovered from Descartes.
mistrial9
yes agree and more.. it is detached thinking from the actual real world creatures called Crow. add some pep to the situation - Crows lineage is from an age before the rise of Humans.
light_hue_1
I'm so incredibly tired of all of the BS claims. (I'm an AI/ML researcher)
> has enabled open-source LLMs “to exceed human-level performance on two more of the lab-bench tasks: doing scientific literature research and reasoning about DNA constructs” with only “modest compute budgets.”
No. They did not. They just ran a crappy experiment and came up with an absurd result.
As a community we need to invest much more effort into benchmarking as a science. Our space is full of garbage claims like this and it isn't doing us any favors.
Eventually the hype will die down and people will realize that a lot of the claims were obvious falsehoods. Then we'll all get collectively punished for it.
nsagent
I'm also an AI researcher and I'm not hopeful that the hype will die down anytime soon. People have been touting insignificant results through shoddy science for a long time now. They've noticed it works well enough because current ML is still pretty much alchemy (Ali Rahimi's 2017 NIPS Test of Time [1] talk still resonates today) so people rarely spend the effort to effectively refute the bogus claims.
As a result, I've opted out of the system and I'm working on trying ambitious ideas I have to try to upend the current paradigm of training on big data, which is truly insane (by age 4 the most erudite of children have probably only been exposed to 45 million words [2] and yet exhibit vastly more understanding and fluency than any language model trained on a similar amount of data).
[1]: https://www.youtube.com/watch?v=Qi1Yry33TQE
[2]: https://www.aft.org/sites/default/files/media/2014/TheEarlyC...
kordlessagain
The problem is with the reporting, if you want to call it that. This is more like a promotional piece. They write something someone said, and assume it’s true or useful to get attention. There’s not even a name attributed to the article.
jaggederest
I'm reminded of the time I saw some A/B test results that didn't make much sense, but were highly significant[1]
I asked how many A/B tests they were running... hundreds. Overlapping. At least they had a holdout group (that they mostly ignored, which indicated that all the A/B tests more or less made no difference)
[1] P < 0.001 with a large effect size. No, your a/b test probably didn't break the laws of economics - you probably messed up your data.
MostlyStable
If they were running many concurrent, overlapping A/B tests, then they didn't necessarily mess up their data. You are likely to get that result, honestly and truthfully, purely by chance, if you run enough tests.
Unless the "running concurrent tests and not correcting your significance level", is what you meant by messing up their data, in which case yeah.
mskar
Do you have ideas for what would make a better experiment? The methodology for a literature search comparison, while simple, is the best I could come up with. We developed ~250 multiple choice questions which require a deep dive into a paper to answer, ideally with very convincing distractor answers. Then we gave 9 evaluators (post-docs and grad students in biology) a week to answer 40 questions each, without any limitations on their search. The evaluators were incentivized by providing a base pay per question completed, with a 50-100% bonus if they got enough questions correct.
Under those circumstances, the evaluators had an answer precision of 73.8%, and the AI system (PaperQA2) was 85.2%. Both the evaluators and PaperQA2 could choose not to answer on a particular question. If you look at accuracy, which takes into account not answering a question, evaluators were 67.7% and PaperQA2 was 66%. So in terms of overall accuracy -- humans still did a touch better. But when actually answering, the AI was more precise.
In terms of literature synthesis comparison, I think the methodology was pretty solid too, but would love more feedback. We had PaperQA2 write cited articles for ~19k human genes, of which there are (non-stub) Wikipedia articles for ~3.9k. It's worth noting that this is a particularly technical subset of Wikipedia articles. We sampled 300 articles that were in both sources, then extracted 500 statements from each (basically a paragraph block). These statements could be compound, or even multi-sentence statements. These statements were shuffled and obfuscated such that the origin could not be determined from the statement alone.
The statements were given to a team of 4 evaluators, who were each asked to evaluate if the information was correct as cited, i.e. did the source actually support the statement. So they had to access (if they could) and actually read all the sources. After we got the evaluator gradings back, we could compile and map each statement back to its origin for comparison. Under these circumstance, the PaperQA2 written articles were 83% cited and supported, while the Wikipedia articles were 61.5% cited and supported. Wikipedia had comparatively more uncited claims, so if we eliminate those and only focus on the cited claims themselves, then PaperQA2 had 86.1% of claims that were supported by the source and Wikipedia had 71.2%. We did an analysis of every single un-supported claim, and on Wikipedia, claims are often attributed to arbitrary or really broad sources, like a landing page to a database.
(here's the paper fwiw: https://arxiv.org/abs/2409.13740)
> Rodriques: Many people assume we’re focused on wet lab automation. There are certainly opportunities there and we are exploring them, but the biggest opportunities are actually on the cognitive side.
Wet lab automation is very difficult and capital intensive. And once you build your lab you are constraining yourself to answering questions within a certain domain for which you have the relevant sample prep and characterization equipment. Your equipment in essence defines your design space, and thus your potential solution space, which places a bound on your TAM.
So of course automating the thinking part of science is more approachable with current AI - but is that what people want? It’s certainly an attractive proposition for management: automate away the highly-paid SMEs and turn R&D into more of a factory environment with replaceable lab techs.. but actually implementing this depends on where the power lies in an org. My theory is that in many orgs the “cognitive” folks hold the true power (= the unwritten expertise about what works and what doesn’t, when to work around your existing setup, how much to trust each number an instrument produces). They’ll resist this change to their last breath.
You may gain some short-term efficiency by accelerating the experiments of today, but in the long run you lose the expertise to break out of local minima imposed by your equipment and training data.
Or to think about this another way, imagine a PhD student who was never allowed to talk to other people, attend conferences etc., and could only read papers and try things in lab. But they can read papers extremely fast. Would they be successful?