Deep learning gets the glory, deep fact checking gets ignored

167 comments

·June 3, 2025

b0a04gl

Man, I’ve been there. Tried throwing BERT at enzyme data once—looked fine in eval, totally flopped in the wild. Classic overfit-on-vibes scenario.

Honestly, for straight-up classification? I’d pick SVM or logistic any day. Transformers are cool, but unless your data’s super clean, they just hallucinate confidently. Like giving GPT a multiple-choice test on gibberish—it will pick something, and say it with its chest.

Lately, I just steal embeddings from big models and slap a dumb classifier on top. Works better, runs faster, less drama.

Appreciate this post. Needed that reality check before I fine-tune something stupid again.

ErigmolCt

Transformers will ace your test set, then faceplant the second they meet reality. I've also done the "wow, 92% accuracy!" dance only to realize later I just built a very confident pattern-matcher for my dataset quirks.

disgruntledphd2

Honestly, if your accuracy/performance metrics are too good, that's almost a sure sign that something has gone wrong.

Source: bitter, bitter experience. I once predicted the placebo effect perfectly using a random forest (just got lucky with the train/test split). Although I'd left academia at that point, I often wonder if I'd have dug in deeper if I'd needed a high impact paper to keep my job.

dvfjsdhgfv

I believe it's very common. At some point I thought about publishing a paper analyzing some studies with good results (published in journals) and showing where the problem with each lies but at some point I just gave up. I thought I will only make the original authors unhappy, everybody else will not care.

stevenae

> Lately, I just steal embeddings from big models and slap a dumb classifier on top. Works better, runs faster, less drama.

You may know this but many don't -- this is broadly known as "transfer learning".

TeMPOraL

Is it, even when applied to trivial classifiers (possibly "classical" ones)?

I feel that we're wrong to be focusing so much on the conversational/inference aspect of LLMs. The way I see it, the true "magic" hides in the model itself. It's effectively a computational representation of understanding. I feel there's a lot of unrealized value hidden in the structure of the latent space itself. We need to spend more time studying it, make more diverse and hands-on tools to explore it, and mine it for all kinds of insights.

stevenae

For this and sibling -- yes. Essentially, using the output of any model as an input to another model is transfer learning.

b0a04gl

ohhh yeah that’s the interoperability game. not just crank model size and pray it grows a brain. everyone's hyped on scale but barely anyone’s thinking glue. anthropic saw it early. their interop crew? scary smart folks, some I know personally. zero chill, just pure signal.

if you wanna peek where their heads at, start here https://www.anthropic.com/research/mapping-mind-language-mod... not just another ai blog. actual systems brain behind it.

lamename

I agree. Isn't this just utilizing the representation learning that's happened under the hood of the LLM?

ActivePattern

Ironically, this comment reads like it was generated from a Transformer (ChatGPT to be specific)

b0a04gl

oh yes i recently become a transformer too

tough

its the em dashes?

sebzim4500

>Lately, I just steal embeddings from big models and slap a dumb classifier on top. Works better, runs faster, less drama.

Sure but this is still indirectly using transformers.

TeMPOraL

Yes, but it's using the understanding they acquired to guide a more reliable tool, instead of also making them generate the final answer, which they're likely to hallucinate in this problem space.

k__

How does this work?

davidclark

I’m not sure anyone I know could make an em dash with their keyboard off the top of their head.

[meta] Here’s where I wish I could personally flag HN accounts.

dathinab

a lot of applications auto convert -- to an em dash

and a bunch of phone/tablet keyboards do so, too

I like em dashes I had considered installing a plugin to reliably turn -- into em dash in the past, if I hadn't discarded that idea you would have seen some in this post ;)

And I think I have seen at lest one spell checking browser plugin which does stuff like that.

Oh and some people use 3rd party interfaces to interact with HN, such which do auto convert consecutive dashes to em dashes.

In the places where I have been using AI from time to time it's also not supper common to use em dashes.

So IMHO "em dash" isn't a tall tell sign for something being AI written.

But then wrt. the OP comment I think you might be right anyway. It's writing style is ... strange. Like taking a writing style from a novel and not any writing style but such which over exaggerates that currently a story is told inside a story. But then fills semantics of a HN comment. Like what you might get if you ask a LLM to "tell a story" for you set of bullet points.

But this opens a question, if the story still comes from a human isn't it fine? Or is it offensive that they didn't just give us compact bullet points?

Putten that aside, there is always the option that the author is just very well read/written, maybe a book author, maybe a hobby author and picked up such a writing style.

T0Bi

A lot of phones do this automatically when doing double dash -- -> —

KTibow

The Android client I use, Harmonic, has a shortcut to report a user, although it just prefills an email to hn@ycombinator.com.

yig

option-shift-minus on a Mac (option-minus for an en dash).

null

[deleted]

reaperducer

I’m not sure anyone I know could make an em dash with their keyboard off the top of their head.

I have endash bound to ⇧⌥⌘0, and emdash bound to ⇧⌥⌘=.

saagarjha

What kind of data did you run this on?

teruakohatu

> Like giving GPT a multiple-choice test on gibberish—it will pick something, and say it with its chest.

If I gave a classroom of under grad students a multiple choice test where no answers were correct, I can almost guarantee almost all the tests would be filled out.

Should GPT and other LLMs refuse to take a test?

In my experience it will answer with the closest answer, even if none of the options are even remotely correct.

shafyy

Not refuse, but remark that none of the answers seems correct. After all, we are only 2 days away from an AGI pro gamer researcher AI according to experts[1], so I would expect this behavior at least.

1: People who have a financial stake in the AI hype

bluGill

In multiple choice if you don't know then a random guess is your best answer in most cases. In a few tests blank is scored better than wrong but that is rare and the professors will tell you.

as such I would expect students to but in something. However after class they would talk about how bad they think they did because they are all self aware enough to know where they guessed.

jeremyjh

I would love to see someone try this. I would guess 85-90% of undergrads would fill out the whole test, but not everyone. There are some people who believe the things that they know.

ofjcihen

I think the issue is the confidence with which it lies to you.

A good analogy would be if someone claimed to be a doctor and when I asked if I should eat lead or tin for my health they said “Tin because it’s good for your complexion”.

6stringmerc

Yes, it should refuse.

Humans have made progress by admitting when they don’t know something.

Believing an LLM should be exempt from this boundary of “responsible knowledge” is an untenable path.

As in, if you trust an ignorant LLM then by proxy you must trust a heart surgeon to perform your hip replacement.

ijk

Just on a practical level, adding a way for the LLM to bail if can detect that things are going wrong saves a lot of trouble. Especially if you are constraining the inference. You still get some false negatives and false positives, of course, but giving the option to say "something else" and explain can save you a lot of headaches when you accidentally send it down the wrong path entirely.

amelius

Before making AI do research, perhaps we should first let it __reproduce__ research. For example, give it a paper of some deep learning technique and make it produce an implementation of that paper. Before it can do that, I have no hope that it can produce novel ideas.

ErigmolCt

Reproducibility is the baseline. Until models can consistently read, understand, and implement existing research correctly, "AI scientist" talk is mostly just branding.

slewis

OpenAI created a benchmark for this: https://openai.com/index/paperbench/

suddenlybananas

Still has data contamination though.

Szpadel

still LLM cannot beat it so it's good enough for start

patagurbon

You would have to have a very complete audit trail for the LLM and ensure the paper shows up nowhere in the dataset.

We have rare but not unheard of issues with academic fraud. LLMs fake data and lie at the drop of a hat

TeMPOraL

> You would have to have a very complete audit trail for the LLM and ensure the paper shows up nowhere in the dataset.

We can do both known and novel reproductions. Like with both LLM training process and human learning, it's valuable to take it in two broad steps:

1) Internalize fully-worked examples, then learn to reproduce them from memory;

2) Train on solving problems for which you know the results but have to work out intermediate steps yourself (looking at the solution before solving the task)

And eventually:

3) Train on solving problems you don't know the answer to, have your solution evaluated by a teacher/judge (that knows the actual answers).

Even parroting existing papers is very valuable, especially early on, when the model is learning how papers and research looks like.

6stringmerc

…because there are no consequences for AI. Humans understand shame, pain, and punishment. Until AI models develop this conditional reasoning as part of their process, to me, they’re grossly overestimated in capability and reliability.

ojosilva

I thought you were going to say "give AI the first part of a paper (prompt) and let it finish it (completion)" as a validation AI can produce science at par with research results. Before it can do that, I have no hope that it can produce novel ideas.

DrScientist

I once had a university assignment where they provided the figures from a paper and we had to write the paper around the just the given figures.

A bit like how you might write a paper yourself - starting with the data.

As it turned out I thought the figures looked like data that might be from a paper referenced in a different lecturers set of lectures ( just on the conclusion, he hadn't shown the figures ) - so I went down the library ( this is in the days of non-digitized content - you had to physically walk the stacks ) and looked it up - found the original paper and then a follow up paper by the same authors....

I like to think I was just doing my background research properly.

I told a friend about the paper and before you know it the whole class knew - and I had to admit to the lecturer that I'd found the original paper when he wondered why the whole class had done so well.

Obviously this would be trivial today with an electronic search.

bee_rider

I guess it would also need the experimental data. It would, I guess, also need some ability to do little experiments and write off those ideas as not worth following up on…

szundi

[dead]

tbrownaw

> For example, give it a paper of some deep learning technique and make it produce an implementation of that paper.

Or maybe give it a paper full of statistics about some experimental observations, and have it reproduce the raw data?

bee_rider

Like, have the AI do the experiment? That could be interesting. Although I guess it would be limited to experiments that could be done on a computer.

darkoob12

Reproduciblity was never a serious issue in AI research community. I think one of the main reasons for explosive progress in AI was the open community and people could easily reproduce other people's research. If you look at top tier conferences you see that they share everything paper, latex, code, data, lecture video etc.

After ChatGPT big cooperations stopped sharing their main research but it still happens at academia.

mnky9800n

I think what I would rather like to see is the reproduction of results from experiments that the AI didn't see but are well known. Not reproducing AI papers. For example, assuming a human can build it, would an AI, not knowing anything except what was known at the time, be able to design the millikan oil drop experiment? Or would it be able to design an Taylor-Coutte setup for exploring turbulence? Would it be able to design a linear particle accelerator or a triaxial compression experiment? I think an interesting line of reasoning would be to restrict the training data to what was known before a seminal paper was produced. Like take Lorenz atmospheric circulation paper, train an AI on only data that comes from before that paper was published. Does the AI produce the same equations in the paper and the same description of chaos that Lorenz arrived at?

raxxorraxor

You probably have to fight quite a few battles because many people know their papers aren't reproducible. More politics than science really.

It would be the biggest boon to science since sci-hub though.

And since a large set of studies won't be reproducible, you need human supervision as well, at least at first.

YossarianFrPrez

Seconded, as not only is this an interesting idea, it might also help solve the issue of checking for reproducibility. Yet even then human evaluators would need to go over the AI-reproduced research with a fine-toothed comb.

Practically speaking, I think there are roles for current LLMs in research. One is in the peer review process. LLMs can assist in evaluating the data-processing code used by scientists. Another is for brainstorming and the first pass at lit reviews.

Kiyo-Lynn

I once met a researcher who spent six months verifying the results of a published paper. In the end, all he received was a simple “thanks for pointing that out.” He said quietly, “Some work matters not because it’s seen, but because it keeps others from going wrong.”

I believe that if we’re not even willing to carefully confirm whether our predictions match reality, then no matter how impressive the technology looks, it’s only a fleeting illusion.

jajko

While that will not land him a Nobel prize, its miles ahead in terms of achievement and added value to mankind compared to most corporate employees. We wish we could say something like that about our past decade of work

eru

People typically get paid as a thank you for their corporate work, not just a lukewarm 'thanks for pointing that out.'

Kiyo-Lynn

A lot of the time, the work we do doesn’t get much recognition, and barely gets seen.But maybe it still helped in some small way. Thinking about that makes it feel a little less disappointing.

boxed

Oh look, just what I've been predicting: https://news.ycombinator.com/context?id=44041114 https://news.ycombinator.com/context?id=41786908

It's the same as "AI can code". It gets caught with failing spectacularly when the problem isn't in the training set over and over again, and people are surprised every time.

kmacdough

With "AI can code", though, we can get pretty far by working around the problem. Use it to augment the workflow of a real SWE and supply it with guardrails like linters, tests, etc. It doesn't do the hard bits like architecture, design and review, but it can take huge amounts of the repetitive "solved" bits that dominate most SWEs time. Very possible to 2-5x productivity without quality loss (because the human does all work to guarantee quality).

But yes, unmanaged and unchecked it absolutely cannot to the full job of really any human. It's not close.

godelski

  > although later investigation suggests there may have been data leakage

I think this point is often forgotten. Everyone should assume data leakage until it is strongly evidenced otherwise. It is not on the reader/skeptic to prove that there is data leakage, it is the authors who have the burden of proof.

It is easy to have data leakage on small datasets. Datasets where you can look at everything. Data leakage is really easy to introduce and you often do it unknowingly. Subtle things easily spoil data.

Now, we're talking about gigantic datasets where there's no chance anyone can manually look through it all. We know the filter methods are imperfect, so it how do we come to believe that there is no leakage? You can say you filtered it, but you cannot say there's no leakage.

Beyond that, we are constantly finding spoilage in the datasets we do have access to. So there's frequent evidence that it is happening.

So why do we continue to assume there's no spoilage? Hype? Honestly, it just sounds like a lie we tell ourselves because we want to believe. But we can't fix these problems if we lie to ourselves about them.

SamuelAdams

Every system has problems. The better question is: what is the acceptable threshold?

For an example Medicare and Medicade had a fraud rate of 7.66%. Yes, that is a lot of billions, and there is room for improvement, but that doesn’t mean the entire system is failing: 93% of cases are being covered as intended.

The same could be said with these models. If the spoilage rate is 10%, does that mean the whole system is bad? Or is it at a tolerable threshold?

[1]: https://www.cms.gov/newsroom/fact-sheets/fiscal-year-2024-im...

fastaguy88

In the protein annotation world, which is largely driven by inferring common ancestry between a protein of unknown function and one of known function, common error thresholds range from FDR of 0.001 to 10^-6. Even a 1% error rate would be considered abysmal. This is in part because it is trivial to get 95% accuracy in prediction; the challenging problem is to get some large fraction of the non-trivial 5% correct.

"Acceptable" thresholds are problem specific. For AI to make a meaningful contribution to protein function prediction, it must do substantially better than current methods, not just better than some arbitrary threshold.

sieabahlpark

[dead]

mike_hearn

I think it's worth being highly skeptical about fraud rates that are stated to two decimal places of precision. Fraud is by design hard to accurately detect. It would be more accurate to say, Medicare decides 7.66% of its cases are fraudulent according to its own policies and procedures, which are likely conservative, and cannot take into account undetected fraud. The true rate is likely higher, perhaps much higher.

There's also the problem of false negatives vs positives. If your goal is to cover 100% of true cases you can achieve that easily by just never denying a claim. That would of course yield stratospheric false positive rates (fraud). You have to understand both the FN rate (cost of missed fraud) vs the FP rate (cost of fraud fighting) and then balance them.

The same applies with using models in science to make predictions.

null

[deleted]

alwa

The number seems to come from Medicare’s CERT program [0]. At a hurried glance they seem to have published data right up to present, but their most recent interpretive report I could find with error margins was from 2016. That one [1] put the CIs on those fraud rates in the +/-2% range per subtype and around +/-0.9% overall. Bearing out your point.

CERT’s annual assessments do seem to involve a large-scale, rigorous analysis of an independent sample of 50,000 cases, though. And those case audits seem, at least on paper and to a layperson, to apply rather more thorough scrutiny than Medicare’s day-to-day policies and procedures.

As @patio11 says, and to your point, “the optimal amount of fraud is non-zero”… [2]

[0] https://www.cms.gov/data-research/monitoring-programs/improp...

[1] https://www.cms.gov/research-statistics-data-and-systems/mon...

[2] https://www.bitsaboutmoney.com/archive/optimal-amount-of-fra...

godelski

  > The better question is: what is the acceptable threshold?

Currently we are unable to answer that question. AND THAT'S THE PROBLEM

I'd be fine if we could. Well, at least far less annoyed. I'm not sure what the threshold should be, but we should always try to minimize it. At least error bounds would do a lot of good at making this happen. But right now we have no clue and that's why this is such a big question that people keep bringing up. We don't point out specific levels of error because they are small and we don't want you looking at them, rather we don't point them out because nobody has a fucking clue.

And until someone has a clue, you shouldn't trust that they error rate is low. The burden of proof is on the one making the claim of performance, not the one asking for evidence to that claim (i.e. skeptics).

Btw, I'd be careful with percentages. Especially when numbers are very high. e.g. LLMs are being trained on trillions of tokens. 10% of 1 trillion is 100 bn. The entire work of Shakespeare is 1.2M tokens... Our 10% error rate would be big enough to spoil any dataset. The bitter truth is that as the absolute number increases, the threshold for acceptable spoilage (in terms of percentage) needs to decrease.

dgb23

There‘s also the question of „what is it failing at?“.

I‘m fine with 5% failure if my soup is a bit too salty. Not fine with 0.1% failure if it contains poison.

wavemode

Data leakage is an eval problem, not an accuracy problem.

That is, the problem is not that the AI is wrong X% of the time. The problem is that, in the presence of a data leak, there is no way of knowing what the value of X even is.

This problem is recursive - in the presence of a data leak, you also cannot know for sure the quantity of data that has leaked.

antithesizer

The supposed location of the burden of proof is really not the definitive guide to what you ought to believe that people online seem to think it is.

mathgeek

Can you elaborate? You've made a claim, but I really think there'd be value in continuing to what you actually mean.

NormLolBob

They mean “vet your sources and don’t blindly follow the internet hive-mind.” or similar; burden of proof is not what the internet thinks.

Tacked their actual point on to the end of a copy paste of op comments context, ended up writing something barely grammatically correct.

In doing so they prove why exactly not to listen to the internet. So they have that going for them.

tbrownaw

What is the relevance of this generic statement to the discussion at hand?

kenjackson

"And for most deep learning papers I read, domain experts have not gone through the results with a fine-tooth comb inspecting the quality of the output. How many other seemingly-impressive papers would not stand up to scrutiny?"

Is this really not the case? I've read some of the AI papers in my field, and I know many other domain experts have as well. That said I do think that CS/software based work is generally easier to check than biology (or it may just be because I know very little bio).

a_bonobo

Validation of biological labels easily takes years - in the OP's example it was a 'lucky' (huge!) coincidence that somebody already had spent years on one of the predicted proteins' labels. Nobody is going to stake 3-5 years of their career on validating some random model's predictions.

knowaveragejoe

Just curious, could you expand on what about that process takes years?

a_bonobo

Bioinformatically, you could compare your protein with known proteins and infer function from there, but OP's paper is specifically for the use case where we know nothing in our databases.

Time-wise it depends where in the process you start!

Do you know what your target protein even is? I've seen entire PhDs trying to purify a single protein - every protein is purified differently, there are dozens of methods and some work and some don't. If you can purify it you can run a barrage of tests on the protein - is it a kinase, how does it bind to different assays etc. That gives you a fairly broad idea in which area of activity your protein sits.

If you know what it is, you can clone it into your vector like E. coli. Then E. coli will hopefully express it. That's a few weeks/months of work, depending on how much you want to double-check.

You can then use fluorescent tags like GFP to show you where in the cell your protein is located. Is it in the cell-wall? is it in the nucleus? that might give you an indication to function. But you only have the location at this point.

If your protein is in an easily kept organism like mice, you can run knock-out experiments, where you use different approaches to either turn off or delete the gene that produces the protein. That takes a few months too - and chances are nothing in your phenotype will change once the gene is knocked out, protein-protein networks are resilient and there might be another protein jumping in to do the job.

if you have an idea of what your protein does, you can confirm using protein-protein binding studies - I think yeast two-hybrid is still very popular for this? It tests whether two specific proteins - your candidate and another protein - interact or bind.

None of those tests will tell you 'this is definitely a nicotinamide adenine dinucleotide binding protein', every test (and there are many more!) will add one piece to the puzzle.

Edit: of course it gets extra-annoying when all these puzzle pieces contradict each other. In my past life I've done some work with the ndh genes that sit on plant chloroplasts and are lost in orchids and some other groups of plants (including my babies), so it's interesting to see what they actually do and why they can be lost. It's called ndh because it was initially named NADH-dehydrogenase-like, because by sequence it kind of looks like a NADH dehydrogenase.

There's a guy in Japan (Toshiharu Shikanai) who worked on it most of his career and worked out that it certainly is NOT a NADH dehydrogenase and is instead a Fd-dependent plastoquinone reductase. https://www.sciencedirect.com/science/article/pii/S000527281...

Knockout experiments with ndh are annoying because it seems to be only really important in stress conditions - under regular conditions our ndh- plants behaved the same.

Again, this is only one protein, and since it's in chloroplasts it's ultra-common - most likely one of the more abundant proteins on earth (it's not in algae either). And we still call it ndh even though it is a Ferredoxin-plastoquinone reductase.

yorwba

Reading a paper is not the same as verifying the results is not the same as certifying their correctness. I read a lot of papers, but I typically only look at the underlying data when I intend to repurpose it for something else, and when I do, I tend to notice errors in the ground truth labels fairly quickly. Of course most models don't perform well enough for this to influence results much...

suddenlybananas

My impression with linguistics is that people do go over the papers that use these techniques carefully and come up with criticisms of them, but people don't take linguists seriously so people from other related disciplines ignore the criticisms.

croemer

Don't call "Nature Communications" "Nature". The prestige is totally different. Also, altmetrics aren't that relevant, maybe if you want to measure public hype.

croemer

Update: It seems the author read this and fixed it. Thanks!

8bitsrule

Fits my limited experiences with LLM (as a researcher). Very impressive apparent written language comprehension and written expression. But when it comes to getting to the -best possible answer- (particulary on unresolved questions), the nearly-instant responses (e.g. to questions that one might spend a half-day on without resolution) are seldom satisfactory. Complicated questions take time to explore, and IME an LLM's lack-of-resolution (because of it's inability) is, so far, set aside in favor of confident-sounding (even if completely-wrong) responses.

ErigmolCt

This really nails one of the core problems in current AI hype cycles: we're optimizing for attention, not accuracy. And this isn't just about biology. You see similar patterns in ML applications across fields: climate science, law, even medicine

slt2021

Fantastic article by Rachel Thomas!

This is basically another argument that deep learning works only as a [generative] information retrieval - i.e a stochastic parrot, due to the fact that the training data is a very lossy representation of the underlying domain.

Because the data/labels of genes do not always represent the underlying domain (biology) perfectly, the output can be false/invalid/nonsensical.

in cases where it works very well - there is data leakage, because by design LLMs are information retrieval tools. It comes form the information theory standpoint, a fundamental "unknown unknown" for any model.

my takeaway is that its not a fault of the algorithm, its more the fault of the training dataset.

We humans operate fluidly in the domain of natural language, and even a kid can read and evaluate whether text make sense or not - this explains the success of models trained on NLP.

but in domains where training data represents the fundamental domain with losses, it will be imperfect.

ffwd

This to me is the paradox of modern LLMs, in that it doesn't represent the underlying domain directly, but it can represent whatever information can be presented in text. So it does represent _some_ information but it is not always clear what it is or how.

The embedding space can represent relationships between words, sentences and paragraphs, and since those things can encode information about the underlying domain, you can query those relationships with text and get reasonable responses. The problem is it's not always clear what is being represented in those relationships as text is a messy encoding scheme.

But another weakness is that as you say it is generative, and in order to make it generative we are instead of hardcoding in a database all possible questions and all possible answers, we offload some of the data to an algorithm (next token prediction) in order to get the possibility of an imprecise probabilistic question/prompt (which is useful because then you can ask anything).

But the problem is no single algorithm can ever encode all possible answers to all possible questions in a domain accurate way and so you lose some precision in the information. Or at least this is how I see LLMs atm.

dathinab

> works only as a [generative] information retrieval

but even if we for simplicity of the argument assume that is true without question, LLM still are here to stay

Like think about how do junior devs which (in programming) average or less skill work, they "retrieve" the information about how to solve the problem from stack overflow, tutorials etc.

So giving all your devs some reasonable well done AI automation tools (not just a chat prompt!!) is like giving each a junior dev to delegate all the tedious simple tasks, too. Without having to worry about that task not allowing the junior dev to grow and learn. And to top it of if there is enough tooling (static code analysis, tests, etc.) in place the AI tooling will do the write things -> run tools -> fix issues loops just fine. And the price for that tool is like what, a 1/30th of that of a junior dev? Means more time to focus on the things which matter including teaching you actual junior devs ;)

And while I would argue AI isn't full there yet, I think the current fundation models _might_ already be good enough to get there with the right ways of wiring them up and combining them.

slt2021

Programming languages are created by humans and the training dataset is complete enough to train LLMs with good results. Most importantly, natural language is the native domain of the programming code.

Whereas in biology, the natural domain is in physical/chemical/biological reactions occuring between organisms and molecules. The laws of interactions are not created by human, but by Creator(tm), and so the training dataset is barely capturing a tiny fraction and richness of the domain and its interactions. Because of this, any model will be inadequate

vixen99

I wonder to what extent the thought processes that lead to the situation described by Rachel Thomas, are active in other areas. Important article by the way, I agree!

softwaredoug

We also love deep cherry picking. Working hard to find that one awesome time some ML / AI thing worked beautifully and shouting its praises to the high heavens. Nevermind the dozens of other times we tried and failed...

ErigmolCt

Yup, the survivorship bias is strong. It's like academic slot machines

r3trohack3r

Dude. I just asked my computer to write [ad lib basic utility script] and it spit out a syntactically correct C program that does it with instructions for compiling it.

And then I asked it for [ad lib cocktail request] and got back thorough instructions.

We did that with sand. That we got from the ground. And taught it to talk. And write C programs.

Never mind what? That I had to ask twice? Or five times?

What maximum number of requests do you feel like the talking sand needs to adequately answer your question in before you are impressed by the talking sand?

rsfern

This is all awesome, but a bit off topic for the thread which focuses on AI for science

The disconnect here is that the cost of iteration is low and it’s relatively easy to verify the quality of a generated C program (does the compiler issue warnings or errors? Does it pass a test suite?) or a recipe (basic experience is probably enough to tell if an ingredient sends out of place or proportions are wildly off)

In science, verifying a prediction is often super difficult and/or expensive because at prediction time we’re trying to shortcut around an expensive or intractable measurement or simulation. Unreliable models can really change the tradeoff point of whether AI accelerates science or just massively inflated the burn rate

halpow

Crows and parrots are amazing talkers too, but there's a hard limit to how much sense they make. Do you want those birds to teach your kids and serve you medicine?

wtetzner

I don't think it has anything to do with being impressed or not. It's about being careful not to put too much trust in something so fallible. Because it is so amazing, people overestimate where it can be reliably used.

dgb23

First off all, I appreciate your comment. Yes, it‘s fucking amazing. (I usually imagine it being „light“ and not „sand“ though. „Sand“ is much more poignant!)

But I think people aren‘t arguing about how amazing it is, but about specific applicability. There‘s also a lot of toxic hype and FUD going around, which can be tiring and frustrating.

TeMPOraL

Even more so, we also love deep stochastic parroting. Working hard to ignore direct experience, growing amount of reports, and to avoid reasoning from first principles, in order to confidently deny the already obvious utility of LLMs, and backing that position with some tired memes.

imiric

It's interesting to see this article in juxtaposition to the one shared recently[1], where AI skeptics were labeled as "nuts", and hallucinations were "(more or less) a solved problem".

This seems to be exactly the kind of results we would expect from a system that hallucinates, has no semantic understanding of the content, and is little more than a probabilistic text generator. This doesn't mean that it can't be useful when placed in the right hands, but it's also unsurprising that human non-experts would use it to cut corners in search of money, power, and glory, or worse—actively delude, scam, and harm others. Considering that the latter group is much larger, it's concerning how little thought and resources are put into implementing _actual_ safety measures, and not just ones that look good in PR statements.

[1]: https://news.ycombinator.com/item?id=44163063

JackC

The difference in fields is key here: AI models are going to have a very different impact in fields where ground truth is available instantly (does the generated code have the expected output?) or takes years of manual verification.

(Not a binary -- ground truth is available enough for AI to be useful to lots of programmers.)

weatherlite

> does the generated code have the expected output?

That's many times not easy to verify at all ...

dathinab

you can easily verify a lot like:

- correct syntax

- passes lints

- type checking passes

- fast test suite passes

- full test suite passes

and every time it doesn't you feed it back into the LLM, automatically, in a loop, without your involvement.

The results are often -- sadly -- too good to not slowly start using AI.

I say sadly because IMHO the IT industry has gone somewhere very wrong due to growing too fast, moving too fast and getting so much money so that the companies spear heading them could just throw more people at it instead of fixing underlying issues. There is also a huge diverge between sience about development, programming, application composition etc. (not to be confused with since about idk. data-structures and fundamental algorithms) and what the industry uses, how it advances etc.

Now I think normally the industry would auto correct at some point, but I fear with LLMs we might get even further away from any fundamental improvements, as we find even more ways to still go along and continue the mess we have.

Worse performance of LLM coding is highly dependent on how much very similar languages are represented in it's dataset, so new languages with any breakthrough/huge improvements or similar will work less good with LLMs. If that trend continues that would lock us in with very mid solutions long term.

BlueTemplar

Heh, reminds me of cryptocurrencies...

Or even of the Internet in general.

I guess it's a common pitfall with information or communication technologies ?

(Heck, or with technologies in general, but non-information or communication ones rarely scale as explosively...)

imiric

It's a common symptom of the Gartner hype cycle.

This doesn't mean that there aren't very valid use cases for these technologies that can benefit humanity in many ways (and I mean this for both digital currencies and machine learning), but unfortunately those get drowned out by the opportunity seekers and charlatans that give the others the same bad reputation.

As usual, it's best to be highly critical of opinions on both extreme sides of the spectrum until (and if) we start climbing the Slope of Enlightenment.