Skip to content(if available)orjump to list(if available)

AI tools are spotting errors in research papers

crazygringo

This actually feels like an amazing step in the right direction.

If AI can help spot obvious errors in published papers, it can do it as part of the review process. And if it can do it as part of the review process, authors can run it on their own work before submitting. It could massively raise the quality level of a lot of papers.

What's important here is that it's part of a process involving experts themselves -- the authors, the peer reviewers. They can easily dismiss false positives, but especially get warnings about statistical mistakes or other aspects of the paper that aren't their primary area of expertise but can contain gotchas.

yojo

Relatedly: unethical researchers could run it on their own work before submitting. It could massively raise the plausibility of fraudulent papers.

I hope your version of the world wins out. I’m still trying to figure out what a post-trust future looks like.

rs186

Students and researchers send their own paper to plagiarism checker to look for "real" and unintended flags before actually submitting the papers, and make revisions accordingly. This is a known, standard practice that is widely accepted.

And let's say someone modifies their faked lab results so that no AI can detect any evidence of photoshopping images. Their results get published. Well, nobody will be able to reproduce their work (unless other people also publish fraudulent work from there), and fellow researchers will raise questions, like, a lot of them. Also, guess what, even today, badly photoshopped results often don't get caught for a few years, and in hindsight that's just some low effort image manipulation -- copying a part of image and paste it elsewhere.

I doubt any of this changes anything. There is a lot of competition in academia, and depending on the field, things may move very fast. Getting away with AI detection of fraudulent work likely doesn't give anyone enough advantage to survive in a competitive field.

azan_

>Their results get published. Well, nobody will be able to reproduce their work (unless other people also publish fraudulent work from there), and fellow researchers will raise questions, like, a lot of them.

Sadly you seem to underestimate how widespread fraud is in academia and overestimate how big the punishment is. In the worst case when someone finds you are guilty of fraud, you will get slap on the wrist. In the usual case absolutely nothing will happen and you will be free to keep publishing fraud.

BurningFrog

Not in academia, but what I hear is that very few results are ever attempted to be reproduced.

So if you publish an unreproducible paper, you can probably have a full career without anyone noticing.

mezyt

> Well, nobody will be able to reproduce their work (unless other people also publish fraudulent work from there)

In theory, yes, in practice, the original result for amyloid beta protein as the main cause of Alzheimer were faked and it wasn't caught for 16 years. A member of my family took med based on it and died in the meantime.

cycomanic

Which researchers are using plagiarism detectors? I'm not aware that this is a known and widely accepted practice. They are used by students and teachers for student papers (in courses etc), but nobody i know would use them for submitting research. I also don't see the point for why even unethical researchers would use it, it wouldn't increase your acceptance chances dramatically.

owl_vision

Unless documented and reproducible, it does not exist. This was the minimum guide when I worked with researchers.

I plus 1 your doubt in the last paragraph.

dccsillag

I've never seen this done in a research setting. Not sure about how much of a standard practice it is.

abirch

You're right that this won't change the incentives for the dishonest researchers. Unfortunately there's not an equivalent of "short sellers" in research, people who are incentivized for finding fraud.

AI is definitely a good thing (TM) for those honest researchers.

t_mann

AI is fundamentally much more of a danger to the fraudsters. Because they can only calibrate their obfuscation to today's tools. But the publications are set in stone and can be analyzed by tomorrow's tools. There are already startups going through old papers with modern tools to detect manipulation [0].

[0] https://imagetwin.ai/

dgfitz

Training a language model on non-verified publications seems… unproductive.

kkylin

Every tool cuts both ways. This won't remove the need for people to be good, but hopefully reduces the scale of the problems to the point where good people (and better systems) can manage.

FWIW while fraud gets headlines, unintentional errors and simply crappy writing are much more common and bigger problems I think. As reviewer and editor I often feel I'm the first one (counting the authors) to ever read the paper beginning to end: inconsistent notation & terminology, unnecessary repetitions, unexplained background material, etc.

pinko

Normally I'm an AI skeptic, but in this case there's a good analogy to post-quantum crypto: even if the current state of the art allows fraudulent researchers to evade detection by today's AI by using today's AI, their results, once published, will remain unchanged as the AI improves, and tomorrow's AI will catch them...

mike_hearn

Doesn't matter. Lots of bad papers get caught the moment they're published and read by someone, but there's no followup. The institutions don't care if they publish auto-generated spam that can be detected on literally a single read through, they aren't going to deploy advanced AI on their archives of papers to create consequences a decade later:

https://www.nature.com/articles/d41586-021-02134-0

tmpz22

I think it’s not always a world scale problem as scientific niches tend to be small communities. The challenge is to get these small communities to police themselves.

For the rarer world scale papers we can dedicate more resources to getting vetting them.

atrettel

Based on my own experience as a peer reviewer and scientist, the issue is not necessarily in detecting plagiarism or fraud. It is in getting editors to care after a paper is already published.

During peer review, this could be great. It could stop a fraudulent paper before it causes any damage. But in my experience, I have never gotten a journal editor to retract an already-published paper that had obvious plagiarism in it (very obvious plagiarism in one case!). They have no incentive to do extra work after the fact with no obvious benefit to themselves. They choose to ignore it instead. I wish it wasn't true, but that has been my experience.

callc

Humans are already capable of “post-truth”. This is enabled by instant global communication and social media (not dismissing the massive benefits these can bring), and led by dictators who want fealty over independent rational thinking.

The limitations of slow news cycles and slow information transmission lends to slow careful thinking. Especially compared to social media.

No AI needed.

hunter2_

The communication enabled by the internet is incredible, but this aspect of it is so frustrating. The cat is out of the bag, and I struggle to identify a solution.

The other day I saw a Facebook post of a national park announcing they'd be closed until further notice. Thousands of comments, 99% of which were divisive political banter assuming this was the result of a top-down order. A very easy-to-miss 1% of the comments were people explaining that the closure was due to a burst pipe or something to that effect. It's reminiscent of the "tragedy of the commons" concept. We are overusing our right to spew nonsense to the point that it's masking the truth.

How do we fix this? Guiding people away from the writings of random nobodies in favor of mainstream authorities doesn't feel entirely proper.

brookst

Both will happen. But the world has been post-trust for millennia.

GuestFAUniverse

Maybe raise the "accountability" part?

Baffles me that somebody can be professor, director, whatever, meaning: taking the place of somebody _really_ qualified and not get dragged through court after falsifying a publication until nothing is left of that betrayer.

It's not only the damage to society due to false, misleading claims. If those publications decide who gets tenure, a research grant, etc. there are careers of others, that were massively damaged.

dsabanin

Maybe at least in some cases these checkers will help them actually find and fix their mistakes and they will end up publishing something useful.

Salgat

My hope is that ML can be used to point out real world things you can't fake or work around, such as why an idea is considered novel or why the methodology isn't just gaming results or why the statistics was done wrong.

nickpsecurity

I thought about it a while back. My concept was using RLHF to train a LLM to extract key points, their premises, and generate counter questions. A human could filter the questions. That feedback becomes training material.

Once better with numbers, maybe have one spot statistical errors. I think a constantly-updated, field-specific checklist for human reviewers made more sense on that, though.

For a data source, I thought OpenReview.net would be a nice start.

Groxx

I very much suspect this will fall into the same behaviors as AI-submitted bug reports in software.

Obviously it's useful when desired, they can find real issues. But it's also absolutely riddled with unchecked "CVE 11 fix now!!!" spam that isn't even correct, exhausting maintainers. Some of those are legitimate accidents, but many are just karma-farming for some other purpose, to appear like a legitimate effort by throwing plausible-looking work onto other people.

econ

The current review mechanism is based on how expensive it is to do the review. If it can be done cheaply it can be replaced with a continuous review system. With each discovery previous works at least need adjusted wording. What starts out an educated guess or an invitation for future research can be replaced with or directly linked to newer findings. An entire body of work can simultaneously drift sideways and offer a new way to measure impact.

Taylor_OD

In another world of reviews... Copilot can now be added as a pr reviewer if a company allows/pays for it. I've started doing it right before adding any of my actual peers. It's only been a week or so and it did catch one small thing for me last week.

This type of llm use feels like spell check except for basic logic. As long as we stuff have people who know what they are doing reviewing stuff AFTER the AI review, I don't see any downsides.

mmooss

I agree it should be part of the composition and review processes.

> It could massively raise the quality level of a lot of papers.

Is there an indication that the difference is 'massive'? For example, reading the OP, it wasn't clear to me how significant these errors are. For example, maybe they are simple factual errors such as the wrong year on a citation.

> They can easily dismiss false positives

That may not be the case - it is possible that the error reports may not be worthwhile. Based on the OP's reporting on accuracy, it doesn't seem like that's the case, but it could vary by field, type of content (quantitative, etc.), etc.

ok_computer

Or it could become a gameable review step like first line resume review.

I think the only structural way to change research publication quality en mass is to change the incentives of the publishers, grant recipients, tenure track requirements, and grad or post doc researcher empowerment/funding.

That is a tall order so I suspect we’ll get more of the same and now there will be 100 page 100% articles just like there are 4-5 page top rank resumes. Whereas a dumb human can tell you that a 1 pager resume or 2000 word article should suffice to get the idea across (barring tenuous proofs or explanation of methods).

Edit: incentives of anonymous reviewers as well that can occupy an insular sub-industry to prop up colleagues or discredit research that contradicts theirs.

flenserboy

So long as they don't build the models to rely on earlier papers, it might work. Fraudulent or mistaken earlier work, taken as correct, could easily lead to newer papers which disagree or don't use the older data as wrong/mistaken. This sort of checking needs to drill down as far as possible.

YeGoblynQueenne

Needs more work.

>> Right now, the YesNoError website contains many false positives, says Nick Brown, a researcher in scientific integrity at Linnaeus University. Among 40 papers flagged as having issues, he found 14 false positives (for example, the model stating that a figure referred to in the text did not appear in the paper, when it did). “The vast majority of the problems they’re finding appear to be writing issues,” and a lot of the detections are wrong, he says.

>> Brown is wary that the effort will create a flood for the scientific community to clear up, as well fuss about minor errors such as typos, many of which should be spotted during peer review (both projects largely look at papers in preprint repositories). Unless the technology drastically improves, “this is going to generate huge amounts of work for no obvious benefit”, says Brown. “It strikes me as extraordinarily naive.”

LiamPowell

> for example, the model stating that a figure referred to in the text did not appear in the paper, when it did

This shouldn't even be possible for most journals where cross-references with links are required as LaTeX or similar will emit an error.

jraph

I've never seen this. Usually you don't have the LaTeX source of a paper you cite, you wouldn't know which label to use for the reference, when the cited paper is written in LaTeX at all. Or something changed quite a bit in recent years.

Can you link to another paper's Figure 2.2 now, and have LaTeX error out if the link is broken? How does that work?

LiamPowell

I assume they're referring to internal references. It does not look like they feed cited papers in to their tool.

dadadad100

Much like scanning tools looking for CVEs. There are thousands of devs right this moment chasing alleged vulns. It is early days for all of these tools. Giving papers a look over is an unqualified good as it is for code. I like the approach of keeping it private until the researcher can respond.

Lerc

There are two different projects being discussed here. One Open source effort and one "AI Entrepreneur" effort. YesNoError is the latter project.

AI, like Cryptocurrencies faces a lot of criticism because of the snake oil and varying levels of poor applications ranging from the fanciful to outright fraud. It bothers me a bit how much of that critique spreads onto the field as well. The origin of the phrase "snake oil" comes from a touted medical treatment, a field that has charlatans deceiving people to this day. In years past I would have thought it a given that people would not consider a wholesale rejection of healthcare as a field because of the presence of fraud. Post pandemic, with the abundance of conspiracies, I have some doubts.

I guess the point I'm making is judge each thing on their individual merits. It might not all be bathwater.

sfink

Don't forget that this is driven by present-day AI. Which means people will assume that it's checking for fraud and incorrect logic, when actually it's checking for self-consistency and consistency with training data. So it should be great for typos, misleading phrasing, and cross-checking facts and diagrams, but I would expect it to do little for manufactured data, plausible but incorrect conclusions, and garden variety bullshit (claiming X because Y, when Y only implies X because you have a reasonable-sounding argument that it ought to).

Not all of that is out of reach. Making the AI evaluate a paper in the context of a cluster of related papers might enable spotting some "too good to be true" things.

Hey, here's an idea: use AI for mapping out the influence of papers that were later retracted (whether for fraud or error, it doesn't matter). Not just via citation, but have it try to identify the no longer supported conclusions from a retracted paper, and see where they show up in downstream papers. (Cheap "downstream" is when a paper or a paper in a family of papers by the same team ever cited the upstream paper, even in preprints. More expensive downstream is doing it without citations.)

ForTheKidz

> people will assume that it's checking for fraud and incorrect logic, when actually it's checking for self-consistency and consistency with training data.

TBF, this also applies to all humans.

lucianbr

No, no it does not. Are you actually claiming with a straight face that not a single human can check for fraud or incorrect logic?

Let's just claim any absurd thing in defense of the AI hype now.

ForTheKidz

> Are you actually claiming with a straight face that not a single human can check for fraud or incorrect logic?

No of course not, I was pointing out that we largely check "for self-consistency and consistency with training data" as well. Our checking of the coherency of other peoples work is presumably an extension of this.

Regardless, computers already check for fraud and incorrect logic as well, albeit in different contexts. Neither humans or computers can do this with general competency, i.e. without specific training to do so.

nxobject

To be fair, at least humans get to have collaborators from multiple perspectives and skillsets; a lot of the discussion about AI in research has assumed that a research team is one hive mind, when the best collaborations aren’t.

Groxx

There is a clear difference in capability even though they share many failures

raincole

If you can check for manufactured data, it means you know more about what the real data looks like than the author.

If there were an AI that can check manufactured data, science would be a solved problem.

SwtCyber

It's probably not going to detect a well-disguised but fundamentally flawed argument

timewizard

They spent trillions of dollars to create a lame spell check.

RainyDayTmrw

Perhaps our collective memories are too short? Did we forget what curl just went through with AI confabulated bug reports[1]?

[1]: https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...

simonw

"YesNoError is planning to let holders of its cryptocurrency dictate which papers get scrutinized first."

Sigh.

jamestimmins

Shocking how often I see a seemingly good idea end with "... and it'll all be on chain", causing me to immediately lose faith in the original concept.

__MatrixMan__

I don't think it's shocking. People who are attracted to radically different approaches are often attracted to more than one.

Sometimes it feels like crypto is the only sector left with any optimism. If they end up doing anything useful it won't be because their tech is better, but just because they believed they could.

Whether it makes more sense to be shackled to investors looking for a return or to some tokenization scheme depends on the the problem that you're trying to solve. Best is to dispense with either, but that's hard unless you're starting from a hefty bank account.

brookst

Why sigh? This sounds like shareholders setting corporate direction.

jancsika

Oh wow, you've got 10,000 HN points and you are asking why someone would sigh upon seeing that some technical tool has a close association with a cryptocurrency.

Even people working reputable mom-and-pops retail jobs know the reputation of retail due to very real high-pressure sales techniques (esp. at car dealerships). Those techniques are undeniably "sigh-able," and reputable retail shops spend a lot of time and energy to distinguish themselves to their potential customers and distance themselves from that ick.

Crypto also has an ick from its rich history of scams. I feel silly even explicitly writing that they have a history rich in scams because everyone on HN knows this.

I could at least understand (though not agree) if you raised a question due to your knowledge of a specific cryptocurrency. But "Why sigh" for general crypto tie-in?

I feel compelled to quote Tim and Eric: "Do you live in a hole, or boat?"

Edit: clarification

loufe

Apart from the actual meat of the discussion, which is whether the GP's sigh is actually warranted, it's just frustrating to see everyone engage in such shallow expression. The one word comment could charitably be interpreted as thoughtful, in the sense that a lot of readers would take the time to understand their view-point, but I still think it should be discouraged as they could take some time to explain their thoughts more clearly. There shouldn't need to be a discussion on what they intended to convey.

That said, your "you're that experienced here and you didn't understand that" line really cheapens the quality of discourse here, too. It certainly doesn't live up to the HN guidelines (https://news.ycombinator.com/newsguidelines.html). You don't have to demean parent's question to deconstruct and disagree with it.

weebull

Exactly. That's why sigh.

roywiggins

yeah, but without all those pesky "securities laws" and so on.

ForTheKidz

Yes, exactly.

delusional

The nice thing about crypto plays is that you know they won't get anywhere so you can safely ignore them. Its all going to collapse soon enough.

surferbayarea

Here are 2 examples from the Black Spatula project where we were able to detect major errors: - https://github.com/The-Black-Spatula-Project/black-spatula-p... - https://github.com/The-Black-Spatula-Project/black-spatula-p...

Some things to note : this didn't even require a complex multi-agent pipeline. A single shot prompting was able to detect these errors.

topaz0

This is such a bad idea. Skip the first section and read the "false positives" section.

camdenreslink

Aren't false positives acceptable in this situation? I'm assuming a human (paper author, journal editor, peer reviewer, etc) is reviewing the errors these tools are identifying. If there is a 10% false positive rate, then the only cost is the wasted time of whoever needs to identify it's a false positive.

I guess this is a bad idea if these tools replace peer reviewers altogether, and papers get published if they can get past the error checker. But I haven't seen that proposed.

csa

> I'm assuming a human (paper author, journal editor, peer reviewer, etc) is reviewing the errors these tools are identifying.

This made me laugh so hard that I was almost crying.

For a specific journal, editor, or reviewer, maybe. For most journals, editors, or reviewers… I would bet money against it.

karaterobot

You'd win that bet. Most journal reviewers don't do more than check that data exists as part of the peer review process—the equivalent of typing `ls` and looking at the directory metadata. They pretty much never run their own analyses to double check the paper. When I say "pretty much never", I mean that when I interviewed reviewers and asked them if they had ever done it, none of them said yes, and when I interviewed journal editors—from significant journals—only one of them said their policy was to even ask reviewers to do it, and that it was still optional. He said he couldn't remember if anyone had ever claimed to do it during his tenure. So yeah, if you get good odds on it, take that bet!

RainyDayTmrw

That screams "moral hazard"[1] to me. See also the incident with curl and AI confabulated bug reports[2].

[1]: Maybe not in the strict original sense of the phrase. More like, an incentive to misbehave and cause downstream harm to others. [2]: https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...

xeonmc

Let me tell you about this thing called Turnitin and how it was a purely advisory screening tool…

topaz0

Note that the section with that heading also discusses several other negative features.

The only false positive rate mentioned in the article is more like 30%, and the true positives in that sample were mostly trivial mistakes (as in, having no effect on the validity of the message) and that is in preprints that have not been peer reviewed, so one would expect that that false positive rate would be much worse after peer review (the true positives would decrease, false positives remain).

And every indication both from the rhetoric of the people developing this and from recent history is that it would almost never be applied in good faith, and instead would empower ideologically motivated bad actors to claim that facts they disapprove of are inadequately supported, or that people they disapprove of should be punished. That kind of user does not care if the "errors" are false positives or trivial.

Other comments have made good points about some of the other downsides.

nxobject

> is reviewing the errors these tools are identifying.

Unfortunately, no one has the incentives or the resources to do doubly triply thorough fine tooth combing: no reviewer or editor’s getting paid; tenure-track researchers who need the service to the discipline check mark in their tenure portfolios also need to churn out research…

null

[deleted]

rainonmoon

People keep offering this hypothetical 10% acceptable false positive rate, but the article says it’s more like 35%. Imagine if your workplace implemented AI and it created 35% more unfruitful work for you. It might not seem like an “unqualified good” as it’s been referred to elsewhere.

afarah1

I can see its usefulness as a screening tool, though I can also see downsides similar to what maintainers face with AI vulnerability reporting. It's an imperfect tool attempting to tackle a difficult and important problem. I suppose its value will be determined by how well it's used and how well it evolves.

aeturnum

Being able to have a machine double check your work for problems that you fix or dismiss as false seems great? If the bad part is "AI knows best" - I agree with that! Properly deployed, this would be another tool in line with peer review that helps the scientific community judge the value of new work.

rs186

I don't see this a worse idea than AI code reviewer. If it spits out irrelevant advice and only gets 1 out of 10 points right, I consider it a win, since the cost is so low and many humans can't catch subtle issues in code.

userbinator

since the cost is so low

As someone who has had to deal with the output of absolutely stupid "AI code reviewers", I can safely say that the cost of being flooded with useless advice is real, and I will simply ignore them unless I want a reminder of how my job will not be automated away by anyone who wants real quality. I don't care if it's right 1 in 10 times; the other 9 times are more than enough to be of negative value.

Ditto for those flooding GitHub with LLM-generated "fix" PRs.

and many humans can't catch subtle issues in code.

That itself is a problem, but pushing the responsibility onto an unaccountable AI is not a solution. The humans are going to get even worse that way.

dartos

You’re missing the bit where humans can be held responsible and improve over time with specific feedback.

AI models only improve through training and good luck convincing any given LLM provider to improve their models for your specific use case unless you have deep pockets…

roywiggins

And people's willingness to outsource their judgement to a computer. If a computer says it, for some people, it's the end of the matter.

zulban

There's also a ton of false positives with spellcheck on scientific papers, but it's obviously a useful tool. Humans review the results.

whatever1

Just consider it being a additional mean reviewer who most likely is wrong. There is still value in debunking their false claims.

LasEspuelas

Deploying this on already published work is probably a bad idea. But what is wrong with working with such tools on submission and review?

gusgus01

I'm extremely skeptical for the value in this. I've already seen wasted hours responding to baseless claims that are lent credence by AI "reviews" of open source codebases. The claims would have happened before but these text generators know how to hallucinate in the correct verbiage to convince lay people and amateurs and are more annoying to deal with.

lifeisstillgood

It’s a nice idea, and I would love to be able to use it for my own company reports (spotting my obvious errors before sending them to my bosses boss)

But the first thing I noticed was the two approaches highlighted - one a small scale approach that does not publish first but approaches the authors privately - and the other publishes first, does not have human review and has its own cryptocurrency

I don’t think anything quite speaks more about the current state of the world and the choices in our political space

lfsh

I am using Jetbrain's AI to do code analysis (find errors).

While it sometimes spot something I missed it also gives a lot of confident 'advise' that is just wrong or not useful.

Current AI tools are still sophisticated search engines. They cannot reason or think.

So while I think it could spot some errors in research papers I am still very sceptical that it is useful as trusted source.

noiv

I'm no member of the scientific community but I fear this project or another will go beyond math errors and eventually establish some kind of incontrovertible AI entity giving a go/nogo on papers. Ending all science in the process because publishers will love it.