Show HN: AI Peer Reviewer – Multiagent system for scientific manuscript analysis
91 comments
·May 31, 2025rjakob
NOTE: We've received a bunch of submissions from you all (which is awesome — thank you!).
We're working through them and will send out reports asap!
Since we're currently covering the model costs for you, we'd appreciate any feedback via this short form in return: https://docs.google.com/forms/d/1EhQvw-HdGRqfL01jZaayoaiTWLS...
Thanks again for testing!
atrettel
I'm a PhD and researcher who has worked in various fields, including at a national lab.
I think AI systems like this could greatly help with peer review, especially as a first check before submitting a manuscript to a journal.
That said, this particular system appears to focus on the wrong issues with peer review, in my opinion. I'll ignore the fact that an AI system is not a peer since another person already brought that up [1]. Even if this kind of system was a peer, the system appears to be checking superficial issues and not the deeper issues that many peer reviewer/referrers care about. I'll also ignore any security risks (other posts discuss that too).
A previous advisor of mine said that a good peer review needs to think about one major/deep question when reviewing a manuscript: Does the manuscript present any novel theories, novel experiments, or novel simulations; or does it serve as a useful literature review?
Papers with more novelty are inherently more publishable. This system does not address this major question and focuses on superficial aspects like writing quality, as if peer review is mere distributed editing and not something deeper. It is possible for even a well-written manuscript to lack any novelty, and novelty is what makes it worthy of publication. Moreover, many manuscripts have at best superficial literature reviews that name drop important papers and often mischaracterize their importance.
It takes deep expertise in a subject to see how a work is novel and fits into the larger picture of a given field. This system does nothing to aid in that. Does it help identify what parts of a paper you should emphasize to prove its novelty? That is, does it help you find the "holes" in the field that need patching? Does it help show what parts of your literature review are lacking?
A lot of peer review is kinda performative, but if we are going to create automated systems to help with peer review, I would like them to focus on the most important task of peer review: assessing the novelty of the work.
(I will note that I have not tried out this particular system myself. I am basing my comments here on the documentation I looked at on GitHub and the information in this thread.)
yusina
I would say the exact opposite. Let the machine do the easy/simple/boring stuff. Let the human peer reviewer do the big question. That's what the human is good at and excited about and it's what the machine will not be good at. The question is a philosophical one: Is this a good idea? Is it relevant? Is it important? This is highly subjective and needs folks in the field to build consensus about. Back in my PhD days, I'd have loved if a machine could have taken care of the simple stuff so humans could focus entirely on the big questions.
(A machine could point to similar work though.)
atrettel
You raise a good point overall. I was just trying to respond to the idea of it replacing a human entirely, as if the authors submit it to the system and a journal editor has to made the decision to publish it or not. I would love to focus more on the big picture stuff, but in my experience most peer reviews amount to "Could you phrase this different?" rather than "Is this a good idea?". I think the latter is a much better better question to ask.
rjakob
Thanks for the thoughtful feedback. That’s very helpful.
We didn’t think too deeply about the term “AI peer reviewer” and didn’t mean to imply it’s equivalent to human peer review. Based on your comments, we’ll stick to using “AI reviewer” going forward.
Regarding security: there is an open-source version for those who want full control. The free cloud version is mainly for convenience and faster iteration. We don’t store manuscript files longer than necessary to generate feedback (https://www.rigorous.company/privacy), and we have no intention of using manuscripts for anything beyond testing the AI reviewer.
On novelty. totally agree it’s a core part of good peer review. The current version actually includes agents evaluating originality, contribution, impact, and significance. It’s still v1 of course but we want to improve it. We'd actually love for critical thinkers like you to help shape it. If you're open to testing it with a preprint and sharing your thoughts on the feedback, that would be extremely valuable to us.
Thanks again for engaging, we really appreciate it.
atrettel
No worries, I appreciate that you took the time to read and respond!
When I first read "Originality and Contribution" at [1], I actually assumed it was a plagiarism check. It did not occur to me until now that you were referring to novelty with that. Similarly, I assumed "Impact and Significance" referred to more about whether the subject was appropriate for a given journal or not (would readers of this journal find this information significant/relevant/impactful or should it be published elsewhere?). That's a question that many journals do ask of referees, independent of overall novelty, but I see how you mean a different aspect of novelty instead.
I'm not opposed to testing your system with a manuscript of my own, but currently the one manuscript that I have approaching publication is still in the internal review stage at my national lab, and I don't know precisely when it will be ready for external submission. But I'll keep it in mind whenever it passes all of the internal checks.
[1] https://github.com/robertjakob/rigorous/blob/main/Agent1_Pee...
Not4Hire
[flagged]
SubiculumCode
My view, and how I conduct my peer reviews, is that I do not care to make decisions of whether a question is important/interesting or not. I feel like my job is to judge the paper on rigor, and whether it fails (purposely or from ignorance) to address/acknowledge the relevant literature.
rjakob
We also provide feedback on rigor across 7 different categories: https://github.com/robertjakob/rigorous/tree/main/Agent1_Pee...
godelski
I'm also a PhD[0] and researcher who has worked in various fields, including national labs too (You DOE?)
I mostly share the same sentiment, and I see a similar issue with the product. The current system is not in its current poor state due to lack of reviewers, it is due to lack of quality reviewers and arbitrary notions of "good enough for this venue." So I wanted to express a difference of opinion about what peer review should be (I think you'll likely agree).
I don't think we are doing the scientific community any service by doing our current Conference/Journal based "peer review". The truth is that you cannot verify a paper by reading it. You can falsify it, but even that is difficult. The ability to determine novelty and utility is also a crapshoot, where we have a long history illustrating how bad we are at this. Several Nobel prize worthy works have been rejected multiple times due to "obviousness", "lack of novelty", and "clearly wrong." All three apply to the paper that led to the 2001 Nobel Prize in Economics[1]!
The truth of the matter is that peer review is done in the lab. It is done through replication, reproduction, and the further development of ideas. What we saw around LK-99[2] was more quality and impactful peer review than what any reader for a venue could provide. The impact existed long before any of those works were published in venues.
I think this came down to forgetting the purpose of journals. They were there when we didn't have tools like ArXiV, OpenReview, or even GitHub. Journals were primarily focused on solving the logistic problem of distribution. So I consider all those technical works, "preprints", and blog posts around LK-99 replications as much of a publication as anything else. The point is that we are communicating with our peers. There's always been prestige around certain venues, but primarily people did not publish to them. The other venues checked for plagiarism, factual errors, and any obvious errors. Otherwise, they continued with publication.
This silly notion of acceptance rates just creates a positive feedback loop which is overburdening the system (highly apparent in ML conferences). The notions of novelty and impact are highly noisy (as demonstrated in multiple NeurIPS studies and elsewhere), making the process far more random than acceptable. I don't think this is all that surprising. It is quite easy to poke flaws in any work you come across. It does not take a genius to figure out limitations of works (often they're explicitly stated!).
The result of this is obvious, and is what most researchers end up doing: resubmit elsewhere and try your luck again. Maybe the papers are improved, maybe they aren't, mostly the latter. The only thing this accomplishes is an exponentially increasing number of paper submissions and slowing down of research progress as we spend time reformatting and resubmitting which should instead be spent researching. The distribution of quality review comments seems to have high variance, but I can say that early in my PhD they mainly resulted in me making my works worse as I chased their comments rather than just trying to re-roll and move on.
In this sense, I don't think there's a "lack of reviewer" problem, so much as we have an acceptance threshold problem with an arbitrary metric. I think we should check for critical errors, check for plagiarism, and then just make sure the work is properly communicated. The rest is far more open to interpretation and not even us experts are that good at it.
[0] Well my defense is in a week...
atrettel
I worked at LANL until very recently, so yes, I was associated with the DOE.
I actually agree with your point that "the ability to determine novelty ... is a crapshoot". My point was that the AI system should at least try to provide some sense of how novel the content is (and what parts are more novel than others, etc.). This is important for other review processes like patent examination and is certainly very important for journal editors to determine whether a manuscript it "worthy" of publication. For these reasons, I personally have a low bar as to what qualifies as "novel" in my own reviews.
Most of my advisors in graduate school were also journal editors, and they instilled on me to focus on novelty during peer reviews because that is what they cared about most when making a decision about a manuscript. Editors focus on novelty because journal space is a scarce resource. You see the same issue in the news in general [1]. This is one of the reasons why I have a low bar to evaluate novelty, because a study can be well done and cover new ground without having an unambiguous conclusion or "story being told" (which is something editors might want).
I originally discussed this briefly in my post but edited it out immediately after posting this. I'll post it again but add more detail. I think that a lot of peer review as practiced today is theater. It doesn't really serve any purpose other than providing some semblance of oversight and review. I agree with your point about the journal/conference being the wrong place to do peer review. It is too late to change things by then. The right time is "in the lab", as you say.
I wholeheartedly agree that reproduction/replication is the standard that we should seek to achieve but rarely ever do. Perhaps the only "original" ideas that I have had in my career came from trying to replicate what other people did and finding out something during that process.
godelski
Nice, I never went to LANL but have a few friends in HPC over there.
You're right, it is theater. But a lot of people think it isn't...
I think it is important to be explicit in why novelty is a crapshoot.
Novelty depends on:
- how well you read the work
- High level reading means you will think x is actually y
- how well read you are
- If you're too well read, every x is just y
- If you're not well read, everything is novel
- how clear the writing was
- If it is too clear, it is obvious, therefore not novel
If any process encourages us to be less clear in writing, we should reject it. I've seen a lot of this happening more and more and it is terrible for science. You shouldn't have to mask your contributions, oversell, or mask other related works. Everything is "incremental" and all that novelty is is a measurement of the reader's ego.What I've just seen is that the old guard lost sight of what was important: communicating. I don't think anyone is malicious here or even had bad intentions. In fact, I think everyone had and still has good intentions. But good intentions don't create good outcomes. They're slow boiled frogs boiled, with slowly increasing dependence on metrics. They can look back and say "it worked for me", blinding them to how things have changed.
> I agree with your point about the journal/conference being the wrong place to do peer review. It is too late to change things by then. The right time is "in the lab", as you say.
I disagree a bit (again, I think you'll agree lol). You're right that some should be happening in the lab. But there is a hierarchy. The next level is outside the lab. Then outside research. Peer review is an ongoing process that never stops. To define it as 3-4 people quickly reading a paper is just laughable. They just have all incentives to reject a work. No one questions you when you reject, but they do when you accept. Acceptance rates sure don't help, and this is the weirdest metric to define "impact" by. I don't even know how one could claim that rejection rate correlates with scientific impact. Maybe only through the confounding variable of prestige and that it is what people target? But then ArXiv should have the highest impact lol. > Perhaps the only "original" ideas that I have had in my career came from trying to replicate what other people did and finding out something during that process.
Same! I don't think it is a coincidence either. Science requires us to be a bit antiauthoritarian. Trust, but verify is a powerful tool. We need to verify in different environments, with methods that should be similar, and all that. Finding those little holes is critical. A worst, replication makes you come up with ideas. At least if you keep asking "why did they do this?" or "why does that happen?".I think in a process where we're pushed to quickly publish we do not take time to chase these rabbit holes. Far too often there's a wealth of information down them. But I'm definitely also biased from my poor experience in grad school lol.
hirenj
It is a real shame that peer review reports were only first published relatively recently. These would have provided valuable training information as to what peer review performs. Unfortunately now, I fully expect the public peer review reports will be poorer in quality, and oftentimes superficial.
On this tool, I fully expect that it will not capture high level conceptual peer review, but could very much serve a role in identifying errors of omission from a manuscript as a checklist to improve quality (as long as this remains an author controlled process).
I will be interested to throw in some of my own published papers to see if it catches all the things I know I would have liked to improve in my papers.
rjakob
Thanks for the feedback. Totally agree. It’s a real shame we don’t have more historical peer review data. It would be great if research was fully transparent.
We did find a few datasets that offer a starting point: https://arxiv.org/abs/2212.04972 https://arxiv.org/abs/2211.06651 https://arxiv.org/abs/1804.09635
There’s also interesting potential in comparing preprints to their final published versions to reverse-engineer the kinds of changes peer review typically drives.
A growing number of journals and publishers, like PLOS, Nature Communications, and BMJ—now publish peer review reports openly, which could be valuable as training data.
That said, while this kind of data might help generate feedback to improve publication odds (by surfacing common reviewer demands early), I am not fully convinced it would lead to the best feedback. In our experience, reviewer comments can be inconsistent or even unreasonable, yet authors often comply anyway to get past the gate.
We're also working on a pre-submission screening tool that checks whether a manuscript meets hard requirements like formatting or scope for specific journals and conferences, hoping this will save a lot of time.
Would love to hear your take on what kind of feedback you find useful, what feels like nonsense, and what you would want in an ideal review report... via this questionnaire https://docs.google.com/forms/d/1EhQvw-HdGRqfL01jZaayoaiTWLS...
isoprophlex
Submitting your original, important, unpublished, research to some random entity. I would be VERY surprised if more than 2% of academics think this is a good idea.
rjakob
I wish my own manuscripts would be that important...
Regarding security concerns: there is an open-source version for those who want full control. The free cloud version is mainly for convenience and faster iteration. We don’t store manuscript files longer than necessary to generate feedback (https://www.rigorous.company/privacy), and we have no intention of using manuscripts for anything beyond testing the AI reviewer.
karencarits
I guess the paper would be complete enough to publish as a preprint at the stage where this specific service is most useful
eterm
If an AI agent is a "Peer", does that mean you want papers written by AI agents to review?
spankibalt
The AI pseudoagent[1], unless sentient and proficient in the chosen field of expertise, is not a peer. It's just a simulacrum of one. As such, it can only manifest simulacra of concepts such as "biases", "fairness", "accountability", etc.
The way I see it, it can function, at best, as a tool of text analysis, e. g. as part of augmented analytics engines in a CAQDAS.
1. Agents are defined as having agency, with sentience as an obvious prerequisite.
nathan_compton
Do you really thing of sentience being a prerequisite for agency? That doesn't seem to follow. Plants are agents in a fairly meaningful sense and yet I doubt they are sentient. I mean, AI's accept information and can be made to make decisions and act in the world. That seems like a very reasonable definition of agency to me.
rjakob
We honestly didn’t think much about the term “AI peer reviewer” and didn’t mean to imply it’s equivalent to human peer review. We’ll stick to using “AI reviewer” going forward.
brookst
Does peer review typically select for the same demographics as the author?
I was joking, but probably so, to the extent that 80% of peer reviewers are men and 80% of authors of peer reviewed articles are men[0]
0. https://www.jhsgo.org/article/S2589-5141%2820%2930046-3/full...
rjakob
I think the ideal scenario would include a fair, merit-based AI reviewer working alongside human experts. AI could also act as a "reviewer of the reviewers," flagging potential flaws or biases in human evaluations to help ensure fairness, accountability and consistency.
yusina
> a fair, merit-based AI reviewer
That's a dream which is unlikely to come true.
One reason being that the training data is not unbiased and it's very hard to make it less biased, let alone unbiased.
The other issue being that the AI companies behind the models are not interested in this. You can see the Grok saga playing out in plain sight, but the competitors are not much better. They patch a few things over, but don't solve it at the root. And they don't have incentives to do that.
Muller20
Peer review has nothing to do with demographics. It's about expertise in the research area.
yusina
You don't have much experience in it, do you? The real world peer review process could not be furher from what you are describing.
Source: I've personally been involved in peer reviewing in fields as diverse as computer science, quantum physics and applied animal biology. I've recently left those fields in part because of how terrible some of the real-world practices are.
brookst
And of course anyone with sufficient knowledge in a domain is freed from human foibles like racism, sexism, etc. Just dry expertise, applied without bias or agenda.
falcor84
As they say: in theory, theory and practice are identical; in practice, they're aren't.
etrautmann
Russ Poldrack just did a deep dive on reviewing papers with AI, finding serious issues with the results: https://substack.com/home/post/p-164416431
rjakob
Thanks for sharing. We’ll take a closer look. There’s definitely something we can learn from it.
davidcbc
This seems to just be a wrapper around a bunch of LLM prompts. What value is being added in the (eventual) pay version?
As a free github project it seems.. I don't know, it's not peer review and shouldn't be advertised as such, but as a basic review I guess it's fine, but why would someone pay you for a handful of LLM prompts?
If your business can be completely replicated by leaked system prompts I think you're going to have issues
rjakob
System prompts / review criteria cannot be "leaked" because they are open-source (full transparency). Focusing heavily on monetization at this stage seems shortsighted...this tool is a small (but longterm important) step of a larger plan.
8organicbits
One to two days is certainly better than eight months, but I'm curious about that delay. Can you explain why working days factor in to the turnaround time?
rjakob
Right now, the core workflow takes about 8 minutes locally, mostly because we haven’t optimized anything yet. There’s plenty of low-hanging fruit: parallelization, smarter batching, etc. With a bit of tuning, we could bring it down to 1–2 minutes. That said, we’ll probably want to add more agents for deeper feedback and quality control, so the final runtime might go up again. At this stage, we're figuring out what’s actually useful, what isn’t, what additional signals we should look at, and what the right format and level of detail for feedback should be. The cloud version includes a manual review step, partly to control costs, partly because we’re still discovering edge cases. So the current 1–2 day turnaround is really just a safety net while we iterate. If we decide to continue with this, the goal is to deliver meaningful feedback in minutes.
anticensor
That eight month delay is caused by backlog, a human reviewing an individual paper takes about a day.
deepdarkforest
For easy/format stuff for specific journals it will be useful. But please, please for the love of god don't try to give actual feedback. We have enough GPT generated reviews on openreview as it is. The point of reviews is to get deep insights/comments from industry experts who have knowledge ABOVE the LLMs. The bar is very low, i know, but we should do better as the research community.
rjakob
Totally agree! The tool is designed to provide early-stage feedback, so experts can focus their attention on the most relevant points at later review stages.
yusina
Please convince us that these are not just words. As much as I'd want to believe you, the sweet VC money is in the feedback that many people here advise you against. It will be hard to stay away from it.
howon92
This is a great idea. Can you share more about what "24 specialized agents" mean in this context? I assume each agent is not simply an LLM model with a specific prompt (e.g. "You're the world's best biologist. Review this biology research paper.") but is a lot more sophisticated. I am trying to learn how sophisticated it is
rjakob
Here is a description of how it works: https://github.com/robertjakob/rigorous/tree/main/Agent1_Pee...
eddythompson80
I was taking with a friend/coworker this week. I came to the realization “Code Reviews Are Dead”.
They were already on life support. The need to “move fast” “there is no time”, “we have a 79 file PR with 7k line changes that we have been working on for 6 weeks. Can you please review it quickly? We wanna demo tomorrow GTM meeting”. Management found zero value in code reviews. You still can’t catch everything, so what’s the point? They can’t measure what the value of such process is.
Now? Now every junior dev is pushing 12 PRs a week, all adding 37 new files and thousands of lines that have been auto generated with a ton of patterns and themes that are all over the place and you are expecting anyone to keep up?
Just merge it. I have seen people go from:
> “asking who is best to review changes in area X? I have a couple of questions to make sure I’m doing things right”
To
> “this seems to work fine. Can I get a quick review? Trying to push it out and see how it works”
To
> “need 2 required approvals on this PR please?”
yusina
> I came to the realization “Code Reviews Are Dead”.
If that's how it works at your company then run as fast as you can. There are many reasonable alternatives that won't push this AI-generated BS on you.
That is, if you care. If you don't then please stay where you are so reasonable places don't need to fight in-house pressure to move in that direction.
mikojan
Oh my god.. The horror.. Please do not let this be my future..
eddythompson80
The horror indeed, but I don't really see a way out of this. Was mainly curious to see how it would affect something like "Peer Review" though I suspect the incetives there are different so the processes might only shares the word "Review" without much baring on each other.
Regarding code reviews, I can't see a way out unfortunately. We already have github (and others) agents/features where you write an issue on a repo, and kick off an agent to "implement it and send a PR for the repo". As it exists today, every repo has 100X more issues and discussions and comments than it has PRs. now imagine if the barrier to opening a PR is basically: Open an issue + click "Have a go at it, GitHub" button. Who has the time or bandwidth to review that? That wouldn't make any sense either.
rjakob
Based on my experience, many reviewers are already using AI extensively. I recently ran reviewer feedback from a top CS conference through an AI detector, and two out of three responses were clearly flagged as AI-generated.
In my view, the peer-review process is flawed. Reviewers have little incentive to engage meaningfully. There’s no financial compensation, and often no way to even get credit for it. It would be cool to have something like a Google Scholar page for reviewers to showcase their contributions and signal expertise.
rjakob
wild times
karencarits
I'll hopefully get to test it soon. To me, LLMs have so far been great for proofreading and getting suggestions for alternative - perhaps more fluent - phrasings. One thing that immediately struck me, though: having 'company' in the URL makes me think corporate and made me much more skeptical than a more generic name would.
yusina
IMO that's what this focus on. Language. That's what LLMs excel at. Perhaps branch out to providing localized papers to markets like China or France (hah, sorry).
Judging the actual contents may feel like the holy grail but is unlikely to be taken well by the actual academic research community. At least the part that cares about progressing human knowledge instead of performative paper milling.
rjakob
Haha fair point, domain name was a 5-second, “what’s available for $6” kind of decision. Definitely not trying to go full corporate just yet
karencarits
Great! Also, checking journal author guidelines is usually very boring and time consuming, so that would be a nice addition! Like, pasting the guidelines in full and getting notified if I am not following some specs
rjakob
We are already looking into that: https://github.com/robertjakob/rigorous/tree/main/Agent2_Out...
Would be great to see contributions from the community!
After waiting 8 months for a journal response or two months for co-author feedback that consisted of "looks good" and a single comma change, we built an AI-powered peer review system that helps researchers improve their manuscripts rapidly before submission.
The system uses multiple specialized agents to analyze different aspects of scientific papers, from methodology to writing quality.
Key features: 24 specialized agents analyzing sections, scientific rigor, and writing quality // Detailed feedback with actionable recommendations. // PDF report generation. // Support for custom review criteria and target journals.
Two ways to use it:
1. Cloud version (free during testing): https://www.rigorous.company - Upload your manuscript - Get a comprehensive PDF report within 1–2 working days - No setup required
2. Self-hosted version (GitHub): https://github.com/robertjakob/rigorous - Use your own OpenAI API keys - Full control over the review process - Customize agents and criteria - MIT licensed
The system is particularly useful for researchers preparing manuscripts before submission to co-authors or target journals.
Would love to get feedback from the HN community, especially from PhDs and researchers across all academic fields. The project is open source and we welcome contributions!
GitHub: https://github.com/robertjakob/rigorous Cloud version: https://www.rigorous.company