When AI Writes the Paper and AI Reviews It, Who's Doing the Science?

Introduction

Consider this scenario. A researcher, under pressure and not particularly scrupulous, uses a language model to draft a manuscript from scratch. A reviewer, overwhelmed by the fifth review request of the month, feeds the manuscript to the same type of model and submits the generated critique. The editor is pleased with the turnaround time. The article is accepted, enters the literature, gets read, gets cited, and eventually informs clinical decisions or funding allocations. Yet at no point did any human genuinely engage with the scientific content.

Does that seem far-fetched? Recent data suggest it is already happening.

On the writing side, large-scale analyses of PubMed abstracts suggest that at least 13.5% of biomedical articles published in 2024 carry detectable traces of LLM processing (Kobak et al., 2025).

On the reviewing side, an analysis of the International Conference on Learning Representations (ICLR) found that approximately 21% of reviews submitted for ICLR 2026 were entirely AI-generated, and more than half showed some degree of AI involvement (Naddaf, 2025b). A survey of 1,600 researchers reports that more than 50% have used AI tools during peer review (Naddaf, 2026).

What makes this situation particularly concerning is not that AI is being used. It is that both sides of the quality control mechanism of scientific production can be automated simultaneously, with no framework in place to assess the consequences.

Note

This article references recent publications, some of which have not yet undergone peer review. It should be read primarily as an invitation to debate and as an ethical and epistemological reflection.

Key point

The risk is not that AI is entering scientific publishing — it already has. The risk is that a closed loop of writing → evaluation → bibliography → writing becomes entrenched, with no human genuinely engaging with the science at any step. And that no one notices.

On the writing side: how much of the literature is already AI-assisted?

The question is no longer whether researchers use LLMs for scientific writing. It is how many, in what ways, and to what extent.

The lexical signature

Kobak et al. (2025, Science Advances) took an instructive approach. Rather than attempting to detect AI-generated text at the level of individual articles — an unreliable task — they tracked vocabulary shifts across more than 15 million PubMed abstracts between 2010 and 2024. Starting in 2023, certain stylistic choices showed a sudden and unprecedented increase in frequency. Not content words tied to a new research topic (as happened with COVID terminology in 2020), but style words: verbs and adjectives that LLMs systematically prefer over their synonyms.

These words existed in scientific writing before 2023, of course, but at stable and low frequencies. Their simultaneous and abrupt rise across all biomedical subfields in 2023–2024 is without precedent in the history of PubMed. Even the COVID surge, with its flood of articles, produced no comparable lexical shock.

By quantifying the excess frequency of these markers, the authors estimated that at least 13.5% of 2024 abstracts were processed by an LLM. This figure varied substantially across subsets of the literature, reaching 40% among certain publishers, countries, and subdisciplines. MDPI and Frontiers journals showed the highest rates.

A parallel analysis by Liang et al. (2025) across nearly one million articles from arXiv, bioRxiv, and the Nature portfolio confirmed the trend, with the fastest growth observed in computer science papers.

Using an LLM as an aid. Is that cheating?

These estimates only capture the detectable end of the spectrum. A researcher who uses an LLM to restructure a paragraph, smooth transitions, or rephrase a sentence will leave fewer traces than one who generates entire sections. The boundary between "linguistic editing" and "content generation" is blurry in practice.

This is also what makes policy enforcement so difficult. The NIH took a strong position in 2025: grant applications "substantially developed by AI" will not be considered, and the agency now deploys AI detection tools to identify them (NIH, NOT-OD-25-132). But what does "substantially developed" mean? If a postdoc writes an outline, submits it to Claude, edits the output, and submits the final version — where does the human contribution end and the AI contribution begin?

At the moment, there is no reliable way to draw that line. And neither does the NIH.

It would also be counterproductive to ban all LLM use outright. Scientific writing relies on a lingua franca — a standardized, highly conventionalized form of English. Looking back a few decades, articles were shorter, and prose was more descriptive and less argumentative than today. A paper no longer merely reports findings; it must persuade. And persuasion requires conforming to a set of codes. For non-anglophone researchers, who face a disadvantage in a publishing system dominated by English, LLM-assisted writing represents a genuine equity tool relative to those codes. This is a fully legitimate use that does not degrade scientific quality — quite the contrary.

On the reviewing side: the mirror problem

If inappropriate LLM use on the writing side raises concerns, it is on the reviewing side that the system becomes vulnerable.

The numbers

In early 2025, ecologist Timothée Poisot received a review containing the phrase: "Here is a revised version of your review with improved clarity and structure" — a characteristic artifact of a reviewer who had pasted an LLM output directly without reading it (Naddaf, 2025a). This anecdote, reported in Nature, is merely the visible tip of what appears to be a much larger phenomenon.

Latona et al. (2024) analyzed the 28,028 reviews submitted to ICLR 2024. Using GPTZero, they estimated that at least 15.8% were AI-assisted — approximately 4,400 reviews for a single conference in a single year.

For ICLR 2026, the situation had deteriorated further. Pangram Labs analyzed 19,490 submissions and 75,800 reviews. The result: approximately 21% of reviews were entirely AI-generated, and more than half showed some degree of AI involvement (Naddaf, 2025b). The ICLR program chairs publicly acknowledged the problem and committed to addressing it, while noting that AI detection tools alone cannot support editorial decisions because of false positive rates.

More recently, Shen & Wang (2026) trained a detection model on historical review data and applied it to ICLR and Nature Communications reviews. Results: approximately 20% of ICLR reviews and 12% of Nature Communications reviews were classified as AI-generated in 2025, with the sharpest increase in late 2024.

A Frontiers survey of approximately 1,600 academics across 111 countries, reported in Nature, found that more than 50% of respondents had used AI tools during peer review (Naddaf, 2026). A separate survey of over 5,000 researchers (also reported by Naddaf, 2026) found that only 5% considered it "appropriate" for a reviewer to use AI output as the basis of a review without disclosure, while 52% considered such use "inappropriate under any circumstances."

There is, then, a substantial gap between practice and declared norms.

The bias problem

AI-assisted reviews are not merely a transparency issue. They can introduce bias. Latona et al. (2024) showed that AI-assisted reviews at ICLR 2024 were associated with significantly higher recommendation scores than human reviews assigned to the same papers.

This matters. If AI reviews systematically inflate scores, and if they are disproportionately assigned to borderline papers, the net effect will be a lowering of the quality threshold for publication. The peer review system does not simply filter papers in or out: it shapes the probability distribution of what gets published.

The complete loop: when AI evaluates AI

In March 2025, Sakana AI announced that a paper entirely generated by its AI Scientist-v2 system had passed the peer review process of an ICLR 2025 workshop (Lu et al., 2026). The system autonomously generated hypotheses, wrote code, ran experiments, analyzed results, and produced a manuscript titled "Compositional Regularization: Unexpected Obstacles in Enhancing Neural Network Generalization." The paper received scores of 6, 7, and 6, placing it above the acceptance threshold and in the top 45% of submissions.

The experiment was conducted transparently, with ICLR organizer cooperation and IRB approval. Sakana immediately withdrew the paper after acceptance. The full body of work (AI Scientist v1 and v2) is now published in Nature (Lu et al., 2026).

Several caveats apply. The paper was accepted to a workshop track, which has higher acceptance rates than the main conference. None of the Sakana AI-generated papers crossed the team's internal bar for main track submission. The AI system made citation errors that a human researcher would likely not have made. And workshop acceptance is not equivalent to high-impact publication.

But the point stands: a fully AI-generated paper, evaluated under standard double-blind conditions, was judged acceptable by human reviewers.

Now combine this with the ICLR data showing that 21% of reviews at the same conference were themselves AI-generated. The uncomfortable question becomes: in how many cases was an AI-written submission evaluated by an AI-generated review, with neither authors nor reviewers disclosing LLM involvement?

Nobody knows. And currently, nobody can know.

What AI cannot evaluate — and why this is critical in biology

The stakes of this debate differ across disciplines. In machine learning, where most detection research has been conducted, the consequences of a failed review are primarily academic: a weak paper enters the literature. In biomedical research, the consequences can eventually become clinical.

Contextualized biological judgment

Peer review in biology is not primarily about statistical verification or editorial clarity. It is about asking: does this result make biological sense?

Can this cell type plausibly express this gene at this level? Does this differential expression reflect biology, or is it a tissue dissociation artifact? Does the reported protein-mRNA discordance point to genuine post-transcriptional regulation, or does it reflect a batch effect between transcriptomic and proteomic datasets?

These questions require deep background knowledge: expertise accumulated through years of experimental work, familiarity with the specific biological system under study, and the ability to recognize when data "doesn't add up" even when statistical tests pass. This is precisely the type of evaluation for which LLMs are least suited at the time of writing.

An LLM can summarize a paper adequately. It can check whether standard statistical tests were applied. It can flag missing references. But it cannot tell you that the stress response signature in your scRNA-seq tumor dataset is more likely a dissociation artifact than a genuine tumor phenotype. It cannot warn you that your "multi-omics" Venn diagram masks a discordance that invalidates your conclusion. These are judgments that depend on domain expertise that no language model currently possesses.

Hallucinations in high-stakes domains

LLMs hallucinate. This is a well-characterized limitation, but its consequences vary by domain. In a machine learning paper, a hallucinated citation is at worst embarrassing. In a clinical genomics paper, a hallucinated gene-disease association could propagate through the literature and eventually influence diagnostic decisions.

Lu et al. (2026) note that AI Scientist-v2 made citation errors, including attributing a method to the wrong paper. In a field where the distinction between a pathogenic variant and a benign polymorphism often rests on a chain of bibliographic evidence, this type of error could carry significant consequences.

The problem of correlated errors

The most underestimated risk of AI-by-AI evaluation may be the problem of correlated errors. Human reviewers have diverse biases: different training backgrounds, different reading habits, different blind spots. This diversity is not a flaw — it is a feature. The probability that two independent human reviewers both miss a major flaw in a paper is relatively low.

The probability that two LLMs miss the same problem could be considerably higher. They share training data, architectural biases, and similar failure modes. A paper generated by AI that avoids the known weaknesses of LLMs is precisely the type of paper that an AI-generated review is most likely not to criticize. Errors are correlated, not independent. The safety margin that independent review is supposed to provide collapses.

Key point

The value of peer review rests on the independence and diversity of expert judgment. AI-by-AI evaluation undermines both: judgments are neither independent (correlated training data) nor expert (no domain-specific experiential knowledge).

Why this is happening: the incentive structure

It would be tempting to frame excessive AI-assisted reviewing as laziness or fraud. That would miss the point.

The peer review system has been under strain for decades. Submission volumes to journals, major conferences, and funding calls have increased year after year. Review timelines are short, unpaid, and professionally unrewarded. The workload has grown; the incentives have not.

It would also be fair to note that human peer review had not waited for LLMs to produce poor evaluations. Three-paragraph reviews without data re-analysis, contradictory critiques from reviewers of the same paper, six-month delays for superficial feedback — all of this existed well before ChatGPT. The reproducibility crisis, despite peer review, is arguably a symptom of the same underlying dysfunction. What AI does is not so much degrade a system that worked well as flood into the cracks of a system already under pressure.

When a reviewer is asked to evaluate their fifth paper of the week, without compensation and without direct career benefit, the temptation to submit the manuscript to an LLM and edit the output is not surprising. The same logic applies to grant writing: when a PI submits six applications per cycle, each requiring 30 pages of persuasive scientific prose, using LLMs is a rational response to an irrational system.

The NIH implicitly acknowledged this by capping applications at six per PI per calendar year starting in 2025, a measure adopted after observing that some investigators were submitting unusually high numbers of applications, possibly AI-assisted (NIH, NOT-OD-25-132). The cap addresses the symptom. The underlying condition — a funding ecosystem that appears to reward volume over substance — remains.

Worth reflecting on

Before condemning AI-assisted reviewing, we should ask whether a system that demands five reviews in two weeks, for free, with no professional recognition, was ever designed to produce genuine engagement with science.

What to do?

Before discussing solutions, it is worth acknowledging what AI genuinely contributes to the scientific process. The study by Liang et al. (2024, NEJM AI) showed that 57% of surveyed researchers found GPT-4-generated feedback useful, and 82% judged it more relevant than feedback from some human reviewers. An LLM that identifies missing references, flags a statistical inconsistency, or catches a poorly labeled figure provides a genuine service — provided it serves as a starting point for the reviewer rather than a substitute for their judgment. The question is therefore not "should AI be banned from scientific publishing?" It is "where is the line between assistance and delegation?"

There is no definitive answer. But there are better and worse ones.

Detection has limits

To draw a line, you first need reliable quantification. AI text detection tools exist and are improving. Pangram Labs, GPTZero, and academic classifiers have been applied to large corpuses with useful results at the population level. But at the level of an individual review, detection remains unreliable. False positive rates are not zero, and accusing a reviewer of AI fraud on the basis of a probabilistic classifier carries real reputational risk.

Detection should be used to monitor trends, not to enforce individual sanctions.

For this article, I chose to use an LLM for spell-checking, syntactic corrections, and to make certain phrases more idiomatic. Interestingly, both the LLM-edited version and the original unedited version were classified by detectors as partially AI-assisted. There is the reliability question, as noted. But there may be another explanation. Has scientific writing become too formulaic to be distinguished from AI output? Or have we already internalized LLM language patterns and begun, without realizing it, to replicate them? If so, the detection question itself may lose all meaning.

Rethinking the reviewing model

The fundamental problem is not AI: it is the mismatch between the volume of scientific production and the available reviewing capacity. Solutions that address this mismatch directly are more likely to work than those that attempt aggressive regulation of AI use.

Compensating reviewers, financially or through formal career recognition. Reducing submission volumes through caps (as the NIH has done) or pre-screening. Separating verification tasks from evaluative tasks: statistical checks, reproducibility audits, and reference verification are tasks where AI assistance could be legitimate and valuable — provided they are clearly distinguished from the evaluative judgment that remains human.

Defining what AI review can and cannot be

Not all uses of AI in reviewing are equivalent. There is a meaningful difference between using an LLM to check whether a paper cites the relevant prior work (potentially useful), using it to verify statistical claims against reported data (potentially useful, with limits), using it to generate a complete evaluative critique of a manuscript (problematic), and submitting an LLM-generated review as one's own expert judgment (unacceptable).

The field needs a framework of legitimate and illegitimate uses, developed through community consensus rather than through binding decisions that cannot be enforced.

The sensitive question: should authors adapt their writing to this new paradigm?

Researchers have always adapted how they present their results to shifts in editorial policies and reviewer expectations. It has always been part of the job. The question now is whether the form of a manuscript should account for the fact that part of the evaluation will likely be performed by an LLM.

This is a question our team now addresses systematically when assisting with article writing and grant preparation. We work to make authors aware of the new opportunities that LLM prompt engineering creates for adapting writing to emerging modes of evaluation.

Concretely, this could mean: structuring hypotheses in explicitly numbered form to facilitate automated verification. Isolating limitations in a dedicated section rather than dispersing them through the discussion — an LLM will more easily miss a limitation buried in an argumentative paragraph than a human reviewer familiar with the domain. Ensuring that statistical claims are formulated in a line-by-line verifiable way, since this is precisely the type of task where an LLM-reviewer would perform best. Making the logical chain from data to results to conclusions as explicit as possible, since an LLM does not "read between the lines."

Conclusion

The parallel with the subjects covered in our previous articles is direct. In batch effect correction, the danger is not that the algorithm fails — it is that it succeeds, producing a clean UMAP that conceals a confounded design. In "multi-omics" integration, the danger is not that the Venn diagram is wrong — it is that it creates the illusion of integration where none occurred.

The same logic applies here. The danger of AI in scientific publishing is not that it produces manifestly bad papers or manifestly deficient reviews. It is that it produces plausible papers and plausible reviews — fluent, well-structured, correctly formatted, but entirely superficial. The output will look like science. It will pass the filters designed to stop bad science. And it will enter the literature as though a human expert had evaluated it.

Biomedical papers already show traces of LLM processing, and a fully AI-generated article has passed peer review.

None of this means AI has no place in scientific publishing. Used transparently, for tasks it can genuinely perform — literature search, statistical checking, linguistic editing — it could meaningfully improve the process. But this requires a clear understanding of what AI can and cannot do. The danger lies in moving toward a massive delegation of the cognitive work that expert humans have always done.

Scientific publishing is, fundamentally, a system of trust. Authors trust reviewers to engage seriously with their work. Reviewers trust authors to have actually done the work they describe. Readers trust the process to have quality-checked that work. When both sides of this exchange are fully automated, the scientific enterprise becomes an empty process.

The question going forward is not whether AI will be part of scientific publishing. It already is. The question is whether the community will build its own framework for using it responsibly — or whether we will slide toward a system where machines speak to machines, and everyone pretends a human was listening.

Take-home message

AI is entering scientific publishing — including on the reviewing side. The risk is not that it fails visibly, but that it succeeds superficially: producing a plausible text that passes a plausible review, with no human genuinely engaging with the science. A clean review, like a clean UMAP, is not a guarantee. Sometimes, it is exactly the problem.

References

Kobak D, González-Márquez R, Horvát E-Á, Lause J. Delving into LLM-assisted writing in biomedical publications through excess vocabulary. Science Advances 11, eadt3813 (2025). https://doi.org/10.1126/sciadv.adt3813
Liang W et al. Quantifying large language model usage in scientific papers. Nature Human Behaviour (2025). https://doi.org/10.1038/s41562-025-02273-8
Liang W et al. Can Large Language Models Provide Useful Feedback on Research Papers? A Large-Scale Empirical Analysis. NEJM AI 1(8) (2024). https://doi.org/10.1056/AIoa2400196
Latona GR, Ribeiro MH, Davidson TR, Veselovsky V, West R. The AI Review Lottery: Widespread AI-Assisted Peer Reviews Boost Paper Scores and Acceptance Rates. arXiv:2405.02150 (2024).
Liang W et al. Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews. Proceedings of the 41st International Conference on Machine Learning (ICML) (2024). arXiv:2403.07183
Naddaf M. AI is transforming peer review — and many scientists are worried. Nature 639, 852–854 (2025a). https://doi.org/10.1038/d41586-025-00894-7
Naddaf M. Major AI conference flooded with peer reviews written fully by AI. Nature 648, 256–257 (2025b). https://doi.org/10.1038/d41586-025-03506-6
Naddaf M. More than half of researchers now use AI for peer review — often against guidance. Nature 649, 273–274 (2026). https://doi.org/10.1038/d41586-025-04066-5
Shen S, Wang K. Detecting AI-Generated Content in Academic Peer Reviews. arXiv:2602.00319 (2026).
Lu C, Lu C, Lange RT et al. Towards end-to-end automation of AI research. Nature 651, 914–919 (2026). https://doi.org/10.1038/s41586-026-10265-5
NIH. The Use of Generative Artificial Intelligence Technologies is Prohibited for the NIH Peer Review Process. NOT-OD-23-149 (2023). https://grants.nih.gov/grants/guide/notice-files/NOT-OD-23-149.html
NIH. Supporting Fairness and Originality in NIH Research Applications. NOT-OD-25-132 (2025). https://grants.nih.gov/grants/guide/notice-files/NOT-OD-25-132.html