At Least 146,000 AI Hallucinated Citations In Papers Published In 2025, Finds Paper

The growing use of AI to write research papers is leaving footprints across major repositories.

A new large-scale audit has found that at least 146,932 citations in scientific papers published in 2025 point to sources that simply do not exist — fabricated by large language models and left unchecked through layers of peer review.

The study, authored by researchers at Cornell, UCLA, UC Berkeley, and Tsinghua University, examined 111 million references across 2.5 million papers on arXiv, bioRxiv, SSRN, and PubMed Central. It is one of the most comprehensive attempts yet to quantify the real-world scale of AI hallucinations in knowledge work.

The Numbers

Citation hallucination rates have climbed sharply since mid-2024 — roughly 18 months after ChatGPT’s launch — reaching 0.39% on arXiv, 1.91% on SSRN, 0.27% on PubMed Central, and 0.21% on bioRxiv as of August 2025. The steepest acceleration coincides not with ChatGPT’s debut but with the rise of AI-powered search and agentic research tools that automate citation generation from live web content.

Monthly estimates for hallucinated citations in August 2025 alone: 3,353 (arXiv), 8,140 (PubMed Central), 767 (SSRN), and 478 (bioRxiv). The researchers note these figures are almost certainly a lower bound, since their verification method focused only on whether reference titles correspond to real publications, not on subtler errors like real citations used to support claims they don’t actually make.

This isn’t the only recent warning. A January 2026 analysis found at least 100 hallucinated citations in NeurIPS 2025 papers — the top machine learning conference — with fabrications evading detection despite papers being reviewed by three to five expert researchers each. The problem, it appears, extends across prestige tiers.

Not a Few Bad Apples

A natural assumption would be that the contamination is concentrated in a small number of deeply flawed papers. The data rejects this. Rather than a handful of manuscripts loaded with fake citations, the pattern is one of diffuse contamination: many papers with a modest number of hallucinated references scattered among legitimate ones. The fraction of papers with more than 50% unmatched references has barely moved; the fraction with just a few has grown substantially.

This matches what experimental studies have found about how LLMs generate citations — a mix of real and fabricated references, with authors apparently adopting suggestions without verifying each one.

Who’s Doing It, and Who Benefits

The research identifies a clear profile for “hallucination citers.” Compared to a matched control group, these authors have substantially fewer prior publications — 62% fewer on arXiv, 73% fewer on SSRN. More than half of hallucination citers on SSRN have zero prior publications. The pattern is consistent with junior or early-career researchers adopting AI tools quickly while having less familiarity with the existing literature to spot errors.

The irony is that LLMs appear to be boosting these researchers’ output significantly. The productivity gap between hallucination citers and their matched controls has largely closed by 2025, with hallucination citers increasing output at rates between 1.3x and 3.1x faster than controls across the four datasets. In other words, the authors most likely to introduce hallucinated citations are now producing manuscripts at a higher rate.

The beneficiaries of these hallucinations follow a predictable pattern. When fabricated citations do reference real scientists, they disproportionately credit already-prominent, high-citation researchers — with 68.8% more prior publications and 58.3% more citations than the average real citation target. They also skew male, with a 6.4 percentage point bias toward male-named authors. The paper notes this may reinforce existing inequities in scientific recognition, channeling credit toward those who already have the most of it.

The Safeguards Aren’t Working

The researchers examined arXiv’s moderation process and found that while rejected manuscripts have a hallucination rate about 4.5 times higher than accepted ones, around 78.8% of non-existent citations still make it onto the platform. Hallucination rates among rejected papers are higher, but volume overwhelms the filter.

Publication doesn’t clean things up either. Tracing 2,241 bioRxiv preprints with unmatched references to their published versions in PubMed Central, the researchers found that 85.3% of hallucinations in preprints survived into the published record. Journal peer review, it turns out, is not catching them.

And it’s not just low-impact journals. While the lowest-impact journals show higher hallucination rates and the highest-impact journals show lower ones, there is no clean gradient in between. Hallucinated citations appear across the full spectrum of journal prestige.

A Compounding Problem

The deeper risk is what happens once these citations enter the record. Researchers and AI systems both build on existing reference lists, and the paper’s Google Scholar analysis shows a growing number of hallucinated citations that now appear as standalone bibliographic entries — citations to works that don’t exist, treated by databases as if they do. As AI tools increasingly assist with research, a researcher consulting those databases could unknowingly recycle a hallucinated source into new work.

There is also a training data feedback loop to consider. LLMs are frequently trained on the same open-access corpora examined in this study. Hallucinations that enter the published record today become training data for tomorrow’s models.

The Wider Stakes

The authors frame science as the best-case scenario for detecting this kind of problem. It has rich indexing systems, strong citation norms, large bibliographic databases, and an emerging toolkit of automated verifiers. Even so, hallucinated content is entering the record, persisting from preprint through publication, and compounding through citation networks.

If that’s true of science, domains with less verification infrastructure — government reports, legal filings, clinical documentation, journalism — face a harder problem with fewer tools to address it. The paper closes with a blunt observation: science is the most measurable instance of what is likely a far broader phenomenon, one that has already outpaced the safeguards designed to contain it.

Posted in AI