What preprints.ai cannot detect
A list of failure modes we have measured, ones we suspect but cannot yet measure, and claims we explicitly do not make. This page exists so the next time someone asks "does it catch X?", we can point them to a paragraph rather than improvise.
What we don't claim
- We do not claim to replicate experiments. A paper with fabricated data that reads plausibly will score better than it deserves.
- We do not claim to read figures. Image forensics flags suspicious panels but cannot tell you what a chart is saying. Stats described only inside an embedded image are out of reach.
- We do not claim that panel agreement equals truth. When nine agents agree, that says they are internally consistent — not that the assessment is correct. See the independent critique we published of our own scoring methodology.
- We do not publish a headline "X% useful comments" figure. Self-selected thumbs ratings on a young product are not a calibration of comment quality. See the rating page.
- We do not claim uniform performance across fields. The agents are well-calibrated for biomedicine and life sciences. Physics, mathematics and CS preprints currently get a weaker signal.
Known failure modes
Hallucinated weaknesses
Reviewer agents occasionally invent a methodological flaw that the paper does not have. We surface every weakness with the agent that produced it and the role it was operating under, so the reader can weigh it against the source. The per-comment thumbs feedback is one of the inputs we will use to identify and downweight comment patterns prone to hallucination.
Agreement is consistency, not truth
Most agents currently share an underlying model family. When several of them agree, this is at least partly intra-model consistency — not the inter-rater reliability you would get from independent human reviewers. We expose the disagreement set on the report rather than burying it in a single confidence number.
Abstract-only papers
For papers ingested via metadata only (no full text), several Layer 1 modules either skip or run in a degraded mode, and the reviewer agents are explicitly told they are reading an abstract rather than the manuscript. The grade range is correspondingly tighter and more conservative on these papers.
Adversarial inputs
The hidden-prompt detector catches the most common rendering-level injection vectors but is not a complete shield. An attacker who studies the rule set can almost certainly construct an exception to it. Caveats are listed in full on that page.
Stale literature context
Where Semantic Scholar rate-limits us or its corpus lags the live publication record, the "related work" and "potentially missing citations" blocks may be empty or out of date. We mark this on the report rather than fall back to fabricated context. Coverage figures here.
Domain breadth
The deterministic Layer 1 modules and the reviewer prompts have been most heavily tested on biomedical preprints. CS and physics papers run through the same pipeline but with weaker calibration. We do not currently publish a per-field performance breakdown because the per-field denominators are too small to be honest about.
What this page is not
This is not a complete failure-mode catalogue. It is the set of limits we have been confronted with often enough to write down. We will add to it as we ship more features and discover new ways the system can be wrong. If you find a failure mode that is not listed here, please open an issue against the repository — that is the fastest way to get it onto this page.
Source documents
Public methodology critique: landing/methodology-critique.md · headline methodology: /methodology.