Honest methodology limits — Evidence

What we don't claim

We do not claim to replicate experiments. A paper with fabricated data that reads plausibly will score better than it deserves.
We do not claim to read figures. Image forensics flags suspicious panels but cannot tell you what a chart is saying. Stats described only inside an embedded image are out of reach.
We do not claim that panel agreement equals truth. When nine agents agree, that says they are internally consistent — not that the assessment is correct. See the independent critique we published of our own scoring methodology.
We do not publish a headline "X% useful comments" figure. Self-selected thumbs ratings on a young product are not a calibration of comment quality. See the rating page.
We do not claim uniform performance across fields. The agents are well-calibrated for biomedicine and life sciences. Physics, mathematics and CS preprints currently get a weaker signal.

Known failure modes

Hallucinated weaknesses

Reviewer agents occasionally invent a methodological flaw that the paper does not have. We surface every weakness with the agent that produced it and the role it was operating under, so the reader can weigh it against the source. The per-comment thumbs feedback is one of the inputs we will use to identify and downweight comment patterns prone to hallucination.

Agreement is consistency, not truth

Most agents currently share an underlying model family. When several of them agree, this is at least partly intra-model consistency — not the inter-rater reliability you would get from independent human reviewers. We expose the disagreement set on the report rather than burying it in a single confidence number.

Abstract-only papers

For papers ingested via metadata only (no full text), several Layer 1 modules either skip or run in a degraded mode, and the reviewer agents are explicitly told they are reading an abstract rather than the manuscript. The grade range is correspondingly tighter and more conservative on these papers.

Adversarial inputs

The hidden-prompt detector catches the most common rendering-level injection vectors but is not a complete shield. An attacker who studies the rule set can almost certainly construct an exception to it. Caveats are listed in full on that page.

Stale literature context

Where Semantic Scholar rate-limits us or its corpus lags the live publication record, the "related work" and "potentially missing citations" blocks may be empty or out of date. We mark this on the report rather than fall back to fabricated context. Coverage figures here.

Domain breadth

The deterministic Layer 1 modules and the reviewer prompts have been most heavily tested on biomedical preprints. CS and physics papers run through the same pipeline but with weaker calibration. We do not currently publish a per-field performance breakdown because the per-field denominators are too small to be honest about.

What this page is not

This is not a complete failure-mode catalogue. It is the set of limits we have been confronted with often enough to write down. We will add to it as we ship more features and discover new ways the system can be wrong. If you find a failure mode that is not listed here, please open an issue against the repository — that is the fastest way to get it onto this page.

Why this exists. Most AI peer-review platforms publish a feature list. We publish a feature list and a failure-mode list, on the theory that the reader needs both to make a useful judgement about whether to trust an individual report.

Source documents

Public methodology critique: landing/methodology-critique.md · headline methodology: /methodology.

What preprints.ai cannot detect