Preprints.ai
← All evidence pages
Honest methodology

What preprints.ai cannot detect

A list of failure modes we have measured, ones we suspect but cannot yet measure, and claims we explicitly do not make. This page exists so the next time someone asks "does it catch X?", we can point them to a paragraph rather than improvise.

limits enumerated no benchmark cherry-picking

What we don't claim

Known failure modes

Hallucinated weaknesses

Reviewer agents occasionally invent a methodological flaw that the paper does not have. We surface every weakness with the agent that produced it and the role it was operating under, so the reader can weigh it against the source. The per-comment thumbs feedback is one of the inputs we will use to identify and downweight comment patterns prone to hallucination.

Agreement is consistency, not truth

Most agents currently share an underlying model family. When several of them agree, this is at least partly intra-model consistency — not the inter-rater reliability you would get from independent human reviewers. We expose the disagreement set on the report rather than burying it in a single confidence number.

Abstract-only papers

For papers ingested via metadata only (no full text), several Layer 1 modules either skip or run in a degraded mode, and the reviewer agents are explicitly told they are reading an abstract rather than the manuscript. The grade range is correspondingly tighter and more conservative on these papers.

Adversarial inputs

The hidden-prompt detector catches the most common rendering-level injection vectors but is not a complete shield. An attacker who studies the rule set can almost certainly construct an exception to it. Caveats are listed in full on that page.

Stale literature context

Where Semantic Scholar rate-limits us or its corpus lags the live publication record, the "related work" and "potentially missing citations" blocks may be empty or out of date. We mark this on the report rather than fall back to fabricated context. Coverage figures here.

Domain breadth

The deterministic Layer 1 modules and the reviewer prompts have been most heavily tested on biomedical preprints. CS and physics papers run through the same pipeline but with weaker calibration. We do not currently publish a per-field performance breakdown because the per-field denominators are too small to be honest about.

What this page is not

This is not a complete failure-mode catalogue. It is the set of limits we have been confronted with often enough to write down. We will add to it as we ship more features and discover new ways the system can be wrong. If you find a failure mode that is not listed here, please open an issue against the repository — that is the fastest way to get it onto this page.

Why this exists. Most AI peer-review platforms publish a feature list. We publish a feature list and a failure-mode list, on the theory that the reader needs both to make a useful judgement about whether to trust an individual report.

Source documents

Public methodology critique: landing/methodology-critique.md · headline methodology: /methodology.