Paper-mill detection — Evidence

Methodology

The module flags papers whose surface features cluster with patterns characteristic of paper-mill output. Signals include:

Template language. N-gram overlap with known paper-mill template phrasings (boilerplate introductions, formulaic methods sentences).
Reference-list fingerprints. Unusually high overlap with reference lists from previously flagged papers; suspicious bursts of references to a small set of journals.
Metadata anomalies. Author affiliations, ORCID coverage, and email domain patterns that are over-represented in the paper-mill literature.
Image-stock matching. When figures are present, comparison against a small index of stock-image reuse seen across known paper-mill submissions.

Results

We do not yet publish a precision or recall figure for this module. The reason is straightforward: there is no widely-agreed labelled public corpus of academic paper-mill preprints to evaluate against. Constructing one with sufficient coverage and an honest negative class is non-trivial work that we have not finished.

What we will do, when that work is finished, is publish:

The size and source of the labelled set.
Precision and recall at the threshold the production module uses.
A confusion matrix broken down by signal type, so readers can see which heuristics are doing the work.
The agreement between this module and the eventual reviewer-agent verdict on the same papers.

Why no number yet. Publishing a precision figure off an unlabelled production stream would be self-graded homework. We would rather leave this page empty of metrics than seed it with a number that cannot be defended.

Caveats — what this doesn't measure

The module flags surface-level signals. A paper whose results are fabricated but whose prose is bespoke will not trip it.
Template-language detection penalises non-native English writers whose prose can superficially resemble template output. The module emits findings as flags, not blockers, and downstream agents are explicitly told not to weight prose fluency.
The fingerprint set was assembled from public reporting on retracted paper-mill output up to early 2025. Newer mills using different templates may pass through unflagged.
Without a labelled set we have no calibrated false-positive rate to share. Treat findings here as "worth a closer look", not "guilty".

Code

Module: checks/layer1/paper_mill_detection.py · related fabrication heuristics: checks/layer1/fabrication_detector.py.