Paper-mill detection
Heuristic detection of template, fingerprint, and metadata patterns associated with paper-mill output. We describe how it works, what it cannot do, and where the gaps in our evaluation are.
Methodology
The module flags papers whose surface features cluster with patterns characteristic of paper-mill output. Signals include:
- Template language. N-gram overlap with known paper-mill template phrasings (boilerplate introductions, formulaic methods sentences).
- Reference-list fingerprints. Unusually high overlap with reference lists from previously flagged papers; suspicious bursts of references to a small set of journals.
- Metadata anomalies. Author affiliations, ORCID coverage, and email domain patterns that are over-represented in the paper-mill literature.
- Image-stock matching. When figures are present, comparison against a small index of stock-image reuse seen across known paper-mill submissions.
Results
We do not yet publish a precision or recall figure for this module. The reason is straightforward: there is no widely-agreed labelled public corpus of academic paper-mill preprints to evaluate against. Constructing one with sufficient coverage and an honest negative class is non-trivial work that we have not finished.
What we will do, when that work is finished, is publish:
- The size and source of the labelled set.
- Precision and recall at the threshold the production module uses.
- A confusion matrix broken down by signal type, so readers can see which heuristics are doing the work.
- The agreement between this module and the eventual reviewer-agent verdict on the same papers.
Caveats — what this doesn't measure
- The module flags surface-level signals. A paper whose results are fabricated but whose prose is bespoke will not trip it.
- Template-language detection penalises non-native English writers whose prose can superficially resemble template output. The module emits findings as flags, not blockers, and downstream agents are explicitly told not to weight prose fluency.
- The fingerprint set was assembled from public reporting on retracted paper-mill output up to early 2025. Newer mills using different templates may pass through unflagged.
- Without a labelled set we have no calibrated false-positive rate to share. Treat findings here as "worth a closer look", not "guilty".
Code
Module: checks/layer1/paper_mill_detection.py · related fabrication heuristics: checks/layer1/fabrication_detector.py.