Methodology

How Preprints.ai works

Every assessment runs a four-phase pipeline: deterministic integrity checks, a nine-agent AI peer review panel, eLife-style multi-model deliberation, and synthesis into an A5–E1 grade. The system is built on peer-reviewed open source tools and designed so that each phase can be independently audited.

Important: All grades are machine-generated indicators. They are designed as a first-pass filter to help researchers, editors, and LLMs prioritise their attention — not to replace human expert review. A grade of A5 does not mean a paper is correct; a grade of E1 does not mean it should be suppressed.

The assessment pipeline

Papers flow through four sequential phases. Each phase can independently cap the final grade — a paper mill detection hit in Layer 1 sets a hard E ceiling regardless of later scores.

L1
Layer 1 — Deterministic integrity checks
11 rule-based modules run in parallel: paper mill fingerprints, p-value recalculation, image forensics, open data detection, software citation, blinding/randomisation reporting, retraction watch, and more. Fast, reproducible, auditable.
~2–5 seconds · All checks parallelised
L2
Layer 2 — 9-agent AI peer review panel
Nine specialist AI reviewers — each grounded in Layer 1 findings — independently assess the paper. Models are rotated across providers (Claude, GPT-4o, Gemini) to eliminate correlated bias.
~40–80 seconds · Agents run concurrently
L3
Layer 3 — eLife-style deliberation (borderlines)
For papers where agents disagree (panel agreement < 0.75), three senior models (Claude, GPT-4o, Gemini) independently review and then negotiate a consensus, following eLife's structured assessment format.
~30–60 seconds · Only for borderline papers
L4
Layer 4 — Grade synthesis
Integrity score (from L1 penalties + agent panel) and novelty score (from agent panel + domain context) are mapped to the A–E and 1–5 axes respectively. The final grade is the intersection.
<1 second

Layer 1: Deterministic integrity checks

The first phase runs a battery of rule-based checks that don't rely on language models. These checks are fast, reproducible, and independently validatable. Their results are passed as structured context to the agent panel in Layer 2.

What Layer 1 checks

  • Paper mill detection: 8,000+ tortured phrases, SCIgen/Mathgen fingerprints, 39 citejacked journals (PPS, Cabanac et al.)
  • Statistical verification: P-value recalculation, GRIM tests, impossible-result detection (statcheck, Nuijten et al.)
  • Image forensics: Duplicate panel detection, copy-move analysis, Western blot signature checking (ELIS)
  • Open data detection: Repository mentions, accession numbers, data availability statement presence (ODDPub)
  • Software citation: Full-paper detection of software tools + version/RRID completeness (SoftCite methodology)
  • Rigor criteria (ScreenIT): Blinding reporting, randomisation description, power calculation, inclusion/exclusion criteria
  • Reference verification: Cross-check against Retraction Watch database, detect citejacked journals
  • Fabrication detection: Benford's law analysis, terminal digit distribution, SPRITE-style consistency
  • Dataset classification: Dataset mention extraction and type classification (DataSeer-ML)
  • AI content detection: LLM-generation signatures and machine translation indicators
  • Trust markers: ORCID presence, ethics approval numbers, COI disclosure, funding statement

Layer 2: The 9-agent AI panel

Nine specialist AI reviewers assess the paper concurrently, each grounded in Layer 1 findings. A critical design principle: each agent has a scope boundary enforcing genuine independence. The ensemble benefit only materialises when agents detect different things — not when nine models converge on the same observations.

Methodologist
claude-sonnet-4-20250514
Experimental design, controls, sample size, confounding variables
Statistician
gpt-4o-2024-11-20
Statistical tests, p-values, effect sizes, multiple comparisons
Domain Expert
gemini-2.0-flash-001
Field-specific novelty, missing citations, context
Reproducibility
claude-haiku-4-5-20251001
Replicability, data/code availability, reporting standards
Ethics & Transparency
gpt-4o-2024-11-20
IRB, COI, funding, pre-registration, publication ethics
Scientific Validity
claude-sonnet-4-20250514
Pseudoscience gate — scientific plausibility and coherence
Domain Primary
claude-sonnet-4-20250514
Comprehensive primary peer review (eLife Reviewer #1 depth)
Methods Specialist
gpt-4o-2024-11-20
Specific techniques, reagents, protocols, controls
Translational Relevance
gemini-2.0-flash-001
Clinical significance, effect sizes vs. practical relevance
Multi-provider design: Agents are distributed across Claude, OpenAI, and Gemini to eliminate correlated model bias. When all three providers agree, confidence is substantially higher than single-model consensus. Panel agreement is reported as a 0–1 score alongside the grade.

Model version pinning

Model version strings are pinned as constants (following Eckmann et al. 2026, PLoS ONE) to prevent silent calibration drift. Version updates are explicit, documented decisions accompanied by re-validation against our calibration test suite.

Layer 3: eLife-style deliberation

For borderline papers (panel agreement < 0.75 or any agent flagging a critical concern), three senior models engage in a structured two-round deliberation modelled on eLife's open peer review process:

  1. Each model independently drafts an eLife-format assessment
  2. Models read each other's assessments and identify key disagreements
  3. A structured consensus is negotiated, with explicit reasoning for any resolved disagreements
  4. Final assessment captures both the consensus grade and the nature of disagreement

The A5–E1 Grading Matrix

Every paper receives a two-character grade: a letter (A–E) representing strength of evidence, and a number (1–5) representing significance of findings. The matrix is inspired by eLife's assessment vocabulary and OpenAI FrontierScience's rubric-based evaluation methodology.

5Landmark 4Fundamental 3Important 2Valuable 1Useful
ACompellingA5A4A3A2A1
BConvincingB5B4B3B2B1
CSolidC5C4C3C2C1
DIncompleteD5D4D3D2D1
EInadequateE5E4E3E2E1

The significance axis (1–5) reflects the importance of the research question — not the publication venue. A D5 asks a landmark question but lacks evidence to support the claims.

Grade descriptions

  • ACompelling evidence: Rigorous methodology, appropriate controls, no statistical anomalies, full data sharing. Ready for high-tier publication.
  • BConvincing evidence: Sound methodology with minor gaps. Would benefit from revision but claims are well-supported.
  • CSolid but incomplete: Meaningful work with significant methodological gaps requiring major revision before publication.
  • DIncomplete evidence: Interesting question but insufficient evidence. Not submission-ready.
  • EInadequate: Serious methodological, statistical, or integrity concerns. Not ready for peer review.

Inter-rater reliability

We measure two forms of agreement: intra-model consistency (same model, same paper, different runs) and inter-model agreement (different models, same paper). The panel agreement score reported with each grade reflects inter-model agreement.

On our validation set of 300 bioRxiv papers with known peer review outcomes, the system achieves a Spearman correlation of ~0.68 with eventual journal acceptance decisions and ~0.72 with editor-assigned quality scores. Individual agents disagree on grade by ≥1 letter in approximately 23% of papers — these are flagged for deliberation.

Open source tools we build on

Preprints.ai is deliberately built on publicly auditable, peer-reviewed tools. We do not black-box our integrity checks.

PPS Fingerprints Paper Mill
Cabanac et al., IRIT · github.com/gcabanac/pps
8,000+ tortured phrase substitutions, SCIgen/Mathgen signatures, 39 citejacked journal fingerprints. The most comprehensive open paper mill detection database.
statcheck Statistics
Nuijten et al. 2016, Behavior Research Methods · github.com/MicheleNuijten/statcheck
Extracts APA-style statistics from text and recalculates p-values from reported test statistics. Detects both statistical errors and decision errors (where significance conclusion changes).
ODDPub Open Data
Riedel et al. 2020, Data Science Journal · QUEST-BIH · github.com/quest-bih/oddpub
Open Data Detection in Publications. Text-mining algorithm detecting data sharing statements, repository mentions, and accession numbers across field-specific and general repositories.
SoftCite Software Citation
Full-paper software detection methodology. We implement the SoftCite approach of broad-vocabulary, full-paper search — achieving F1≈0.87 vs the RRID-restricted Methods-only approach (F1≈0.27, per Eckmann et al. 2026).
DataSeer-ML Datasets
Dataset mention extraction and type classification from scientific text. Complements ODDPub with structured dataset type identification.
ELIS Image Forensics
Named for Elisabeth Bik · image integrity analysis
Image duplicate detection, copy-move analysis, and Western blot panel verification. Identifies figures that appear multiple times within a paper or show geometric duplication.
ScreenIT Tools Rigor Reporting
Eckmann et al. 2026, PLoS ONE 21(2):e0342225
Empirical benchmark of 11 automated rigor-checking tools. We implement their recommended tool stack: SciScore patterns for blinding (F1≈0.89) and power calculation (F1≈0.79), pre-rob+SciScore ensemble for randomisation (F1≈0.76), and inclusion/exclusion text detection (F1≈0.88).
Semantic Scholar API Literature
Allen Institute for AI · api.semanticscholar.org
200M+ paper corpus used by domain expert agents to verify novelty claims, check if findings are already published, and analyse citation networks for the assessed paper's references.

Methodology inspirations

  • eLife open peer review — Structured assessment format and deliberation protocol for borderline papers
  • OpenAI FrontierScience — Rubric-based evaluation methodology for reasoning quality vs. conclusions
  • Paperpal Preflight (Cactus Communications) — Three-pillar publisher integration model: integrity / language / compliance
  • Research Signals — UX patterns for manuscript screening dashboards
  • Thakkar et al. 2026, Nature Machine Intelligence — ICLR RCT validating that LLM feedback in peer review improves review quality

Limitations & caveats

These limitations are not edge cases — they are central to responsible use of the system. Please read before integrating into any editorial workflow.
  • Not a replacement for peer review. The system identifies technical signals — it does not assess scientific truth. Novel, paradigm-challenging work may score lower than incremental but methodologically clean work.
  • Language bias. All integrity checks and agents are optimised for English-language papers. Non-English text is detected and flagged, but assessment quality is substantially lower.
  • Domain coverage variance. Agent performance is strongest in biomedicine, life sciences, and quantitative social science. Coverage for arts, humanities, and qualitative research is limited.
  • PDF quality dependence. Scanned PDFs, heavily image-based papers, or papers using unusual typesetting may extract poorly, degrading all downstream assessment quality.
  • Statistical tests only. statcheck only catches APA-format reported statistics. Papers using non-standard reporting formats, Bayesian statistics, or only effect sizes without p-values are not fully covered.
  • Absence ≠ problem. A missing blinding statement does not mean blinding was absent — only that it was not reported. Rigor criteria flags indicate reporting gaps, not procedural failures.
  • LLM calibration drift. Model performance changes silently between versions. We pin all model versions and re-validate quarterly. An assessment from 2025 may differ from one in 2026 for the same paper.

Key references

  • Nuijten et al. (2016). The prevalence of statistical reporting errors in psychology. Behavior Research Methods. doi:10.3758/s13428-015-0664-2
  • Riedel et al. (2020). ODDPub. Data Science Journal. doi:10.5334/dsj-2020-042
  • Cabanac et al. (2021). Tortured phrases. Scientometrics. doi:10.1007/s11192-021-04096-2
  • Du et al. (2021). SoftCite dataset. JCDL 2021.
  • Eckmann et al. (2026). ScreenIT: comparing software tools for rigor and transparency. PLoS ONE 21(2):e0342225.
  • Thakkar et al. (2026). A large-scale randomized study of LLM feedback in peer review. Nature Machine Intelligence.