Methodology
How Preprints.ai works
Every assessment runs a four-phase pipeline: deterministic integrity checks, a nine-agent AI peer review panel, eLife-style multi-model deliberation, and synthesis into an A5–E1 grade. The system is built on peer-reviewed open source tools and designed so that each phase can be independently audited.
The assessment pipeline
Papers flow through four sequential phases. Each phase can independently cap the final grade — a paper mill detection hit in Layer 1 sets a hard E ceiling regardless of later scores.
Layer 1: Deterministic integrity checks
The first phase runs a battery of rule-based checks that don't rely on language models. These checks are fast, reproducible, and independently validatable. Their results are passed as structured context to the agent panel in Layer 2.
What Layer 1 checks
- Paper mill detection: 8,000+ tortured phrases, SCIgen/Mathgen fingerprints, 39 citejacked journals (PPS, Cabanac et al.)
- Statistical verification: P-value recalculation, GRIM tests, impossible-result detection (statcheck, Nuijten et al.)
- Image forensics: Duplicate panel detection, copy-move analysis, Western blot signature checking (ELIS)
- Open data detection: Repository mentions, accession numbers, data availability statement presence (ODDPub)
- Software citation: Full-paper detection of software tools + version/RRID completeness (SoftCite methodology)
- Rigor criteria (ScreenIT): Blinding reporting, randomisation description, power calculation, inclusion/exclusion criteria
- Reference verification: Cross-check against Retraction Watch database, detect citejacked journals
- Fabrication detection: Benford's law analysis, terminal digit distribution, SPRITE-style consistency
- Dataset classification: Dataset mention extraction and type classification (DataSeer-ML)
- AI content detection: LLM-generation signatures and machine translation indicators
- Trust markers: ORCID presence, ethics approval numbers, COI disclosure, funding statement
Layer 2: The 9-agent AI panel
Nine specialist AI reviewers assess the paper concurrently, each grounded in Layer 1 findings. A critical design principle: each agent has a scope boundary enforcing genuine independence. The ensemble benefit only materialises when agents detect different things — not when nine models converge on the same observations.
Model version pinning
Model version strings are pinned as constants (following Eckmann et al. 2026, PLoS ONE) to prevent silent calibration drift. Version updates are explicit, documented decisions accompanied by re-validation against our calibration test suite.
Layer 3: eLife-style deliberation
For borderline papers (panel agreement < 0.75 or any agent flagging a critical concern), three senior models engage in a structured two-round deliberation modelled on eLife's open peer review process:
- Each model independently drafts an eLife-format assessment
- Models read each other's assessments and identify key disagreements
- A structured consensus is negotiated, with explicit reasoning for any resolved disagreements
- Final assessment captures both the consensus grade and the nature of disagreement
The A5–E1 Grading Matrix
Every paper receives a two-character grade: a letter (A–E) representing strength of evidence, and a number (1–5) representing significance of findings. The matrix is inspired by eLife's assessment vocabulary and OpenAI FrontierScience's rubric-based evaluation methodology.
| 5Landmark | 4Fundamental | 3Important | 2Valuable | 1Useful | |
|---|---|---|---|---|---|
| ACompelling | A5 | A4 | A3 | A2 | A1 |
| BConvincing | B5 | B4 | B3 | B2 | B1 |
| CSolid | C5 | C4 | C3 | C2 | C1 |
| DIncomplete | D5 | D4 | D3 | D2 | D1 |
| EInadequate | E5 | E4 | E3 | E2 | E1 |
The significance axis (1–5) reflects the importance of the research question — not the publication venue. A D5 asks a landmark question but lacks evidence to support the claims.
Grade descriptions
- Compelling evidence: Rigorous methodology, appropriate controls, no statistical anomalies, full data sharing. Ready for high-tier publication.
- Convincing evidence: Sound methodology with minor gaps. Would benefit from revision but claims are well-supported.
- Solid but incomplete: Meaningful work with significant methodological gaps requiring major revision before publication.
- Incomplete evidence: Interesting question but insufficient evidence. Not submission-ready.
- Inadequate: Serious methodological, statistical, or integrity concerns. Not ready for peer review.
Inter-rater reliability
We measure two forms of agreement: intra-model consistency (same model, same paper, different runs) and inter-model agreement (different models, same paper). The panel agreement score reported with each grade reflects inter-model agreement.
On our validation set of 300 bioRxiv papers with known peer review outcomes, the system achieves a Spearman correlation of ~0.68 with eventual journal acceptance decisions and ~0.72 with editor-assigned quality scores. Individual agents disagree on grade by ≥1 letter in approximately 23% of papers — these are flagged for deliberation.
Open source tools we build on
Preprints.ai is deliberately built on publicly auditable, peer-reviewed tools. We do not black-box our integrity checks.
Methodology inspirations
- eLife open peer review — Structured assessment format and deliberation protocol for borderline papers
- OpenAI FrontierScience — Rubric-based evaluation methodology for reasoning quality vs. conclusions
- Paperpal Preflight (Cactus Communications) — Three-pillar publisher integration model: integrity / language / compliance
- Research Signals — UX patterns for manuscript screening dashboards
- Thakkar et al. 2026, Nature Machine Intelligence — ICLR RCT validating that LLM feedback in peer review improves review quality
Limitations & caveats
- Not a replacement for peer review. The system identifies technical signals — it does not assess scientific truth. Novel, paradigm-challenging work may score lower than incremental but methodologically clean work.
- Language bias. All integrity checks and agents are optimised for English-language papers. Non-English text is detected and flagged, but assessment quality is substantially lower.
- Domain coverage variance. Agent performance is strongest in biomedicine, life sciences, and quantitative social science. Coverage for arts, humanities, and qualitative research is limited.
- PDF quality dependence. Scanned PDFs, heavily image-based papers, or papers using unusual typesetting may extract poorly, degrading all downstream assessment quality.
- Statistical tests only. statcheck only catches APA-format reported statistics. Papers using non-standard reporting formats, Bayesian statistics, or only effect sizes without p-values are not fully covered.
- Absence ≠ problem. A missing blinding statement does not mean blinding was absent — only that it was not reported. Rigor criteria flags indicate reporting gaps, not procedural failures.
- LLM calibration drift. Model performance changes silently between versions. We pin all model versions and re-validate quarterly. An assessment from 2025 may differ from one in 2026 for the same paper.
Key references
- Nuijten et al. (2016). The prevalence of statistical reporting errors in psychology. Behavior Research Methods. doi:10.3758/s13428-015-0664-2
- Riedel et al. (2020). ODDPub. Data Science Journal. doi:10.5334/dsj-2020-042
- Cabanac et al. (2021). Tortured phrases. Scientometrics. doi:10.1007/s11192-021-04096-2
- Du et al. (2021). SoftCite dataset. JCDL 2021.
- Eckmann et al. (2026). ScreenIT: comparing software tools for rigor and transparency. PLoS ONE 21(2):e0342225.
- Thakkar et al. (2026). A large-scale randomized study of LLM feedback in peer review. Nature Machine Intelligence.