Methodology
A transparent grade signal for preprints — produced by a four-layer pipeline, not a black box.
Every paper on preprints.ai flows through the same pipeline. The output is a single grade on a 25-cell matrix: an integrity letter (A–E) and a novelty number (1–5). The grade is derived deterministically from agent assessments, so two runs over the same paper return the same grade modulo model non-determinism.
- AExceptional
- BCompelling
- CSolid
- DIncomplete
- EInadequate
- 5Landmark
- 4Fundamental
- 3Important
- 2Valuable
- 1Useful
The four layers
Deterministic checks
Before any language model runs, the paper is checked against a battery of open, peer-reviewed tools. Failures here don't fail the paper automatically — they're surfaced as flags that downstream agents and readers can weigh.
- Paper-mill signals — cross-reference against known paper-mill templates and fingerprint databases.
- Retraction watch — match the DOI against the
Retraction Watchdatabase. - Statistical consistency — GRIM, statcheck, and p-value consistency tests.
- Data / code availability — ODDPub + DataSeer-ML parse the manuscript for open data statements.
- Image integrity — ELIS-style duplicate / manipulation scan where full text is available.
Nine-agent review
Nine independent agents read the paper and produce a structured review. Each agent has a specific role and is asked to stay in its lane to preserve independence:
- Methodologist — experimental design, controls, confounders.
- Statistician — tests used, effect sizes, multiple testing correction, sample size.
- Scientific validity — are the claims coherent with the evidence presented?
- Ethics & transparency — ethics statement, funding, conflicts of interest, author contributions.
- Domain primary — the main specialist voice, identified per-paper from subject / keywords.
- Domain clinical — translational / clinical read where relevant.
- Domain methods — field-specific methods, code, and computational reproducibility.
- Domain literature — prior art, novelty claims, contradictions with published work.
- Domain reproducibility — ARRIVE, CONSORT, MIQE and similar field-specific reporting guidelines.
Each agent emits an evidence_strength label (one of five) and a significance label (one of five), along with strengths, concerns, and questions for authors. Models are mixed — some agents run on Claude Sonnet, others on local Ollama (Gemma 3/4, Qwen 2.5) — chosen per role for cost and specialization.
Deterministic grade derivation
Grades are not computed by an LLM. Each agent's evidence_strength maps to an integrity letter, and each significance maps to a novelty number, using a fixed lookup table verified at 99.8 % agreement (Cohen's κ = 0.998) against 800 historical eLife peer reviews.
The panel's grades are aggregated using the mode; ties break toward the lower grade (conservative). This makes the output auditable — you can trace every grade back to the text labels that produced it.
Opus arbitration
When agents disagree substantially (agreement < 0.65) or the derived grade sits on a boundary, Claude Opus is called as a senior advisor. It sees the full panel output and either confirms the consensus or moves it up/down by one step with a short written rationale.
Advisor adjustments are logged and visible on the paper report under the summary, prefixed with "Advisor note".
Calibration
Grades are continuously calibrated against two signals:
- Historical eLife reviews — 800 graded papers used as the gold standard for the evidence-to-letter and significance-to-number lookups.
- Publication outcomes — where a preprint is later published in a peer-reviewed journal, the journal's tier is used as a rough novelty proxy. Large systematic disagreements feed back into prompt calibration.
Limitations
- This is a machine-generated indicator, not peer review. It complements human review; it does not replace it.
- The pipeline cannot replicate experiments. A paper with fabricated data that reads plausibly will score higher than it should.
- Domain agents are well-calibrated for life sciences. Physics, maths, and CS preprints get a weaker signal.
- Grades are based on the preprint as posted. Revisions are re-scored.
Source code & reproducibility
The pipeline is open to inspection. Prompts, scoring rules, and the deterministic grade lookup are all linked from the docs page. Every assessment is reproducible — the paper text, model versions, and agent responses are logged.