Preprints.ai
How it works

Methodology

A transparent grade signal for preprints — produced by a four-layer pipeline, not a black box.

Every paper on preprints.ai flows through the same pipeline. The output is a single grade on a 25-cell matrix: an integrity letter (A–E) and a novelty number (1–5). The grade is derived deterministically from agent assessments, so two runs over the same paper return the same grade modulo model non-determinism.

Integrity letter
  • AExceptional
  • BCompelling
  • CSolid
  • DIncomplete
  • EInadequate
Novelty number
  • 5Landmark
  • 4Fundamental
  • 3Important
  • 2Valuable
  • 1Useful

The four layers

01

Deterministic checks

Non-LLM · under 1 second

Before any language model runs, the paper is checked against a battery of open, peer-reviewed tools. Failures here don't fail the paper automatically — they're surfaced as flags that downstream agents and readers can weigh.

  • Paper-mill signals — cross-reference against known paper-mill templates and fingerprint databases.
  • Retraction watch — match the DOI against the Retraction Watch database.
  • Statistical consistency — GRIM, statcheck, and p-value consistency tests.
  • Data / code availability — ODDPub + DataSeer-ML parse the manuscript for open data statements.
  • Image integrity — ELIS-style duplicate / manipulation scan where full text is available.
02

Nine-agent review

Parallel · 3–8 minutes

Nine independent agents read the paper and produce a structured review. Each agent has a specific role and is asked to stay in its lane to preserve independence:

  • Methodologist — experimental design, controls, confounders.
  • Statistician — tests used, effect sizes, multiple testing correction, sample size.
  • Scientific validity — are the claims coherent with the evidence presented?
  • Ethics & transparency — ethics statement, funding, conflicts of interest, author contributions.
  • Domain primary — the main specialist voice, identified per-paper from subject / keywords.
  • Domain clinical — translational / clinical read where relevant.
  • Domain methods — field-specific methods, code, and computational reproducibility.
  • Domain literature — prior art, novelty claims, contradictions with published work.
  • Domain reproducibility — ARRIVE, CONSORT, MIQE and similar field-specific reporting guidelines.

Each agent emits an evidence_strength label (one of five) and a significance label (one of five), along with strengths, concerns, and questions for authors. Models are mixed — some agents run on Claude Sonnet, others on local Ollama (Gemma 3/4, Qwen 2.5) — chosen per role for cost and specialization.

03

Deterministic grade derivation

Pure lookup · under 1 ms

Grades are not computed by an LLM. Each agent's evidence_strength maps to an integrity letter, and each significance maps to a novelty number, using a fixed lookup table verified at 99.8 % agreement (Cohen's κ = 0.998) against 800 historical eLife peer reviews.

The panel's grades are aggregated using the mode; ties break toward the lower grade (conservative). This makes the output auditable — you can trace every grade back to the text labels that produced it.

04

Opus arbitration

Borderline cases only

When agents disagree substantially (agreement < 0.65) or the derived grade sits on a boundary, Claude Opus is called as a senior advisor. It sees the full panel output and either confirms the consensus or moves it up/down by one step with a short written rationale.

Advisor adjustments are logged and visible on the paper report under the summary, prefixed with "Advisor note".

Calibration

Grades are continuously calibrated against two signals:

Limitations

Source code & reproducibility

The pipeline is open to inspection. Prompts, scoring rules, and the deterministic grade lookup are all linked from the docs page. Every assessment is reproducible — the paper text, model versions, and agent responses are logged.