Preprints.ai
How it works

Methodology

A transparent grade signal for preprints — three independent axes from a four-layer pipeline, not a single black-box score.

Every paper flows through the same pipeline and comes out with three independent grades plus a synthesised verdict. The three axes are deliberately separate — they answer different questions, and a paper can score high on one and low on another. Reading them as a single “overall mark” is the most common mistake.

The three axes

1 · Trust grade (A–E) — the headline letter

A deterministic transparency score: does the paper publish the open-science markers a trustworthy paper should? It is not a measure of how good the science is — a brilliant study with no data-availability statement scores low here, and a mediocre study that ticks every box scores high. No language model decides this letter; it is computed by a fixed rule from markers detected in the manuscript text, so it is fully reproducible. Every detected marker is shown on the report with its supporting quote and line number.

  • A Excellent — ticks nearly every applicable integrity box
  • B Strong — most boxes ticked
  • C Moderate — some boxes ticked
  • D Limited — few boxes ticked
  • E None — essentially no integrity markers

Boxes weighed: data availability, code availability, conflict-of-interest disclosure, preregistration, funding statement, author contributions, ORCID, and — only when the work involves human or animal subjects — an ethics statement. Downward caps: the deterministic fraud / error checks (paper-mill and fabrication scans, statcheck statistical-consistency, retracted-reference detection) force the grade down regardless of how many boxes are ticked.

2 · Novelty (1–10) — lower = more novel

How novel is the contribution, on a 1–10 scale percentile-binned across the whole corpus into a bell curve centred on 5. Synthesised from the panel's significance read. Independent of the trust grade — a paper can be highly novel but opaque (E2) or fully transparent but incremental (A8).

1–3Landmark · Major breakthrough · Fundamental
4–6Significant · Important (median) · Notable
7–10Valuable · Incremental · Marginal · Minimal
3 · Evidence strength — methodology quality

The panel's read on how good the science is — methodology, controls, and whether the data support the claims. Shown as a word (Exceptional → Inadequate) plus a continuous 0–1 confidence-weighted score. This is the “is the work sound?” axis that readers often expect the headline letter to be; here it is kept separate from transparency so neither masks the other.

The verdict

The three axes are combined into a single editorial decision — what an editorial board would likely decide — derived deterministically from the evidence strength × novelty (Accept → Reject). A finer-grained Publication Fit tier (1–10, anchored to journal exemplars: tier 1 ≈ Nature, tier 5 ≈ PLOS One median) expresses the same call as a number. Both use the same inputs; the editorial decision is the headline, Publication Fit the detail.

The pipeline

01

Deterministic checks (Layer 1)

Non-LLM · seconds

Before any language model runs, the paper is checked against a battery of open tools. These both detect the trust markers (with provenance) and run the fraud / error checks that can cap the trust grade.

  • Trust-marker detection — data / code availability, COI, ethics, funding, author contributions, preregistration, ORCID — each captured with the exact quote and line number.
  • Paper-mill signals — template / fingerprint cross-reference.
  • Retracted references — every cited DOI is matched against retraction records; ≥3 retracted citations caps the grade.
  • Statistical consistency — GRIM and statcheck-style p-value checks.
  • Text-capture gate — a paper is only reviewed if extraction reached the references section, so “marker absent” claims are made only where we have actually read the relevant part of the manuscript. Papers below the bar are re-queued, not graded on a partial text.
02

Eleven-agent review (Layer 2)

Parallel · minutes

Eleven agents read the full text and produce structured reviews. Each has a role and stays in its lane:

  • Methodologist — design, controls, confounders.
  • Statistician — tests, effect sizes, multiple-testing correction, sample size.
  • Scientific validity — do the claims follow from the evidence?
  • Ethics & transparency — ethics, funding, COI, contributions.
  • Domain primary / clinical / methods / literature / reproducibility — five specialist voices, selected per paper.
  • Devil's advocate — adversarial critic stressing the consensus.
  • Positive evidence — counter-weight that surfaces what the paper does well, against a 10-axis rubric.

Models are mixed per role for cost and specialization — some agents run on Claude (Sonnet / Haiku), others on local Ollama (Gemma, Qwen). Each agent emits structured strengths, weaknesses, scores and a recommendation.

A note on independence: AI reviewers overlap with each other far more than human reviewers do, so panel agreement reflects shared reasoning as much as corroboration. We therefore show how many distinct agents independently raised each concern (a concern echoed by several agents is more reliable), and flag papers where the panel was split.

03

Deterministic derivation (Layer 3)

Pure rules · no LLM

The grades are computed by fixed rules, not by a model, so they are reproducible and auditable:

  • Trust grade from the Layer-1 markers + fraud caps (core/integrity_markers.py).
  • Novelty 1–10 by percentile-binning the panel's significance into the corpus bell curve (core/novelty_scale.py).
  • Evidence strength by aggregating the panel's confidence-weighted scores.
  • Editorial decision & Publication Fit from evidence × novelty (core/publication_tier.py).
  • Major claims are extracted with source quotes and each given a deterministic verdict (supported / partial / unsupported) by matching the panel's concerns against the claim.
04

Arbitration (Layer 4)

Borderline cases only

When the panel disagrees or the grade sits on a boundary, a senior arbiter (DeepSeek-R1, with Claude Opus as fallback) reviews the full panel output and confirms or nudges the consensus with a written rationale, logged on the report. A scientific-validity cap requires at least two agents to concur before it can force a paper down — a single contrarian reviewer cannot veto the consensus.

Calibration & validation

Limitations

Source code & reproducibility

The deterministic grade lookups, trust-marker rules and prompts are open to inspection (linked from the docs page), and the per-check evidence behind each signal is shown on the Evidence pages. Every assessment logs its paper text, model versions and per-agent responses, so any grade can be traced back to the inputs that produced it.