Preprints.ai
How it works

Methodology

A transparent grade signal for preprints — three independent axes from a four-layer pipeline, not a single black-box score.

Every paper flows through the same pipeline and comes out with a three-character grade plus a synthesised verdict. The format is [Evidence][Trust][Novelty] — for example BC6 = compelling evidence, moderate trust, notable novelty. The three axes answer different questions and are deliberately independent: a paper can score high on one and low on another.

The three axes

1 · Evidence (A–E) — is the science sound?

A calibrated AI judge reads the full manuscript and rates methodology quality, internal consistency, and how well the data support the claims. This is the “is this good science?” axis — the first letter of the grade.

  • A Exceptional — rigorous design, strong controls, claims fully supported
  • B Compelling — solid work with minor limitations
  • C Solid — adequate but with notable gaps
  • D Incomplete — significant methodological weaknesses
  • E Inadequate — fundamental flaws undermine the conclusions
2 · Trust (A–E) — can we verify and reproduce it?

A deterministic compliance score: how well does the paper meet the verifiable open-science checks a trustworthy paper should? No language model decides this letter — it is computed by a fixed rule from checks on the manuscript text, so it is fully reproducible. Each check is counted only when it applies.

  • A Excellent — meets nearly every applicable check
  • B Strong — most checks met
  • C Moderate — some checks met
  • D Limited — few checks met
  • E Minimal — essentially no trust signals

What it weighs (each counted only when it applies): transparency markers — data & code availability, conflict-of-interest disclosure, funding, author contributions, and (for human/animal-subject work) an ethics statement — plus reproducibility, statistical rigor, and citation health. Downward caps: the fraud / error checks (paper-mill and fabrication scans, statcheck statistical-consistency, retracted-reference detection) force the grade down regardless of the rest.

3 · Novelty (1–10) — is this new?

How novel is the contribution, on a 1–10 scale percentile-binned across the whole corpus into a bell curve centred on 5. Synthesised from the panel's significance read. Independent of the other two axes — a paper can be novel but poorly reported (AE1) or transparent and incremental (AA8).

1–3Landmark · Major breakthrough · Fundamental
4–6Significant · Important (median) · Notable
7–10Valuable · Incremental · Marginal · Minimal

The verdict

The axes are combined into a single editorial decision — what an editorial board would likely decide — derived deterministically from evidence × novelty (Accept → Reject). A finer-grained Publication Fit tier (1–10, anchored to journal exemplars: tier 1 ≈ Nature, tier 5 ≈ PLOS One median) expresses the same call as a number. AA1 is the best possible grade (exceptional evidence, excellent trust, landmark novelty); EE10 the worst.

The pipeline

01

Deterministic checks (Layer 1)

Non-LLM · seconds

Before any language model runs, the paper is checked against a battery of open tools. These both detect the trust markers (with provenance) and run the fraud / error checks that can cap the trust grade.

  • Trust-marker detection — data / code availability, COI, ethics, funding, author contributions, preregistration, ORCID — each captured with the exact quote and line number.
  • Paper-mill signals — template / fingerprint cross-reference.
  • Retracted references — every cited DOI is matched against retraction records; ≥3 retracted citations caps the grade.
  • Statistical consistency — GRIM and statcheck-style p-value checks.
  • Text-capture gate — a paper is only reviewed if extraction reached the references section, so “marker absent” claims are made only where we have actually read the relevant part of the manuscript. Papers below the bar are re-queued, not graded on a partial text.
02

Eleven-agent review (Layer 2)

Parallel · minutes

Eleven agents read the full text and produce structured reviews. Each has a role and stays in its lane:

  • Methodologist — design, controls, confounders.
  • Statistician — tests, effect sizes, multiple-testing correction, sample size.
  • Scientific validity — do the claims follow from the evidence?
  • Ethics & transparency — ethics, funding, COI, contributions.
  • Domain primary / clinical / methods / literature / reproducibility — five specialist voices, selected per paper.
  • Devil's advocate — adversarial critic stressing the consensus.
  • Positive evidence — counter-weight that surfaces what the paper does well, against a 10-axis rubric.

Models are mixed per role for cost and specialization — some agents run on Claude (Sonnet / Haiku), others on local Ollama (Gemma, Qwen). Each agent emits structured strengths, weaknesses, scores and a recommendation.

A note on independence: AI reviewers overlap with each other far more than human reviewers do, so panel agreement reflects shared reasoning as much as corroboration. We therefore show how many distinct agents independently raised each concern (a concern echoed by several agents is more reliable), and flag papers where the panel was split.

03

Deterministic derivation (Layer 3)

Pure rules · no LLM

The grades are computed by fixed rules, not by a model, so they are reproducible and auditable:

  • Trust grade from all applicable Layer-1 checks — transparency, reproducibility, statistical rigor, citation health — with fraud/error caps (core/trust_v2.py).
  • Novelty 1–10 by percentile-binning the panel's significance into the corpus bell curve (core/novelty_scale.py).
  • Evidence strength by aggregating the panel's confidence-weighted scores.
  • Editorial decision & Publication Fit from evidence × novelty (core/publication_tier.py).
  • Major claims are extracted with source quotes and each given a deterministic verdict (supported / partial / unsupported) by matching the panel's concerns against the claim.
04

Arbitration (Layer 4)

Borderline cases only

When the panel disagrees or the grade sits on a boundary, a senior arbiter (DeepSeek-R1, with Claude Opus as fallback) reviews the full panel output and confirms or nudges the consensus with a written rationale, logged on the report. A scientific-validity cap requires at least two agents to concur before it can force a paper down — a single contrarian reviewer cannot veto the consensus.

Calibration & validation

Limitations

Source code & reproducibility

The deterministic grade lookups, trust-marker rules and prompts are open to inspection (linked from the docs page), and the per-check evidence behind each signal is shown on the Evidence pages. Every assessment logs its paper text, model versions and per-agent responses, so any grade can be traced back to the inputs that produced it.