Methodology — Preprints.ai

Every paper flows through the same pipeline and comes out with three independent grades plus a synthesised verdict. The three axes are deliberately separate — they answer different questions, and a paper can score high on one and low on another. Reading them as a single “overall mark” is the most common mistake.

The three axes

1 · Trust grade (A–E) — the headline letter

A deterministic transparency score: does the paper publish the open-science markers a trustworthy paper should? It is not a measure of how good the science is — a brilliant study with no data-availability statement scores low here, and a mediocre study that ticks every box scores high. No language model decides this letter; it is computed by a fixed rule from markers detected in the manuscript text, so it is fully reproducible. Every detected marker is shown on the report with its supporting quote and line number.

A Excellent — ticks nearly every applicable integrity box
B Strong — most boxes ticked
C Moderate — some boxes ticked
D Limited — few boxes ticked
E None — essentially no integrity markers

Boxes weighed: data availability, code availability, conflict-of-interest disclosure, preregistration, funding statement, author contributions, ORCID, and — only when the work involves human or animal subjects — an ethics statement. Downward caps: the deterministic fraud / error checks (paper-mill and fabrication scans, statcheck statistical-consistency, retracted-reference detection) force the grade down regardless of how many boxes are ticked.

2 · Novelty (1–10) — lower = more novel

How novel is the contribution, on a 1–10 scale percentile-binned across the whole corpus into a bell curve centred on 5. Synthesised from the panel's significance read. Independent of the trust grade — a paper can be highly novel but opaque (E2) or fully transparent but incremental (A8).

1–3	Landmark · Major breakthrough · Fundamental
4–6	Significant · Important (median) · Notable
7–10	Valuable · Incremental · Marginal · Minimal

3 · Evidence strength — methodology quality

The panel's read on how good the science is — methodology, controls, and whether the data support the claims. Shown as a word (Exceptional → Inadequate) plus a continuous 0–1 confidence-weighted score. This is the “is the work sound?” axis that readers often expect the headline letter to be; here it is kept separate from transparency so neither masks the other.

The verdict

The three axes are combined into a single editorial decision — what an editorial board would likely decide — derived deterministically from the evidence strength × novelty (Accept → Reject). A finer-grained Publication Fit tier (1–10, anchored to journal exemplars: tier 1 ≈ Nature, tier 5 ≈ PLOS One median) expresses the same call as a number. Both use the same inputs; the editorial decision is the headline, Publication Fit the detail.

The pipeline

Deterministic checks (Layer 1)

Non-LLM · seconds

Before any language model runs, the paper is checked against a battery of open tools. These both detect the trust markers (with provenance) and run the fraud / error checks that can cap the trust grade.

Trust-marker detection — data / code availability, COI, ethics, funding, author contributions, preregistration, ORCID — each captured with the exact quote and line number.
Paper-mill signals — template / fingerprint cross-reference.
Retracted references — every cited DOI is matched against retraction records; ≥3 retracted citations caps the grade.
Statistical consistency — GRIM and statcheck-style p-value checks.
Text-capture gate — a paper is only reviewed if extraction reached the references section, so “marker absent” claims are made only where we have actually read the relevant part of the manuscript. Papers below the bar are re-queued, not graded on a partial text.

Eleven-agent review (Layer 2)

Parallel · minutes

Eleven agents read the full text and produce structured reviews. Each has a role and stays in its lane:

Methodologist — design, controls, confounders.
Statistician — tests, effect sizes, multiple-testing correction, sample size.
Scientific validity — do the claims follow from the evidence?
Ethics & transparency — ethics, funding, COI, contributions.
Domain primary / clinical / methods / literature / reproducibility — five specialist voices, selected per paper.
Devil's advocate — adversarial critic stressing the consensus.
Positive evidence — counter-weight that surfaces what the paper does well, against a 10-axis rubric.

Models are mixed per role for cost and specialization — some agents run on Claude (Sonnet / Haiku), others on local Ollama (Gemma, Qwen). Each agent emits structured strengths, weaknesses, scores and a recommendation.

A note on independence: AI reviewers overlap with each other far more than human reviewers do, so panel agreement reflects shared reasoning as much as corroboration. We therefore show how many distinct agents independently raised each concern (a concern echoed by several agents is more reliable), and flag papers where the panel was split.

Deterministic derivation (Layer 3)

Pure rules · no LLM

The grades are computed by fixed rules, not by a model, so they are reproducible and auditable:

Trust grade from the Layer-1 markers + fraud caps (core/integrity_markers.py).
Novelty 1–10 by percentile-binning the panel's significance into the corpus bell curve (core/novelty_scale.py).
Evidence strength by aggregating the panel's confidence-weighted scores.
Editorial decision & Publication Fit from evidence × novelty (core/publication_tier.py).
Major claims are extracted with source quotes and each given a deterministic verdict (supported / partial / unsupported) by matching the panel's concerns against the claim.

Arbitration (Layer 4)

Borderline cases only

When the panel disagrees or the grade sits on a boundary, a senior arbiter (DeepSeek-R1, with Claude Opus as fallback) reviews the full panel output and confirms or nudges the consensus with a written rationale, logged on the report. A scientific-validity cap requires at least two agents to concur before it can force a paper down — a single contrarian reviewer cannot veto the consensus.

Calibration & validation

eLife anchors — the evidence-strength and significance label lookups are anchored to ~800 historical eLife reviews (κ ≈ 0.998 on the historical 2-axis mapping).
Self-refreshing novelty bins — the 1–10 cutpoints are recomputed from the live corpus so the bell curve stays calibrated as the corpus grows.
Expert validation set — a held-out sample is rated by experts on criticism correctness, significance and sufficiency, to measure precision and recall directly rather than assume it.

Limitations

This is a machine-generated indicator, not peer review. It assists human judgement; it does not replace it.
The pipeline cannot replicate experiments — fabricated data that reads plausibly will score higher than it should.
The headline A–E letter is transparency, not quality. Read it alongside the evidence-strength axis, not instead of it.
Panel agreement is not full independence (see Layer 2) — treat a unanimous panel as suggestive, not conclusive.
Domain agents are strongest in the life sciences; physics, maths and CS get a weaker signal. Field-specific norms (e.g. preregistration is not expected in pure maths) are an area of ongoing work.
Grades reflect the preprint as posted; revisions are re-scored.

Source code & reproducibility

The deterministic grade lookups, trust-marker rules and prompts are open to inspection (linked from the docs page), and the per-check evidence behind each signal is shown on the Evidence pages. Every assessment logs its paper text, model versions and per-agent responses, so any grade can be traced back to the inputs that produced it.