Multi-Agent AI Peer Review

Peer review at scale

Nine specialist AI agents review every preprint independently — then deliberate to produce a consensus assessment graded on integrity and novelty.

900+ papers assessed across 317,000+ indexed preprints from bioRxiv and medRxiv. Calibrated against 6,000+ publication outcomes.

All assessments are machine-generated and require human expert review.

Try: 10.1101/2025.01.15.633214 or paste any bioRxiv URL

Live Pipeline

Real-time assessment pipeline — auto-refreshes every 30 seconds

Live
Total Papers
Assessed
In Queue
Full Texts
Strength of Evidence
A
B
C
D
E
Significance of Findings
5
4
3
2
1

Grade Heatmap

Evidence (A–E) × Significance (5–1)

54321
A-----
B-----
C-----
D-----
E-----

Recent Assessments

View all →
GradeTitleSourceSubjectTimeBasisReport
Loading…

How It Works

Four phases, nine agents, one consensus grade

1. Deterministic Checks Layer 1

Paper mill detection (8,000+ tortured phrases), statcheck p-value verification, GRIM tests, data availability checks, retraction cross-referencing — all before any LLM runs.

2. Nine-Agent Review Layer 2

Four integrity agents (methodologist, statistician, ethics reviewer, scientific validity) plus five domain agents review every paper independently using Claude Sonnet and Claude Haiku.

3. Adversarial Verification Layer 3

Critical concerns from any agent are challenged by an adversarial verifier. Claims must survive cross-examination before affecting the final grade.

4. Multi-Model Deliberation Layer 4

For borderline papers, three models (Claude, GPT-4o, Gemini) independently review, consult, and reconcile — inspired by eLife's consultative peer review model.

5. Recalibration Rules Quality Gate

Hard overrides prevent positivity bias: 2+ critical weaknesses enforce a D floor, 4+ major concerns cap at C, reject recommendations cap at D.

6. Publication Calibration Feedback Loop

Scores are compared against where papers actually get published. Journal tier data from OpenAlex, Europe PMC, and Semantic Scholar drives automatic threshold adjustment.

Why This Matters

Peer review is structurally unable to keep up

10,000+ preprints per week

Posted to bioRxiv and medRxiv alone. The reviewer pool hasn't grown to match. Average review turnaround now exceeds 100 days at major journals.

50% contain statistical errors

At least one statistical inconsistency per paper on average. Most go undetected because reviewers don't have time to recheck the maths.

2–5% from paper mills

Fabricated studies with tortured phrases, recycled figures, and fake data. Traditional review catches some. Automated detection catches more.

No quality signal on preprints

Groundbreaking research sits alongside methodologically flawed studies with no way to tell them apart until months later — if ever.

Built on Open Science Infrastructure

Every finding is traceable to a peer-reviewed tool. No proprietary black boxes.