Paper mill detection (8,000+ tortured phrases), statcheck p-value verification, GRIM tests, data availability checks, retraction cross-referencing — all before any LLM runs.
Multi-Agent AI Peer Review
Peer review at scale
Nine specialist AI agents review every preprint independently — then deliberate to produce a consensus assessment graded on integrity and novelty.
900+ papers assessed across 317,000+ indexed preprints from bioRxiv and medRxiv. Calibrated against 6,000+ publication outcomes.
All assessments are machine-generated and require human expert review.
Try: 10.1101/2025.01.15.633214 or paste any bioRxiv URL
Live Pipeline
Real-time assessment pipeline — auto-refreshes every 30 seconds
Grade Heatmap
Evidence (A–E) × Significance (5–1)
| 5 | 4 | 3 | 2 | 1 | |
|---|---|---|---|---|---|
| A | - | - | - | - | - |
| B | - | - | - | - | - |
| C | - | - | - | - | - |
| D | - | - | - | - | - |
| E | - | - | - | - | - |
Recent Assessments
View all →| Grade | Title | Source | Subject | Time | Basis | Report |
|---|---|---|---|---|---|---|
| Loading… | ||||||
How It Works
Four phases, nine agents, one consensus grade
Four integrity agents (methodologist, statistician, ethics reviewer, scientific validity) plus five domain agents review every paper independently using Claude Sonnet and Claude Haiku.
Critical concerns from any agent are challenged by an adversarial verifier. Claims must survive cross-examination before affecting the final grade.
For borderline papers, three models (Claude, GPT-4o, Gemini) independently review, consult, and reconcile — inspired by eLife's consultative peer review model.
Hard overrides prevent positivity bias: 2+ critical weaknesses enforce a D floor, 4+ major concerns cap at C, reject recommendations cap at D.
Scores are compared against where papers actually get published. Journal tier data from OpenAlex, Europe PMC, and Semantic Scholar drives automatic threshold adjustment.
Why This Matters
Peer review is structurally unable to keep up
10,000+ preprints per week
Posted to bioRxiv and medRxiv alone. The reviewer pool hasn't grown to match. Average review turnaround now exceeds 100 days at major journals.
50% contain statistical errors
At least one statistical inconsistency per paper on average. Most go undetected because reviewers don't have time to recheck the maths.
2–5% from paper mills
Fabricated studies with tortured phrases, recycled figures, and fake data. Traditional review catches some. Automated detection catches more.
No quality signal on preprints
Groundbreaking research sits alongside methodologically flawed studies with no way to tell them apart until months later — if ever.
Built on Open Science Infrastructure
Every finding is traceable to a peer-reviewed tool. No proprietary black boxes.