Documentation
How Preprints.ai assesses research quality using a two-layer AI system
Introduction
Preprints.ai provides automated quality signals for academic preprints. With over 10,000 preprints posted weekly across bioRxiv, medRxiv, and arXiv, researchers need help identifying which papers deserve their attention.
Our system combines fast automated checks (detecting paper mills, statistical errors, missing data) with deep multi-agent peer review (5+ specialized AI reviewers analyzing methodology, statistics, reproducibility, and domain-specific standards).
We evaluate methodological integrity and novelty—not whether findings are "true". A high grade means good scientific practices were followed. A low grade means methodological concerns warrant caution.
The A5–E1 Grade System
Every paper receives a two-part grade: Integrity letter (A–E) + Novelty number (1–5)
Integrity Grades (A–E)
| Grade | Score | Meaning |
|---|---|---|
| A | ≥0.85 | Exemplary methodology |
| B | 0.70–0.84 | Solid with minor concerns |
| C | 0.55–0.69 | Adequate but notable gaps |
| D | 0.40–0.54 | Significant concerns |
| E | <0.40 | Critical issues |
Novelty Grades (1–5)
| Grade | Score | Meaning |
|---|---|---|
| 5 | ≥0.85 | Highly novel, potentially field-changing |
| 4 | 0.70–0.84 | Novel contribution |
| 3 | 0.55–0.69 | Incremental advance |
| 2 | 0.40–0.54 | Confirmatory |
| 1 | <0.40 | Limited novelty |
Assessment Pipeline
Papers flow through a two-layer system combining fast automated checks with deep AI review.
Layer 1: Automated Checks
11 automated checks run in parallel before AI review (~2 seconds total).
| Check | Detects | Impact |
|---|---|---|
| Paper Mill Detection | Tortured phrases, SCIgen/Mathgen signatures, LLM artifacts | → E grade cap |
| Statistical Verification | P-value recalculation errors (statcheck) | −0.15 penalty |
| Fabrication Detection | GRIM test, Benford's law, terminal digit analysis | → E grade cap |
| Trust Markers | ORCID, ethics statement, COI, funding | ±0.05 |
| Open Data (ODDPub) | Data/code availability, accession numbers | +0.02 bonus |
| Sample Size Consistency | N values match Methods vs Results | −0.05 warning |
| Reference Verification | Retracted papers, citejacked journals | −0.03 to −0.10 |
| Reproducibility Checklist | CONSORT/ARRIVE/MIQE items present | Informs agents |
| Image Forensics (ELIS) | Duplicate images, manipulation signs | −0.15 to −0.20 |
| Language Detection | Machine translation artifacts | Warning flag |
| Adversarial Sanitizer | Prompt injection attempts | Security |
Layer 2: Agentic Peer Review
Six specialized AI agents review each paper in parallel.
The 5 Core Agents
Methodologist
Experimental design, controls, sample sizes
Statistician
Statistical validity, effect sizes, corrections
Domain Expert
Field-specific standards (CONSORT, ARRIVE, etc.)
Reproducibility
Protocol detail, data/code availability
Ethics
Ethics approval, COI, transparency
Integrity Score Calculation
Weighted consensus of agent assessments plus Layer 1 adjustments:
| Component | Weight |
|---|---|
| Methodologist | 25% |
| Statistician | 25% |
| Reproducibility | 25% |
| Ethics & Transparency | 15% |
| Domain Expert | 10% |
Layer 1 Adjustments
- Paper mill detected: Cap at E
- Statistical errors: −0.15
- No data availability: −0.05
- All trust markers: +0.05
Novelty Score Calculation
Weighted more heavily toward domain expertise:
| Component | Weight |
|---|---|
| Domain Expert | 40% |
| Methodologist | 20% |
| Statistician | 15% |
| Ethics | 15% |
| Reproducibility | 10% |
Consensus & Agreement
| Agreement | Interpretation |
|---|---|
| ≥85% | High confidence |
| 70–84% | Good agreement |
| 60–69% | Moderate disagreement |
| <60% | Significant disagreement—flagged |
27 Domain Expert Configurations
Each bioRxiv category has specialized expertise:
| Category | Key Standards |
|---|---|
| Clinical Trials | CONSORT, pre-registration, ITT |
| Neuroscience | ARRIVE, optogenetic controls |
| Genomics | MINSEQE, GEO/SRA deposition |
| Cancer Biology | STR authentication, PDX models |
| Bioinformatics | Benchmarking, code availability |
| Epidemiology | STROBE, DAGs, E-values |
+ 21 more categories
Integrated Tools
Paper Mill Detection (PPS)
- 8,000+ tortured phrases
- 257 SCIgen signatures
- 19 LLM output markers
Statistical Verification
Reported: t(24) = 2.50, p = 0.02
Recalculated: p = 0.0196
Status: ✓ Consistent
Image Forensics (ELIS)
Named after Elisabeth Bik. Detects:
- Duplicate images via perceptual hashing
- Copy-move forgery within images
- Western blot splice patterns
- Metadata inconsistencies
Domain Context (OpenAlex)
We enrich assessments with real literature context:
- Similar papers in the literature
- Citation counts and patterns
- Field-specific norms and benchmarks
- Concept/topic classification
Validation & Ground Truth
We track our predictions against outcomes to ensure accuracy:
Retraction Monitoring
Papers we grade are monitored via CrossRef and Retraction Watch. We track:
- Did we flag papers that were later retracted?
- Sensitivity: % of retractions we caught
- False positives: Good papers we wrongly flagged
Publication Outcomes
We track where preprints end up:
- Do A-grade papers get published in high-impact journals?
- Do E-grade papers fail peer review?
- Citation correlation with novelty scores
Calibration Dataset
We maintain a set of papers with known ground truth:
- Known retractions (should be E grade)
- Highly-cited landmark papers (should be A grade)
- Expert-reviewed papers
Critical Failures
- Paper mill content detected
- Image manipulation
- GRIM/SPRITE violations
- Tautological claims
API Reference
Base URL: https://api.preprints.ai/v1
GET /grade/{doi}
GET /v1/grade/10.1101/2024.01.15.123456
{
"grade": "B3",
"integrity": { "score": 0.78, "letter": "B" },
"novelty": { "score": 0.62, "number": 3 },
"confidence": 0.85,
"agreement_score": 0.78
}
POST /assess
POST /v1/assess
{ "doi": "10.1101/2024.01.15.123456" }
Response: { "status": "queued" }
Rate Limits
| Endpoint | Limit |
|---|---|
| GET /grade/* | 100/minute |
| POST /assess | 10/minute |
| POST /v1/assess (partner) | 60/hour |
Partner API
The Partner API enables external platforms like OpenAccess.ai to submit manuscripts for automated peer review. Partner reviews are stored separately from bioRxiv assessments and include additional provenance auditing for AI-generated research.
Authentication
All partner endpoints require an X-API-Key header with a valid partner key.
Endpoint 1: Submit for Assessment
POST /v1/assess
Content-Type: application/json
X-API-Key: {partner_key}
{
"manuscript_content": "Full text (markdown/plain/JATS)",
"metadata": {
"title": "Paper title",
"abstract": "Abstract text",
"authors": [
{"name": "Human Author", "orcid": "0000-...", "is_ai_system": false},
{"name": "Claude (Anthropic)", "is_ai_system": true}
],
"subject_area": "Biology",
"ai_system": "Claude (Anthropic)"
},
"provenance": {
"model_id": "claude-sonnet-4-5-20250929",
"databases_queried": ["PubMed", "Semantic Scholar"],
"generation_date": "2026-02-16",
"total_compute_hours": 0.5
},
"callback_url": "https://yourapp.com/webhook",
"callback_secret": "your_hmac_secret",
"submission_ref": "your-internal-id",
"assessment_config": {
"include_provenance_audit": true,
"include_reproducibility": true,
"reviewer_count": 8
}
}
→ 202 Accepted
{
"assessment_id": "ps_abc123",
"status": "pending",
"estimated_completion_seconds": 300
}
Endpoint 2: Webhook Callback
When complete, we POST to your callback_url with:
x-preprints-signature: HMAC-SHA256 of the body using yourcallback_secret- Full structured assessment with grade, reviewers, trust markers, and provenance audit
Endpoint 3: Poll Status
GET /v1/assess/{assessment_id}
X-API-Key: {partner_key}
→ Returns full assessment when status = "completed"
Endpoint 4: Reassessment
POST /v1/assess/{previous_id}/reassess
X-API-Key: {partner_key}
{
"manuscript_content": "Updated v2 text...",
"version": 2,
"version_note": "Addressed reviewer concerns"
}
Endpoint 5: Public Report
GET /assessment/{assessment_id}
→ Redirects to the interactive report page
Provenance Audit
When include_provenance_audit is true, the assessment includes:
- Model plausibility — Does the claimed AI match the manuscript's capabilities?
- Provenance consistency — Are dates, versions, and sources consistent?
- Human contribution alignment — Does the contribution claim match style?
- Prompt injection detection — Flag adversarial content
- Reproducibility signal — Could another run produce similar results?