Calibration corpus — Evidence

eLife reviews

6,411

peer-reviewed, published

preprints.ai reviews

1,570

self-generated, used for consistency only

PREreview

1,298

harvested from Zenodo, CC-BY 4.0

Pipeline today

The currently shipping production pipeline is fully described on the methodology page. In summary: 9 specialist agents read the paper, an Opus advisor arbitrates borderline cases, and a deterministic lookup converts the panel's evidence-strength and significance labels into the integrity letter and novelty number on the report.

The deterministic lookup was originally calibrated against 800 historical eLife peer reviews used as a gold standard.

What review-12b is

review-12b is an in-training calibration model whose only job will be to map the structured output of the 9-agent panel to the same grade scale, learned end-to-end against the 9,279-review corpus rather than via the hand-tuned lookup. It is a shape-matching layer, not a replacement reviewer.

6,411 eLife reviews. Public, structured, with explicit evidence-strength and significance labels. The largest single source.
1,570 preprints.ai reviews. Used for self-consistency only. We never train a calibration target on its own previous outputs in a way that would reinforce its own biases.
1,298 PREreview reviews. Harvested from the Zenodo community prereview-reviews (1,593 total records; 1,298 retained after dedup and quality filtering). Released under CC-BY 4.0.

Status

review-12b is not yet integrated into the live grade pipeline. Until it is, every grade you see on a report is produced by the deterministic lookup described on the methodology page. We will publish the diff metrics — agreement, calibration error, and grade-shift histogram — on this page when the model is wired in for real traffic.

Why we have not shipped it yet. A calibration model that disagrees with the deterministic lookup must justify itself before it ships. The disagreement set has not yet been audited end-to-end by a human reviewer. We would rather miss a launch date than ship a model whose disagreements we cannot explain.

Caveats — what this doesn't measure

The eLife slice is biased toward life-sciences manuscripts that survived editorial triage. It is not a random sample of the preprint universe.
PREreview reviews are CC-BY 4.0, but reviewers self-select onto the platform; their reviews may differ systematically from journal peer review.
The 1,570 self-generated reviews exist only to test whether the model can reproduce its own consensus stably across repeats. They are excluded from any held-out evaluation set.
Agreement with eLife's public-review labels is not the same as agreement with the eLife reviewers' underlying judgement about quality. We measure label match, not truth.

Code & attribution

Pipeline orchestration: agents/agentic_review.py. PREreview reviews are reproduced under CC-BY 4.0; eLife public reviews are reproduced under their open licensing.

Calibration corpus & review-12b

Pipeline today

What review-12b is

Status

Caveats — what this doesn't measure

Code & attribution