Calibration corpus & review-12b
9,279 reviews assembled from eLife, preprints.ai self-generated, and PREreview, used to train a calibration model that maps the 9-agent panel output to a final grade. The model is in training and not yet wired into the production pipeline.
Pipeline today
The currently shipping production pipeline is fully described on the methodology page. In summary: 9 specialist agents read the paper, an Opus advisor arbitrates borderline cases, and a deterministic lookup converts the panel's evidence-strength and significance labels into the integrity letter and novelty number on the report.
The deterministic lookup was originally calibrated against 800 historical eLife peer reviews used as a gold standard.
What review-12b is
review-12b is an in-training calibration model whose only job will be to map the structured output of the 9-agent panel to the same grade scale, learned end-to-end against the 9,279-review corpus rather than via the hand-tuned lookup. It is a shape-matching layer, not a replacement reviewer.
- 6,411 eLife reviews. Public, structured, with explicit evidence-strength and significance labels. The largest single source.
- 1,570 preprints.ai reviews. Used for self-consistency only. We never train a calibration target on its own previous outputs in a way that would reinforce its own biases.
- 1,298 PREreview reviews. Harvested from the Zenodo community
prereview-reviews(1,593 total records; 1,298 retained after dedup and quality filtering). Released under CC-BY 4.0.
Status
review-12b is not yet integrated into the live grade pipeline. Until it is, every grade you see on a report is produced by the deterministic lookup described on the methodology page. We will publish the diff metrics — agreement, calibration error, and grade-shift histogram — on this page when the model is wired in for real traffic.
Caveats — what this doesn't measure
- The eLife slice is biased toward life-sciences manuscripts that survived editorial triage. It is not a random sample of the preprint universe.
- PREreview reviews are CC-BY 4.0, but reviewers self-select onto the platform; their reviews may differ systematically from journal peer review.
- The 1,570 self-generated reviews exist only to test whether the model can reproduce its own consensus stably across repeats. They are excluded from any held-out evaluation set.
- Agreement with eLife's public-review labels is not the same as agreement with the eLife reviewers' underlying judgement about quality. We measure label match, not truth.
Code & attribution
Pipeline orchestration: agents/agentic_review.py. PREreview reviews are reproduced under CC-BY 4.0; eLife public reviews are reproduced under their open licensing.