Literature context
For every paper we score, we pull the most-related published work from Semantic Scholar and surface plausible missing citations. When the upstream API rate-limits us, we mark the gap explicitly rather than synthesise context from thin air.
Methodology
When a paper is queued for assessment, the worker calls Semantic Scholar's /paper/search and /paper/{id}/references endpoints to assemble three pieces of context that the reviewer agents then consume:
- Related work. Up to 10 papers ranked by S2's relevance score against the manuscript title and abstract. The agents see titles, venues, and brief excerpts — not full text.
- Potentially missing citations. Highly cited papers (top decile within the field, by S2 citation count) that match the manuscript's claims but are absent from its reference list.
- Citation graph context. First-degree forward and backward citations, used to flag when a paper is making claims about prior work that look isolated from the citation graph.
Example output line from a recent daily-fired assessment: 10 related works fetched, 2 potentially missing citations flagged, 17 references resolved against S2 corpus, 3 references unresolved. The DOI for that example: 10.1101/2024.06.14.598985.
Honest coverage
The Semantic Scholar API rate-limits unauthenticated callers and returns 429 Too Many Requests when our worker bursts. Two consequences follow:
- Roughly 58.2% of current production assessments carry a populated
literature_contextblock. The remainder display an explicit "literature context unavailable" line on the report rather than empty space. - The 06:07 UTC backfill cron retries up to 500 papers per night, prioritising those with the highest reader traffic. Coverage rises with the cron rather than on the next page load.
Caveats — what this doesn't measure
- Semantic Scholar's coverage is itself uneven across fields. Biomedicine and CS are well-served; humanities and some engineering subfields are sparse.
- "Potentially missing citation" is a heuristic, not a verdict. The reviewer agent must still judge whether the suggested paper actually belongs in the manuscript's reference list.
- Citation counts on S2 lag the live publication record by weeks. A genuinely landmark recent paper may not yet show as "highly cited".
- Pre-Apr-13 papers in our corpus were assessed before the Semantic Scholar integration shipped; they carry no literature_context. The daily backfill cron is closing this gap.
Code & attribution
S2 client and worker integration: agents/. Semantic Scholar API and citation graph are properties of the Allen Institute for AI and used in accordance with their terms.