Files
gh-k-dense-ai-claude-scient…/skills/biorxiv-database/references/api_reference.md
2025-11-30 08:30:10 +08:00

6.3 KiB

bioRxiv API Reference

Overview

The bioRxiv API provides programmatic access to preprint metadata from the bioRxiv server. The API returns JSON-formatted data with comprehensive metadata about life sciences preprints.

Base URL

https://api.biorxiv.org

Rate Limiting

Be respectful of the API:

  • Add delays between requests (minimum 0.5 seconds recommended)
  • Use appropriate User-Agent headers
  • Cache results when possible

API Endpoints

1. Details by Date Range

Retrieve preprints posted within a specific date range.

Endpoint:

GET /details/biorxiv/{start_date}/{end_date}
GET /details/biorxiv/{start_date}/{end_date}/{category}

Parameters:

  • start_date: Start date in YYYY-MM-DD format
  • end_date: End date in YYYY-MM-DD format
  • category (optional): Filter by subject category

Example:

GET https://api.biorxiv.org/details/biorxiv/2024-01-01/2024-01-31
GET https://api.biorxiv.org/details/biorxiv/2024-01-01/2024-01-31/neuroscience

Response:

{
  "messages": [
    {
      "status": "ok",
      "count": 150,
      "total": 150
    }
  ],
  "collection": [
    {
      "doi": "10.1101/2024.01.15.123456",
      "title": "Example Paper Title",
      "authors": "Smith J, Doe J, Johnson A",
      "author_corresponding": "Smith J",
      "author_corresponding_institution": "University Example",
      "date": "2024-01-15",
      "version": "1",
      "type": "new results",
      "license": "cc_by",
      "category": "neuroscience",
      "jatsxml": "https://www.biorxiv.org/content/...",
      "abstract": "This is the abstract...",
      "published": ""
    }
  ]
}

2. Details by DOI

Retrieve details for a specific preprint by DOI.

Endpoint:

GET /details/biorxiv/{doi}

Parameters:

  • doi: The DOI of the preprint (e.g., 10.1101/2024.01.15.123456)

Example:

GET https://api.biorxiv.org/details/biorxiv/10.1101/2024.01.15.123456

3. Publications by Interval

Retrieve recent publications from a time interval.

Endpoint:

GET /pubs/biorxiv/{interval}/{cursor}/{format}

Parameters:

  • interval: Number of days back to search (e.g., 1 for last 24 hours)
  • cursor: Pagination cursor (0 for first page, increment by 100 for subsequent pages)
  • format: Response format (json or xml)

Example:

GET https://api.biorxiv.org/pubs/biorxiv/1/0/json

Response includes pagination:

{
  "messages": [
    {
      "status": "ok",
      "count": 100,
      "total": 250,
      "cursor": 100
    }
  ],
  "collection": [...]
}

Valid Categories

bioRxiv organizes preprints into the following categories:

  • animal-behavior-and-cognition
  • biochemistry
  • bioengineering
  • bioinformatics
  • biophysics
  • cancer-biology
  • cell-biology
  • clinical-trials
  • developmental-biology
  • ecology
  • epidemiology
  • evolutionary-biology
  • genetics
  • genomics
  • immunology
  • microbiology
  • molecular-biology
  • neuroscience
  • paleontology
  • pathology
  • pharmacology-and-toxicology
  • physiology
  • plant-biology
  • scientific-communication-and-education
  • synthetic-biology
  • systems-biology
  • zoology

Paper Metadata Fields

Each paper in the collection array contains:

Field Description Type
doi Digital Object Identifier string
title Paper title string
authors Comma-separated author list string
author_corresponding Corresponding author name string
author_corresponding_institution Corresponding author's institution string
date Publication date (YYYY-MM-DD) string
version Version number string
type Type of submission (e.g., "new results") string
license License type (e.g., "cc_by") string
category Subject category string
jatsxml URL to JATS XML string
abstract Paper abstract string
published Journal publication info (if published) string

Downloading Full Papers

PDF Download

PDFs can be downloaded directly (not through API):

https://www.biorxiv.org/content/{doi}v{version}.full.pdf

Example:

https://www.biorxiv.org/content/10.1101/2024.01.15.123456v1.full.pdf

HTML Version

https://www.biorxiv.org/content/{doi}v{version}

JATS XML

Full structured XML is available via the jatsxml field in the API response.

Common Search Patterns

  1. Get papers from date range
  2. Filter by author name (case-insensitive substring match in authors field)
  1. Get papers from date range (optionally filtered by category)
  2. Search in title, abstract, or both fields
  3. Filter papers containing keywords (case-insensitive)

Recent Papers by Category

  1. Use /pubs/biorxiv/{interval}/0/json endpoint
  2. Filter by category if needed

Error Handling

Common HTTP status codes:

  • 200: Success
  • 404: Resource not found
  • 500: Server error

Always check the messages array in the response:

{
  "messages": [
    {
      "status": "ok",
      "count": 100
    }
  ]
}

Best Practices

  1. Cache results: Store retrieved papers to avoid repeated API calls
  2. Use appropriate date ranges: Smaller date ranges return faster
  3. Filter by category: Reduces data transfer and processing time
  4. Batch processing: When downloading multiple PDFs, add delays between requests
  5. Error handling: Always check response status and handle errors gracefully
  6. Version tracking: Note that papers can have multiple versions

Python Usage Example

from biorxiv_search import BioRxivSearcher

searcher = BioRxivSearcher(verbose=True)

# Search by keywords
papers = searcher.search_by_keywords(
    keywords=["CRISPR", "gene editing"],
    start_date="2024-01-01",
    end_date="2024-12-31",
    category="genomics"
)

# Search by author
papers = searcher.search_by_author(
    author_name="Smith",
    start_date="2023-01-01",
    end_date="2024-12-31"
)

# Get specific paper
paper = searcher.get_paper_details("10.1101/2024.01.15.123456")

# Download PDF
searcher.download_pdf("10.1101/2024.01.15.123456", "paper.pdf")

External Resources