281 lines
6.3 KiB
Markdown
281 lines
6.3 KiB
Markdown
# bioRxiv API Reference
|
|
|
|
## Overview
|
|
|
|
The bioRxiv API provides programmatic access to preprint metadata from the bioRxiv server. The API returns JSON-formatted data with comprehensive metadata about life sciences preprints.
|
|
|
|
## Base URL
|
|
|
|
```
|
|
https://api.biorxiv.org
|
|
```
|
|
|
|
## Rate Limiting
|
|
|
|
Be respectful of the API:
|
|
- Add delays between requests (minimum 0.5 seconds recommended)
|
|
- Use appropriate User-Agent headers
|
|
- Cache results when possible
|
|
|
|
## API Endpoints
|
|
|
|
### 1. Details by Date Range
|
|
|
|
Retrieve preprints posted within a specific date range.
|
|
|
|
**Endpoint:**
|
|
```
|
|
GET /details/biorxiv/{start_date}/{end_date}
|
|
GET /details/biorxiv/{start_date}/{end_date}/{category}
|
|
```
|
|
|
|
**Parameters:**
|
|
- `start_date`: Start date in YYYY-MM-DD format
|
|
- `end_date`: End date in YYYY-MM-DD format
|
|
- `category` (optional): Filter by subject category
|
|
|
|
**Example:**
|
|
```
|
|
GET https://api.biorxiv.org/details/biorxiv/2024-01-01/2024-01-31
|
|
GET https://api.biorxiv.org/details/biorxiv/2024-01-01/2024-01-31/neuroscience
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"messages": [
|
|
{
|
|
"status": "ok",
|
|
"count": 150,
|
|
"total": 150
|
|
}
|
|
],
|
|
"collection": [
|
|
{
|
|
"doi": "10.1101/2024.01.15.123456",
|
|
"title": "Example Paper Title",
|
|
"authors": "Smith J, Doe J, Johnson A",
|
|
"author_corresponding": "Smith J",
|
|
"author_corresponding_institution": "University Example",
|
|
"date": "2024-01-15",
|
|
"version": "1",
|
|
"type": "new results",
|
|
"license": "cc_by",
|
|
"category": "neuroscience",
|
|
"jatsxml": "https://www.biorxiv.org/content/...",
|
|
"abstract": "This is the abstract...",
|
|
"published": ""
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### 2. Details by DOI
|
|
|
|
Retrieve details for a specific preprint by DOI.
|
|
|
|
**Endpoint:**
|
|
```
|
|
GET /details/biorxiv/{doi}
|
|
```
|
|
|
|
**Parameters:**
|
|
- `doi`: The DOI of the preprint (e.g., `10.1101/2024.01.15.123456`)
|
|
|
|
**Example:**
|
|
```
|
|
GET https://api.biorxiv.org/details/biorxiv/10.1101/2024.01.15.123456
|
|
```
|
|
|
|
### 3. Publications by Interval
|
|
|
|
Retrieve recent publications from a time interval.
|
|
|
|
**Endpoint:**
|
|
```
|
|
GET /pubs/biorxiv/{interval}/{cursor}/{format}
|
|
```
|
|
|
|
**Parameters:**
|
|
- `interval`: Number of days back to search (e.g., `1` for last 24 hours)
|
|
- `cursor`: Pagination cursor (0 for first page, increment by 100 for subsequent pages)
|
|
- `format`: Response format (`json` or `xml`)
|
|
|
|
**Example:**
|
|
```
|
|
GET https://api.biorxiv.org/pubs/biorxiv/1/0/json
|
|
```
|
|
|
|
**Response includes pagination:**
|
|
```json
|
|
{
|
|
"messages": [
|
|
{
|
|
"status": "ok",
|
|
"count": 100,
|
|
"total": 250,
|
|
"cursor": 100
|
|
}
|
|
],
|
|
"collection": [...]
|
|
}
|
|
```
|
|
|
|
## Valid Categories
|
|
|
|
bioRxiv organizes preprints into the following categories:
|
|
|
|
- `animal-behavior-and-cognition`
|
|
- `biochemistry`
|
|
- `bioengineering`
|
|
- `bioinformatics`
|
|
- `biophysics`
|
|
- `cancer-biology`
|
|
- `cell-biology`
|
|
- `clinical-trials`
|
|
- `developmental-biology`
|
|
- `ecology`
|
|
- `epidemiology`
|
|
- `evolutionary-biology`
|
|
- `genetics`
|
|
- `genomics`
|
|
- `immunology`
|
|
- `microbiology`
|
|
- `molecular-biology`
|
|
- `neuroscience`
|
|
- `paleontology`
|
|
- `pathology`
|
|
- `pharmacology-and-toxicology`
|
|
- `physiology`
|
|
- `plant-biology`
|
|
- `scientific-communication-and-education`
|
|
- `synthetic-biology`
|
|
- `systems-biology`
|
|
- `zoology`
|
|
|
|
## Paper Metadata Fields
|
|
|
|
Each paper in the `collection` array contains:
|
|
|
|
| Field | Description | Type |
|
|
|-------|-------------|------|
|
|
| `doi` | Digital Object Identifier | string |
|
|
| `title` | Paper title | string |
|
|
| `authors` | Comma-separated author list | string |
|
|
| `author_corresponding` | Corresponding author name | string |
|
|
| `author_corresponding_institution` | Corresponding author's institution | string |
|
|
| `date` | Publication date (YYYY-MM-DD) | string |
|
|
| `version` | Version number | string |
|
|
| `type` | Type of submission (e.g., "new results") | string |
|
|
| `license` | License type (e.g., "cc_by") | string |
|
|
| `category` | Subject category | string |
|
|
| `jatsxml` | URL to JATS XML | string |
|
|
| `abstract` | Paper abstract | string |
|
|
| `published` | Journal publication info (if published) | string |
|
|
|
|
## Downloading Full Papers
|
|
|
|
### PDF Download
|
|
|
|
PDFs can be downloaded directly (not through API):
|
|
|
|
```
|
|
https://www.biorxiv.org/content/{doi}v{version}.full.pdf
|
|
```
|
|
|
|
Example:
|
|
```
|
|
https://www.biorxiv.org/content/10.1101/2024.01.15.123456v1.full.pdf
|
|
```
|
|
|
|
### HTML Version
|
|
|
|
```
|
|
https://www.biorxiv.org/content/{doi}v{version}
|
|
```
|
|
|
|
### JATS XML
|
|
|
|
Full structured XML is available via the `jatsxml` field in the API response.
|
|
|
|
## Common Search Patterns
|
|
|
|
### Author Search
|
|
|
|
1. Get papers from date range
|
|
2. Filter by author name (case-insensitive substring match in `authors` field)
|
|
|
|
### Keyword Search
|
|
|
|
1. Get papers from date range (optionally filtered by category)
|
|
2. Search in title, abstract, or both fields
|
|
3. Filter papers containing keywords (case-insensitive)
|
|
|
|
### Recent Papers by Category
|
|
|
|
1. Use `/pubs/biorxiv/{interval}/0/json` endpoint
|
|
2. Filter by category if needed
|
|
|
|
## Error Handling
|
|
|
|
Common HTTP status codes:
|
|
- `200`: Success
|
|
- `404`: Resource not found
|
|
- `500`: Server error
|
|
|
|
Always check the `messages` array in the response:
|
|
```json
|
|
{
|
|
"messages": [
|
|
{
|
|
"status": "ok",
|
|
"count": 100
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Cache results**: Store retrieved papers to avoid repeated API calls
|
|
2. **Use appropriate date ranges**: Smaller date ranges return faster
|
|
3. **Filter by category**: Reduces data transfer and processing time
|
|
4. **Batch processing**: When downloading multiple PDFs, add delays between requests
|
|
5. **Error handling**: Always check response status and handle errors gracefully
|
|
6. **Version tracking**: Note that papers can have multiple versions
|
|
|
|
## Python Usage Example
|
|
|
|
```python
|
|
from biorxiv_search import BioRxivSearcher
|
|
|
|
searcher = BioRxivSearcher(verbose=True)
|
|
|
|
# Search by keywords
|
|
papers = searcher.search_by_keywords(
|
|
keywords=["CRISPR", "gene editing"],
|
|
start_date="2024-01-01",
|
|
end_date="2024-12-31",
|
|
category="genomics"
|
|
)
|
|
|
|
# Search by author
|
|
papers = searcher.search_by_author(
|
|
author_name="Smith",
|
|
start_date="2023-01-01",
|
|
end_date="2024-12-31"
|
|
)
|
|
|
|
# Get specific paper
|
|
paper = searcher.get_paper_details("10.1101/2024.01.15.123456")
|
|
|
|
# Download PDF
|
|
searcher.download_pdf("10.1101/2024.01.15.123456", "paper.pdf")
|
|
```
|
|
|
|
## External Resources
|
|
|
|
- bioRxiv homepage: https://www.biorxiv.org/
|
|
- API documentation: https://api.biorxiv.org/
|
|
- JATS XML specification: https://jats.nlm.nih.gov/
|