Initial commit
This commit is contained in:
477
skills/biorxiv-database/SKILL.md
Normal file
477
skills/biorxiv-database/SKILL.md
Normal file
@@ -0,0 +1,477 @@
|
||||
---
|
||||
name: biorxiv-database
|
||||
description: Efficient database search tool for bioRxiv preprint server. Use this skill when searching for life sciences preprints by keywords, authors, date ranges, or categories, retrieving paper metadata, downloading PDFs, or conducting literature reviews.
|
||||
---
|
||||
|
||||
# bioRxiv Database
|
||||
|
||||
## Overview
|
||||
|
||||
This skill provides efficient Python-based tools for searching and retrieving preprints from the bioRxiv database. It enables comprehensive searches by keywords, authors, date ranges, and categories, returning structured JSON metadata that includes titles, abstracts, DOIs, and citation information. The skill also supports PDF downloads for full-text analysis.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
- Searching for recent preprints in specific research areas
|
||||
- Tracking publications by particular authors
|
||||
- Conducting systematic literature reviews
|
||||
- Analyzing research trends over time periods
|
||||
- Retrieving metadata for citation management
|
||||
- Downloading preprint PDFs for analysis
|
||||
- Filtering papers by bioRxiv subject categories
|
||||
|
||||
## Core Search Capabilities
|
||||
|
||||
### 1. Keyword Search
|
||||
|
||||
Search for preprints containing specific keywords in titles, abstracts, or author lists.
|
||||
|
||||
**Basic Usage:**
|
||||
```python
|
||||
python scripts/biorxiv_search.py \
|
||||
--keywords "CRISPR" "gene editing" \
|
||||
--start-date 2024-01-01 \
|
||||
--end-date 2024-12-31 \
|
||||
--output results.json
|
||||
```
|
||||
|
||||
**With Category Filter:**
|
||||
```python
|
||||
python scripts/biorxiv_search.py \
|
||||
--keywords "neural networks" "deep learning" \
|
||||
--days-back 180 \
|
||||
--category neuroscience \
|
||||
--output recent_neuroscience.json
|
||||
```
|
||||
|
||||
**Search Fields:**
|
||||
By default, keywords are searched in both title and abstract. Customize with `--search-fields`:
|
||||
```python
|
||||
python scripts/biorxiv_search.py \
|
||||
--keywords "AlphaFold" \
|
||||
--search-fields title \
|
||||
--days-back 365
|
||||
```
|
||||
|
||||
### 2. Author Search
|
||||
|
||||
Find all papers by a specific author within a date range.
|
||||
|
||||
**Basic Usage:**
|
||||
```python
|
||||
python scripts/biorxiv_search.py \
|
||||
--author "Smith" \
|
||||
--start-date 2023-01-01 \
|
||||
--end-date 2024-12-31 \
|
||||
--output smith_papers.json
|
||||
```
|
||||
|
||||
**Recent Publications:**
|
||||
```python
|
||||
# Last year by default if no dates specified
|
||||
python scripts/biorxiv_search.py \
|
||||
--author "Johnson" \
|
||||
--output johnson_recent.json
|
||||
```
|
||||
|
||||
### 3. Date Range Search
|
||||
|
||||
Retrieve all preprints posted within a specific date range.
|
||||
|
||||
**Basic Usage:**
|
||||
```python
|
||||
python scripts/biorxiv_search.py \
|
||||
--start-date 2024-01-01 \
|
||||
--end-date 2024-01-31 \
|
||||
--output january_2024.json
|
||||
```
|
||||
|
||||
**With Category Filter:**
|
||||
```python
|
||||
python scripts/biorxiv_search.py \
|
||||
--start-date 2024-06-01 \
|
||||
--end-date 2024-06-30 \
|
||||
--category genomics \
|
||||
--output genomics_june.json
|
||||
```
|
||||
|
||||
**Days Back Shortcut:**
|
||||
```python
|
||||
# Last 30 days
|
||||
python scripts/biorxiv_search.py \
|
||||
--days-back 30 \
|
||||
--output last_month.json
|
||||
```
|
||||
|
||||
### 4. Paper Details by DOI
|
||||
|
||||
Retrieve detailed metadata for a specific preprint.
|
||||
|
||||
**Basic Usage:**
|
||||
```python
|
||||
python scripts/biorxiv_search.py \
|
||||
--doi "10.1101/2024.01.15.123456" \
|
||||
--output paper_details.json
|
||||
```
|
||||
|
||||
**Full DOI URLs Accepted:**
|
||||
```python
|
||||
python scripts/biorxiv_search.py \
|
||||
--doi "https://doi.org/10.1101/2024.01.15.123456"
|
||||
```
|
||||
|
||||
### 5. PDF Downloads
|
||||
|
||||
Download the full-text PDF of any preprint.
|
||||
|
||||
**Basic Usage:**
|
||||
```python
|
||||
python scripts/biorxiv_search.py \
|
||||
--doi "10.1101/2024.01.15.123456" \
|
||||
--download-pdf paper.pdf
|
||||
```
|
||||
|
||||
**Batch Processing:**
|
||||
For multiple PDFs, extract DOIs from a search result JSON and download each paper:
|
||||
```python
|
||||
import json
|
||||
from biorxiv_search import BioRxivSearcher
|
||||
|
||||
# Load search results
|
||||
with open('results.json') as f:
|
||||
data = json.load(f)
|
||||
|
||||
searcher = BioRxivSearcher(verbose=True)
|
||||
|
||||
# Download each paper
|
||||
for i, paper in enumerate(data['results'][:10]): # First 10 papers
|
||||
doi = paper['doi']
|
||||
searcher.download_pdf(doi, f"papers/paper_{i+1}.pdf")
|
||||
```
|
||||
|
||||
## Valid Categories
|
||||
|
||||
Filter searches by bioRxiv subject categories:
|
||||
|
||||
- `animal-behavior-and-cognition`
|
||||
- `biochemistry`
|
||||
- `bioengineering`
|
||||
- `bioinformatics`
|
||||
- `biophysics`
|
||||
- `cancer-biology`
|
||||
- `cell-biology`
|
||||
- `clinical-trials`
|
||||
- `developmental-biology`
|
||||
- `ecology`
|
||||
- `epidemiology`
|
||||
- `evolutionary-biology`
|
||||
- `genetics`
|
||||
- `genomics`
|
||||
- `immunology`
|
||||
- `microbiology`
|
||||
- `molecular-biology`
|
||||
- `neuroscience`
|
||||
- `paleontology`
|
||||
- `pathology`
|
||||
- `pharmacology-and-toxicology`
|
||||
- `physiology`
|
||||
- `plant-biology`
|
||||
- `scientific-communication-and-education`
|
||||
- `synthetic-biology`
|
||||
- `systems-biology`
|
||||
- `zoology`
|
||||
|
||||
## Output Format
|
||||
|
||||
All searches return structured JSON with the following format:
|
||||
|
||||
```json
|
||||
{
|
||||
"query": {
|
||||
"keywords": ["CRISPR"],
|
||||
"start_date": "2024-01-01",
|
||||
"end_date": "2024-12-31",
|
||||
"category": "genomics"
|
||||
},
|
||||
"result_count": 42,
|
||||
"results": [
|
||||
{
|
||||
"doi": "10.1101/2024.01.15.123456",
|
||||
"title": "Paper Title Here",
|
||||
"authors": "Smith J, Doe J, Johnson A",
|
||||
"author_corresponding": "Smith J",
|
||||
"author_corresponding_institution": "University Example",
|
||||
"date": "2024-01-15",
|
||||
"version": "1",
|
||||
"type": "new results",
|
||||
"license": "cc_by",
|
||||
"category": "genomics",
|
||||
"abstract": "Full abstract text...",
|
||||
"pdf_url": "https://www.biorxiv.org/content/10.1101/2024.01.15.123456v1.full.pdf",
|
||||
"html_url": "https://www.biorxiv.org/content/10.1101/2024.01.15.123456v1",
|
||||
"jatsxml": "https://www.biorxiv.org/content/...",
|
||||
"published": ""
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Common Usage Patterns
|
||||
|
||||
### Literature Review Workflow
|
||||
|
||||
1. **Broad keyword search:**
|
||||
```python
|
||||
python scripts/biorxiv_search.py \
|
||||
--keywords "organoids" "tissue engineering" \
|
||||
--start-date 2023-01-01 \
|
||||
--end-date 2024-12-31 \
|
||||
--category bioengineering \
|
||||
--output organoid_papers.json
|
||||
```
|
||||
|
||||
2. **Extract and review results:**
|
||||
```python
|
||||
import json
|
||||
|
||||
with open('organoid_papers.json') as f:
|
||||
data = json.load(f)
|
||||
|
||||
print(f"Found {data['result_count']} papers")
|
||||
|
||||
for paper in data['results'][:5]:
|
||||
print(f"\nTitle: {paper['title']}")
|
||||
print(f"Authors: {paper['authors']}")
|
||||
print(f"Date: {paper['date']}")
|
||||
print(f"DOI: {paper['doi']}")
|
||||
```
|
||||
|
||||
3. **Download selected papers:**
|
||||
```python
|
||||
from biorxiv_search import BioRxivSearcher
|
||||
|
||||
searcher = BioRxivSearcher()
|
||||
selected_dois = ["10.1101/2024.01.15.123456", "10.1101/2024.02.20.789012"]
|
||||
|
||||
for doi in selected_dois:
|
||||
filename = doi.replace("/", "_").replace(".", "_") + ".pdf"
|
||||
searcher.download_pdf(doi, f"papers/{filename}")
|
||||
```
|
||||
|
||||
### Trend Analysis
|
||||
|
||||
Track research trends by analyzing publication frequencies over time:
|
||||
|
||||
```python
|
||||
python scripts/biorxiv_search.py \
|
||||
--keywords "machine learning" \
|
||||
--start-date 2020-01-01 \
|
||||
--end-date 2024-12-31 \
|
||||
--category bioinformatics \
|
||||
--output ml_trends.json
|
||||
```
|
||||
|
||||
Then analyze the temporal distribution in the results.
|
||||
|
||||
### Author Tracking
|
||||
|
||||
Monitor specific researchers' preprints:
|
||||
|
||||
```python
|
||||
# Track multiple authors
|
||||
authors = ["Smith", "Johnson", "Williams"]
|
||||
|
||||
for author in authors:
|
||||
python scripts/biorxiv_search.py \
|
||||
--author "{author}" \
|
||||
--days-back 365 \
|
||||
--output "{author}_papers.json"
|
||||
```
|
||||
|
||||
## Python API Usage
|
||||
|
||||
For more complex workflows, import and use the `BioRxivSearcher` class directly:
|
||||
|
||||
```python
|
||||
from scripts.biorxiv_search import BioRxivSearcher
|
||||
|
||||
# Initialize
|
||||
searcher = BioRxivSearcher(verbose=True)
|
||||
|
||||
# Multiple search operations
|
||||
keywords_papers = searcher.search_by_keywords(
|
||||
keywords=["CRISPR", "gene editing"],
|
||||
start_date="2024-01-01",
|
||||
end_date="2024-12-31",
|
||||
category="genomics"
|
||||
)
|
||||
|
||||
author_papers = searcher.search_by_author(
|
||||
author_name="Smith",
|
||||
start_date="2023-01-01",
|
||||
end_date="2024-12-31"
|
||||
)
|
||||
|
||||
# Get specific paper details
|
||||
paper = searcher.get_paper_details("10.1101/2024.01.15.123456")
|
||||
|
||||
# Download PDF
|
||||
success = searcher.download_pdf(
|
||||
doi="10.1101/2024.01.15.123456",
|
||||
output_path="paper.pdf"
|
||||
)
|
||||
|
||||
# Format results consistently
|
||||
formatted = searcher.format_result(paper, include_abstract=True)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use appropriate date ranges**: Smaller date ranges return faster. For keyword searches over long periods, consider splitting into multiple queries.
|
||||
|
||||
2. **Filter by category**: When possible, use `--category` to reduce data transfer and improve search precision.
|
||||
|
||||
3. **Respect rate limits**: The script includes automatic delays (0.5s between requests). For large-scale data collection, add additional delays.
|
||||
|
||||
4. **Cache results**: Save search results to JSON files to avoid repeated API calls.
|
||||
|
||||
5. **Version tracking**: Preprints can have multiple versions. The `version` field indicates which version is returned. PDF URLs include the version number.
|
||||
|
||||
6. **Handle errors gracefully**: Check the `result_count` in output JSON. Empty results may indicate date range issues or API connectivity problems.
|
||||
|
||||
7. **Verbose mode for debugging**: Use `--verbose` flag to see detailed logging of API requests and responses.
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Custom Date Range Logic
|
||||
|
||||
```python
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
# Last quarter
|
||||
end_date = datetime.now()
|
||||
start_date = end_date - timedelta(days=90)
|
||||
|
||||
python scripts/biorxiv_search.py \
|
||||
--start-date {start_date.strftime('%Y-%m-%d')} \
|
||||
--end-date {end_date.strftime('%Y-%m-%d')}
|
||||
```
|
||||
|
||||
### Result Limiting
|
||||
|
||||
Limit the number of results returned:
|
||||
|
||||
```python
|
||||
python scripts/biorxiv_search.py \
|
||||
--keywords "COVID-19" \
|
||||
--days-back 30 \
|
||||
--limit 50 \
|
||||
--output covid_top50.json
|
||||
```
|
||||
|
||||
### Exclude Abstracts for Speed
|
||||
|
||||
When only metadata is needed:
|
||||
|
||||
```python
|
||||
# Note: Abstract inclusion is controlled in Python API
|
||||
from scripts.biorxiv_search import BioRxivSearcher
|
||||
|
||||
searcher = BioRxivSearcher()
|
||||
papers = searcher.search_by_keywords(keywords=["AI"], days_back=30)
|
||||
formatted = [searcher.format_result(p, include_abstract=False) for p in papers]
|
||||
```
|
||||
|
||||
## Programmatic Integration
|
||||
|
||||
Integrate search results into downstream analysis pipelines:
|
||||
|
||||
```python
|
||||
import json
|
||||
import pandas as pd
|
||||
|
||||
# Load results
|
||||
with open('results.json') as f:
|
||||
data = json.load(f)
|
||||
|
||||
# Convert to DataFrame for analysis
|
||||
df = pd.DataFrame(data['results'])
|
||||
|
||||
# Analyze
|
||||
print(f"Total papers: {len(df)}")
|
||||
print(f"Date range: {df['date'].min()} to {df['date'].max()}")
|
||||
print(f"\nTop authors by paper count:")
|
||||
print(df['authors'].str.split(',').explode().str.strip().value_counts().head(10))
|
||||
|
||||
# Filter and export
|
||||
recent = df[df['date'] >= '2024-06-01']
|
||||
recent.to_csv('recent_papers.csv', index=False)
|
||||
```
|
||||
|
||||
## Testing the Skill
|
||||
|
||||
To verify that the bioRxiv database skill is working correctly, run the comprehensive test suite.
|
||||
|
||||
**Prerequisites:**
|
||||
```bash
|
||||
uv pip install requests
|
||||
```
|
||||
|
||||
**Run tests:**
|
||||
```bash
|
||||
python tests/test_biorxiv_search.py
|
||||
```
|
||||
|
||||
The test suite validates:
|
||||
- **Initialization**: BioRxivSearcher class instantiation
|
||||
- **Date Range Search**: Retrieving papers within specific date ranges
|
||||
- **Category Filtering**: Filtering papers by bioRxiv categories
|
||||
- **Keyword Search**: Finding papers containing specific keywords
|
||||
- **DOI Lookup**: Retrieving specific papers by DOI
|
||||
- **Result Formatting**: Proper formatting of paper metadata
|
||||
- **Interval Search**: Fetching recent papers by time intervals
|
||||
|
||||
**Expected Output:**
|
||||
```
|
||||
🧬 bioRxiv Database Search Skill Test Suite
|
||||
======================================================================
|
||||
|
||||
🧪 Test 1: Initialization
|
||||
✅ BioRxivSearcher initialized successfully
|
||||
|
||||
🧪 Test 2: Date Range Search
|
||||
✅ Found 150 papers between 2024-01-01 and 2024-01-07
|
||||
First paper: Novel CRISPR-based approach for genome editing...
|
||||
|
||||
[... additional tests ...]
|
||||
|
||||
======================================================================
|
||||
📊 Test Summary
|
||||
======================================================================
|
||||
✅ PASS: Initialization
|
||||
✅ PASS: Date Range Search
|
||||
✅ PASS: Category Filtering
|
||||
✅ PASS: Keyword Search
|
||||
✅ PASS: DOI Lookup
|
||||
✅ PASS: Result Formatting
|
||||
✅ PASS: Interval Search
|
||||
======================================================================
|
||||
Results: 7/7 tests passed (100%)
|
||||
======================================================================
|
||||
|
||||
🎉 All tests passed! The bioRxiv database skill is working correctly.
|
||||
```
|
||||
|
||||
**Note:** Some tests may show warnings if no papers are found in specific date ranges or categories. This is normal and does not indicate a failure.
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
For detailed API specifications, endpoint documentation, and response schemas, refer to:
|
||||
- `references/api_reference.md` - Complete bioRxiv API documentation
|
||||
|
||||
The reference file includes:
|
||||
- Full API endpoint specifications
|
||||
- Response format details
|
||||
- Error handling patterns
|
||||
- Rate limiting guidelines
|
||||
- Advanced search patterns
|
||||
280
skills/biorxiv-database/references/api_reference.md
Normal file
280
skills/biorxiv-database/references/api_reference.md
Normal file
@@ -0,0 +1,280 @@
|
||||
# bioRxiv API Reference
|
||||
|
||||
## Overview
|
||||
|
||||
The bioRxiv API provides programmatic access to preprint metadata from the bioRxiv server. The API returns JSON-formatted data with comprehensive metadata about life sciences preprints.
|
||||
|
||||
## Base URL
|
||||
|
||||
```
|
||||
https://api.biorxiv.org
|
||||
```
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
Be respectful of the API:
|
||||
- Add delays between requests (minimum 0.5 seconds recommended)
|
||||
- Use appropriate User-Agent headers
|
||||
- Cache results when possible
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### 1. Details by Date Range
|
||||
|
||||
Retrieve preprints posted within a specific date range.
|
||||
|
||||
**Endpoint:**
|
||||
```
|
||||
GET /details/biorxiv/{start_date}/{end_date}
|
||||
GET /details/biorxiv/{start_date}/{end_date}/{category}
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `start_date`: Start date in YYYY-MM-DD format
|
||||
- `end_date`: End date in YYYY-MM-DD format
|
||||
- `category` (optional): Filter by subject category
|
||||
|
||||
**Example:**
|
||||
```
|
||||
GET https://api.biorxiv.org/details/biorxiv/2024-01-01/2024-01-31
|
||||
GET https://api.biorxiv.org/details/biorxiv/2024-01-01/2024-01-31/neuroscience
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"messages": [
|
||||
{
|
||||
"status": "ok",
|
||||
"count": 150,
|
||||
"total": 150
|
||||
}
|
||||
],
|
||||
"collection": [
|
||||
{
|
||||
"doi": "10.1101/2024.01.15.123456",
|
||||
"title": "Example Paper Title",
|
||||
"authors": "Smith J, Doe J, Johnson A",
|
||||
"author_corresponding": "Smith J",
|
||||
"author_corresponding_institution": "University Example",
|
||||
"date": "2024-01-15",
|
||||
"version": "1",
|
||||
"type": "new results",
|
||||
"license": "cc_by",
|
||||
"category": "neuroscience",
|
||||
"jatsxml": "https://www.biorxiv.org/content/...",
|
||||
"abstract": "This is the abstract...",
|
||||
"published": ""
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Details by DOI
|
||||
|
||||
Retrieve details for a specific preprint by DOI.
|
||||
|
||||
**Endpoint:**
|
||||
```
|
||||
GET /details/biorxiv/{doi}
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `doi`: The DOI of the preprint (e.g., `10.1101/2024.01.15.123456`)
|
||||
|
||||
**Example:**
|
||||
```
|
||||
GET https://api.biorxiv.org/details/biorxiv/10.1101/2024.01.15.123456
|
||||
```
|
||||
|
||||
### 3. Publications by Interval
|
||||
|
||||
Retrieve recent publications from a time interval.
|
||||
|
||||
**Endpoint:**
|
||||
```
|
||||
GET /pubs/biorxiv/{interval}/{cursor}/{format}
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `interval`: Number of days back to search (e.g., `1` for last 24 hours)
|
||||
- `cursor`: Pagination cursor (0 for first page, increment by 100 for subsequent pages)
|
||||
- `format`: Response format (`json` or `xml`)
|
||||
|
||||
**Example:**
|
||||
```
|
||||
GET https://api.biorxiv.org/pubs/biorxiv/1/0/json
|
||||
```
|
||||
|
||||
**Response includes pagination:**
|
||||
```json
|
||||
{
|
||||
"messages": [
|
||||
{
|
||||
"status": "ok",
|
||||
"count": 100,
|
||||
"total": 250,
|
||||
"cursor": 100
|
||||
}
|
||||
],
|
||||
"collection": [...]
|
||||
}
|
||||
```
|
||||
|
||||
## Valid Categories
|
||||
|
||||
bioRxiv organizes preprints into the following categories:
|
||||
|
||||
- `animal-behavior-and-cognition`
|
||||
- `biochemistry`
|
||||
- `bioengineering`
|
||||
- `bioinformatics`
|
||||
- `biophysics`
|
||||
- `cancer-biology`
|
||||
- `cell-biology`
|
||||
- `clinical-trials`
|
||||
- `developmental-biology`
|
||||
- `ecology`
|
||||
- `epidemiology`
|
||||
- `evolutionary-biology`
|
||||
- `genetics`
|
||||
- `genomics`
|
||||
- `immunology`
|
||||
- `microbiology`
|
||||
- `molecular-biology`
|
||||
- `neuroscience`
|
||||
- `paleontology`
|
||||
- `pathology`
|
||||
- `pharmacology-and-toxicology`
|
||||
- `physiology`
|
||||
- `plant-biology`
|
||||
- `scientific-communication-and-education`
|
||||
- `synthetic-biology`
|
||||
- `systems-biology`
|
||||
- `zoology`
|
||||
|
||||
## Paper Metadata Fields
|
||||
|
||||
Each paper in the `collection` array contains:
|
||||
|
||||
| Field | Description | Type |
|
||||
|-------|-------------|------|
|
||||
| `doi` | Digital Object Identifier | string |
|
||||
| `title` | Paper title | string |
|
||||
| `authors` | Comma-separated author list | string |
|
||||
| `author_corresponding` | Corresponding author name | string |
|
||||
| `author_corresponding_institution` | Corresponding author's institution | string |
|
||||
| `date` | Publication date (YYYY-MM-DD) | string |
|
||||
| `version` | Version number | string |
|
||||
| `type` | Type of submission (e.g., "new results") | string |
|
||||
| `license` | License type (e.g., "cc_by") | string |
|
||||
| `category` | Subject category | string |
|
||||
| `jatsxml` | URL to JATS XML | string |
|
||||
| `abstract` | Paper abstract | string |
|
||||
| `published` | Journal publication info (if published) | string |
|
||||
|
||||
## Downloading Full Papers
|
||||
|
||||
### PDF Download
|
||||
|
||||
PDFs can be downloaded directly (not through API):
|
||||
|
||||
```
|
||||
https://www.biorxiv.org/content/{doi}v{version}.full.pdf
|
||||
```
|
||||
|
||||
Example:
|
||||
```
|
||||
https://www.biorxiv.org/content/10.1101/2024.01.15.123456v1.full.pdf
|
||||
```
|
||||
|
||||
### HTML Version
|
||||
|
||||
```
|
||||
https://www.biorxiv.org/content/{doi}v{version}
|
||||
```
|
||||
|
||||
### JATS XML
|
||||
|
||||
Full structured XML is available via the `jatsxml` field in the API response.
|
||||
|
||||
## Common Search Patterns
|
||||
|
||||
### Author Search
|
||||
|
||||
1. Get papers from date range
|
||||
2. Filter by author name (case-insensitive substring match in `authors` field)
|
||||
|
||||
### Keyword Search
|
||||
|
||||
1. Get papers from date range (optionally filtered by category)
|
||||
2. Search in title, abstract, or both fields
|
||||
3. Filter papers containing keywords (case-insensitive)
|
||||
|
||||
### Recent Papers by Category
|
||||
|
||||
1. Use `/pubs/biorxiv/{interval}/0/json` endpoint
|
||||
2. Filter by category if needed
|
||||
|
||||
## Error Handling
|
||||
|
||||
Common HTTP status codes:
|
||||
- `200`: Success
|
||||
- `404`: Resource not found
|
||||
- `500`: Server error
|
||||
|
||||
Always check the `messages` array in the response:
|
||||
```json
|
||||
{
|
||||
"messages": [
|
||||
{
|
||||
"status": "ok",
|
||||
"count": 100
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Cache results**: Store retrieved papers to avoid repeated API calls
|
||||
2. **Use appropriate date ranges**: Smaller date ranges return faster
|
||||
3. **Filter by category**: Reduces data transfer and processing time
|
||||
4. **Batch processing**: When downloading multiple PDFs, add delays between requests
|
||||
5. **Error handling**: Always check response status and handle errors gracefully
|
||||
6. **Version tracking**: Note that papers can have multiple versions
|
||||
|
||||
## Python Usage Example
|
||||
|
||||
```python
|
||||
from biorxiv_search import BioRxivSearcher
|
||||
|
||||
searcher = BioRxivSearcher(verbose=True)
|
||||
|
||||
# Search by keywords
|
||||
papers = searcher.search_by_keywords(
|
||||
keywords=["CRISPR", "gene editing"],
|
||||
start_date="2024-01-01",
|
||||
end_date="2024-12-31",
|
||||
category="genomics"
|
||||
)
|
||||
|
||||
# Search by author
|
||||
papers = searcher.search_by_author(
|
||||
author_name="Smith",
|
||||
start_date="2023-01-01",
|
||||
end_date="2024-12-31"
|
||||
)
|
||||
|
||||
# Get specific paper
|
||||
paper = searcher.get_paper_details("10.1101/2024.01.15.123456")
|
||||
|
||||
# Download PDF
|
||||
searcher.download_pdf("10.1101/2024.01.15.123456", "paper.pdf")
|
||||
```
|
||||
|
||||
## External Resources
|
||||
|
||||
- bioRxiv homepage: https://www.biorxiv.org/
|
||||
- API documentation: https://api.biorxiv.org/
|
||||
- JATS XML specification: https://jats.nlm.nih.gov/
|
||||
445
skills/biorxiv-database/scripts/biorxiv_search.py
Executable file
445
skills/biorxiv-database/scripts/biorxiv_search.py
Executable file
@@ -0,0 +1,445 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
bioRxiv Search Tool
|
||||
A comprehensive Python tool for searching and retrieving preprints from bioRxiv.
|
||||
Supports keyword search, author search, date filtering, category filtering, and more.
|
||||
|
||||
Note: This tool is focused exclusively on bioRxiv (life sciences preprints).
|
||||
"""
|
||||
|
||||
import requests
|
||||
import json
|
||||
import argparse
|
||||
from datetime import datetime, timedelta
|
||||
from typing import List, Dict, Optional, Any
|
||||
import time
|
||||
import sys
|
||||
from urllib.parse import quote
|
||||
|
||||
|
||||
class BioRxivSearcher:
|
||||
"""Efficient search interface for bioRxiv preprints."""
|
||||
|
||||
BASE_URL = "https://api.biorxiv.org"
|
||||
|
||||
# Valid bioRxiv categories
|
||||
CATEGORIES = [
|
||||
"animal-behavior-and-cognition", "biochemistry", "bioengineering",
|
||||
"bioinformatics", "biophysics", "cancer-biology", "cell-biology",
|
||||
"clinical-trials", "developmental-biology", "ecology", "epidemiology",
|
||||
"evolutionary-biology", "genetics", "genomics", "immunology",
|
||||
"microbiology", "molecular-biology", "neuroscience", "paleontology",
|
||||
"pathology", "pharmacology-and-toxicology", "physiology",
|
||||
"plant-biology", "scientific-communication-and-education",
|
||||
"synthetic-biology", "systems-biology", "zoology"
|
||||
]
|
||||
|
||||
def __init__(self, verbose: bool = False):
|
||||
"""Initialize the searcher."""
|
||||
self.verbose = verbose
|
||||
self.session = requests.Session()
|
||||
self.session.headers.update({
|
||||
'User-Agent': 'BioRxiv-Search-Tool/1.0'
|
||||
})
|
||||
|
||||
def _log(self, message: str):
|
||||
"""Print verbose logging messages."""
|
||||
if self.verbose:
|
||||
print(f"[INFO] {message}", file=sys.stderr)
|
||||
|
||||
def _make_request(self, endpoint: str, params: Optional[Dict] = None) -> Dict:
|
||||
"""Make an API request with error handling and rate limiting."""
|
||||
url = f"{self.BASE_URL}/{endpoint}"
|
||||
self._log(f"Requesting: {url}")
|
||||
|
||||
try:
|
||||
response = self.session.get(url, params=params, timeout=30)
|
||||
response.raise_for_status()
|
||||
|
||||
# Rate limiting - be respectful to the API
|
||||
time.sleep(0.5)
|
||||
|
||||
return response.json()
|
||||
except requests.exceptions.RequestException as e:
|
||||
self._log(f"Error making request: {e}")
|
||||
return {"messages": [{"status": "error", "message": str(e)}], "collection": []}
|
||||
|
||||
def search_by_date_range(
|
||||
self,
|
||||
start_date: str,
|
||||
end_date: str,
|
||||
category: Optional[str] = None
|
||||
) -> List[Dict]:
|
||||
"""
|
||||
Search for preprints within a date range.
|
||||
|
||||
Args:
|
||||
start_date: Start date in YYYY-MM-DD format
|
||||
end_date: End date in YYYY-MM-DD format
|
||||
category: Optional category filter (e.g., 'neuroscience')
|
||||
|
||||
Returns:
|
||||
List of preprint dictionaries
|
||||
"""
|
||||
self._log(f"Searching bioRxiv from {start_date} to {end_date}")
|
||||
|
||||
if category:
|
||||
endpoint = f"details/biorxiv/{start_date}/{end_date}/{category}"
|
||||
else:
|
||||
endpoint = f"details/biorxiv/{start_date}/{end_date}"
|
||||
|
||||
data = self._make_request(endpoint)
|
||||
|
||||
if "collection" in data:
|
||||
self._log(f"Found {len(data['collection'])} preprints")
|
||||
return data["collection"]
|
||||
|
||||
return []
|
||||
|
||||
def search_by_interval(
|
||||
self,
|
||||
interval: str = "1",
|
||||
cursor: int = 0,
|
||||
format: str = "json"
|
||||
) -> Dict:
|
||||
"""
|
||||
Retrieve preprints from a specific time interval.
|
||||
|
||||
Args:
|
||||
interval: Number of days back to search
|
||||
cursor: Pagination cursor (0 for first page, then use returned cursor)
|
||||
format: Response format ('json' or 'xml')
|
||||
|
||||
Returns:
|
||||
Dictionary with collection and pagination info
|
||||
"""
|
||||
endpoint = f"pubs/biorxiv/{interval}/{cursor}/{format}"
|
||||
return self._make_request(endpoint)
|
||||
|
||||
def get_paper_details(self, doi: str) -> Dict:
|
||||
"""
|
||||
Get detailed information about a specific paper by DOI.
|
||||
|
||||
Args:
|
||||
doi: The DOI of the paper (e.g., '10.1101/2021.01.01.123456')
|
||||
|
||||
Returns:
|
||||
Dictionary with paper details
|
||||
"""
|
||||
# Clean DOI if full URL was provided
|
||||
if 'doi.org' in doi:
|
||||
doi = doi.split('doi.org/')[-1]
|
||||
|
||||
self._log(f"Fetching details for DOI: {doi}")
|
||||
endpoint = f"details/biorxiv/{doi}"
|
||||
|
||||
data = self._make_request(endpoint)
|
||||
|
||||
if "collection" in data and len(data["collection"]) > 0:
|
||||
return data["collection"][0]
|
||||
|
||||
return {}
|
||||
|
||||
def search_by_author(
|
||||
self,
|
||||
author_name: str,
|
||||
start_date: Optional[str] = None,
|
||||
end_date: Optional[str] = None
|
||||
) -> List[Dict]:
|
||||
"""
|
||||
Search for papers by author name.
|
||||
|
||||
Args:
|
||||
author_name: Author name to search for
|
||||
start_date: Optional start date (YYYY-MM-DD)
|
||||
end_date: Optional end date (YYYY-MM-DD)
|
||||
|
||||
Returns:
|
||||
List of matching preprints
|
||||
"""
|
||||
# If no date range specified, search last 3 years
|
||||
if not start_date:
|
||||
end_date = datetime.now().strftime("%Y-%m-%d")
|
||||
start_date = (datetime.now() - timedelta(days=1095)).strftime("%Y-%m-%d")
|
||||
|
||||
self._log(f"Searching for author: {author_name}")
|
||||
|
||||
# Get all papers in date range
|
||||
papers = self.search_by_date_range(start_date, end_date)
|
||||
|
||||
# Filter by author name (case-insensitive)
|
||||
author_lower = author_name.lower()
|
||||
matching_papers = []
|
||||
|
||||
for paper in papers:
|
||||
authors = paper.get("authors", "")
|
||||
if author_lower in authors.lower():
|
||||
matching_papers.append(paper)
|
||||
|
||||
self._log(f"Found {len(matching_papers)} papers by {author_name}")
|
||||
return matching_papers
|
||||
|
||||
def search_by_keywords(
|
||||
self,
|
||||
keywords: List[str],
|
||||
start_date: Optional[str] = None,
|
||||
end_date: Optional[str] = None,
|
||||
category: Optional[str] = None,
|
||||
search_fields: List[str] = ["title", "abstract"]
|
||||
) -> List[Dict]:
|
||||
"""
|
||||
Search for papers containing specific keywords.
|
||||
|
||||
Args:
|
||||
keywords: List of keywords to search for
|
||||
start_date: Optional start date (YYYY-MM-DD)
|
||||
end_date: Optional end date (YYYY-MM-DD)
|
||||
category: Optional category filter
|
||||
search_fields: Fields to search in (title, abstract, authors)
|
||||
|
||||
Returns:
|
||||
List of matching preprints
|
||||
"""
|
||||
# If no date range specified, search last year
|
||||
if not start_date:
|
||||
end_date = datetime.now().strftime("%Y-%m-%d")
|
||||
start_date = (datetime.now() - timedelta(days=365)).strftime("%Y-%m-%d")
|
||||
|
||||
self._log(f"Searching for keywords: {keywords}")
|
||||
|
||||
# Get all papers in date range
|
||||
papers = self.search_by_date_range(start_date, end_date, category)
|
||||
|
||||
# Filter by keywords
|
||||
matching_papers = []
|
||||
keywords_lower = [k.lower() for k in keywords]
|
||||
|
||||
for paper in papers:
|
||||
# Build search text from specified fields
|
||||
search_text = ""
|
||||
for field in search_fields:
|
||||
if field in paper:
|
||||
search_text += " " + str(paper[field]).lower()
|
||||
|
||||
# Check if any keyword matches
|
||||
if any(keyword in search_text for keyword in keywords_lower):
|
||||
matching_papers.append(paper)
|
||||
|
||||
self._log(f"Found {len(matching_papers)} papers matching keywords")
|
||||
return matching_papers
|
||||
|
||||
def download_pdf(self, doi: str, output_path: str) -> bool:
|
||||
"""
|
||||
Download the PDF of a paper.
|
||||
|
||||
Args:
|
||||
doi: The DOI of the paper
|
||||
output_path: Path where PDF should be saved
|
||||
|
||||
Returns:
|
||||
True if download successful, False otherwise
|
||||
"""
|
||||
# Clean DOI
|
||||
if 'doi.org' in doi:
|
||||
doi = doi.split('doi.org/')[-1]
|
||||
|
||||
# Construct PDF URL
|
||||
pdf_url = f"https://www.biorxiv.org/content/{doi}v1.full.pdf"
|
||||
|
||||
self._log(f"Downloading PDF from: {pdf_url}")
|
||||
|
||||
try:
|
||||
response = self.session.get(pdf_url, timeout=60)
|
||||
response.raise_for_status()
|
||||
|
||||
with open(output_path, 'wb') as f:
|
||||
f.write(response.content)
|
||||
|
||||
self._log(f"PDF saved to: {output_path}")
|
||||
return True
|
||||
except Exception as e:
|
||||
self._log(f"Error downloading PDF: {e}")
|
||||
return False
|
||||
|
||||
def format_result(self, paper: Dict, include_abstract: bool = True) -> Dict:
|
||||
"""
|
||||
Format a paper result with standardized fields.
|
||||
|
||||
Args:
|
||||
paper: Raw paper dictionary from API
|
||||
include_abstract: Whether to include the abstract
|
||||
|
||||
Returns:
|
||||
Formatted paper dictionary
|
||||
"""
|
||||
result = {
|
||||
"doi": paper.get("doi", ""),
|
||||
"title": paper.get("title", ""),
|
||||
"authors": paper.get("authors", ""),
|
||||
"author_corresponding": paper.get("author_corresponding", ""),
|
||||
"author_corresponding_institution": paper.get("author_corresponding_institution", ""),
|
||||
"date": paper.get("date", ""),
|
||||
"version": paper.get("version", ""),
|
||||
"type": paper.get("type", ""),
|
||||
"license": paper.get("license", ""),
|
||||
"category": paper.get("category", ""),
|
||||
"jatsxml": paper.get("jatsxml", ""),
|
||||
"published": paper.get("published", "")
|
||||
}
|
||||
|
||||
if include_abstract:
|
||||
result["abstract"] = paper.get("abstract", "")
|
||||
|
||||
# Add PDF and HTML URLs
|
||||
if result["doi"]:
|
||||
result["pdf_url"] = f"https://www.biorxiv.org/content/{result['doi']}v{result['version']}.full.pdf"
|
||||
result["html_url"] = f"https://www.biorxiv.org/content/{result['doi']}v{result['version']}"
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def main():
|
||||
"""Command-line interface for bioRxiv search."""
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Search bioRxiv preprints efficiently",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter
|
||||
)
|
||||
|
||||
parser.add_argument("--verbose", "-v", action="store_true",
|
||||
help="Enable verbose logging")
|
||||
|
||||
# Search type arguments
|
||||
search_group = parser.add_argument_group("Search options")
|
||||
search_group.add_argument("--keywords", "-k", nargs="+",
|
||||
help="Keywords to search for")
|
||||
search_group.add_argument("--author", "-a",
|
||||
help="Author name to search for")
|
||||
search_group.add_argument("--doi",
|
||||
help="Get details for specific DOI")
|
||||
|
||||
# Date range arguments
|
||||
date_group = parser.add_argument_group("Date range options")
|
||||
date_group.add_argument("--start-date",
|
||||
help="Start date (YYYY-MM-DD)")
|
||||
date_group.add_argument("--end-date",
|
||||
help="End date (YYYY-MM-DD)")
|
||||
date_group.add_argument("--days-back", type=int,
|
||||
help="Search N days back from today")
|
||||
|
||||
# Filter arguments
|
||||
filter_group = parser.add_argument_group("Filter options")
|
||||
filter_group.add_argument("--category", "-c",
|
||||
choices=BioRxivSearcher.CATEGORIES,
|
||||
help="Filter by category")
|
||||
filter_group.add_argument("--search-fields", nargs="+",
|
||||
default=["title", "abstract"],
|
||||
choices=["title", "abstract", "authors"],
|
||||
help="Fields to search in for keywords")
|
||||
|
||||
# Output arguments
|
||||
output_group = parser.add_argument_group("Output options")
|
||||
output_group.add_argument("--output", "-o",
|
||||
help="Output file (default: stdout)")
|
||||
output_group.add_argument("--include-abstract", action="store_true",
|
||||
default=True, help="Include abstracts in output")
|
||||
output_group.add_argument("--download-pdf",
|
||||
help="Download PDF to specified path (requires --doi)")
|
||||
output_group.add_argument("--limit", type=int,
|
||||
help="Limit number of results")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Initialize searcher
|
||||
searcher = BioRxivSearcher(verbose=args.verbose)
|
||||
|
||||
# Handle date range
|
||||
end_date = args.end_date or datetime.now().strftime("%Y-%m-%d")
|
||||
if args.days_back:
|
||||
start_date = (datetime.now() - timedelta(days=args.days_back)).strftime("%Y-%m-%d")
|
||||
else:
|
||||
start_date = args.start_date
|
||||
|
||||
# Execute search based on arguments
|
||||
results = []
|
||||
|
||||
if args.download_pdf:
|
||||
if not args.doi:
|
||||
print("Error: --doi required with --download-pdf", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
success = searcher.download_pdf(args.doi, args.download_pdf)
|
||||
return 0 if success else 1
|
||||
|
||||
elif args.doi:
|
||||
# Get specific paper by DOI
|
||||
paper = searcher.get_paper_details(args.doi)
|
||||
if paper:
|
||||
results = [paper]
|
||||
|
||||
elif args.author:
|
||||
# Search by author
|
||||
results = searcher.search_by_author(
|
||||
args.author, start_date, end_date
|
||||
)
|
||||
|
||||
elif args.keywords:
|
||||
# Search by keywords
|
||||
if not start_date:
|
||||
print("Error: --start-date or --days-back required for keyword search",
|
||||
file=sys.stderr)
|
||||
return 1
|
||||
|
||||
results = searcher.search_by_keywords(
|
||||
args.keywords, start_date, end_date,
|
||||
args.category, args.search_fields
|
||||
)
|
||||
|
||||
else:
|
||||
# Date range search
|
||||
if not start_date:
|
||||
print("Error: Must specify search criteria (--keywords, --author, or --doi)",
|
||||
file=sys.stderr)
|
||||
return 1
|
||||
|
||||
results = searcher.search_by_date_range(
|
||||
start_date, end_date, args.category
|
||||
)
|
||||
|
||||
# Apply limit
|
||||
if args.limit:
|
||||
results = results[:args.limit]
|
||||
|
||||
# Format results
|
||||
formatted_results = [
|
||||
searcher.format_result(paper, args.include_abstract)
|
||||
for paper in results
|
||||
]
|
||||
|
||||
# Output results
|
||||
output_data = {
|
||||
"query": {
|
||||
"keywords": args.keywords,
|
||||
"author": args.author,
|
||||
"doi": args.doi,
|
||||
"start_date": start_date,
|
||||
"end_date": end_date,
|
||||
"category": args.category
|
||||
},
|
||||
"result_count": len(formatted_results),
|
||||
"results": formatted_results
|
||||
}
|
||||
|
||||
output_json = json.dumps(output_data, indent=2)
|
||||
|
||||
if args.output:
|
||||
with open(args.output, 'w') as f:
|
||||
f.write(output_json)
|
||||
print(f"Results written to {args.output}", file=sys.stderr)
|
||||
else:
|
||||
print(output_json)
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user