Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/openalex-database/references/api_guide.md
+++ b/skills/openalex-database/references/api_guide.md
@@ -0,0 +1,371 @@
+# OpenAlex API Complete Guide
+
+## Base Information
+
+**Base URL:** `https://api.openalex.org`
+**Authentication:** None required
+**Rate Limits:**
+- Default: 1 request/second, 100k requests/day
+- Polite pool (with email): 10 requests/second, 100k requests/day
+
+## Critical Best Practices
+
+### ✅ DO: Use `?sample` parameter for random sampling
+```
+https://api.openalex.org/works?sample=20&seed=123
+```
+For large samples (10k+), use multiple seeds and deduplicate.
+
+### ❌ DON'T: Use random page numbers for sampling
+Incorrect: `?page=5`, `?page=17` - This biases results!
+
+### ✅ DO: Use two-step lookup for entity filtering
+```
+1. Find entity ID: /authors?search=einstein
+2. Use ID: /works?filter=authorships.author.id:A5023888391
+```
+
+### ❌ DON'T: Filter by entity names directly
+Incorrect: `/works?filter=author_name:Einstein` - Names are ambiguous!
+
+### ✅ DO: Use maximum page size for bulk extraction
+```
+?per-page=200
+```
+This is 8x faster than default (25).
+
+### ❌ DON'T: Use default page sizes
+Default is only 25 results per page.
+
+### ✅ DO: Use OR filter (pipe |) for batch lookups
+```
+/works?filter=doi:10.1/abc|10.2/def|10.3/ghi
+```
+Up to 50 values per filter.
+
+### ❌ DON'T: Make sequential API calls for lists
+Making 100 separate calls when you can batch them is inefficient.
+
+### ✅ DO: Implement exponential backoff for retries
+```python
+for attempt in range(max_retries):
+    try:
+        response = requests.get(url)
+        if response.status_code == 200:
+            return response.json()
+    except:
+        wait_time = 2 ** attempt
+        time.sleep(wait_time)
+```
+
+### ✅ DO: Add email for 10x rate limit boost
+```
+?mailto=yourname@example.edu
+```
+Increases from 1 req/sec → 10 req/sec.
+
+## Entity Endpoints
+
+- `/works` - 240M+ scholarly documents
+- `/authors` - Researcher profiles
+- `/sources` - Journals, repositories, conferences
+- `/institutions` - Universities, research organizations
+- `/topics` - Subject classifications (3-level hierarchy)
+- `/publishers` - Publishing organizations
+- `/funders` - Funding agencies
+- `/text` - Tag your own text with topics/keywords (POST)
+
+## Essential Query Parameters
+
+| Parameter | Description | Example |
+|-----------|-------------|---------|
+| `filter=` | Filter results | `?filter=publication_year:2020` |
+| `search=` | Full-text search | `?search=machine+learning` |
+| `sort=` | Sort results | `?sort=cited_by_count:desc` |
+| `per-page=` | Results per page (max 200) | `?per-page=200` |
+| `page=` | Page number | `?page=2` |
+| `sample=` | Random results | `?sample=50&seed=42` |
+| `select=` | Limit fields | `?select=id,title` |
+| `group_by=` | Aggregate by field | `?group_by=publication_year` |
+| `mailto=` | Email for polite pool | `?mailto=you@example.edu` |
+
+## Filter Syntax
+
+### Basic Filtering
+```
+Single filter:     ?filter=publication_year:2020
+Multiple (AND):    ?filter=publication_year:2020,is_oa:true
+Values (OR):       ?filter=type:journal-article|book
+Negation:          ?filter=type:!journal-article
+```
+
+### Comparison Operators
+```
+Greater than:      ?filter=cited_by_count:>100
+Less than:         ?filter=publication_year:<2020
+Range:             ?filter=publication_year:2020-2023
+```
+
+### Multiple Values in Same Attribute
+```
+Repeat filter:     ?filter=institutions.country_code:us,institutions.country_code:gb
+Use + symbol:      ?filter=institutions.country_code:us+gb
+```
+Both mean: "works with author from US AND author from GB"
+
+### OR Queries
+```
+Any of these:      ?filter=institutions.country_code:us|gb|ca
+Batch IDs:         ?filter=doi:10.1/abc|10.2/def
+```
+Up to 50 values with pipes.
+
+## Common Query Patterns
+
+### Get Random Sample
+```bash
+# Small sample
+https://api.openalex.org/works?sample=20&seed=42
+
+# Large sample (10k+) - make multiple requests
+https://api.openalex.org/works?sample=1000&seed=1
+https://api.openalex.org/works?sample=1000&seed=2
+# Then deduplicate by ID
+```
+
+### Search Works
+```bash
+# Simple search
+https://api.openalex.org/works?search=machine+learning
+
+# Search specific field
+https://api.openalex.org/works?filter=title.search:CRISPR
+
+# Search + filter
+https://api.openalex.org/works?search=climate&filter=publication_year:2023
+```
+
+### Find Works by Author (Two-Step)
+```bash
+# Step 1: Get author ID
+https://api.openalex.org/authors?search=Heather+Piwowar
+# Returns: "id": "https://openalex.org/A5023888391"
+
+# Step 2: Get their works
+https://api.openalex.org/works?filter=authorships.author.id:A5023888391
+```
+
+### Find Works by Institution (Two-Step)
+```bash
+# Step 1: Get institution ID
+https://api.openalex.org/institutions?search=MIT
+# Returns: "id": "https://openalex.org/I136199984"
+
+# Step 2: Get their works
+https://api.openalex.org/works?filter=authorships.institutions.id:I136199984
+```
+
+### Highly Cited Recent Papers
+```bash
+https://api.openalex.org/works?filter=publication_year:>2020&sort=cited_by_count:desc&per-page=200
+```
+
+### Open Access Works
+```bash
+# All OA
+https://api.openalex.org/works?filter=is_oa:true
+
+# Gold OA only
+https://api.openalex.org/works?filter=open_access.oa_status:gold
+```
+
+### Multiple Criteria
+```bash
+# Recent OA works about COVID from top institutions
+https://api.openalex.org/works?filter=publication_year:2022,is_oa:true,title.search:covid,authorships.institutions.id:I136199984|I27837315
+```
+
+### Bulk DOI Lookup
+```bash
+# Get specific works by DOI (up to 50 per request)
+https://api.openalex.org/works?filter=doi:https://doi.org/10.1371/journal.pone.0266781|https://doi.org/10.1371/journal.pone.0267149&per-page=50
+```
+
+### Aggregate Data
+```bash
+# Top topics
+https://api.openalex.org/works?group_by=topics.id
+
+# Papers per year
+https://api.openalex.org/works?group_by=publication_year
+
+# Most prolific institutions
+https://api.openalex.org/works?group_by=authorships.institutions.id
+```
+
+### Pagination
+```bash
+# First page
+https://api.openalex.org/works?filter=publication_year:2023&per-page=200
+
+# Next pages
+https://api.openalex.org/works?filter=publication_year:2023&per-page=200&page=2
+```
+
+## Response Structure
+
+### List Endpoints
+```json
+{
+  "meta": {
+    "count": 240523418,
+    "db_response_time_ms": 42,
+    "page": 1,
+    "per_page": 25
+  },
+  "results": [
+    { /* entity object */ }
+  ]
+}
+```
+
+### Single Entity
+```
+https://api.openalex.org/works/W2741809807
+→ Returns Work object directly (no meta/results wrapper)
+```
+
+### Group By
+```json
+{
+  "meta": { "count": 100 },
+  "group_by": [
+    {
+      "key": "https://openalex.org/T10001",
+      "key_display_name": "Artificial Intelligence",
+      "count": 15234
+    }
+  ]
+}
+```
+
+## Works Filters (Most Common)
+
+| Filter | Description | Example |
+|--------|-------------|---------|
+| `authorships.author.id` | Author's OpenAlex ID | `A5023888391` |
+| `authorships.institutions.id` | Institution's ID | `I136199984` |
+| `cited_by_count` | Citation count | `>100` |
+| `is_oa` | Is open access | `true/false` |
+| `publication_year` | Year published | `2020`, `>2020`, `2018-2022` |
+| `primary_location.source.id` | Source (journal) ID | `S137773608` |
+| `topics.id` | Topic ID | `T10001` |
+| `type` | Document type | `article`, `book`, `dataset` |
+| `has_doi` | Has DOI | `true/false` |
+| `has_fulltext` | Has fulltext | `true/false` |
+
+## Authors Filters
+
+| Filter | Description |
+|--------|-------------|
+| `last_known_institution.id` | Current/last institution |
+| `works_count` | Number of works |
+| `cited_by_count` | Total citations |
+| `orcid` | ORCID identifier |
+
+## External ID Support
+
+### Works
+```
+DOI:  /works/https://doi.org/10.7717/peerj.4375
+PMID: /works/pmid:29844763
+```
+
+### Authors
+```
+ORCID: /authors/https://orcid.org/0000-0003-1613-5981
+```
+
+### Institutions
+```
+ROR: /institutions/https://ror.org/02y3ad647
+```
+
+### Sources
+```
+ISSN: /sources/issn:0028-0836
+```
+
+## Performance Tips
+
+1. **Use maximum page size**: `?per-page=200` (8x fewer calls)
+2. **Batch ID lookups**: Use pipe operator for up to 50 IDs
+3. **Select only needed fields**: `?select=id,title,publication_year`
+4. **Use concurrent requests**: With rate limiting (10 req/sec with email)
+5. **Add email**: `?mailto=you@example.edu` for 10x speed boost
+
+## Error Handling
+
+### HTTP Status Codes
+- `200` - Success
+- `400` - Bad request (check filter syntax)
+- `403` - Rate limit exceeded (implement backoff)
+- `404` - Entity doesn't exist
+- `500` - Server error (retry with backoff)
+
+### Exponential Backoff
+```python
+def fetch_with_retry(url, max_retries=5):
+    for attempt in range(max_retries):
+        try:
+            response = requests.get(url, timeout=30)
+            if response.status_code == 200:
+                return response.json()
+            elif response.status_code in [403, 500, 502, 503, 504]:
+                wait_time = 2 ** attempt
+                time.sleep(wait_time)
+            else:
+                response.raise_for_status()
+        except requests.exceptions.Timeout:
+            if attempt < max_retries - 1:
+                time.sleep(2 ** attempt)
+            else:
+                raise
+    raise Exception(f"Failed after {max_retries} retries")
+```
+
+## Rate Limiting
+
+### Without Email (Default Pool)
+- 1 request/second
+- 100,000 requests/day
+
+### With Email (Polite Pool)
+- 10 requests/second
+- 100,000 requests/day
+- **Always use for production**
+
+### Concurrent Request Strategy
+1. Track requests per second globally
+2. Use semaphore or rate limiter across threads
+3. Monitor for 403 responses
+4. Back off if limits hit
+
+## Common Mistakes to Avoid
+
+1. ❌ Using page numbers for sampling → ✅ Use `?sample=`
+2. ❌ Filtering by entity names → ✅ Get IDs first
+3. ❌ Default page size → ✅ Use `per-page=200`
+4. ❌ Sequential ID lookups → ✅ Batch with pipe operator
+5. ❌ No error handling → ✅ Implement retry with backoff
+6. ❌ Ignoring rate limits → ✅ Global rate limiting
+7. ❌ Not including email → ✅ Add `mailto=`
+8. ❌ Fetching all fields → ✅ Use `select=`
+
+## Additional Resources
+
+- Full documentation: https://docs.openalex.org
+- API Overview: https://docs.openalex.org/how-to-use-the-api/api-overview
+- Entity schemas: https://docs.openalex.org/api-entities
+- Help: https://openalex.org/help
+- User group: https://groups.google.com/g/openalex-users
--- a/skills/openalex-database/references/common_queries.md
+++ b/skills/openalex-database/references/common_queries.md
@@ -0,0 +1,381 @@
+# Common OpenAlex Query Examples
+
+This document provides practical examples for common research queries using OpenAlex.
+
+## Finding Papers by Author
+
+**User query**: "Find papers by Albert Einstein"
+
+**Approach**: Two-step pattern
+1. Search for author to get ID
+2. Filter works by author ID
+
+**Python example**:
+```python
+from scripts.openalex_client import OpenAlexClient
+from scripts.query_helpers import find_author_works
+
+client = OpenAlexClient(email="your-email@example.edu")
+works = find_author_works("Albert Einstein", client, limit=100)
+
+for work in works:
+    print(f"{work['title']} ({work['publication_year']})")
+```
+
+## Finding Papers from an Institution
+
+**User query**: "What papers has MIT published in the last year?"
+
+**Approach**: Two-step pattern with date filter
+1. Search for institution to get ID
+2. Filter works by institution ID and year
+
+**Python example**:
+```python
+from scripts.query_helpers import find_institution_works
+
+works = find_institution_works("MIT", client, limit=200)
+
+# Filter for recent papers
+import datetime
+current_year = datetime.datetime.now().year
+recent_works = [w for w in works if w['publication_year'] == current_year]
+```
+
+## Highly Cited Papers on a Topic
+
+**User query**: "Find the most cited papers on CRISPR from the last 5 years"
+
+**Approach**: Search + filter + sort
+
+**Python example**:
+```python
+works = client.search_works(
+    search="CRISPR",
+    filter_params={
+        "publication_year": ">2019"
+    },
+    sort="cited_by_count:desc",
+    per_page=100
+)
+
+for work in works['results']:
+    title = work['title']
+    citations = work['cited_by_count']
+    year = work['publication_year']
+    print(f"{title} ({year}): {citations} citations")
+```
+
+## Open Access Papers on a Topic
+
+**User query**: "Find open access papers about climate change"
+
+**Approach**: Search + OA filter
+
+**Python example**:
+```python
+from scripts.query_helpers import get_open_access_papers
+
+papers = get_open_access_papers(
+    search_term="climate change",
+    client=client,
+    oa_status="any",  # or "gold", "green", "hybrid", "bronze"
+    limit=200
+)
+
+for paper in papers:
+    print(f"{paper['title']}")
+    print(f"  OA Status: {paper['open_access']['oa_status']}")
+    print(f"  URL: {paper['open_access']['oa_url']}")
+```
+
+## Publication Trends Analysis
+
+**User query**: "Show me publication trends for machine learning over the years"
+
+**Approach**: Use group_by to aggregate by year
+
+**Python example**:
+```python
+from scripts.query_helpers import get_publication_trends
+
+trends = get_publication_trends(
+    search_term="machine learning",
+    client=client
+)
+
+# Sort by year
+trends_sorted = sorted(trends, key=lambda x: x['key'])
+
+for trend in trends_sorted[-10:]:  # Last 10 years
+    year = trend['key']
+    count = trend['count']
+    print(f"{year}: {count} publications")
+```
+
+## Analyzing Research Output
+
+**User query**: "Analyze the research output of Stanford University from 2020-2024"
+
+**Approach**: Multiple aggregations for comprehensive analysis
+
+**Python example**:
+```python
+from scripts.query_helpers import analyze_research_output
+
+analysis = analyze_research_output(
+    entity_type='institution',
+    entity_name='Stanford University',
+    client=client,
+    years='2020-2024'
+)
+
+print(f"Institution: {analysis['entity_name']}")
+print(f"Total works: {analysis['total_works']}")
+print(f"Open access: {analysis['open_access_percentage']}%")
+print("\nTop topics:")
+for topic in analysis['top_topics'][:5]:
+    print(f"  - {topic['key_display_name']}: {topic['count']} works")
+```
+
+## Finding Papers by DOI (Batch)
+
+**User query**: "Get information for these 10 DOIs: ..."
+
+**Approach**: Batch lookup with pipe separator
+
+**Python example**:
+```python
+dois = [
+    "https://doi.org/10.1371/journal.pone.0266781",
+    "https://doi.org/10.1371/journal.pone.0267149",
+    "https://doi.org/10.1038/s41586-021-03819-2",
+    # ... up to 50 DOIs
+]
+
+works = client.batch_lookup(
+    entity_type='works',
+    ids=dois,
+    id_field='doi'
+)
+
+for work in works:
+    print(f"{work['title']} - {work['publication_year']}")
+```
+
+## Random Sample of Papers
+
+**User query**: "Give me 50 random papers from 2023"
+
+**Approach**: Use sample parameter with seed for reproducibility
+
+**Python example**:
+```python
+works = client.sample_works(
+    sample_size=50,
+    seed=42,  # For reproducibility
+    filter_params={
+        "publication_year": "2023",
+        "is_oa": "true"
+    }
+)
+
+print(f"Got {len(works)} random papers from 2023")
+```
+
+## Papers from Multiple Institutions
+
+**User query**: "Find papers with authors from both MIT and Stanford"
+
+**Approach**: Use + operator for AND within same attribute
+
+**Python example**:
+```python
+# First, get institution IDs
+mit_response = client._make_request(
+    '/institutions',
+    params={'search': 'MIT', 'per-page': 1}
+)
+mit_id = mit_response['results'][0]['id'].split('/')[-1]
+
+stanford_response = client._make_request(
+    '/institutions',
+    params={'search': 'Stanford', 'per-page': 1}
+)
+stanford_id = stanford_response['results'][0]['id'].split('/')[-1]
+
+# Find works with authors from both institutions
+works = client.search_works(
+    filter_params={
+        "authorships.institutions.id": f"{mit_id}+{stanford_id}"
+    },
+    per_page=100
+)
+
+print(f"Found {works['meta']['count']} collaborative papers")
+```
+
+## Papers in a Specific Journal
+
+**User query**: "Get all papers from Nature published in 2023"
+
+**Approach**: Two-step - find journal ID, then filter works
+
+**Python example**:
+```python
+# Step 1: Find journal source ID
+source_response = client._make_request(
+    '/sources',
+    params={'search': 'Nature', 'per-page': 1}
+)
+source = source_response['results'][0]
+source_id = source['id'].split('/')[-1]
+
+print(f"Found journal: {source['display_name']} (ID: {source_id})")
+
+# Step 2: Get works from that source
+works = client.search_works(
+    filter_params={
+        "primary_location.source.id": source_id,
+        "publication_year": "2023"
+    },
+    per_page=200
+)
+
+print(f"Found {works['meta']['count']} papers from Nature in 2023")
+```
+
+## Topic Analysis by Institution
+
+**User query**: "What topics does MIT research most?"
+
+**Approach**: Filter by institution, group by topics
+
+**Python example**:
+```python
+# Get MIT ID
+inst_response = client._make_request(
+    '/institutions',
+    params={'search': 'MIT', 'per-page': 1}
+)
+mit_id = inst_response['results'][0]['id'].split('/')[-1]
+
+# Group by topics
+topics = client.group_by(
+    entity_type='works',
+    group_field='topics.id',
+    filter_params={
+        "authorships.institutions.id": mit_id,
+        "publication_year": ">2020"
+    }
+)
+
+print("Top research topics at MIT (2020+):")
+for i, topic in enumerate(topics[:10], 1):
+    print(f"{i}. {topic['key_display_name']}: {topic['count']} works")
+```
+
+## Citation Analysis
+
+**User query**: "Find papers that cite this specific DOI"
+
+**Approach**: Get work by DOI, then use cited_by_api_url
+
+**Python example**:
+```python
+# Get the work
+doi = "https://doi.org/10.1038/s41586-021-03819-2"
+work = client.get_entity('works', doi)
+
+# Get papers that cite it
+cited_by_url = work['cited_by_api_url']
+
+# Extract just the query part and use it
+import requests
+response = requests.get(cited_by_url, params={'mailto': client.email})
+citing_works = response.json()
+
+print(f"{work['title']}")
+print(f"Total citations: {work['cited_by_count']}")
+print(f"\nRecent citing papers:")
+for citing_work in citing_works['results'][:5]:
+    print(f"  - {citing_work['title']} ({citing_work['publication_year']})")
+```
+
+## Large-Scale Data Extraction
+
+**User query**: "Get all papers on quantum computing from the last 3 years"
+
+**Approach**: Paginate through all results
+
+**Python example**:
+```python
+all_papers = client.paginate_all(
+    endpoint='/works',
+    params={
+        'search': 'quantum computing',
+        'filter': 'publication_year:2022-2024'
+    },
+    max_results=10000  # Limit to prevent excessive API calls
+)
+
+print(f"Retrieved {len(all_papers)} papers")
+
+# Save to CSV
+import csv
+with open('quantum_papers.csv', 'w', newline='') as f:
+    writer = csv.writer(f)
+    writer.writerow(['Title', 'Year', 'Citations', 'DOI', 'OA Status'])
+
+    for paper in all_papers:
+        writer.writerow([
+            paper['title'],
+            paper['publication_year'],
+            paper['cited_by_count'],
+            paper.get('doi', 'N/A'),
+            paper['open_access']['oa_status']
+        ])
+```
+
+## Complex Multi-Filter Query
+
+**User query**: "Find recent, highly-cited, open access papers on AI from top institutions"
+
+**Approach**: Combine multiple filters
+
+**Python example**:
+```python
+# Get IDs for top institutions
+top_institutions = ['MIT', 'Stanford', 'Oxford']
+inst_ids = []
+
+for inst_name in top_institutions:
+    response = client._make_request(
+        '/institutions',
+        params={'search': inst_name, 'per-page': 1}
+    )
+    if response['results']:
+        inst_id = response['results'][0]['id'].split('/')[-1]
+        inst_ids.append(inst_id)
+
+# Combine with pipe for OR
+inst_filter = '|'.join(inst_ids)
+
+# Complex query
+works = client.search_works(
+    search="artificial intelligence",
+    filter_params={
+        "publication_year": ">2022",
+        "cited_by_count": ">50",
+        "is_oa": "true",
+        "authorships.institutions.id": inst_filter
+    },
+    sort="cited_by_count:desc",
+    per_page=200
+)
+
+print(f"Found {works['meta']['count']} papers matching criteria")
+for work in works['results'][:10]:
+    print(f"{work['title']}")
+    print(f"  Citations: {work['cited_by_count']}, Year: {work['publication_year']}")
+```