Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,189 @@
---
name: uniprot-database
description: "Direct REST API access to UniProt. Protein searches, FASTA retrieval, ID mapping, Swiss-Prot/TrEMBL. For Python workflows with multiple databases, prefer bioservices (unified interface to 40+ services). Use this for direct HTTP/REST work or UniProt-specific control."
---
# UniProt Database
## Overview
UniProt is the world's leading comprehensive protein sequence and functional information resource. Search proteins by name, gene, or accession, retrieve sequences in FASTA format, perform ID mapping across databases, access Swiss-Prot/TrEMBL annotations via REST API for protein analysis.
## When to Use This Skill
This skill should be used when:
- Searching for protein entries by name, gene symbol, accession, or organism
- Retrieving protein sequences in FASTA or other formats
- Mapping identifiers between UniProt and external databases (Ensembl, RefSeq, PDB, etc.)
- Accessing protein annotations including GO terms, domains, and functional descriptions
- Batch retrieving multiple protein entries efficiently
- Querying reviewed (Swiss-Prot) vs. unreviewed (TrEMBL) protein data
- Streaming large protein datasets
- Building custom queries with field-specific search syntax
## Core Capabilities
### 1. Searching for Proteins
Search UniProt using natural language queries or structured search syntax.
**Common search patterns:**
```python
# Search by protein name
query = "insulin AND organism_name:\"Homo sapiens\""
# Search by gene name
query = "gene:BRCA1 AND reviewed:true"
# Search by accession
query = "accession:P12345"
# Search by sequence length
query = "length:[100 TO 500]"
# Search by taxonomy
query = "taxonomy_id:9606" # Human proteins
# Search by GO term
query = "go:0005515" # Protein binding
```
Use the API search endpoint: `https://rest.uniprot.org/uniprotkb/search?query={query}&format={format}`
**Supported formats:** JSON, TSV, Excel, XML, FASTA, RDF, TXT
### 2. Retrieving Individual Protein Entries
Retrieve specific protein entries by accession number.
**Accession number formats:**
- Classic: P12345, Q1AAA9, O15530 (6 characters: letter + 5 alphanumeric)
- Extended: A0A022YWF9 (10 characters for newer entries)
**Retrieve endpoint:** `https://rest.uniprot.org/uniprotkb/{accession}.{format}`
Example: `https://rest.uniprot.org/uniprotkb/P12345.fasta`
### 3. Batch Retrieval and ID Mapping
Map protein identifiers between different database systems and retrieve multiple entries efficiently.
**ID Mapping workflow:**
1. Submit mapping job to: `https://rest.uniprot.org/idmapping/run`
2. Check job status: `https://rest.uniprot.org/idmapping/status/{jobId}`
3. Retrieve results: `https://rest.uniprot.org/idmapping/results/{jobId}`
**Supported databases for mapping:**
- UniProtKB AC/ID
- Gene names
- Ensembl, RefSeq, EMBL
- PDB, AlphaFoldDB
- KEGG, GO terms
- And many more (see `/references/id_mapping_databases.md`)
**Limitations:**
- Maximum 100,000 IDs per job
- Results stored for 7 days
### 4. Streaming Large Result Sets
For large queries that exceed pagination limits, use the stream endpoint:
`https://rest.uniprot.org/uniprotkb/stream?query={query}&format={format}`
The stream endpoint returns all results without pagination, suitable for downloading complete datasets.
### 5. Customizing Retrieved Fields
Specify exactly which fields to retrieve for efficient data transfer.
**Common fields:**
- `accession` - UniProt accession number
- `id` - Entry name
- `gene_names` - Gene name(s)
- `organism_name` - Organism
- `protein_name` - Protein names
- `sequence` - Amino acid sequence
- `length` - Sequence length
- `go_*` - Gene Ontology annotations
- `cc_*` - Comment fields (function, interaction, etc.)
- `ft_*` - Feature annotations (domains, sites, etc.)
**Example:** `https://rest.uniprot.org/uniprotkb/search?query=insulin&fields=accession,gene_names,organism_name,length,sequence&format=tsv`
See `/references/api_fields.md` for complete field list.
## Python Implementation
For programmatic access, use the provided helper script `scripts/uniprot_client.py` which implements:
- `search_proteins(query, format)` - Search UniProt with any query
- `get_protein(accession, format)` - Retrieve single protein entry
- `map_ids(ids, from_db, to_db)` - Map between identifier types
- `batch_retrieve(accessions, format)` - Retrieve multiple entries
- `stream_results(query, format)` - Stream large result sets
**Alternative Python packages:**
- **Unipressed**: Modern, typed Python client for UniProt REST API
- **bioservices**: Comprehensive bioinformatics web services client
## Query Syntax Examples
**Boolean operators:**
```
kinase AND organism_name:human
(diabetes OR insulin) AND reviewed:true
cancer NOT lung
```
**Field-specific searches:**
```
gene:BRCA1
accession:P12345
organism_id:9606
taxonomy_name:"Homo sapiens"
annotation:(type:signal)
```
**Range queries:**
```
length:[100 TO 500]
mass:[50000 TO 100000]
```
**Wildcards:**
```
gene:BRCA*
protein_name:kinase*
```
See `/references/query_syntax.md` for comprehensive syntax documentation.
## Best Practices
1. **Use reviewed entries when possible**: Filter with `reviewed:true` for Swiss-Prot (manually curated) entries
2. **Specify format explicitly**: Choose the most appropriate format (FASTA for sequences, TSV for tabular data, JSON for programmatic parsing)
3. **Use field selection**: Only request fields you need to reduce bandwidth and processing time
4. **Handle pagination**: For large result sets, implement proper pagination or use the stream endpoint
5. **Cache results**: Store frequently accessed data locally to minimize API calls
6. **Rate limiting**: Be respectful of API resources; implement delays for large batch operations
7. **Check data quality**: TrEMBL entries are computational predictions; Swiss-Prot entries are manually reviewed
## Resources
### scripts/
`uniprot_client.py` - Python client with helper functions for common UniProt operations including search, retrieval, ID mapping, and streaming.
### references/
- `api_fields.md` - Complete list of available fields for customizing queries
- `id_mapping_databases.md` - Supported databases for ID mapping operations
- `query_syntax.md` - Comprehensive query syntax with advanced examples
- `api_examples.md` - Code examples in multiple languages (Python, curl, R)
## Additional Resources
- **API Documentation**: https://www.uniprot.org/help/api
- **Interactive API Explorer**: https://www.uniprot.org/api-documentation
- **REST Tutorial**: https://www.uniprot.org/help/uniprot_rest_tutorial
- **Query Syntax Help**: https://www.uniprot.org/help/query-fields
- **SPARQL Endpoint**: https://sparql.uniprot.org/ (for advanced graph queries)

View File

@@ -0,0 +1,413 @@
# UniProt API Examples
Practical code examples for interacting with the UniProt REST API in multiple languages.
## Python Examples
### Example 1: Basic Search
```python
import requests
# Search for human insulin proteins
url = "https://rest.uniprot.org/uniprotkb/search"
params = {
"query": "insulin AND organism_id:9606 AND reviewed:true",
"format": "json",
"size": 10
}
response = requests.get(url, params=params)
data = response.json()
for result in data['results']:
print(f"{result['primaryAccession']}: {result['proteinDescription']['recommendedName']['fullName']['value']}")
```
### Example 2: Retrieve Protein Sequence
```python
import requests
# Get human insulin sequence in FASTA format
accession = "P01308"
url = f"https://rest.uniprot.org/uniprotkb/{accession}.fasta"
response = requests.get(url)
print(response.text)
```
### Example 3: Custom Fields
```python
import requests
# Get specific fields only
url = "https://rest.uniprot.org/uniprotkb/search"
params = {
"query": "gene:BRCA1 AND reviewed:true",
"format": "tsv",
"fields": "accession,gene_names,organism_name,length,cc_function"
}
response = requests.get(url, params=params)
print(response.text)
```
### Example 4: ID Mapping
```python
import requests
import time
def map_uniprot_ids(ids, from_db, to_db):
# Submit job
submit_url = "https://rest.uniprot.org/idmapping/run"
data = {
"from": from_db,
"to": to_db,
"ids": ",".join(ids)
}
response = requests.post(submit_url, data=data)
job_id = response.json()["jobId"]
# Poll for completion
status_url = f"https://rest.uniprot.org/idmapping/status/{job_id}"
while True:
response = requests.get(status_url)
status = response.json()
if "results" in status or "failedIds" in status:
break
time.sleep(3)
# Get results
results_url = f"https://rest.uniprot.org/idmapping/results/{job_id}"
response = requests.get(results_url)
return response.json()
# Map UniProt IDs to PDB
ids = ["P01308", "P04637"]
mapping = map_uniprot_ids(ids, "UniProtKB_AC-ID", "PDB")
print(mapping)
```
### Example 5: Stream Large Results
```python
import requests
# Stream all reviewed human proteins
url = "https://rest.uniprot.org/uniprotkb/stream"
params = {
"query": "organism_id:9606 AND reviewed:true",
"format": "fasta"
}
response = requests.get(url, params=params, stream=True)
# Process in chunks
with open("human_proteins.fasta", "w") as f:
for chunk in response.iter_content(chunk_size=8192, decode_unicode=True):
if chunk:
f.write(chunk)
```
### Example 6: Pagination
```python
import requests
def get_all_results(query, fields=None):
"""Get all results with pagination"""
url = "https://rest.uniprot.org/uniprotkb/search"
all_results = []
params = {
"query": query,
"format": "json",
"size": 500 # Max size per page
}
if fields:
params["fields"] = ",".join(fields)
while True:
response = requests.get(url, params=params)
data = response.json()
all_results.extend(data['results'])
# Check for next page
if 'next' in data:
url = data['next']
else:
break
return all_results
# Get all human kinases
results = get_all_results(
"protein_name:kinase AND organism_id:9606 AND reviewed:true",
fields=["accession", "gene_names", "protein_name"]
)
print(f"Found {len(results)} proteins")
```
## cURL Examples
### Example 1: Simple Search
```bash
# Search for insulin proteins
curl "https://rest.uniprot.org/uniprotkb/search?query=insulin&format=json&size=5"
```
### Example 2: Get Protein Entry
```bash
# Get human insulin in FASTA format
curl "https://rest.uniprot.org/uniprotkb/P01308.fasta"
```
### Example 3: Custom Fields
```bash
# Get specific fields in TSV format
curl "https://rest.uniprot.org/uniprotkb/search?query=gene:BRCA1&format=tsv&fields=accession,gene_names,length"
```
### Example 4: ID Mapping - Submit Job
```bash
# Submit mapping job
curl -X POST "https://rest.uniprot.org/idmapping/run" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "from=UniProtKB_AC-ID&to=PDB&ids=P01308,P04637"
```
### Example 5: ID Mapping - Get Results
```bash
# Get mapping results (replace JOB_ID)
curl "https://rest.uniprot.org/idmapping/results/JOB_ID"
```
### Example 6: Download All Results
```bash
# Download all human reviewed proteins
curl "https://rest.uniprot.org/uniprotkb/stream?query=organism_id:9606+AND+reviewed:true&format=fasta" \
-o human_proteins.fasta
```
## R Examples
### Example 1: Basic Search
```r
library(httr)
library(jsonlite)
# Search for insulin proteins
url <- "https://rest.uniprot.org/uniprotkb/search"
query_params <- list(
query = "insulin AND organism_id:9606",
format = "json",
size = 10
)
response <- GET(url, query = query_params)
data <- fromJSON(content(response, "text"))
# Extract accessions and names
proteins <- data$results[, c("primaryAccession", "proteinDescription")]
print(proteins)
```
### Example 2: Get Sequences
```r
library(httr)
# Get protein sequence
accession <- "P01308"
url <- paste0("https://rest.uniprot.org/uniprotkb/", accession, ".fasta")
response <- GET(url)
sequence <- content(response, "text")
cat(sequence)
```
### Example 3: Download to Data Frame
```r
library(httr)
library(readr)
# Get data as TSV
url <- "https://rest.uniprot.org/uniprotkb/search"
query_params <- list(
query = "gene:BRCA1 AND reviewed:true",
format = "tsv",
fields = "accession,gene_names,organism_name,length"
)
response <- GET(url, query = query_params)
data <- read_tsv(content(response, "text"))
print(data)
```
## JavaScript Examples
### Example 1: Fetch API
```javascript
// Search for proteins
async function searchUniProt(query) {
const url = `https://rest.uniprot.org/uniprotkb/search?query=${encodeURIComponent(query)}&format=json&size=10`;
const response = await fetch(url);
const data = await response.json();
return data.results;
}
// Usage
searchUniProt("insulin AND organism_id:9606")
.then(results => console.log(results));
```
### Example 2: Get Protein Entry
```javascript
async function getProtein(accession, format = "json") {
const url = `https://rest.uniprot.org/uniprotkb/${accession}.${format}`;
const response = await fetch(url);
if (format === "json") {
return await response.json();
} else {
return await response.text();
}
}
// Usage
getProtein("P01308", "fasta")
.then(sequence => console.log(sequence));
```
### Example 3: ID Mapping
```javascript
async function mapIds(ids, fromDb, toDb) {
// Submit job
const submitUrl = "https://rest.uniprot.org/idmapping/run";
const formData = new URLSearchParams({
from: fromDb,
to: toDb,
ids: ids.join(",")
});
const submitResponse = await fetch(submitUrl, {
method: "POST",
body: formData
});
const { jobId } = await submitResponse.json();
// Poll for completion
const statusUrl = `https://rest.uniprot.org/idmapping/status/${jobId}`;
while (true) {
const statusResponse = await fetch(statusUrl);
const status = await statusResponse.json();
if ("results" in status || "failedIds" in status) {
break;
}
await new Promise(resolve => setTimeout(resolve, 3000));
}
// Get results
const resultsUrl = `https://rest.uniprot.org/idmapping/results/${jobId}`;
const resultsResponse = await fetch(resultsUrl);
return await resultsResponse.json();
}
// Usage
mapIds(["P01308", "P04637"], "UniProtKB_AC-ID", "PDB")
.then(mapping => console.log(mapping));
```
## Advanced Examples
### Example: Batch Processing with Rate Limiting
```python
import requests
import time
from typing import List, Dict
class UniProtClient:
def __init__(self, rate_limit=1.0):
self.base_url = "https://rest.uniprot.org"
self.rate_limit = rate_limit
self.last_request = 0
def _rate_limit(self):
"""Enforce rate limiting"""
elapsed = time.time() - self.last_request
if elapsed < self.rate_limit:
time.sleep(self.rate_limit - elapsed)
self.last_request = time.time()
def batch_get_proteins(self, accessions: List[str],
batch_size: int = 100) -> List[Dict]:
"""Get proteins in batches"""
results = []
for i in range(0, len(accessions), batch_size):
batch = accessions[i:i + batch_size]
query = " OR ".join([f"accession:{acc}" for acc in batch])
self._rate_limit()
response = requests.get(
f"{self.base_url}/uniprotkb/search",
params={
"query": query,
"format": "json",
"size": batch_size
}
)
if response.ok:
data = response.json()
results.extend(data.get('results', []))
else:
print(f"Error in batch {i//batch_size}: {response.status_code}")
return results
# Usage
client = UniProtClient(rate_limit=0.5)
accessions = ["P01308", "P04637", "P12345", "Q9Y6K9"]
proteins = client.batch_get_proteins(accessions)
```
### Example: Download with Progress Bar
```python
import requests
from tqdm import tqdm
def download_with_progress(query, output_file, format="fasta"):
"""Download results with progress bar"""
url = "https://rest.uniprot.org/uniprotkb/stream"
params = {
"query": query,
"format": format
}
response = requests.get(url, params=params, stream=True)
total_size = int(response.headers.get('content-length', 0))
with open(output_file, 'wb') as f, \
tqdm(total=total_size, unit='B', unit_scale=True) as pbar:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
pbar.update(len(chunk))
# Usage
download_with_progress(
"organism_id:9606 AND reviewed:true",
"human_proteome.fasta"
)
```
## Resources
- API Documentation: https://www.uniprot.org/help/api
- Interactive API Explorer: https://www.uniprot.org/api-documentation
- Python client (Unipressed): https://github.com/multimeric/Unipressed
- Bioservices package: https://bioservices.readthedocs.io/

View File

@@ -0,0 +1,275 @@
# UniProt API Fields Reference
Complete list of available fields for customizing UniProt API queries. Use these fields with the `fields` parameter to retrieve only the data you need.
## Usage
Add fields parameter to your query:
```
https://rest.uniprot.org/uniprotkb/search?query=insulin&fields=accession,gene_names,organism_name,length
```
Multiple fields are comma-separated. No spaces after commas.
## Core Fields
### Identification
- `accession` - Primary accession number (e.g., P12345)
- `id` - Entry name (e.g., INSR_HUMAN)
- `uniprotkb_id` - Same as id
- `entryType` - REVIEWED (Swiss-Prot) or UNREVIEWED (TrEMBL)
### Protein Names
- `protein_name` - Recommended and alternative protein names
- `gene_names` - Gene name(s)
- `gene_primary` - Primary gene name
- `gene_synonym` - Gene synonyms
- `gene_oln` - Ordered locus names
- `gene_orf` - ORF names
### Organism Information
- `organism_name` - Organism scientific name
- `organism_id` - NCBI taxonomy identifier
- `lineage` - Taxonomic lineage
- `virus_hosts` - Virus host organisms (for viral proteins)
### Sequence Information
- `sequence` - Amino acid sequence
- `length` - Sequence length
- `mass` - Molecular mass (Daltons)
- `fragment` - Whether entry is a fragment
- `checksum` - Sequence CRC64 checksum
## Annotation Fields
### Function and Biology
- `cc_function` - Function description
- `cc_catalytic_activity` - Catalytic activity
- `cc_activity_regulation` - Activity regulation
- `cc_pathway` - Metabolic pathway information
- `cc_cofactor` - Cofactor information
### Interaction and Localization
- `cc_interaction` - Protein-protein interactions
- `cc_subunit` - Subunit structure
- `cc_subcellular_location` - Subcellular location
- `cc_tissue_specificity` - Tissue specificity
- `cc_developmental_stage` - Developmental stage expression
### Disease and Phenotype
- `cc_disease` - Disease associations
- `cc_disruption_phenotype` - Disruption phenotype
- `cc_allergen` - Allergen information
- `cc_toxic_dose` - Toxic dose information
### Post-translational Modifications
- `cc_ptm` - Post-translational modifications
- `cc_mass_spectrometry` - Mass spectrometry data
### Other Comments
- `cc_alternative_products` - Alternative products (isoforms)
- `cc_polymorphism` - Polymorphism information
- `cc_rna_editing` - RNA editing
- `cc_caution` - Caution notes
- `cc_miscellaneous` - Miscellaneous information
- `cc_similarity` - Sequence similarities
- `cc_sequence_caution` - Sequence caution
- `cc_web_resource` - Web resources
## Feature Fields (ft_)
### Molecular Processing
- `ft_signal` - Signal peptide
- `ft_transit` - Transit peptide
- `ft_init_met` - Initiator methionine
- `ft_propep` - Propeptide
- `ft_chain` - Chain (mature protein)
- `ft_peptide` - Peptide
### Regions and Sites
- `ft_domain` - Domain
- `ft_repeat` - Repeat
- `ft_ca_bind` - Calcium binding
- `ft_zn_fing` - Zinc finger
- `ft_dna_bind` - DNA binding
- `ft_np_bind` - Nucleotide binding
- `ft_region` - Region of interest
- `ft_coiled` - Coiled coil
- `ft_motif` - Short sequence motif
- `ft_compbias` - Compositional bias
### Sites and Modifications
- `ft_act_site` - Active site
- `ft_metal` - Metal binding
- `ft_binding` - Binding site
- `ft_site` - Site
- `ft_mod_res` - Modified residue
- `ft_lipid` - Lipidation
- `ft_carbohyd` - Glycosylation
- `ft_disulfid` - Disulfide bond
- `ft_crosslnk` - Cross-link
### Structural Features
- `ft_helix` - Helix
- `ft_strand` - Beta strand
- `ft_turn` - Turn
- `ft_transmem` - Transmembrane region
- `ft_intramem` - Intramembrane region
- `ft_topo_dom` - Topological domain
### Variation and Conflict
- `ft_variant` - Natural variant
- `ft_var_seq` - Alternative sequence
- `ft_mutagen` - Mutagenesis
- `ft_unsure` - Unsure residue
- `ft_conflict` - Sequence conflict
- `ft_non_cons` - Non-consecutive residues
- `ft_non_ter` - Non-terminal residue
- `ft_non_std` - Non-standard residue
## Gene Ontology (GO)
- `go` - All GO terms
- `go_p` - Biological process
- `go_c` - Cellular component
- `go_f` - Molecular function
- `go_id` - GO term identifiers
## Cross-References (xref_)
### Sequence Databases
- `xref_embl` - EMBL/GenBank/DDBJ
- `xref_refseq` - RefSeq
- `xref_ccds` - CCDS
- `xref_pir` - PIR
### 3D Structure Databases
- `xref_pdb` - Protein Data Bank
- `xref_pcddb` - PCD database
- `xref_alphafolddb` - AlphaFold database
- `xref_smr` - SWISS-MODEL Repository
### Protein Family/Domain Databases
- `xref_interpro` - InterPro
- `xref_pfam` - Pfam
- `xref_prosite` - PROSITE
- `xref_smart` - SMART
### Genome Databases
- `xref_ensembl` - Ensembl
- `xref_ensemblgenomes` - Ensembl Genomes
- `xref_geneid` - Entrez Gene
- `xref_kegg` - KEGG
### Organism-Specific Databases
- `xref_mgi` - MGI (mouse)
- `xref_rgd` - RGD (rat)
- `xref_flybase` - FlyBase (fly)
- `xref_wormbase` - WormBase (worm)
- `xref_xenbase` - Xenbase (frog)
- `xref_zfin` - ZFIN (zebrafish)
### Pathway Databases
- `xref_reactome` - Reactome
- `xref_signor` - SIGNOR
- `xref_signalink` - SignaLink
### Disease Databases
- `xref_disgenet` - DisGeNET
- `xref_malacards` - MalaCards
- `xref_omim` - OMIM
- `xref_orphanet` - Orphanet
### Drug Databases
- `xref_chembl` - ChEMBL
- `xref_drugbank` - DrugBank
- `xref_guidetopharmacology` - Guide to Pharmacology
### Expression Databases
- `xref_bgee` - Bgee
- `xref_expressionetatlas` - Expression Atlas
- `xref_genevisible` - Genevisible
## Metadata Fields
### Dates
- `date_created` - Entry creation date
- `date_modified` - Last modification date
- `date_sequence_modified` - Last sequence modification date
### Evidence and Quality
- `annotation_score` - Annotation score (1-5)
- `protein_existence` - Protein existence level
- `reviewed` - Whether entry is reviewed (Swiss-Prot)
### Literature
- `lit_pubmed_id` - PubMed identifiers
- `lit_doi` - DOI identifiers
### Proteomics
- `proteome` - Proteome identifier
- `tools` - Tools used for annotation
## Retrieving Available Fields Programmatically
Use the configuration endpoint to get all available fields:
```bash
curl https://rest.uniprot.org/configure/uniprotkb/result-fields
```
Or in Python:
```python
import requests
response = requests.get("https://rest.uniprot.org/configure/uniprotkb/result-fields")
fields = response.json()
```
## Common Field Combinations
### Basic protein information
```
fields=accession,id,protein_name,gene_names,organism_name,length
```
### Sequence and structure
```
fields=accession,sequence,length,mass,xref_pdb,xref_alphafolddb
```
### Functional annotation
```
fields=accession,protein_name,cc_function,cc_catalytic_activity,cc_pathway,go
```
### Disease information
```
fields=accession,protein_name,gene_names,cc_disease,xref_omim,xref_malacards
```
### Expression patterns
```
fields=accession,gene_names,cc_tissue_specificity,cc_developmental_stage,xref_bgee
```
### Complete annotation
```
fields=accession,id,protein_name,gene_names,organism_name,sequence,length,cc_*,ft_*,go,xref_pdb
```
## Notes
1. **Wildcards**: Some fields support wildcards (e.g., `cc_*` for all comment fields, `ft_*` for all features)
2. **Performance**: Requesting fewer fields improves response time and reduces bandwidth
3. **Format dependency**: Some fields may be formatted differently depending on output format (JSON vs TSV)
4. **Null values**: Fields without data may be omitted from response (JSON) or empty (TSV)
5. **Arrays vs strings**: In JSON format, many fields return arrays of objects rather than simple strings
## Resources
- Interactive field explorer: https://www.uniprot.org/api-documentation
- API fields endpoint: https://rest.uniprot.org/configure/uniprotkb/result-fields
- Return fields documentation: https://www.uniprot.org/help/return_fields

View File

@@ -0,0 +1,285 @@
# UniProt ID Mapping Databases
Complete list of databases supported by the UniProt ID Mapping service. Use these database names when calling the ID mapping API.
## Retrieving Database List Programmatically
```python
import requests
response = requests.get("https://rest.uniprot.org/configure/idmapping/fields")
databases = response.json()
```
## UniProt Databases
### UniProtKB
- `UniProtKB_AC-ID` - UniProt accession and ID
- `UniProtKB` - UniProt Knowledgebase
- `UniProtKB-Swiss-Prot` - Reviewed (Swiss-Prot)
- `UniProtKB-TrEMBL` - Unreviewed (TrEMBL)
- `UniParc` - UniProt Archive
- `UniRef50` - UniRef 50% identity clusters
- `UniRef90` - UniRef 90% identity clusters
- `UniRef100` - UniRef 100% identity clusters
## Sequence Databases
### Nucleotide Sequence
- `EMBL` - EMBL/GenBank/DDBJ
- `EMBL-CDS` - EMBL coding sequences
- `RefSeq_Nucleotide` - RefSeq nucleotide sequences
- `CCDS` - Consensus CDS
### Protein Sequence
- `RefSeq_Protein` - RefSeq protein sequences
- `PIR` - Protein Information Resource
## Gene Databases
- `GeneID` - Entrez Gene
- `Gene_Name` - Gene name
- `Gene_Synonym` - Gene synonym
- `Gene_OrderedLocusName` - Ordered locus name
- `Gene_ORFName` - ORF name
## Genome Databases
### General
- `Ensembl` - Ensembl
- `EnsemblGenomes` - Ensembl Genomes
- `EnsemblGenomes_PRO` - Ensembl Genomes protein
- `EnsemblGenomes_TRS` - Ensembl Genomes transcript
- `Ensembl_PRO` - Ensembl protein
- `Ensembl_TRS` - Ensembl transcript
### Organism-Specific
- `KEGG` - KEGG Genes
- `PATRIC` - PATRIC
- `UCSC` - UCSC Genome Browser
- `VectorBase` - VectorBase
- `WBParaSite` - WormBase ParaSite
## Structure Databases
- `PDB` - Protein Data Bank
- `AlphaFoldDB` - AlphaFold Database
- `BMRB` - Biological Magnetic Resonance Data Bank
- `PDBsum` - PDB summary
- `SASBDB` - Small Angle Scattering Biological Data Bank
- `SMR` - SWISS-MODEL Repository
## Protein Family and Domain Databases
- `InterPro` - InterPro
- `Pfam` - Pfam protein families
- `PROSITE` - PROSITE
- `SMART` - SMART domains
- `CDD` - Conserved Domain Database
- `HAMAP` - HAMAP
- `PANTHER` - PANTHER
- `PRINTS` - PRINTS
- `ProDom` - ProDom
- `SFLD` - Structure-Function Linkage Database
- `SUPFAM` - SUPERFAMILY
- `TIGRFAMs` - TIGRFAMs
## Organism-Specific Databases
### Model Organisms
- `MGI` - Mouse Genome Informatics
- `RGD` - Rat Genome Database
- `FlyBase` - FlyBase (Drosophila)
- `WormBase` - WormBase (C. elegans)
- `Xenbase` - Xenbase (Xenopus)
- `ZFIN` - Zebrafish Information Network
- `dictyBase` - dictyBase (Dictyostelium)
- `EcoGene` - EcoGene (E. coli)
- `SGD` - Saccharomyces Genome Database (yeast)
- `PomBase` - PomBase (S. pombe)
- `TAIR` - The Arabidopsis Information Resource
### Human-Specific
- `HGNC` - HUGO Gene Nomenclature Committee
- `CCDS` - Consensus Coding Sequence Database
## Pathway Databases
- `Reactome` - Reactome
- `BioCyc` - BioCyc
- `PlantReactome` - Plant Reactome
- `SIGNOR` - SIGNOR
- `SignaLink` - SignaLink
## Enzyme and Metabolism
- `EC` - Enzyme Commission number
- `BRENDA` - BRENDA enzyme database
- `SABIO-RK` - SABIO-RK (biochemical reactions)
- `MetaCyc` - MetaCyc
## Disease and Phenotype Databases
- `OMIM` - Online Mendelian Inheritance in Man
- `MIM` - MIM (same as OMIM)
- `OrphaNet` - Orphanet (rare diseases)
- `DisGeNET` - DisGeNET
- `MalaCards` - MalaCards
- `CTD` - Comparative Toxicogenomics Database
- `OpenTargets` - Open Targets
## Drug and Chemical Databases
- `ChEMBL` - ChEMBL
- `DrugBank` - DrugBank
- `DrugCentral` - DrugCentral
- `GuidetoPHARMACOLOGY` - Guide to Pharmacology
- `SwissLipids` - SwissLipids
## Gene Expression Databases
- `Bgee` - Bgee gene expression
- `ExpressionAtlas` - Expression Atlas
- `Genevisible` - Genevisible
- `CleanEx` - CleanEx
## Proteomics Databases
- `PRIDE` - PRIDE proteomics
- `PeptideAtlas` - PeptideAtlas
- `ProteomicsDB` - ProteomicsDB
- `CPTAC` - CPTAC
- `jPOST` - jPOST
- `MassIVE` - MassIVE
- `MaxQB` - MaxQB
- `PaxDb` - PaxDb
- `TopDownProteomics` - Top Down Proteomics
## Protein-Protein Interaction
- `STRING` - STRING
- `BioGRID` - BioGRID
- `IntAct` - IntAct
- `MINT` - MINT
- `DIP` - Database of Interacting Proteins
- `ComplexPortal` - Complex Portal
## Ontologies
- `GO` - Gene Ontology
- `GeneTree` - Ensembl GeneTree
- `HOGENOM` - HOGENOM
- `HOVERGEN` - HOVERGEN
- `KO` - KEGG Orthology
- `OMA` - OMA orthology
- `OrthoDB` - OrthoDB
- `TreeFam` - TreeFam
## Other Specialized Databases
### Glycosylation
- `GlyConnect` - GlyConnect
- `GlyGen` - GlyGen
### Protein Modifications
- `PhosphoSitePlus` - PhosphoSitePlus
- `iPTMnet` - iPTMnet
### Antibodies
- `Antibodypedia` - Antibodypedia
- `DNASU` - DNASU
### Protein Localization
- `COMPARTMENTS` - COMPARTMENTS
- `NeXtProt` - NeXtProt (human proteins)
### Evolution and Phylogeny
- `eggNOG` - eggNOG
- `GeneTree` - Ensembl GeneTree
- `InParanoid` - InParanoid
### Technical Resources
- `PRO` - Protein Ontology
- `GenomeRNAi` - GenomeRNAi
- `PubMed` - PubMed literature references
## Common Mapping Scenarios
### Example 1: UniProt to PDB
```python
from_db = "UniProtKB_AC-ID"
to_db = "PDB"
ids = ["P01308", "P04637"]
```
### Example 2: Gene Name to UniProt
```python
from_db = "Gene_Name"
to_db = "UniProtKB"
ids = ["BRCA1", "TP53", "INSR"]
```
### Example 3: UniProt to Ensembl
```python
from_db = "UniProtKB_AC-ID"
to_db = "Ensembl"
ids = ["P12345"]
```
### Example 4: RefSeq to UniProt
```python
from_db = "RefSeq_Protein"
to_db = "UniProtKB"
ids = ["NP_000207.1"]
```
### Example 5: UniProt to GO Terms
```python
from_db = "UniProtKB_AC-ID"
to_db = "GO"
ids = ["P01308"]
```
## Usage Notes
1. **Database names are case-sensitive**: Use exact names as listed
2. **Many-to-many mappings**: One ID may map to multiple target IDs
3. **Failed mappings**: Some IDs may not have mappings; check the `failedIds` field in results
4. **Batch size limit**: Maximum 100,000 IDs per job
5. **Result expiration**: Results are stored for 7 days
6. **Bidirectional mapping**: Most databases support mapping in both directions
## API Endpoints
### Get available databases
```
GET https://rest.uniprot.org/configure/idmapping/fields
```
### Submit mapping job
```
POST https://rest.uniprot.org/idmapping/run
Content-Type: application/x-www-form-urlencoded
from={from_db}&to={to_db}&ids={comma_separated_ids}
```
### Check job status
```
GET https://rest.uniprot.org/idmapping/status/{jobId}
```
### Get results
```
GET https://rest.uniprot.org/idmapping/results/{jobId}
```
## Resources
- ID Mapping tool: https://www.uniprot.org/id-mapping
- API documentation: https://www.uniprot.org/help/id_mapping
- Programmatic access: https://www.uniprot.org/help/api_idmapping

View File

@@ -0,0 +1,256 @@
# UniProt Query Syntax Reference
Comprehensive guide to UniProt search query syntax for constructing complex searches.
## Basic Syntax
### Simple Queries
```
insulin
kinase
```
### Field-Specific Searches
```
gene:BRCA1
accession:P12345
organism_name:human
protein_name:kinase
```
## Boolean Operators
### AND (both terms must be present)
```
insulin AND diabetes
kinase AND human
gene:BRCA1 AND reviewed:true
```
### OR (either term can be present)
```
diabetes OR insulin
(cancer OR tumor) AND human
```
### NOT (exclude terms)
```
kinase NOT human
protein_name:kinase NOT organism_name:mouse
```
### Grouping with Parentheses
```
(diabetes OR insulin) AND reviewed:true
(gene:BRCA1 OR gene:BRCA2) AND organism_id:9606
```
## Common Search Fields
### Identification
- `accession:P12345` - UniProt accession number
- `id:INSR_HUMAN` - Entry name
- `gene:BRCA1` - Gene name
- `gene_exact:BRCA1` - Exact gene name match
### Organism/Taxonomy
- `organism_name:human` - Organism name
- `organism_name:"Homo sapiens"` - Exact organism name (use quotes for multi-word)
- `organism_id:9606` - NCBI taxonomy ID
- `taxonomy_id:9606` - Same as organism_id
- `taxonomy_name:"Homo sapiens"` - Taxonomy name
### Protein Information
- `protein_name:insulin` - Protein name
- `protein_name:"insulin receptor"` - Exact protein name
- `reviewed:true` - Only Swiss-Prot (reviewed) entries
- `reviewed:false` - Only TrEMBL (unreviewed) entries
### Sequence Properties
- `length:[100 TO 500]` - Sequence length range
- `mass:[50000 TO 100000]` - Molecular mass in Daltons
- `sequence:MVLSPADKTNVK` - Exact sequence match
- `fragment:false` - Exclude fragment sequences
### Gene Ontology (GO)
- `go:0005515` - GO term ID (0005515 = protein binding)
- `go_f:* ` - Any molecular function
- `go_p:*` - Any biological process
- `go_c:*` - Any cellular component
### Annotations
- `annotation:(type:signal)` - Has signal peptide annotation
- `annotation:(type:transmem)` - Has transmembrane region
- `cc_function:*` - Has function comment
- `cc_interaction:*` - Has interaction comment
- `ft_domain:*` - Has domain feature
### Database Cross-References
- `xref:pdb` - Has PDB structure
- `xref:ensembl` - Has Ensembl reference
- `database:pdb` - Same as xref
- `database:(type:pdb)` - Alternative syntax
### Protein Families and Domains
- `family:"protein kinase"` - Protein family
- `keyword:"Protein kinase"` - Keyword annotation
- `cc_similarity:*` - Has similarity comment
## Range Queries
### Numeric Ranges
```
length:[100 TO 500] # Between 100 and 500
mass:[* TO 50000] # Less than or equal to 50000
created:[2023-01-01 TO *] # Created after Jan 1, 2023
```
### Date Ranges
```
created:[2023-01-01 TO 2023-12-31]
modified:[2024-01-01 TO *]
```
## Wildcards
### Single Character (?)
```
gene:BRCA? # Matches BRCA1, BRCA2, etc.
```
### Multiple Characters (*)
```
gene:BRCA* # Matches BRCA1, BRCA2, BRCA1P1, etc.
protein_name:kinase*
organism_name:Homo*
```
## Advanced Searches
### Existence Queries
```
cc_function:* # Has any function annotation
ft_domain:* # Has any domain feature
xref:pdb # Has PDB structure
```
### Combined Complex Queries
```
# Human reviewed kinases with PDB structure
(protein_name:kinase OR family:kinase) AND organism_id:9606 AND reviewed:true AND xref:pdb
# Cancer-related proteins excluding mice
(disease:cancer OR keyword:cancer) NOT organism_name:mouse
# Membrane proteins with signal peptides
annotation:(type:transmem) AND annotation:(type:signal) AND reviewed:true
# Recently updated human proteins
organism_id:9606 AND modified:[2024-01-01 TO *] AND reviewed:true
```
## Field-Specific Examples
### Protein Names
```
protein_name:"insulin receptor" # Exact phrase
protein_name:insulin* # Starts with insulin
recommended_name:insulin # Recommended name only
alternative_name:insulin # Alternative names only
```
### Genes
```
gene:BRCA1 # Gene symbol
gene_exact:BRCA1 # Exact gene match
olnName:BRCA1 # Ordered locus name
orfName:BRCA1 # ORF name
```
### Organisms
```
organism_name:human # Common name
organism_name:"Homo sapiens" # Scientific name
organism_id:9606 # Taxonomy ID
lineage:primates # Taxonomic lineage
```
### Features
```
ft_signal:* # Signal peptide
ft_transmem:* # Transmembrane region
ft_domain:"Protein kinase" # Specific domain
ft_binding:* # Binding site
ft_site:* # Any site
```
### Comments (cc_)
```
cc_function:* # Function description
cc_catalytic_activity:* # Catalytic activity
cc_pathway:* # Pathway involvement
cc_interaction:* # Protein interactions
cc_subcellular_location:* # Subcellular location
cc_tissue_specificity:* # Tissue specificity
cc_disease:cancer # Disease association
```
## Tips and Best Practices
1. **Use quotes for exact phrases**: `organism_name:"Homo sapiens"` not `organism_name:Homo sapiens`
2. **Filter by review status**: Add `AND reviewed:true` for high-quality Swiss-Prot entries
3. **Combine wildcards carefully**: `*kinase*` may be too broad; `kinase*` is more specific
4. **Use parentheses for complex logic**: `(A OR B) AND (C OR D)` is clearer than `A OR B AND C OR D`
5. **Numeric ranges are inclusive**: `length:[100 TO 500]` includes both 100 and 500
6. **Field prefixes**: Learn common prefixes:
- `cc_` = Comments
- `ft_` = Features
- `go_` = Gene Ontology
- `xref_` = Cross-references
7. **Check field names**: Use the API's `/configure/uniprotkb/result-fields` endpoint to see all available fields
## Query Validation
Test queries using:
- **Web interface**: https://www.uniprot.org/uniprotkb
- **API**: https://rest.uniprot.org/uniprotkb/search?query=YOUR_QUERY
- **API documentation**: https://www.uniprot.org/help/query-fields
## Common Patterns
### Find well-characterized proteins
```
reviewed:true AND xref:pdb AND cc_function:*
```
### Find disease-associated proteins
```
cc_disease:* AND organism_id:9606 AND reviewed:true
```
### Find proteins with experimental evidence
```
existence:"Evidence at protein level" AND reviewed:true
```
### Find secreted proteins
```
cc_subcellular_location:secreted AND reviewed:true
```
### Find drug targets
```
keyword:"Pharmaceutical" OR keyword:"Drug target"
```
## Resources
- Full query field reference: https://www.uniprot.org/help/query-fields
- API query documentation: https://www.uniprot.org/help/api_queries
- Text search documentation: https://www.uniprot.org/help/text-search

View File

@@ -0,0 +1,256 @@
#!/usr/bin/env python3
"""
UniProt REST API Client
A Python client for interacting with the UniProt REST API.
Provides helper functions for common operations including search,
retrieval, ID mapping, and streaming.
Usage examples:
# Search for proteins
results = search_proteins("insulin AND organism_name:human", format="json")
# Get a single protein
protein = get_protein("P12345", format="fasta")
# Map IDs
mapped = map_ids(["P12345", "P04637"], from_db="UniProtKB_AC-ID", to_db="PDB")
# Stream large results
for batch in stream_results("taxonomy_id:9606 AND reviewed:true", format="fasta"):
process(batch)
"""
import requests
import time
import json
from typing import List, Dict, Optional, Generator
from urllib.parse import urlencode
BASE_URL = "https://rest.uniprot.org"
POLLING_INTERVAL = 3 # seconds
def search_proteins(query: str, format: str = "json",
fields: Optional[List[str]] = None,
size: int = 25) -> Dict:
"""
Search UniProt database with a query.
Args:
query: Search query (e.g., "insulin AND organism_name:human")
format: Response format (json, tsv, xlsx, xml, fasta, txt, rdf)
fields: List of fields to return (e.g., ["accession", "gene_names", "organism_name"])
size: Number of results per page (default 25, max 500)
Returns:
Response data in requested format
"""
endpoint = f"{BASE_URL}/uniprotkb/search"
params = {
"query": query,
"format": format,
"size": size
}
if fields:
params["fields"] = ",".join(fields)
response = requests.get(endpoint, params=params)
response.raise_for_status()
if format == "json":
return response.json()
else:
return response.text
def get_protein(accession: str, format: str = "json") -> str:
"""
Retrieve a single protein entry by accession number.
Args:
accession: UniProt accession number (e.g., "P12345")
format: Response format (json, txt, xml, fasta, gff, rdf)
Returns:
Protein data in requested format
"""
endpoint = f"{BASE_URL}/uniprotkb/{accession}.{format}"
response = requests.get(endpoint)
response.raise_for_status()
if format == "json":
return response.json()
else:
return response.text
def batch_retrieve(accessions: List[str], format: str = "json",
fields: Optional[List[str]] = None) -> str:
"""
Retrieve multiple protein entries efficiently.
Args:
accessions: List of UniProt accession numbers
format: Response format
fields: List of fields to return
Returns:
Combined results in requested format
"""
query = " OR ".join([f"accession:{acc}" for acc in accessions])
return search_proteins(query, format=format, fields=fields, size=len(accessions))
def stream_results(query: str, format: str = "fasta",
fields: Optional[List[str]] = None,
chunk_size: int = 8192) -> Generator[str, None, None]:
"""
Stream large result sets without pagination.
Args:
query: Search query
format: Response format
fields: List of fields to return
chunk_size: Size of chunks to yield
Yields:
Chunks of response data
"""
endpoint = f"{BASE_URL}/uniprotkb/stream"
params = {
"query": query,
"format": format
}
if fields:
params["fields"] = ",".join(fields)
response = requests.get(endpoint, params=params, stream=True)
response.raise_for_status()
for chunk in response.iter_content(chunk_size=chunk_size, decode_unicode=True):
if chunk:
yield chunk
def map_ids(ids: List[str], from_db: str, to_db: str,
format: str = "json") -> Dict:
"""
Map protein identifiers between different database systems.
Args:
ids: List of identifiers to map (max 100,000)
from_db: Source database (e.g., "UniProtKB_AC-ID", "Gene_Name")
to_db: Target database (e.g., "PDB", "Ensembl", "RefSeq_Protein")
format: Response format
Returns:
Mapping results
Note:
- Maximum 100,000 IDs per job
- Results stored for 7 days
- See id_mapping_databases.md for all supported databases
"""
if len(ids) > 100000:
raise ValueError("Maximum 100,000 IDs allowed per mapping job")
# Step 1: Submit job
submit_endpoint = f"{BASE_URL}/idmapping/run"
data = {
"from": from_db,
"to": to_db,
"ids": ",".join(ids)
}
response = requests.post(submit_endpoint, data=data)
response.raise_for_status()
job_id = response.json()["jobId"]
# Step 2: Poll for completion
status_endpoint = f"{BASE_URL}/idmapping/status/{job_id}"
while True:
response = requests.get(status_endpoint)
response.raise_for_status()
status = response.json()
if "results" in status or "failedIds" in status:
break
time.sleep(POLLING_INTERVAL)
# Step 3: Retrieve results
results_endpoint = f"{BASE_URL}/idmapping/results/{job_id}"
params = {"format": format}
response = requests.get(results_endpoint, params=params)
response.raise_for_status()
if format == "json":
return response.json()
else:
return response.text
def get_available_fields() -> List[Dict]:
"""
Get list of all available fields for queries.
Returns:
List of field definitions with names and descriptions
"""
endpoint = f"{BASE_URL}/configure/uniprotkb/result-fields"
response = requests.get(endpoint)
response.raise_for_status()
return response.json()
def get_id_mapping_databases() -> Dict:
"""
Get list of all supported databases for ID mapping.
Returns:
Dictionary of database groups and their supported databases
"""
endpoint = f"{BASE_URL}/configure/idmapping/fields"
response = requests.get(endpoint)
response.raise_for_status()
return response.json()
# Example usage
if __name__ == "__main__":
# Example 1: Search for human insulin proteins
print("Searching for human insulin proteins...")
results = search_proteins(
"insulin AND organism_name:human AND reviewed:true",
format="json",
fields=["accession", "id", "gene_names", "protein_name"],
size=5
)
print(json.dumps(results, indent=2))
# Example 2: Get a specific protein in FASTA format
print("\nRetrieving protein P01308 (human insulin)...")
protein = get_protein("P01308", format="fasta")
print(protein)
# Example 3: Map UniProt IDs to PDB IDs
print("\nMapping UniProt IDs to PDB...")
mapping = map_ids(
["P01308", "P04637"],
from_db="UniProtKB_AC-ID",
to_db="PDB"
)
print(json.dumps(mapping, indent=2))