257 lines
6.7 KiB
Markdown
257 lines
6.7 KiB
Markdown
# UniProt Query Syntax Reference
|
|
|
|
Comprehensive guide to UniProt search query syntax for constructing complex searches.
|
|
|
|
## Basic Syntax
|
|
|
|
### Simple Queries
|
|
```
|
|
insulin
|
|
kinase
|
|
```
|
|
|
|
### Field-Specific Searches
|
|
```
|
|
gene:BRCA1
|
|
accession:P12345
|
|
organism_name:human
|
|
protein_name:kinase
|
|
```
|
|
|
|
## Boolean Operators
|
|
|
|
### AND (both terms must be present)
|
|
```
|
|
insulin AND diabetes
|
|
kinase AND human
|
|
gene:BRCA1 AND reviewed:true
|
|
```
|
|
|
|
### OR (either term can be present)
|
|
```
|
|
diabetes OR insulin
|
|
(cancer OR tumor) AND human
|
|
```
|
|
|
|
### NOT (exclude terms)
|
|
```
|
|
kinase NOT human
|
|
protein_name:kinase NOT organism_name:mouse
|
|
```
|
|
|
|
### Grouping with Parentheses
|
|
```
|
|
(diabetes OR insulin) AND reviewed:true
|
|
(gene:BRCA1 OR gene:BRCA2) AND organism_id:9606
|
|
```
|
|
|
|
## Common Search Fields
|
|
|
|
### Identification
|
|
- `accession:P12345` - UniProt accession number
|
|
- `id:INSR_HUMAN` - Entry name
|
|
- `gene:BRCA1` - Gene name
|
|
- `gene_exact:BRCA1` - Exact gene name match
|
|
|
|
### Organism/Taxonomy
|
|
- `organism_name:human` - Organism name
|
|
- `organism_name:"Homo sapiens"` - Exact organism name (use quotes for multi-word)
|
|
- `organism_id:9606` - NCBI taxonomy ID
|
|
- `taxonomy_id:9606` - Same as organism_id
|
|
- `taxonomy_name:"Homo sapiens"` - Taxonomy name
|
|
|
|
### Protein Information
|
|
- `protein_name:insulin` - Protein name
|
|
- `protein_name:"insulin receptor"` - Exact protein name
|
|
- `reviewed:true` - Only Swiss-Prot (reviewed) entries
|
|
- `reviewed:false` - Only TrEMBL (unreviewed) entries
|
|
|
|
### Sequence Properties
|
|
- `length:[100 TO 500]` - Sequence length range
|
|
- `mass:[50000 TO 100000]` - Molecular mass in Daltons
|
|
- `sequence:MVLSPADKTNVK` - Exact sequence match
|
|
- `fragment:false` - Exclude fragment sequences
|
|
|
|
### Gene Ontology (GO)
|
|
- `go:0005515` - GO term ID (0005515 = protein binding)
|
|
- `go_f:* ` - Any molecular function
|
|
- `go_p:*` - Any biological process
|
|
- `go_c:*` - Any cellular component
|
|
|
|
### Annotations
|
|
- `annotation:(type:signal)` - Has signal peptide annotation
|
|
- `annotation:(type:transmem)` - Has transmembrane region
|
|
- `cc_function:*` - Has function comment
|
|
- `cc_interaction:*` - Has interaction comment
|
|
- `ft_domain:*` - Has domain feature
|
|
|
|
### Database Cross-References
|
|
- `xref:pdb` - Has PDB structure
|
|
- `xref:ensembl` - Has Ensembl reference
|
|
- `database:pdb` - Same as xref
|
|
- `database:(type:pdb)` - Alternative syntax
|
|
|
|
### Protein Families and Domains
|
|
- `family:"protein kinase"` - Protein family
|
|
- `keyword:"Protein kinase"` - Keyword annotation
|
|
- `cc_similarity:*` - Has similarity comment
|
|
|
|
## Range Queries
|
|
|
|
### Numeric Ranges
|
|
```
|
|
length:[100 TO 500] # Between 100 and 500
|
|
mass:[* TO 50000] # Less than or equal to 50000
|
|
created:[2023-01-01 TO *] # Created after Jan 1, 2023
|
|
```
|
|
|
|
### Date Ranges
|
|
```
|
|
created:[2023-01-01 TO 2023-12-31]
|
|
modified:[2024-01-01 TO *]
|
|
```
|
|
|
|
## Wildcards
|
|
|
|
### Single Character (?)
|
|
```
|
|
gene:BRCA? # Matches BRCA1, BRCA2, etc.
|
|
```
|
|
|
|
### Multiple Characters (*)
|
|
```
|
|
gene:BRCA* # Matches BRCA1, BRCA2, BRCA1P1, etc.
|
|
protein_name:kinase*
|
|
organism_name:Homo*
|
|
```
|
|
|
|
## Advanced Searches
|
|
|
|
### Existence Queries
|
|
```
|
|
cc_function:* # Has any function annotation
|
|
ft_domain:* # Has any domain feature
|
|
xref:pdb # Has PDB structure
|
|
```
|
|
|
|
### Combined Complex Queries
|
|
```
|
|
# Human reviewed kinases with PDB structure
|
|
(protein_name:kinase OR family:kinase) AND organism_id:9606 AND reviewed:true AND xref:pdb
|
|
|
|
# Cancer-related proteins excluding mice
|
|
(disease:cancer OR keyword:cancer) NOT organism_name:mouse
|
|
|
|
# Membrane proteins with signal peptides
|
|
annotation:(type:transmem) AND annotation:(type:signal) AND reviewed:true
|
|
|
|
# Recently updated human proteins
|
|
organism_id:9606 AND modified:[2024-01-01 TO *] AND reviewed:true
|
|
```
|
|
|
|
## Field-Specific Examples
|
|
|
|
### Protein Names
|
|
```
|
|
protein_name:"insulin receptor" # Exact phrase
|
|
protein_name:insulin* # Starts with insulin
|
|
recommended_name:insulin # Recommended name only
|
|
alternative_name:insulin # Alternative names only
|
|
```
|
|
|
|
### Genes
|
|
```
|
|
gene:BRCA1 # Gene symbol
|
|
gene_exact:BRCA1 # Exact gene match
|
|
olnName:BRCA1 # Ordered locus name
|
|
orfName:BRCA1 # ORF name
|
|
```
|
|
|
|
### Organisms
|
|
```
|
|
organism_name:human # Common name
|
|
organism_name:"Homo sapiens" # Scientific name
|
|
organism_id:9606 # Taxonomy ID
|
|
lineage:primates # Taxonomic lineage
|
|
```
|
|
|
|
### Features
|
|
```
|
|
ft_signal:* # Signal peptide
|
|
ft_transmem:* # Transmembrane region
|
|
ft_domain:"Protein kinase" # Specific domain
|
|
ft_binding:* # Binding site
|
|
ft_site:* # Any site
|
|
```
|
|
|
|
### Comments (cc_)
|
|
```
|
|
cc_function:* # Function description
|
|
cc_catalytic_activity:* # Catalytic activity
|
|
cc_pathway:* # Pathway involvement
|
|
cc_interaction:* # Protein interactions
|
|
cc_subcellular_location:* # Subcellular location
|
|
cc_tissue_specificity:* # Tissue specificity
|
|
cc_disease:cancer # Disease association
|
|
```
|
|
|
|
## Tips and Best Practices
|
|
|
|
1. **Use quotes for exact phrases**: `organism_name:"Homo sapiens"` not `organism_name:Homo sapiens`
|
|
|
|
2. **Filter by review status**: Add `AND reviewed:true` for high-quality Swiss-Prot entries
|
|
|
|
3. **Combine wildcards carefully**: `*kinase*` may be too broad; `kinase*` is more specific
|
|
|
|
4. **Use parentheses for complex logic**: `(A OR B) AND (C OR D)` is clearer than `A OR B AND C OR D`
|
|
|
|
5. **Numeric ranges are inclusive**: `length:[100 TO 500]` includes both 100 and 500
|
|
|
|
6. **Field prefixes**: Learn common prefixes:
|
|
- `cc_` = Comments
|
|
- `ft_` = Features
|
|
- `go_` = Gene Ontology
|
|
- `xref_` = Cross-references
|
|
|
|
7. **Check field names**: Use the API's `/configure/uniprotkb/result-fields` endpoint to see all available fields
|
|
|
|
## Query Validation
|
|
|
|
Test queries using:
|
|
- **Web interface**: https://www.uniprot.org/uniprotkb
|
|
- **API**: https://rest.uniprot.org/uniprotkb/search?query=YOUR_QUERY
|
|
- **API documentation**: https://www.uniprot.org/help/query-fields
|
|
|
|
## Common Patterns
|
|
|
|
### Find well-characterized proteins
|
|
```
|
|
reviewed:true AND xref:pdb AND cc_function:*
|
|
```
|
|
|
|
### Find disease-associated proteins
|
|
```
|
|
cc_disease:* AND organism_id:9606 AND reviewed:true
|
|
```
|
|
|
|
### Find proteins with experimental evidence
|
|
```
|
|
existence:"Evidence at protein level" AND reviewed:true
|
|
```
|
|
|
|
### Find secreted proteins
|
|
```
|
|
cc_subcellular_location:secreted AND reviewed:true
|
|
```
|
|
|
|
### Find drug targets
|
|
```
|
|
keyword:"Pharmaceutical" OR keyword:"Drug target"
|
|
```
|
|
|
|
## Resources
|
|
|
|
- Full query field reference: https://www.uniprot.org/help/query-fields
|
|
- API query documentation: https://www.uniprot.org/help/api_queries
|
|
- Text search documentation: https://www.uniprot.org/help/text-search
|