276 lines
7.5 KiB
Markdown
276 lines
7.5 KiB
Markdown
# UniProt API Fields Reference
|
|
|
|
Complete list of available fields for customizing UniProt API queries. Use these fields with the `fields` parameter to retrieve only the data you need.
|
|
|
|
## Usage
|
|
|
|
Add fields parameter to your query:
|
|
```
|
|
https://rest.uniprot.org/uniprotkb/search?query=insulin&fields=accession,gene_names,organism_name,length
|
|
```
|
|
|
|
Multiple fields are comma-separated. No spaces after commas.
|
|
|
|
## Core Fields
|
|
|
|
### Identification
|
|
- `accession` - Primary accession number (e.g., P12345)
|
|
- `id` - Entry name (e.g., INSR_HUMAN)
|
|
- `uniprotkb_id` - Same as id
|
|
- `entryType` - REVIEWED (Swiss-Prot) or UNREVIEWED (TrEMBL)
|
|
|
|
### Protein Names
|
|
- `protein_name` - Recommended and alternative protein names
|
|
- `gene_names` - Gene name(s)
|
|
- `gene_primary` - Primary gene name
|
|
- `gene_synonym` - Gene synonyms
|
|
- `gene_oln` - Ordered locus names
|
|
- `gene_orf` - ORF names
|
|
|
|
### Organism Information
|
|
- `organism_name` - Organism scientific name
|
|
- `organism_id` - NCBI taxonomy identifier
|
|
- `lineage` - Taxonomic lineage
|
|
- `virus_hosts` - Virus host organisms (for viral proteins)
|
|
|
|
### Sequence Information
|
|
- `sequence` - Amino acid sequence
|
|
- `length` - Sequence length
|
|
- `mass` - Molecular mass (Daltons)
|
|
- `fragment` - Whether entry is a fragment
|
|
- `checksum` - Sequence CRC64 checksum
|
|
|
|
## Annotation Fields
|
|
|
|
### Function and Biology
|
|
- `cc_function` - Function description
|
|
- `cc_catalytic_activity` - Catalytic activity
|
|
- `cc_activity_regulation` - Activity regulation
|
|
- `cc_pathway` - Metabolic pathway information
|
|
- `cc_cofactor` - Cofactor information
|
|
|
|
### Interaction and Localization
|
|
- `cc_interaction` - Protein-protein interactions
|
|
- `cc_subunit` - Subunit structure
|
|
- `cc_subcellular_location` - Subcellular location
|
|
- `cc_tissue_specificity` - Tissue specificity
|
|
- `cc_developmental_stage` - Developmental stage expression
|
|
|
|
### Disease and Phenotype
|
|
- `cc_disease` - Disease associations
|
|
- `cc_disruption_phenotype` - Disruption phenotype
|
|
- `cc_allergen` - Allergen information
|
|
- `cc_toxic_dose` - Toxic dose information
|
|
|
|
### Post-translational Modifications
|
|
- `cc_ptm` - Post-translational modifications
|
|
- `cc_mass_spectrometry` - Mass spectrometry data
|
|
|
|
### Other Comments
|
|
- `cc_alternative_products` - Alternative products (isoforms)
|
|
- `cc_polymorphism` - Polymorphism information
|
|
- `cc_rna_editing` - RNA editing
|
|
- `cc_caution` - Caution notes
|
|
- `cc_miscellaneous` - Miscellaneous information
|
|
- `cc_similarity` - Sequence similarities
|
|
- `cc_sequence_caution` - Sequence caution
|
|
- `cc_web_resource` - Web resources
|
|
|
|
## Feature Fields (ft_)
|
|
|
|
### Molecular Processing
|
|
- `ft_signal` - Signal peptide
|
|
- `ft_transit` - Transit peptide
|
|
- `ft_init_met` - Initiator methionine
|
|
- `ft_propep` - Propeptide
|
|
- `ft_chain` - Chain (mature protein)
|
|
- `ft_peptide` - Peptide
|
|
|
|
### Regions and Sites
|
|
- `ft_domain` - Domain
|
|
- `ft_repeat` - Repeat
|
|
- `ft_ca_bind` - Calcium binding
|
|
- `ft_zn_fing` - Zinc finger
|
|
- `ft_dna_bind` - DNA binding
|
|
- `ft_np_bind` - Nucleotide binding
|
|
- `ft_region` - Region of interest
|
|
- `ft_coiled` - Coiled coil
|
|
- `ft_motif` - Short sequence motif
|
|
- `ft_compbias` - Compositional bias
|
|
|
|
### Sites and Modifications
|
|
- `ft_act_site` - Active site
|
|
- `ft_metal` - Metal binding
|
|
- `ft_binding` - Binding site
|
|
- `ft_site` - Site
|
|
- `ft_mod_res` - Modified residue
|
|
- `ft_lipid` - Lipidation
|
|
- `ft_carbohyd` - Glycosylation
|
|
- `ft_disulfid` - Disulfide bond
|
|
- `ft_crosslnk` - Cross-link
|
|
|
|
### Structural Features
|
|
- `ft_helix` - Helix
|
|
- `ft_strand` - Beta strand
|
|
- `ft_turn` - Turn
|
|
- `ft_transmem` - Transmembrane region
|
|
- `ft_intramem` - Intramembrane region
|
|
- `ft_topo_dom` - Topological domain
|
|
|
|
### Variation and Conflict
|
|
- `ft_variant` - Natural variant
|
|
- `ft_var_seq` - Alternative sequence
|
|
- `ft_mutagen` - Mutagenesis
|
|
- `ft_unsure` - Unsure residue
|
|
- `ft_conflict` - Sequence conflict
|
|
- `ft_non_cons` - Non-consecutive residues
|
|
- `ft_non_ter` - Non-terminal residue
|
|
- `ft_non_std` - Non-standard residue
|
|
|
|
## Gene Ontology (GO)
|
|
|
|
- `go` - All GO terms
|
|
- `go_p` - Biological process
|
|
- `go_c` - Cellular component
|
|
- `go_f` - Molecular function
|
|
- `go_id` - GO term identifiers
|
|
|
|
## Cross-References (xref_)
|
|
|
|
### Sequence Databases
|
|
- `xref_embl` - EMBL/GenBank/DDBJ
|
|
- `xref_refseq` - RefSeq
|
|
- `xref_ccds` - CCDS
|
|
- `xref_pir` - PIR
|
|
|
|
### 3D Structure Databases
|
|
- `xref_pdb` - Protein Data Bank
|
|
- `xref_pcddb` - PCD database
|
|
- `xref_alphafolddb` - AlphaFold database
|
|
- `xref_smr` - SWISS-MODEL Repository
|
|
|
|
### Protein Family/Domain Databases
|
|
- `xref_interpro` - InterPro
|
|
- `xref_pfam` - Pfam
|
|
- `xref_prosite` - PROSITE
|
|
- `xref_smart` - SMART
|
|
|
|
### Genome Databases
|
|
- `xref_ensembl` - Ensembl
|
|
- `xref_ensemblgenomes` - Ensembl Genomes
|
|
- `xref_geneid` - Entrez Gene
|
|
- `xref_kegg` - KEGG
|
|
|
|
### Organism-Specific Databases
|
|
- `xref_mgi` - MGI (mouse)
|
|
- `xref_rgd` - RGD (rat)
|
|
- `xref_flybase` - FlyBase (fly)
|
|
- `xref_wormbase` - WormBase (worm)
|
|
- `xref_xenbase` - Xenbase (frog)
|
|
- `xref_zfin` - ZFIN (zebrafish)
|
|
|
|
### Pathway Databases
|
|
- `xref_reactome` - Reactome
|
|
- `xref_signor` - SIGNOR
|
|
- `xref_signalink` - SignaLink
|
|
|
|
### Disease Databases
|
|
- `xref_disgenet` - DisGeNET
|
|
- `xref_malacards` - MalaCards
|
|
- `xref_omim` - OMIM
|
|
- `xref_orphanet` - Orphanet
|
|
|
|
### Drug Databases
|
|
- `xref_chembl` - ChEMBL
|
|
- `xref_drugbank` - DrugBank
|
|
- `xref_guidetopharmacology` - Guide to Pharmacology
|
|
|
|
### Expression Databases
|
|
- `xref_bgee` - Bgee
|
|
- `xref_expressionetatlas` - Expression Atlas
|
|
- `xref_genevisible` - Genevisible
|
|
|
|
## Metadata Fields
|
|
|
|
### Dates
|
|
- `date_created` - Entry creation date
|
|
- `date_modified` - Last modification date
|
|
- `date_sequence_modified` - Last sequence modification date
|
|
|
|
### Evidence and Quality
|
|
- `annotation_score` - Annotation score (1-5)
|
|
- `protein_existence` - Protein existence level
|
|
- `reviewed` - Whether entry is reviewed (Swiss-Prot)
|
|
|
|
### Literature
|
|
- `lit_pubmed_id` - PubMed identifiers
|
|
- `lit_doi` - DOI identifiers
|
|
|
|
### Proteomics
|
|
- `proteome` - Proteome identifier
|
|
- `tools` - Tools used for annotation
|
|
|
|
## Retrieving Available Fields Programmatically
|
|
|
|
Use the configuration endpoint to get all available fields:
|
|
```bash
|
|
curl https://rest.uniprot.org/configure/uniprotkb/result-fields
|
|
```
|
|
|
|
Or in Python:
|
|
```python
|
|
import requests
|
|
response = requests.get("https://rest.uniprot.org/configure/uniprotkb/result-fields")
|
|
fields = response.json()
|
|
```
|
|
|
|
## Common Field Combinations
|
|
|
|
### Basic protein information
|
|
```
|
|
fields=accession,id,protein_name,gene_names,organism_name,length
|
|
```
|
|
|
|
### Sequence and structure
|
|
```
|
|
fields=accession,sequence,length,mass,xref_pdb,xref_alphafolddb
|
|
```
|
|
|
|
### Functional annotation
|
|
```
|
|
fields=accession,protein_name,cc_function,cc_catalytic_activity,cc_pathway,go
|
|
```
|
|
|
|
### Disease information
|
|
```
|
|
fields=accession,protein_name,gene_names,cc_disease,xref_omim,xref_malacards
|
|
```
|
|
|
|
### Expression patterns
|
|
```
|
|
fields=accession,gene_names,cc_tissue_specificity,cc_developmental_stage,xref_bgee
|
|
```
|
|
|
|
### Complete annotation
|
|
```
|
|
fields=accession,id,protein_name,gene_names,organism_name,sequence,length,cc_*,ft_*,go,xref_pdb
|
|
```
|
|
|
|
## Notes
|
|
|
|
1. **Wildcards**: Some fields support wildcards (e.g., `cc_*` for all comment fields, `ft_*` for all features)
|
|
|
|
2. **Performance**: Requesting fewer fields improves response time and reduces bandwidth
|
|
|
|
3. **Format dependency**: Some fields may be formatted differently depending on output format (JSON vs TSV)
|
|
|
|
4. **Null values**: Fields without data may be omitted from response (JSON) or empty (TSV)
|
|
|
|
5. **Arrays vs strings**: In JSON format, many fields return arrays of objects rather than simple strings
|
|
|
|
## Resources
|
|
|
|
- Interactive field explorer: https://www.uniprot.org/api-documentation
|
|
- API fields endpoint: https://rest.uniprot.org/configure/uniprotkb/result-fields
|
|
- Return fields documentation: https://www.uniprot.org/help/return_fields
|