12 KiB
AlphaFold Database API Reference
This document provides comprehensive technical documentation for programmatic access to the AlphaFold Protein Structure Database.
Table of Contents
- REST API Endpoints
- File Access Patterns
- Data Schemas
- Google Cloud Access
- BigQuery Schema
- Best Practices
- Error Handling
- Rate Limiting
REST API Endpoints
Base URL
https://alphafold.ebi.ac.uk/api/
1. Get Prediction by UniProt Accession
Endpoint: /prediction/{uniprot_id}
Method: GET
Description: Retrieve AlphaFold prediction metadata for a given UniProt accession.
Parameters:
uniprot_id(required): UniProt accession (e.g., "P00520")
Example Request:
curl https://alphafold.ebi.ac.uk/api/prediction/P00520
Example Response:
[
{
"entryId": "AF-P00520-F1",
"gene": "ABL1",
"uniprotAccession": "P00520",
"uniprotId": "ABL1_HUMAN",
"uniprotDescription": "Tyrosine-protein kinase ABL1",
"taxId": 9606,
"organismScientificName": "Homo sapiens",
"uniprotStart": 1,
"uniprotEnd": 1130,
"uniprotSequence": "MLEICLKLVGCKSKKGLSSSSSCYLEEALQRPVASDFEPQGLSEAARWNSKENLLAGPSENDPNLFVALYDFVASGDNTLSITKGEKLRVLGYNHNGEWCEAQTKNGQGWVPSNYITPVNSLEKHSWYHGPVSRNAAEYLLSSGINGSFLVRESESSPGQRSISLRYEGRVYHYRINTASDGKLYVSSESRFNTLAELVHHHSTVADGLITTLHYPAPKRNKPTVYGVSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKEIKHPNLVQLLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFIHRDLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWAFGVLLWEIATYGMSPYPGIDLSQVYELLEKDYRMERPEGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQESSISDEVEKELGKQGVRGAVSTLLQAPELPTKTRTSRRAAEHRDTTDVPEMPHSKGQGESDPLDHEPAVSPLLPRKERGPPEGGLNEDERLLPKDKKTNLFSALIKKKKKTAPTPPKRSSSFREMDGQPERRGAGEEEGRDISNGALAFTPLDTADPAKSPKPSNGAGVPNGALRESGGSGFRSPHLWKKSSTLTSSRLATGEEEGGGSSSKRFLRSCSASCVPHGAKDTEWRSVTLPRDLQSTGRQFDSSTFGGHKSEKPALPRKRAGENRSDQVTRGTVTPPPRLVKKNEEAADEVFKDIMESSPGSSPPNLTPKPLRRQVTVAPASGLPHKEEAGKGSALGTPAAAEPVTPTSKAGSGAPGGTSKGPAEESRVRRHKHSSESPGRDKGKLSRLKPAPPPPPAASAGKAGGKPSQSPSQEAAGEAVLGAKTKATSLVDAVNSDAAKPSQPGEGLKKPVLPATPKPQSAKPSGTPISPAPVPSTLPSASSALAGDQPSSTAFIPLISTRVSLRKTRQPPERIASGAITKGVVLDSTEALCLAISRNSEQMASHSAVLEAGKNLYTFCVSYVDSIQQMRNKFAFREAINKLENNLRELQICPATAGSGPAATQDFSKLLSSVKEISDIVQR",
"modelCreatedDate": "2021-07-01",
"latestVersion": 4,
"allVersions": [1, 2, 3, 4],
"cifUrl": "https://alphafold.ebi.ac.uk/files/AF-P00520-F1-model_v4.cif",
"bcifUrl": "https://alphafold.ebi.ac.uk/files/AF-P00520-F1-model_v4.bcif",
"pdbUrl": "https://alphafold.ebi.ac.uk/files/AF-P00520-F1-model_v4.pdb",
"paeImageUrl": "https://alphafold.ebi.ac.uk/files/AF-P00520-F1-predicted_aligned_error_v4.png",
"paeDocUrl": "https://alphafold.ebi.ac.uk/files/AF-P00520-F1-predicted_aligned_error_v4.json"
}
]
Response Fields:
entryId: AlphaFold internal identifier (format: AF-{uniprot}-F{fragment})gene: Gene symboluniprotAccession: UniProt accessionuniprotId: UniProt entry nameuniprotDescription: Protein descriptiontaxId: NCBI taxonomy identifierorganismScientificName: Species scientific nameuniprotStart/uniprotEnd: Residue range covereduniprotSequence: Full protein sequencemodelCreatedDate: Initial prediction datelatestVersion: Current model version numberallVersions: List of available versionscifUrl/bcifUrl/pdbUrl: Structure file download URLspaeImageUrl: PAE visualization image URLpaeDocUrl: PAE data JSON URL
2. 3D-Beacons Integration
AlphaFold is integrated into the 3D-Beacons network for federated structure access.
Endpoint: https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/uniprot/summary/{uniprot_id}.json
Example:
import requests
uniprot_id = "P00520"
url = f"https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/uniprot/summary/{uniprot_id}.json"
response = requests.get(url)
data = response.json()
# Filter for AlphaFold structures
alphafold_structures = [
s for s in data['structures']
if s['provider'] == 'AlphaFold DB'
]
File Access Patterns
Direct File Downloads
All AlphaFold files are accessible via direct URLs without authentication.
URL Pattern:
https://alphafold.ebi.ac.uk/files/{alphafold_id}-{file_type}_{version}.{extension}
Components:
{alphafold_id}: Entry identifier (e.g., "AF-P00520-F1"){file_type}: Type of file (see below){version}: Database version (e.g., "v4"){extension}: File format extension
Available File Types
1. Model Coordinates
mmCIF Format (Recommended):
https://alphafold.ebi.ac.uk/files/AF-P00520-F1-model_v4.cif
- Standard crystallographic format
- Contains full metadata
- Supports large structures
- File size: Variable (100KB - 10MB typical)
Binary CIF Format:
https://alphafold.ebi.ac.uk/files/AF-P00520-F1-model_v4.bcif
- Compressed binary version of mmCIF
- Smaller file size (~70% reduction)
- Faster parsing
- Requires specialized parser
PDB Format (Legacy):
https://alphafold.ebi.ac.uk/files/AF-P00520-F1-model_v4.pdb
- Traditional PDB text format
- Limited to 99,999 atoms
- Widely supported by older tools
- File size: Similar to mmCIF
2. Confidence Metrics
Per-Residue Confidence (JSON):
https://alphafold.ebi.ac.uk/files/AF-P00520-F1-confidence_v4.json
Structure:
{
"confidenceScore": [87.5, 91.2, 93.8, ...],
"confidenceCategory": ["high", "very_high", "very_high", ...]
}
Fields:
confidenceScore: Array of pLDDT values (0-100) for each residueconfidenceCategory: Categorical classification (very_low, low, high, very_high)
3. Predicted Aligned Error (JSON)
https://alphafold.ebi.ac.uk/files/AF-P00520-F1-predicted_aligned_error_v4.json
Structure:
{
"distance": [[0, 2.3, 4.5, ...], [2.3, 0, 3.1, ...], ...],
"max_predicted_aligned_error": 31.75
}
Fields:
distance: N×N matrix of PAE values in Ångströmsmax_predicted_aligned_error: Maximum PAE value in the matrix
4. PAE Visualization (PNG)
https://alphafold.ebi.ac.uk/files/AF-P00520-F1-predicted_aligned_error_v4.png
- Pre-rendered PAE heatmap
- Useful for quick visual assessment
- Resolution: Variable based on protein size
Batch Download Strategy
For downloading multiple files efficiently, use concurrent downloads with proper error handling and rate limiting to respect server resources.
Data Schemas
Coordinate File (mmCIF) Schema
AlphaFold mmCIF files contain:
Key Data Categories:
_entry: Entry-level metadata_struct: Structure title and description_entity: Molecular entity information_atom_site: Atomic coordinates and properties_pdbx_struct_assembly: Biological assembly info
Important Fields in _atom_site:
group_PDB: "ATOM" for all recordsid: Atom serial numberlabel_atom_id: Atom name (e.g., "CA", "N", "C")label_comp_id: Residue name (e.g., "ALA", "GLY")label_seq_id: Residue sequence numberCartn_x/y/z: Cartesian coordinates (Ångströms)B_iso_or_equiv: B-factor (contains pLDDT score)
pLDDT in B-factor Column: AlphaFold stores per-residue confidence (pLDDT) in the B-factor field. This allows standard structure viewers to color by confidence automatically.
Confidence JSON Schema
{
"confidenceScore": [
87.5, // Residue 1 pLDDT
91.2, // Residue 2 pLDDT
93.8 // Residue 3 pLDDT
// ... one value per residue
],
"confidenceCategory": [
"high", // Residue 1 category
"very_high", // Residue 2 category
"very_high" // Residue 3 category
// ... one category per residue
]
}
Confidence Categories:
very_high: pLDDT > 90high: 70 < pLDDT ≤ 90low: 50 < pLDDT ≤ 70very_low: pLDDT ≤ 50
PAE JSON Schema
{
"distance": [
[0.0, 2.3, 4.5, ...], // PAE from residue 1 to all residues
[2.3, 0.0, 3.1, ...], // PAE from residue 2 to all residues
[4.5, 3.1, 0.0, ...] // PAE from residue 3 to all residues
// ... N×N matrix for N residues
],
"max_predicted_aligned_error": 31.75
}
Interpretation:
distance[i][j]: Expected position error (Ångströms) of residue j if the predicted and true structures were aligned on residue i- Lower values indicate more confident relative positioning
- Diagonal is always 0 (residue aligned to itself)
- Matrix is not symmetric: distance[i][j] ≠ distance[j][i]
Google Cloud Access
AlphaFold DB is hosted on Google Cloud Platform for bulk access.
Cloud Storage Bucket
Bucket: gs://public-datasets-deepmind-alphafold-v4
Directory Structure:
gs://public-datasets-deepmind-alphafold-v4/
├── accession_ids.csv # Index of all entries (13.5 GB)
├── sequences.fasta # All protein sequences (16.5 GB)
└── proteomes/ # Grouped by species (1M+ archives)
Installing gsutil
# Using pip
pip install gsutil
# Or install Google Cloud SDK
curl https://sdk.cloud.google.com | bash
Downloading Proteomes
By Taxonomy ID:
# Download all archives for a species
TAX_ID=9606 # Human
gsutil -m cp gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-${TAX_ID}-*_v4.tar .
BigQuery Schema
AlphaFold metadata is available in BigQuery for SQL-based queries.
Dataset: bigquery-public-data.deepmind_alphafold
Table: metadata
Key Fields
| Field | Type | Description |
|---|---|---|
entryId |
STRING | AlphaFold entry ID |
uniprotAccession |
STRING | UniProt accession |
gene |
STRING | Gene symbol |
organismScientificName |
STRING | Species scientific name |
taxId |
INTEGER | NCBI taxonomy ID |
globalMetricValue |
FLOAT | Overall quality metric |
fractionPlddtVeryHigh |
FLOAT | Fraction with pLDDT ≥ 90 |
isReviewed |
BOOLEAN | Swiss-Prot reviewed status |
sequenceLength |
INTEGER | Protein sequence length |
Example Query
SELECT
entryId,
uniprotAccession,
gene,
fractionPlddtVeryHigh
FROM `bigquery-public-data.deepmind_alphafold.metadata`
WHERE
taxId = 9606 -- Homo sapiens
AND fractionPlddtVeryHigh > 0.8
AND isReviewed = TRUE
ORDER BY fractionPlddtVeryHigh DESC
LIMIT 100;
Best Practices
1. Caching Strategy
Always cache downloaded files locally to avoid repeated downloads.
2. Error Handling
Implement robust error handling for API requests with retry logic for transient failures.
3. Bulk Processing
For processing many proteins, use concurrent downloads with appropriate rate limiting.
4. Version Management
Always specify and track database versions in your code (current: v4).
Error Handling
Common HTTP Status Codes
| Code | Meaning | Action |
|---|---|---|
| 200 | Success | Process response normally |
| 404 | Not Found | No AlphaFold prediction for this UniProt ID |
| 429 | Too Many Requests | Implement rate limiting and retry with backoff |
| 500 | Server Error | Retry with exponential backoff |
| 503 | Service Unavailable | Wait and retry later |
Rate Limiting
Recommendations
- Limit to 10 concurrent requests maximum
- Add 100-200ms delay between sequential requests
- Use Google Cloud for bulk downloads instead of REST API
- Cache all downloaded data locally
Additional Resources
- AlphaFold GitHub: https://github.com/google-deepmind/alphafold
- Google Cloud Documentation: https://cloud.google.com/datasets/alphafold
- 3D-Beacons Documentation: https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/docs
- Biopython Tutorial: https://biopython.org/wiki/AlphaFold
Version History
- v1 (2021): Initial release with ~350K structures
- v2 (2022): Expanded to 200M+ structures
- v3 (2023): Updated models and expanded coverage
- v4 (2024): Current version with improved confidence metrics
Citation
When using AlphaFold DB in publications, cite:
- Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
- Varadi, M. et al. AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 52, D368–D375 (2024).