# STRING Database API Reference ## Overview STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is a comprehensive database of known and predicted protein-protein interactions integrating data from over 40 sources. **Database Statistics (v12.0+):** - Coverage: 5000+ genomes - Proteins: ~59.3 million - Interactions: 20+ billion - Data types: Physical interactions, functional associations, co-expression, co-occurrence, text-mining, databases **Core Data Resource:** Designated by Global Biodata Coalition and ELIXIR ## API Base URLs - **Current version**: https://string-db.org/api - **Version-specific**: https://version-12-0.string-db.org/api (for reproducibility) - **API documentation**: https://string-db.org/help/api/ ## Best Practices 1. **Identifier Mapping**: Always map identifiers first using `get_string_ids` for faster subsequent queries 2. **Use STRING IDs**: Prefer STRING identifiers (e.g., `9606.ENSP00000269305`) over gene names 3. **Specify Species**: For networks with >10 proteins, always specify species NCBI taxon ID 4. **Rate Limiting**: Wait 1 second between API calls to avoid server overload 5. **Versioned URLs**: Use version-specific URLs for reproducible research 6. **POST over GET**: Use POST requests for large protein lists 7. **Caller Identity**: Include `caller_identity` parameter for tracking (e.g., your application name) ## API Methods ### 1. Identifier Mapping (`get_string_ids`) **Purpose**: Maps common protein names, gene symbols, UniProt IDs, and other identifiers to STRING identifiers. **Endpoint**: `/api/tsv/get_string_ids` **Parameters**: - `identifiers` (required): Protein names/IDs separated by newlines (`%0d`) - `species` (required): NCBI taxon ID - `limit`: Number of matches per identifier (default: 1) - `echo_query`: Include query term in output (1 or 0) - `caller_identity`: Application identifier **Output Format**: TSV with columns: - `queryItem`: Original query - `queryIndex`: Query position - `stringId`: STRING identifier - `ncbiTaxonId`: Species taxon ID - `taxonName`: Species name - `preferredName`: Preferred gene name - `annotation`: Protein description **Example**: ``` identifiers=TP53%0dBRCA1&species=9606&limit=1 ``` **Use cases**: - Converting gene symbols to STRING IDs - Validating protein identifiers - Finding canonical protein names ### 2. Network Data (`network`) **Purpose**: Retrieves protein-protein interaction network data in tabular format. **Endpoint**: `/api/tsv/network` **Parameters**: - `identifiers` (required): Protein IDs separated by `%0d` - `species`: NCBI taxon ID - `required_score`: Confidence threshold 0-1000 (default: 400) - 150: low confidence - 400: medium confidence - 700: high confidence - 900: highest confidence - `network_type`: `functional` (default) or `physical` - `add_nodes`: Add N interacting proteins (0-10) - `caller_identity`: Application identifier **Output Format**: TSV with columns: - `stringId_A`, `stringId_B`: Interacting proteins - `preferredName_A`, `preferredName_B`: Gene names - `ncbiTaxonId`: Species - `score`: Combined interaction score (0-1000) - `nscore`: Neighborhood score - `fscore`: Fusion score - `pscore`: Phylogenetic profile score - `ascore`: Coexpression score - `escore`: Experimental score - `dscore`: Database score - `tscore`: Text-mining score **Network Types**: - **Functional**: All interaction evidence types (recommended for most analyses) - **Physical**: Only direct physical binding evidence **Example**: ``` identifiers=9606.ENSP00000269305%0d9606.ENSP00000275493&required_score=700 ``` ### 3. Network Image (`image/network`) **Purpose**: Generates visual network representation as PNG image. **Endpoint**: `/api/image/network` **Parameters**: - `identifiers` (required): Protein IDs separated by `%0d` - `species`: NCBI taxon ID - `required_score`: Confidence threshold 0-1000 - `network_flavor`: Visualization style - `evidence`: Show evidence types as colored lines - `confidence`: Show confidence as line thickness - `actions`: Show activating/inhibiting interactions - `add_nodes`: Add N interacting proteins (0-10) - `caller_identity`: Application identifier **Output**: PNG image (binary data) **Image Specifications**: - Format: PNG - Size: Automatically scaled based on network size - High-resolution option available (add `?highres=1`) **Example**: ``` identifiers=TP53%0dMDM2&species=9606&network_flavor=evidence ``` ### 4. Interaction Partners (`interaction_partners`) **Purpose**: Retrieves all STRING interaction partners for given protein(s). **Endpoint**: `/api/tsv/interaction_partners` **Parameters**: - `identifiers` (required): Protein IDs - `species`: NCBI taxon ID - `required_score`: Confidence threshold 0-1000 - `limit`: Maximum number of partners (default: 10) - `caller_identity`: Application identifier **Output Format**: TSV with same columns as `network` method **Use cases**: - Finding hub proteins - Expanding networks - Discovery of novel interactions **Example**: ``` identifiers=TP53&species=9606&limit=20&required_score=700 ``` ### 5. Functional Enrichment (`enrichment`) **Purpose**: Performs functional enrichment analysis for a set of proteins across multiple annotation databases. **Endpoint**: `/api/tsv/enrichment` **Parameters**: - `identifiers` (required): List of protein IDs - `species` (required): NCBI taxon ID - `caller_identity`: Application identifier **Enrichment Categories**: - **Gene Ontology**: Biological Process, Molecular Function, Cellular Component - **KEGG Pathways**: Metabolic and signaling pathways - **Pfam**: Protein domains - **InterPro**: Protein families and domains - **SMART**: Domain architecture - **UniProt Keywords**: Curated functional keywords **Output Format**: TSV with columns: - `category`: Annotation category - `term`: Term ID - `description`: Term description - `number_of_genes`: Genes in input with this term - `number_of_genes_in_background`: Total genes with this term - `ncbiTaxonId`: Species - `inputGenes`: Comma-separated gene list - `preferredNames`: Comma-separated gene names - `p_value`: Enrichment p-value (uncorrected) - `fdr`: False discovery rate (corrected p-value) **Statistical Method**: Fisher's exact test with Benjamini-Hochberg FDR correction **Example**: ``` identifiers=TP53%0dMDM2%0dATM%0dCHEK2&species=9606 ``` ### 6. PPI Enrichment (`ppi_enrichment`) **Purpose**: Tests if a network has significantly more interactions than expected by chance. **Endpoint**: `/api/json/ppi_enrichment` **Parameters**: - `identifiers` (required): List of protein IDs - `species`: NCBI taxon ID - `required_score`: Confidence threshold - `caller_identity`: Application identifier **Output Format**: JSON with fields: - `number_of_nodes`: Proteins in network - `number_of_edges`: Interactions observed - `expected_number_of_edges`: Expected interactions (random) - `p_value`: Statistical significance **Interpretation**: - p-value < 0.05: Network is significantly enriched - Low p-value indicates proteins form functional module **Example**: ``` identifiers=TP53%0dMDM2%0dATM%0dCHEK2&species=9606 ``` ### 7. Homology Scores (`homology`) **Purpose**: Retrieves protein similarity/homology scores. **Endpoint**: `/api/tsv/homology` **Parameters**: - `identifiers` (required): Protein IDs - `species`: NCBI taxon ID - `caller_identity`: Application identifier **Output Format**: TSV with homology scores between proteins **Use cases**: - Identifying protein families - Paralog analysis - Cross-species comparisons ### 8. Version Information (`version`) **Purpose**: Returns current STRING database version. **Endpoint**: `/api/tsv/version` **Output**: Version string (e.g., "12.0") ## Common Species NCBI Taxon IDs | Organism | Common Name | Taxon ID | |----------|-------------|----------| | Homo sapiens | Human | 9606 | | Mus musculus | Mouse | 10090 | | Rattus norvegicus | Rat | 10116 | | Drosophila melanogaster | Fruit fly | 7227 | | Caenorhabditis elegans | C. elegans | 6239 | | Saccharomyces cerevisiae | Yeast | 4932 | | Arabidopsis thaliana | Thale cress | 3702 | | Escherichia coli K-12 | E. coli | 511145 | | Danio rerio | Zebrafish | 7955 | | Gallus gallus | Chicken | 9031 | Full list: https://string-db.org/cgi/input?input_page_active_form=organisms ## STRING Identifier Format STRING uses Ensembl protein IDs with taxon prefix: - Format: `{taxonId}.{ensemblProteinId}` - Example: `9606.ENSP00000269305` (human TP53) **ID Components**: - **Taxon ID**: NCBI taxonomy identifier - **Protein ID**: Usually Ensembl protein ID (ENSP...) ## Interaction Confidence Scores STRING provides combined confidence scores (0-1000) based on multiple evidence channels: ### Evidence Channels 1. **Neighborhood (nscore)**: Gene fusion and conserved genomic neighborhood 2. **Fusion (fscore)**: Gene fusion events across species 3. **Phylogenetic Profile (pscore)**: Co-occurrence across species 4. **Coexpression (ascore)**: RNA expression correlation 5. **Experimental (escore)**: Biochemical/genetic experiments 6. **Database (dscore)**: Curated pathway/complex databases 7. **Text-mining (tscore)**: Literature co-occurrence ### Recommended Thresholds - **150**: Low confidence (exploratory analysis) - **400**: Medium confidence (standard analysis) - **700**: High confidence (conservative analysis) - **900**: Highest confidence (very stringent) ## Output Formats ### Available Formats 1. **TSV**: Tab-separated values (default, best for data processing) 2. **JSON**: JavaScript Object Notation (structured data) 3. **XML**: Extensible Markup Language 4. **PSI-MI**: Proteomics Standards Initiative format 5. **PSI-MITAB**: Tab-delimited PSI-MI format 6. **PNG**: Image format (for network visualizations) 7. **SVG**: Scalable vector graphics (for network visualizations) ### Format Selection Replace `/tsv/` in URL with desired format: - `/json/network` - JSON format - `/xml/network` - XML format - `/image/network` - PNG image ## Error Handling ### HTTP Status Codes - **200 OK**: Successful request - **400 Bad Request**: Invalid parameters or syntax - **404 Not Found**: Protein/species not found - **500 Internal Server Error**: Server error ### Common Errors 1. **"No proteins found"**: Invalid identifiers or species mismatch 2. **"Species required"**: Missing species parameter for large networks 3. **Empty results**: No interactions above score threshold 4. **Timeout**: Network too large, reduce protein count ## Advanced Features ### Bulk Network Upload For complete proteome analysis: 1. Navigate to https://string-db.org/ 2. Select "Upload proteome" option 3. Upload FASTA file 4. STRING generates complete interaction network and predicts functions ### Values/Ranks Enrichment API For differential expression/proteomics data: 1. **Get API Key**: ``` /api/json/get_api_key ``` 2. **Submit Data**: Tab-separated protein ID and value pairs 3. **Check Status**: ``` /api/json/valuesranks_enrichment_status?job_id={id} ``` 4. **Retrieve Results**: Access enrichment tables and figures **Requirements**: - Complete protein set (no filtering) - Numeric values for each protein - Proper species identifier ### Network Customization **Network Size Control**: - `add_nodes=N`: Adds N most connected proteins - `limit`: Controls partner retrieval **Confidence Filtering**: - Adjust `required_score` based on analysis goals - Higher scores = fewer false positives, more false negatives **Network Type Selection**: - `functional`: All evidence (recommended for pathway analysis) - `physical`: Direct binding only (recommended for structural studies) ## Integration with Other Tools ### Python Libraries **requests** (recommended): ```python import requests url = "https://string-db.org/api/tsv/network" params = {"identifiers": "TP53", "species": 9606} response = requests.get(url, params=params) ``` **urllib** (standard library): ```python import urllib.request url = "https://string-db.org/api/tsv/network?identifiers=TP53&species=9606" response = urllib.request.urlopen(url) ``` ### R Integration **STRINGdb Bioconductor package**: ```R library(STRINGdb) string_db <- STRINGdb$new(version="12", species=9606) ``` ### Cytoscape STRING networks can be imported into Cytoscape for visualization and analysis: 1. Use stringApp plugin 2. Import TSV network data 3. Apply layouts and styling ## Data License STRING data is freely available under **Creative Commons BY 4.0** license: - ✓ Free to use for academic and commercial purposes - ✓ Attribution required - ✓ Modifications allowed - ✓ Redistribution allowed **Citation**: Szklarczyk et al. (latest publication) ## Rate Limits and Usage - **Rate limiting**: No strict limit, but avoid rapid-fire requests - **Recommendation**: Wait 1 second between calls - **Large datasets**: Use bulk download from https://string-db.org/cgi/download - **Proteome-scale**: Use web upload feature instead of API ## Related Resources - **STRING website**: https://string-db.org - **Download page**: https://string-db.org/cgi/download - **Help center**: https://string-db.org/help/ - **API documentation**: https://string-db.org/help/api/ - **Publications**: https://string-db.org/cgi/about ## Troubleshooting **No results returned**: - Verify species parameter matches identifiers - Check identifier format - Lower confidence threshold - Use identifier mapping first **Timeout errors**: - Reduce number of input proteins - Split large queries into batches - Use bulk download for proteome-scale analyses **Version inconsistencies**: - Use version-specific URLs - Check STRING version with `/version` endpoint - Update identifiers if using old IDs