Initial commit
This commit is contained in:
455
skills/string-database/references/string_reference.md
Normal file
455
skills/string-database/references/string_reference.md
Normal file
@@ -0,0 +1,455 @@
|
||||
# STRING Database API Reference
|
||||
|
||||
## Overview
|
||||
|
||||
STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is a comprehensive database of known and predicted protein-protein interactions integrating data from over 40 sources.
|
||||
|
||||
**Database Statistics (v12.0+):**
|
||||
- Coverage: 5000+ genomes
|
||||
- Proteins: ~59.3 million
|
||||
- Interactions: 20+ billion
|
||||
- Data types: Physical interactions, functional associations, co-expression, co-occurrence, text-mining, databases
|
||||
|
||||
**Core Data Resource:** Designated by Global Biodata Coalition and ELIXIR
|
||||
|
||||
## API Base URLs
|
||||
|
||||
- **Current version**: https://string-db.org/api
|
||||
- **Version-specific**: https://version-12-0.string-db.org/api (for reproducibility)
|
||||
- **API documentation**: https://string-db.org/help/api/
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Identifier Mapping**: Always map identifiers first using `get_string_ids` for faster subsequent queries
|
||||
2. **Use STRING IDs**: Prefer STRING identifiers (e.g., `9606.ENSP00000269305`) over gene names
|
||||
3. **Specify Species**: For networks with >10 proteins, always specify species NCBI taxon ID
|
||||
4. **Rate Limiting**: Wait 1 second between API calls to avoid server overload
|
||||
5. **Versioned URLs**: Use version-specific URLs for reproducible research
|
||||
6. **POST over GET**: Use POST requests for large protein lists
|
||||
7. **Caller Identity**: Include `caller_identity` parameter for tracking (e.g., your application name)
|
||||
|
||||
## API Methods
|
||||
|
||||
### 1. Identifier Mapping (`get_string_ids`)
|
||||
|
||||
**Purpose**: Maps common protein names, gene symbols, UniProt IDs, and other identifiers to STRING identifiers.
|
||||
|
||||
**Endpoint**: `/api/tsv/get_string_ids`
|
||||
|
||||
**Parameters**:
|
||||
- `identifiers` (required): Protein names/IDs separated by newlines (`%0d`)
|
||||
- `species` (required): NCBI taxon ID
|
||||
- `limit`: Number of matches per identifier (default: 1)
|
||||
- `echo_query`: Include query term in output (1 or 0)
|
||||
- `caller_identity`: Application identifier
|
||||
|
||||
**Output Format**: TSV with columns:
|
||||
- `queryItem`: Original query
|
||||
- `queryIndex`: Query position
|
||||
- `stringId`: STRING identifier
|
||||
- `ncbiTaxonId`: Species taxon ID
|
||||
- `taxonName`: Species name
|
||||
- `preferredName`: Preferred gene name
|
||||
- `annotation`: Protein description
|
||||
|
||||
**Example**:
|
||||
```
|
||||
identifiers=TP53%0dBRCA1&species=9606&limit=1
|
||||
```
|
||||
|
||||
**Use cases**:
|
||||
- Converting gene symbols to STRING IDs
|
||||
- Validating protein identifiers
|
||||
- Finding canonical protein names
|
||||
|
||||
### 2. Network Data (`network`)
|
||||
|
||||
**Purpose**: Retrieves protein-protein interaction network data in tabular format.
|
||||
|
||||
**Endpoint**: `/api/tsv/network`
|
||||
|
||||
**Parameters**:
|
||||
- `identifiers` (required): Protein IDs separated by `%0d`
|
||||
- `species`: NCBI taxon ID
|
||||
- `required_score`: Confidence threshold 0-1000 (default: 400)
|
||||
- 150: low confidence
|
||||
- 400: medium confidence
|
||||
- 700: high confidence
|
||||
- 900: highest confidence
|
||||
- `network_type`: `functional` (default) or `physical`
|
||||
- `add_nodes`: Add N interacting proteins (0-10)
|
||||
- `caller_identity`: Application identifier
|
||||
|
||||
**Output Format**: TSV with columns:
|
||||
- `stringId_A`, `stringId_B`: Interacting proteins
|
||||
- `preferredName_A`, `preferredName_B`: Gene names
|
||||
- `ncbiTaxonId`: Species
|
||||
- `score`: Combined interaction score (0-1000)
|
||||
- `nscore`: Neighborhood score
|
||||
- `fscore`: Fusion score
|
||||
- `pscore`: Phylogenetic profile score
|
||||
- `ascore`: Coexpression score
|
||||
- `escore`: Experimental score
|
||||
- `dscore`: Database score
|
||||
- `tscore`: Text-mining score
|
||||
|
||||
**Network Types**:
|
||||
- **Functional**: All interaction evidence types (recommended for most analyses)
|
||||
- **Physical**: Only direct physical binding evidence
|
||||
|
||||
**Example**:
|
||||
```
|
||||
identifiers=9606.ENSP00000269305%0d9606.ENSP00000275493&required_score=700
|
||||
```
|
||||
|
||||
### 3. Network Image (`image/network`)
|
||||
|
||||
**Purpose**: Generates visual network representation as PNG image.
|
||||
|
||||
**Endpoint**: `/api/image/network`
|
||||
|
||||
**Parameters**:
|
||||
- `identifiers` (required): Protein IDs separated by `%0d`
|
||||
- `species`: NCBI taxon ID
|
||||
- `required_score`: Confidence threshold 0-1000
|
||||
- `network_flavor`: Visualization style
|
||||
- `evidence`: Show evidence types as colored lines
|
||||
- `confidence`: Show confidence as line thickness
|
||||
- `actions`: Show activating/inhibiting interactions
|
||||
- `add_nodes`: Add N interacting proteins (0-10)
|
||||
- `caller_identity`: Application identifier
|
||||
|
||||
**Output**: PNG image (binary data)
|
||||
|
||||
**Image Specifications**:
|
||||
- Format: PNG
|
||||
- Size: Automatically scaled based on network size
|
||||
- High-resolution option available (add `?highres=1`)
|
||||
|
||||
**Example**:
|
||||
```
|
||||
identifiers=TP53%0dMDM2&species=9606&network_flavor=evidence
|
||||
```
|
||||
|
||||
### 4. Interaction Partners (`interaction_partners`)
|
||||
|
||||
**Purpose**: Retrieves all STRING interaction partners for given protein(s).
|
||||
|
||||
**Endpoint**: `/api/tsv/interaction_partners`
|
||||
|
||||
**Parameters**:
|
||||
- `identifiers` (required): Protein IDs
|
||||
- `species`: NCBI taxon ID
|
||||
- `required_score`: Confidence threshold 0-1000
|
||||
- `limit`: Maximum number of partners (default: 10)
|
||||
- `caller_identity`: Application identifier
|
||||
|
||||
**Output Format**: TSV with same columns as `network` method
|
||||
|
||||
**Use cases**:
|
||||
- Finding hub proteins
|
||||
- Expanding networks
|
||||
- Discovery of novel interactions
|
||||
|
||||
**Example**:
|
||||
```
|
||||
identifiers=TP53&species=9606&limit=20&required_score=700
|
||||
```
|
||||
|
||||
### 5. Functional Enrichment (`enrichment`)
|
||||
|
||||
**Purpose**: Performs functional enrichment analysis for a set of proteins across multiple annotation databases.
|
||||
|
||||
**Endpoint**: `/api/tsv/enrichment`
|
||||
|
||||
**Parameters**:
|
||||
- `identifiers` (required): List of protein IDs
|
||||
- `species` (required): NCBI taxon ID
|
||||
- `caller_identity`: Application identifier
|
||||
|
||||
**Enrichment Categories**:
|
||||
- **Gene Ontology**: Biological Process, Molecular Function, Cellular Component
|
||||
- **KEGG Pathways**: Metabolic and signaling pathways
|
||||
- **Pfam**: Protein domains
|
||||
- **InterPro**: Protein families and domains
|
||||
- **SMART**: Domain architecture
|
||||
- **UniProt Keywords**: Curated functional keywords
|
||||
|
||||
**Output Format**: TSV with columns:
|
||||
- `category`: Annotation category
|
||||
- `term`: Term ID
|
||||
- `description`: Term description
|
||||
- `number_of_genes`: Genes in input with this term
|
||||
- `number_of_genes_in_background`: Total genes with this term
|
||||
- `ncbiTaxonId`: Species
|
||||
- `inputGenes`: Comma-separated gene list
|
||||
- `preferredNames`: Comma-separated gene names
|
||||
- `p_value`: Enrichment p-value (uncorrected)
|
||||
- `fdr`: False discovery rate (corrected p-value)
|
||||
|
||||
**Statistical Method**: Fisher's exact test with Benjamini-Hochberg FDR correction
|
||||
|
||||
**Example**:
|
||||
```
|
||||
identifiers=TP53%0dMDM2%0dATM%0dCHEK2&species=9606
|
||||
```
|
||||
|
||||
### 6. PPI Enrichment (`ppi_enrichment`)
|
||||
|
||||
**Purpose**: Tests if a network has significantly more interactions than expected by chance.
|
||||
|
||||
**Endpoint**: `/api/json/ppi_enrichment`
|
||||
|
||||
**Parameters**:
|
||||
- `identifiers` (required): List of protein IDs
|
||||
- `species`: NCBI taxon ID
|
||||
- `required_score`: Confidence threshold
|
||||
- `caller_identity`: Application identifier
|
||||
|
||||
**Output Format**: JSON with fields:
|
||||
- `number_of_nodes`: Proteins in network
|
||||
- `number_of_edges`: Interactions observed
|
||||
- `expected_number_of_edges`: Expected interactions (random)
|
||||
- `p_value`: Statistical significance
|
||||
|
||||
**Interpretation**:
|
||||
- p-value < 0.05: Network is significantly enriched
|
||||
- Low p-value indicates proteins form functional module
|
||||
|
||||
**Example**:
|
||||
```
|
||||
identifiers=TP53%0dMDM2%0dATM%0dCHEK2&species=9606
|
||||
```
|
||||
|
||||
### 7. Homology Scores (`homology`)
|
||||
|
||||
**Purpose**: Retrieves protein similarity/homology scores.
|
||||
|
||||
**Endpoint**: `/api/tsv/homology`
|
||||
|
||||
**Parameters**:
|
||||
- `identifiers` (required): Protein IDs
|
||||
- `species`: NCBI taxon ID
|
||||
- `caller_identity`: Application identifier
|
||||
|
||||
**Output Format**: TSV with homology scores between proteins
|
||||
|
||||
**Use cases**:
|
||||
- Identifying protein families
|
||||
- Paralog analysis
|
||||
- Cross-species comparisons
|
||||
|
||||
### 8. Version Information (`version`)
|
||||
|
||||
**Purpose**: Returns current STRING database version.
|
||||
|
||||
**Endpoint**: `/api/tsv/version`
|
||||
|
||||
**Output**: Version string (e.g., "12.0")
|
||||
|
||||
## Common Species NCBI Taxon IDs
|
||||
|
||||
| Organism | Common Name | Taxon ID |
|
||||
|----------|-------------|----------|
|
||||
| Homo sapiens | Human | 9606 |
|
||||
| Mus musculus | Mouse | 10090 |
|
||||
| Rattus norvegicus | Rat | 10116 |
|
||||
| Drosophila melanogaster | Fruit fly | 7227 |
|
||||
| Caenorhabditis elegans | C. elegans | 6239 |
|
||||
| Saccharomyces cerevisiae | Yeast | 4932 |
|
||||
| Arabidopsis thaliana | Thale cress | 3702 |
|
||||
| Escherichia coli K-12 | E. coli | 511145 |
|
||||
| Danio rerio | Zebrafish | 7955 |
|
||||
| Gallus gallus | Chicken | 9031 |
|
||||
|
||||
Full list: https://string-db.org/cgi/input?input_page_active_form=organisms
|
||||
|
||||
## STRING Identifier Format
|
||||
|
||||
STRING uses Ensembl protein IDs with taxon prefix:
|
||||
- Format: `{taxonId}.{ensemblProteinId}`
|
||||
- Example: `9606.ENSP00000269305` (human TP53)
|
||||
|
||||
**ID Components**:
|
||||
- **Taxon ID**: NCBI taxonomy identifier
|
||||
- **Protein ID**: Usually Ensembl protein ID (ENSP...)
|
||||
|
||||
## Interaction Confidence Scores
|
||||
|
||||
STRING provides combined confidence scores (0-1000) based on multiple evidence channels:
|
||||
|
||||
### Evidence Channels
|
||||
|
||||
1. **Neighborhood (nscore)**: Gene fusion and conserved genomic neighborhood
|
||||
2. **Fusion (fscore)**: Gene fusion events across species
|
||||
3. **Phylogenetic Profile (pscore)**: Co-occurrence across species
|
||||
4. **Coexpression (ascore)**: RNA expression correlation
|
||||
5. **Experimental (escore)**: Biochemical/genetic experiments
|
||||
6. **Database (dscore)**: Curated pathway/complex databases
|
||||
7. **Text-mining (tscore)**: Literature co-occurrence
|
||||
|
||||
### Recommended Thresholds
|
||||
|
||||
- **150**: Low confidence (exploratory analysis)
|
||||
- **400**: Medium confidence (standard analysis)
|
||||
- **700**: High confidence (conservative analysis)
|
||||
- **900**: Highest confidence (very stringent)
|
||||
|
||||
## Output Formats
|
||||
|
||||
### Available Formats
|
||||
|
||||
1. **TSV**: Tab-separated values (default, best for data processing)
|
||||
2. **JSON**: JavaScript Object Notation (structured data)
|
||||
3. **XML**: Extensible Markup Language
|
||||
4. **PSI-MI**: Proteomics Standards Initiative format
|
||||
5. **PSI-MITAB**: Tab-delimited PSI-MI format
|
||||
6. **PNG**: Image format (for network visualizations)
|
||||
7. **SVG**: Scalable vector graphics (for network visualizations)
|
||||
|
||||
### Format Selection
|
||||
|
||||
Replace `/tsv/` in URL with desired format:
|
||||
- `/json/network` - JSON format
|
||||
- `/xml/network` - XML format
|
||||
- `/image/network` - PNG image
|
||||
|
||||
## Error Handling
|
||||
|
||||
### HTTP Status Codes
|
||||
|
||||
- **200 OK**: Successful request
|
||||
- **400 Bad Request**: Invalid parameters or syntax
|
||||
- **404 Not Found**: Protein/species not found
|
||||
- **500 Internal Server Error**: Server error
|
||||
|
||||
### Common Errors
|
||||
|
||||
1. **"No proteins found"**: Invalid identifiers or species mismatch
|
||||
2. **"Species required"**: Missing species parameter for large networks
|
||||
3. **Empty results**: No interactions above score threshold
|
||||
4. **Timeout**: Network too large, reduce protein count
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Bulk Network Upload
|
||||
|
||||
For complete proteome analysis:
|
||||
1. Navigate to https://string-db.org/
|
||||
2. Select "Upload proteome" option
|
||||
3. Upload FASTA file
|
||||
4. STRING generates complete interaction network and predicts functions
|
||||
|
||||
### Values/Ranks Enrichment API
|
||||
|
||||
For differential expression/proteomics data:
|
||||
|
||||
1. **Get API Key**:
|
||||
```
|
||||
/api/json/get_api_key
|
||||
```
|
||||
|
||||
2. **Submit Data**: Tab-separated protein ID and value pairs
|
||||
|
||||
3. **Check Status**:
|
||||
```
|
||||
/api/json/valuesranks_enrichment_status?job_id={id}
|
||||
```
|
||||
|
||||
4. **Retrieve Results**: Access enrichment tables and figures
|
||||
|
||||
**Requirements**:
|
||||
- Complete protein set (no filtering)
|
||||
- Numeric values for each protein
|
||||
- Proper species identifier
|
||||
|
||||
### Network Customization
|
||||
|
||||
**Network Size Control**:
|
||||
- `add_nodes=N`: Adds N most connected proteins
|
||||
- `limit`: Controls partner retrieval
|
||||
|
||||
**Confidence Filtering**:
|
||||
- Adjust `required_score` based on analysis goals
|
||||
- Higher scores = fewer false positives, more false negatives
|
||||
|
||||
**Network Type Selection**:
|
||||
- `functional`: All evidence (recommended for pathway analysis)
|
||||
- `physical`: Direct binding only (recommended for structural studies)
|
||||
|
||||
## Integration with Other Tools
|
||||
|
||||
### Python Libraries
|
||||
|
||||
**requests** (recommended):
|
||||
```python
|
||||
import requests
|
||||
url = "https://string-db.org/api/tsv/network"
|
||||
params = {"identifiers": "TP53", "species": 9606}
|
||||
response = requests.get(url, params=params)
|
||||
```
|
||||
|
||||
**urllib** (standard library):
|
||||
```python
|
||||
import urllib.request
|
||||
url = "https://string-db.org/api/tsv/network?identifiers=TP53&species=9606"
|
||||
response = urllib.request.urlopen(url)
|
||||
```
|
||||
|
||||
### R Integration
|
||||
|
||||
**STRINGdb Bioconductor package**:
|
||||
```R
|
||||
library(STRINGdb)
|
||||
string_db <- STRINGdb$new(version="12", species=9606)
|
||||
```
|
||||
|
||||
### Cytoscape
|
||||
|
||||
STRING networks can be imported into Cytoscape for visualization and analysis:
|
||||
1. Use stringApp plugin
|
||||
2. Import TSV network data
|
||||
3. Apply layouts and styling
|
||||
|
||||
## Data License
|
||||
|
||||
STRING data is freely available under **Creative Commons BY 4.0** license:
|
||||
- ✓ Free to use for academic and commercial purposes
|
||||
- ✓ Attribution required
|
||||
- ✓ Modifications allowed
|
||||
- ✓ Redistribution allowed
|
||||
|
||||
**Citation**: Szklarczyk et al. (latest publication)
|
||||
|
||||
## Rate Limits and Usage
|
||||
|
||||
- **Rate limiting**: No strict limit, but avoid rapid-fire requests
|
||||
- **Recommendation**: Wait 1 second between calls
|
||||
- **Large datasets**: Use bulk download from https://string-db.org/cgi/download
|
||||
- **Proteome-scale**: Use web upload feature instead of API
|
||||
|
||||
## Related Resources
|
||||
|
||||
- **STRING website**: https://string-db.org
|
||||
- **Download page**: https://string-db.org/cgi/download
|
||||
- **Help center**: https://string-db.org/help/
|
||||
- **API documentation**: https://string-db.org/help/api/
|
||||
- **Publications**: https://string-db.org/cgi/about
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**No results returned**:
|
||||
- Verify species parameter matches identifiers
|
||||
- Check identifier format
|
||||
- Lower confidence threshold
|
||||
- Use identifier mapping first
|
||||
|
||||
**Timeout errors**:
|
||||
- Reduce number of input proteins
|
||||
- Split large queries into batches
|
||||
- Use bulk download for proteome-scale analyses
|
||||
|
||||
**Version inconsistencies**:
|
||||
- Use version-specific URLs
|
||||
- Check STRING version with `/version` endpoint
|
||||
- Update identifiers if using old IDs
|
||||
Reference in New Issue
Block a user