603 lines
20 KiB
Markdown
603 lines
20 KiB
Markdown
---
|
||
name: gwas-database
|
||
description: "Query NHGRI-EBI GWAS Catalog for SNP-trait associations. Search variants by rs ID, disease/trait, gene, retrieve p-values and summary statistics, for genetic epidemiology and polygenic risk scores."
|
||
---
|
||
|
||
# GWAS Catalog Database
|
||
|
||
## Overview
|
||
|
||
The GWAS Catalog is a comprehensive repository of published genome-wide association studies maintained by the National Human Genome Research Institute (NHGRI) and the European Bioinformatics Institute (EBI). The catalog contains curated SNP-trait associations from thousands of GWAS publications, including genetic variants, associated traits and diseases, p-values, effect sizes, and full summary statistics for many studies.
|
||
|
||
## When to Use This Skill
|
||
|
||
This skill should be used when queries involve:
|
||
|
||
- **Genetic variant associations**: Finding SNPs associated with diseases or traits
|
||
- **SNP lookups**: Retrieving information about specific genetic variants (rs IDs)
|
||
- **Trait/disease searches**: Discovering genetic associations for phenotypes
|
||
- **Gene associations**: Finding variants in or near specific genes
|
||
- **GWAS summary statistics**: Accessing complete genome-wide association data
|
||
- **Study metadata**: Retrieving publication and cohort information
|
||
- **Population genetics**: Exploring ancestry-specific associations
|
||
- **Polygenic risk scores**: Identifying variants for risk prediction models
|
||
- **Functional genomics**: Understanding variant effects and genomic context
|
||
- **Systematic reviews**: Comprehensive literature synthesis of genetic associations
|
||
|
||
## Core Capabilities
|
||
|
||
### 1. Understanding GWAS Catalog Data Structure
|
||
|
||
The GWAS Catalog is organized around four core entities:
|
||
|
||
- **Studies**: GWAS publications with metadata (PMID, author, cohort details)
|
||
- **Associations**: SNP-trait associations with statistical evidence (p ≤ 5×10⁻⁸)
|
||
- **Variants**: Genetic markers (SNPs) with genomic coordinates and alleles
|
||
- **Traits**: Phenotypes and diseases (mapped to EFO ontology terms)
|
||
|
||
**Key Identifiers:**
|
||
- Study accessions: `GCST` IDs (e.g., GCST001234)
|
||
- Variant IDs: `rs` numbers (e.g., rs7903146) or `variant_id` format
|
||
- Trait IDs: EFO terms (e.g., EFO_0001360 for type 2 diabetes)
|
||
- Gene symbols: HGNC approved names (e.g., TCF7L2)
|
||
|
||
### 2. Web Interface Searches
|
||
|
||
The web interface at https://www.ebi.ac.uk/gwas/ supports multiple search modes:
|
||
|
||
**By Variant (rs ID):**
|
||
```
|
||
rs7903146
|
||
```
|
||
Returns all trait associations for this SNP.
|
||
|
||
**By Disease/Trait:**
|
||
```
|
||
type 2 diabetes
|
||
Parkinson disease
|
||
body mass index
|
||
```
|
||
Returns all associated genetic variants.
|
||
|
||
**By Gene:**
|
||
```
|
||
APOE
|
||
TCF7L2
|
||
```
|
||
Returns variants in or near the gene region.
|
||
|
||
**By Chromosomal Region:**
|
||
```
|
||
10:114000000-115000000
|
||
```
|
||
Returns variants in the specified genomic interval.
|
||
|
||
**By Publication:**
|
||
```
|
||
PMID:20581827
|
||
Author: McCarthy MI
|
||
GCST001234
|
||
```
|
||
Returns study details and all reported associations.
|
||
|
||
### 3. REST API Access
|
||
|
||
The GWAS Catalog provides two REST APIs for programmatic access:
|
||
|
||
**Base URLs:**
|
||
- GWAS Catalog API: `https://www.ebi.ac.uk/gwas/rest/api`
|
||
- Summary Statistics API: `https://www.ebi.ac.uk/gwas/summary-statistics/api`
|
||
|
||
**API Documentation:**
|
||
- Main API docs: https://www.ebi.ac.uk/gwas/rest/docs/api
|
||
- Summary stats docs: https://www.ebi.ac.uk/gwas/summary-statistics/docs/
|
||
|
||
**Core Endpoints:**
|
||
|
||
1. **Studies endpoint** - `/studies/{accessionID}`
|
||
```python
|
||
import requests
|
||
|
||
# Get a specific study
|
||
url = "https://www.ebi.ac.uk/gwas/rest/api/studies/GCST001795"
|
||
response = requests.get(url, headers={"Content-Type": "application/json"})
|
||
study = response.json()
|
||
```
|
||
|
||
2. **Associations endpoint** - `/associations`
|
||
```python
|
||
# Find associations for a variant
|
||
variant = "rs7903146"
|
||
url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/{variant}/associations"
|
||
params = {"projection": "associationBySnp"}
|
||
response = requests.get(url, params=params, headers={"Content-Type": "application/json"})
|
||
associations = response.json()
|
||
```
|
||
|
||
3. **Variants endpoint** - `/singleNucleotidePolymorphisms/{rsID}`
|
||
```python
|
||
# Get variant details
|
||
url = "https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/rs7903146"
|
||
response = requests.get(url, headers={"Content-Type": "application/json"})
|
||
variant_info = response.json()
|
||
```
|
||
|
||
4. **Traits endpoint** - `/efoTraits/{efoID}`
|
||
```python
|
||
# Get trait information
|
||
url = "https://www.ebi.ac.uk/gwas/rest/api/efoTraits/EFO_0001360"
|
||
response = requests.get(url, headers={"Content-Type": "application/json"})
|
||
trait_info = response.json()
|
||
```
|
||
|
||
### 4. Query Examples and Patterns
|
||
|
||
**Example 1: Find all associations for a disease**
|
||
```python
|
||
import requests
|
||
|
||
trait = "EFO_0001360" # Type 2 diabetes
|
||
base_url = "https://www.ebi.ac.uk/gwas/rest/api"
|
||
|
||
# Query associations for this trait
|
||
url = f"{base_url}/efoTraits/{trait}/associations"
|
||
response = requests.get(url, headers={"Content-Type": "application/json"})
|
||
associations = response.json()
|
||
|
||
# Process results
|
||
for assoc in associations.get('_embedded', {}).get('associations', []):
|
||
variant = assoc.get('rsId')
|
||
pvalue = assoc.get('pvalue')
|
||
risk_allele = assoc.get('strongestAllele')
|
||
print(f"{variant}: p={pvalue}, risk allele={risk_allele}")
|
||
```
|
||
|
||
**Example 2: Get variant information and all trait associations**
|
||
```python
|
||
import requests
|
||
|
||
variant = "rs7903146"
|
||
base_url = "https://www.ebi.ac.uk/gwas/rest/api"
|
||
|
||
# Get variant details
|
||
url = f"{base_url}/singleNucleotidePolymorphisms/{variant}"
|
||
response = requests.get(url, headers={"Content-Type": "application/json"})
|
||
variant_data = response.json()
|
||
|
||
# Get all associations for this variant
|
||
url = f"{base_url}/singleNucleotidePolymorphisms/{variant}/associations"
|
||
params = {"projection": "associationBySnp"}
|
||
response = requests.get(url, params=params, headers={"Content-Type": "application/json"})
|
||
associations = response.json()
|
||
|
||
# Extract trait names and p-values
|
||
for assoc in associations.get('_embedded', {}).get('associations', []):
|
||
trait = assoc.get('efoTrait')
|
||
pvalue = assoc.get('pvalue')
|
||
print(f"Trait: {trait}, p-value: {pvalue}")
|
||
```
|
||
|
||
**Example 3: Access summary statistics**
|
||
```python
|
||
import requests
|
||
|
||
# Query summary statistics API
|
||
base_url = "https://www.ebi.ac.uk/gwas/summary-statistics/api"
|
||
|
||
# Find associations by trait with p-value threshold
|
||
trait = "EFO_0001360" # Type 2 diabetes
|
||
p_upper = "0.000000001" # p < 1e-9
|
||
url = f"{base_url}/traits/{trait}/associations"
|
||
params = {
|
||
"p_upper": p_upper,
|
||
"size": 100 # Number of results
|
||
}
|
||
response = requests.get(url, params=params)
|
||
results = response.json()
|
||
|
||
# Process genome-wide significant hits
|
||
for hit in results.get('_embedded', {}).get('associations', []):
|
||
variant_id = hit.get('variant_id')
|
||
chromosome = hit.get('chromosome')
|
||
position = hit.get('base_pair_location')
|
||
pvalue = hit.get('p_value')
|
||
print(f"{chromosome}:{position} ({variant_id}): p={pvalue}")
|
||
```
|
||
|
||
**Example 4: Query by chromosomal region**
|
||
```python
|
||
import requests
|
||
|
||
# Find variants in a specific genomic region
|
||
chromosome = "10"
|
||
start_pos = 114000000
|
||
end_pos = 115000000
|
||
|
||
base_url = "https://www.ebi.ac.uk/gwas/rest/api"
|
||
url = f"{base_url}/singleNucleotidePolymorphisms/search/findByChromBpLocationRange"
|
||
params = {
|
||
"chrom": chromosome,
|
||
"bpStart": start_pos,
|
||
"bpEnd": end_pos
|
||
}
|
||
response = requests.get(url, params=params, headers={"Content-Type": "application/json"})
|
||
variants_in_region = response.json()
|
||
```
|
||
|
||
### 5. Working with Summary Statistics
|
||
|
||
The GWAS Catalog hosts full summary statistics for many studies, providing access to all tested variants (not just genome-wide significant hits).
|
||
|
||
**Access Methods:**
|
||
1. **FTP download**: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/
|
||
2. **REST API**: Query-based access to summary statistics
|
||
3. **Web interface**: Browse and download via the website
|
||
|
||
**Summary Statistics API Features:**
|
||
- Filter by chromosome, position, p-value
|
||
- Query specific variants across studies
|
||
- Retrieve effect sizes and allele frequencies
|
||
- Access harmonized and standardized data
|
||
|
||
**Example: Download summary statistics for a study**
|
||
```python
|
||
import requests
|
||
import gzip
|
||
|
||
# Get available summary statistics
|
||
base_url = "https://www.ebi.ac.uk/gwas/summary-statistics/api"
|
||
url = f"{base_url}/studies/GCST001234"
|
||
response = requests.get(url)
|
||
study_info = response.json()
|
||
|
||
# Download link is provided in the response
|
||
# Alternatively, use FTP:
|
||
# ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCSTXXXXXX/
|
||
```
|
||
|
||
### 6. Data Integration and Cross-referencing
|
||
|
||
The GWAS Catalog provides links to external resources:
|
||
|
||
**Genomic Databases:**
|
||
- Ensembl: Gene annotations and variant consequences
|
||
- dbSNP: Variant identifiers and population frequencies
|
||
- gnomAD: Population allele frequencies
|
||
|
||
**Functional Resources:**
|
||
- Open Targets: Target-disease associations
|
||
- PGS Catalog: Polygenic risk scores
|
||
- UCSC Genome Browser: Genomic context
|
||
|
||
**Phenotype Resources:**
|
||
- EFO (Experimental Factor Ontology): Standardized trait terms
|
||
- OMIM: Disease gene relationships
|
||
- Disease Ontology: Disease hierarchies
|
||
|
||
**Following Links in API Responses:**
|
||
```python
|
||
import requests
|
||
|
||
# API responses include _links for related resources
|
||
response = requests.get("https://www.ebi.ac.uk/gwas/rest/api/studies/GCST001234")
|
||
study = response.json()
|
||
|
||
# Follow link to associations
|
||
associations_url = study['_links']['associations']['href']
|
||
associations_response = requests.get(associations_url)
|
||
```
|
||
|
||
## Query Workflows
|
||
|
||
### Workflow 1: Exploring Genetic Associations for a Disease
|
||
|
||
1. **Identify the trait** using EFO terms or free text:
|
||
- Search web interface for disease name
|
||
- Note the EFO ID (e.g., EFO_0001360 for type 2 diabetes)
|
||
|
||
2. **Query associations via API:**
|
||
```python
|
||
url = f"https://www.ebi.ac.uk/gwas/rest/api/efoTraits/{efo_id}/associations"
|
||
```
|
||
|
||
3. **Filter by significance and population:**
|
||
- Check p-values (genome-wide significant: p ≤ 5×10⁻⁸)
|
||
- Review ancestry information in study metadata
|
||
- Filter by sample size or discovery/replication status
|
||
|
||
4. **Extract variant details:**
|
||
- rs IDs for each association
|
||
- Effect alleles and directions
|
||
- Effect sizes (odds ratios, beta coefficients)
|
||
- Population allele frequencies
|
||
|
||
5. **Cross-reference with other databases:**
|
||
- Look up variant consequences in Ensembl
|
||
- Check population frequencies in gnomAD
|
||
- Explore gene function and pathways
|
||
|
||
### Workflow 2: Investigating a Specific Genetic Variant
|
||
|
||
1. **Query the variant:**
|
||
```python
|
||
url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/{rs_id}"
|
||
```
|
||
|
||
2. **Retrieve all trait associations:**
|
||
```python
|
||
url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/{rs_id}/associations"
|
||
```
|
||
|
||
3. **Analyze pleiotropy:**
|
||
- Identify all traits associated with this variant
|
||
- Review effect directions across traits
|
||
- Look for shared biological pathways
|
||
|
||
4. **Check genomic context:**
|
||
- Determine nearby genes
|
||
- Identify if variant is in coding/regulatory regions
|
||
- Review linkage disequilibrium with other variants
|
||
|
||
### Workflow 3: Gene-Centric Association Analysis
|
||
|
||
1. **Search by gene symbol** in web interface or:
|
||
```python
|
||
url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/search/findByGene"
|
||
params = {"geneName": gene_symbol}
|
||
```
|
||
|
||
2. **Retrieve variants in gene region:**
|
||
- Get chromosomal coordinates for gene
|
||
- Query variants in region
|
||
- Include promoter and regulatory regions (extend boundaries)
|
||
|
||
3. **Analyze association patterns:**
|
||
- Identify traits associated with variants in this gene
|
||
- Look for consistent associations across studies
|
||
- Review effect sizes and directions
|
||
|
||
4. **Functional interpretation:**
|
||
- Determine variant consequences (missense, regulatory, etc.)
|
||
- Check expression QTL (eQTL) data
|
||
- Review pathway and network context
|
||
|
||
### Workflow 4: Systematic Review of Genetic Evidence
|
||
|
||
1. **Define research question:**
|
||
- Specific trait or disease of interest
|
||
- Population considerations
|
||
- Study design requirements
|
||
|
||
2. **Comprehensive variant extraction:**
|
||
- Query all associations for trait
|
||
- Set significance threshold
|
||
- Note discovery and replication studies
|
||
|
||
3. **Quality assessment:**
|
||
- Review study sample sizes
|
||
- Check for population diversity
|
||
- Assess heterogeneity across studies
|
||
- Identify potential biases
|
||
|
||
4. **Data synthesis:**
|
||
- Aggregate associations across studies
|
||
- Perform meta-analysis if applicable
|
||
- Create summary tables
|
||
- Generate Manhattan or forest plots
|
||
|
||
5. **Export and documentation:**
|
||
- Download full association data
|
||
- Export summary statistics if needed
|
||
- Document search strategy and date
|
||
- Create reproducible analysis scripts
|
||
|
||
### Workflow 5: Accessing and Analyzing Summary Statistics
|
||
|
||
1. **Identify studies with summary statistics:**
|
||
- Browse summary statistics portal
|
||
- Check FTP directory listings
|
||
- Query API for available studies
|
||
|
||
2. **Download summary statistics:**
|
||
```bash
|
||
# Via FTP
|
||
wget ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCSTXXXXXX/harmonised/GCSTXXXXXX-harmonised.tsv.gz
|
||
```
|
||
|
||
3. **Query via API for specific variants:**
|
||
```python
|
||
url = f"https://www.ebi.ac.uk/gwas/summary-statistics/api/chromosomes/{chrom}/associations"
|
||
params = {"start": start_pos, "end": end_pos}
|
||
```
|
||
|
||
4. **Process and analyze:**
|
||
- Filter by p-value thresholds
|
||
- Extract effect sizes and confidence intervals
|
||
- Perform downstream analyses (fine-mapping, colocalization, etc.)
|
||
|
||
## Response Formats and Data Fields
|
||
|
||
**Key Fields in Association Records:**
|
||
- `rsId`: Variant identifier (rs number)
|
||
- `strongestAllele`: Risk allele for the association
|
||
- `pvalue`: Association p-value
|
||
- `pvalueText`: P-value as text (may include inequality)
|
||
- `orPerCopyNum`: Odds ratio or beta coefficient
|
||
- `betaNum`: Effect size (for quantitative traits)
|
||
- `betaUnit`: Unit of measurement for beta
|
||
- `range`: Confidence interval
|
||
- `efoTrait`: Associated trait name
|
||
- `mappedLabel`: EFO-mapped trait term
|
||
|
||
**Study Metadata Fields:**
|
||
- `accessionId`: GCST study identifier
|
||
- `pubmedId`: PubMed ID
|
||
- `author`: First author
|
||
- `publicationDate`: Publication date
|
||
- `ancestryInitial`: Discovery population ancestry
|
||
- `ancestryReplication`: Replication population ancestry
|
||
- `sampleSize`: Total sample size
|
||
|
||
**Pagination:**
|
||
Results are paginated (default 20 items per page). Navigate using:
|
||
- `size` parameter: Number of results per page
|
||
- `page` parameter: Page number (0-indexed)
|
||
- `_links` in response: URLs for next/previous pages
|
||
|
||
## Best Practices
|
||
|
||
### Query Strategy
|
||
- Start with web interface to identify relevant EFO terms and study accessions
|
||
- Use API for bulk data extraction and automated analyses
|
||
- Implement pagination handling for large result sets
|
||
- Cache API responses to minimize redundant requests
|
||
|
||
### Data Interpretation
|
||
- Always check p-value thresholds (genome-wide: 5×10⁻⁸)
|
||
- Review ancestry information for population applicability
|
||
- Consider sample size when assessing evidence strength
|
||
- Check for replication across independent studies
|
||
- Be aware of winner's curse in effect size estimates
|
||
|
||
### Rate Limiting and Ethics
|
||
- Respect API usage guidelines (no excessive requests)
|
||
- Use summary statistics downloads for genome-wide analyses
|
||
- Implement appropriate delays between API calls
|
||
- Cache results locally when performing iterative analyses
|
||
- Cite the GWAS Catalog in publications
|
||
|
||
### Data Quality Considerations
|
||
- GWAS Catalog curates published associations (may contain inconsistencies)
|
||
- Effect sizes reported as published (may need harmonization)
|
||
- Some studies report conditional or joint associations
|
||
- Check for study overlap when combining results
|
||
- Be aware of ascertainment and selection biases
|
||
|
||
## Python Integration Example
|
||
|
||
Complete workflow for querying and analyzing GWAS data:
|
||
|
||
```python
|
||
import requests
|
||
import pandas as pd
|
||
from time import sleep
|
||
|
||
def query_gwas_catalog(trait_id, p_threshold=5e-8):
|
||
"""
|
||
Query GWAS Catalog for trait associations
|
||
|
||
Args:
|
||
trait_id: EFO trait identifier (e.g., 'EFO_0001360')
|
||
p_threshold: P-value threshold for filtering
|
||
|
||
Returns:
|
||
pandas DataFrame with association results
|
||
"""
|
||
base_url = "https://www.ebi.ac.uk/gwas/rest/api"
|
||
url = f"{base_url}/efoTraits/{trait_id}/associations"
|
||
|
||
headers = {"Content-Type": "application/json"}
|
||
results = []
|
||
page = 0
|
||
|
||
while True:
|
||
params = {"page": page, "size": 100}
|
||
response = requests.get(url, params=params, headers=headers)
|
||
|
||
if response.status_code != 200:
|
||
break
|
||
|
||
data = response.json()
|
||
associations = data.get('_embedded', {}).get('associations', [])
|
||
|
||
if not associations:
|
||
break
|
||
|
||
for assoc in associations:
|
||
pvalue = assoc.get('pvalue')
|
||
if pvalue and float(pvalue) <= p_threshold:
|
||
results.append({
|
||
'variant': assoc.get('rsId'),
|
||
'pvalue': pvalue,
|
||
'risk_allele': assoc.get('strongestAllele'),
|
||
'or_beta': assoc.get('orPerCopyNum') or assoc.get('betaNum'),
|
||
'trait': assoc.get('efoTrait'),
|
||
'pubmed_id': assoc.get('pubmedId')
|
||
})
|
||
|
||
page += 1
|
||
sleep(0.1) # Rate limiting
|
||
|
||
return pd.DataFrame(results)
|
||
|
||
# Example usage
|
||
df = query_gwas_catalog('EFO_0001360') # Type 2 diabetes
|
||
print(df.head())
|
||
print(f"\nTotal associations: {len(df)}")
|
||
print(f"Unique variants: {df['variant'].nunique()}")
|
||
```
|
||
|
||
## Resources
|
||
|
||
### references/api_reference.md
|
||
|
||
Comprehensive API documentation including:
|
||
- Detailed endpoint specifications for both APIs
|
||
- Complete list of query parameters and filters
|
||
- Response format specifications and field descriptions
|
||
- Advanced query examples and patterns
|
||
- Error handling and troubleshooting
|
||
- Integration with external databases
|
||
|
||
Consult this reference when:
|
||
- Constructing complex API queries
|
||
- Understanding response structures
|
||
- Implementing pagination or batch operations
|
||
- Troubleshooting API errors
|
||
- Exploring advanced filtering options
|
||
|
||
### Training Materials
|
||
|
||
The GWAS Catalog team provides workshop materials:
|
||
- GitHub repository: https://github.com/EBISPOT/GWAS_Catalog-workshop
|
||
- Jupyter notebooks with example queries
|
||
- Google Colab integration for cloud execution
|
||
|
||
## Important Notes
|
||
|
||
### Data Updates
|
||
- The GWAS Catalog is updated regularly with new publications
|
||
- Re-run queries periodically for comprehensive coverage
|
||
- Summary statistics are added as studies release data
|
||
- EFO mappings may be updated over time
|
||
|
||
### Citation Requirements
|
||
When using GWAS Catalog data, cite:
|
||
- Sollis E, et al. (2023) The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Research. PMID: 37953337
|
||
- Include access date and version when available
|
||
- Cite original studies when discussing specific findings
|
||
|
||
### Limitations
|
||
- Not all GWAS publications are included (curation criteria apply)
|
||
- Full summary statistics available for subset of studies
|
||
- Effect sizes may require harmonization across studies
|
||
- Population diversity is growing but historically limited
|
||
- Some associations represent conditional or joint effects
|
||
|
||
### Data Access
|
||
- Web interface: Free, no registration required
|
||
- REST APIs: Free, no API key needed
|
||
- FTP downloads: Open access
|
||
- Rate limiting applies to API (be respectful)
|
||
|
||
## Additional Resources
|
||
|
||
- **GWAS Catalog website**: https://www.ebi.ac.uk/gwas/
|
||
- **Documentation**: https://www.ebi.ac.uk/gwas/docs
|
||
- **API documentation**: https://www.ebi.ac.uk/gwas/rest/docs/api
|
||
- **Summary Statistics API**: https://www.ebi.ac.uk/gwas/summary-statistics/docs/
|
||
- **FTP site**: http://ftp.ebi.ac.uk/pub/databases/gwas/
|
||
- **Training materials**: https://github.com/EBISPOT/GWAS_Catalog-workshop
|
||
- **PGS Catalog** (polygenic scores): https://www.pgscatalog.org/
|
||
- **Help and support**: gwas-info@ebi.ac.uk
|