gh-k-dense-ai-claude-scient…/skills/gwas-database/SKILL.md

---
name: gwas-database
description: "Query NHGRI-EBI GWAS Catalog for SNP-trait associations. Search variants by rs ID, disease/trait, gene, retrieve p-values and summary statistics, for genetic epidemiology and polygenic risk scores."
---

# GWAS Catalog Database

## Overview

The GWAS Catalog is a comprehensive repository of published genome-wide association studies maintained by the National Human Genome Research Institute (NHGRI) and the European Bioinformatics Institute (EBI). The catalog contains curated SNP-trait associations from thousands of GWAS publications, including genetic variants, associated traits and diseases, p-values, effect sizes, and full summary statistics for many studies.

## When to Use This Skill

This skill should be used when queries involve:

- **Genetic variant associations**: Finding SNPs associated with diseases or traits
- **SNP lookups**: Retrieving information about specific genetic variants (rs IDs)
- **Trait/disease searches**: Discovering genetic associations for phenotypes
- **Gene associations**: Finding variants in or near specific genes
- **GWAS summary statistics**: Accessing complete genome-wide association data
- **Study metadata**: Retrieving publication and cohort information
- **Population genetics**: Exploring ancestry-specific associations
- **Polygenic risk scores**: Identifying variants for risk prediction models
- **Functional genomics**: Understanding variant effects and genomic context
- **Systematic reviews**: Comprehensive literature synthesis of genetic associations

## Core Capabilities

### 1. Understanding GWAS Catalog Data Structure

The GWAS Catalog is organized around four core entities:

- **Studies**: GWAS publications with metadata (PMID, author, cohort details)
- **Associations**: SNP-trait associations with statistical evidence (p ≤ 5×10⁻⁸)
- **Variants**: Genetic markers (SNPs) with genomic coordinates and alleles
- **Traits**: Phenotypes and diseases (mapped to EFO ontology terms)

**Key Identifiers:**
- Study accessions: `GCST` IDs (e.g., GCST001234)
- Variant IDs: `rs` numbers (e.g., rs7903146) or `variant_id` format
- Trait IDs: EFO terms (e.g., EFO_0001360 for type 2 diabetes)
- Gene symbols: HGNC approved names (e.g., TCF7L2)

### 2. Web Interface Searches

The web interface at https://www.ebi.ac.uk/gwas/ supports multiple search modes:

**By Variant (rs ID):**
```
rs7903146
```
Returns all trait associations for this SNP.

**By Disease/Trait:**
```
type 2 diabetes
Parkinson disease
body mass index
```
Returns all associated genetic variants.

**By Gene:**
```
APOE
TCF7L2
```
Returns variants in or near the gene region.

**By Chromosomal Region:**
```
10:114000000-115000000
```
Returns variants in the specified genomic interval.

**By Publication:**
```
PMID:20581827
Author: McCarthy MI
GCST001234
```
Returns study details and all reported associations.

### 3. REST API Access

The GWAS Catalog provides two REST APIs for programmatic access:

**Base URLs:**
- GWAS Catalog API: `https://www.ebi.ac.uk/gwas/rest/api`
- Summary Statistics API: `https://www.ebi.ac.uk/gwas/summary-statistics/api`

**API Documentation:**
- Main API docs: https://www.ebi.ac.uk/gwas/rest/docs/api
- Summary stats docs: https://www.ebi.ac.uk/gwas/summary-statistics/docs/

**Core Endpoints:**

1. **Studies endpoint** - `/studies/{accessionID}`
   ```python
   import requests

   # Get a specific study
   url = "https://www.ebi.ac.uk/gwas/rest/api/studies/GCST001795"
   response = requests.get(url, headers={"Content-Type": "application/json"})
   study = response.json()
   ```

2. **Associations endpoint** - `/associations`
   ```python
   # Find associations for a variant
   variant = "rs7903146"
   url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/{variant}/associations"
   params = {"projection": "associationBySnp"}
   response = requests.get(url, params=params, headers={"Content-Type": "application/json"})
   associations = response.json()
   ```

3. **Variants endpoint** - `/singleNucleotidePolymorphisms/{rsID}`
   ```python
   # Get variant details
   url = "https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/rs7903146"
   response = requests.get(url, headers={"Content-Type": "application/json"})
   variant_info = response.json()
   ```

4. **Traits endpoint** - `/efoTraits/{efoID}`
   ```python
   # Get trait information
   url = "https://www.ebi.ac.uk/gwas/rest/api/efoTraits/EFO_0001360"
   response = requests.get(url, headers={"Content-Type": "application/json"})
   trait_info = response.json()
   ```

### 4. Query Examples and Patterns

**Example 1: Find all associations for a disease**
```python
import requests

trait = "EFO_0001360"  # Type 2 diabetes
base_url = "https://www.ebi.ac.uk/gwas/rest/api"

# Query associations for this trait
url = f"{base_url}/efoTraits/{trait}/associations"
response = requests.get(url, headers={"Content-Type": "application/json"})
associations = response.json()

# Process results
for assoc in associations.get('_embedded', {}).get('associations', []):
    variant = assoc.get('rsId')
    pvalue = assoc.get('pvalue')
    risk_allele = assoc.get('strongestAllele')
    print(f"{variant}: p={pvalue}, risk allele={risk_allele}")
```

**Example 2: Get variant information and all trait associations**
```python
import requests

variant = "rs7903146"
base_url = "https://www.ebi.ac.uk/gwas/rest/api"

# Get variant details
url = f"{base_url}/singleNucleotidePolymorphisms/{variant}"
response = requests.get(url, headers={"Content-Type": "application/json"})
variant_data = response.json()

# Get all associations for this variant
url = f"{base_url}/singleNucleotidePolymorphisms/{variant}/associations"
params = {"projection": "associationBySnp"}
response = requests.get(url, params=params, headers={"Content-Type": "application/json"})
associations = response.json()

# Extract trait names and p-values
for assoc in associations.get('_embedded', {}).get('associations', []):
    trait = assoc.get('efoTrait')
    pvalue = assoc.get('pvalue')
    print(f"Trait: {trait}, p-value: {pvalue}")
```

**Example 3: Access summary statistics**
```python
import requests

# Query summary statistics API
base_url = "https://www.ebi.ac.uk/gwas/summary-statistics/api"

# Find associations by trait with p-value threshold
trait = "EFO_0001360"  # Type 2 diabetes
p_upper = "0.000000001"  # p < 1e-9
url = f"{base_url}/traits/{trait}/associations"
params = {
    "p_upper": p_upper,
    "size": 100  # Number of results
}
response = requests.get(url, params=params)
results = response.json()

# Process genome-wide significant hits
for hit in results.get('_embedded', {}).get('associations', []):
    variant_id = hit.get('variant_id')
    chromosome = hit.get('chromosome')
    position = hit.get('base_pair_location')
    pvalue = hit.get('p_value')
    print(f"{chromosome}:{position} ({variant_id}): p={pvalue}")
```

**Example 4: Query by chromosomal region**
```python
import requests

# Find variants in a specific genomic region
chromosome = "10"
start_pos = 114000000
end_pos = 115000000

base_url = "https://www.ebi.ac.uk/gwas/rest/api"
url = f"{base_url}/singleNucleotidePolymorphisms/search/findByChromBpLocationRange"
params = {
    "chrom": chromosome,
    "bpStart": start_pos,
    "bpEnd": end_pos
}
response = requests.get(url, params=params, headers={"Content-Type": "application/json"})
variants_in_region = response.json()
```

### 5. Working with Summary Statistics

The GWAS Catalog hosts full summary statistics for many studies, providing access to all tested variants (not just genome-wide significant hits).

**Access Methods:**
1. **FTP download**: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/
2. **REST API**: Query-based access to summary statistics
3. **Web interface**: Browse and download via the website

**Summary Statistics API Features:**
- Filter by chromosome, position, p-value
- Query specific variants across studies
- Retrieve effect sizes and allele frequencies
- Access harmonized and standardized data

**Example: Download summary statistics for a study**
```python
import requests
import gzip

# Get available summary statistics
base_url = "https://www.ebi.ac.uk/gwas/summary-statistics/api"
url = f"{base_url}/studies/GCST001234"
response = requests.get(url)
study_info = response.json()

# Download link is provided in the response
# Alternatively, use FTP:
# ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCSTXXXXXX/
```

### 6. Data Integration and Cross-referencing

The GWAS Catalog provides links to external resources:

**Genomic Databases:**
- Ensembl: Gene annotations and variant consequences
- dbSNP: Variant identifiers and population frequencies
- gnomAD: Population allele frequencies

**Functional Resources:**
- Open Targets: Target-disease associations
- PGS Catalog: Polygenic risk scores
- UCSC Genome Browser: Genomic context

**Phenotype Resources:**
- EFO (Experimental Factor Ontology): Standardized trait terms
- OMIM: Disease gene relationships
- Disease Ontology: Disease hierarchies

**Following Links in API Responses:**
```python
import requests

# API responses include _links for related resources
response = requests.get("https://www.ebi.ac.uk/gwas/rest/api/studies/GCST001234")
study = response.json()

# Follow link to associations
associations_url = study['_links']['associations']['href']
associations_response = requests.get(associations_url)
```

## Query Workflows

### Workflow 1: Exploring Genetic Associations for a Disease

1. **Identify the trait** using EFO terms or free text:
   - Search web interface for disease name
   - Note the EFO ID (e.g., EFO_0001360 for type 2 diabetes)

2. **Query associations via API:**
   ```python
   url = f"https://www.ebi.ac.uk/gwas/rest/api/efoTraits/{efo_id}/associations"
   ```

3. **Filter by significance and population:**
   - Check p-values (genome-wide significant: p ≤ 5×10⁻⁸)
   - Review ancestry information in study metadata
   - Filter by sample size or discovery/replication status

4. **Extract variant details:**
   - rs IDs for each association
   - Effect alleles and directions
   - Effect sizes (odds ratios, beta coefficients)
   - Population allele frequencies

5. **Cross-reference with other databases:**
   - Look up variant consequences in Ensembl
   - Check population frequencies in gnomAD
   - Explore gene function and pathways

### Workflow 2: Investigating a Specific Genetic Variant

1. **Query the variant:**
   ```python
   url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/{rs_id}"
   ```

2. **Retrieve all trait associations:**
   ```python
   url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/{rs_id}/associations"
   ```

3. **Analyze pleiotropy:**
   - Identify all traits associated with this variant
   - Review effect directions across traits
   - Look for shared biological pathways

4. **Check genomic context:**
   - Determine nearby genes
   - Identify if variant is in coding/regulatory regions
   - Review linkage disequilibrium with other variants

### Workflow 3: Gene-Centric Association Analysis

1. **Search by gene symbol** in web interface or:
   ```python
   url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/search/findByGene"
   params = {"geneName": gene_symbol}
   ```

2. **Retrieve variants in gene region:**
   - Get chromosomal coordinates for gene
   - Query variants in region
   - Include promoter and regulatory regions (extend boundaries)

3. **Analyze association patterns:**
   - Identify traits associated with variants in this gene
   - Look for consistent associations across studies
   - Review effect sizes and directions

4. **Functional interpretation:**
   - Determine variant consequences (missense, regulatory, etc.)
   - Check expression QTL (eQTL) data
   - Review pathway and network context

### Workflow 4: Systematic Review of Genetic Evidence

1. **Define research question:**
   - Specific trait or disease of interest
   - Population considerations
   - Study design requirements

2. **Comprehensive variant extraction:**
   - Query all associations for trait
   - Set significance threshold
   - Note discovery and replication studies

3. **Quality assessment:**
   - Review study sample sizes
   - Check for population diversity
   - Assess heterogeneity across studies
   - Identify potential biases

4. **Data synthesis:**
   - Aggregate associations across studies
   - Perform meta-analysis if applicable
   - Create summary tables
   - Generate Manhattan or forest plots

5. **Export and documentation:**
   - Download full association data
   - Export summary statistics if needed
   - Document search strategy and date
   - Create reproducible analysis scripts

### Workflow 5: Accessing and Analyzing Summary Statistics

1. **Identify studies with summary statistics:**
   - Browse summary statistics portal
   - Check FTP directory listings
   - Query API for available studies

2. **Download summary statistics:**
   ```bash
   # Via FTP
   wget ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCSTXXXXXX/harmonised/GCSTXXXXXX-harmonised.tsv.gz
   ```

3. **Query via API for specific variants:**
   ```python
   url = f"https://www.ebi.ac.uk/gwas/summary-statistics/api/chromosomes/{chrom}/associations"
   params = {"start": start_pos, "end": end_pos}
   ```

4. **Process and analyze:**
   - Filter by p-value thresholds
   - Extract effect sizes and confidence intervals
   - Perform downstream analyses (fine-mapping, colocalization, etc.)

## Response Formats and Data Fields

**Key Fields in Association Records:**
- `rsId`: Variant identifier (rs number)
- `strongestAllele`: Risk allele for the association
- `pvalue`: Association p-value
- `pvalueText`: P-value as text (may include inequality)
- `orPerCopyNum`: Odds ratio or beta coefficient
- `betaNum`: Effect size (for quantitative traits)
- `betaUnit`: Unit of measurement for beta
- `range`: Confidence interval
- `efoTrait`: Associated trait name
- `mappedLabel`: EFO-mapped trait term

**Study Metadata Fields:**
- `accessionId`: GCST study identifier
- `pubmedId`: PubMed ID
- `author`: First author
- `publicationDate`: Publication date
- `ancestryInitial`: Discovery population ancestry
- `ancestryReplication`: Replication population ancestry
- `sampleSize`: Total sample size

**Pagination:**
Results are paginated (default 20 items per page). Navigate using:
- `size` parameter: Number of results per page
- `page` parameter: Page number (0-indexed)
- `_links` in response: URLs for next/previous pages

## Best Practices

### Query Strategy
- Start with web interface to identify relevant EFO terms and study accessions
- Use API for bulk data extraction and automated analyses
- Implement pagination handling for large result sets
- Cache API responses to minimize redundant requests

### Data Interpretation
- Always check p-value thresholds (genome-wide: 5×10⁻⁸)
- Review ancestry information for population applicability
- Consider sample size when assessing evidence strength
- Check for replication across independent studies
- Be aware of winner's curse in effect size estimates

### Rate Limiting and Ethics
- Respect API usage guidelines (no excessive requests)
- Use summary statistics downloads for genome-wide analyses
- Implement appropriate delays between API calls
- Cache results locally when performing iterative analyses
- Cite the GWAS Catalog in publications

### Data Quality Considerations
- GWAS Catalog curates published associations (may contain inconsistencies)
- Effect sizes reported as published (may need harmonization)
- Some studies report conditional or joint associations
- Check for study overlap when combining results
- Be aware of ascertainment and selection biases

## Python Integration Example

Complete workflow for querying and analyzing GWAS data:

```python
import requests
import pandas as pd
from time import sleep

def query_gwas_catalog(trait_id, p_threshold=5e-8):
    """
    Query GWAS Catalog for trait associations

    Args:
        trait_id: EFO trait identifier (e.g., 'EFO_0001360')
        p_threshold: P-value threshold for filtering

    Returns:
        pandas DataFrame with association results
    """
    base_url = "https://www.ebi.ac.uk/gwas/rest/api"
    url = f"{base_url}/efoTraits/{trait_id}/associations"

    headers = {"Content-Type": "application/json"}
    results = []
    page = 0

    while True:
        params = {"page": page, "size": 100}
        response = requests.get(url, params=params, headers=headers)

        if response.status_code != 200:
            break

        data = response.json()
        associations = data.get('_embedded', {}).get('associations', [])

        if not associations:
            break

        for assoc in associations:
            pvalue = assoc.get('pvalue')
            if pvalue and float(pvalue) <= p_threshold:
                results.append({
                    'variant': assoc.get('rsId'),
                    'pvalue': pvalue,
                    'risk_allele': assoc.get('strongestAllele'),
                    'or_beta': assoc.get('orPerCopyNum') or assoc.get('betaNum'),
                    'trait': assoc.get('efoTrait'),
                    'pubmed_id': assoc.get('pubmedId')
                })

        page += 1
        sleep(0.1)  # Rate limiting

    return pd.DataFrame(results)

# Example usage
df = query_gwas_catalog('EFO_0001360')  # Type 2 diabetes
print(df.head())
print(f"\nTotal associations: {len(df)}")
print(f"Unique variants: {df['variant'].nunique()}")
```

## Resources

### references/api_reference.md

Comprehensive API documentation including:
- Detailed endpoint specifications for both APIs
- Complete list of query parameters and filters
- Response format specifications and field descriptions
- Advanced query examples and patterns
- Error handling and troubleshooting
- Integration with external databases

Consult this reference when:
- Constructing complex API queries
- Understanding response structures
- Implementing pagination or batch operations
- Troubleshooting API errors
- Exploring advanced filtering options

### Training Materials

The GWAS Catalog team provides workshop materials:
- GitHub repository: https://github.com/EBISPOT/GWAS_Catalog-workshop
- Jupyter notebooks with example queries
- Google Colab integration for cloud execution

## Important Notes

### Data Updates
- The GWAS Catalog is updated regularly with new publications
- Re-run queries periodically for comprehensive coverage
- Summary statistics are added as studies release data
- EFO mappings may be updated over time

### Citation Requirements
When using GWAS Catalog data, cite:
- Sollis E, et al. (2023) The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Research. PMID: 37953337
- Include access date and version when available
- Cite original studies when discussing specific findings

### Limitations
- Not all GWAS publications are included (curation criteria apply)
- Full summary statistics available for subset of studies
- Effect sizes may require harmonization across studies
- Population diversity is growing but historically limited
- Some associations represent conditional or joint effects

### Data Access
- Web interface: Free, no registration required
- REST APIs: Free, no API key needed
- FTP downloads: Open access
- Rate limiting applies to API (be respectful)

## Additional Resources

- **GWAS Catalog website**: https://www.ebi.ac.uk/gwas/
- **Documentation**: https://www.ebi.ac.uk/gwas/docs
- **API documentation**: https://www.ebi.ac.uk/gwas/rest/docs/api
- **Summary Statistics API**: https://www.ebi.ac.uk/gwas/summary-statistics/docs/
- **FTP site**: http://ftp.ebi.ac.uk/pub/databases/gwas/
- **Training materials**: https://github.com/EBISPOT/GWAS_Catalog-workshop
- **PGS Catalog** (polygenic scores): https://www.pgscatalog.org/
- **Help and support**: gwas-info@ebi.ac.uk