Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,227 @@
# ClinVar API and Data Access Reference
## Overview
ClinVar provides multiple methods for programmatic data access:
- **E-utilities** - NCBI's REST API for searching and retrieving data
- **Entrez Direct** - Command-line tools for UNIX environments
- **FTP Downloads** - Bulk data files in XML, VCF, and tab-delimited formats
- **Submission API** - REST API for submitting variant interpretations
## E-utilities API
### Base URL
```
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
```
### Supported Operations
#### 1. esearch - Search for Records
Search ClinVar using the same query syntax as the web interface.
**Endpoint:**
```
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
```
**Parameters:**
- `db=clinvar` - Database name (required)
- `term=<query>` - Search query (required)
- `retmax=<N>` - Maximum records to return (default: 20)
- `retmode=json` - Return format (json or xml)
- `usehistory=y` - Store results on server for large datasets
**Example Query:**
```bash
# Search for BRCA1 pathogenic variants
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=BRCA1[gene]+AND+pathogenic[CLNSIG]&retmode=json&retmax=100"
```
**Common Search Fields:**
- `[gene]` - Gene symbol
- `[CLNSIG]` - Clinical significance (pathogenic, benign, etc.)
- `[disorder]` - Disease/condition name
- `[variant name]` - HGVS expression or variant identifier
- `[chr]` - Chromosome number
- `[Assembly]` - GRCh37 or GRCh38
#### 2. esummary - Retrieve Record Summaries
Get summary information for specific ClinVar records.
**Endpoint:**
```
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi
```
**Parameters:**
- `db=clinvar` - Database name (required)
- `id=<UIDs>` - Comma-separated list of ClinVar UIDs
- `retmode=json` - Return format (json or xml)
- `version=2.0` - API version (recommended for JSON)
**Example:**
```bash
# Get summary for specific variant
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&id=12345&retmode=json&version=2.0"
```
**esummary Output Includes:**
- Accession (RCV/VCV)
- Clinical significance
- Review status
- Gene symbols
- Variant type
- Genomic locations (GRCh37 and GRCh38)
- Associated conditions
- Allele origin (germline/somatic)
#### 3. efetch - Retrieve Full Records
Download complete XML records for detailed analysis.
**Endpoint:**
```
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
```
**Parameters:**
- `db=clinvar` - Database name (required)
- `id=<UIDs>` - Comma-separated ClinVar UIDs
- `rettype=vcv` or `rettype=rcv` - Record type
**Example:**
```bash
# Fetch full VCV record
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=clinvar&id=12345&rettype=vcv"
```
#### 4. elink - Find Related Records
Link ClinVar records to other NCBI databases.
**Endpoint:**
```
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi
```
**Available Links:**
- clinvar_pubmed - Link to PubMed citations
- clinvar_gene - Link to Gene database
- clinvar_medgen - Link to MedGen (conditions)
- clinvar_snp - Link to dbSNP
**Example:**
```bash
# Find PubMed articles for a variant
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=clinvar&db=pubmed&id=12345"
```
### Workflow Example: Complete Search and Retrieval
```bash
# Step 1: Search for variants
SEARCH_URL="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=CFTR[gene]+AND+pathogenic[CLNSIG]&retmode=json&retmax=10"
# Step 2: Parse IDs from search results
# (Extract id list from JSON response)
# Step 3: Retrieve summaries
SUMMARY_URL="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&id=<ids>&retmode=json&version=2.0"
# Step 4: Fetch full records if needed
FETCH_URL="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=clinvar&id=<ids>&rettype=vcv"
```
## Entrez Direct (Command-Line)
Install Entrez Direct for command-line access:
```bash
sh -c "$(curl -fsSL ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
```
### Common Commands
**Search:**
```bash
esearch -db clinvar -query "BRCA1[gene] AND pathogenic[CLNSIG]"
```
**Pipeline Search to Summary:**
```bash
esearch -db clinvar -query "TP53[gene]" | \
efetch -format docsum | \
xtract -pattern DocumentSummary -element AccessionVersion Title
```
**Count Results:**
```bash
esearch -db clinvar -query "breast cancer[disorder]" | \
efilter -status reviewed | \
efetch -format docsum
```
## Rate Limits and Best Practices
### Rate Limits
- **Without API Key:** 3 requests/second
- **With API Key:** 10 requests/second
- Large datasets: Use `usehistory=y` to avoid repeated queries
### API Key Setup
1. Register for NCBI account at https://www.ncbi.nlm.nih.gov/account/
2. Generate API key in account settings
3. Add `&api_key=<YOUR_KEY>` to all requests
### Best Practices
- Test queries on web interface before automation
- Use `usehistory` for large result sets (>500 records)
- Implement exponential backoff for rate limit errors
- Cache results when appropriate
- Use batch requests instead of individual queries
- Respect NCBI servers - don't submit large jobs during peak US hours
## Python Example with Biopython
```python
from Bio import Entrez
# Set email (required by NCBI)
Entrez.email = "your.email@example.com"
# Search ClinVar
def search_clinvar(query, retmax=100):
handle = Entrez.esearch(db="clinvar", term=query, retmax=retmax)
record = Entrez.read(handle)
handle.close()
return record["IdList"]
# Get summaries
def get_summaries(id_list):
ids = ",".join(id_list)
handle = Entrez.esummary(db="clinvar", id=ids, retmode="json")
record = Entrez.read(handle)
handle.close()
return record
# Example usage
variant_ids = search_clinvar("BRCA2[gene] AND pathogenic[CLNSIG]")
summaries = get_summaries(variant_ids)
```
## Error Handling
### Common HTTP Status Codes
- `200` - Success
- `400` - Bad request (check query syntax)
- `429` - Too many requests (rate limited)
- `500` - Server error (retry with exponential backoff)
### Error Response Example
```xml
<ERROR>Empty id list - nothing to do</ERROR>
```
## Additional Resources
- NCBI E-utilities documentation: https://www.ncbi.nlm.nih.gov/books/NBK25501/
- ClinVar web services: https://www.ncbi.nlm.nih.gov/clinvar/docs/maintenance_use/
- Entrez Direct cookbook: https://www.ncbi.nlm.nih.gov/books/NBK179288/

View File

@@ -0,0 +1,218 @@
# ClinVar Clinical Significance Interpretation Guide
## Overview
ClinVar uses standardized terminology to describe the clinical significance of genetic variants. Understanding these classifications is critical for interpreting variant reports and making informed research or clinical decisions.
## Important Disclaimer
**ClinVar data is NOT intended for direct diagnostic use or medical decision-making without review by a genetics professional.** The interpretations in ClinVar represent submitted data from various sources and should be evaluated in the context of the specific patient and clinical scenario.
## Three Classification Categories
ClinVar represents three distinct types of variant classifications:
1. **Germline variants** - Inherited variants related to Mendelian diseases and drug responses
2. **Somatic variants (Clinical Impact)** - Acquired variants with therapeutic implications
3. **Somatic variants (Oncogenicity)** - Acquired variants related to cancer development
## Germline Variant Classifications
### Standard ACMG/AMP Terms
These are the five core terms recommended by the American College of Medical Genetics and Genomics (ACMG) and Association for Molecular Pathology (AMP):
| Term | Abbreviation | Meaning | Probability |
|------|--------------|---------|-------------|
| **Pathogenic** | P | Variant causes disease | ~99% |
| **Likely Pathogenic** | LP | Variant likely causes disease | ~90% |
| **Uncertain Significance** | VUS | Insufficient evidence to classify | N/A |
| **Likely Benign** | LB | Variant likely does not cause disease | ~90% non-pathogenic |
| **Benign** | B | Variant does not cause disease | ~99% non-pathogenic |
### Low-Penetrance and Risk Allele Terms
ClinGen recommends additional terms for variants with incomplete penetrance or risk associations:
- **Pathogenic, low penetrance** - Disease-causing but not all carriers develop disease
- **Likely pathogenic, low penetrance** - Probably disease-causing with incomplete penetrance
- **Established risk allele** - Confirmed association with increased disease risk
- **Likely risk allele** - Probable association with increased disease risk
- **Uncertain risk allele** - Unclear risk association
### Additional Classification Terms
- **Drug response** - Variants affecting medication efficacy or metabolism
- **Association** - Statistical association with trait/disease
- **Protective** - Variants that reduce disease risk
- **Affects** - Variants that affect a biological function
- **Other** - Classifications that don't fit standard categories
- **Not provided** - No classification submitted
### Special Considerations
**Recessive Disorders:**
A disease-causing variant for an autosomal recessive disorder should be classified as "Pathogenic," even though heterozygous carriers will not develop disease. The classification describes the variant's effect, not the carrier status.
**Compound Heterozygotes:**
Each variant is classified independently. Two "Likely Pathogenic" variants in trans can together cause recessive disease, but each maintains its individual classification.
## Somatic Variant Classifications
### Clinical Impact (AMP/ASCO/CAP Tiers)
Based on guidelines from the Association for Molecular Pathology (AMP), American Society of Clinical Oncology (ASCO), and College of American Pathologists (CAP):
| Tier | Meaning |
|------|---------|
| **Tier I - Strong** | Variants with strong clinical significance - FDA-approved therapies or professional guidelines |
| **Tier II - Potential** | Variants with potential clinical actionability - emerging evidence |
| **Tier III - Uncertain** | Variants of unknown clinical significance |
| **Tier IV - Benign/Likely Benign** | Variants with no therapeutic implications |
### Oncogenicity (ClinGen/CGC/VICC)
Based on standards from ClinGen, Cancer Genomics Consortium (CGC), and Variant Interpretation for Cancer Consortium (VICC):
| Term | Meaning |
|------|---------|
| **Oncogenic** | Variant drives cancer development |
| **Likely Oncogenic** | Variant probably drives cancer development |
| **Uncertain Significance** | Insufficient evidence for oncogenicity |
| **Likely Benign** | Variant probably does not drive cancer |
| **Benign** | Variant does not drive cancer |
## Review Status and Star Ratings
ClinVar assigns review status ratings to indicate the strength of evidence behind classifications:
| Stars | Review Status | Description | Weight |
|-------|---------------|-------------|--------|
| ★★★★ | **Practice Guideline** | Reviewed by expert panel with published guidelines | Highest |
| ★★★ | **Expert Panel Review** | Reviewed by expert panel (e.g., ClinGen) | High |
| ★★ | **Multiple Submitters, No Conflicts** | ≥2 submitters with same classification | Moderate |
| ★ | **Criteria Provided, Single Submitter** | One submitter with supporting evidence | Standard |
| ☆ | **No Assertion Criteria** | Classification without documented criteria | Lowest |
| ☆ | **No Assertion Provided** | No classification submitted | None |
### What the Stars Mean
- **4 stars**: Highest confidence - vetted by expert panels, used in clinical practice guidelines
- **3 stars**: High confidence - expert panel review (e.g., ClinGen Variant Curation Expert Panel)
- **2 stars**: Moderate confidence - consensus among multiple independent submitters
- **1 star**: Single submitter with evidence - quality depends on submitter expertise
- **0 stars**: Low confidence - insufficient evidence or no criteria provided
## Conflicting Interpretations
### What Constitutes a Conflict?
As of June 2022, conflicts are reported between:
- Pathogenic/likely pathogenic **vs.** Uncertain significance
- Pathogenic/likely pathogenic **vs.** Benign/likely benign
- Uncertain significance **vs.** Benign/likely benign
### Conflict Resolution
When conflicts exist, ClinVar reports:
- **"Conflicting interpretations of pathogenicity"** - Disagreement on clinical significance
- Individual submissions are displayed so users can evaluate evidence
- Higher review status (more stars) carries more weight
- More recent submissions may reflect updated evidence
### Handling Conflicts in Research
When encountering conflicts:
1. Check the review status (star rating) of each interpretation
2. Examine the evidence and criteria provided by each submitter
3. Consider the date of submission (more recent may reflect new data)
4. Review population frequency data and functional studies
5. Consult expert panel classifications when available
## Aggregate Classifications
ClinVar calculates an aggregate classification when multiple submitters provide interpretations:
### No Conflicts
When all submitters agree (within the same category):
- Display: Single classification term
- Confidence: Higher with more submitters
### With Conflicts
When submitters disagree:
- Display: "Conflicting interpretations of pathogenicity"
- Details: All individual submissions shown
- Resolution: Users must evaluate evidence themselves
## Interpretation Best Practices
### For Researchers
1. **Always check review status** - Prefer variants with ★★★ or ★★★★ ratings
2. **Review submission details** - Examine evidence supporting classification
3. **Consider publication date** - Newer classifications may incorporate recent data
4. **Check assertion criteria** - Variants with ACMG criteria are more reliable
5. **Verify in context** - Population, ethnicity, and phenotype matter
6. **Follow up on conflicts** - Investigate discrepancies before making conclusions
### For Variant Annotation Pipelines
1. Prioritize higher review status classifications
2. Flag conflicting interpretations for manual review
3. Track classification changes over time
4. Include population frequency data alongside ClinVar classifications
5. Document ClinVar version and access date
### Red Flags
Be cautious with variants that have:
- Zero or one star rating
- Conflicting interpretations without resolution
- Classification as VUS (uncertain significance)
- Very old submission dates without updates
- Classification based on in silico predictions alone
## Common Query Patterns
### Search for High-Confidence Pathogenic Variants
```
BRCA1[gene] AND pathogenic[CLNSIG] AND practice guideline[RVSTAT]
```
### Filter by Review Status
```
TP53[gene] AND (reviewed by expert panel[RVSTAT] OR practice guideline[RVSTAT])
```
### Exclude Conflicting Interpretations
```
CFTR[gene] AND pathogenic[CLNSIG] NOT conflicting[RVSTAT]
```
## Updates and Reclassifications
### Why Classifications Change
Variants may be reclassified due to:
- New functional studies
- Additional population data (e.g., gnomAD)
- Updated ACMG guidelines
- Clinical evidence from more patients
- Segregation data from families
### Tracking Changes
- ClinVar maintains submission history
- Version-controlled VCV/RCV accessions
- Monthly updates to classifications
- Reclassifications can go in either direction (upgrade or downgrade)
## Key Resources
- ACMG/AMP Variant Interpretation Guidelines: Richards et al., 2015
- ClinGen Sequence Variant Interpretation Working Group: https://clinicalgenome.org/
- ClinVar Clinical Significance Documentation: https://www.ncbi.nlm.nih.gov/clinvar/docs/clinsig/
- Review Status Documentation: https://www.ncbi.nlm.nih.gov/clinvar/docs/review_status/

View File

@@ -0,0 +1,358 @@
# ClinVar Data Formats and FTP Access
## Overview
ClinVar provides bulk data downloads in multiple formats to support different research workflows. Data is distributed via FTP and updated on regular schedules.
## FTP Access
### Base URL
```
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/
```
### Update Schedule
- **Monthly Releases**: First Thursday of each month
- Complete dataset with comprehensive documentation
- Archived indefinitely for reproducibility
- Includes release notes
- **Weekly Updates**: Every Monday
- Incremental updates to monthly release
- Retained until next monthly release
- Allows synchronization with ClinVar website
### Directory Structure
```
pub/clinvar/
├── xml/ # XML data files
│ ├── clinvar_variation/ # VCV files (variant-centric)
│ │ ├── weekly_release/ # Weekly updates
│ │ └── archive/ # Monthly archives
│ └── RCV/ # RCV files (variant-condition pairs)
│ ├── weekly_release/
│ └── archive/
├── vcf_GRCh37/ # VCF files (GRCh37/hg19)
├── vcf_GRCh38/ # VCF files (GRCh38/hg38)
├── tab_delimited/ # Tab-delimited summary files
│ ├── variant_summary.txt.gz
│ ├── var_citations.txt.gz
│ └── cross_references.txt.gz
└── README.txt # Format documentation
```
## Data Formats
### 1. XML Format (Primary Distribution)
XML provides the most comprehensive data with full submission details, evidence, and metadata.
#### VCV (Variation) Files
- **Purpose**: Variant-centric aggregation
- **Location**: `xml/clinvar_variation/`
- **Accession format**: VCV000000001.1
- **Best for**: Queries focused on specific variants regardless of condition
- **File naming**: `ClinVarVariationRelease_YYYY-MM-DD.xml.gz`
**VCV Record Structure:**
```xml
<VariationArchive VariationID="12345" VariationType="single nucleotide variant">
<VariationName>NM_000059.3(BRCA2):c.1310_1313del (p.Lys437fs)</VariationName>
<InterpretedRecord>
<Interpretations>
<InterpretedConditionList>
<InterpretedCondition>Breast-ovarian cancer, familial 2</InterpretedCondition>
</InterpretedConditionList>
<ClinicalSignificance>Pathogenic</ClinicalSignificance>
<ReviewStatus>reviewed by expert panel</ReviewStatus>
</Interpretations>
</InterpretedRecord>
<ClinicalAssertionList>
<!-- Individual submissions -->
</ClinicalAssertionList>
</VariationArchive>
```
#### RCV (Record) Files
- **Purpose**: Variant-condition pair aggregation
- **Location**: `xml/RCV/`
- **Accession format**: RCV000000001.1
- **Best for**: Queries focused on variant-disease relationships
- **File naming**: `ClinVarRCVRelease_YYYY-MM-DD.xml.gz`
**Key differences from VCV:**
- One RCV per variant-condition combination
- A single variant may have multiple RCV records (different conditions)
- More focused on clinical interpretation per disease
#### SCV (Submission) Records
- **Format**: Individual submissions within VCV/RCV records
- **Accession format**: SCV000000001.1
- **Content**: Submitter-specific interpretations and evidence
### 2. VCF Format
Variant Call Format files for genomic analysis pipelines.
#### Locations
- **GRCh37/hg19**: `vcf_GRCh37/clinvar.vcf.gz`
- **GRCh38/hg38**: `vcf_GRCh38/clinvar.vcf.gz`
#### Content Limitations
- **Included**: Simple alleles with precise genomic coordinates
- **Excluded**:
- Variants >10 kb
- Cytogenetic variants
- Complex structural variants
- Variants without precise breakpoints
#### VCF INFO Fields
Key INFO fields in ClinVar VCF:
| Field | Description |
|-------|-------------|
| **ALLELEID** | ClinVar allele identifier |
| **CLNSIG** | Clinical significance |
| **CLNREVSTAT** | Review status |
| **CLNDN** | Condition name(s) |
| **CLNVC** | Variant type (SNV, deletion, etc.) |
| **CLNVCSO** | Sequence ontology term |
| **GENEINFO** | Gene symbol:gene ID |
| **MC** | Molecular consequence |
| **RS** | dbSNP rsID |
| **AF_ESP** | Allele frequency (ESP) |
| **AF_EXAC** | Allele frequency (ExAC) |
| **AF_TGP** | Allele frequency (1000 Genomes) |
#### Example VCF Line
```
#CHROM POS ID REF ALT QUAL FILTER INFO
13 32339912 rs80357382 A G . . ALLELEID=38447;CLNDN=Breast-ovarian_cancer,_familial_2;CLNSIG=Pathogenic;CLNREVSTAT=reviewed_by_expert_panel;GENEINFO=BRCA2:675
```
### 3. Tab-Delimited Format
Summary files for quick analysis and database loading.
#### variant_summary.txt
Primary summary file with selected metadata for all genome-mapped variants.
**Key Columns:**
- `VariationID` - ClinVar variation identifier
- `Type` - Variant type (SNV, indel, CNV, etc.)
- `Name` - Variant name (typically HGVS)
- `GeneID` - NCBI Gene ID
- `GeneSymbol` - Gene symbol
- `ClinicalSignificance` - Classification
- `ReviewStatus` - Star rating level
- `LastEvaluated` - Date of last review
- `RS# (dbSNP)` - dbSNP rsID if available
- `Chromosome` - Chromosome
- `PositionVCF` - Position (GRCh38)
- `ReferenceAlleleVCF` - Reference allele
- `AlternateAlleleVCF` - Alternate allele
- `Assembly` - Reference assembly (GRCh37/GRCh38)
- `PhenotypeIDS` - MedGen/OMIM/Orphanet IDs
- `Origin` - Germline, somatic, de novo, etc.
- `SubmitterCategories` - Submitter types (clinical, research, etc.)
**Example Usage:**
```bash
# Extract all pathogenic BRCA1 variants
zcat variant_summary.txt.gz | \
awk -F'\t' '$7=="BRCA1" && $13~"Pathogenic"' | \
cut -f1,7,13,14
```
#### var_citations.txt
Cross-references to PubMed articles, dbSNP, and dbVar.
**Columns:**
- `AlleleID` - ClinVar allele ID
- `VariationID` - ClinVar variation ID
- `rs` - dbSNP rsID
- `nsv/esv` - dbVar IDs
- `PubMedID` - PubMed citation
#### cross_references.txt
Database cross-references with modification dates.
**Columns:**
- `VariationID`
- `Database` (OMIM, UniProtKB, GTR, etc.)
- `Identifier`
- `DateLastModified`
## Choosing the Right Format
### Use XML when:
- Need complete submission details
- Want to track evidence and criteria
- Building comprehensive variant databases
- Require full metadata and relationships
### Use VCF when:
- Integrating with genomic analysis pipelines
- Annotating variant calls from sequencing
- Need genomic coordinates for overlap analysis
- Working with standard bioinformatics tools
### Use Tab-Delimited when:
- Quick database queries and filters
- Loading into spreadsheets or databases
- Simple data extraction and statistics
- Don't need full evidence details
## Accession Types and Identifiers
### VCV (Variation Archive)
- **Format**: VCV000012345.6 (ID.version)
- **Scope**: Aggregates all data for a single variant
- **Versioning**: Increments when variant data changes
### RCV (Record)
- **Format**: RCV000056789.4
- **Scope**: One variant-condition interpretation
- **Versioning**: Increments when interpretation changes
### SCV (Submission)
- **Format**: SCV000098765.2
- **Scope**: Individual submitter's interpretation
- **Versioning**: Increments when submission updates
### Other Identifiers
- **VariationID**: Stable numeric identifier for variants
- **AlleleID**: Stable numeric identifier for alleles
- **dbSNP rsID**: Cross-reference to dbSNP (when available)
## File Processing Tips
### XML Processing
**Python with xml.etree:**
```python
import gzip
import xml.etree.ElementTree as ET
with gzip.open('ClinVarVariationRelease.xml.gz', 'rt') as f:
for event, elem in ET.iterparse(f, events=('end',)):
if elem.tag == 'VariationArchive':
# Process variant
variation_id = elem.attrib.get('VariationID')
# Extract data
elem.clear() # Free memory
```
**Command-line with xmllint:**
```bash
# Extract pathogenic variants
zcat ClinVarVariationRelease.xml.gz | \
xmllint --xpath "//VariationArchive[.//ClinicalSignificance[text()='Pathogenic']]" -
```
### VCF Processing
**Using bcftools:**
```bash
# Filter by clinical significance
bcftools view -i 'INFO/CLNSIG~"Pathogenic"' clinvar.vcf.gz
# Extract specific genes
bcftools view -i 'INFO/GENEINFO~"BRCA"' clinvar.vcf.gz
# Annotate your VCF
bcftools annotate -a clinvar.vcf.gz -c INFO your_variants.vcf
```
**Using PyVCF:**
```python
import vcf
vcf_reader = vcf.Reader(filename='clinvar.vcf.gz')
for record in vcf_reader:
clnsig = record.INFO.get('CLNSIG', [])
if 'Pathogenic' in clnsig:
print(f"{record.CHROM}:{record.POS} - {clnsig}")
```
### Tab-Delimited Processing
**Using pandas:**
```python
import pandas as pd
# Read variant summary
df = pd.read_csv('variant_summary.txt.gz', sep='\t', compression='gzip')
# Filter pathogenic variants
pathogenic = df[df['ClinicalSignificance'].str.contains('Pathogenic', na=False)]
# Group by gene
gene_counts = pathogenic.groupby('GeneSymbol').size().sort_values(ascending=False)
```
## Data Quality Considerations
### Known Limitations
1. **VCF files exclude large variants** - Variants >10 kb not included
2. **Historical data may be less accurate** - Older submissions had fewer standardization requirements
3. **Conflicting interpretations exist** - Multiple submitters may disagree
4. **Not all variants have genomic coordinates** - Some HGVS expressions can't be mapped
### Validation Recommendations
- Cross-reference multiple data formats when possible
- Check review status (prefer ★★★ or ★★★★ ratings)
- Verify genomic coordinates against current genome builds
- Consider population frequency data (gnomAD) for context
- Review submission dates - newer data may be more accurate
## Bulk Download Scripts
### Download Latest Monthly Release
```bash
#!/bin/bash
# Download latest ClinVar monthly XML release
BASE_URL="ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation"
# Get latest file
LATEST=$(curl -s ${BASE_URL}/ | \
grep -oP 'ClinVarVariationRelease_\d{4}-\d{2}\.xml\.gz' | \
tail -1)
# Download
wget ${BASE_URL}/${LATEST}
```
### Download All Formats
```bash
#!/bin/bash
# Download ClinVar in all formats
FTP_BASE="ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar"
# XML
wget ${FTP_BASE}/xml/clinvar_variation/ClinVarVariationRelease_00-latest.xml.gz
# VCF (both assemblies)
wget ${FTP_BASE}/vcf_GRCh37/clinvar.vcf.gz
wget ${FTP_BASE}/vcf_GRCh38/clinvar.vcf.gz
# Tab-delimited
wget ${FTP_BASE}/tab_delimited/variant_summary.txt.gz
wget ${FTP_BASE}/tab_delimited/var_citations.txt.gz
```
## Additional Resources
- ClinVar FTP Primer: https://www.ncbi.nlm.nih.gov/clinvar/docs/ftp_primer/
- XML Schema Documentation: https://www.ncbi.nlm.nih.gov/clinvar/docs/xml_schemas/
- VCF Specification: https://samtools.github.io/hts-specs/VCFv4.3.pdf
- Release Notes: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/README.txt