11 KiB
ClinVar Data Formats and FTP Access
Overview
ClinVar provides bulk data downloads in multiple formats to support different research workflows. Data is distributed via FTP and updated on regular schedules.
FTP Access
Base URL
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/
Update Schedule
-
Monthly Releases: First Thursday of each month
- Complete dataset with comprehensive documentation
- Archived indefinitely for reproducibility
- Includes release notes
-
Weekly Updates: Every Monday
- Incremental updates to monthly release
- Retained until next monthly release
- Allows synchronization with ClinVar website
Directory Structure
pub/clinvar/
├── xml/ # XML data files
│ ├── clinvar_variation/ # VCV files (variant-centric)
│ │ ├── weekly_release/ # Weekly updates
│ │ └── archive/ # Monthly archives
│ └── RCV/ # RCV files (variant-condition pairs)
│ ├── weekly_release/
│ └── archive/
├── vcf_GRCh37/ # VCF files (GRCh37/hg19)
├── vcf_GRCh38/ # VCF files (GRCh38/hg38)
├── tab_delimited/ # Tab-delimited summary files
│ ├── variant_summary.txt.gz
│ ├── var_citations.txt.gz
│ └── cross_references.txt.gz
└── README.txt # Format documentation
Data Formats
1. XML Format (Primary Distribution)
XML provides the most comprehensive data with full submission details, evidence, and metadata.
VCV (Variation) Files
- Purpose: Variant-centric aggregation
- Location:
xml/clinvar_variation/ - Accession format: VCV000000001.1
- Best for: Queries focused on specific variants regardless of condition
- File naming:
ClinVarVariationRelease_YYYY-MM-DD.xml.gz
VCV Record Structure:
<VariationArchive VariationID="12345" VariationType="single nucleotide variant">
<VariationName>NM_000059.3(BRCA2):c.1310_1313del (p.Lys437fs)</VariationName>
<InterpretedRecord>
<Interpretations>
<InterpretedConditionList>
<InterpretedCondition>Breast-ovarian cancer, familial 2</InterpretedCondition>
</InterpretedConditionList>
<ClinicalSignificance>Pathogenic</ClinicalSignificance>
<ReviewStatus>reviewed by expert panel</ReviewStatus>
</Interpretations>
</InterpretedRecord>
<ClinicalAssertionList>
<!-- Individual submissions -->
</ClinicalAssertionList>
</VariationArchive>
RCV (Record) Files
- Purpose: Variant-condition pair aggregation
- Location:
xml/RCV/ - Accession format: RCV000000001.1
- Best for: Queries focused on variant-disease relationships
- File naming:
ClinVarRCVRelease_YYYY-MM-DD.xml.gz
Key differences from VCV:
- One RCV per variant-condition combination
- A single variant may have multiple RCV records (different conditions)
- More focused on clinical interpretation per disease
SCV (Submission) Records
- Format: Individual submissions within VCV/RCV records
- Accession format: SCV000000001.1
- Content: Submitter-specific interpretations and evidence
2. VCF Format
Variant Call Format files for genomic analysis pipelines.
Locations
- GRCh37/hg19:
vcf_GRCh37/clinvar.vcf.gz - GRCh38/hg38:
vcf_GRCh38/clinvar.vcf.gz
Content Limitations
- Included: Simple alleles with precise genomic coordinates
- Excluded:
- Variants >10 kb
- Cytogenetic variants
- Complex structural variants
- Variants without precise breakpoints
VCF INFO Fields
Key INFO fields in ClinVar VCF:
| Field | Description |
|---|---|
| ALLELEID | ClinVar allele identifier |
| CLNSIG | Clinical significance |
| CLNREVSTAT | Review status |
| CLNDN | Condition name(s) |
| CLNVC | Variant type (SNV, deletion, etc.) |
| CLNVCSO | Sequence ontology term |
| GENEINFO | Gene symbol:gene ID |
| MC | Molecular consequence |
| RS | dbSNP rsID |
| AF_ESP | Allele frequency (ESP) |
| AF_EXAC | Allele frequency (ExAC) |
| AF_TGP | Allele frequency (1000 Genomes) |
Example VCF Line
#CHROM POS ID REF ALT QUAL FILTER INFO
13 32339912 rs80357382 A G . . ALLELEID=38447;CLNDN=Breast-ovarian_cancer,_familial_2;CLNSIG=Pathogenic;CLNREVSTAT=reviewed_by_expert_panel;GENEINFO=BRCA2:675
3. Tab-Delimited Format
Summary files for quick analysis and database loading.
variant_summary.txt
Primary summary file with selected metadata for all genome-mapped variants.
Key Columns:
VariationID- ClinVar variation identifierType- Variant type (SNV, indel, CNV, etc.)Name- Variant name (typically HGVS)GeneID- NCBI Gene IDGeneSymbol- Gene symbolClinicalSignificance- ClassificationReviewStatus- Star rating levelLastEvaluated- Date of last reviewRS# (dbSNP)- dbSNP rsID if availableChromosome- ChromosomePositionVCF- Position (GRCh38)ReferenceAlleleVCF- Reference alleleAlternateAlleleVCF- Alternate alleleAssembly- Reference assembly (GRCh37/GRCh38)PhenotypeIDS- MedGen/OMIM/Orphanet IDsOrigin- Germline, somatic, de novo, etc.SubmitterCategories- Submitter types (clinical, research, etc.)
Example Usage:
# Extract all pathogenic BRCA1 variants
zcat variant_summary.txt.gz | \
awk -F'\t' '$7=="BRCA1" && $13~"Pathogenic"' | \
cut -f1,7,13,14
var_citations.txt
Cross-references to PubMed articles, dbSNP, and dbVar.
Columns:
AlleleID- ClinVar allele IDVariationID- ClinVar variation IDrs- dbSNP rsIDnsv/esv- dbVar IDsPubMedID- PubMed citation
cross_references.txt
Database cross-references with modification dates.
Columns:
VariationIDDatabase(OMIM, UniProtKB, GTR, etc.)IdentifierDateLastModified
Choosing the Right Format
Use XML when:
- Need complete submission details
- Want to track evidence and criteria
- Building comprehensive variant databases
- Require full metadata and relationships
Use VCF when:
- Integrating with genomic analysis pipelines
- Annotating variant calls from sequencing
- Need genomic coordinates for overlap analysis
- Working with standard bioinformatics tools
Use Tab-Delimited when:
- Quick database queries and filters
- Loading into spreadsheets or databases
- Simple data extraction and statistics
- Don't need full evidence details
Accession Types and Identifiers
VCV (Variation Archive)
- Format: VCV000012345.6 (ID.version)
- Scope: Aggregates all data for a single variant
- Versioning: Increments when variant data changes
RCV (Record)
- Format: RCV000056789.4
- Scope: One variant-condition interpretation
- Versioning: Increments when interpretation changes
SCV (Submission)
- Format: SCV000098765.2
- Scope: Individual submitter's interpretation
- Versioning: Increments when submission updates
Other Identifiers
- VariationID: Stable numeric identifier for variants
- AlleleID: Stable numeric identifier for alleles
- dbSNP rsID: Cross-reference to dbSNP (when available)
File Processing Tips
XML Processing
Python with xml.etree:
import gzip
import xml.etree.ElementTree as ET
with gzip.open('ClinVarVariationRelease.xml.gz', 'rt') as f:
for event, elem in ET.iterparse(f, events=('end',)):
if elem.tag == 'VariationArchive':
# Process variant
variation_id = elem.attrib.get('VariationID')
# Extract data
elem.clear() # Free memory
Command-line with xmllint:
# Extract pathogenic variants
zcat ClinVarVariationRelease.xml.gz | \
xmllint --xpath "//VariationArchive[.//ClinicalSignificance[text()='Pathogenic']]" -
VCF Processing
Using bcftools:
# Filter by clinical significance
bcftools view -i 'INFO/CLNSIG~"Pathogenic"' clinvar.vcf.gz
# Extract specific genes
bcftools view -i 'INFO/GENEINFO~"BRCA"' clinvar.vcf.gz
# Annotate your VCF
bcftools annotate -a clinvar.vcf.gz -c INFO your_variants.vcf
Using PyVCF:
import vcf
vcf_reader = vcf.Reader(filename='clinvar.vcf.gz')
for record in vcf_reader:
clnsig = record.INFO.get('CLNSIG', [])
if 'Pathogenic' in clnsig:
print(f"{record.CHROM}:{record.POS} - {clnsig}")
Tab-Delimited Processing
Using pandas:
import pandas as pd
# Read variant summary
df = pd.read_csv('variant_summary.txt.gz', sep='\t', compression='gzip')
# Filter pathogenic variants
pathogenic = df[df['ClinicalSignificance'].str.contains('Pathogenic', na=False)]
# Group by gene
gene_counts = pathogenic.groupby('GeneSymbol').size().sort_values(ascending=False)
Data Quality Considerations
Known Limitations
- VCF files exclude large variants - Variants >10 kb not included
- Historical data may be less accurate - Older submissions had fewer standardization requirements
- Conflicting interpretations exist - Multiple submitters may disagree
- Not all variants have genomic coordinates - Some HGVS expressions can't be mapped
Validation Recommendations
- Cross-reference multiple data formats when possible
- Check review status (prefer ★★★ or ★★★★ ratings)
- Verify genomic coordinates against current genome builds
- Consider population frequency data (gnomAD) for context
- Review submission dates - newer data may be more accurate
Bulk Download Scripts
Download Latest Monthly Release
#!/bin/bash
# Download latest ClinVar monthly XML release
BASE_URL="ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation"
# Get latest file
LATEST=$(curl -s ${BASE_URL}/ | \
grep -oP 'ClinVarVariationRelease_\d{4}-\d{2}\.xml\.gz' | \
tail -1)
# Download
wget ${BASE_URL}/${LATEST}
Download All Formats
#!/bin/bash
# Download ClinVar in all formats
FTP_BASE="ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar"
# XML
wget ${FTP_BASE}/xml/clinvar_variation/ClinVarVariationRelease_00-latest.xml.gz
# VCF (both assemblies)
wget ${FTP_BASE}/vcf_GRCh37/clinvar.vcf.gz
wget ${FTP_BASE}/vcf_GRCh38/clinvar.vcf.gz
# Tab-delimited
wget ${FTP_BASE}/tab_delimited/variant_summary.txt.gz
wget ${FTP_BASE}/tab_delimited/var_citations.txt.gz
Additional Resources
- ClinVar FTP Primer: https://www.ncbi.nlm.nih.gov/clinvar/docs/ftp_primer/
- XML Schema Documentation: https://www.ncbi.nlm.nih.gov/clinvar/docs/xml_schemas/
- VCF Specification: https://samtools.github.io/hts-specs/VCFv4.3.pdf
- Release Notes: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/README.txt