Files
gh-k-dense-ai-claude-scient…/skills/clinvar-database/references/data_formats.md
2025-11-30 08:30:10 +08:00

11 KiB

ClinVar Data Formats and FTP Access

Overview

ClinVar provides bulk data downloads in multiple formats to support different research workflows. Data is distributed via FTP and updated on regular schedules.

FTP Access

Base URL

ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/

Update Schedule

  • Monthly Releases: First Thursday of each month

    • Complete dataset with comprehensive documentation
    • Archived indefinitely for reproducibility
    • Includes release notes
  • Weekly Updates: Every Monday

    • Incremental updates to monthly release
    • Retained until next monthly release
    • Allows synchronization with ClinVar website

Directory Structure

pub/clinvar/
├── xml/                          # XML data files
│   ├── clinvar_variation/       # VCV files (variant-centric)
│   │   ├── weekly_release/      # Weekly updates
│   │   └── archive/             # Monthly archives
│   └── RCV/                     # RCV files (variant-condition pairs)
│       ├── weekly_release/
│       └── archive/
├── vcf_GRCh37/                  # VCF files (GRCh37/hg19)
├── vcf_GRCh38/                  # VCF files (GRCh38/hg38)
├── tab_delimited/               # Tab-delimited summary files
│   ├── variant_summary.txt.gz
│   ├── var_citations.txt.gz
│   └── cross_references.txt.gz
└── README.txt                   # Format documentation

Data Formats

1. XML Format (Primary Distribution)

XML provides the most comprehensive data with full submission details, evidence, and metadata.

VCV (Variation) Files

  • Purpose: Variant-centric aggregation
  • Location: xml/clinvar_variation/
  • Accession format: VCV000000001.1
  • Best for: Queries focused on specific variants regardless of condition
  • File naming: ClinVarVariationRelease_YYYY-MM-DD.xml.gz

VCV Record Structure:

<VariationArchive VariationID="12345" VariationType="single nucleotide variant">
  <VariationName>NM_000059.3(BRCA2):c.1310_1313del (p.Lys437fs)</VariationName>
  <InterpretedRecord>
    <Interpretations>
      <InterpretedConditionList>
        <InterpretedCondition>Breast-ovarian cancer, familial 2</InterpretedCondition>
      </InterpretedConditionList>
      <ClinicalSignificance>Pathogenic</ClinicalSignificance>
      <ReviewStatus>reviewed by expert panel</ReviewStatus>
    </Interpretations>
  </InterpretedRecord>
  <ClinicalAssertionList>
    <!-- Individual submissions -->
  </ClinicalAssertionList>
</VariationArchive>

RCV (Record) Files

  • Purpose: Variant-condition pair aggregation
  • Location: xml/RCV/
  • Accession format: RCV000000001.1
  • Best for: Queries focused on variant-disease relationships
  • File naming: ClinVarRCVRelease_YYYY-MM-DD.xml.gz

Key differences from VCV:

  • One RCV per variant-condition combination
  • A single variant may have multiple RCV records (different conditions)
  • More focused on clinical interpretation per disease

SCV (Submission) Records

  • Format: Individual submissions within VCV/RCV records
  • Accession format: SCV000000001.1
  • Content: Submitter-specific interpretations and evidence

2. VCF Format

Variant Call Format files for genomic analysis pipelines.

Locations

  • GRCh37/hg19: vcf_GRCh37/clinvar.vcf.gz
  • GRCh38/hg38: vcf_GRCh38/clinvar.vcf.gz

Content Limitations

  • Included: Simple alleles with precise genomic coordinates
  • Excluded:
    • Variants >10 kb
    • Cytogenetic variants
    • Complex structural variants
    • Variants without precise breakpoints

VCF INFO Fields

Key INFO fields in ClinVar VCF:

Field Description
ALLELEID ClinVar allele identifier
CLNSIG Clinical significance
CLNREVSTAT Review status
CLNDN Condition name(s)
CLNVC Variant type (SNV, deletion, etc.)
CLNVCSO Sequence ontology term
GENEINFO Gene symbol:gene ID
MC Molecular consequence
RS dbSNP rsID
AF_ESP Allele frequency (ESP)
AF_EXAC Allele frequency (ExAC)
AF_TGP Allele frequency (1000 Genomes)

Example VCF Line

#CHROM  POS     ID      REF  ALT  QUAL  FILTER  INFO
13      32339912  rs80357382  A    G    .     .     ALLELEID=38447;CLNDN=Breast-ovarian_cancer,_familial_2;CLNSIG=Pathogenic;CLNREVSTAT=reviewed_by_expert_panel;GENEINFO=BRCA2:675

3. Tab-Delimited Format

Summary files for quick analysis and database loading.

variant_summary.txt

Primary summary file with selected metadata for all genome-mapped variants.

Key Columns:

  • VariationID - ClinVar variation identifier
  • Type - Variant type (SNV, indel, CNV, etc.)
  • Name - Variant name (typically HGVS)
  • GeneID - NCBI Gene ID
  • GeneSymbol - Gene symbol
  • ClinicalSignificance - Classification
  • ReviewStatus - Star rating level
  • LastEvaluated - Date of last review
  • RS# (dbSNP) - dbSNP rsID if available
  • Chromosome - Chromosome
  • PositionVCF - Position (GRCh38)
  • ReferenceAlleleVCF - Reference allele
  • AlternateAlleleVCF - Alternate allele
  • Assembly - Reference assembly (GRCh37/GRCh38)
  • PhenotypeIDS - MedGen/OMIM/Orphanet IDs
  • Origin - Germline, somatic, de novo, etc.
  • SubmitterCategories - Submitter types (clinical, research, etc.)

Example Usage:

# Extract all pathogenic BRCA1 variants
zcat variant_summary.txt.gz | \
  awk -F'\t' '$7=="BRCA1" && $13~"Pathogenic"' | \
  cut -f1,7,13,14

var_citations.txt

Cross-references to PubMed articles, dbSNP, and dbVar.

Columns:

  • AlleleID - ClinVar allele ID
  • VariationID - ClinVar variation ID
  • rs - dbSNP rsID
  • nsv/esv - dbVar IDs
  • PubMedID - PubMed citation

cross_references.txt

Database cross-references with modification dates.

Columns:

  • VariationID
  • Database (OMIM, UniProtKB, GTR, etc.)
  • Identifier
  • DateLastModified

Choosing the Right Format

Use XML when:

  • Need complete submission details
  • Want to track evidence and criteria
  • Building comprehensive variant databases
  • Require full metadata and relationships

Use VCF when:

  • Integrating with genomic analysis pipelines
  • Annotating variant calls from sequencing
  • Need genomic coordinates for overlap analysis
  • Working with standard bioinformatics tools

Use Tab-Delimited when:

  • Quick database queries and filters
  • Loading into spreadsheets or databases
  • Simple data extraction and statistics
  • Don't need full evidence details

Accession Types and Identifiers

VCV (Variation Archive)

  • Format: VCV000012345.6 (ID.version)
  • Scope: Aggregates all data for a single variant
  • Versioning: Increments when variant data changes

RCV (Record)

  • Format: RCV000056789.4
  • Scope: One variant-condition interpretation
  • Versioning: Increments when interpretation changes

SCV (Submission)

  • Format: SCV000098765.2
  • Scope: Individual submitter's interpretation
  • Versioning: Increments when submission updates

Other Identifiers

  • VariationID: Stable numeric identifier for variants
  • AlleleID: Stable numeric identifier for alleles
  • dbSNP rsID: Cross-reference to dbSNP (when available)

File Processing Tips

XML Processing

Python with xml.etree:

import gzip
import xml.etree.ElementTree as ET

with gzip.open('ClinVarVariationRelease.xml.gz', 'rt') as f:
    for event, elem in ET.iterparse(f, events=('end',)):
        if elem.tag == 'VariationArchive':
            # Process variant
            variation_id = elem.attrib.get('VariationID')
            # Extract data
            elem.clear()  # Free memory

Command-line with xmllint:

# Extract pathogenic variants
zcat ClinVarVariationRelease.xml.gz | \
  xmllint --xpath "//VariationArchive[.//ClinicalSignificance[text()='Pathogenic']]" -

VCF Processing

Using bcftools:

# Filter by clinical significance
bcftools view -i 'INFO/CLNSIG~"Pathogenic"' clinvar.vcf.gz

# Extract specific genes
bcftools view -i 'INFO/GENEINFO~"BRCA"' clinvar.vcf.gz

# Annotate your VCF
bcftools annotate -a clinvar.vcf.gz -c INFO your_variants.vcf

Using PyVCF:

import vcf

vcf_reader = vcf.Reader(filename='clinvar.vcf.gz')
for record in vcf_reader:
    clnsig = record.INFO.get('CLNSIG', [])
    if 'Pathogenic' in clnsig:
        print(f"{record.CHROM}:{record.POS} - {clnsig}")

Tab-Delimited Processing

Using pandas:

import pandas as pd

# Read variant summary
df = pd.read_csv('variant_summary.txt.gz', sep='\t', compression='gzip')

# Filter pathogenic variants
pathogenic = df[df['ClinicalSignificance'].str.contains('Pathogenic', na=False)]

# Group by gene
gene_counts = pathogenic.groupby('GeneSymbol').size().sort_values(ascending=False)

Data Quality Considerations

Known Limitations

  1. VCF files exclude large variants - Variants >10 kb not included
  2. Historical data may be less accurate - Older submissions had fewer standardization requirements
  3. Conflicting interpretations exist - Multiple submitters may disagree
  4. Not all variants have genomic coordinates - Some HGVS expressions can't be mapped

Validation Recommendations

  • Cross-reference multiple data formats when possible
  • Check review status (prefer ★★★ or ★★★★ ratings)
  • Verify genomic coordinates against current genome builds
  • Consider population frequency data (gnomAD) for context
  • Review submission dates - newer data may be more accurate

Bulk Download Scripts

Download Latest Monthly Release

#!/bin/bash
# Download latest ClinVar monthly XML release

BASE_URL="ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation"

# Get latest file
LATEST=$(curl -s ${BASE_URL}/ | \
         grep -oP 'ClinVarVariationRelease_\d{4}-\d{2}\.xml\.gz' | \
         tail -1)

# Download
wget ${BASE_URL}/${LATEST}

Download All Formats

#!/bin/bash
# Download ClinVar in all formats

FTP_BASE="ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar"

# XML
wget ${FTP_BASE}/xml/clinvar_variation/ClinVarVariationRelease_00-latest.xml.gz

# VCF (both assemblies)
wget ${FTP_BASE}/vcf_GRCh37/clinvar.vcf.gz
wget ${FTP_BASE}/vcf_GRCh38/clinvar.vcf.gz

# Tab-delimited
wget ${FTP_BASE}/tab_delimited/variant_summary.txt.gz
wget ${FTP_BASE}/tab_delimited/var_citations.txt.gz

Additional Resources