Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/clinvar-database/references/api_reference.md
+++ b/skills/clinvar-database/references/api_reference.md
@@ -0,0 +1,227 @@
+# ClinVar API and Data Access Reference
+
+## Overview
+
+ClinVar provides multiple methods for programmatic data access:
+- **E-utilities** - NCBI's REST API for searching and retrieving data
+- **Entrez Direct** - Command-line tools for UNIX environments
+- **FTP Downloads** - Bulk data files in XML, VCF, and tab-delimited formats
+- **Submission API** - REST API for submitting variant interpretations
+
+## E-utilities API
+
+### Base URL
+```
+https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
+```
+
+### Supported Operations
+
+#### 1. esearch - Search for Records
+Search ClinVar using the same query syntax as the web interface.
+
+**Endpoint:**
+```
+https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
+```
+
+**Parameters:**
+- `db=clinvar` - Database name (required)
+- `term=<query>` - Search query (required)
+- `retmax=<N>` - Maximum records to return (default: 20)
+- `retmode=json` - Return format (json or xml)
+- `usehistory=y` - Store results on server for large datasets
+
+**Example Query:**
+```bash
+# Search for BRCA1 pathogenic variants
+curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=BRCA1[gene]+AND+pathogenic[CLNSIG]&retmode=json&retmax=100"
+```
+
+**Common Search Fields:**
+- `[gene]` - Gene symbol
+- `[CLNSIG]` - Clinical significance (pathogenic, benign, etc.)
+- `[disorder]` - Disease/condition name
+- `[variant name]` - HGVS expression or variant identifier
+- `[chr]` - Chromosome number
+- `[Assembly]` - GRCh37 or GRCh38
+
+#### 2. esummary - Retrieve Record Summaries
+Get summary information for specific ClinVar records.
+
+**Endpoint:**
+```
+https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi
+```
+
+**Parameters:**
+- `db=clinvar` - Database name (required)
+- `id=<UIDs>` - Comma-separated list of ClinVar UIDs
+- `retmode=json` - Return format (json or xml)
+- `version=2.0` - API version (recommended for JSON)
+
+**Example:**
+```bash
+# Get summary for specific variant
+curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&id=12345&retmode=json&version=2.0"
+```
+
+**esummary Output Includes:**
+- Accession (RCV/VCV)
+- Clinical significance
+- Review status
+- Gene symbols
+- Variant type
+- Genomic locations (GRCh37 and GRCh38)
+- Associated conditions
+- Allele origin (germline/somatic)
+
+#### 3. efetch - Retrieve Full Records
+Download complete XML records for detailed analysis.
+
+**Endpoint:**
+```
+https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
+```
+
+**Parameters:**
+- `db=clinvar` - Database name (required)
+- `id=<UIDs>` - Comma-separated ClinVar UIDs
+- `rettype=vcv` or `rettype=rcv` - Record type
+
+**Example:**
+```bash
+# Fetch full VCV record
+curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=clinvar&id=12345&rettype=vcv"
+```
+
+#### 4. elink - Find Related Records
+Link ClinVar records to other NCBI databases.
+
+**Endpoint:**
+```
+https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi
+```
+
+**Available Links:**
+- clinvar_pubmed - Link to PubMed citations
+- clinvar_gene - Link to Gene database
+- clinvar_medgen - Link to MedGen (conditions)
+- clinvar_snp - Link to dbSNP
+
+**Example:**
+```bash
+# Find PubMed articles for a variant
+curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=clinvar&db=pubmed&id=12345"
+```
+
+### Workflow Example: Complete Search and Retrieval
+
+```bash
+# Step 1: Search for variants
+SEARCH_URL="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=CFTR[gene]+AND+pathogenic[CLNSIG]&retmode=json&retmax=10"
+
+# Step 2: Parse IDs from search results
+# (Extract id list from JSON response)
+
+# Step 3: Retrieve summaries
+SUMMARY_URL="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&id=<ids>&retmode=json&version=2.0"
+
+# Step 4: Fetch full records if needed
+FETCH_URL="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=clinvar&id=<ids>&rettype=vcv"
+```
+
+## Entrez Direct (Command-Line)
+
+Install Entrez Direct for command-line access:
+```bash
+sh -c "$(curl -fsSL ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
+```
+
+### Common Commands
+
+**Search:**
+```bash
+esearch -db clinvar -query "BRCA1[gene] AND pathogenic[CLNSIG]"
+```
+
+**Pipeline Search to Summary:**
+```bash
+esearch -db clinvar -query "TP53[gene]" | \
+  efetch -format docsum | \
+  xtract -pattern DocumentSummary -element AccessionVersion Title
+```
+
+**Count Results:**
+```bash
+esearch -db clinvar -query "breast cancer[disorder]" | \
+  efilter -status reviewed | \
+  efetch -format docsum
+```
+
+## Rate Limits and Best Practices
+
+### Rate Limits
+- **Without API Key:** 3 requests/second
+- **With API Key:** 10 requests/second
+- Large datasets: Use `usehistory=y` to avoid repeated queries
+
+### API Key Setup
+1. Register for NCBI account at https://www.ncbi.nlm.nih.gov/account/
+2. Generate API key in account settings
+3. Add `&api_key=<YOUR_KEY>` to all requests
+
+### Best Practices
+- Test queries on web interface before automation
+- Use `usehistory` for large result sets (>500 records)
+- Implement exponential backoff for rate limit errors
+- Cache results when appropriate
+- Use batch requests instead of individual queries
+- Respect NCBI servers - don't submit large jobs during peak US hours
+
+## Python Example with Biopython
+
+```python
+from Bio import Entrez
+
+# Set email (required by NCBI)
+Entrez.email = "your.email@example.com"
+
+# Search ClinVar
+def search_clinvar(query, retmax=100):
+    handle = Entrez.esearch(db="clinvar", term=query, retmax=retmax)
+    record = Entrez.read(handle)
+    handle.close()
+    return record["IdList"]
+
+# Get summaries
+def get_summaries(id_list):
+    ids = ",".join(id_list)
+    handle = Entrez.esummary(db="clinvar", id=ids, retmode="json")
+    record = Entrez.read(handle)
+    handle.close()
+    return record
+
+# Example usage
+variant_ids = search_clinvar("BRCA2[gene] AND pathogenic[CLNSIG]")
+summaries = get_summaries(variant_ids)
+```
+
+## Error Handling
+
+### Common HTTP Status Codes
+- `200` - Success
+- `400` - Bad request (check query syntax)
+- `429` - Too many requests (rate limited)
+- `500` - Server error (retry with exponential backoff)
+
+### Error Response Example
+```xml
+<ERROR>Empty id list - nothing to do</ERROR>
+```
+
+## Additional Resources
+
+- NCBI E-utilities documentation: https://www.ncbi.nlm.nih.gov/books/NBK25501/
+- ClinVar web services: https://www.ncbi.nlm.nih.gov/clinvar/docs/maintenance_use/
+- Entrez Direct cookbook: https://www.ncbi.nlm.nih.gov/books/NBK179288/
--- a/skills/clinvar-database/references/clinical_significance.md
+++ b/skills/clinvar-database/references/clinical_significance.md
@@ -0,0 +1,218 @@
+# ClinVar Clinical Significance Interpretation Guide
+
+## Overview
+
+ClinVar uses standardized terminology to describe the clinical significance of genetic variants. Understanding these classifications is critical for interpreting variant reports and making informed research or clinical decisions.
+
+## Important Disclaimer
+
+**ClinVar data is NOT intended for direct diagnostic use or medical decision-making without review by a genetics professional.** The interpretations in ClinVar represent submitted data from various sources and should be evaluated in the context of the specific patient and clinical scenario.
+
+## Three Classification Categories
+
+ClinVar represents three distinct types of variant classifications:
+
+1. **Germline variants** - Inherited variants related to Mendelian diseases and drug responses
+2. **Somatic variants (Clinical Impact)** - Acquired variants with therapeutic implications
+3. **Somatic variants (Oncogenicity)** - Acquired variants related to cancer development
+
+## Germline Variant Classifications
+
+### Standard ACMG/AMP Terms
+
+These are the five core terms recommended by the American College of Medical Genetics and Genomics (ACMG) and Association for Molecular Pathology (AMP):
+
+| Term | Abbreviation | Meaning | Probability |
+|------|--------------|---------|-------------|
+| **Pathogenic** | P | Variant causes disease | ~99% |
+| **Likely Pathogenic** | LP | Variant likely causes disease | ~90% |
+| **Uncertain Significance** | VUS | Insufficient evidence to classify | N/A |
+| **Likely Benign** | LB | Variant likely does not cause disease | ~90% non-pathogenic |
+| **Benign** | B | Variant does not cause disease | ~99% non-pathogenic |
+
+### Low-Penetrance and Risk Allele Terms
+
+ClinGen recommends additional terms for variants with incomplete penetrance or risk associations:
+
+- **Pathogenic, low penetrance** - Disease-causing but not all carriers develop disease
+- **Likely pathogenic, low penetrance** - Probably disease-causing with incomplete penetrance
+- **Established risk allele** - Confirmed association with increased disease risk
+- **Likely risk allele** - Probable association with increased disease risk
+- **Uncertain risk allele** - Unclear risk association
+
+### Additional Classification Terms
+
+- **Drug response** - Variants affecting medication efficacy or metabolism
+- **Association** - Statistical association with trait/disease
+- **Protective** - Variants that reduce disease risk
+- **Affects** - Variants that affect a biological function
+- **Other** - Classifications that don't fit standard categories
+- **Not provided** - No classification submitted
+
+### Special Considerations
+
+**Recessive Disorders:**
+A disease-causing variant for an autosomal recessive disorder should be classified as "Pathogenic," even though heterozygous carriers will not develop disease. The classification describes the variant's effect, not the carrier status.
+
+**Compound Heterozygotes:**
+Each variant is classified independently. Two "Likely Pathogenic" variants in trans can together cause recessive disease, but each maintains its individual classification.
+
+## Somatic Variant Classifications
+
+### Clinical Impact (AMP/ASCO/CAP Tiers)
+
+Based on guidelines from the Association for Molecular Pathology (AMP), American Society of Clinical Oncology (ASCO), and College of American Pathologists (CAP):
+
+| Tier | Meaning |
+|------|---------|
+| **Tier I - Strong** | Variants with strong clinical significance - FDA-approved therapies or professional guidelines |
+| **Tier II - Potential** | Variants with potential clinical actionability - emerging evidence |
+| **Tier III - Uncertain** | Variants of unknown clinical significance |
+| **Tier IV - Benign/Likely Benign** | Variants with no therapeutic implications |
+
+### Oncogenicity (ClinGen/CGC/VICC)
+
+Based on standards from ClinGen, Cancer Genomics Consortium (CGC), and Variant Interpretation for Cancer Consortium (VICC):
+
+| Term | Meaning |
+|------|---------|
+| **Oncogenic** | Variant drives cancer development |
+| **Likely Oncogenic** | Variant probably drives cancer development |
+| **Uncertain Significance** | Insufficient evidence for oncogenicity |
+| **Likely Benign** | Variant probably does not drive cancer |
+| **Benign** | Variant does not drive cancer |
+
+## Review Status and Star Ratings
+
+ClinVar assigns review status ratings to indicate the strength of evidence behind classifications:
+
+| Stars | Review Status | Description | Weight |
+|-------|---------------|-------------|--------|
+| ★★★★ | **Practice Guideline** | Reviewed by expert panel with published guidelines | Highest |
+| ★★★ | **Expert Panel Review** | Reviewed by expert panel (e.g., ClinGen) | High |
+| ★★ | **Multiple Submitters, No Conflicts** | ≥2 submitters with same classification | Moderate |
+| ★ | **Criteria Provided, Single Submitter** | One submitter with supporting evidence | Standard |
+| ☆ | **No Assertion Criteria** | Classification without documented criteria | Lowest |
+| ☆ | **No Assertion Provided** | No classification submitted | None |
+
+### What the Stars Mean
+
+- **4 stars**: Highest confidence - vetted by expert panels, used in clinical practice guidelines
+- **3 stars**: High confidence - expert panel review (e.g., ClinGen Variant Curation Expert Panel)
+- **2 stars**: Moderate confidence - consensus among multiple independent submitters
+- **1 star**: Single submitter with evidence - quality depends on submitter expertise
+- **0 stars**: Low confidence - insufficient evidence or no criteria provided
+
+## Conflicting Interpretations
+
+### What Constitutes a Conflict?
+
+As of June 2022, conflicts are reported between:
+- Pathogenic/likely pathogenic **vs.** Uncertain significance
+- Pathogenic/likely pathogenic **vs.** Benign/likely benign
+- Uncertain significance **vs.** Benign/likely benign
+
+### Conflict Resolution
+
+When conflicts exist, ClinVar reports:
+- **"Conflicting interpretations of pathogenicity"** - Disagreement on clinical significance
+- Individual submissions are displayed so users can evaluate evidence
+- Higher review status (more stars) carries more weight
+- More recent submissions may reflect updated evidence
+
+### Handling Conflicts in Research
+
+When encountering conflicts:
+1. Check the review status (star rating) of each interpretation
+2. Examine the evidence and criteria provided by each submitter
+3. Consider the date of submission (more recent may reflect new data)
+4. Review population frequency data and functional studies
+5. Consult expert panel classifications when available
+
+## Aggregate Classifications
+
+ClinVar calculates an aggregate classification when multiple submitters provide interpretations:
+
+### No Conflicts
+When all submitters agree (within the same category):
+- Display: Single classification term
+- Confidence: Higher with more submitters
+
+### With Conflicts
+When submitters disagree:
+- Display: "Conflicting interpretations of pathogenicity"
+- Details: All individual submissions shown
+- Resolution: Users must evaluate evidence themselves
+
+## Interpretation Best Practices
+
+### For Researchers
+
+1. **Always check review status** - Prefer variants with ★★★ or ★★★★ ratings
+2. **Review submission details** - Examine evidence supporting classification
+3. **Consider publication date** - Newer classifications may incorporate recent data
+4. **Check assertion criteria** - Variants with ACMG criteria are more reliable
+5. **Verify in context** - Population, ethnicity, and phenotype matter
+6. **Follow up on conflicts** - Investigate discrepancies before making conclusions
+
+### For Variant Annotation Pipelines
+
+1. Prioritize higher review status classifications
+2. Flag conflicting interpretations for manual review
+3. Track classification changes over time
+4. Include population frequency data alongside ClinVar classifications
+5. Document ClinVar version and access date
+
+### Red Flags
+
+Be cautious with variants that have:
+- Zero or one star rating
+- Conflicting interpretations without resolution
+- Classification as VUS (uncertain significance)
+- Very old submission dates without updates
+- Classification based on in silico predictions alone
+
+## Common Query Patterns
+
+### Search for High-Confidence Pathogenic Variants
+
+```
+BRCA1[gene] AND pathogenic[CLNSIG] AND practice guideline[RVSTAT]
+```
+
+### Filter by Review Status
+
+```
+TP53[gene] AND (reviewed by expert panel[RVSTAT] OR practice guideline[RVSTAT])
+```
+
+### Exclude Conflicting Interpretations
+
+```
+CFTR[gene] AND pathogenic[CLNSIG] NOT conflicting[RVSTAT]
+```
+
+## Updates and Reclassifications
+
+### Why Classifications Change
+
+Variants may be reclassified due to:
+- New functional studies
+- Additional population data (e.g., gnomAD)
+- Updated ACMG guidelines
+- Clinical evidence from more patients
+- Segregation data from families
+
+### Tracking Changes
+
+- ClinVar maintains submission history
+- Version-controlled VCV/RCV accessions
+- Monthly updates to classifications
+- Reclassifications can go in either direction (upgrade or downgrade)
+
+## Key Resources
+
+- ACMG/AMP Variant Interpretation Guidelines: Richards et al., 2015
+- ClinGen Sequence Variant Interpretation Working Group: https://clinicalgenome.org/
+- ClinVar Clinical Significance Documentation: https://www.ncbi.nlm.nih.gov/clinvar/docs/clinsig/
+- Review Status Documentation: https://www.ncbi.nlm.nih.gov/clinvar/docs/review_status/
--- a/skills/clinvar-database/references/data_formats.md
+++ b/skills/clinvar-database/references/data_formats.md
@@ -0,0 +1,358 @@
+# ClinVar Data Formats and FTP Access
+
+## Overview
+
+ClinVar provides bulk data downloads in multiple formats to support different research workflows. Data is distributed via FTP and updated on regular schedules.
+
+## FTP Access
+
+### Base URL
+```
+ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/
+```
+
+### Update Schedule
+
+- **Monthly Releases**: First Thursday of each month
+  - Complete dataset with comprehensive documentation
+  - Archived indefinitely for reproducibility
+  - Includes release notes
+
+- **Weekly Updates**: Every Monday
+  - Incremental updates to monthly release
+  - Retained until next monthly release
+  - Allows synchronization with ClinVar website
+
+### Directory Structure
+
+```
+pub/clinvar/
+├── xml/                          # XML data files
+│   ├── clinvar_variation/       # VCV files (variant-centric)
+│   │   ├── weekly_release/      # Weekly updates
+│   │   └── archive/             # Monthly archives
+│   └── RCV/                     # RCV files (variant-condition pairs)
+│       ├── weekly_release/
+│       └── archive/
+├── vcf_GRCh37/                  # VCF files (GRCh37/hg19)
+├── vcf_GRCh38/                  # VCF files (GRCh38/hg38)
+├── tab_delimited/               # Tab-delimited summary files
+│   ├── variant_summary.txt.gz
+│   ├── var_citations.txt.gz
+│   └── cross_references.txt.gz
+└── README.txt                   # Format documentation
+```
+
+## Data Formats
+
+### 1. XML Format (Primary Distribution)
+
+XML provides the most comprehensive data with full submission details, evidence, and metadata.
+
+#### VCV (Variation) Files
+- **Purpose**: Variant-centric aggregation
+- **Location**: `xml/clinvar_variation/`
+- **Accession format**: VCV000000001.1
+- **Best for**: Queries focused on specific variants regardless of condition
+- **File naming**: `ClinVarVariationRelease_YYYY-MM-DD.xml.gz`
+
+**VCV Record Structure:**
+```xml
+<VariationArchive VariationID="12345" VariationType="single nucleotide variant">
+  <VariationName>NM_000059.3(BRCA2):c.1310_1313del (p.Lys437fs)</VariationName>
+  <InterpretedRecord>
+    <Interpretations>
+      <InterpretedConditionList>
+        <InterpretedCondition>Breast-ovarian cancer, familial 2</InterpretedCondition>
+      </InterpretedConditionList>
+      <ClinicalSignificance>Pathogenic</ClinicalSignificance>
+      <ReviewStatus>reviewed by expert panel</ReviewStatus>
+    </Interpretations>
+  </InterpretedRecord>
+  <ClinicalAssertionList>
+    <!-- Individual submissions -->
+  </ClinicalAssertionList>
+</VariationArchive>
+```
+
+#### RCV (Record) Files
+- **Purpose**: Variant-condition pair aggregation
+- **Location**: `xml/RCV/`
+- **Accession format**: RCV000000001.1
+- **Best for**: Queries focused on variant-disease relationships
+- **File naming**: `ClinVarRCVRelease_YYYY-MM-DD.xml.gz`
+
+**Key differences from VCV:**
+- One RCV per variant-condition combination
+- A single variant may have multiple RCV records (different conditions)
+- More focused on clinical interpretation per disease
+
+#### SCV (Submission) Records
+- **Format**: Individual submissions within VCV/RCV records
+- **Accession format**: SCV000000001.1
+- **Content**: Submitter-specific interpretations and evidence
+
+### 2. VCF Format
+
+Variant Call Format files for genomic analysis pipelines.
+
+#### Locations
+- **GRCh37/hg19**: `vcf_GRCh37/clinvar.vcf.gz`
+- **GRCh38/hg38**: `vcf_GRCh38/clinvar.vcf.gz`
+
+#### Content Limitations
+- **Included**: Simple alleles with precise genomic coordinates
+- **Excluded**:
+  - Variants >10 kb
+  - Cytogenetic variants
+  - Complex structural variants
+  - Variants without precise breakpoints
+
+#### VCF INFO Fields
+
+Key INFO fields in ClinVar VCF:
+
+| Field | Description |
+|-------|-------------|
+| **ALLELEID** | ClinVar allele identifier |
+| **CLNSIG** | Clinical significance |
+| **CLNREVSTAT** | Review status |
+| **CLNDN** | Condition name(s) |
+| **CLNVC** | Variant type (SNV, deletion, etc.) |
+| **CLNVCSO** | Sequence ontology term |
+| **GENEINFO** | Gene symbol:gene ID |
+| **MC** | Molecular consequence |
+| **RS** | dbSNP rsID |
+| **AF_ESP** | Allele frequency (ESP) |
+| **AF_EXAC** | Allele frequency (ExAC) |
+| **AF_TGP** | Allele frequency (1000 Genomes) |
+
+#### Example VCF Line
+```
+#CHROM  POS     ID      REF  ALT  QUAL  FILTER  INFO
+13      32339912  rs80357382  A    G    .     .     ALLELEID=38447;CLNDN=Breast-ovarian_cancer,_familial_2;CLNSIG=Pathogenic;CLNREVSTAT=reviewed_by_expert_panel;GENEINFO=BRCA2:675
+```
+
+### 3. Tab-Delimited Format
+
+Summary files for quick analysis and database loading.
+
+#### variant_summary.txt
+Primary summary file with selected metadata for all genome-mapped variants.
+
+**Key Columns:**
+- `VariationID` - ClinVar variation identifier
+- `Type` - Variant type (SNV, indel, CNV, etc.)
+- `Name` - Variant name (typically HGVS)
+- `GeneID` - NCBI Gene ID
+- `GeneSymbol` - Gene symbol
+- `ClinicalSignificance` - Classification
+- `ReviewStatus` - Star rating level
+- `LastEvaluated` - Date of last review
+- `RS# (dbSNP)` - dbSNP rsID if available
+- `Chromosome` - Chromosome
+- `PositionVCF` - Position (GRCh38)
+- `ReferenceAlleleVCF` - Reference allele
+- `AlternateAlleleVCF` - Alternate allele
+- `Assembly` - Reference assembly (GRCh37/GRCh38)
+- `PhenotypeIDS` - MedGen/OMIM/Orphanet IDs
+- `Origin` - Germline, somatic, de novo, etc.
+- `SubmitterCategories` - Submitter types (clinical, research, etc.)
+
+**Example Usage:**
+```bash
+# Extract all pathogenic BRCA1 variants
+zcat variant_summary.txt.gz | \
+  awk -F'\t' '$7=="BRCA1" && $13~"Pathogenic"' | \
+  cut -f1,7,13,14
+```
+
+#### var_citations.txt
+Cross-references to PubMed articles, dbSNP, and dbVar.
+
+**Columns:**
+- `AlleleID` - ClinVar allele ID
+- `VariationID` - ClinVar variation ID
+- `rs` - dbSNP rsID
+- `nsv/esv` - dbVar IDs
+- `PubMedID` - PubMed citation
+
+#### cross_references.txt
+Database cross-references with modification dates.
+
+**Columns:**
+- `VariationID`
+- `Database` (OMIM, UniProtKB, GTR, etc.)
+- `Identifier`
+- `DateLastModified`
+
+## Choosing the Right Format
+
+### Use XML when:
+- Need complete submission details
+- Want to track evidence and criteria
+- Building comprehensive variant databases
+- Require full metadata and relationships
+
+### Use VCF when:
+- Integrating with genomic analysis pipelines
+- Annotating variant calls from sequencing
+- Need genomic coordinates for overlap analysis
+- Working with standard bioinformatics tools
+
+### Use Tab-Delimited when:
+- Quick database queries and filters
+- Loading into spreadsheets or databases
+- Simple data extraction and statistics
+- Don't need full evidence details
+
+## Accession Types and Identifiers
+
+### VCV (Variation Archive)
+- **Format**: VCV000012345.6 (ID.version)
+- **Scope**: Aggregates all data for a single variant
+- **Versioning**: Increments when variant data changes
+
+### RCV (Record)
+- **Format**: RCV000056789.4
+- **Scope**: One variant-condition interpretation
+- **Versioning**: Increments when interpretation changes
+
+### SCV (Submission)
+- **Format**: SCV000098765.2
+- **Scope**: Individual submitter's interpretation
+- **Versioning**: Increments when submission updates
+
+### Other Identifiers
+- **VariationID**: Stable numeric identifier for variants
+- **AlleleID**: Stable numeric identifier for alleles
+- **dbSNP rsID**: Cross-reference to dbSNP (when available)
+
+## File Processing Tips
+
+### XML Processing
+
+**Python with xml.etree:**
+```python
+import gzip
+import xml.etree.ElementTree as ET
+
+with gzip.open('ClinVarVariationRelease.xml.gz', 'rt') as f:
+    for event, elem in ET.iterparse(f, events=('end',)):
+        if elem.tag == 'VariationArchive':
+            # Process variant
+            variation_id = elem.attrib.get('VariationID')
+            # Extract data
+            elem.clear()  # Free memory
+```
+
+**Command-line with xmllint:**
+```bash
+# Extract pathogenic variants
+zcat ClinVarVariationRelease.xml.gz | \
+  xmllint --xpath "//VariationArchive[.//ClinicalSignificance[text()='Pathogenic']]" -
+```
+
+### VCF Processing
+
+**Using bcftools:**
+```bash
+# Filter by clinical significance
+bcftools view -i 'INFO/CLNSIG~"Pathogenic"' clinvar.vcf.gz
+
+# Extract specific genes
+bcftools view -i 'INFO/GENEINFO~"BRCA"' clinvar.vcf.gz
+
+# Annotate your VCF
+bcftools annotate -a clinvar.vcf.gz -c INFO your_variants.vcf
+```
+
+**Using PyVCF:**
+```python
+import vcf
+
+vcf_reader = vcf.Reader(filename='clinvar.vcf.gz')
+for record in vcf_reader:
+    clnsig = record.INFO.get('CLNSIG', [])
+    if 'Pathogenic' in clnsig:
+        print(f"{record.CHROM}:{record.POS} - {clnsig}")
+```
+
+### Tab-Delimited Processing
+
+**Using pandas:**
+```python
+import pandas as pd
+
+# Read variant summary
+df = pd.read_csv('variant_summary.txt.gz', sep='\t', compression='gzip')
+
+# Filter pathogenic variants
+pathogenic = df[df['ClinicalSignificance'].str.contains('Pathogenic', na=False)]
+
+# Group by gene
+gene_counts = pathogenic.groupby('GeneSymbol').size().sort_values(ascending=False)
+```
+
+## Data Quality Considerations
+
+### Known Limitations
+
+1. **VCF files exclude large variants** - Variants >10 kb not included
+2. **Historical data may be less accurate** - Older submissions had fewer standardization requirements
+3. **Conflicting interpretations exist** - Multiple submitters may disagree
+4. **Not all variants have genomic coordinates** - Some HGVS expressions can't be mapped
+
+### Validation Recommendations
+
+- Cross-reference multiple data formats when possible
+- Check review status (prefer ★★★ or ★★★★ ratings)
+- Verify genomic coordinates against current genome builds
+- Consider population frequency data (gnomAD) for context
+- Review submission dates - newer data may be more accurate
+
+## Bulk Download Scripts
+
+### Download Latest Monthly Release
+
+```bash
+#!/bin/bash
+# Download latest ClinVar monthly XML release
+
+BASE_URL="ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation"
+
+# Get latest file
+LATEST=$(curl -s ${BASE_URL}/ | \
+         grep -oP 'ClinVarVariationRelease_\d{4}-\d{2}\.xml\.gz' | \
+         tail -1)
+
+# Download
+wget ${BASE_URL}/${LATEST}
+```
+
+### Download All Formats
+
+```bash
+#!/bin/bash
+# Download ClinVar in all formats
+
+FTP_BASE="ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar"
+
+# XML
+wget ${FTP_BASE}/xml/clinvar_variation/ClinVarVariationRelease_00-latest.xml.gz
+
+# VCF (both assemblies)
+wget ${FTP_BASE}/vcf_GRCh37/clinvar.vcf.gz
+wget ${FTP_BASE}/vcf_GRCh38/clinvar.vcf.gz
+
+# Tab-delimited
+wget ${FTP_BASE}/tab_delimited/variant_summary.txt.gz
+wget ${FTP_BASE}/tab_delimited/var_citations.txt.gz
+```
+
+## Additional Resources
+
+- ClinVar FTP Primer: https://www.ncbi.nlm.nih.gov/clinvar/docs/ftp_primer/
+- XML Schema Documentation: https://www.ncbi.nlm.nih.gov/clinvar/docs/xml_schemas/
+- VCF Specification: https://samtools.github.io/hts-specs/VCFv4.3.pdf
+- Release Notes: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/README.txt