Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/geo-database/SKILL.md
+++ b/skills/geo-database/SKILL.md
@@ -0,0 +1,809 @@
+---
+name: geo-database
+description: "Access NCBI GEO for gene expression/genomics data. Search/download microarray and RNA-seq datasets (GSE, GSM, GPL), retrieve SOFT/Matrix files, for transcriptomics and expression analysis."
+---
+
+# GEO Database
+
+## Overview
+
+The Gene Expression Omnibus (GEO) is NCBI's public repository for high-throughput gene expression and functional genomics data. GEO contains over 264,000 studies with more than 8 million samples from both array-based and sequence-based experiments.
+
+## When to Use This Skill
+
+This skill should be used when searching for gene expression datasets, retrieving experimental data, downloading raw and processed files, querying expression profiles, or integrating GEO data into computational analysis workflows.
+
+## Core Capabilities
+
+### 1. Understanding GEO Data Organization
+
+GEO organizes data hierarchically using different accession types:
+
+**Series (GSE):** A complete experiment with a set of related samples
+- Example: GSE123456
+- Contains experimental design, samples, and overall study information
+- Largest organizational unit in GEO
+- Current count: 264,928+ series
+
+**Sample (GSM):** A single experimental sample or biological replicate
+- Example: GSM987654
+- Contains individual sample data, protocols, and metadata
+- Linked to platforms and series
+- Current count: 8,068,632+ samples
+
+**Platform (GPL):** The microarray or sequencing platform used
+- Example: GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array)
+- Describes the technology and probe/feature annotations
+- Shared across multiple experiments
+- Current count: 27,739+ platforms
+
+**DataSet (GDS):** Curated collections with consistent formatting
+- Example: GDS5678
+- Experimentally-comparable samples organized by study design
+- Processed for differential analysis
+- Subset of GEO data (4,348 curated datasets)
+- Ideal for quick comparative analyses
+
+**Profiles:** Gene-specific expression data linked to sequence features
+- Queryable by gene name or annotation
+- Cross-references to Entrez Gene
+- Enables gene-centric searches across all studies
+
+### 2. Searching GEO Data
+
+**GEO DataSets Search:**
+
+Search for studies by keywords, organism, or experimental conditions:
+
+```python
+from Bio import Entrez
+
+# Configure Entrez (required)
+Entrez.email = "your.email@example.com"
+
+# Search for datasets
+def search_geo_datasets(query, retmax=20):
+    """Search GEO DataSets database"""
+    handle = Entrez.esearch(
+        db="gds",
+        term=query,
+        retmax=retmax,
+        usehistory="y"
+    )
+    results = Entrez.read(handle)
+    handle.close()
+    return results
+
+# Example searches
+results = search_geo_datasets("breast cancer[MeSH] AND Homo sapiens[Organism]")
+print(f"Found {results['Count']} datasets")
+
+# Search by specific platform
+results = search_geo_datasets("GPL570[Accession]")
+
+# Search by study type
+results = search_geo_datasets("expression profiling by array[DataSet Type]")
+```
+
+**GEO Profiles Search:**
+
+Find gene-specific expression patterns:
+
+```python
+# Search for gene expression profiles
+def search_geo_profiles(gene_name, organism="Homo sapiens", retmax=100):
+    """Search GEO Profiles for a specific gene"""
+    query = f"{gene_name}[Gene Name] AND {organism}[Organism]"
+    handle = Entrez.esearch(
+        db="geoprofiles",
+        term=query,
+        retmax=retmax
+    )
+    results = Entrez.read(handle)
+    handle.close()
+    return results
+
+# Find TP53 expression across studies
+tp53_results = search_geo_profiles("TP53", organism="Homo sapiens")
+print(f"Found {tp53_results['Count']} expression profiles for TP53")
+```
+
+**Advanced Search Patterns:**
+
+```python
+# Combine multiple search terms
+def advanced_geo_search(terms, operator="AND"):
+    """Build complex search queries"""
+    query = f" {operator} ".join(terms)
+    return search_geo_datasets(query)
+
+# Find recent high-throughput studies
+search_terms = [
+    "RNA-seq[DataSet Type]",
+    "Homo sapiens[Organism]",
+    "2024[Publication Date]"
+]
+results = advanced_geo_search(search_terms)
+
+# Search by author and condition
+search_terms = [
+    "Smith[Author]",
+    "diabetes[Disease]"
+]
+results = advanced_geo_search(search_terms)
+```
+
+### 3. Retrieving GEO Data with GEOparse (Recommended)
+
+**GEOparse** is the primary Python library for accessing GEO data:
+
+**Installation:**
+```bash
+uv pip install GEOparse
+```
+
+**Basic Usage:**
+
+```python
+import GEOparse
+
+# Download and parse a GEO Series
+gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
+
+# Access series metadata
+print(gse.metadata['title'])
+print(gse.metadata['summary'])
+print(gse.metadata['overall_design'])
+
+# Access sample information
+for gsm_name, gsm in gse.gsms.items():
+    print(f"Sample: {gsm_name}")
+    print(f"  Title: {gsm.metadata['title'][0]}")
+    print(f"  Source: {gsm.metadata['source_name_ch1'][0]}")
+    print(f"  Characteristics: {gsm.metadata.get('characteristics_ch1', [])}")
+
+# Access platform information
+for gpl_name, gpl in gse.gpls.items():
+    print(f"Platform: {gpl_name}")
+    print(f"  Title: {gpl.metadata['title'][0]}")
+    print(f"  Organism: {gpl.metadata['organism'][0]}")
+```
+
+**Working with Expression Data:**
+
+```python
+import GEOparse
+import pandas as pd
+
+# Get expression data from series
+gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
+
+# Extract expression matrix
+# Method 1: From series matrix file (fastest)
+if hasattr(gse, 'pivot_samples'):
+    expression_df = gse.pivot_samples('VALUE')
+    print(expression_df.shape)  # genes x samples
+
+# Method 2: From individual samples
+expression_data = {}
+for gsm_name, gsm in gse.gsms.items():
+    if hasattr(gsm, 'table'):
+        expression_data[gsm_name] = gsm.table['VALUE']
+
+expression_df = pd.DataFrame(expression_data)
+print(f"Expression matrix: {expression_df.shape}")
+```
+
+**Accessing Supplementary Files:**
+
+```python
+import GEOparse
+
+gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
+
+# Download supplementary files
+gse.download_supplementary_files(
+    directory="./data/GSE123456_suppl",
+    download_sra=False  # Set to True to download SRA files
+)
+
+# List available supplementary files
+for gsm_name, gsm in gse.gsms.items():
+    if hasattr(gsm, 'supplementary_files'):
+        print(f"Sample {gsm_name}:")
+        for file_url in gsm.metadata.get('supplementary_file', []):
+            print(f"  {file_url}")
+```
+
+**Filtering and Subsetting Data:**
+
+```python
+import GEOparse
+
+gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
+
+# Filter samples by metadata
+control_samples = [
+    gsm_name for gsm_name, gsm in gse.gsms.items()
+    if 'control' in gsm.metadata.get('title', [''])[0].lower()
+]
+
+treatment_samples = [
+    gsm_name for gsm_name, gsm in gse.gsms.items()
+    if 'treatment' in gsm.metadata.get('title', [''])[0].lower()
+]
+
+print(f"Control samples: {len(control_samples)}")
+print(f"Treatment samples: {len(treatment_samples)}")
+
+# Extract subset expression matrix
+expression_df = gse.pivot_samples('VALUE')
+control_expr = expression_df[control_samples]
+treatment_expr = expression_df[treatment_samples]
+```
+
+### 4. Using NCBI E-utilities for GEO Access
+
+**E-utilities** provide lower-level programmatic access to GEO metadata:
+
+**Basic E-utilities Workflow:**
+
+```python
+from Bio import Entrez
+import time
+
+Entrez.email = "your.email@example.com"
+
+# Step 1: Search for GEO entries
+def search_geo(query, db="gds", retmax=100):
+    """Search GEO using E-utilities"""
+    handle = Entrez.esearch(
+        db=db,
+        term=query,
+        retmax=retmax,
+        usehistory="y"
+    )
+    results = Entrez.read(handle)
+    handle.close()
+    return results
+
+# Step 2: Fetch summaries
+def fetch_geo_summaries(id_list, db="gds"):
+    """Fetch document summaries for GEO entries"""
+    ids = ",".join(id_list)
+    handle = Entrez.esummary(db=db, id=ids)
+    summaries = Entrez.read(handle)
+    handle.close()
+    return summaries
+
+# Step 3: Fetch full records
+def fetch_geo_records(id_list, db="gds"):
+    """Fetch full GEO records"""
+    ids = ",".join(id_list)
+    handle = Entrez.efetch(db=db, id=ids, retmode="xml")
+    records = Entrez.read(handle)
+    handle.close()
+    return records
+
+# Example workflow
+search_results = search_geo("breast cancer AND Homo sapiens")
+id_list = search_results['IdList'][:5]
+
+summaries = fetch_geo_summaries(id_list)
+for summary in summaries:
+    print(f"GDS: {summary.get('Accession', 'N/A')}")
+    print(f"Title: {summary.get('title', 'N/A')}")
+    print(f"Samples: {summary.get('n_samples', 'N/A')}")
+    print()
+```
+
+**Batch Processing with E-utilities:**
+
+```python
+from Bio import Entrez
+import time
+
+Entrez.email = "your.email@example.com"
+
+def batch_fetch_geo_metadata(accessions, batch_size=100):
+    """Fetch metadata for multiple GEO accessions"""
+    results = {}
+
+    for i in range(0, len(accessions), batch_size):
+        batch = accessions[i:i + batch_size]
+
+        # Search for each accession
+        for accession in batch:
+            try:
+                query = f"{accession}[Accession]"
+                search_handle = Entrez.esearch(db="gds", term=query)
+                search_results = Entrez.read(search_handle)
+                search_handle.close()
+
+                if search_results['IdList']:
+                    # Fetch summary
+                    summary_handle = Entrez.esummary(
+                        db="gds",
+                        id=search_results['IdList'][0]
+                    )
+                    summary = Entrez.read(summary_handle)
+                    summary_handle.close()
+                    results[accession] = summary[0]
+
+                # Be polite to NCBI servers
+                time.sleep(0.34)  # Max 3 requests per second
+
+            except Exception as e:
+                print(f"Error fetching {accession}: {e}")
+
+    return results
+
+# Fetch metadata for multiple datasets
+gse_list = ["GSE100001", "GSE100002", "GSE100003"]
+metadata = batch_fetch_geo_metadata(gse_list)
+```
+
+### 5. Direct FTP Access for Data Files
+
+**FTP URLs for GEO Data:**
+
+GEO data can be downloaded directly via FTP:
+
+```python
+import ftplib
+import os
+
+def download_geo_ftp(accession, file_type="matrix", dest_dir="./data"):
+    """Download GEO files via FTP"""
+    # Construct FTP path based on accession type
+    if accession.startswith("GSE"):
+        # Series files
+        gse_num = accession[3:]
+        base_num = gse_num[:-3] + "nnn"
+        ftp_path = f"/geo/series/GSE{base_num}/{accession}/"
+
+        if file_type == "matrix":
+            filename = f"{accession}_series_matrix.txt.gz"
+        elif file_type == "soft":
+            filename = f"{accession}_family.soft.gz"
+        elif file_type == "miniml":
+            filename = f"{accession}_family.xml.tgz"
+
+    # Connect to FTP server
+    ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov")
+    ftp.login()
+    ftp.cwd(ftp_path)
+
+    # Download file
+    os.makedirs(dest_dir, exist_ok=True)
+    local_file = os.path.join(dest_dir, filename)
+
+    with open(local_file, 'wb') as f:
+        ftp.retrbinary(f'RETR {filename}', f.write)
+
+    ftp.quit()
+    print(f"Downloaded: {local_file}")
+    return local_file
+
+# Download series matrix file
+download_geo_ftp("GSE123456", file_type="matrix")
+
+# Download SOFT format file
+download_geo_ftp("GSE123456", file_type="soft")
+```
+
+**Using wget or curl for Downloads:**
+
+```bash
+# Download series matrix file
+wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/matrix/GSE123456_series_matrix.txt.gz
+
+# Download all supplementary files for a series
+wget -r -np -nd ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/suppl/
+
+# Download SOFT format family file
+wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/soft/GSE123456_family.soft.gz
+```
+
+### 6. Analyzing GEO Data
+
+**Quality Control and Preprocessing:**
+
+```python
+import GEOparse
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+
+# Load dataset
+gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
+expression_df = gse.pivot_samples('VALUE')
+
+# Check for missing values
+print(f"Missing values: {expression_df.isnull().sum().sum()}")
+
+# Log transformation (if needed)
+if expression_df.min().min() > 0:  # Check if already log-transformed
+    if expression_df.max().max() > 100:
+        expression_df = np.log2(expression_df + 1)
+        print("Applied log2 transformation")
+
+# Distribution plots
+plt.figure(figsize=(12, 5))
+
+plt.subplot(1, 2, 1)
+expression_df.plot.box(ax=plt.gca())
+plt.title("Expression Distribution per Sample")
+plt.xticks(rotation=90)
+
+plt.subplot(1, 2, 2)
+expression_df.mean(axis=1).hist(bins=50)
+plt.title("Gene Expression Distribution")
+plt.xlabel("Average Expression")
+
+plt.tight_layout()
+plt.savefig("geo_qc.png", dpi=300, bbox_inches='tight')
+```
+
+**Differential Expression Analysis:**
+
+```python
+import GEOparse
+import pandas as pd
+import numpy as np
+from scipy import stats
+
+gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
+expression_df = gse.pivot_samples('VALUE')
+
+# Define sample groups
+control_samples = ["GSM1", "GSM2", "GSM3"]
+treatment_samples = ["GSM4", "GSM5", "GSM6"]
+
+# Calculate fold changes and p-values
+results = []
+for gene in expression_df.index:
+    control_expr = expression_df.loc[gene, control_samples]
+    treatment_expr = expression_df.loc[gene, treatment_samples]
+
+    # Calculate statistics
+    fold_change = treatment_expr.mean() - control_expr.mean()
+    t_stat, p_value = stats.ttest_ind(treatment_expr, control_expr)
+
+    results.append({
+        'gene': gene,
+        'log2_fold_change': fold_change,
+        'p_value': p_value,
+        'control_mean': control_expr.mean(),
+        'treatment_mean': treatment_expr.mean()
+    })
+
+# Create results DataFrame
+de_results = pd.DataFrame(results)
+
+# Multiple testing correction (Benjamini-Hochberg)
+from statsmodels.stats.multitest import multipletests
+_, de_results['q_value'], _, _ = multipletests(
+    de_results['p_value'],
+    method='fdr_bh'
+)
+
+# Filter significant genes
+significant_genes = de_results[
+    (de_results['q_value'] < 0.05) &
+    (abs(de_results['log2_fold_change']) > 1)
+]
+
+print(f"Significant genes: {len(significant_genes)}")
+significant_genes.to_csv("de_results.csv", index=False)
+```
+
+**Correlation and Clustering Analysis:**
+
+```python
+import GEOparse
+import seaborn as sns
+import matplotlib.pyplot as plt
+from scipy.cluster import hierarchy
+from scipy.spatial.distance import pdist
+
+gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
+expression_df = gse.pivot_samples('VALUE')
+
+# Sample correlation heatmap
+sample_corr = expression_df.corr()
+
+plt.figure(figsize=(10, 8))
+sns.heatmap(sample_corr, cmap='coolwarm', center=0,
+            square=True, linewidths=0.5)
+plt.title("Sample Correlation Matrix")
+plt.tight_layout()
+plt.savefig("sample_correlation.png", dpi=300, bbox_inches='tight')
+
+# Hierarchical clustering
+distances = pdist(expression_df.T, metric='correlation')
+linkage = hierarchy.linkage(distances, method='average')
+
+plt.figure(figsize=(12, 6))
+hierarchy.dendrogram(linkage, labels=expression_df.columns)
+plt.title("Hierarchical Clustering of Samples")
+plt.xlabel("Samples")
+plt.ylabel("Distance")
+plt.xticks(rotation=90)
+plt.tight_layout()
+plt.savefig("sample_clustering.png", dpi=300, bbox_inches='tight')
+```
+
+### 7. Batch Processing Multiple Datasets
+
+**Download and Process Multiple Series:**
+
+```python
+import GEOparse
+import pandas as pd
+import os
+
+def batch_download_geo(gse_list, destdir="./geo_data"):
+    """Download multiple GEO series"""
+    results = {}
+
+    for gse_id in gse_list:
+        try:
+            print(f"Processing {gse_id}...")
+            gse = GEOparse.get_GEO(geo=gse_id, destdir=destdir)
+
+            # Extract key information
+            results[gse_id] = {
+                'title': gse.metadata.get('title', ['N/A'])[0],
+                'organism': gse.metadata.get('organism', ['N/A'])[0],
+                'platform': list(gse.gpls.keys())[0] if gse.gpls else 'N/A',
+                'num_samples': len(gse.gsms),
+                'submission_date': gse.metadata.get('submission_date', ['N/A'])[0]
+            }
+
+            # Save expression data
+            if hasattr(gse, 'pivot_samples'):
+                expr_df = gse.pivot_samples('VALUE')
+                expr_df.to_csv(f"{destdir}/{gse_id}_expression.csv")
+                results[gse_id]['num_genes'] = len(expr_df)
+
+        except Exception as e:
+            print(f"Error processing {gse_id}: {e}")
+            results[gse_id] = {'error': str(e)}
+
+    # Save summary
+    summary_df = pd.DataFrame(results).T
+    summary_df.to_csv(f"{destdir}/batch_summary.csv")
+
+    return results
+
+# Process multiple datasets
+gse_list = ["GSE100001", "GSE100002", "GSE100003"]
+results = batch_download_geo(gse_list)
+```
+
+**Meta-Analysis Across Studies:**
+
+```python
+import GEOparse
+import pandas as pd
+import numpy as np
+
+def meta_analysis_geo(gse_list, gene_of_interest):
+    """Perform meta-analysis of gene expression across studies"""
+    results = []
+
+    for gse_id in gse_list:
+        try:
+            gse = GEOparse.get_GEO(geo=gse_id, destdir="./data")
+
+            # Get platform annotation
+            gpl = list(gse.gpls.values())[0]
+
+            # Find gene in platform
+            if hasattr(gpl, 'table'):
+                gene_probes = gpl.table[
+                    gpl.table['Gene Symbol'].str.contains(
+                        gene_of_interest,
+                        case=False,
+                        na=False
+                    )
+                ]
+
+                if not gene_probes.empty:
+                    expr_df = gse.pivot_samples('VALUE')
+
+                    for probe_id in gene_probes['ID']:
+                        if probe_id in expr_df.index:
+                            expr_values = expr_df.loc[probe_id]
+
+                            results.append({
+                                'study': gse_id,
+                                'probe': probe_id,
+                                'mean_expression': expr_values.mean(),
+                                'std_expression': expr_values.std(),
+                                'num_samples': len(expr_values)
+                            })
+
+        except Exception as e:
+            print(f"Error in {gse_id}: {e}")
+
+    return pd.DataFrame(results)
+
+# Meta-analysis for TP53
+gse_studies = ["GSE100001", "GSE100002", "GSE100003"]
+meta_results = meta_analysis_geo(gse_studies, "TP53")
+print(meta_results)
+```
+
+## Installation and Setup
+
+### Python Libraries
+
+```bash
+# Primary GEO access library (recommended)
+uv pip install GEOparse
+
+# For E-utilities and programmatic NCBI access
+uv pip install biopython
+
+# For data analysis
+uv pip install pandas numpy scipy
+
+# For visualization
+uv pip install matplotlib seaborn
+
+# For statistical analysis
+uv pip install statsmodels scikit-learn
+```
+
+### Configuration
+
+Set up NCBI E-utilities access:
+
+```python
+from Bio import Entrez
+
+# Always set your email (required by NCBI)
+Entrez.email = "your.email@example.com"
+
+# Optional: Set API key for increased rate limits
+# Get your API key from: https://www.ncbi.nlm.nih.gov/account/
+Entrez.api_key = "your_api_key_here"
+
+# With API key: 10 requests/second
+# Without API key: 3 requests/second
+```
+
+## Common Use Cases
+
+### Transcriptomics Research
+- Download gene expression data for specific conditions
+- Compare expression profiles across studies
+- Identify differentially expressed genes
+- Perform meta-analyses across multiple datasets
+
+### Drug Response Studies
+- Analyze gene expression changes after drug treatment
+- Identify biomarkers for drug response
+- Compare drug effects across cell lines or patients
+- Build predictive models for drug sensitivity
+
+### Disease Biology
+- Study gene expression in disease vs. normal tissues
+- Identify disease-associated expression signatures
+- Compare patient subgroups and disease stages
+- Correlate expression with clinical outcomes
+
+### Biomarker Discovery
+- Screen for diagnostic or prognostic markers
+- Validate biomarkers across independent cohorts
+- Compare marker performance across platforms
+- Integrate expression with clinical data
+
+## Key Concepts
+
+**SOFT (Simple Omnibus Format in Text):** GEO's primary text-based format containing metadata and data tables. Easily parsed by GEOparse.
+
+**MINiML (MIAME Notation in Markup Language):** XML format for GEO data, used for programmatic access and data exchange.
+
+**Series Matrix:** Tab-delimited expression matrix with samples as columns and genes/probes as rows. Fastest format for getting expression data.
+
+**MIAME Compliance:** Minimum Information About a Microarray Experiment - standardized annotation that GEO enforces for all submissions.
+
+**Expression Value Types:** Different types of expression measurements (raw signal, normalized, log-transformed). Always check platform and processing methods.
+
+**Platform Annotation:** Maps probe/feature IDs to genes. Essential for biological interpretation of expression data.
+
+## GEO2R Web Tool
+
+For quick analysis without coding, use GEO2R:
+
+- Web-based statistical analysis tool integrated into GEO
+- Accessible at: https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSExxxxx
+- Performs differential expression analysis
+- Generates R scripts for reproducibility
+- Useful for exploratory analysis before downloading data
+
+## Rate Limiting and Best Practices
+
+**NCBI E-utilities Rate Limits:**
+- Without API key: 3 requests per second
+- With API key: 10 requests per second
+- Implement delays between requests: `time.sleep(0.34)` (no API key) or `time.sleep(0.1)` (with API key)
+
+**FTP Access:**
+- No rate limits for FTP downloads
+- Preferred method for bulk downloads
+- Can download entire directories with wget -r
+
+**GEOparse Caching:**
+- GEOparse automatically caches downloaded files in destdir
+- Subsequent calls use cached data
+- Clean cache periodically to save disk space
+
+**Optimal Practices:**
+- Use GEOparse for series-level access (easiest)
+- Use E-utilities for metadata searching and batch queries
+- Use FTP for direct file downloads and bulk operations
+- Cache data locally to avoid repeated downloads
+- Always set Entrez.email when using Biopython
+
+## Resources
+
+### references/geo_reference.md
+
+Comprehensive reference documentation covering:
+- Detailed E-utilities API specifications and endpoints
+- Complete SOFT and MINiML file format documentation
+- Advanced GEOparse usage patterns and examples
+- FTP directory structure and file naming conventions
+- Data processing pipelines and normalization methods
+- Troubleshooting common issues and error handling
+- Platform-specific considerations and quirks
+
+Consult this reference for in-depth technical details, complex query patterns, or when working with uncommon data formats.
+
+## Important Notes
+
+### Data Quality Considerations
+
+- GEO accepts user-submitted data with varying quality standards
+- Always check platform annotation and processing methods
+- Verify sample metadata and experimental design
+- Be cautious with batch effects across studies
+- Consider reprocessing raw data for consistency
+
+### File Size Warnings
+
+- Series matrix files can be large (>1 GB for large studies)
+- Supplementary files (e.g., CEL files) can be very large
+- Plan for adequate disk space before downloading
+- Consider downloading samples incrementally
+
+### Data Usage and Citation
+
+- GEO data is freely available for research use
+- Always cite original studies when using GEO data
+- Cite GEO database: Barrett et al. (2013) Nucleic Acids Research
+- Check individual dataset usage restrictions (if any)
+- Follow NCBI guidelines for programmatic access
+
+### Common Pitfalls
+
+- Different platforms use different probe IDs (requires annotation mapping)
+- Expression values may be raw, normalized, or log-transformed (check metadata)
+- Sample metadata can be inconsistently formatted across studies
+- Not all series have series matrix files (older submissions)
+- Platform annotations may be outdated (genes renamed, IDs deprecated)
+
+## Additional Resources
+
+- **GEO Website:** https://www.ncbi.nlm.nih.gov/geo/
+- **GEO Submission Guidelines:** https://www.ncbi.nlm.nih.gov/geo/info/submission.html
+- **GEOparse Documentation:** https://geoparse.readthedocs.io/
+- **E-utilities Documentation:** https://www.ncbi.nlm.nih.gov/books/NBK25501/
+- **GEO FTP Site:** ftp://ftp.ncbi.nlm.nih.gov/geo/
+- **GEO2R Tool:** https://www.ncbi.nlm.nih.gov/geo/geo2r/
+- **NCBI API Keys:** https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
+- **Biopython Tutorial:** https://biopython.org/DIST/docs/tutorial/Tutorial.html
--- a/skills/geo-database/references/geo_reference.md
+++ b/skills/geo-database/references/geo_reference.md
@@ -0,0 +1,829 @@
+# GEO Database Reference Documentation
+
+## Complete E-utilities API Specifications
+
+### Overview
+
+The NCBI Entrez Programming Utilities (E-utilities) provide programmatic access to GEO metadata through a set of nine server-side programs. All E-utilities return results in XML format by default.
+
+### Base URL
+
+```
+https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
+```
+
+### Core E-utility Programs
+
+#### eSearch - Text Query to ID List
+
+**Purpose:** Search a database and return a list of UIDs matching the query.
+
+**URL Pattern:**
+```
+https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
+```
+
+**Parameters:**
+- `db` (required): Database to search (e.g., "gds", "geoprofiles")
+- `term` (required): Search query string
+- `retmax`: Maximum number of UIDs to return (default: 20, max: 10000)
+- `retstart`: Starting position in result set (for pagination)
+- `usehistory`: Set to "y" to store results on history server
+- `sort`: Sort order (e.g., "relevance", "pub_date")
+- `field`: Limit search to specific field
+- `datetype`: Type of date to limit by
+- `reldate`: Limit to items within N days of today
+- `mindate`, `maxdate`: Date range limits (YYYY/MM/DD)
+
+**Example:**
+```python
+from Bio import Entrez
+Entrez.email = "your@email.com"
+
+# Basic search
+handle = Entrez.esearch(
+    db="gds",
+    term="breast cancer AND Homo sapiens",
+    retmax=100,
+    usehistory="y"
+)
+results = Entrez.read(handle)
+handle.close()
+
+# Results contain:
+# - Count: Total number of matches
+# - RetMax: Number of UIDs returned
+# - RetStart: Starting position
+# - IdList: List of UIDs
+# - QueryKey: Key for history server (if usehistory="y")
+# - WebEnv: Web environment string (if usehistory="y")
+```
+
+#### eSummary - Document Summaries
+
+**Purpose:** Retrieve document summaries for a list of UIDs.
+
+**URL Pattern:**
+```
+https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi
+```
+
+**Parameters:**
+- `db` (required): Database
+- `id` (required): Comma-separated list of UIDs or query_key+WebEnv
+- `retmode`: Return format ("xml" or "json")
+- `version`: Summary version ("2.0" recommended)
+
+**Example:**
+```python
+from Bio import Entrez
+Entrez.email = "your@email.com"
+
+# Get summaries for multiple IDs
+handle = Entrez.esummary(
+    db="gds",
+    id="200000001,200000002",
+    retmode="xml",
+    version="2.0"
+)
+summaries = Entrez.read(handle)
+handle.close()
+
+# Summary fields for GEO DataSets:
+# - Accession: GDS accession
+# - title: Dataset title
+# - summary: Dataset description
+# - PDAT: Publication date
+# - n_samples: Number of samples
+# - Organism: Source organism
+# - PubMedIds: Associated PubMed IDs
+```
+
+#### eFetch - Full Records
+
+**Purpose:** Retrieve full records for a list of UIDs.
+
+**URL Pattern:**
+```
+https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
+```
+
+**Parameters:**
+- `db` (required): Database
+- `id` (required): Comma-separated list of UIDs
+- `retmode`: Return format ("xml", "text")
+- `rettype`: Record type (database-specific)
+
+**Example:**
+```python
+from Bio import Entrez
+Entrez.email = "your@email.com"
+
+# Fetch full records
+handle = Entrez.efetch(
+    db="gds",
+    id="200000001",
+    retmode="xml"
+)
+records = Entrez.read(handle)
+handle.close()
+```
+
+#### eLink - Cross-Database Linking
+
+**Purpose:** Find related records in same or different databases.
+
+**URL Pattern:**
+```
+https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi
+```
+
+**Parameters:**
+- `dbfrom` (required): Source database
+- `db` (required): Target database
+- `id` (required): UID from source database
+- `cmd`: Link command type
+  - "neighbor": Return linked UIDs (default)
+  - "neighbor_score": Return scored links
+  - "acheck": Check for links
+  - "ncheck": Count links
+  - "llinks": Return URLs to LinkOut resources
+
+**Example:**
+```python
+from Bio import Entrez
+Entrez.email = "your@email.com"
+
+# Find PubMed articles linked to a GEO dataset
+handle = Entrez.elink(
+    dbfrom="gds",
+    db="pubmed",
+    id="200000001"
+)
+links = Entrez.read(handle)
+handle.close()
+```
+
+#### ePost - Upload UID List
+
+**Purpose:** Upload a list of UIDs to the history server for use in subsequent requests.
+
+**URL Pattern:**
+```
+https://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi
+```
+
+**Parameters:**
+- `db` (required): Database
+- `id` (required): Comma-separated list of UIDs
+
+**Example:**
+```python
+from Bio import Entrez
+Entrez.email = "your@email.com"
+
+# Post large list of IDs
+large_id_list = [str(i) for i in range(200000001, 200000101)]
+handle = Entrez.epost(db="gds", id=",".join(large_id_list))
+result = Entrez.read(handle)
+handle.close()
+
+# Use returned QueryKey and WebEnv in subsequent calls
+query_key = result["QueryKey"]
+webenv = result["WebEnv"]
+```
+
+#### eInfo - Database Information
+
+**Purpose:** Get information about available databases and their fields.
+
+**URL Pattern:**
+```
+https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi
+```
+
+**Parameters:**
+- `db`: Database name (omit to get list of all databases)
+- `version`: Set to "2.0" for detailed field information
+
+**Example:**
+```python
+from Bio import Entrez
+Entrez.email = "your@email.com"
+
+# Get information about gds database
+handle = Entrez.einfo(db="gds", version="2.0")
+info = Entrez.read(handle)
+handle.close()
+
+# Returns:
+# - Database description
+# - Last update date
+# - Record count
+# - Available search fields
+# - Link information
+```
+
+### Search Field Qualifiers for GEO
+
+Common search fields for building targeted queries:
+
+**General Fields:**
+- `[Accession]`: GEO accession number
+- `[Title]`: Dataset title
+- `[Author]`: Author name
+- `[Organism]`: Source organism
+- `[Entry Type]`: Type of entry (e.g., "Expression profiling by array")
+- `[Platform]`: Platform accession or name
+- `[PubMed ID]`: Associated PubMed ID
+
+**Date Fields:**
+- `[Publication Date]`: Publication date (YYYY or YYYY/MM/DD)
+- `[Submission Date]`: Submission date
+- `[Modification Date]`: Last modification date
+
+**MeSH Terms:**
+- `[MeSH Terms]`: Medical Subject Headings
+- `[MeSH Major Topic]`: Major MeSH topics
+
+**Study Type Fields:**
+- `[DataSet Type]`: Type of study (e.g., "RNA-seq", "ChIP-seq")
+- `[Sample Type]`: Sample type
+
+**Example Complex Query:**
+```python
+query = """
+    (breast cancer[MeSH] OR breast neoplasms[Title]) AND
+    Homo sapiens[Organism] AND
+    expression profiling by array[Entry Type] AND
+    2020:2024[Publication Date] AND
+    GPL570[Platform]
+"""
+```
+
+## SOFT File Format Specification
+
+### Overview
+
+SOFT (Simple Omnibus Format in Text) is GEO's primary data exchange format. Files are structured as key-value pairs with data tables.
+
+### File Types
+
+**Family SOFT Files:**
+- Filename: `GSExxxxx_family.soft.gz`
+- Contains: Complete series with all samples and platforms
+- Size: Can be very large (100s of MB compressed)
+- Use: Complete data extraction
+
+**Series Matrix Files:**
+- Filename: `GSExxxxx_series_matrix.txt.gz`
+- Contains: Expression matrix with minimal metadata
+- Size: Smaller than family files
+- Use: Quick access to expression data
+
+**Platform SOFT Files:**
+- Filename: `GPLxxxxx.soft`
+- Contains: Platform annotation and probe information
+- Use: Mapping probes to genes
+
+### SOFT File Structure
+
+```
+^DATABASE = GeoMiame
+!Database_name = Gene Expression Omnibus (GEO)
+!Database_institute = NCBI NLM NIH
+!Database_web_link = http://www.ncbi.nlm.nih.gov/geo
+!Database_email = geo@ncbi.nlm.nih.gov
+
+^SERIES = GSExxxxx
+!Series_title = Study Title Here
+!Series_summary = Study description and background...
+!Series_overall_design = Experimental design...
+!Series_type = Expression profiling by array
+!Series_pubmed_id = 12345678
+!Series_submission_date = Jan 01 2024
+!Series_last_update_date = Jan 15 2024
+!Series_contributor = John,Doe
+!Series_contributor = Jane,Smith
+!Series_sample_id = GSMxxxxxx
+!Series_sample_id = GSMxxxxxx
+
+^PLATFORM = GPLxxxxx
+!Platform_title = Platform Name
+!Platform_distribution = commercial or custom
+!Platform_organism = Homo sapiens
+!Platform_manufacturer = Affymetrix
+!Platform_technology = in situ oligonucleotide
+!Platform_data_row_count = 54675
+#ID = Probe ID
+#GB_ACC = GenBank accession
+#SPOT_ID = Spot identifier
+#Gene Symbol = Gene symbol
+#Gene Title = Gene title
+!platform_table_begin
+ID    GB_ACC    SPOT_ID    Gene Symbol    Gene Title
+1007_s_at    U48705    -    DDR1    discoidin domain receptor...
+1053_at    M87338    -    RFC2    replication factor C...
+!platform_table_end
+
+^SAMPLE = GSMxxxxxx
+!Sample_title = Sample name
+!Sample_source_name_ch1 = cell line XYZ
+!Sample_organism_ch1 = Homo sapiens
+!Sample_characteristics_ch1 = cell type: epithelial
+!Sample_characteristics_ch1 = treatment: control
+!Sample_molecule_ch1 = total RNA
+!Sample_label_ch1 = biotin
+!Sample_platform_id = GPLxxxxx
+!Sample_data_processing = normalization method
+#ID_REF = Probe identifier
+#VALUE = Expression value
+!sample_table_begin
+ID_REF    VALUE
+1007_s_at    8.456
+1053_at    7.234
+!sample_table_end
+```
+
+### Parsing SOFT Files
+
+**With GEOparse:**
+```python
+import GEOparse
+
+# Parse series
+gse = GEOparse.get_GEO(filepath="GSE123456_family.soft.gz")
+
+# Access metadata
+metadata = gse.metadata
+phenotype_data = gse.phenotype_data
+
+# Access samples
+for gsm_name, gsm in gse.gsms.items():
+    sample_data = gsm.table
+    sample_metadata = gsm.metadata
+
+# Access platforms
+for gpl_name, gpl in gse.gpls.items():
+    platform_table = gpl.table
+    platform_metadata = gpl.metadata
+```
+
+**Manual Parsing:**
+```python
+import gzip
+
+def parse_soft_file(filename):
+    """Basic SOFT file parser"""
+    sections = {}
+    current_section = None
+    current_metadata = {}
+    current_table = []
+    in_table = False
+
+    with gzip.open(filename, 'rt') as f:
+        for line in f:
+            line = line.strip()
+
+            # New section
+            if line.startswith('^'):
+                if current_section:
+                    sections[current_section] = {
+                        'metadata': current_metadata,
+                        'table': current_table
+                    }
+                parts = line[1:].split(' = ')
+                current_section = parts[1] if len(parts) > 1 else parts[0]
+                current_metadata = {}
+                current_table = []
+                in_table = False
+
+            # Metadata
+            elif line.startswith('!'):
+                if in_table:
+                    in_table = False
+                key_value = line[1:].split(' = ', 1)
+                if len(key_value) == 2:
+                    key, value = key_value
+                    if key in current_metadata:
+                        if isinstance(current_metadata[key], list):
+                            current_metadata[key].append(value)
+                        else:
+                            current_metadata[key] = [current_metadata[key], value]
+                    else:
+                        current_metadata[key] = value
+
+            # Table data
+            elif line.startswith('#') or in_table:
+                in_table = True
+                current_table.append(line)
+
+    return sections
+```
+
+## MINiML File Format
+
+### Overview
+
+MINiML (MIAME Notation in Markup Language) is GEO's XML-based format for data exchange.
+
+### File Structure
+
+```xml
+<?xml version="1.0" encoding="UTF-8"?>
+<MINiML xmlns="http://www.ncbi.nlm.nih.gov/geo/info/MINiML"
+        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
+  <Series iid="GDS123">
+    <Status>
+      <Submission-Date>2024-01-01</Submission-Date>
+      <Release-Date>2024-01-15</Release-Date>
+      <Last-Update-Date>2024-01-15</Last-Update-Date>
+    </Status>
+    <Title>Study Title</Title>
+    <Summary>Study description...</Summary>
+    <Overall-Design>Experimental design...</Overall-Design>
+    <Type>Expression profiling by array</Type>
+    <Contributor>
+      <Person>
+        <First>John</First>
+        <Last>Doe</Last>
+      </Person>
+    </Contributor>
+  </Series>
+
+  <Platform iid="GPL123">
+    <Title>Platform Name</Title>
+    <Distribution>commercial</Distribution>
+    <Technology>in situ oligonucleotide</Technology>
+    <Organism taxid="9606">Homo sapiens</Organism>
+    <Data-Table>
+      <Column position="1">
+        <Name>ID</Name>
+        <Description>Probe identifier</Description>
+      </Column>
+      <Data>
+        <Row>
+          <Cell column="1">1007_s_at</Cell>
+          <Cell column="2">U48705</Cell>
+        </Row>
+      </Data>
+    </Data-Table>
+  </Platform>
+
+  <Sample iid="GSM123">
+    <Title>Sample name</Title>
+    <Source>cell line XYZ</Source>
+    <Organism taxid="9606">Homo sapiens</Organism>
+    <Characteristics tag="cell type">epithelial</Characteristics>
+    <Characteristics tag="treatment">control</Characteristics>
+    <Platform-Ref ref="GPL123"/>
+    <Data-Table>
+      <Column position="1">
+        <Name>ID_REF</Name>
+      </Column>
+      <Column position="2">
+        <Name>VALUE</Name>
+      </Column>
+      <Data>
+        <Row>
+          <Cell column="1">1007_s_at</Cell>
+          <Cell column="2">8.456</Cell>
+        </Row>
+      </Data>
+    </Data-Table>
+  </Sample>
+</MINiML>
+```
+
+## FTP Directory Structure
+
+### Series Files
+
+**Pattern:**
+```
+ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE{nnn}nnn/GSE{xxxxx}/
+```
+
+Where `{nnn}` represents replacing last 3 digits with "nnn" and `{xxxxx}` is the full accession.
+
+**Example:**
+- GSE123456 → `/geo/series/GSE123nnn/GSE123456/`
+- GSE1234 → `/geo/series/GSE1nnn/GSE1234/`
+- GSE100001 → `/geo/series/GSE100nnn/GSE100001/`
+
+**Subdirectories:**
+- `/matrix/` - Series matrix files
+- `/soft/` - Family SOFT files
+- `/miniml/` - MINiML XML files
+- `/suppl/` - Supplementary files
+
+**File Types:**
+```
+matrix/
+  └── GSE123456_series_matrix.txt.gz
+
+soft/
+  └── GSE123456_family.soft.gz
+
+miniml/
+  └── GSE123456_family.xml.tgz
+
+suppl/
+  ├── GSE123456_RAW.tar
+  ├── filelist.txt
+  └── [various supplementary files]
+```
+
+### Sample Files
+
+**Pattern:**
+```
+ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM{nnn}nnn/GSM{xxxxx}/
+```
+
+**Subdirectories:**
+- `/suppl/` - Sample-specific supplementary files
+
+### Platform Files
+
+**Pattern:**
+```
+ftp://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL{nnn}nnn/GPL{xxxxx}/
+```
+
+**File Types:**
+```
+soft/
+  └── GPL570.soft.gz
+
+miniml/
+  └── GPL570.xml
+
+annot/
+  └── GPL570.annot.gz  # Enhanced annotation (if available)
+```
+
+## Advanced GEOparse Usage
+
+### Custom Parsing Options
+
+```python
+import GEOparse
+
+# Parse with custom options
+gse = GEOparse.get_GEO(
+    geo="GSE123456",
+    destdir="./data",
+    silent=False,  # Show progress
+    how="full",  # Parse mode: "full", "quick", "brief"
+    annotate_gpl=True,  # Include platform annotation
+    geotype="GSE"  # Explicit type
+)
+
+# Access specific sample
+gsm = gse.gsms['GSM1234567']
+
+# Get expression values for specific probe
+probe_id = "1007_s_at"
+if hasattr(gsm, 'table'):
+    probe_data = gsm.table[gsm.table['ID_REF'] == probe_id]
+
+# Get all characteristics
+characteristics = {}
+for key, values in gsm.metadata.items():
+    if key.startswith('characteristics'):
+        for value in (values if isinstance(values, list) else [values]):
+            if ':' in value:
+                char_key, char_value = value.split(':', 1)
+                characteristics[char_key.strip()] = char_value.strip()
+```
+
+### Working with Platform Annotations
+
+```python
+import GEOparse
+import pandas as pd
+
+gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
+
+# Get platform
+gpl = list(gse.gpls.values())[0]
+
+# Extract annotation table
+if hasattr(gpl, 'table'):
+    annotation = gpl.table
+
+    # Common annotation columns:
+    # - ID: Probe identifier
+    # - Gene Symbol: Gene symbol
+    # - Gene Title: Gene description
+    # - GB_ACC: GenBank accession
+    # - Gene ID: Entrez Gene ID
+    # - RefSeq: RefSeq accession
+    # - UniGene: UniGene cluster
+
+    # Map probes to genes
+    probe_to_gene = dict(zip(
+        annotation['ID'],
+        annotation['Gene Symbol']
+    ))
+
+    # Handle multiple probes per gene
+    gene_to_probes = {}
+    for probe, gene in probe_to_gene.items():
+        if gene and gene != '---':
+            if gene not in gene_to_probes:
+                gene_to_probes[gene] = []
+            gene_to_probes[gene].append(probe)
+```
+
+### Handling Large Datasets
+
+```python
+import GEOparse
+import pandas as pd
+import numpy as np
+
+def process_large_gse(gse_id, chunk_size=1000):
+    """Process large GEO series in chunks"""
+    gse = GEOparse.get_GEO(geo=gse_id, destdir="./data")
+
+    # Get sample list
+    sample_list = list(gse.gsms.keys())
+
+    # Process in chunks
+    for i in range(0, len(sample_list), chunk_size):
+        chunk_samples = sample_list[i:i+chunk_size]
+
+        # Extract data for chunk
+        chunk_data = {}
+        for gsm_id in chunk_samples:
+            gsm = gse.gsms[gsm_id]
+            if hasattr(gsm, 'table'):
+                chunk_data[gsm_id] = gsm.table['VALUE']
+
+        # Process chunk
+        chunk_df = pd.DataFrame(chunk_data)
+
+        # Save chunk results
+        chunk_df.to_csv(f"chunk_{i//chunk_size}.csv")
+
+        print(f"Processed {i+len(chunk_samples)}/{len(sample_list)} samples")
+```
+
+## Troubleshooting Common Issues
+
+### Issue: GEOparse Fails to Download
+
+**Symptoms:** Timeout errors, connection failures
+
+**Solutions:**
+1. Check internet connection
+2. Try downloading directly via FTP first
+3. Parse local files:
+```python
+gse = GEOparse.get_GEO(filepath="./local/GSE123456_family.soft.gz")
+```
+4. Increase timeout (modify GEOparse source if needed)
+
+### Issue: Missing Expression Data
+
+**Symptoms:** `pivot_samples()` fails or returns empty
+
+**Cause:** Not all series have series matrix files (older submissions)
+
+**Solution:** Parse individual sample tables:
+```python
+expression_data = {}
+for gsm_name, gsm in gse.gsms.items():
+    if hasattr(gsm, 'table') and 'VALUE' in gsm.table.columns:
+        expression_data[gsm_name] = gsm.table.set_index('ID_REF')['VALUE']
+
+expression_df = pd.DataFrame(expression_data)
+```
+
+### Issue: Inconsistent Probe IDs
+
+**Symptoms:** Probe IDs don't match between samples
+
+**Cause:** Different platform versions or sample processing
+
+**Solution:** Standardize using platform annotation:
+```python
+# Get common probe set
+all_probes = set()
+for gsm in gse.gsms.values():
+    if hasattr(gsm, 'table'):
+        all_probes.update(gsm.table['ID_REF'].values)
+
+# Create standardized matrix
+standardized_data = {}
+for gsm_name, gsm in gse.gsms.items():
+    if hasattr(gsm, 'table'):
+        sample_data = gsm.table.set_index('ID_REF')['VALUE']
+        standardized_data[gsm_name] = sample_data.reindex(all_probes)
+
+expression_df = pd.DataFrame(standardized_data)
+```
+
+### Issue: E-utilities Rate Limiting
+
+**Symptoms:** HTTP 429 errors, slow responses
+
+**Solution:**
+1. Get an API key from NCBI
+2. Implement rate limiting:
+```python
+import time
+from functools import wraps
+
+def rate_limit(calls_per_second=3):
+    min_interval = 1.0 / calls_per_second
+
+    def decorator(func):
+        last_called = [0.0]
+
+        @wraps(func)
+        def wrapper(*args, **kwargs):
+            elapsed = time.time() - last_called[0]
+            wait_time = min_interval - elapsed
+            if wait_time > 0:
+                time.sleep(wait_time)
+            result = func(*args, **kwargs)
+            last_called[0] = time.time()
+            return result
+        return wrapper
+    return decorator
+
+@rate_limit(calls_per_second=3)
+def safe_esearch(query):
+    handle = Entrez.esearch(db="gds", term=query)
+    results = Entrez.read(handle)
+    handle.close()
+    return results
+```
+
+### Issue: Memory Errors with Large Datasets
+
+**Symptoms:** MemoryError, system slowdown
+
+**Solution:**
+1. Process data in chunks
+2. Use sparse matrices for expression data
+3. Load only necessary columns
+4. Use memory-efficient data types:
+```python
+import pandas as pd
+
+# Read with specific dtypes
+expression_df = pd.read_csv(
+    "expression_matrix.csv",
+    dtype={'ID': str, 'GSM1': np.float32}  # Use float32 instead of float64
+)
+
+# Or use sparse format for mostly-zero data
+import scipy.sparse as sp
+sparse_matrix = sp.csr_matrix(expression_df.values)
+```
+
+## Platform-Specific Considerations
+
+### Affymetrix Arrays
+
+- Probe IDs format: `1007_s_at`, `1053_at`
+- Multiple probe sets per gene common
+- Check for `_at`, `_s_at`, `_x_at` suffixes
+- May need RMA or MAS5 normalization
+
+### Illumina Arrays
+
+- Probe IDs format: `ILMN_1234567`
+- Watch for duplicate probes
+- BeadChip-specific processing may be needed
+
+### RNA-seq
+
+- May not have traditional "probes"
+- Check for gene IDs (Ensembl, Entrez)
+- Counts vs. FPKM/TPM values
+- May need separate count files
+
+### Two-Channel Arrays
+
+- Look for `_ch1` and `_ch2` suffixes in metadata
+- VALUE_ch1, VALUE_ch2 columns
+- May need ratio or intensity values
+- Check dye-swap experiments
+
+## Best Practices Summary
+
+1. **Always set Entrez.email** before using E-utilities
+2. **Use API key** for better rate limits
+3. **Cache downloaded files** locally
+4. **Check data quality** before analysis
+5. **Verify platform annotations** are current
+6. **Document data processing** steps
+7. **Cite original studies** when using data
+8. **Check for batch effects** in meta-analyses
+9. **Validate results** with independent datasets
+10. **Follow NCBI usage guidelines**