Initial commit
This commit is contained in:
809
skills/geo-database/SKILL.md
Normal file
809
skills/geo-database/SKILL.md
Normal file
@@ -0,0 +1,809 @@
|
||||
---
|
||||
name: geo-database
|
||||
description: "Access NCBI GEO for gene expression/genomics data. Search/download microarray and RNA-seq datasets (GSE, GSM, GPL), retrieve SOFT/Matrix files, for transcriptomics and expression analysis."
|
||||
---
|
||||
|
||||
# GEO Database
|
||||
|
||||
## Overview
|
||||
|
||||
The Gene Expression Omnibus (GEO) is NCBI's public repository for high-throughput gene expression and functional genomics data. GEO contains over 264,000 studies with more than 8 million samples from both array-based and sequence-based experiments.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when searching for gene expression datasets, retrieving experimental data, downloading raw and processed files, querying expression profiles, or integrating GEO data into computational analysis workflows.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Understanding GEO Data Organization
|
||||
|
||||
GEO organizes data hierarchically using different accession types:
|
||||
|
||||
**Series (GSE):** A complete experiment with a set of related samples
|
||||
- Example: GSE123456
|
||||
- Contains experimental design, samples, and overall study information
|
||||
- Largest organizational unit in GEO
|
||||
- Current count: 264,928+ series
|
||||
|
||||
**Sample (GSM):** A single experimental sample or biological replicate
|
||||
- Example: GSM987654
|
||||
- Contains individual sample data, protocols, and metadata
|
||||
- Linked to platforms and series
|
||||
- Current count: 8,068,632+ samples
|
||||
|
||||
**Platform (GPL):** The microarray or sequencing platform used
|
||||
- Example: GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array)
|
||||
- Describes the technology and probe/feature annotations
|
||||
- Shared across multiple experiments
|
||||
- Current count: 27,739+ platforms
|
||||
|
||||
**DataSet (GDS):** Curated collections with consistent formatting
|
||||
- Example: GDS5678
|
||||
- Experimentally-comparable samples organized by study design
|
||||
- Processed for differential analysis
|
||||
- Subset of GEO data (4,348 curated datasets)
|
||||
- Ideal for quick comparative analyses
|
||||
|
||||
**Profiles:** Gene-specific expression data linked to sequence features
|
||||
- Queryable by gene name or annotation
|
||||
- Cross-references to Entrez Gene
|
||||
- Enables gene-centric searches across all studies
|
||||
|
||||
### 2. Searching GEO Data
|
||||
|
||||
**GEO DataSets Search:**
|
||||
|
||||
Search for studies by keywords, organism, or experimental conditions:
|
||||
|
||||
```python
|
||||
from Bio import Entrez
|
||||
|
||||
# Configure Entrez (required)
|
||||
Entrez.email = "your.email@example.com"
|
||||
|
||||
# Search for datasets
|
||||
def search_geo_datasets(query, retmax=20):
|
||||
"""Search GEO DataSets database"""
|
||||
handle = Entrez.esearch(
|
||||
db="gds",
|
||||
term=query,
|
||||
retmax=retmax,
|
||||
usehistory="y"
|
||||
)
|
||||
results = Entrez.read(handle)
|
||||
handle.close()
|
||||
return results
|
||||
|
||||
# Example searches
|
||||
results = search_geo_datasets("breast cancer[MeSH] AND Homo sapiens[Organism]")
|
||||
print(f"Found {results['Count']} datasets")
|
||||
|
||||
# Search by specific platform
|
||||
results = search_geo_datasets("GPL570[Accession]")
|
||||
|
||||
# Search by study type
|
||||
results = search_geo_datasets("expression profiling by array[DataSet Type]")
|
||||
```
|
||||
|
||||
**GEO Profiles Search:**
|
||||
|
||||
Find gene-specific expression patterns:
|
||||
|
||||
```python
|
||||
# Search for gene expression profiles
|
||||
def search_geo_profiles(gene_name, organism="Homo sapiens", retmax=100):
|
||||
"""Search GEO Profiles for a specific gene"""
|
||||
query = f"{gene_name}[Gene Name] AND {organism}[Organism]"
|
||||
handle = Entrez.esearch(
|
||||
db="geoprofiles",
|
||||
term=query,
|
||||
retmax=retmax
|
||||
)
|
||||
results = Entrez.read(handle)
|
||||
handle.close()
|
||||
return results
|
||||
|
||||
# Find TP53 expression across studies
|
||||
tp53_results = search_geo_profiles("TP53", organism="Homo sapiens")
|
||||
print(f"Found {tp53_results['Count']} expression profiles for TP53")
|
||||
```
|
||||
|
||||
**Advanced Search Patterns:**
|
||||
|
||||
```python
|
||||
# Combine multiple search terms
|
||||
def advanced_geo_search(terms, operator="AND"):
|
||||
"""Build complex search queries"""
|
||||
query = f" {operator} ".join(terms)
|
||||
return search_geo_datasets(query)
|
||||
|
||||
# Find recent high-throughput studies
|
||||
search_terms = [
|
||||
"RNA-seq[DataSet Type]",
|
||||
"Homo sapiens[Organism]",
|
||||
"2024[Publication Date]"
|
||||
]
|
||||
results = advanced_geo_search(search_terms)
|
||||
|
||||
# Search by author and condition
|
||||
search_terms = [
|
||||
"Smith[Author]",
|
||||
"diabetes[Disease]"
|
||||
]
|
||||
results = advanced_geo_search(search_terms)
|
||||
```
|
||||
|
||||
### 3. Retrieving GEO Data with GEOparse (Recommended)
|
||||
|
||||
**GEOparse** is the primary Python library for accessing GEO data:
|
||||
|
||||
**Installation:**
|
||||
```bash
|
||||
uv pip install GEOparse
|
||||
```
|
||||
|
||||
**Basic Usage:**
|
||||
|
||||
```python
|
||||
import GEOparse
|
||||
|
||||
# Download and parse a GEO Series
|
||||
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
|
||||
|
||||
# Access series metadata
|
||||
print(gse.metadata['title'])
|
||||
print(gse.metadata['summary'])
|
||||
print(gse.metadata['overall_design'])
|
||||
|
||||
# Access sample information
|
||||
for gsm_name, gsm in gse.gsms.items():
|
||||
print(f"Sample: {gsm_name}")
|
||||
print(f" Title: {gsm.metadata['title'][0]}")
|
||||
print(f" Source: {gsm.metadata['source_name_ch1'][0]}")
|
||||
print(f" Characteristics: {gsm.metadata.get('characteristics_ch1', [])}")
|
||||
|
||||
# Access platform information
|
||||
for gpl_name, gpl in gse.gpls.items():
|
||||
print(f"Platform: {gpl_name}")
|
||||
print(f" Title: {gpl.metadata['title'][0]}")
|
||||
print(f" Organism: {gpl.metadata['organism'][0]}")
|
||||
```
|
||||
|
||||
**Working with Expression Data:**
|
||||
|
||||
```python
|
||||
import GEOparse
|
||||
import pandas as pd
|
||||
|
||||
# Get expression data from series
|
||||
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
|
||||
|
||||
# Extract expression matrix
|
||||
# Method 1: From series matrix file (fastest)
|
||||
if hasattr(gse, 'pivot_samples'):
|
||||
expression_df = gse.pivot_samples('VALUE')
|
||||
print(expression_df.shape) # genes x samples
|
||||
|
||||
# Method 2: From individual samples
|
||||
expression_data = {}
|
||||
for gsm_name, gsm in gse.gsms.items():
|
||||
if hasattr(gsm, 'table'):
|
||||
expression_data[gsm_name] = gsm.table['VALUE']
|
||||
|
||||
expression_df = pd.DataFrame(expression_data)
|
||||
print(f"Expression matrix: {expression_df.shape}")
|
||||
```
|
||||
|
||||
**Accessing Supplementary Files:**
|
||||
|
||||
```python
|
||||
import GEOparse
|
||||
|
||||
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
|
||||
|
||||
# Download supplementary files
|
||||
gse.download_supplementary_files(
|
||||
directory="./data/GSE123456_suppl",
|
||||
download_sra=False # Set to True to download SRA files
|
||||
)
|
||||
|
||||
# List available supplementary files
|
||||
for gsm_name, gsm in gse.gsms.items():
|
||||
if hasattr(gsm, 'supplementary_files'):
|
||||
print(f"Sample {gsm_name}:")
|
||||
for file_url in gsm.metadata.get('supplementary_file', []):
|
||||
print(f" {file_url}")
|
||||
```
|
||||
|
||||
**Filtering and Subsetting Data:**
|
||||
|
||||
```python
|
||||
import GEOparse
|
||||
|
||||
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
|
||||
|
||||
# Filter samples by metadata
|
||||
control_samples = [
|
||||
gsm_name for gsm_name, gsm in gse.gsms.items()
|
||||
if 'control' in gsm.metadata.get('title', [''])[0].lower()
|
||||
]
|
||||
|
||||
treatment_samples = [
|
||||
gsm_name for gsm_name, gsm in gse.gsms.items()
|
||||
if 'treatment' in gsm.metadata.get('title', [''])[0].lower()
|
||||
]
|
||||
|
||||
print(f"Control samples: {len(control_samples)}")
|
||||
print(f"Treatment samples: {len(treatment_samples)}")
|
||||
|
||||
# Extract subset expression matrix
|
||||
expression_df = gse.pivot_samples('VALUE')
|
||||
control_expr = expression_df[control_samples]
|
||||
treatment_expr = expression_df[treatment_samples]
|
||||
```
|
||||
|
||||
### 4. Using NCBI E-utilities for GEO Access
|
||||
|
||||
**E-utilities** provide lower-level programmatic access to GEO metadata:
|
||||
|
||||
**Basic E-utilities Workflow:**
|
||||
|
||||
```python
|
||||
from Bio import Entrez
|
||||
import time
|
||||
|
||||
Entrez.email = "your.email@example.com"
|
||||
|
||||
# Step 1: Search for GEO entries
|
||||
def search_geo(query, db="gds", retmax=100):
|
||||
"""Search GEO using E-utilities"""
|
||||
handle = Entrez.esearch(
|
||||
db=db,
|
||||
term=query,
|
||||
retmax=retmax,
|
||||
usehistory="y"
|
||||
)
|
||||
results = Entrez.read(handle)
|
||||
handle.close()
|
||||
return results
|
||||
|
||||
# Step 2: Fetch summaries
|
||||
def fetch_geo_summaries(id_list, db="gds"):
|
||||
"""Fetch document summaries for GEO entries"""
|
||||
ids = ",".join(id_list)
|
||||
handle = Entrez.esummary(db=db, id=ids)
|
||||
summaries = Entrez.read(handle)
|
||||
handle.close()
|
||||
return summaries
|
||||
|
||||
# Step 3: Fetch full records
|
||||
def fetch_geo_records(id_list, db="gds"):
|
||||
"""Fetch full GEO records"""
|
||||
ids = ",".join(id_list)
|
||||
handle = Entrez.efetch(db=db, id=ids, retmode="xml")
|
||||
records = Entrez.read(handle)
|
||||
handle.close()
|
||||
return records
|
||||
|
||||
# Example workflow
|
||||
search_results = search_geo("breast cancer AND Homo sapiens")
|
||||
id_list = search_results['IdList'][:5]
|
||||
|
||||
summaries = fetch_geo_summaries(id_list)
|
||||
for summary in summaries:
|
||||
print(f"GDS: {summary.get('Accession', 'N/A')}")
|
||||
print(f"Title: {summary.get('title', 'N/A')}")
|
||||
print(f"Samples: {summary.get('n_samples', 'N/A')}")
|
||||
print()
|
||||
```
|
||||
|
||||
**Batch Processing with E-utilities:**
|
||||
|
||||
```python
|
||||
from Bio import Entrez
|
||||
import time
|
||||
|
||||
Entrez.email = "your.email@example.com"
|
||||
|
||||
def batch_fetch_geo_metadata(accessions, batch_size=100):
|
||||
"""Fetch metadata for multiple GEO accessions"""
|
||||
results = {}
|
||||
|
||||
for i in range(0, len(accessions), batch_size):
|
||||
batch = accessions[i:i + batch_size]
|
||||
|
||||
# Search for each accession
|
||||
for accession in batch:
|
||||
try:
|
||||
query = f"{accession}[Accession]"
|
||||
search_handle = Entrez.esearch(db="gds", term=query)
|
||||
search_results = Entrez.read(search_handle)
|
||||
search_handle.close()
|
||||
|
||||
if search_results['IdList']:
|
||||
# Fetch summary
|
||||
summary_handle = Entrez.esummary(
|
||||
db="gds",
|
||||
id=search_results['IdList'][0]
|
||||
)
|
||||
summary = Entrez.read(summary_handle)
|
||||
summary_handle.close()
|
||||
results[accession] = summary[0]
|
||||
|
||||
# Be polite to NCBI servers
|
||||
time.sleep(0.34) # Max 3 requests per second
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error fetching {accession}: {e}")
|
||||
|
||||
return results
|
||||
|
||||
# Fetch metadata for multiple datasets
|
||||
gse_list = ["GSE100001", "GSE100002", "GSE100003"]
|
||||
metadata = batch_fetch_geo_metadata(gse_list)
|
||||
```
|
||||
|
||||
### 5. Direct FTP Access for Data Files
|
||||
|
||||
**FTP URLs for GEO Data:**
|
||||
|
||||
GEO data can be downloaded directly via FTP:
|
||||
|
||||
```python
|
||||
import ftplib
|
||||
import os
|
||||
|
||||
def download_geo_ftp(accession, file_type="matrix", dest_dir="./data"):
|
||||
"""Download GEO files via FTP"""
|
||||
# Construct FTP path based on accession type
|
||||
if accession.startswith("GSE"):
|
||||
# Series files
|
||||
gse_num = accession[3:]
|
||||
base_num = gse_num[:-3] + "nnn"
|
||||
ftp_path = f"/geo/series/GSE{base_num}/{accession}/"
|
||||
|
||||
if file_type == "matrix":
|
||||
filename = f"{accession}_series_matrix.txt.gz"
|
||||
elif file_type == "soft":
|
||||
filename = f"{accession}_family.soft.gz"
|
||||
elif file_type == "miniml":
|
||||
filename = f"{accession}_family.xml.tgz"
|
||||
|
||||
# Connect to FTP server
|
||||
ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov")
|
||||
ftp.login()
|
||||
ftp.cwd(ftp_path)
|
||||
|
||||
# Download file
|
||||
os.makedirs(dest_dir, exist_ok=True)
|
||||
local_file = os.path.join(dest_dir, filename)
|
||||
|
||||
with open(local_file, 'wb') as f:
|
||||
ftp.retrbinary(f'RETR {filename}', f.write)
|
||||
|
||||
ftp.quit()
|
||||
print(f"Downloaded: {local_file}")
|
||||
return local_file
|
||||
|
||||
# Download series matrix file
|
||||
download_geo_ftp("GSE123456", file_type="matrix")
|
||||
|
||||
# Download SOFT format file
|
||||
download_geo_ftp("GSE123456", file_type="soft")
|
||||
```
|
||||
|
||||
**Using wget or curl for Downloads:**
|
||||
|
||||
```bash
|
||||
# Download series matrix file
|
||||
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/matrix/GSE123456_series_matrix.txt.gz
|
||||
|
||||
# Download all supplementary files for a series
|
||||
wget -r -np -nd ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/suppl/
|
||||
|
||||
# Download SOFT format family file
|
||||
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/soft/GSE123456_family.soft.gz
|
||||
```
|
||||
|
||||
### 6. Analyzing GEO Data
|
||||
|
||||
**Quality Control and Preprocessing:**
|
||||
|
||||
```python
|
||||
import GEOparse
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Load dataset
|
||||
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
|
||||
expression_df = gse.pivot_samples('VALUE')
|
||||
|
||||
# Check for missing values
|
||||
print(f"Missing values: {expression_df.isnull().sum().sum()}")
|
||||
|
||||
# Log transformation (if needed)
|
||||
if expression_df.min().min() > 0: # Check if already log-transformed
|
||||
if expression_df.max().max() > 100:
|
||||
expression_df = np.log2(expression_df + 1)
|
||||
print("Applied log2 transformation")
|
||||
|
||||
# Distribution plots
|
||||
plt.figure(figsize=(12, 5))
|
||||
|
||||
plt.subplot(1, 2, 1)
|
||||
expression_df.plot.box(ax=plt.gca())
|
||||
plt.title("Expression Distribution per Sample")
|
||||
plt.xticks(rotation=90)
|
||||
|
||||
plt.subplot(1, 2, 2)
|
||||
expression_df.mean(axis=1).hist(bins=50)
|
||||
plt.title("Gene Expression Distribution")
|
||||
plt.xlabel("Average Expression")
|
||||
|
||||
plt.tight_layout()
|
||||
plt.savefig("geo_qc.png", dpi=300, bbox_inches='tight')
|
||||
```
|
||||
|
||||
**Differential Expression Analysis:**
|
||||
|
||||
```python
|
||||
import GEOparse
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
from scipy import stats
|
||||
|
||||
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
|
||||
expression_df = gse.pivot_samples('VALUE')
|
||||
|
||||
# Define sample groups
|
||||
control_samples = ["GSM1", "GSM2", "GSM3"]
|
||||
treatment_samples = ["GSM4", "GSM5", "GSM6"]
|
||||
|
||||
# Calculate fold changes and p-values
|
||||
results = []
|
||||
for gene in expression_df.index:
|
||||
control_expr = expression_df.loc[gene, control_samples]
|
||||
treatment_expr = expression_df.loc[gene, treatment_samples]
|
||||
|
||||
# Calculate statistics
|
||||
fold_change = treatment_expr.mean() - control_expr.mean()
|
||||
t_stat, p_value = stats.ttest_ind(treatment_expr, control_expr)
|
||||
|
||||
results.append({
|
||||
'gene': gene,
|
||||
'log2_fold_change': fold_change,
|
||||
'p_value': p_value,
|
||||
'control_mean': control_expr.mean(),
|
||||
'treatment_mean': treatment_expr.mean()
|
||||
})
|
||||
|
||||
# Create results DataFrame
|
||||
de_results = pd.DataFrame(results)
|
||||
|
||||
# Multiple testing correction (Benjamini-Hochberg)
|
||||
from statsmodels.stats.multitest import multipletests
|
||||
_, de_results['q_value'], _, _ = multipletests(
|
||||
de_results['p_value'],
|
||||
method='fdr_bh'
|
||||
)
|
||||
|
||||
# Filter significant genes
|
||||
significant_genes = de_results[
|
||||
(de_results['q_value'] < 0.05) &
|
||||
(abs(de_results['log2_fold_change']) > 1)
|
||||
]
|
||||
|
||||
print(f"Significant genes: {len(significant_genes)}")
|
||||
significant_genes.to_csv("de_results.csv", index=False)
|
||||
```
|
||||
|
||||
**Correlation and Clustering Analysis:**
|
||||
|
||||
```python
|
||||
import GEOparse
|
||||
import seaborn as sns
|
||||
import matplotlib.pyplot as plt
|
||||
from scipy.cluster import hierarchy
|
||||
from scipy.spatial.distance import pdist
|
||||
|
||||
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
|
||||
expression_df = gse.pivot_samples('VALUE')
|
||||
|
||||
# Sample correlation heatmap
|
||||
sample_corr = expression_df.corr()
|
||||
|
||||
plt.figure(figsize=(10, 8))
|
||||
sns.heatmap(sample_corr, cmap='coolwarm', center=0,
|
||||
square=True, linewidths=0.5)
|
||||
plt.title("Sample Correlation Matrix")
|
||||
plt.tight_layout()
|
||||
plt.savefig("sample_correlation.png", dpi=300, bbox_inches='tight')
|
||||
|
||||
# Hierarchical clustering
|
||||
distances = pdist(expression_df.T, metric='correlation')
|
||||
linkage = hierarchy.linkage(distances, method='average')
|
||||
|
||||
plt.figure(figsize=(12, 6))
|
||||
hierarchy.dendrogram(linkage, labels=expression_df.columns)
|
||||
plt.title("Hierarchical Clustering of Samples")
|
||||
plt.xlabel("Samples")
|
||||
plt.ylabel("Distance")
|
||||
plt.xticks(rotation=90)
|
||||
plt.tight_layout()
|
||||
plt.savefig("sample_clustering.png", dpi=300, bbox_inches='tight')
|
||||
```
|
||||
|
||||
### 7. Batch Processing Multiple Datasets
|
||||
|
||||
**Download and Process Multiple Series:**
|
||||
|
||||
```python
|
||||
import GEOparse
|
||||
import pandas as pd
|
||||
import os
|
||||
|
||||
def batch_download_geo(gse_list, destdir="./geo_data"):
|
||||
"""Download multiple GEO series"""
|
||||
results = {}
|
||||
|
||||
for gse_id in gse_list:
|
||||
try:
|
||||
print(f"Processing {gse_id}...")
|
||||
gse = GEOparse.get_GEO(geo=gse_id, destdir=destdir)
|
||||
|
||||
# Extract key information
|
||||
results[gse_id] = {
|
||||
'title': gse.metadata.get('title', ['N/A'])[0],
|
||||
'organism': gse.metadata.get('organism', ['N/A'])[0],
|
||||
'platform': list(gse.gpls.keys())[0] if gse.gpls else 'N/A',
|
||||
'num_samples': len(gse.gsms),
|
||||
'submission_date': gse.metadata.get('submission_date', ['N/A'])[0]
|
||||
}
|
||||
|
||||
# Save expression data
|
||||
if hasattr(gse, 'pivot_samples'):
|
||||
expr_df = gse.pivot_samples('VALUE')
|
||||
expr_df.to_csv(f"{destdir}/{gse_id}_expression.csv")
|
||||
results[gse_id]['num_genes'] = len(expr_df)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error processing {gse_id}: {e}")
|
||||
results[gse_id] = {'error': str(e)}
|
||||
|
||||
# Save summary
|
||||
summary_df = pd.DataFrame(results).T
|
||||
summary_df.to_csv(f"{destdir}/batch_summary.csv")
|
||||
|
||||
return results
|
||||
|
||||
# Process multiple datasets
|
||||
gse_list = ["GSE100001", "GSE100002", "GSE100003"]
|
||||
results = batch_download_geo(gse_list)
|
||||
```
|
||||
|
||||
**Meta-Analysis Across Studies:**
|
||||
|
||||
```python
|
||||
import GEOparse
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
|
||||
def meta_analysis_geo(gse_list, gene_of_interest):
|
||||
"""Perform meta-analysis of gene expression across studies"""
|
||||
results = []
|
||||
|
||||
for gse_id in gse_list:
|
||||
try:
|
||||
gse = GEOparse.get_GEO(geo=gse_id, destdir="./data")
|
||||
|
||||
# Get platform annotation
|
||||
gpl = list(gse.gpls.values())[0]
|
||||
|
||||
# Find gene in platform
|
||||
if hasattr(gpl, 'table'):
|
||||
gene_probes = gpl.table[
|
||||
gpl.table['Gene Symbol'].str.contains(
|
||||
gene_of_interest,
|
||||
case=False,
|
||||
na=False
|
||||
)
|
||||
]
|
||||
|
||||
if not gene_probes.empty:
|
||||
expr_df = gse.pivot_samples('VALUE')
|
||||
|
||||
for probe_id in gene_probes['ID']:
|
||||
if probe_id in expr_df.index:
|
||||
expr_values = expr_df.loc[probe_id]
|
||||
|
||||
results.append({
|
||||
'study': gse_id,
|
||||
'probe': probe_id,
|
||||
'mean_expression': expr_values.mean(),
|
||||
'std_expression': expr_values.std(),
|
||||
'num_samples': len(expr_values)
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error in {gse_id}: {e}")
|
||||
|
||||
return pd.DataFrame(results)
|
||||
|
||||
# Meta-analysis for TP53
|
||||
gse_studies = ["GSE100001", "GSE100002", "GSE100003"]
|
||||
meta_results = meta_analysis_geo(gse_studies, "TP53")
|
||||
print(meta_results)
|
||||
```
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
### Python Libraries
|
||||
|
||||
```bash
|
||||
# Primary GEO access library (recommended)
|
||||
uv pip install GEOparse
|
||||
|
||||
# For E-utilities and programmatic NCBI access
|
||||
uv pip install biopython
|
||||
|
||||
# For data analysis
|
||||
uv pip install pandas numpy scipy
|
||||
|
||||
# For visualization
|
||||
uv pip install matplotlib seaborn
|
||||
|
||||
# For statistical analysis
|
||||
uv pip install statsmodels scikit-learn
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
Set up NCBI E-utilities access:
|
||||
|
||||
```python
|
||||
from Bio import Entrez
|
||||
|
||||
# Always set your email (required by NCBI)
|
||||
Entrez.email = "your.email@example.com"
|
||||
|
||||
# Optional: Set API key for increased rate limits
|
||||
# Get your API key from: https://www.ncbi.nlm.nih.gov/account/
|
||||
Entrez.api_key = "your_api_key_here"
|
||||
|
||||
# With API key: 10 requests/second
|
||||
# Without API key: 3 requests/second
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### Transcriptomics Research
|
||||
- Download gene expression data for specific conditions
|
||||
- Compare expression profiles across studies
|
||||
- Identify differentially expressed genes
|
||||
- Perform meta-analyses across multiple datasets
|
||||
|
||||
### Drug Response Studies
|
||||
- Analyze gene expression changes after drug treatment
|
||||
- Identify biomarkers for drug response
|
||||
- Compare drug effects across cell lines or patients
|
||||
- Build predictive models for drug sensitivity
|
||||
|
||||
### Disease Biology
|
||||
- Study gene expression in disease vs. normal tissues
|
||||
- Identify disease-associated expression signatures
|
||||
- Compare patient subgroups and disease stages
|
||||
- Correlate expression with clinical outcomes
|
||||
|
||||
### Biomarker Discovery
|
||||
- Screen for diagnostic or prognostic markers
|
||||
- Validate biomarkers across independent cohorts
|
||||
- Compare marker performance across platforms
|
||||
- Integrate expression with clinical data
|
||||
|
||||
## Key Concepts
|
||||
|
||||
**SOFT (Simple Omnibus Format in Text):** GEO's primary text-based format containing metadata and data tables. Easily parsed by GEOparse.
|
||||
|
||||
**MINiML (MIAME Notation in Markup Language):** XML format for GEO data, used for programmatic access and data exchange.
|
||||
|
||||
**Series Matrix:** Tab-delimited expression matrix with samples as columns and genes/probes as rows. Fastest format for getting expression data.
|
||||
|
||||
**MIAME Compliance:** Minimum Information About a Microarray Experiment - standardized annotation that GEO enforces for all submissions.
|
||||
|
||||
**Expression Value Types:** Different types of expression measurements (raw signal, normalized, log-transformed). Always check platform and processing methods.
|
||||
|
||||
**Platform Annotation:** Maps probe/feature IDs to genes. Essential for biological interpretation of expression data.
|
||||
|
||||
## GEO2R Web Tool
|
||||
|
||||
For quick analysis without coding, use GEO2R:
|
||||
|
||||
- Web-based statistical analysis tool integrated into GEO
|
||||
- Accessible at: https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSExxxxx
|
||||
- Performs differential expression analysis
|
||||
- Generates R scripts for reproducibility
|
||||
- Useful for exploratory analysis before downloading data
|
||||
|
||||
## Rate Limiting and Best Practices
|
||||
|
||||
**NCBI E-utilities Rate Limits:**
|
||||
- Without API key: 3 requests per second
|
||||
- With API key: 10 requests per second
|
||||
- Implement delays between requests: `time.sleep(0.34)` (no API key) or `time.sleep(0.1)` (with API key)
|
||||
|
||||
**FTP Access:**
|
||||
- No rate limits for FTP downloads
|
||||
- Preferred method for bulk downloads
|
||||
- Can download entire directories with wget -r
|
||||
|
||||
**GEOparse Caching:**
|
||||
- GEOparse automatically caches downloaded files in destdir
|
||||
- Subsequent calls use cached data
|
||||
- Clean cache periodically to save disk space
|
||||
|
||||
**Optimal Practices:**
|
||||
- Use GEOparse for series-level access (easiest)
|
||||
- Use E-utilities for metadata searching and batch queries
|
||||
- Use FTP for direct file downloads and bulk operations
|
||||
- Cache data locally to avoid repeated downloads
|
||||
- Always set Entrez.email when using Biopython
|
||||
|
||||
## Resources
|
||||
|
||||
### references/geo_reference.md
|
||||
|
||||
Comprehensive reference documentation covering:
|
||||
- Detailed E-utilities API specifications and endpoints
|
||||
- Complete SOFT and MINiML file format documentation
|
||||
- Advanced GEOparse usage patterns and examples
|
||||
- FTP directory structure and file naming conventions
|
||||
- Data processing pipelines and normalization methods
|
||||
- Troubleshooting common issues and error handling
|
||||
- Platform-specific considerations and quirks
|
||||
|
||||
Consult this reference for in-depth technical details, complex query patterns, or when working with uncommon data formats.
|
||||
|
||||
## Important Notes
|
||||
|
||||
### Data Quality Considerations
|
||||
|
||||
- GEO accepts user-submitted data with varying quality standards
|
||||
- Always check platform annotation and processing methods
|
||||
- Verify sample metadata and experimental design
|
||||
- Be cautious with batch effects across studies
|
||||
- Consider reprocessing raw data for consistency
|
||||
|
||||
### File Size Warnings
|
||||
|
||||
- Series matrix files can be large (>1 GB for large studies)
|
||||
- Supplementary files (e.g., CEL files) can be very large
|
||||
- Plan for adequate disk space before downloading
|
||||
- Consider downloading samples incrementally
|
||||
|
||||
### Data Usage and Citation
|
||||
|
||||
- GEO data is freely available for research use
|
||||
- Always cite original studies when using GEO data
|
||||
- Cite GEO database: Barrett et al. (2013) Nucleic Acids Research
|
||||
- Check individual dataset usage restrictions (if any)
|
||||
- Follow NCBI guidelines for programmatic access
|
||||
|
||||
### Common Pitfalls
|
||||
|
||||
- Different platforms use different probe IDs (requires annotation mapping)
|
||||
- Expression values may be raw, normalized, or log-transformed (check metadata)
|
||||
- Sample metadata can be inconsistently formatted across studies
|
||||
- Not all series have series matrix files (older submissions)
|
||||
- Platform annotations may be outdated (genes renamed, IDs deprecated)
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **GEO Website:** https://www.ncbi.nlm.nih.gov/geo/
|
||||
- **GEO Submission Guidelines:** https://www.ncbi.nlm.nih.gov/geo/info/submission.html
|
||||
- **GEOparse Documentation:** https://geoparse.readthedocs.io/
|
||||
- **E-utilities Documentation:** https://www.ncbi.nlm.nih.gov/books/NBK25501/
|
||||
- **GEO FTP Site:** ftp://ftp.ncbi.nlm.nih.gov/geo/
|
||||
- **GEO2R Tool:** https://www.ncbi.nlm.nih.gov/geo/geo2r/
|
||||
- **NCBI API Keys:** https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
|
||||
- **Biopython Tutorial:** https://biopython.org/DIST/docs/tutorial/Tutorial.html
|
||||
829
skills/geo-database/references/geo_reference.md
Normal file
829
skills/geo-database/references/geo_reference.md
Normal file
@@ -0,0 +1,829 @@
|
||||
# GEO Database Reference Documentation
|
||||
|
||||
## Complete E-utilities API Specifications
|
||||
|
||||
### Overview
|
||||
|
||||
The NCBI Entrez Programming Utilities (E-utilities) provide programmatic access to GEO metadata through a set of nine server-side programs. All E-utilities return results in XML format by default.
|
||||
|
||||
### Base URL
|
||||
|
||||
```
|
||||
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
|
||||
```
|
||||
|
||||
### Core E-utility Programs
|
||||
|
||||
#### eSearch - Text Query to ID List
|
||||
|
||||
**Purpose:** Search a database and return a list of UIDs matching the query.
|
||||
|
||||
**URL Pattern:**
|
||||
```
|
||||
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `db` (required): Database to search (e.g., "gds", "geoprofiles")
|
||||
- `term` (required): Search query string
|
||||
- `retmax`: Maximum number of UIDs to return (default: 20, max: 10000)
|
||||
- `retstart`: Starting position in result set (for pagination)
|
||||
- `usehistory`: Set to "y" to store results on history server
|
||||
- `sort`: Sort order (e.g., "relevance", "pub_date")
|
||||
- `field`: Limit search to specific field
|
||||
- `datetype`: Type of date to limit by
|
||||
- `reldate`: Limit to items within N days of today
|
||||
- `mindate`, `maxdate`: Date range limits (YYYY/MM/DD)
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from Bio import Entrez
|
||||
Entrez.email = "your@email.com"
|
||||
|
||||
# Basic search
|
||||
handle = Entrez.esearch(
|
||||
db="gds",
|
||||
term="breast cancer AND Homo sapiens",
|
||||
retmax=100,
|
||||
usehistory="y"
|
||||
)
|
||||
results = Entrez.read(handle)
|
||||
handle.close()
|
||||
|
||||
# Results contain:
|
||||
# - Count: Total number of matches
|
||||
# - RetMax: Number of UIDs returned
|
||||
# - RetStart: Starting position
|
||||
# - IdList: List of UIDs
|
||||
# - QueryKey: Key for history server (if usehistory="y")
|
||||
# - WebEnv: Web environment string (if usehistory="y")
|
||||
```
|
||||
|
||||
#### eSummary - Document Summaries
|
||||
|
||||
**Purpose:** Retrieve document summaries for a list of UIDs.
|
||||
|
||||
**URL Pattern:**
|
||||
```
|
||||
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `db` (required): Database
|
||||
- `id` (required): Comma-separated list of UIDs or query_key+WebEnv
|
||||
- `retmode`: Return format ("xml" or "json")
|
||||
- `version`: Summary version ("2.0" recommended)
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from Bio import Entrez
|
||||
Entrez.email = "your@email.com"
|
||||
|
||||
# Get summaries for multiple IDs
|
||||
handle = Entrez.esummary(
|
||||
db="gds",
|
||||
id="200000001,200000002",
|
||||
retmode="xml",
|
||||
version="2.0"
|
||||
)
|
||||
summaries = Entrez.read(handle)
|
||||
handle.close()
|
||||
|
||||
# Summary fields for GEO DataSets:
|
||||
# - Accession: GDS accession
|
||||
# - title: Dataset title
|
||||
# - summary: Dataset description
|
||||
# - PDAT: Publication date
|
||||
# - n_samples: Number of samples
|
||||
# - Organism: Source organism
|
||||
# - PubMedIds: Associated PubMed IDs
|
||||
```
|
||||
|
||||
#### eFetch - Full Records
|
||||
|
||||
**Purpose:** Retrieve full records for a list of UIDs.
|
||||
|
||||
**URL Pattern:**
|
||||
```
|
||||
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `db` (required): Database
|
||||
- `id` (required): Comma-separated list of UIDs
|
||||
- `retmode`: Return format ("xml", "text")
|
||||
- `rettype`: Record type (database-specific)
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from Bio import Entrez
|
||||
Entrez.email = "your@email.com"
|
||||
|
||||
# Fetch full records
|
||||
handle = Entrez.efetch(
|
||||
db="gds",
|
||||
id="200000001",
|
||||
retmode="xml"
|
||||
)
|
||||
records = Entrez.read(handle)
|
||||
handle.close()
|
||||
```
|
||||
|
||||
#### eLink - Cross-Database Linking
|
||||
|
||||
**Purpose:** Find related records in same or different databases.
|
||||
|
||||
**URL Pattern:**
|
||||
```
|
||||
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `dbfrom` (required): Source database
|
||||
- `db` (required): Target database
|
||||
- `id` (required): UID from source database
|
||||
- `cmd`: Link command type
|
||||
- "neighbor": Return linked UIDs (default)
|
||||
- "neighbor_score": Return scored links
|
||||
- "acheck": Check for links
|
||||
- "ncheck": Count links
|
||||
- "llinks": Return URLs to LinkOut resources
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from Bio import Entrez
|
||||
Entrez.email = "your@email.com"
|
||||
|
||||
# Find PubMed articles linked to a GEO dataset
|
||||
handle = Entrez.elink(
|
||||
dbfrom="gds",
|
||||
db="pubmed",
|
||||
id="200000001"
|
||||
)
|
||||
links = Entrez.read(handle)
|
||||
handle.close()
|
||||
```
|
||||
|
||||
#### ePost - Upload UID List
|
||||
|
||||
**Purpose:** Upload a list of UIDs to the history server for use in subsequent requests.
|
||||
|
||||
**URL Pattern:**
|
||||
```
|
||||
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `db` (required): Database
|
||||
- `id` (required): Comma-separated list of UIDs
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from Bio import Entrez
|
||||
Entrez.email = "your@email.com"
|
||||
|
||||
# Post large list of IDs
|
||||
large_id_list = [str(i) for i in range(200000001, 200000101)]
|
||||
handle = Entrez.epost(db="gds", id=",".join(large_id_list))
|
||||
result = Entrez.read(handle)
|
||||
handle.close()
|
||||
|
||||
# Use returned QueryKey and WebEnv in subsequent calls
|
||||
query_key = result["QueryKey"]
|
||||
webenv = result["WebEnv"]
|
||||
```
|
||||
|
||||
#### eInfo - Database Information
|
||||
|
||||
**Purpose:** Get information about available databases and their fields.
|
||||
|
||||
**URL Pattern:**
|
||||
```
|
||||
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `db`: Database name (omit to get list of all databases)
|
||||
- `version`: Set to "2.0" for detailed field information
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from Bio import Entrez
|
||||
Entrez.email = "your@email.com"
|
||||
|
||||
# Get information about gds database
|
||||
handle = Entrez.einfo(db="gds", version="2.0")
|
||||
info = Entrez.read(handle)
|
||||
handle.close()
|
||||
|
||||
# Returns:
|
||||
# - Database description
|
||||
# - Last update date
|
||||
# - Record count
|
||||
# - Available search fields
|
||||
# - Link information
|
||||
```
|
||||
|
||||
### Search Field Qualifiers for GEO
|
||||
|
||||
Common search fields for building targeted queries:
|
||||
|
||||
**General Fields:**
|
||||
- `[Accession]`: GEO accession number
|
||||
- `[Title]`: Dataset title
|
||||
- `[Author]`: Author name
|
||||
- `[Organism]`: Source organism
|
||||
- `[Entry Type]`: Type of entry (e.g., "Expression profiling by array")
|
||||
- `[Platform]`: Platform accession or name
|
||||
- `[PubMed ID]`: Associated PubMed ID
|
||||
|
||||
**Date Fields:**
|
||||
- `[Publication Date]`: Publication date (YYYY or YYYY/MM/DD)
|
||||
- `[Submission Date]`: Submission date
|
||||
- `[Modification Date]`: Last modification date
|
||||
|
||||
**MeSH Terms:**
|
||||
- `[MeSH Terms]`: Medical Subject Headings
|
||||
- `[MeSH Major Topic]`: Major MeSH topics
|
||||
|
||||
**Study Type Fields:**
|
||||
- `[DataSet Type]`: Type of study (e.g., "RNA-seq", "ChIP-seq")
|
||||
- `[Sample Type]`: Sample type
|
||||
|
||||
**Example Complex Query:**
|
||||
```python
|
||||
query = """
|
||||
(breast cancer[MeSH] OR breast neoplasms[Title]) AND
|
||||
Homo sapiens[Organism] AND
|
||||
expression profiling by array[Entry Type] AND
|
||||
2020:2024[Publication Date] AND
|
||||
GPL570[Platform]
|
||||
"""
|
||||
```
|
||||
|
||||
## SOFT File Format Specification
|
||||
|
||||
### Overview
|
||||
|
||||
SOFT (Simple Omnibus Format in Text) is GEO's primary data exchange format. Files are structured as key-value pairs with data tables.
|
||||
|
||||
### File Types
|
||||
|
||||
**Family SOFT Files:**
|
||||
- Filename: `GSExxxxx_family.soft.gz`
|
||||
- Contains: Complete series with all samples and platforms
|
||||
- Size: Can be very large (100s of MB compressed)
|
||||
- Use: Complete data extraction
|
||||
|
||||
**Series Matrix Files:**
|
||||
- Filename: `GSExxxxx_series_matrix.txt.gz`
|
||||
- Contains: Expression matrix with minimal metadata
|
||||
- Size: Smaller than family files
|
||||
- Use: Quick access to expression data
|
||||
|
||||
**Platform SOFT Files:**
|
||||
- Filename: `GPLxxxxx.soft`
|
||||
- Contains: Platform annotation and probe information
|
||||
- Use: Mapping probes to genes
|
||||
|
||||
### SOFT File Structure
|
||||
|
||||
```
|
||||
^DATABASE = GeoMiame
|
||||
!Database_name = Gene Expression Omnibus (GEO)
|
||||
!Database_institute = NCBI NLM NIH
|
||||
!Database_web_link = http://www.ncbi.nlm.nih.gov/geo
|
||||
!Database_email = geo@ncbi.nlm.nih.gov
|
||||
|
||||
^SERIES = GSExxxxx
|
||||
!Series_title = Study Title Here
|
||||
!Series_summary = Study description and background...
|
||||
!Series_overall_design = Experimental design...
|
||||
!Series_type = Expression profiling by array
|
||||
!Series_pubmed_id = 12345678
|
||||
!Series_submission_date = Jan 01 2024
|
||||
!Series_last_update_date = Jan 15 2024
|
||||
!Series_contributor = John,Doe
|
||||
!Series_contributor = Jane,Smith
|
||||
!Series_sample_id = GSMxxxxxx
|
||||
!Series_sample_id = GSMxxxxxx
|
||||
|
||||
^PLATFORM = GPLxxxxx
|
||||
!Platform_title = Platform Name
|
||||
!Platform_distribution = commercial or custom
|
||||
!Platform_organism = Homo sapiens
|
||||
!Platform_manufacturer = Affymetrix
|
||||
!Platform_technology = in situ oligonucleotide
|
||||
!Platform_data_row_count = 54675
|
||||
#ID = Probe ID
|
||||
#GB_ACC = GenBank accession
|
||||
#SPOT_ID = Spot identifier
|
||||
#Gene Symbol = Gene symbol
|
||||
#Gene Title = Gene title
|
||||
!platform_table_begin
|
||||
ID GB_ACC SPOT_ID Gene Symbol Gene Title
|
||||
1007_s_at U48705 - DDR1 discoidin domain receptor...
|
||||
1053_at M87338 - RFC2 replication factor C...
|
||||
!platform_table_end
|
||||
|
||||
^SAMPLE = GSMxxxxxx
|
||||
!Sample_title = Sample name
|
||||
!Sample_source_name_ch1 = cell line XYZ
|
||||
!Sample_organism_ch1 = Homo sapiens
|
||||
!Sample_characteristics_ch1 = cell type: epithelial
|
||||
!Sample_characteristics_ch1 = treatment: control
|
||||
!Sample_molecule_ch1 = total RNA
|
||||
!Sample_label_ch1 = biotin
|
||||
!Sample_platform_id = GPLxxxxx
|
||||
!Sample_data_processing = normalization method
|
||||
#ID_REF = Probe identifier
|
||||
#VALUE = Expression value
|
||||
!sample_table_begin
|
||||
ID_REF VALUE
|
||||
1007_s_at 8.456
|
||||
1053_at 7.234
|
||||
!sample_table_end
|
||||
```
|
||||
|
||||
### Parsing SOFT Files
|
||||
|
||||
**With GEOparse:**
|
||||
```python
|
||||
import GEOparse
|
||||
|
||||
# Parse series
|
||||
gse = GEOparse.get_GEO(filepath="GSE123456_family.soft.gz")
|
||||
|
||||
# Access metadata
|
||||
metadata = gse.metadata
|
||||
phenotype_data = gse.phenotype_data
|
||||
|
||||
# Access samples
|
||||
for gsm_name, gsm in gse.gsms.items():
|
||||
sample_data = gsm.table
|
||||
sample_metadata = gsm.metadata
|
||||
|
||||
# Access platforms
|
||||
for gpl_name, gpl in gse.gpls.items():
|
||||
platform_table = gpl.table
|
||||
platform_metadata = gpl.metadata
|
||||
```
|
||||
|
||||
**Manual Parsing:**
|
||||
```python
|
||||
import gzip
|
||||
|
||||
def parse_soft_file(filename):
|
||||
"""Basic SOFT file parser"""
|
||||
sections = {}
|
||||
current_section = None
|
||||
current_metadata = {}
|
||||
current_table = []
|
||||
in_table = False
|
||||
|
||||
with gzip.open(filename, 'rt') as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
|
||||
# New section
|
||||
if line.startswith('^'):
|
||||
if current_section:
|
||||
sections[current_section] = {
|
||||
'metadata': current_metadata,
|
||||
'table': current_table
|
||||
}
|
||||
parts = line[1:].split(' = ')
|
||||
current_section = parts[1] if len(parts) > 1 else parts[0]
|
||||
current_metadata = {}
|
||||
current_table = []
|
||||
in_table = False
|
||||
|
||||
# Metadata
|
||||
elif line.startswith('!'):
|
||||
if in_table:
|
||||
in_table = False
|
||||
key_value = line[1:].split(' = ', 1)
|
||||
if len(key_value) == 2:
|
||||
key, value = key_value
|
||||
if key in current_metadata:
|
||||
if isinstance(current_metadata[key], list):
|
||||
current_metadata[key].append(value)
|
||||
else:
|
||||
current_metadata[key] = [current_metadata[key], value]
|
||||
else:
|
||||
current_metadata[key] = value
|
||||
|
||||
# Table data
|
||||
elif line.startswith('#') or in_table:
|
||||
in_table = True
|
||||
current_table.append(line)
|
||||
|
||||
return sections
|
||||
```
|
||||
|
||||
## MINiML File Format
|
||||
|
||||
### Overview
|
||||
|
||||
MINiML (MIAME Notation in Markup Language) is GEO's XML-based format for data exchange.
|
||||
|
||||
### File Structure
|
||||
|
||||
```xml
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<MINiML xmlns="http://www.ncbi.nlm.nih.gov/geo/info/MINiML"
|
||||
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
|
||||
<Series iid="GDS123">
|
||||
<Status>
|
||||
<Submission-Date>2024-01-01</Submission-Date>
|
||||
<Release-Date>2024-01-15</Release-Date>
|
||||
<Last-Update-Date>2024-01-15</Last-Update-Date>
|
||||
</Status>
|
||||
<Title>Study Title</Title>
|
||||
<Summary>Study description...</Summary>
|
||||
<Overall-Design>Experimental design...</Overall-Design>
|
||||
<Type>Expression profiling by array</Type>
|
||||
<Contributor>
|
||||
<Person>
|
||||
<First>John</First>
|
||||
<Last>Doe</Last>
|
||||
</Person>
|
||||
</Contributor>
|
||||
</Series>
|
||||
|
||||
<Platform iid="GPL123">
|
||||
<Title>Platform Name</Title>
|
||||
<Distribution>commercial</Distribution>
|
||||
<Technology>in situ oligonucleotide</Technology>
|
||||
<Organism taxid="9606">Homo sapiens</Organism>
|
||||
<Data-Table>
|
||||
<Column position="1">
|
||||
<Name>ID</Name>
|
||||
<Description>Probe identifier</Description>
|
||||
</Column>
|
||||
<Data>
|
||||
<Row>
|
||||
<Cell column="1">1007_s_at</Cell>
|
||||
<Cell column="2">U48705</Cell>
|
||||
</Row>
|
||||
</Data>
|
||||
</Data-Table>
|
||||
</Platform>
|
||||
|
||||
<Sample iid="GSM123">
|
||||
<Title>Sample name</Title>
|
||||
<Source>cell line XYZ</Source>
|
||||
<Organism taxid="9606">Homo sapiens</Organism>
|
||||
<Characteristics tag="cell type">epithelial</Characteristics>
|
||||
<Characteristics tag="treatment">control</Characteristics>
|
||||
<Platform-Ref ref="GPL123"/>
|
||||
<Data-Table>
|
||||
<Column position="1">
|
||||
<Name>ID_REF</Name>
|
||||
</Column>
|
||||
<Column position="2">
|
||||
<Name>VALUE</Name>
|
||||
</Column>
|
||||
<Data>
|
||||
<Row>
|
||||
<Cell column="1">1007_s_at</Cell>
|
||||
<Cell column="2">8.456</Cell>
|
||||
</Row>
|
||||
</Data>
|
||||
</Data-Table>
|
||||
</Sample>
|
||||
</MINiML>
|
||||
```
|
||||
|
||||
## FTP Directory Structure
|
||||
|
||||
### Series Files
|
||||
|
||||
**Pattern:**
|
||||
```
|
||||
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE{nnn}nnn/GSE{xxxxx}/
|
||||
```
|
||||
|
||||
Where `{nnn}` represents replacing last 3 digits with "nnn" and `{xxxxx}` is the full accession.
|
||||
|
||||
**Example:**
|
||||
- GSE123456 → `/geo/series/GSE123nnn/GSE123456/`
|
||||
- GSE1234 → `/geo/series/GSE1nnn/GSE1234/`
|
||||
- GSE100001 → `/geo/series/GSE100nnn/GSE100001/`
|
||||
|
||||
**Subdirectories:**
|
||||
- `/matrix/` - Series matrix files
|
||||
- `/soft/` - Family SOFT files
|
||||
- `/miniml/` - MINiML XML files
|
||||
- `/suppl/` - Supplementary files
|
||||
|
||||
**File Types:**
|
||||
```
|
||||
matrix/
|
||||
└── GSE123456_series_matrix.txt.gz
|
||||
|
||||
soft/
|
||||
└── GSE123456_family.soft.gz
|
||||
|
||||
miniml/
|
||||
└── GSE123456_family.xml.tgz
|
||||
|
||||
suppl/
|
||||
├── GSE123456_RAW.tar
|
||||
├── filelist.txt
|
||||
└── [various supplementary files]
|
||||
```
|
||||
|
||||
### Sample Files
|
||||
|
||||
**Pattern:**
|
||||
```
|
||||
ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM{nnn}nnn/GSM{xxxxx}/
|
||||
```
|
||||
|
||||
**Subdirectories:**
|
||||
- `/suppl/` - Sample-specific supplementary files
|
||||
|
||||
### Platform Files
|
||||
|
||||
**Pattern:**
|
||||
```
|
||||
ftp://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL{nnn}nnn/GPL{xxxxx}/
|
||||
```
|
||||
|
||||
**File Types:**
|
||||
```
|
||||
soft/
|
||||
└── GPL570.soft.gz
|
||||
|
||||
miniml/
|
||||
└── GPL570.xml
|
||||
|
||||
annot/
|
||||
└── GPL570.annot.gz # Enhanced annotation (if available)
|
||||
```
|
||||
|
||||
## Advanced GEOparse Usage
|
||||
|
||||
### Custom Parsing Options
|
||||
|
||||
```python
|
||||
import GEOparse
|
||||
|
||||
# Parse with custom options
|
||||
gse = GEOparse.get_GEO(
|
||||
geo="GSE123456",
|
||||
destdir="./data",
|
||||
silent=False, # Show progress
|
||||
how="full", # Parse mode: "full", "quick", "brief"
|
||||
annotate_gpl=True, # Include platform annotation
|
||||
geotype="GSE" # Explicit type
|
||||
)
|
||||
|
||||
# Access specific sample
|
||||
gsm = gse.gsms['GSM1234567']
|
||||
|
||||
# Get expression values for specific probe
|
||||
probe_id = "1007_s_at"
|
||||
if hasattr(gsm, 'table'):
|
||||
probe_data = gsm.table[gsm.table['ID_REF'] == probe_id]
|
||||
|
||||
# Get all characteristics
|
||||
characteristics = {}
|
||||
for key, values in gsm.metadata.items():
|
||||
if key.startswith('characteristics'):
|
||||
for value in (values if isinstance(values, list) else [values]):
|
||||
if ':' in value:
|
||||
char_key, char_value = value.split(':', 1)
|
||||
characteristics[char_key.strip()] = char_value.strip()
|
||||
```
|
||||
|
||||
### Working with Platform Annotations
|
||||
|
||||
```python
|
||||
import GEOparse
|
||||
import pandas as pd
|
||||
|
||||
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
|
||||
|
||||
# Get platform
|
||||
gpl = list(gse.gpls.values())[0]
|
||||
|
||||
# Extract annotation table
|
||||
if hasattr(gpl, 'table'):
|
||||
annotation = gpl.table
|
||||
|
||||
# Common annotation columns:
|
||||
# - ID: Probe identifier
|
||||
# - Gene Symbol: Gene symbol
|
||||
# - Gene Title: Gene description
|
||||
# - GB_ACC: GenBank accession
|
||||
# - Gene ID: Entrez Gene ID
|
||||
# - RefSeq: RefSeq accession
|
||||
# - UniGene: UniGene cluster
|
||||
|
||||
# Map probes to genes
|
||||
probe_to_gene = dict(zip(
|
||||
annotation['ID'],
|
||||
annotation['Gene Symbol']
|
||||
))
|
||||
|
||||
# Handle multiple probes per gene
|
||||
gene_to_probes = {}
|
||||
for probe, gene in probe_to_gene.items():
|
||||
if gene and gene != '---':
|
||||
if gene not in gene_to_probes:
|
||||
gene_to_probes[gene] = []
|
||||
gene_to_probes[gene].append(probe)
|
||||
```
|
||||
|
||||
### Handling Large Datasets
|
||||
|
||||
```python
|
||||
import GEOparse
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
|
||||
def process_large_gse(gse_id, chunk_size=1000):
|
||||
"""Process large GEO series in chunks"""
|
||||
gse = GEOparse.get_GEO(geo=gse_id, destdir="./data")
|
||||
|
||||
# Get sample list
|
||||
sample_list = list(gse.gsms.keys())
|
||||
|
||||
# Process in chunks
|
||||
for i in range(0, len(sample_list), chunk_size):
|
||||
chunk_samples = sample_list[i:i+chunk_size]
|
||||
|
||||
# Extract data for chunk
|
||||
chunk_data = {}
|
||||
for gsm_id in chunk_samples:
|
||||
gsm = gse.gsms[gsm_id]
|
||||
if hasattr(gsm, 'table'):
|
||||
chunk_data[gsm_id] = gsm.table['VALUE']
|
||||
|
||||
# Process chunk
|
||||
chunk_df = pd.DataFrame(chunk_data)
|
||||
|
||||
# Save chunk results
|
||||
chunk_df.to_csv(f"chunk_{i//chunk_size}.csv")
|
||||
|
||||
print(f"Processed {i+len(chunk_samples)}/{len(sample_list)} samples")
|
||||
```
|
||||
|
||||
## Troubleshooting Common Issues
|
||||
|
||||
### Issue: GEOparse Fails to Download
|
||||
|
||||
**Symptoms:** Timeout errors, connection failures
|
||||
|
||||
**Solutions:**
|
||||
1. Check internet connection
|
||||
2. Try downloading directly via FTP first
|
||||
3. Parse local files:
|
||||
```python
|
||||
gse = GEOparse.get_GEO(filepath="./local/GSE123456_family.soft.gz")
|
||||
```
|
||||
4. Increase timeout (modify GEOparse source if needed)
|
||||
|
||||
### Issue: Missing Expression Data
|
||||
|
||||
**Symptoms:** `pivot_samples()` fails or returns empty
|
||||
|
||||
**Cause:** Not all series have series matrix files (older submissions)
|
||||
|
||||
**Solution:** Parse individual sample tables:
|
||||
```python
|
||||
expression_data = {}
|
||||
for gsm_name, gsm in gse.gsms.items():
|
||||
if hasattr(gsm, 'table') and 'VALUE' in gsm.table.columns:
|
||||
expression_data[gsm_name] = gsm.table.set_index('ID_REF')['VALUE']
|
||||
|
||||
expression_df = pd.DataFrame(expression_data)
|
||||
```
|
||||
|
||||
### Issue: Inconsistent Probe IDs
|
||||
|
||||
**Symptoms:** Probe IDs don't match between samples
|
||||
|
||||
**Cause:** Different platform versions or sample processing
|
||||
|
||||
**Solution:** Standardize using platform annotation:
|
||||
```python
|
||||
# Get common probe set
|
||||
all_probes = set()
|
||||
for gsm in gse.gsms.values():
|
||||
if hasattr(gsm, 'table'):
|
||||
all_probes.update(gsm.table['ID_REF'].values)
|
||||
|
||||
# Create standardized matrix
|
||||
standardized_data = {}
|
||||
for gsm_name, gsm in gse.gsms.items():
|
||||
if hasattr(gsm, 'table'):
|
||||
sample_data = gsm.table.set_index('ID_REF')['VALUE']
|
||||
standardized_data[gsm_name] = sample_data.reindex(all_probes)
|
||||
|
||||
expression_df = pd.DataFrame(standardized_data)
|
||||
```
|
||||
|
||||
### Issue: E-utilities Rate Limiting
|
||||
|
||||
**Symptoms:** HTTP 429 errors, slow responses
|
||||
|
||||
**Solution:**
|
||||
1. Get an API key from NCBI
|
||||
2. Implement rate limiting:
|
||||
```python
|
||||
import time
|
||||
from functools import wraps
|
||||
|
||||
def rate_limit(calls_per_second=3):
|
||||
min_interval = 1.0 / calls_per_second
|
||||
|
||||
def decorator(func):
|
||||
last_called = [0.0]
|
||||
|
||||
@wraps(func)
|
||||
def wrapper(*args, **kwargs):
|
||||
elapsed = time.time() - last_called[0]
|
||||
wait_time = min_interval - elapsed
|
||||
if wait_time > 0:
|
||||
time.sleep(wait_time)
|
||||
result = func(*args, **kwargs)
|
||||
last_called[0] = time.time()
|
||||
return result
|
||||
return wrapper
|
||||
return decorator
|
||||
|
||||
@rate_limit(calls_per_second=3)
|
||||
def safe_esearch(query):
|
||||
handle = Entrez.esearch(db="gds", term=query)
|
||||
results = Entrez.read(handle)
|
||||
handle.close()
|
||||
return results
|
||||
```
|
||||
|
||||
### Issue: Memory Errors with Large Datasets
|
||||
|
||||
**Symptoms:** MemoryError, system slowdown
|
||||
|
||||
**Solution:**
|
||||
1. Process data in chunks
|
||||
2. Use sparse matrices for expression data
|
||||
3. Load only necessary columns
|
||||
4. Use memory-efficient data types:
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# Read with specific dtypes
|
||||
expression_df = pd.read_csv(
|
||||
"expression_matrix.csv",
|
||||
dtype={'ID': str, 'GSM1': np.float32} # Use float32 instead of float64
|
||||
)
|
||||
|
||||
# Or use sparse format for mostly-zero data
|
||||
import scipy.sparse as sp
|
||||
sparse_matrix = sp.csr_matrix(expression_df.values)
|
||||
```
|
||||
|
||||
## Platform-Specific Considerations
|
||||
|
||||
### Affymetrix Arrays
|
||||
|
||||
- Probe IDs format: `1007_s_at`, `1053_at`
|
||||
- Multiple probe sets per gene common
|
||||
- Check for `_at`, `_s_at`, `_x_at` suffixes
|
||||
- May need RMA or MAS5 normalization
|
||||
|
||||
### Illumina Arrays
|
||||
|
||||
- Probe IDs format: `ILMN_1234567`
|
||||
- Watch for duplicate probes
|
||||
- BeadChip-specific processing may be needed
|
||||
|
||||
### RNA-seq
|
||||
|
||||
- May not have traditional "probes"
|
||||
- Check for gene IDs (Ensembl, Entrez)
|
||||
- Counts vs. FPKM/TPM values
|
||||
- May need separate count files
|
||||
|
||||
### Two-Channel Arrays
|
||||
|
||||
- Look for `_ch1` and `_ch2` suffixes in metadata
|
||||
- VALUE_ch1, VALUE_ch2 columns
|
||||
- May need ratio or intensity values
|
||||
- Check dye-swap experiments
|
||||
|
||||
## Best Practices Summary
|
||||
|
||||
1. **Always set Entrez.email** before using E-utilities
|
||||
2. **Use API key** for better rate limits
|
||||
3. **Cache downloaded files** locally
|
||||
4. **Check data quality** before analysis
|
||||
5. **Verify platform annotations** are current
|
||||
6. **Document data processing** steps
|
||||
7. **Cite original studies** when using data
|
||||
8. **Check for batch effects** in meta-analyses
|
||||
9. **Validate results** with independent datasets
|
||||
10. **Follow NCBI usage guidelines**
|
||||
Reference in New Issue
Block a user