21 KiB
GEO Database Reference Documentation
Complete E-utilities API Specifications
Overview
The NCBI Entrez Programming Utilities (E-utilities) provide programmatic access to GEO metadata through a set of nine server-side programs. All E-utilities return results in XML format by default.
Base URL
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
Core E-utility Programs
eSearch - Text Query to ID List
Purpose: Search a database and return a list of UIDs matching the query.
URL Pattern:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
Parameters:
db(required): Database to search (e.g., "gds", "geoprofiles")term(required): Search query stringretmax: Maximum number of UIDs to return (default: 20, max: 10000)retstart: Starting position in result set (for pagination)usehistory: Set to "y" to store results on history serversort: Sort order (e.g., "relevance", "pub_date")field: Limit search to specific fielddatetype: Type of date to limit byreldate: Limit to items within N days of todaymindate,maxdate: Date range limits (YYYY/MM/DD)
Example:
from Bio import Entrez
Entrez.email = "your@email.com"
# Basic search
handle = Entrez.esearch(
db="gds",
term="breast cancer AND Homo sapiens",
retmax=100,
usehistory="y"
)
results = Entrez.read(handle)
handle.close()
# Results contain:
# - Count: Total number of matches
# - RetMax: Number of UIDs returned
# - RetStart: Starting position
# - IdList: List of UIDs
# - QueryKey: Key for history server (if usehistory="y")
# - WebEnv: Web environment string (if usehistory="y")
eSummary - Document Summaries
Purpose: Retrieve document summaries for a list of UIDs.
URL Pattern:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi
Parameters:
db(required): Databaseid(required): Comma-separated list of UIDs or query_key+WebEnvretmode: Return format ("xml" or "json")version: Summary version ("2.0" recommended)
Example:
from Bio import Entrez
Entrez.email = "your@email.com"
# Get summaries for multiple IDs
handle = Entrez.esummary(
db="gds",
id="200000001,200000002",
retmode="xml",
version="2.0"
)
summaries = Entrez.read(handle)
handle.close()
# Summary fields for GEO DataSets:
# - Accession: GDS accession
# - title: Dataset title
# - summary: Dataset description
# - PDAT: Publication date
# - n_samples: Number of samples
# - Organism: Source organism
# - PubMedIds: Associated PubMed IDs
eFetch - Full Records
Purpose: Retrieve full records for a list of UIDs.
URL Pattern:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
Parameters:
db(required): Databaseid(required): Comma-separated list of UIDsretmode: Return format ("xml", "text")rettype: Record type (database-specific)
Example:
from Bio import Entrez
Entrez.email = "your@email.com"
# Fetch full records
handle = Entrez.efetch(
db="gds",
id="200000001",
retmode="xml"
)
records = Entrez.read(handle)
handle.close()
eLink - Cross-Database Linking
Purpose: Find related records in same or different databases.
URL Pattern:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi
Parameters:
dbfrom(required): Source databasedb(required): Target databaseid(required): UID from source databasecmd: Link command type- "neighbor": Return linked UIDs (default)
- "neighbor_score": Return scored links
- "acheck": Check for links
- "ncheck": Count links
- "llinks": Return URLs to LinkOut resources
Example:
from Bio import Entrez
Entrez.email = "your@email.com"
# Find PubMed articles linked to a GEO dataset
handle = Entrez.elink(
dbfrom="gds",
db="pubmed",
id="200000001"
)
links = Entrez.read(handle)
handle.close()
ePost - Upload UID List
Purpose: Upload a list of UIDs to the history server for use in subsequent requests.
URL Pattern:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi
Parameters:
db(required): Databaseid(required): Comma-separated list of UIDs
Example:
from Bio import Entrez
Entrez.email = "your@email.com"
# Post large list of IDs
large_id_list = [str(i) for i in range(200000001, 200000101)]
handle = Entrez.epost(db="gds", id=",".join(large_id_list))
result = Entrez.read(handle)
handle.close()
# Use returned QueryKey and WebEnv in subsequent calls
query_key = result["QueryKey"]
webenv = result["WebEnv"]
eInfo - Database Information
Purpose: Get information about available databases and their fields.
URL Pattern:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi
Parameters:
db: Database name (omit to get list of all databases)version: Set to "2.0" for detailed field information
Example:
from Bio import Entrez
Entrez.email = "your@email.com"
# Get information about gds database
handle = Entrez.einfo(db="gds", version="2.0")
info = Entrez.read(handle)
handle.close()
# Returns:
# - Database description
# - Last update date
# - Record count
# - Available search fields
# - Link information
Search Field Qualifiers for GEO
Common search fields for building targeted queries:
General Fields:
[Accession]: GEO accession number[Title]: Dataset title[Author]: Author name[Organism]: Source organism[Entry Type]: Type of entry (e.g., "Expression profiling by array")[Platform]: Platform accession or name[PubMed ID]: Associated PubMed ID
Date Fields:
[Publication Date]: Publication date (YYYY or YYYY/MM/DD)[Submission Date]: Submission date[Modification Date]: Last modification date
MeSH Terms:
[MeSH Terms]: Medical Subject Headings[MeSH Major Topic]: Major MeSH topics
Study Type Fields:
[DataSet Type]: Type of study (e.g., "RNA-seq", "ChIP-seq")[Sample Type]: Sample type
Example Complex Query:
query = """
(breast cancer[MeSH] OR breast neoplasms[Title]) AND
Homo sapiens[Organism] AND
expression profiling by array[Entry Type] AND
2020:2024[Publication Date] AND
GPL570[Platform]
"""
SOFT File Format Specification
Overview
SOFT (Simple Omnibus Format in Text) is GEO's primary data exchange format. Files are structured as key-value pairs with data tables.
File Types
Family SOFT Files:
- Filename:
GSExxxxx_family.soft.gz - Contains: Complete series with all samples and platforms
- Size: Can be very large (100s of MB compressed)
- Use: Complete data extraction
Series Matrix Files:
- Filename:
GSExxxxx_series_matrix.txt.gz - Contains: Expression matrix with minimal metadata
- Size: Smaller than family files
- Use: Quick access to expression data
Platform SOFT Files:
- Filename:
GPLxxxxx.soft - Contains: Platform annotation and probe information
- Use: Mapping probes to genes
SOFT File Structure
^DATABASE = GeoMiame
!Database_name = Gene Expression Omnibus (GEO)
!Database_institute = NCBI NLM NIH
!Database_web_link = http://www.ncbi.nlm.nih.gov/geo
!Database_email = geo@ncbi.nlm.nih.gov
^SERIES = GSExxxxx
!Series_title = Study Title Here
!Series_summary = Study description and background...
!Series_overall_design = Experimental design...
!Series_type = Expression profiling by array
!Series_pubmed_id = 12345678
!Series_submission_date = Jan 01 2024
!Series_last_update_date = Jan 15 2024
!Series_contributor = John,Doe
!Series_contributor = Jane,Smith
!Series_sample_id = GSMxxxxxx
!Series_sample_id = GSMxxxxxx
^PLATFORM = GPLxxxxx
!Platform_title = Platform Name
!Platform_distribution = commercial or custom
!Platform_organism = Homo sapiens
!Platform_manufacturer = Affymetrix
!Platform_technology = in situ oligonucleotide
!Platform_data_row_count = 54675
#ID = Probe ID
#GB_ACC = GenBank accession
#SPOT_ID = Spot identifier
#Gene Symbol = Gene symbol
#Gene Title = Gene title
!platform_table_begin
ID GB_ACC SPOT_ID Gene Symbol Gene Title
1007_s_at U48705 - DDR1 discoidin domain receptor...
1053_at M87338 - RFC2 replication factor C...
!platform_table_end
^SAMPLE = GSMxxxxxx
!Sample_title = Sample name
!Sample_source_name_ch1 = cell line XYZ
!Sample_organism_ch1 = Homo sapiens
!Sample_characteristics_ch1 = cell type: epithelial
!Sample_characteristics_ch1 = treatment: control
!Sample_molecule_ch1 = total RNA
!Sample_label_ch1 = biotin
!Sample_platform_id = GPLxxxxx
!Sample_data_processing = normalization method
#ID_REF = Probe identifier
#VALUE = Expression value
!sample_table_begin
ID_REF VALUE
1007_s_at 8.456
1053_at 7.234
!sample_table_end
Parsing SOFT Files
With GEOparse:
import GEOparse
# Parse series
gse = GEOparse.get_GEO(filepath="GSE123456_family.soft.gz")
# Access metadata
metadata = gse.metadata
phenotype_data = gse.phenotype_data
# Access samples
for gsm_name, gsm in gse.gsms.items():
sample_data = gsm.table
sample_metadata = gsm.metadata
# Access platforms
for gpl_name, gpl in gse.gpls.items():
platform_table = gpl.table
platform_metadata = gpl.metadata
Manual Parsing:
import gzip
def parse_soft_file(filename):
"""Basic SOFT file parser"""
sections = {}
current_section = None
current_metadata = {}
current_table = []
in_table = False
with gzip.open(filename, 'rt') as f:
for line in f:
line = line.strip()
# New section
if line.startswith('^'):
if current_section:
sections[current_section] = {
'metadata': current_metadata,
'table': current_table
}
parts = line[1:].split(' = ')
current_section = parts[1] if len(parts) > 1 else parts[0]
current_metadata = {}
current_table = []
in_table = False
# Metadata
elif line.startswith('!'):
if in_table:
in_table = False
key_value = line[1:].split(' = ', 1)
if len(key_value) == 2:
key, value = key_value
if key in current_metadata:
if isinstance(current_metadata[key], list):
current_metadata[key].append(value)
else:
current_metadata[key] = [current_metadata[key], value]
else:
current_metadata[key] = value
# Table data
elif line.startswith('#') or in_table:
in_table = True
current_table.append(line)
return sections
MINiML File Format
Overview
MINiML (MIAME Notation in Markup Language) is GEO's XML-based format for data exchange.
File Structure
<?xml version="1.0" encoding="UTF-8"?>
<MINiML xmlns="http://www.ncbi.nlm.nih.gov/geo/info/MINiML"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Series iid="GDS123">
<Status>
<Submission-Date>2024-01-01</Submission-Date>
<Release-Date>2024-01-15</Release-Date>
<Last-Update-Date>2024-01-15</Last-Update-Date>
</Status>
<Title>Study Title</Title>
<Summary>Study description...</Summary>
<Overall-Design>Experimental design...</Overall-Design>
<Type>Expression profiling by array</Type>
<Contributor>
<Person>
<First>John</First>
<Last>Doe</Last>
</Person>
</Contributor>
</Series>
<Platform iid="GPL123">
<Title>Platform Name</Title>
<Distribution>commercial</Distribution>
<Technology>in situ oligonucleotide</Technology>
<Organism taxid="9606">Homo sapiens</Organism>
<Data-Table>
<Column position="1">
<Name>ID</Name>
<Description>Probe identifier</Description>
</Column>
<Data>
<Row>
<Cell column="1">1007_s_at</Cell>
<Cell column="2">U48705</Cell>
</Row>
</Data>
</Data-Table>
</Platform>
<Sample iid="GSM123">
<Title>Sample name</Title>
<Source>cell line XYZ</Source>
<Organism taxid="9606">Homo sapiens</Organism>
<Characteristics tag="cell type">epithelial</Characteristics>
<Characteristics tag="treatment">control</Characteristics>
<Platform-Ref ref="GPL123"/>
<Data-Table>
<Column position="1">
<Name>ID_REF</Name>
</Column>
<Column position="2">
<Name>VALUE</Name>
</Column>
<Data>
<Row>
<Cell column="1">1007_s_at</Cell>
<Cell column="2">8.456</Cell>
</Row>
</Data>
</Data-Table>
</Sample>
</MINiML>
FTP Directory Structure
Series Files
Pattern:
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE{nnn}nnn/GSE{xxxxx}/
Where {nnn} represents replacing last 3 digits with "nnn" and {xxxxx} is the full accession.
Example:
- GSE123456 →
/geo/series/GSE123nnn/GSE123456/ - GSE1234 →
/geo/series/GSE1nnn/GSE1234/ - GSE100001 →
/geo/series/GSE100nnn/GSE100001/
Subdirectories:
/matrix/- Series matrix files/soft/- Family SOFT files/miniml/- MINiML XML files/suppl/- Supplementary files
File Types:
matrix/
└── GSE123456_series_matrix.txt.gz
soft/
└── GSE123456_family.soft.gz
miniml/
└── GSE123456_family.xml.tgz
suppl/
├── GSE123456_RAW.tar
├── filelist.txt
└── [various supplementary files]
Sample Files
Pattern:
ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM{nnn}nnn/GSM{xxxxx}/
Subdirectories:
/suppl/- Sample-specific supplementary files
Platform Files
Pattern:
ftp://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL{nnn}nnn/GPL{xxxxx}/
File Types:
soft/
└── GPL570.soft.gz
miniml/
└── GPL570.xml
annot/
└── GPL570.annot.gz # Enhanced annotation (if available)
Advanced GEOparse Usage
Custom Parsing Options
import GEOparse
# Parse with custom options
gse = GEOparse.get_GEO(
geo="GSE123456",
destdir="./data",
silent=False, # Show progress
how="full", # Parse mode: "full", "quick", "brief"
annotate_gpl=True, # Include platform annotation
geotype="GSE" # Explicit type
)
# Access specific sample
gsm = gse.gsms['GSM1234567']
# Get expression values for specific probe
probe_id = "1007_s_at"
if hasattr(gsm, 'table'):
probe_data = gsm.table[gsm.table['ID_REF'] == probe_id]
# Get all characteristics
characteristics = {}
for key, values in gsm.metadata.items():
if key.startswith('characteristics'):
for value in (values if isinstance(values, list) else [values]):
if ':' in value:
char_key, char_value = value.split(':', 1)
characteristics[char_key.strip()] = char_value.strip()
Working with Platform Annotations
import GEOparse
import pandas as pd
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
# Get platform
gpl = list(gse.gpls.values())[0]
# Extract annotation table
if hasattr(gpl, 'table'):
annotation = gpl.table
# Common annotation columns:
# - ID: Probe identifier
# - Gene Symbol: Gene symbol
# - Gene Title: Gene description
# - GB_ACC: GenBank accession
# - Gene ID: Entrez Gene ID
# - RefSeq: RefSeq accession
# - UniGene: UniGene cluster
# Map probes to genes
probe_to_gene = dict(zip(
annotation['ID'],
annotation['Gene Symbol']
))
# Handle multiple probes per gene
gene_to_probes = {}
for probe, gene in probe_to_gene.items():
if gene and gene != '---':
if gene not in gene_to_probes:
gene_to_probes[gene] = []
gene_to_probes[gene].append(probe)
Handling Large Datasets
import GEOparse
import pandas as pd
import numpy as np
def process_large_gse(gse_id, chunk_size=1000):
"""Process large GEO series in chunks"""
gse = GEOparse.get_GEO(geo=gse_id, destdir="./data")
# Get sample list
sample_list = list(gse.gsms.keys())
# Process in chunks
for i in range(0, len(sample_list), chunk_size):
chunk_samples = sample_list[i:i+chunk_size]
# Extract data for chunk
chunk_data = {}
for gsm_id in chunk_samples:
gsm = gse.gsms[gsm_id]
if hasattr(gsm, 'table'):
chunk_data[gsm_id] = gsm.table['VALUE']
# Process chunk
chunk_df = pd.DataFrame(chunk_data)
# Save chunk results
chunk_df.to_csv(f"chunk_{i//chunk_size}.csv")
print(f"Processed {i+len(chunk_samples)}/{len(sample_list)} samples")
Troubleshooting Common Issues
Issue: GEOparse Fails to Download
Symptoms: Timeout errors, connection failures
Solutions:
- Check internet connection
- Try downloading directly via FTP first
- Parse local files:
gse = GEOparse.get_GEO(filepath="./local/GSE123456_family.soft.gz")
- Increase timeout (modify GEOparse source if needed)
Issue: Missing Expression Data
Symptoms: pivot_samples() fails or returns empty
Cause: Not all series have series matrix files (older submissions)
Solution: Parse individual sample tables:
expression_data = {}
for gsm_name, gsm in gse.gsms.items():
if hasattr(gsm, 'table') and 'VALUE' in gsm.table.columns:
expression_data[gsm_name] = gsm.table.set_index('ID_REF')['VALUE']
expression_df = pd.DataFrame(expression_data)
Issue: Inconsistent Probe IDs
Symptoms: Probe IDs don't match between samples
Cause: Different platform versions or sample processing
Solution: Standardize using platform annotation:
# Get common probe set
all_probes = set()
for gsm in gse.gsms.values():
if hasattr(gsm, 'table'):
all_probes.update(gsm.table['ID_REF'].values)
# Create standardized matrix
standardized_data = {}
for gsm_name, gsm in gse.gsms.items():
if hasattr(gsm, 'table'):
sample_data = gsm.table.set_index('ID_REF')['VALUE']
standardized_data[gsm_name] = sample_data.reindex(all_probes)
expression_df = pd.DataFrame(standardized_data)
Issue: E-utilities Rate Limiting
Symptoms: HTTP 429 errors, slow responses
Solution:
- Get an API key from NCBI
- Implement rate limiting:
import time
from functools import wraps
def rate_limit(calls_per_second=3):
min_interval = 1.0 / calls_per_second
def decorator(func):
last_called = [0.0]
@wraps(func)
def wrapper(*args, **kwargs):
elapsed = time.time() - last_called[0]
wait_time = min_interval - elapsed
if wait_time > 0:
time.sleep(wait_time)
result = func(*args, **kwargs)
last_called[0] = time.time()
return result
return wrapper
return decorator
@rate_limit(calls_per_second=3)
def safe_esearch(query):
handle = Entrez.esearch(db="gds", term=query)
results = Entrez.read(handle)
handle.close()
return results
Issue: Memory Errors with Large Datasets
Symptoms: MemoryError, system slowdown
Solution:
- Process data in chunks
- Use sparse matrices for expression data
- Load only necessary columns
- Use memory-efficient data types:
import pandas as pd
# Read with specific dtypes
expression_df = pd.read_csv(
"expression_matrix.csv",
dtype={'ID': str, 'GSM1': np.float32} # Use float32 instead of float64
)
# Or use sparse format for mostly-zero data
import scipy.sparse as sp
sparse_matrix = sp.csr_matrix(expression_df.values)
Platform-Specific Considerations
Affymetrix Arrays
- Probe IDs format:
1007_s_at,1053_at - Multiple probe sets per gene common
- Check for
_at,_s_at,_x_atsuffixes - May need RMA or MAS5 normalization
Illumina Arrays
- Probe IDs format:
ILMN_1234567 - Watch for duplicate probes
- BeadChip-specific processing may be needed
RNA-seq
- May not have traditional "probes"
- Check for gene IDs (Ensembl, Entrez)
- Counts vs. FPKM/TPM values
- May need separate count files
Two-Channel Arrays
- Look for
_ch1and_ch2suffixes in metadata - VALUE_ch1, VALUE_ch2 columns
- May need ratio or intensity values
- Check dye-swap experiments
Best Practices Summary
- Always set Entrez.email before using E-utilities
- Use API key for better rate limits
- Cache downloaded files locally
- Check data quality before analysis
- Verify platform annotations are current
- Document data processing steps
- Cite original studies when using data
- Check for batch effects in meta-analyses
- Validate results with independent datasets
- Follow NCBI usage guidelines