zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Fork 0

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

21 KiB

Raw Permalink Blame History

GEO Database Reference Documentation

Complete E-utilities API Specifications

Overview

The NCBI Entrez Programming Utilities (E-utilities) provide programmatic access to GEO metadata through a set of nine server-side programs. All E-utilities return results in XML format by default.

Base URL

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/

Core E-utility Programs

eSearch - Text Query to ID List

Purpose: Search a database and return a list of UIDs matching the query.

URL Pattern:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi

Parameters:

db (required): Database to search (e.g., "gds", "geoprofiles")
term (required): Search query string
retmax: Maximum number of UIDs to return (default: 20, max: 10000)
retstart: Starting position in result set (for pagination)
usehistory: Set to "y" to store results on history server
sort: Sort order (e.g., "relevance", "pub_date")
field: Limit search to specific field
datetype: Type of date to limit by
reldate: Limit to items within N days of today
mindate, maxdate: Date range limits (YYYY/MM/DD)

Example:

from Bio import Entrez
Entrez.email = "your@email.com"

# Basic search
handle = Entrez.esearch(
    db="gds",
    term="breast cancer AND Homo sapiens",
    retmax=100,
    usehistory="y"
)
results = Entrez.read(handle)
handle.close()

# Results contain:
# - Count: Total number of matches
# - RetMax: Number of UIDs returned
# - RetStart: Starting position
# - IdList: List of UIDs
# - QueryKey: Key for history server (if usehistory="y")
# - WebEnv: Web environment string (if usehistory="y")

eSummary - Document Summaries

Purpose: Retrieve document summaries for a list of UIDs.

URL Pattern:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi

Parameters:

db (required): Database
id (required): Comma-separated list of UIDs or query_key+WebEnv
retmode: Return format ("xml" or "json")
version: Summary version ("2.0" recommended)

Example:

from Bio import Entrez
Entrez.email = "your@email.com"

# Get summaries for multiple IDs
handle = Entrez.esummary(
    db="gds",
    id="200000001,200000002",
    retmode="xml",
    version="2.0"
)
summaries = Entrez.read(handle)
handle.close()

# Summary fields for GEO DataSets:
# - Accession: GDS accession
# - title: Dataset title
# - summary: Dataset description
# - PDAT: Publication date
# - n_samples: Number of samples
# - Organism: Source organism
# - PubMedIds: Associated PubMed IDs

eFetch - Full Records

Purpose: Retrieve full records for a list of UIDs.

URL Pattern:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi

Parameters:

db (required): Database
id (required): Comma-separated list of UIDs
retmode: Return format ("xml", "text")
rettype: Record type (database-specific)

Example:

from Bio import Entrez
Entrez.email = "your@email.com"

# Fetch full records
handle = Entrez.efetch(
    db="gds",
    id="200000001",
    retmode="xml"
)
records = Entrez.read(handle)
handle.close()

eLink - Cross-Database Linking

Purpose: Find related records in same or different databases.

URL Pattern:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi

Parameters:

dbfrom (required): Source database
db (required): Target database
id (required): UID from source database
cmd: Link command type
- "neighbor": Return linked UIDs (default)
- "neighbor_score": Return scored links
- "acheck": Check for links
- "ncheck": Count links
- "llinks": Return URLs to LinkOut resources

Example:

from Bio import Entrez
Entrez.email = "your@email.com"

# Find PubMed articles linked to a GEO dataset
handle = Entrez.elink(
    dbfrom="gds",
    db="pubmed",
    id="200000001"
)
links = Entrez.read(handle)
handle.close()

ePost - Upload UID List

Purpose: Upload a list of UIDs to the history server for use in subsequent requests.

URL Pattern:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi

Parameters:

db (required): Database
id (required): Comma-separated list of UIDs

Example:

from Bio import Entrez
Entrez.email = "your@email.com"

# Post large list of IDs
large_id_list = [str(i) for i in range(200000001, 200000101)]
handle = Entrez.epost(db="gds", id=",".join(large_id_list))
result = Entrez.read(handle)
handle.close()

# Use returned QueryKey and WebEnv in subsequent calls
query_key = result["QueryKey"]
webenv = result["WebEnv"]

eInfo - Database Information

Purpose: Get information about available databases and their fields.

URL Pattern:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi

Parameters:

db: Database name (omit to get list of all databases)
version: Set to "2.0" for detailed field information

Example:

from Bio import Entrez
Entrez.email = "your@email.com"

# Get information about gds database
handle = Entrez.einfo(db="gds", version="2.0")
info = Entrez.read(handle)
handle.close()

# Returns:
# - Database description
# - Last update date
# - Record count
# - Available search fields
# - Link information

Search Field Qualifiers for GEO

Common search fields for building targeted queries:

General Fields:

[Accession]: GEO accession number
[Title]: Dataset title
[Author]: Author name
[Organism]: Source organism
[Entry Type]: Type of entry (e.g., "Expression profiling by array")
[Platform]: Platform accession or name
[PubMed ID]: Associated PubMed ID

Date Fields:

[Publication Date]: Publication date (YYYY or YYYY/MM/DD)
[Submission Date]: Submission date
[Modification Date]: Last modification date

MeSH Terms:

[MeSH Terms]: Medical Subject Headings
[MeSH Major Topic]: Major MeSH topics

Study Type Fields:

[DataSet Type]: Type of study (e.g., "RNA-seq", "ChIP-seq")
[Sample Type]: Sample type

Example Complex Query:

query = """
    (breast cancer[MeSH] OR breast neoplasms[Title]) AND
    Homo sapiens[Organism] AND
    expression profiling by array[Entry Type] AND
    2020:2024[Publication Date] AND
    GPL570[Platform]
"""

SOFT File Format Specification

Overview

SOFT (Simple Omnibus Format in Text) is GEO's primary data exchange format. Files are structured as key-value pairs with data tables.

File Types

Family SOFT Files:

Filename: GSExxxxx_family.soft.gz
Contains: Complete series with all samples and platforms
Size: Can be very large (100s of MB compressed)
Use: Complete data extraction

Series Matrix Files:

Filename: GSExxxxx_series_matrix.txt.gz
Contains: Expression matrix with minimal metadata
Size: Smaller than family files
Use: Quick access to expression data

Platform SOFT Files:

Filename: GPLxxxxx.soft
Contains: Platform annotation and probe information
Use: Mapping probes to genes

SOFT File Structure

^DATABASE = GeoMiame
!Database_name = Gene Expression Omnibus (GEO)
!Database_institute = NCBI NLM NIH
!Database_web_link = http://www.ncbi.nlm.nih.gov/geo
!Database_email = geo@ncbi.nlm.nih.gov

^SERIES = GSExxxxx
!Series_title = Study Title Here
!Series_summary = Study description and background...
!Series_overall_design = Experimental design...
!Series_type = Expression profiling by array
!Series_pubmed_id = 12345678
!Series_submission_date = Jan 01 2024
!Series_last_update_date = Jan 15 2024
!Series_contributor = John,Doe
!Series_contributor = Jane,Smith
!Series_sample_id = GSMxxxxxx
!Series_sample_id = GSMxxxxxx

^PLATFORM = GPLxxxxx
!Platform_title = Platform Name
!Platform_distribution = commercial or custom
!Platform_organism = Homo sapiens
!Platform_manufacturer = Affymetrix
!Platform_technology = in situ oligonucleotide
!Platform_data_row_count = 54675
#ID = Probe ID
#GB_ACC = GenBank accession
#SPOT_ID = Spot identifier
#Gene Symbol = Gene symbol
#Gene Title = Gene title
!platform_table_begin
ID    GB_ACC    SPOT_ID    Gene Symbol    Gene Title
1007_s_at    U48705    -    DDR1    discoidin domain receptor...
1053_at    M87338    -    RFC2    replication factor C...
!platform_table_end

^SAMPLE = GSMxxxxxx
!Sample_title = Sample name
!Sample_source_name_ch1 = cell line XYZ
!Sample_organism_ch1 = Homo sapiens
!Sample_characteristics_ch1 = cell type: epithelial
!Sample_characteristics_ch1 = treatment: control
!Sample_molecule_ch1 = total RNA
!Sample_label_ch1 = biotin
!Sample_platform_id = GPLxxxxx
!Sample_data_processing = normalization method
#ID_REF = Probe identifier
#VALUE = Expression value
!sample_table_begin
ID_REF    VALUE
1007_s_at    8.456
1053_at    7.234
!sample_table_end

Parsing SOFT Files

With GEOparse:

import GEOparse

# Parse series
gse = GEOparse.get_GEO(filepath="GSE123456_family.soft.gz")

# Access metadata
metadata = gse.metadata
phenotype_data = gse.phenotype_data

# Access samples
for gsm_name, gsm in gse.gsms.items():
    sample_data = gsm.table
    sample_metadata = gsm.metadata

# Access platforms
for gpl_name, gpl in gse.gpls.items():
    platform_table = gpl.table
    platform_metadata = gpl.metadata

Manual Parsing:

import gzip

def parse_soft_file(filename):
    """Basic SOFT file parser"""
    sections = {}
    current_section = None
    current_metadata = {}
    current_table = []
    in_table = False

    with gzip.open(filename, 'rt') as f:
        for line in f:
            line = line.strip()

            # New section
            if line.startswith('^'):
                if current_section:
                    sections[current_section] = {
                        'metadata': current_metadata,
                        'table': current_table
                    }
                parts = line[1:].split(' = ')
                current_section = parts[1] if len(parts) > 1 else parts[0]
                current_metadata = {}
                current_table = []
                in_table = False

            # Metadata
            elif line.startswith('!'):
                if in_table:
                    in_table = False
                key_value = line[1:].split(' = ', 1)
                if len(key_value) == 2:
                    key, value = key_value
                    if key in current_metadata:
                        if isinstance(current_metadata[key], list):
                            current_metadata[key].append(value)
                        else:
                            current_metadata[key] = [current_metadata[key], value]
                    else:
                        current_metadata[key] = value

            # Table data
            elif line.startswith('#') or in_table:
                in_table = True
                current_table.append(line)

    return sections

MINiML File Format

Overview

MINiML (MIAME Notation in Markup Language) is GEO's XML-based format for data exchange.

File Structure

<?xml version="1.0" encoding="UTF-8"?>
<MINiML xmlns="http://www.ncbi.nlm.nih.gov/geo/info/MINiML"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <Series iid="GDS123">
    <Status>
      <Submission-Date>2024-01-01</Submission-Date>
      <Release-Date>2024-01-15</Release-Date>
      <Last-Update-Date>2024-01-15</Last-Update-Date>
    </Status>
    <Title>Study Title</Title>
    <Summary>Study description...</Summary>
    <Overall-Design>Experimental design...</Overall-Design>
    <Type>Expression profiling by array</Type>
    <Contributor>
      <Person>
        <First>John</First>
        <Last>Doe</Last>
      </Person>
    </Contributor>
  </Series>

  <Platform iid="GPL123">
    <Title>Platform Name</Title>
    <Distribution>commercial</Distribution>
    <Technology>in situ oligonucleotide</Technology>
    <Organism taxid="9606">Homo sapiens</Organism>
    <Data-Table>
      <Column position="1">
        <Name>ID</Name>
        <Description>Probe identifier</Description>
      </Column>
      <Data>
        <Row>
          <Cell column="1">1007_s_at</Cell>
          <Cell column="2">U48705</Cell>
        </Row>
      </Data>
    </Data-Table>
  </Platform>

  <Sample iid="GSM123">
    <Title>Sample name</Title>
    <Source>cell line XYZ</Source>
    <Organism taxid="9606">Homo sapiens</Organism>
    <Characteristics tag="cell type">epithelial</Characteristics>
    <Characteristics tag="treatment">control</Characteristics>
    <Platform-Ref ref="GPL123"/>
    <Data-Table>
      <Column position="1">
        <Name>ID_REF</Name>
      </Column>
      <Column position="2">
        <Name>VALUE</Name>
      </Column>
      <Data>
        <Row>
          <Cell column="1">1007_s_at</Cell>
          <Cell column="2">8.456</Cell>
        </Row>
      </Data>
    </Data-Table>
  </Sample>
</MINiML>

FTP Directory Structure

Series Files

Pattern:

ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE{nnn}nnn/GSE{xxxxx}/

Where {nnn} represents replacing last 3 digits with "nnn" and {xxxxx} is the full accession.

Example:

GSE123456 → /geo/series/GSE123nnn/GSE123456/
GSE1234 → /geo/series/GSE1nnn/GSE1234/
GSE100001 → /geo/series/GSE100nnn/GSE100001/

Subdirectories:

/matrix/ - Series matrix files
/soft/ - Family SOFT files
/miniml/ - MINiML XML files
/suppl/ - Supplementary files

File Types:

matrix/
  └── GSE123456_series_matrix.txt.gz

soft/
  └── GSE123456_family.soft.gz

miniml/
  └── GSE123456_family.xml.tgz

suppl/
  ├── GSE123456_RAW.tar
  ├── filelist.txt
  └── [various supplementary files]

Sample Files

Pattern:

ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM{nnn}nnn/GSM{xxxxx}/

Subdirectories:

/suppl/ - Sample-specific supplementary files

Platform Files

Pattern:

ftp://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL{nnn}nnn/GPL{xxxxx}/

File Types:

soft/
  └── GPL570.soft.gz

miniml/
  └── GPL570.xml

annot/
  └── GPL570.annot.gz  # Enhanced annotation (if available)

Advanced GEOparse Usage

Custom Parsing Options

import GEOparse

# Parse with custom options
gse = GEOparse.get_GEO(
    geo="GSE123456",
    destdir="./data",
    silent=False,  # Show progress
    how="full",  # Parse mode: "full", "quick", "brief"
    annotate_gpl=True,  # Include platform annotation
    geotype="GSE"  # Explicit type
)

# Access specific sample
gsm = gse.gsms['GSM1234567']

# Get expression values for specific probe
probe_id = "1007_s_at"
if hasattr(gsm, 'table'):
    probe_data = gsm.table[gsm.table['ID_REF'] == probe_id]

# Get all characteristics
characteristics = {}
for key, values in gsm.metadata.items():
    if key.startswith('characteristics'):
        for value in (values if isinstance(values, list) else [values]):
            if ':' in value:
                char_key, char_value = value.split(':', 1)
                characteristics[char_key.strip()] = char_value.strip()

Working with Platform Annotations

import GEOparse
import pandas as pd

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

# Get platform
gpl = list(gse.gpls.values())[0]

# Extract annotation table
if hasattr(gpl, 'table'):
    annotation = gpl.table

    # Common annotation columns:
    # - ID: Probe identifier
    # - Gene Symbol: Gene symbol
    # - Gene Title: Gene description
    # - GB_ACC: GenBank accession
    # - Gene ID: Entrez Gene ID
    # - RefSeq: RefSeq accession
    # - UniGene: UniGene cluster

    # Map probes to genes
    probe_to_gene = dict(zip(
        annotation['ID'],
        annotation['Gene Symbol']
    ))

    # Handle multiple probes per gene
    gene_to_probes = {}
    for probe, gene in probe_to_gene.items():
        if gene and gene != '---':
            if gene not in gene_to_probes:
                gene_to_probes[gene] = []
            gene_to_probes[gene].append(probe)

Handling Large Datasets

import GEOparse
import pandas as pd
import numpy as np

def process_large_gse(gse_id, chunk_size=1000):
    """Process large GEO series in chunks"""
    gse = GEOparse.get_GEO(geo=gse_id, destdir="./data")

    # Get sample list
    sample_list = list(gse.gsms.keys())

    # Process in chunks
    for i in range(0, len(sample_list), chunk_size):
        chunk_samples = sample_list[i:i+chunk_size]

        # Extract data for chunk
        chunk_data = {}
        for gsm_id in chunk_samples:
            gsm = gse.gsms[gsm_id]
            if hasattr(gsm, 'table'):
                chunk_data[gsm_id] = gsm.table['VALUE']

        # Process chunk
        chunk_df = pd.DataFrame(chunk_data)

        # Save chunk results
        chunk_df.to_csv(f"chunk_{i//chunk_size}.csv")

        print(f"Processed {i+len(chunk_samples)}/{len(sample_list)} samples")

Troubleshooting Common Issues

Issue: GEOparse Fails to Download

Symptoms: Timeout errors, connection failures

Solutions:

Check internet connection
Try downloading directly via FTP first
Parse local files:

gse = GEOparse.get_GEO(filepath="./local/GSE123456_family.soft.gz")

Increase timeout (modify GEOparse source if needed)

Issue: Missing Expression Data

Symptoms: pivot_samples() fails or returns empty

Cause: Not all series have series matrix files (older submissions)

Solution: Parse individual sample tables:

expression_data = {}
for gsm_name, gsm in gse.gsms.items():
    if hasattr(gsm, 'table') and 'VALUE' in gsm.table.columns:
        expression_data[gsm_name] = gsm.table.set_index('ID_REF')['VALUE']

expression_df = pd.DataFrame(expression_data)

Issue: Inconsistent Probe IDs

Symptoms: Probe IDs don't match between samples

Cause: Different platform versions or sample processing

Solution: Standardize using platform annotation:

# Get common probe set
all_probes = set()
for gsm in gse.gsms.values():
    if hasattr(gsm, 'table'):
        all_probes.update(gsm.table['ID_REF'].values)

# Create standardized matrix
standardized_data = {}
for gsm_name, gsm in gse.gsms.items():
    if hasattr(gsm, 'table'):
        sample_data = gsm.table.set_index('ID_REF')['VALUE']
        standardized_data[gsm_name] = sample_data.reindex(all_probes)

expression_df = pd.DataFrame(standardized_data)

Issue: E-utilities Rate Limiting

Symptoms: HTTP 429 errors, slow responses

Solution:

Get an API key from NCBI
Implement rate limiting:

import time
from functools import wraps

def rate_limit(calls_per_second=3):
    min_interval = 1.0 / calls_per_second

    def decorator(func):
        last_called = [0.0]

        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            wait_time = min_interval - elapsed
            if wait_time > 0:
                time.sleep(wait_time)
            result = func(*args, **kwargs)
            last_called[0] = time.time()
            return result
        return wrapper
    return decorator

@rate_limit(calls_per_second=3)
def safe_esearch(query):
    handle = Entrez.esearch(db="gds", term=query)
    results = Entrez.read(handle)
    handle.close()
    return results

Issue: Memory Errors with Large Datasets

Symptoms: MemoryError, system slowdown

Solution:

Process data in chunks
Use sparse matrices for expression data
Load only necessary columns
Use memory-efficient data types:

import pandas as pd

# Read with specific dtypes
expression_df = pd.read_csv(
    "expression_matrix.csv",
    dtype={'ID': str, 'GSM1': np.float32}  # Use float32 instead of float64
)

# Or use sparse format for mostly-zero data
import scipy.sparse as sp
sparse_matrix = sp.csr_matrix(expression_df.values)

Platform-Specific Considerations

Affymetrix Arrays

Probe IDs format: 1007_s_at, 1053_at
Multiple probe sets per gene common
Check for _at, _s_at, _x_at suffixes
May need RMA or MAS5 normalization

Illumina Arrays

Probe IDs format: ILMN_1234567
Watch for duplicate probes
BeadChip-specific processing may be needed

RNA-seq

May not have traditional "probes"
Check for gene IDs (Ensembl, Entrez)
Counts vs. FPKM/TPM values
May need separate count files

Two-Channel Arrays

Look for _ch1 and _ch2 suffixes in metadata
VALUE_ch1, VALUE_ch2 columns
May need ratio or intensity values
Check dye-swap experiments

Best Practices Summary

Always set Entrez.email before using E-utilities
Use API key for better rate limits
Cache downloaded files locally
Check data quality before analysis
Verify platform annotations are current
Document data processing steps
Cite original studies when using data
Check for batch effects in meta-analyses
Validate results with independent datasets
Follow NCBI usage guidelines

21 KiB Raw Permalink Blame History

GEO Database Reference Documentation

Complete E-utilities API Specifications

Overview

Base URL

Core E-utility Programs

eSearch - Text Query to ID List

eSummary - Document Summaries

eFetch - Full Records

eLink - Cross-Database Linking

ePost - Upload UID List

eInfo - Database Information

Search Field Qualifiers for GEO

SOFT File Format Specification

Overview

File Types

SOFT File Structure

Parsing SOFT Files

MINiML File Format

Overview

File Structure

FTP Directory Structure

Series Files

Sample Files

Platform Files

Advanced GEOparse Usage

Custom Parsing Options

Working with Platform Annotations

Handling Large Datasets

Troubleshooting Common Issues

Issue: GEOparse Fails to Download

Issue: Missing Expression Data

Issue: Inconsistent Probe IDs

Issue: E-utilities Rate Limiting

Issue: Memory Errors with Large Datasets

Platform-Specific Considerations

Affymetrix Arrays

Illumina Arrays

RNA-seq

Two-Channel Arrays

Best Practices Summary

21 KiB

Raw Permalink Blame History