Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/chembl-database/SKILL.md
+++ b/skills/chembl-database/SKILL.md
@@ -0,0 +1,383 @@
+---
+name: chembl-database
+description: "Query ChEMBL's bioactive molecules and drug discovery data. Search compounds by structure/properties, retrieve bioactivity data (IC50, Ki), find inhibitors, perform SAR studies, for medicinal chemistry."
+---
+
+# ChEMBL Database
+
+## Overview
+
+ChEMBL is a manually curated database of bioactive molecules maintained by the European Bioinformatics Institute (EBI), containing over 2 million compounds, 19 million bioactivity measurements, 13,000+ drug targets, and data on approved drugs and clinical candidates. Access and query this data programmatically using the ChEMBL Python client for drug discovery and medicinal chemistry research.
+
+## When to Use This Skill
+
+This skill should be used when:
+
+- **Compound searches**: Finding molecules by name, structure, or properties
+- **Target information**: Retrieving data about proteins, enzymes, or biological targets
+- **Bioactivity data**: Querying IC50, Ki, EC50, or other activity measurements
+- **Drug information**: Looking up approved drugs, mechanisms, or indications
+- **Structure searches**: Performing similarity or substructure searches
+- **Cheminformatics**: Analyzing molecular properties and drug-likeness
+- **Target-ligand relationships**: Exploring compound-target interactions
+- **Drug discovery**: Identifying inhibitors, agonists, or bioactive molecules
+
+## Installation and Setup
+
+### Python Client
+
+The ChEMBL Python client is required for programmatic access:
+
+```bash
+uv pip install chembl_webresource_client
+```
+
+### Basic Usage Pattern
+
+```python
+from chembl_webresource_client.new_client import new_client
+
+# Access different endpoints
+molecule = new_client.molecule
+target = new_client.target
+activity = new_client.activity
+drug = new_client.drug
+```
+
+## Core Capabilities
+
+### 1. Molecule Queries
+
+**Retrieve by ChEMBL ID:**
+```python
+molecule = new_client.molecule
+aspirin = molecule.get('CHEMBL25')
+```
+
+**Search by name:**
+```python
+results = molecule.filter(pref_name__icontains='aspirin')
+```
+
+**Filter by properties:**
+```python
+# Find small molecules (MW <= 500) with favorable LogP
+results = molecule.filter(
+    molecule_properties__mw_freebase__lte=500,
+    molecule_properties__alogp__lte=5
+)
+```
+
+### 2. Target Queries
+
+**Retrieve target information:**
+```python
+target = new_client.target
+egfr = target.get('CHEMBL203')
+```
+
+**Search for specific target types:**
+```python
+# Find all kinase targets
+kinases = target.filter(
+    target_type='SINGLE PROTEIN',
+    pref_name__icontains='kinase'
+)
+```
+
+### 3. Bioactivity Data
+
+**Query activities for a target:**
+```python
+activity = new_client.activity
+# Find potent EGFR inhibitors
+results = activity.filter(
+    target_chembl_id='CHEMBL203',
+    standard_type='IC50',
+    standard_value__lte=100,
+    standard_units='nM'
+)
+```
+
+**Get all activities for a compound:**
+```python
+compound_activities = activity.filter(
+    molecule_chembl_id='CHEMBL25',
+    pchembl_value__isnull=False
+)
+```
+
+### 4. Structure-Based Searches
+
+**Similarity search:**
+```python
+similarity = new_client.similarity
+# Find compounds similar to aspirin
+similar = similarity.filter(
+    smiles='CC(=O)Oc1ccccc1C(=O)O',
+    similarity=85  # 85% similarity threshold
+)
+```
+
+**Substructure search:**
+```python
+substructure = new_client.substructure
+# Find compounds containing benzene ring
+results = substructure.filter(smiles='c1ccccc1')
+```
+
+### 5. Drug Information
+
+**Retrieve drug data:**
+```python
+drug = new_client.drug
+drug_info = drug.get('CHEMBL25')
+```
+
+**Get mechanisms of action:**
+```python
+mechanism = new_client.mechanism
+mechanisms = mechanism.filter(molecule_chembl_id='CHEMBL25')
+```
+
+**Query drug indications:**
+```python
+drug_indication = new_client.drug_indication
+indications = drug_indication.filter(molecule_chembl_id='CHEMBL25')
+```
+
+## Query Workflow
+
+### Workflow 1: Finding Inhibitors for a Target
+
+1. **Identify the target** by searching by name:
+   ```python
+   targets = new_client.target.filter(pref_name__icontains='EGFR')
+   target_id = targets[0]['target_chembl_id']
+   ```
+
+2. **Query bioactivity data** for that target:
+   ```python
+   activities = new_client.activity.filter(
+       target_chembl_id=target_id,
+       standard_type='IC50',
+       standard_value__lte=100
+   )
+   ```
+
+3. **Extract compound IDs** and retrieve details:
+   ```python
+   compound_ids = [act['molecule_chembl_id'] for act in activities]
+   compounds = [new_client.molecule.get(cid) for cid in compound_ids]
+   ```
+
+### Workflow 2: Analyzing a Known Drug
+
+1. **Get drug information**:
+   ```python
+   drug_info = new_client.drug.get('CHEMBL1234')
+   ```
+
+2. **Retrieve mechanisms**:
+   ```python
+   mechanisms = new_client.mechanism.filter(molecule_chembl_id='CHEMBL1234')
+   ```
+
+3. **Find all bioactivities**:
+   ```python
+   activities = new_client.activity.filter(molecule_chembl_id='CHEMBL1234')
+   ```
+
+### Workflow 3: Structure-Activity Relationship (SAR) Study
+
+1. **Find similar compounds**:
+   ```python
+   similar = new_client.similarity.filter(smiles='query_smiles', similarity=80)
+   ```
+
+2. **Get activities for each compound**:
+   ```python
+   for compound in similar:
+       activities = new_client.activity.filter(
+           molecule_chembl_id=compound['molecule_chembl_id']
+       )
+   ```
+
+3. **Analyze property-activity relationships** using molecular properties from results.
+
+## Filter Operators
+
+ChEMBL supports Django-style query filters:
+
+- `__exact` - Exact match
+- `__iexact` - Case-insensitive exact match
+- `__contains` / `__icontains` - Substring matching
+- `__startswith` / `__endswith` - Prefix/suffix matching
+- `__gt`, `__gte`, `__lt`, `__lte` - Numeric comparisons
+- `__range` - Value in range
+- `__in` - Value in list
+- `__isnull` - Null/not null check
+
+## Data Export and Analysis
+
+Convert results to pandas DataFrame for analysis:
+
+```python
+import pandas as pd
+
+activities = new_client.activity.filter(target_chembl_id='CHEMBL203')
+df = pd.DataFrame(list(activities))
+
+# Analyze results
+print(df['standard_value'].describe())
+print(df.groupby('standard_type').size())
+```
+
+## Performance Optimization
+
+### Caching
+
+The client automatically caches results for 24 hours. Configure caching:
+
+```python
+from chembl_webresource_client.settings import Settings
+
+# Disable caching
+Settings.Instance().CACHING = False
+
+# Adjust cache expiration (seconds)
+Settings.Instance().CACHE_EXPIRE = 86400
+```
+
+### Lazy Evaluation
+
+Queries execute only when data is accessed. Convert to list to force execution:
+
+```python
+# Query is not executed yet
+results = molecule.filter(pref_name__icontains='aspirin')
+
+# Force execution
+results_list = list(results)
+```
+
+### Pagination
+
+Results are paginated automatically. Iterate through all results:
+
+```python
+for activity in new_client.activity.filter(target_chembl_id='CHEMBL203'):
+    # Process each activity
+    print(activity['molecule_chembl_id'])
+```
+
+## Common Use Cases
+
+### Find Kinase Inhibitors
+
+```python
+# Identify kinase targets
+kinases = new_client.target.filter(
+    target_type='SINGLE PROTEIN',
+    pref_name__icontains='kinase'
+)
+
+# Get potent inhibitors
+for kinase in kinases[:5]:  # First 5 kinases
+    activities = new_client.activity.filter(
+        target_chembl_id=kinase['target_chembl_id'],
+        standard_type='IC50',
+        standard_value__lte=50
+    )
+```
+
+### Explore Drug Repurposing
+
+```python
+# Get approved drugs
+drugs = new_client.drug.filter()
+
+# For each drug, find all targets
+for drug in drugs[:10]:
+    mechanisms = new_client.mechanism.filter(
+        molecule_chembl_id=drug['molecule_chembl_id']
+    )
+```
+
+### Virtual Screening
+
+```python
+# Find compounds with desired properties
+candidates = new_client.molecule.filter(
+    molecule_properties__mw_freebase__range=[300, 500],
+    molecule_properties__alogp__lte=5,
+    molecule_properties__hba__lte=10,
+    molecule_properties__hbd__lte=5
+)
+```
+
+## Resources
+
+### scripts/example_queries.py
+
+Ready-to-use Python functions demonstrating common ChEMBL query patterns:
+
+- `get_molecule_info()` - Retrieve molecule details by ID
+- `search_molecules_by_name()` - Name-based molecule search
+- `find_molecules_by_properties()` - Property-based filtering
+- `get_bioactivity_data()` - Query bioactivities for targets
+- `find_similar_compounds()` - Similarity searching
+- `substructure_search()` - Substructure matching
+- `get_drug_info()` - Retrieve drug information
+- `find_kinase_inhibitors()` - Specialized kinase inhibitor search
+- `export_to_dataframe()` - Convert results to pandas DataFrame
+
+Consult this script for implementation details and usage examples.
+
+### references/api_reference.md
+
+Comprehensive API documentation including:
+
+- Complete endpoint listing (molecule, target, activity, assay, drug, etc.)
+- All filter operators and query patterns
+- Molecular properties and bioactivity fields
+- Advanced query examples
+- Configuration and performance tuning
+- Error handling and rate limiting
+
+Refer to this document when detailed API information is needed or when troubleshooting queries.
+
+## Important Notes
+
+### Data Reliability
+
+- ChEMBL data is manually curated but may contain inconsistencies
+- Always check `data_validity_comment` field in activity records
+- Be aware of `potential_duplicate` flags
+
+### Units and Standards
+
+- Bioactivity values use standard units (nM, uM, etc.)
+- `pchembl_value` provides normalized activity (-log scale)
+- Check `standard_type` to understand measurement type (IC50, Ki, EC50, etc.)
+
+### Rate Limiting
+
+- Respect ChEMBL's fair usage policies
+- Use caching to minimize repeated requests
+- Consider bulk downloads for large datasets
+- Avoid hammering the API with rapid consecutive requests
+
+### Chemical Structure Formats
+
+- SMILES strings are the primary structure format
+- InChI keys available for compounds
+- SVG images can be generated via the image endpoint
+
+## Additional Resources
+
+- ChEMBL website: https://www.ebi.ac.uk/chembl/
+- API documentation: https://www.ebi.ac.uk/chembl/api/data/docs
+- Python client GitHub: https://github.com/chembl/chembl_webresource_client
+- Interface documentation: https://chembl.gitbook.io/chembl-interface-documentation/
+- Example notebooks: https://github.com/chembl/notebooks