Initial commit
This commit is contained in:
376
skills/dnanexus-integration/SKILL.md
Normal file
376
skills/dnanexus-integration/SKILL.md
Normal file
@@ -0,0 +1,376 @@
|
||||
---
|
||||
name: dnanexus-integration
|
||||
description: "DNAnexus cloud genomics platform. Build apps/applets, manage data (upload/download), dxpy Python SDK, run workflows, FASTQ/BAM/VCF, for genomics pipeline development and execution."
|
||||
---
|
||||
|
||||
# DNAnexus Integration
|
||||
|
||||
## Overview
|
||||
|
||||
DNAnexus is a cloud platform for biomedical data analysis and genomics. Build and deploy apps/applets, manage data objects, run workflows, and use the dxpy Python SDK for genomics pipeline development and execution.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when:
|
||||
- Creating, building, or modifying DNAnexus apps/applets
|
||||
- Uploading, downloading, searching, or organizing files and records
|
||||
- Running analyses, monitoring jobs, creating workflows
|
||||
- Writing scripts using dxpy to interact with the platform
|
||||
- Setting up dxapp.json, managing dependencies, using Docker
|
||||
- Processing FASTQ, BAM, VCF, or other bioinformatics files
|
||||
- Managing projects, permissions, or platform resources
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
The skill is organized into five main areas, each with detailed reference documentation:
|
||||
|
||||
### 1. App Development
|
||||
|
||||
**Purpose**: Create executable programs (apps/applets) that run on the DNAnexus platform.
|
||||
|
||||
**Key Operations**:
|
||||
- Generate app skeleton with `dx-app-wizard`
|
||||
- Write Python or Bash apps with proper entry points
|
||||
- Handle input/output data objects
|
||||
- Deploy with `dx build` or `dx build --app`
|
||||
- Test apps on the platform
|
||||
|
||||
**Common Use Cases**:
|
||||
- Bioinformatics pipelines (alignment, variant calling)
|
||||
- Data processing workflows
|
||||
- Quality control and filtering
|
||||
- Format conversion tools
|
||||
|
||||
**Reference**: See `references/app-development.md` for:
|
||||
- Complete app structure and patterns
|
||||
- Python entry point decorators
|
||||
- Input/output handling with dxpy
|
||||
- Development best practices
|
||||
- Common issues and solutions
|
||||
|
||||
### 2. Data Operations
|
||||
|
||||
**Purpose**: Manage files, records, and other data objects on the platform.
|
||||
|
||||
**Key Operations**:
|
||||
- Upload/download files with `dxpy.upload_local_file()` and `dxpy.download_dxfile()`
|
||||
- Create and manage records with metadata
|
||||
- Search for data objects by name, properties, or type
|
||||
- Clone data between projects
|
||||
- Manage project folders and permissions
|
||||
|
||||
**Common Use Cases**:
|
||||
- Uploading sequencing data (FASTQ files)
|
||||
- Organizing analysis results
|
||||
- Searching for specific samples or experiments
|
||||
- Backing up data across projects
|
||||
- Managing reference genomes and annotations
|
||||
|
||||
**Reference**: See `references/data-operations.md` for:
|
||||
- Complete file and record operations
|
||||
- Data object lifecycle (open/closed states)
|
||||
- Search and discovery patterns
|
||||
- Project management
|
||||
- Batch operations
|
||||
|
||||
### 3. Job Execution
|
||||
|
||||
**Purpose**: Run analyses, monitor execution, and orchestrate workflows.
|
||||
|
||||
**Key Operations**:
|
||||
- Launch jobs with `applet.run()` or `app.run()`
|
||||
- Monitor job status and logs
|
||||
- Create subjobs for parallel processing
|
||||
- Build and run multi-step workflows
|
||||
- Chain jobs with output references
|
||||
|
||||
**Common Use Cases**:
|
||||
- Running genomics analyses on sequencing data
|
||||
- Parallel processing of multiple samples
|
||||
- Multi-step analysis pipelines
|
||||
- Monitoring long-running computations
|
||||
- Debugging failed jobs
|
||||
|
||||
**Reference**: See `references/job-execution.md` for:
|
||||
- Complete job lifecycle and states
|
||||
- Workflow creation and orchestration
|
||||
- Parallel execution patterns
|
||||
- Job monitoring and debugging
|
||||
- Resource management
|
||||
|
||||
### 4. Python SDK (dxpy)
|
||||
|
||||
**Purpose**: Programmatic access to DNAnexus platform through Python.
|
||||
|
||||
**Key Operations**:
|
||||
- Work with data object handlers (DXFile, DXRecord, DXApplet, etc.)
|
||||
- Use high-level functions for common tasks
|
||||
- Make direct API calls for advanced operations
|
||||
- Create links and references between objects
|
||||
- Search and discover platform resources
|
||||
|
||||
**Common Use Cases**:
|
||||
- Automation scripts for data management
|
||||
- Custom analysis pipelines
|
||||
- Batch processing workflows
|
||||
- Integration with external tools
|
||||
- Data migration and organization
|
||||
|
||||
**Reference**: See `references/python-sdk.md` for:
|
||||
- Complete dxpy class reference
|
||||
- High-level utility functions
|
||||
- API method documentation
|
||||
- Error handling patterns
|
||||
- Common code patterns
|
||||
|
||||
### 5. Configuration and Dependencies
|
||||
|
||||
**Purpose**: Configure app metadata and manage dependencies.
|
||||
|
||||
**Key Operations**:
|
||||
- Write dxapp.json with inputs, outputs, and run specs
|
||||
- Install system packages (execDepends)
|
||||
- Bundle custom tools and resources
|
||||
- Use assets for shared dependencies
|
||||
- Integrate Docker containers
|
||||
- Configure instance types and timeouts
|
||||
|
||||
**Common Use Cases**:
|
||||
- Defining app input/output specifications
|
||||
- Installing bioinformatics tools (samtools, bwa, etc.)
|
||||
- Managing Python package dependencies
|
||||
- Using Docker images for complex environments
|
||||
- Selecting computational resources
|
||||
|
||||
**Reference**: See `references/configuration.md` for:
|
||||
- Complete dxapp.json specification
|
||||
- Dependency management strategies
|
||||
- Docker integration patterns
|
||||
- Regional and resource configuration
|
||||
- Example configurations
|
||||
|
||||
## Quick Start Examples
|
||||
|
||||
### Upload and Analyze Data
|
||||
|
||||
```python
|
||||
import dxpy
|
||||
|
||||
# Upload input file
|
||||
input_file = dxpy.upload_local_file("sample.fastq", project="project-xxxx")
|
||||
|
||||
# Run analysis
|
||||
job = dxpy.DXApplet("applet-xxxx").run({
|
||||
"reads": dxpy.dxlink(input_file.get_id())
|
||||
})
|
||||
|
||||
# Wait for completion
|
||||
job.wait_on_done()
|
||||
|
||||
# Download results
|
||||
output_id = job.describe()["output"]["aligned_reads"]["$dnanexus_link"]
|
||||
dxpy.download_dxfile(output_id, "aligned.bam")
|
||||
```
|
||||
|
||||
### Search and Download Files
|
||||
|
||||
```python
|
||||
import dxpy
|
||||
|
||||
# Find BAM files from a specific experiment
|
||||
files = dxpy.find_data_objects(
|
||||
classname="file",
|
||||
name="*.bam",
|
||||
properties={"experiment": "exp001"},
|
||||
project="project-xxxx"
|
||||
)
|
||||
|
||||
# Download each file
|
||||
for file_result in files:
|
||||
file_obj = dxpy.DXFile(file_result["id"])
|
||||
filename = file_obj.describe()["name"]
|
||||
dxpy.download_dxfile(file_result["id"], filename)
|
||||
```
|
||||
|
||||
### Create Simple App
|
||||
|
||||
```python
|
||||
# src/my-app.py
|
||||
import dxpy
|
||||
import subprocess
|
||||
|
||||
@dxpy.entry_point('main')
|
||||
def main(input_file, quality_threshold=30):
|
||||
# Download input
|
||||
dxpy.download_dxfile(input_file["$dnanexus_link"], "input.fastq")
|
||||
|
||||
# Process
|
||||
subprocess.check_call([
|
||||
"quality_filter",
|
||||
"--input", "input.fastq",
|
||||
"--output", "filtered.fastq",
|
||||
"--threshold", str(quality_threshold)
|
||||
])
|
||||
|
||||
# Upload output
|
||||
output_file = dxpy.upload_local_file("filtered.fastq")
|
||||
|
||||
return {
|
||||
"filtered_reads": dxpy.dxlink(output_file)
|
||||
}
|
||||
|
||||
dxpy.run()
|
||||
```
|
||||
|
||||
## Workflow Decision Tree
|
||||
|
||||
When working with DNAnexus, follow this decision tree:
|
||||
|
||||
1. **Need to create a new executable?**
|
||||
- Yes → Use **App Development** (references/app-development.md)
|
||||
- No → Continue to step 2
|
||||
|
||||
2. **Need to manage files or data?**
|
||||
- Yes → Use **Data Operations** (references/data-operations.md)
|
||||
- No → Continue to step 3
|
||||
|
||||
3. **Need to run an analysis or workflow?**
|
||||
- Yes → Use **Job Execution** (references/job-execution.md)
|
||||
- No → Continue to step 4
|
||||
|
||||
4. **Writing Python scripts for automation?**
|
||||
- Yes → Use **Python SDK** (references/python-sdk.md)
|
||||
- No → Continue to step 5
|
||||
|
||||
5. **Configuring app settings or dependencies?**
|
||||
- Yes → Use **Configuration** (references/configuration.md)
|
||||
|
||||
Often you'll need multiple capabilities together (e.g., app development + configuration, or data operations + job execution).
|
||||
|
||||
## Installation and Authentication
|
||||
|
||||
### Install dxpy
|
||||
|
||||
```bash
|
||||
uv pip install dxpy
|
||||
```
|
||||
|
||||
### Login to DNAnexus
|
||||
|
||||
```bash
|
||||
dx login
|
||||
```
|
||||
|
||||
This authenticates your session and sets up access to projects and data.
|
||||
|
||||
### Verify Installation
|
||||
|
||||
```bash
|
||||
dx --version
|
||||
dx whoami
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Pattern 1: Batch Processing
|
||||
|
||||
Process multiple files with the same analysis:
|
||||
|
||||
```python
|
||||
# Find all FASTQ files
|
||||
files = dxpy.find_data_objects(
|
||||
classname="file",
|
||||
name="*.fastq",
|
||||
project="project-xxxx"
|
||||
)
|
||||
|
||||
# Launch parallel jobs
|
||||
jobs = []
|
||||
for file_result in files:
|
||||
job = dxpy.DXApplet("applet-xxxx").run({
|
||||
"input": dxpy.dxlink(file_result["id"])
|
||||
})
|
||||
jobs.append(job)
|
||||
|
||||
# Wait for all completions
|
||||
for job in jobs:
|
||||
job.wait_on_done()
|
||||
```
|
||||
|
||||
### Pattern 2: Multi-Step Pipeline
|
||||
|
||||
Chain multiple analyses together:
|
||||
|
||||
```python
|
||||
# Step 1: Quality control
|
||||
qc_job = qc_applet.run({"reads": input_file})
|
||||
|
||||
# Step 2: Alignment (uses QC output)
|
||||
align_job = align_applet.run({
|
||||
"reads": qc_job.get_output_ref("filtered_reads")
|
||||
})
|
||||
|
||||
# Step 3: Variant calling (uses alignment output)
|
||||
variant_job = variant_applet.run({
|
||||
"bam": align_job.get_output_ref("aligned_bam")
|
||||
})
|
||||
```
|
||||
|
||||
### Pattern 3: Data Organization
|
||||
|
||||
Organize analysis results systematically:
|
||||
|
||||
```python
|
||||
# Create organized folder structure
|
||||
dxpy.api.project_new_folder(
|
||||
"project-xxxx",
|
||||
{"folder": "/experiments/exp001/results", "parents": True}
|
||||
)
|
||||
|
||||
# Upload with metadata
|
||||
result_file = dxpy.upload_local_file(
|
||||
"results.txt",
|
||||
project="project-xxxx",
|
||||
folder="/experiments/exp001/results",
|
||||
properties={
|
||||
"experiment": "exp001",
|
||||
"sample": "sample1",
|
||||
"analysis_date": "2025-10-20"
|
||||
},
|
||||
tags=["validated", "published"]
|
||||
)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Error Handling**: Always wrap API calls in try-except blocks
|
||||
2. **Resource Management**: Choose appropriate instance types for workloads
|
||||
3. **Data Organization**: Use consistent folder structures and metadata
|
||||
4. **Cost Optimization**: Archive old data, use appropriate storage classes
|
||||
5. **Documentation**: Include clear descriptions in dxapp.json
|
||||
6. **Testing**: Test apps with various input types before production use
|
||||
7. **Version Control**: Use semantic versioning for apps
|
||||
8. **Security**: Never hardcode credentials in source code
|
||||
9. **Logging**: Include informative log messages for debugging
|
||||
10. **Cleanup**: Remove temporary files and failed jobs
|
||||
|
||||
## Resources
|
||||
|
||||
This skill includes detailed reference documentation:
|
||||
|
||||
### references/
|
||||
|
||||
- **app-development.md** - Complete guide to building and deploying apps/applets
|
||||
- **data-operations.md** - File management, records, search, and project operations
|
||||
- **job-execution.md** - Running jobs, workflows, monitoring, and parallel processing
|
||||
- **python-sdk.md** - Comprehensive dxpy library reference with all classes and functions
|
||||
- **configuration.md** - dxapp.json specification and dependency management
|
||||
|
||||
Load these references when you need detailed information about specific operations or when working on complex tasks.
|
||||
|
||||
## Getting Help
|
||||
|
||||
- Official documentation: https://documentation.dnanexus.com/
|
||||
- API reference: http://autodoc.dnanexus.com/
|
||||
- GitHub repository: https://github.com/dnanexus/dx-toolkit
|
||||
- Support: support@dnanexus.com
|
||||
Reference in New Issue
Block a user