377 lines
10 KiB
Markdown
377 lines
10 KiB
Markdown
---
|
|
name: dnanexus-integration
|
|
description: "DNAnexus cloud genomics platform. Build apps/applets, manage data (upload/download), dxpy Python SDK, run workflows, FASTQ/BAM/VCF, for genomics pipeline development and execution."
|
|
---
|
|
|
|
# DNAnexus Integration
|
|
|
|
## Overview
|
|
|
|
DNAnexus is a cloud platform for biomedical data analysis and genomics. Build and deploy apps/applets, manage data objects, run workflows, and use the dxpy Python SDK for genomics pipeline development and execution.
|
|
|
|
## When to Use This Skill
|
|
|
|
This skill should be used when:
|
|
- Creating, building, or modifying DNAnexus apps/applets
|
|
- Uploading, downloading, searching, or organizing files and records
|
|
- Running analyses, monitoring jobs, creating workflows
|
|
- Writing scripts using dxpy to interact with the platform
|
|
- Setting up dxapp.json, managing dependencies, using Docker
|
|
- Processing FASTQ, BAM, VCF, or other bioinformatics files
|
|
- Managing projects, permissions, or platform resources
|
|
|
|
## Core Capabilities
|
|
|
|
The skill is organized into five main areas, each with detailed reference documentation:
|
|
|
|
### 1. App Development
|
|
|
|
**Purpose**: Create executable programs (apps/applets) that run on the DNAnexus platform.
|
|
|
|
**Key Operations**:
|
|
- Generate app skeleton with `dx-app-wizard`
|
|
- Write Python or Bash apps with proper entry points
|
|
- Handle input/output data objects
|
|
- Deploy with `dx build` or `dx build --app`
|
|
- Test apps on the platform
|
|
|
|
**Common Use Cases**:
|
|
- Bioinformatics pipelines (alignment, variant calling)
|
|
- Data processing workflows
|
|
- Quality control and filtering
|
|
- Format conversion tools
|
|
|
|
**Reference**: See `references/app-development.md` for:
|
|
- Complete app structure and patterns
|
|
- Python entry point decorators
|
|
- Input/output handling with dxpy
|
|
- Development best practices
|
|
- Common issues and solutions
|
|
|
|
### 2. Data Operations
|
|
|
|
**Purpose**: Manage files, records, and other data objects on the platform.
|
|
|
|
**Key Operations**:
|
|
- Upload/download files with `dxpy.upload_local_file()` and `dxpy.download_dxfile()`
|
|
- Create and manage records with metadata
|
|
- Search for data objects by name, properties, or type
|
|
- Clone data between projects
|
|
- Manage project folders and permissions
|
|
|
|
**Common Use Cases**:
|
|
- Uploading sequencing data (FASTQ files)
|
|
- Organizing analysis results
|
|
- Searching for specific samples or experiments
|
|
- Backing up data across projects
|
|
- Managing reference genomes and annotations
|
|
|
|
**Reference**: See `references/data-operations.md` for:
|
|
- Complete file and record operations
|
|
- Data object lifecycle (open/closed states)
|
|
- Search and discovery patterns
|
|
- Project management
|
|
- Batch operations
|
|
|
|
### 3. Job Execution
|
|
|
|
**Purpose**: Run analyses, monitor execution, and orchestrate workflows.
|
|
|
|
**Key Operations**:
|
|
- Launch jobs with `applet.run()` or `app.run()`
|
|
- Monitor job status and logs
|
|
- Create subjobs for parallel processing
|
|
- Build and run multi-step workflows
|
|
- Chain jobs with output references
|
|
|
|
**Common Use Cases**:
|
|
- Running genomics analyses on sequencing data
|
|
- Parallel processing of multiple samples
|
|
- Multi-step analysis pipelines
|
|
- Monitoring long-running computations
|
|
- Debugging failed jobs
|
|
|
|
**Reference**: See `references/job-execution.md` for:
|
|
- Complete job lifecycle and states
|
|
- Workflow creation and orchestration
|
|
- Parallel execution patterns
|
|
- Job monitoring and debugging
|
|
- Resource management
|
|
|
|
### 4. Python SDK (dxpy)
|
|
|
|
**Purpose**: Programmatic access to DNAnexus platform through Python.
|
|
|
|
**Key Operations**:
|
|
- Work with data object handlers (DXFile, DXRecord, DXApplet, etc.)
|
|
- Use high-level functions for common tasks
|
|
- Make direct API calls for advanced operations
|
|
- Create links and references between objects
|
|
- Search and discover platform resources
|
|
|
|
**Common Use Cases**:
|
|
- Automation scripts for data management
|
|
- Custom analysis pipelines
|
|
- Batch processing workflows
|
|
- Integration with external tools
|
|
- Data migration and organization
|
|
|
|
**Reference**: See `references/python-sdk.md` for:
|
|
- Complete dxpy class reference
|
|
- High-level utility functions
|
|
- API method documentation
|
|
- Error handling patterns
|
|
- Common code patterns
|
|
|
|
### 5. Configuration and Dependencies
|
|
|
|
**Purpose**: Configure app metadata and manage dependencies.
|
|
|
|
**Key Operations**:
|
|
- Write dxapp.json with inputs, outputs, and run specs
|
|
- Install system packages (execDepends)
|
|
- Bundle custom tools and resources
|
|
- Use assets for shared dependencies
|
|
- Integrate Docker containers
|
|
- Configure instance types and timeouts
|
|
|
|
**Common Use Cases**:
|
|
- Defining app input/output specifications
|
|
- Installing bioinformatics tools (samtools, bwa, etc.)
|
|
- Managing Python package dependencies
|
|
- Using Docker images for complex environments
|
|
- Selecting computational resources
|
|
|
|
**Reference**: See `references/configuration.md` for:
|
|
- Complete dxapp.json specification
|
|
- Dependency management strategies
|
|
- Docker integration patterns
|
|
- Regional and resource configuration
|
|
- Example configurations
|
|
|
|
## Quick Start Examples
|
|
|
|
### Upload and Analyze Data
|
|
|
|
```python
|
|
import dxpy
|
|
|
|
# Upload input file
|
|
input_file = dxpy.upload_local_file("sample.fastq", project="project-xxxx")
|
|
|
|
# Run analysis
|
|
job = dxpy.DXApplet("applet-xxxx").run({
|
|
"reads": dxpy.dxlink(input_file.get_id())
|
|
})
|
|
|
|
# Wait for completion
|
|
job.wait_on_done()
|
|
|
|
# Download results
|
|
output_id = job.describe()["output"]["aligned_reads"]["$dnanexus_link"]
|
|
dxpy.download_dxfile(output_id, "aligned.bam")
|
|
```
|
|
|
|
### Search and Download Files
|
|
|
|
```python
|
|
import dxpy
|
|
|
|
# Find BAM files from a specific experiment
|
|
files = dxpy.find_data_objects(
|
|
classname="file",
|
|
name="*.bam",
|
|
properties={"experiment": "exp001"},
|
|
project="project-xxxx"
|
|
)
|
|
|
|
# Download each file
|
|
for file_result in files:
|
|
file_obj = dxpy.DXFile(file_result["id"])
|
|
filename = file_obj.describe()["name"]
|
|
dxpy.download_dxfile(file_result["id"], filename)
|
|
```
|
|
|
|
### Create Simple App
|
|
|
|
```python
|
|
# src/my-app.py
|
|
import dxpy
|
|
import subprocess
|
|
|
|
@dxpy.entry_point('main')
|
|
def main(input_file, quality_threshold=30):
|
|
# Download input
|
|
dxpy.download_dxfile(input_file["$dnanexus_link"], "input.fastq")
|
|
|
|
# Process
|
|
subprocess.check_call([
|
|
"quality_filter",
|
|
"--input", "input.fastq",
|
|
"--output", "filtered.fastq",
|
|
"--threshold", str(quality_threshold)
|
|
])
|
|
|
|
# Upload output
|
|
output_file = dxpy.upload_local_file("filtered.fastq")
|
|
|
|
return {
|
|
"filtered_reads": dxpy.dxlink(output_file)
|
|
}
|
|
|
|
dxpy.run()
|
|
```
|
|
|
|
## Workflow Decision Tree
|
|
|
|
When working with DNAnexus, follow this decision tree:
|
|
|
|
1. **Need to create a new executable?**
|
|
- Yes → Use **App Development** (references/app-development.md)
|
|
- No → Continue to step 2
|
|
|
|
2. **Need to manage files or data?**
|
|
- Yes → Use **Data Operations** (references/data-operations.md)
|
|
- No → Continue to step 3
|
|
|
|
3. **Need to run an analysis or workflow?**
|
|
- Yes → Use **Job Execution** (references/job-execution.md)
|
|
- No → Continue to step 4
|
|
|
|
4. **Writing Python scripts for automation?**
|
|
- Yes → Use **Python SDK** (references/python-sdk.md)
|
|
- No → Continue to step 5
|
|
|
|
5. **Configuring app settings or dependencies?**
|
|
- Yes → Use **Configuration** (references/configuration.md)
|
|
|
|
Often you'll need multiple capabilities together (e.g., app development + configuration, or data operations + job execution).
|
|
|
|
## Installation and Authentication
|
|
|
|
### Install dxpy
|
|
|
|
```bash
|
|
uv pip install dxpy
|
|
```
|
|
|
|
### Login to DNAnexus
|
|
|
|
```bash
|
|
dx login
|
|
```
|
|
|
|
This authenticates your session and sets up access to projects and data.
|
|
|
|
### Verify Installation
|
|
|
|
```bash
|
|
dx --version
|
|
dx whoami
|
|
```
|
|
|
|
## Common Patterns
|
|
|
|
### Pattern 1: Batch Processing
|
|
|
|
Process multiple files with the same analysis:
|
|
|
|
```python
|
|
# Find all FASTQ files
|
|
files = dxpy.find_data_objects(
|
|
classname="file",
|
|
name="*.fastq",
|
|
project="project-xxxx"
|
|
)
|
|
|
|
# Launch parallel jobs
|
|
jobs = []
|
|
for file_result in files:
|
|
job = dxpy.DXApplet("applet-xxxx").run({
|
|
"input": dxpy.dxlink(file_result["id"])
|
|
})
|
|
jobs.append(job)
|
|
|
|
# Wait for all completions
|
|
for job in jobs:
|
|
job.wait_on_done()
|
|
```
|
|
|
|
### Pattern 2: Multi-Step Pipeline
|
|
|
|
Chain multiple analyses together:
|
|
|
|
```python
|
|
# Step 1: Quality control
|
|
qc_job = qc_applet.run({"reads": input_file})
|
|
|
|
# Step 2: Alignment (uses QC output)
|
|
align_job = align_applet.run({
|
|
"reads": qc_job.get_output_ref("filtered_reads")
|
|
})
|
|
|
|
# Step 3: Variant calling (uses alignment output)
|
|
variant_job = variant_applet.run({
|
|
"bam": align_job.get_output_ref("aligned_bam")
|
|
})
|
|
```
|
|
|
|
### Pattern 3: Data Organization
|
|
|
|
Organize analysis results systematically:
|
|
|
|
```python
|
|
# Create organized folder structure
|
|
dxpy.api.project_new_folder(
|
|
"project-xxxx",
|
|
{"folder": "/experiments/exp001/results", "parents": True}
|
|
)
|
|
|
|
# Upload with metadata
|
|
result_file = dxpy.upload_local_file(
|
|
"results.txt",
|
|
project="project-xxxx",
|
|
folder="/experiments/exp001/results",
|
|
properties={
|
|
"experiment": "exp001",
|
|
"sample": "sample1",
|
|
"analysis_date": "2025-10-20"
|
|
},
|
|
tags=["validated", "published"]
|
|
)
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Error Handling**: Always wrap API calls in try-except blocks
|
|
2. **Resource Management**: Choose appropriate instance types for workloads
|
|
3. **Data Organization**: Use consistent folder structures and metadata
|
|
4. **Cost Optimization**: Archive old data, use appropriate storage classes
|
|
5. **Documentation**: Include clear descriptions in dxapp.json
|
|
6. **Testing**: Test apps with various input types before production use
|
|
7. **Version Control**: Use semantic versioning for apps
|
|
8. **Security**: Never hardcode credentials in source code
|
|
9. **Logging**: Include informative log messages for debugging
|
|
10. **Cleanup**: Remove temporary files and failed jobs
|
|
|
|
## Resources
|
|
|
|
This skill includes detailed reference documentation:
|
|
|
|
### references/
|
|
|
|
- **app-development.md** - Complete guide to building and deploying apps/applets
|
|
- **data-operations.md** - File management, records, search, and project operations
|
|
- **job-execution.md** - Running jobs, workflows, monitoring, and parallel processing
|
|
- **python-sdk.md** - Comprehensive dxpy library reference with all classes and functions
|
|
- **configuration.md** - dxapp.json specification and dependency management
|
|
|
|
Load these references when you need detailed information about specific operations or when working on complex tasks.
|
|
|
|
## Getting Help
|
|
|
|
- Official documentation: https://documentation.dnanexus.com/
|
|
- API reference: http://autodoc.dnanexus.com/
|
|
- GitHub repository: https://github.com/dnanexus/dx-toolkit
|
|
- Support: support@dnanexus.com
|