Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/dnanexus-integration/references/job-execution.md
+++ b/skills/dnanexus-integration/references/job-execution.md
@@ -0,0 +1,412 @@
+# DNAnexus Job Execution and Workflows
+
+## Overview
+
+Jobs are the fundamental execution units on DNAnexus. When an applet or app runs, a job is created and executed on a worker node in an isolated Linux environment with constant API access.
+
+## Job Types
+
+### Origin Jobs
+Initially created by users or automated systems.
+
+### Master Jobs
+Result from directly launching an executable (app/applet).
+
+### Child Jobs
+Spawned by parent jobs for parallel processing or sub-workflows.
+
+## Running Jobs
+
+### Running an Applet
+
+**Basic execution**:
+```python
+import dxpy
+
+# Run an applet
+job = dxpy.DXApplet("applet-xxxx").run({
+    "input1": {"$dnanexus_link": "file-yyyy"},
+    "input2": "parameter_value"
+})
+
+print(f"Job ID: {job.get_id()}")
+```
+
+**Using command line**:
+```bash
+dx run applet-xxxx -i input1=file-yyyy -i input2="value"
+```
+
+### Running an App
+
+```python
+# Run an app by name
+job = dxpy.DXApp(name="my-app").run({
+    "reads": {"$dnanexus_link": "file-xxxx"},
+    "quality_threshold": 30
+})
+```
+
+### Specifying Execution Parameters
+
+```python
+job = dxpy.DXApplet("applet-xxxx").run(
+    applet_input={
+        "input_file": {"$dnanexus_link": "file-yyyy"}
+    },
+    project="project-zzzz",  # Output project
+    folder="/results",        # Output folder
+    name="My Analysis Job",   # Job name
+    instance_type="mem2_hdd2_x4",  # Override instance type
+    priority="high"           # Job priority
+)
+```
+
+## Job Monitoring
+
+### Checking Job Status
+
+```python
+job = dxpy.DXJob("job-xxxx")
+state = job.describe()["state"]
+
+# States: idle, waiting_on_input, runnable, running, done, failed, terminated
+print(f"Job state: {state}")
+```
+
+**Using command line**:
+```bash
+dx watch job-xxxx
+```
+
+### Waiting for Job Completion
+
+```python
+# Block until job completes
+job.wait_on_done()
+
+# Check if successful
+if job.describe()["state"] == "done":
+    output = job.describe()["output"]
+    print(f"Job completed: {output}")
+else:
+    print("Job failed")
+```
+
+### Getting Job Output
+
+```python
+job = dxpy.DXJob("job-xxxx")
+
+# Wait for completion
+job.wait_on_done()
+
+# Get outputs
+output = job.describe()["output"]
+output_file_id = output["result_file"]["$dnanexus_link"]
+
+# Download result
+dxpy.download_dxfile(output_file_id, "result.txt")
+```
+
+### Job Output References
+
+Create references to job outputs before they complete:
+
+```python
+# Launch first job
+job1 = dxpy.DXApplet("applet-1").run({"input": "..."})
+
+# Launch second job using output reference
+job2 = dxpy.DXApplet("applet-2").run({
+    "input": dxpy.dxlink(job1.get_output_ref("output_name"))
+})
+```
+
+## Job Logs
+
+### Viewing Logs
+
+**Command line**:
+```bash
+dx watch job-xxxx --get-streams
+```
+
+**Programmatically**:
+```python
+import sys
+
+# Get job logs
+job = dxpy.DXJob("job-xxxx")
+log = dxpy.api.job_get_log(job.get_id())
+
+for log_entry in log["loglines"]:
+    print(log_entry)
+```
+
+## Parallel Execution
+
+### Creating Subjobs
+
+```python
+@dxpy.entry_point('main')
+def main(input_files):
+    # Create subjobs for parallel processing
+    subjobs = []
+
+    for input_file in input_files:
+        subjob = dxpy.new_dxjob(
+            fn_input={"file": input_file},
+            fn_name="process_file"
+        )
+        subjobs.append(subjob)
+
+    # Collect results
+    results = []
+    for subjob in subjobs:
+        result = subjob.get_output_ref("processed_file")
+        results.append(result)
+
+    return {"all_results": results}
+
+@dxpy.entry_point('process_file')
+def process_file(file):
+    # Process single file
+    # ...
+    return {"processed_file": output_file}
+```
+
+### Scatter-Gather Pattern
+
+```python
+# Scatter: Process items in parallel
+scatter_jobs = []
+for item in items:
+    job = dxpy.new_dxjob(
+        fn_input={"item": item},
+        fn_name="process_item"
+    )
+    scatter_jobs.append(job)
+
+# Gather: Combine results
+gather_job = dxpy.new_dxjob(
+    fn_input={
+        "results": [job.get_output_ref("result") for job in scatter_jobs]
+    },
+    fn_name="combine_results"
+)
+```
+
+## Workflows
+
+Workflows combine multiple apps/applets into multi-step pipelines.
+
+### Creating a Workflow
+
+```python
+# Create workflow
+workflow = dxpy.new_dxworkflow(
+    name="My Analysis Pipeline",
+    project="project-xxxx"
+)
+
+# Add stages
+stage1 = workflow.add_stage(
+    dxpy.DXApplet("applet-1"),
+    name="Quality Control",
+    folder="/qc"
+)
+
+stage2 = workflow.add_stage(
+    dxpy.DXApplet("applet-2"),
+    name="Alignment",
+    folder="/alignment"
+)
+
+# Connect stages
+stage2.set_input("reads", stage1.get_output_ref("filtered_reads"))
+
+# Close workflow
+workflow.close()
+```
+
+### Running a Workflow
+
+```python
+# Run workflow
+analysis = workflow.run({
+    "stage-xxxx.input1": {"$dnanexus_link": "file-yyyy"}
+})
+
+# Monitor analysis (collection of jobs)
+analysis.wait_on_done()
+
+# Get workflow outputs
+outputs = analysis.describe()["output"]
+```
+
+**Using command line**:
+```bash
+dx run workflow-xxxx -i stage-1.input=file-yyyy
+```
+
+## Job Permissions and Context
+
+### Workspace Context
+
+Jobs run in a workspace project with cloned input data:
+- Jobs require `CONTRIBUTE` permission to workspace
+- Jobs need `VIEW` access to source projects
+- All charges accumulate to the originating project
+
+### Data Requirements
+
+Jobs cannot start until:
+1. All input data objects are in `closed` state
+2. Required permissions are available
+3. Resources are allocated
+
+Output objects must reach `closed` state before workspace cleanup.
+
+## Job Lifecycle
+
+```
+Created → Waiting on Input → Runnable → Running → Done/Failed
+```
+
+**States**:
+- `idle`: Job created but not yet queued
+- `waiting_on_input`: Waiting for input data objects to close
+- `runnable`: Ready to run, waiting for resources
+- `running`: Currently executing
+- `done`: Completed successfully
+- `failed`: Execution failed
+- `terminated`: Manually stopped
+
+## Error Handling
+
+### Job Failure
+
+```python
+job = dxpy.DXJob("job-xxxx")
+job.wait_on_done()
+
+desc = job.describe()
+if desc["state"] == "failed":
+    print(f"Job failed: {desc.get('failureReason', 'Unknown')}")
+    print(f"Failure message: {desc.get('failureMessage', '')}")
+```
+
+### Retry Failed Jobs
+
+```python
+# Rerun failed job
+new_job = dxpy.DXApplet(desc["applet"]).run(
+    desc["originalInput"],
+    project=desc["project"]
+)
+```
+
+### Terminating Jobs
+
+```python
+# Stop a running job
+job = dxpy.DXJob("job-xxxx")
+job.terminate()
+```
+
+**Using command line**:
+```bash
+dx terminate job-xxxx
+```
+
+## Resource Management
+
+### Instance Types
+
+Specify computational resources:
+
+```python
+# Run with specific instance type
+job = dxpy.DXApplet("applet-xxxx").run(
+    {"input": "..."},
+    instance_type="mem3_ssd1_v2_x8"  # 8 cores, high memory, SSD
+)
+```
+
+Common instance types:
+- `mem1_ssd1_v2_x4` - 4 cores, standard memory
+- `mem2_ssd1_v2_x8` - 8 cores, high memory
+- `mem3_ssd1_v2_x16` - 16 cores, very high memory
+- `mem1_ssd1_v2_x36` - 36 cores for parallel workloads
+
+### Timeout Settings
+
+Set maximum execution time:
+
+```python
+job = dxpy.DXApplet("applet-xxxx").run(
+    {"input": "..."},
+    timeout="24h"  # Maximum runtime
+)
+```
+
+## Job Tagging and Metadata
+
+### Add Job Tags
+
+```python
+job = dxpy.DXApplet("applet-xxxx").run(
+    {"input": "..."},
+    tags=["experiment1", "batch2", "production"]
+)
+```
+
+### Add Job Properties
+
+```python
+job = dxpy.DXApplet("applet-xxxx").run(
+    {"input": "..."},
+    properties={
+        "experiment": "exp001",
+        "sample": "sample1",
+        "batch": "batch2"
+    }
+)
+```
+
+### Finding Jobs
+
+```python
+# Find jobs by tag
+jobs = dxpy.find_jobs(
+    project="project-xxxx",
+    tags=["experiment1"],
+    describe=True
+)
+
+for job in jobs:
+    print(f"{job['describe']['name']}: {job['id']}")
+```
+
+## Best Practices
+
+1. **Job Naming**: Use descriptive names for easier tracking
+2. **Tags and Properties**: Tag jobs for organization and searchability
+3. **Resource Selection**: Choose appropriate instance types for workload
+4. **Error Handling**: Check job state and handle failures gracefully
+5. **Parallel Processing**: Use subjobs for independent parallel tasks
+6. **Workflows**: Use workflows for complex multi-step analyses
+7. **Monitoring**: Monitor long-running jobs and check logs for issues
+8. **Cost Management**: Use appropriate instance types to balance cost/performance
+9. **Timeouts**: Set reasonable timeouts to prevent runaway jobs
+10. **Cleanup**: Remove failed or obsolete jobs
+
+## Debugging Tips
+
+1. **Check Logs**: Always review job logs for error messages
+2. **Verify Inputs**: Ensure input files are closed and accessible
+3. **Test Locally**: Test logic locally before deploying to platform
+4. **Start Small**: Test with small datasets before scaling up
+5. **Monitor Resources**: Check if job is running out of memory or disk space
+6. **Instance Type**: Try larger instance if job fails due to resources