Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/dnanexus-integration/references/data-operations.md
+++ b/skills/dnanexus-integration/references/data-operations.md
@@ -0,0 +1,400 @@
+# DNAnexus Data Operations
+
+## Overview
+
+DNAnexus provides comprehensive data management capabilities for files, records, databases, and other data objects. All data operations can be performed via the Python SDK (dxpy) or command-line interface (dx).
+
+## Data Object Types
+
+### Files
+Binary or text data stored on the platform.
+
+### Records
+Structured data objects with arbitrary JSON details and metadata.
+
+### Databases
+Structured database objects for relational data.
+
+### Applets and Apps
+Executable programs (covered in app-development.md).
+
+### Workflows
+Multi-step analysis pipelines.
+
+## Data Object Lifecycle
+
+### States
+
+**Open State**: Data can be modified
+- Files: Contents can be written
+- Records: Details can be updated
+- Applets: Created in closed state by default
+
+**Closed State**: Data becomes immutable
+- File contents are fixed
+- Metadata fields are locked (types, details, links, visibility)
+- Objects are ready for sharing and analysis
+
+### Transitions
+
+```
+Create (open) → Modify → Close (immutable)
+```
+
+Most objects start open and require explicit closure:
+```python
+# Close a file
+file_obj.close()
+```
+
+Some objects can be created and closed in one operation:
+```python
+# Create closed record
+record = dxpy.new_dxrecord(details={...}, close=True)
+```
+
+## File Operations
+
+### Uploading Files
+
+**From local file**:
+```python
+import dxpy
+
+# Upload a file
+file_obj = dxpy.upload_local_file("data.txt", project="project-xxxx")
+print(f"Uploaded: {file_obj.get_id()}")
+```
+
+**With metadata**:
+```python
+file_obj = dxpy.upload_local_file(
+    "data.txt",
+    name="my_data",
+    project="project-xxxx",
+    folder="/results",
+    properties={"sample": "sample1", "type": "raw"},
+    tags=["experiment1", "batch2"]
+)
+```
+
+**Streaming upload**:
+```python
+# For large files or generated data
+file_obj = dxpy.new_dxfile(project="project-xxxx", name="output.txt")
+file_obj.write("Line 1\n")
+file_obj.write("Line 2\n")
+file_obj.close()
+```
+
+### Downloading Files
+
+**To local file**:
+```python
+# Download by ID
+dxpy.download_dxfile("file-xxxx", "local_output.txt")
+
+# Download using handler
+file_obj = dxpy.DXFile("file-xxxx")
+dxpy.download_dxfile(file_obj.get_id(), "local_output.txt")
+```
+
+**Read file contents**:
+```python
+file_obj = dxpy.DXFile("file-xxxx")
+with file_obj.open_file() as f:
+    contents = f.read()
+```
+
+**Download to specific directory**:
+```python
+dxpy.download_dxfile("file-xxxx", "/path/to/directory/filename.txt")
+```
+
+### File Metadata
+
+**Get file information**:
+```python
+file_obj = dxpy.DXFile("file-xxxx")
+describe = file_obj.describe()
+
+print(f"Name: {describe['name']}")
+print(f"Size: {describe['size']} bytes")
+print(f"State: {describe['state']}")
+print(f"Created: {describe['created']}")
+```
+
+**Update file metadata**:
+```python
+file_obj.set_properties({"experiment": "exp1", "version": "v2"})
+file_obj.add_tags(["validated", "published"])
+file_obj.rename("new_name.txt")
+```
+
+## Record Operations
+
+Records store structured metadata with arbitrary JSON.
+
+### Creating Records
+
+```python
+# Create a record
+record = dxpy.new_dxrecord(
+    name="sample_metadata",
+    types=["SampleMetadata"],
+    details={
+        "sample_id": "S001",
+        "tissue": "blood",
+        "age": 45,
+        "conditions": ["diabetes"]
+    },
+    project="project-xxxx",
+    close=True
+)
+```
+
+### Reading Records
+
+```python
+record = dxpy.DXRecord("record-xxxx")
+describe = record.describe()
+
+# Access details
+details = record.get_details()
+sample_id = details["sample_id"]
+tissue = details["tissue"]
+```
+
+### Updating Records
+
+```python
+# Record must be open to update
+record = dxpy.DXRecord("record-xxxx")
+details = record.get_details()
+details["processed"] = True
+record.set_details(details)
+record.close()
+```
+
+## Search and Discovery
+
+### Finding Data Objects
+
+**Search by name**:
+```python
+results = dxpy.find_data_objects(
+    name="*.fastq",
+    project="project-xxxx",
+    folder="/raw_data"
+)
+
+for result in results:
+    print(f"{result['describe']['name']}: {result['id']}")
+```
+
+**Search by properties**:
+```python
+results = dxpy.find_data_objects(
+    classname="file",
+    properties={"sample": "sample1", "type": "processed"},
+    project="project-xxxx"
+)
+```
+
+**Search by type**:
+```python
+# Find all records of specific type
+results = dxpy.find_data_objects(
+    classname="record",
+    typename="SampleMetadata",
+    project="project-xxxx"
+)
+```
+
+**Search with state filter**:
+```python
+# Find only closed files
+results = dxpy.find_data_objects(
+    classname="file",
+    state="closed",
+    project="project-xxxx"
+)
+```
+
+### System-wide Search
+
+```python
+# Search across all accessible projects
+results = dxpy.find_data_objects(
+    name="important_data.txt",
+    describe=True  # Include full descriptions
+)
+```
+
+## Cloning and Copying
+
+### Clone Data Between Projects
+
+```python
+# Clone file to another project
+new_file = dxpy.DXFile("file-xxxx").clone(
+    project="project-yyyy",
+    folder="/imported_data"
+)
+```
+
+### Clone Multiple Objects
+
+```python
+# Clone folder contents
+files = dxpy.find_data_objects(
+    classname="file",
+    project="project-xxxx",
+    folder="/results"
+)
+
+for file in files:
+    file_obj = dxpy.DXFile(file['id'])
+    file_obj.clone(project="project-yyyy", folder="/backup")
+```
+
+## Project Management
+
+### Creating Projects
+
+```python
+# Create a new project
+project = dxpy.api.project_new({
+    "name": "My Analysis Project",
+    "description": "RNA-seq analysis for experiment X"
+})
+
+project_id = project['id']
+```
+
+### Project Permissions
+
+```python
+# Invite user to project
+dxpy.api.project_invite(
+    project_id,
+    {
+        "invitee": "user-xxxx",
+        "level": "CONTRIBUTE"  # VIEW, UPLOAD, CONTRIBUTE, ADMINISTER
+    }
+)
+```
+
+### List Projects
+
+```python
+# List accessible projects
+projects = dxpy.find_projects(describe=True)
+
+for proj in projects:
+    desc = proj['describe']
+    print(f"{desc['name']}: {proj['id']}")
+```
+
+## Folder Operations
+
+### Creating Folders
+
+```python
+# Create nested folders
+dxpy.api.project_new_folder(
+    "project-xxxx",
+    {"folder": "/analysis/batch1/results", "parents": True}
+)
+```
+
+### Moving Objects
+
+```python
+# Move file to different folder
+file_obj = dxpy.DXFile("file-xxxx", project="project-xxxx")
+file_obj.move("/new_location")
+```
+
+### Removing Objects
+
+```python
+# Remove file from project (not permanent deletion)
+dxpy.api.project_remove_objects(
+    "project-xxxx",
+    {"objects": ["file-xxxx"]}
+)
+
+# Permanent deletion
+file_obj = dxpy.DXFile("file-xxxx")
+file_obj.remove()
+```
+
+## Archival
+
+### Archive Data
+
+Archived data is moved to cheaper long-term storage:
+
+```python
+# Archive a file
+dxpy.api.project_archive(
+    "project-xxxx",
+    {"files": ["file-xxxx"]}
+)
+```
+
+### Unarchive Data
+
+```python
+# Unarchive when needed
+dxpy.api.project_unarchive(
+    "project-xxxx",
+    {"files": ["file-xxxx"]}
+)
+```
+
+## Batch Operations
+
+### Upload Multiple Files
+
+```python
+import os
+
+# Upload all files in directory
+for filename in os.listdir("./data"):
+    filepath = os.path.join("./data", filename)
+    if os.path.isfile(filepath):
+        dxpy.upload_local_file(
+            filepath,
+            project="project-xxxx",
+            folder="/batch_upload"
+        )
+```
+
+### Download Multiple Files
+
+```python
+# Download all files from folder
+files = dxpy.find_data_objects(
+    classname="file",
+    project="project-xxxx",
+    folder="/results"
+)
+
+for file in files:
+    file_obj = dxpy.DXFile(file['id'])
+    filename = file_obj.describe()['name']
+    dxpy.download_dxfile(file['id'], f"./downloads/{filename}")
+```
+
+## Best Practices
+
+1. **Close Files**: Always close files after writing to make them accessible
+2. **Use Properties**: Tag data with meaningful properties for easier discovery
+3. **Organize Folders**: Use logical folder structures
+4. **Clean Up**: Remove temporary or obsolete data
+5. **Batch Operations**: Group operations when processing many objects
+6. **Error Handling**: Check object states before operations
+7. **Permissions**: Verify project permissions before data operations
+8. **Archive Old Data**: Use archival for long-term storage cost savings