Files
2025-11-30 08:30:10 +08:00

7.9 KiB

DNAnexus Data Operations

Overview

DNAnexus provides comprehensive data management capabilities for files, records, databases, and other data objects. All data operations can be performed via the Python SDK (dxpy) or command-line interface (dx).

Data Object Types

Files

Binary or text data stored on the platform.

Records

Structured data objects with arbitrary JSON details and metadata.

Databases

Structured database objects for relational data.

Applets and Apps

Executable programs (covered in app-development.md).

Workflows

Multi-step analysis pipelines.

Data Object Lifecycle

States

Open State: Data can be modified

  • Files: Contents can be written
  • Records: Details can be updated
  • Applets: Created in closed state by default

Closed State: Data becomes immutable

  • File contents are fixed
  • Metadata fields are locked (types, details, links, visibility)
  • Objects are ready for sharing and analysis

Transitions

Create (open) → Modify → Close (immutable)

Most objects start open and require explicit closure:

# Close a file
file_obj.close()

Some objects can be created and closed in one operation:

# Create closed record
record = dxpy.new_dxrecord(details={...}, close=True)

File Operations

Uploading Files

From local file:

import dxpy

# Upload a file
file_obj = dxpy.upload_local_file("data.txt", project="project-xxxx")
print(f"Uploaded: {file_obj.get_id()}")

With metadata:

file_obj = dxpy.upload_local_file(
    "data.txt",
    name="my_data",
    project="project-xxxx",
    folder="/results",
    properties={"sample": "sample1", "type": "raw"},
    tags=["experiment1", "batch2"]
)

Streaming upload:

# For large files or generated data
file_obj = dxpy.new_dxfile(project="project-xxxx", name="output.txt")
file_obj.write("Line 1\n")
file_obj.write("Line 2\n")
file_obj.close()

Downloading Files

To local file:

# Download by ID
dxpy.download_dxfile("file-xxxx", "local_output.txt")

# Download using handler
file_obj = dxpy.DXFile("file-xxxx")
dxpy.download_dxfile(file_obj.get_id(), "local_output.txt")

Read file contents:

file_obj = dxpy.DXFile("file-xxxx")
with file_obj.open_file() as f:
    contents = f.read()

Download to specific directory:

dxpy.download_dxfile("file-xxxx", "/path/to/directory/filename.txt")

File Metadata

Get file information:

file_obj = dxpy.DXFile("file-xxxx")
describe = file_obj.describe()

print(f"Name: {describe['name']}")
print(f"Size: {describe['size']} bytes")
print(f"State: {describe['state']}")
print(f"Created: {describe['created']}")

Update file metadata:

file_obj.set_properties({"experiment": "exp1", "version": "v2"})
file_obj.add_tags(["validated", "published"])
file_obj.rename("new_name.txt")

Record Operations

Records store structured metadata with arbitrary JSON.

Creating Records

# Create a record
record = dxpy.new_dxrecord(
    name="sample_metadata",
    types=["SampleMetadata"],
    details={
        "sample_id": "S001",
        "tissue": "blood",
        "age": 45,
        "conditions": ["diabetes"]
    },
    project="project-xxxx",
    close=True
)

Reading Records

record = dxpy.DXRecord("record-xxxx")
describe = record.describe()

# Access details
details = record.get_details()
sample_id = details["sample_id"]
tissue = details["tissue"]

Updating Records

# Record must be open to update
record = dxpy.DXRecord("record-xxxx")
details = record.get_details()
details["processed"] = True
record.set_details(details)
record.close()

Search and Discovery

Finding Data Objects

Search by name:

results = dxpy.find_data_objects(
    name="*.fastq",
    project="project-xxxx",
    folder="/raw_data"
)

for result in results:
    print(f"{result['describe']['name']}: {result['id']}")

Search by properties:

results = dxpy.find_data_objects(
    classname="file",
    properties={"sample": "sample1", "type": "processed"},
    project="project-xxxx"
)

Search by type:

# Find all records of specific type
results = dxpy.find_data_objects(
    classname="record",
    typename="SampleMetadata",
    project="project-xxxx"
)

Search with state filter:

# Find only closed files
results = dxpy.find_data_objects(
    classname="file",
    state="closed",
    project="project-xxxx"
)
# Search across all accessible projects
results = dxpy.find_data_objects(
    name="important_data.txt",
    describe=True  # Include full descriptions
)

Cloning and Copying

Clone Data Between Projects

# Clone file to another project
new_file = dxpy.DXFile("file-xxxx").clone(
    project="project-yyyy",
    folder="/imported_data"
)

Clone Multiple Objects

# Clone folder contents
files = dxpy.find_data_objects(
    classname="file",
    project="project-xxxx",
    folder="/results"
)

for file in files:
    file_obj = dxpy.DXFile(file['id'])
    file_obj.clone(project="project-yyyy", folder="/backup")

Project Management

Creating Projects

# Create a new project
project = dxpy.api.project_new({
    "name": "My Analysis Project",
    "description": "RNA-seq analysis for experiment X"
})

project_id = project['id']

Project Permissions

# Invite user to project
dxpy.api.project_invite(
    project_id,
    {
        "invitee": "user-xxxx",
        "level": "CONTRIBUTE"  # VIEW, UPLOAD, CONTRIBUTE, ADMINISTER
    }
)

List Projects

# List accessible projects
projects = dxpy.find_projects(describe=True)

for proj in projects:
    desc = proj['describe']
    print(f"{desc['name']}: {proj['id']}")

Folder Operations

Creating Folders

# Create nested folders
dxpy.api.project_new_folder(
    "project-xxxx",
    {"folder": "/analysis/batch1/results", "parents": True}
)

Moving Objects

# Move file to different folder
file_obj = dxpy.DXFile("file-xxxx", project="project-xxxx")
file_obj.move("/new_location")

Removing Objects

# Remove file from project (not permanent deletion)
dxpy.api.project_remove_objects(
    "project-xxxx",
    {"objects": ["file-xxxx"]}
)

# Permanent deletion
file_obj = dxpy.DXFile("file-xxxx")
file_obj.remove()

Archival

Archive Data

Archived data is moved to cheaper long-term storage:

# Archive a file
dxpy.api.project_archive(
    "project-xxxx",
    {"files": ["file-xxxx"]}
)

Unarchive Data

# Unarchive when needed
dxpy.api.project_unarchive(
    "project-xxxx",
    {"files": ["file-xxxx"]}
)

Batch Operations

Upload Multiple Files

import os

# Upload all files in directory
for filename in os.listdir("./data"):
    filepath = os.path.join("./data", filename)
    if os.path.isfile(filepath):
        dxpy.upload_local_file(
            filepath,
            project="project-xxxx",
            folder="/batch_upload"
        )

Download Multiple Files

# Download all files from folder
files = dxpy.find_data_objects(
    classname="file",
    project="project-xxxx",
    folder="/results"
)

for file in files:
    file_obj = dxpy.DXFile(file['id'])
    filename = file_obj.describe()['name']
    dxpy.download_dxfile(file['id'], f"./downloads/{filename}")

Best Practices

  1. Close Files: Always close files after writing to make them accessible
  2. Use Properties: Tag data with meaningful properties for easier discovery
  3. Organize Folders: Use logical folder structures
  4. Clean Up: Remove temporary or obsolete data
  5. Batch Operations: Group operations when processing many objects
  6. Error Handling: Check object states before operations
  7. Permissions: Verify project permissions before data operations
  8. Archive Old Data: Use archival for long-term storage cost savings