7.9 KiB
DNAnexus Data Operations
Overview
DNAnexus provides comprehensive data management capabilities for files, records, databases, and other data objects. All data operations can be performed via the Python SDK (dxpy) or command-line interface (dx).
Data Object Types
Files
Binary or text data stored on the platform.
Records
Structured data objects with arbitrary JSON details and metadata.
Databases
Structured database objects for relational data.
Applets and Apps
Executable programs (covered in app-development.md).
Workflows
Multi-step analysis pipelines.
Data Object Lifecycle
States
Open State: Data can be modified
- Files: Contents can be written
- Records: Details can be updated
- Applets: Created in closed state by default
Closed State: Data becomes immutable
- File contents are fixed
- Metadata fields are locked (types, details, links, visibility)
- Objects are ready for sharing and analysis
Transitions
Create (open) → Modify → Close (immutable)
Most objects start open and require explicit closure:
# Close a file
file_obj.close()
Some objects can be created and closed in one operation:
# Create closed record
record = dxpy.new_dxrecord(details={...}, close=True)
File Operations
Uploading Files
From local file:
import dxpy
# Upload a file
file_obj = dxpy.upload_local_file("data.txt", project="project-xxxx")
print(f"Uploaded: {file_obj.get_id()}")
With metadata:
file_obj = dxpy.upload_local_file(
"data.txt",
name="my_data",
project="project-xxxx",
folder="/results",
properties={"sample": "sample1", "type": "raw"},
tags=["experiment1", "batch2"]
)
Streaming upload:
# For large files or generated data
file_obj = dxpy.new_dxfile(project="project-xxxx", name="output.txt")
file_obj.write("Line 1\n")
file_obj.write("Line 2\n")
file_obj.close()
Downloading Files
To local file:
# Download by ID
dxpy.download_dxfile("file-xxxx", "local_output.txt")
# Download using handler
file_obj = dxpy.DXFile("file-xxxx")
dxpy.download_dxfile(file_obj.get_id(), "local_output.txt")
Read file contents:
file_obj = dxpy.DXFile("file-xxxx")
with file_obj.open_file() as f:
contents = f.read()
Download to specific directory:
dxpy.download_dxfile("file-xxxx", "/path/to/directory/filename.txt")
File Metadata
Get file information:
file_obj = dxpy.DXFile("file-xxxx")
describe = file_obj.describe()
print(f"Name: {describe['name']}")
print(f"Size: {describe['size']} bytes")
print(f"State: {describe['state']}")
print(f"Created: {describe['created']}")
Update file metadata:
file_obj.set_properties({"experiment": "exp1", "version": "v2"})
file_obj.add_tags(["validated", "published"])
file_obj.rename("new_name.txt")
Record Operations
Records store structured metadata with arbitrary JSON.
Creating Records
# Create a record
record = dxpy.new_dxrecord(
name="sample_metadata",
types=["SampleMetadata"],
details={
"sample_id": "S001",
"tissue": "blood",
"age": 45,
"conditions": ["diabetes"]
},
project="project-xxxx",
close=True
)
Reading Records
record = dxpy.DXRecord("record-xxxx")
describe = record.describe()
# Access details
details = record.get_details()
sample_id = details["sample_id"]
tissue = details["tissue"]
Updating Records
# Record must be open to update
record = dxpy.DXRecord("record-xxxx")
details = record.get_details()
details["processed"] = True
record.set_details(details)
record.close()
Search and Discovery
Finding Data Objects
Search by name:
results = dxpy.find_data_objects(
name="*.fastq",
project="project-xxxx",
folder="/raw_data"
)
for result in results:
print(f"{result['describe']['name']}: {result['id']}")
Search by properties:
results = dxpy.find_data_objects(
classname="file",
properties={"sample": "sample1", "type": "processed"},
project="project-xxxx"
)
Search by type:
# Find all records of specific type
results = dxpy.find_data_objects(
classname="record",
typename="SampleMetadata",
project="project-xxxx"
)
Search with state filter:
# Find only closed files
results = dxpy.find_data_objects(
classname="file",
state="closed",
project="project-xxxx"
)
System-wide Search
# Search across all accessible projects
results = dxpy.find_data_objects(
name="important_data.txt",
describe=True # Include full descriptions
)
Cloning and Copying
Clone Data Between Projects
# Clone file to another project
new_file = dxpy.DXFile("file-xxxx").clone(
project="project-yyyy",
folder="/imported_data"
)
Clone Multiple Objects
# Clone folder contents
files = dxpy.find_data_objects(
classname="file",
project="project-xxxx",
folder="/results"
)
for file in files:
file_obj = dxpy.DXFile(file['id'])
file_obj.clone(project="project-yyyy", folder="/backup")
Project Management
Creating Projects
# Create a new project
project = dxpy.api.project_new({
"name": "My Analysis Project",
"description": "RNA-seq analysis for experiment X"
})
project_id = project['id']
Project Permissions
# Invite user to project
dxpy.api.project_invite(
project_id,
{
"invitee": "user-xxxx",
"level": "CONTRIBUTE" # VIEW, UPLOAD, CONTRIBUTE, ADMINISTER
}
)
List Projects
# List accessible projects
projects = dxpy.find_projects(describe=True)
for proj in projects:
desc = proj['describe']
print(f"{desc['name']}: {proj['id']}")
Folder Operations
Creating Folders
# Create nested folders
dxpy.api.project_new_folder(
"project-xxxx",
{"folder": "/analysis/batch1/results", "parents": True}
)
Moving Objects
# Move file to different folder
file_obj = dxpy.DXFile("file-xxxx", project="project-xxxx")
file_obj.move("/new_location")
Removing Objects
# Remove file from project (not permanent deletion)
dxpy.api.project_remove_objects(
"project-xxxx",
{"objects": ["file-xxxx"]}
)
# Permanent deletion
file_obj = dxpy.DXFile("file-xxxx")
file_obj.remove()
Archival
Archive Data
Archived data is moved to cheaper long-term storage:
# Archive a file
dxpy.api.project_archive(
"project-xxxx",
{"files": ["file-xxxx"]}
)
Unarchive Data
# Unarchive when needed
dxpy.api.project_unarchive(
"project-xxxx",
{"files": ["file-xxxx"]}
)
Batch Operations
Upload Multiple Files
import os
# Upload all files in directory
for filename in os.listdir("./data"):
filepath = os.path.join("./data", filename)
if os.path.isfile(filepath):
dxpy.upload_local_file(
filepath,
project="project-xxxx",
folder="/batch_upload"
)
Download Multiple Files
# Download all files from folder
files = dxpy.find_data_objects(
classname="file",
project="project-xxxx",
folder="/results"
)
for file in files:
file_obj = dxpy.DXFile(file['id'])
filename = file_obj.describe()['name']
dxpy.download_dxfile(file['id'], f"./downloads/{filename}")
Best Practices
- Close Files: Always close files after writing to make them accessible
- Use Properties: Tag data with meaningful properties for easier discovery
- Organize Folders: Use logical folder structures
- Clean Up: Remove temporary or obsolete data
- Batch Operations: Group operations when processing many objects
- Error Handling: Check object states before operations
- Permissions: Verify project permissions before data operations
- Archive Old Data: Use archival for long-term storage cost savings