401 lines
7.9 KiB
Markdown
401 lines
7.9 KiB
Markdown
# DNAnexus Data Operations
|
|
|
|
## Overview
|
|
|
|
DNAnexus provides comprehensive data management capabilities for files, records, databases, and other data objects. All data operations can be performed via the Python SDK (dxpy) or command-line interface (dx).
|
|
|
|
## Data Object Types
|
|
|
|
### Files
|
|
Binary or text data stored on the platform.
|
|
|
|
### Records
|
|
Structured data objects with arbitrary JSON details and metadata.
|
|
|
|
### Databases
|
|
Structured database objects for relational data.
|
|
|
|
### Applets and Apps
|
|
Executable programs (covered in app-development.md).
|
|
|
|
### Workflows
|
|
Multi-step analysis pipelines.
|
|
|
|
## Data Object Lifecycle
|
|
|
|
### States
|
|
|
|
**Open State**: Data can be modified
|
|
- Files: Contents can be written
|
|
- Records: Details can be updated
|
|
- Applets: Created in closed state by default
|
|
|
|
**Closed State**: Data becomes immutable
|
|
- File contents are fixed
|
|
- Metadata fields are locked (types, details, links, visibility)
|
|
- Objects are ready for sharing and analysis
|
|
|
|
### Transitions
|
|
|
|
```
|
|
Create (open) → Modify → Close (immutable)
|
|
```
|
|
|
|
Most objects start open and require explicit closure:
|
|
```python
|
|
# Close a file
|
|
file_obj.close()
|
|
```
|
|
|
|
Some objects can be created and closed in one operation:
|
|
```python
|
|
# Create closed record
|
|
record = dxpy.new_dxrecord(details={...}, close=True)
|
|
```
|
|
|
|
## File Operations
|
|
|
|
### Uploading Files
|
|
|
|
**From local file**:
|
|
```python
|
|
import dxpy
|
|
|
|
# Upload a file
|
|
file_obj = dxpy.upload_local_file("data.txt", project="project-xxxx")
|
|
print(f"Uploaded: {file_obj.get_id()}")
|
|
```
|
|
|
|
**With metadata**:
|
|
```python
|
|
file_obj = dxpy.upload_local_file(
|
|
"data.txt",
|
|
name="my_data",
|
|
project="project-xxxx",
|
|
folder="/results",
|
|
properties={"sample": "sample1", "type": "raw"},
|
|
tags=["experiment1", "batch2"]
|
|
)
|
|
```
|
|
|
|
**Streaming upload**:
|
|
```python
|
|
# For large files or generated data
|
|
file_obj = dxpy.new_dxfile(project="project-xxxx", name="output.txt")
|
|
file_obj.write("Line 1\n")
|
|
file_obj.write("Line 2\n")
|
|
file_obj.close()
|
|
```
|
|
|
|
### Downloading Files
|
|
|
|
**To local file**:
|
|
```python
|
|
# Download by ID
|
|
dxpy.download_dxfile("file-xxxx", "local_output.txt")
|
|
|
|
# Download using handler
|
|
file_obj = dxpy.DXFile("file-xxxx")
|
|
dxpy.download_dxfile(file_obj.get_id(), "local_output.txt")
|
|
```
|
|
|
|
**Read file contents**:
|
|
```python
|
|
file_obj = dxpy.DXFile("file-xxxx")
|
|
with file_obj.open_file() as f:
|
|
contents = f.read()
|
|
```
|
|
|
|
**Download to specific directory**:
|
|
```python
|
|
dxpy.download_dxfile("file-xxxx", "/path/to/directory/filename.txt")
|
|
```
|
|
|
|
### File Metadata
|
|
|
|
**Get file information**:
|
|
```python
|
|
file_obj = dxpy.DXFile("file-xxxx")
|
|
describe = file_obj.describe()
|
|
|
|
print(f"Name: {describe['name']}")
|
|
print(f"Size: {describe['size']} bytes")
|
|
print(f"State: {describe['state']}")
|
|
print(f"Created: {describe['created']}")
|
|
```
|
|
|
|
**Update file metadata**:
|
|
```python
|
|
file_obj.set_properties({"experiment": "exp1", "version": "v2"})
|
|
file_obj.add_tags(["validated", "published"])
|
|
file_obj.rename("new_name.txt")
|
|
```
|
|
|
|
## Record Operations
|
|
|
|
Records store structured metadata with arbitrary JSON.
|
|
|
|
### Creating Records
|
|
|
|
```python
|
|
# Create a record
|
|
record = dxpy.new_dxrecord(
|
|
name="sample_metadata",
|
|
types=["SampleMetadata"],
|
|
details={
|
|
"sample_id": "S001",
|
|
"tissue": "blood",
|
|
"age": 45,
|
|
"conditions": ["diabetes"]
|
|
},
|
|
project="project-xxxx",
|
|
close=True
|
|
)
|
|
```
|
|
|
|
### Reading Records
|
|
|
|
```python
|
|
record = dxpy.DXRecord("record-xxxx")
|
|
describe = record.describe()
|
|
|
|
# Access details
|
|
details = record.get_details()
|
|
sample_id = details["sample_id"]
|
|
tissue = details["tissue"]
|
|
```
|
|
|
|
### Updating Records
|
|
|
|
```python
|
|
# Record must be open to update
|
|
record = dxpy.DXRecord("record-xxxx")
|
|
details = record.get_details()
|
|
details["processed"] = True
|
|
record.set_details(details)
|
|
record.close()
|
|
```
|
|
|
|
## Search and Discovery
|
|
|
|
### Finding Data Objects
|
|
|
|
**Search by name**:
|
|
```python
|
|
results = dxpy.find_data_objects(
|
|
name="*.fastq",
|
|
project="project-xxxx",
|
|
folder="/raw_data"
|
|
)
|
|
|
|
for result in results:
|
|
print(f"{result['describe']['name']}: {result['id']}")
|
|
```
|
|
|
|
**Search by properties**:
|
|
```python
|
|
results = dxpy.find_data_objects(
|
|
classname="file",
|
|
properties={"sample": "sample1", "type": "processed"},
|
|
project="project-xxxx"
|
|
)
|
|
```
|
|
|
|
**Search by type**:
|
|
```python
|
|
# Find all records of specific type
|
|
results = dxpy.find_data_objects(
|
|
classname="record",
|
|
typename="SampleMetadata",
|
|
project="project-xxxx"
|
|
)
|
|
```
|
|
|
|
**Search with state filter**:
|
|
```python
|
|
# Find only closed files
|
|
results = dxpy.find_data_objects(
|
|
classname="file",
|
|
state="closed",
|
|
project="project-xxxx"
|
|
)
|
|
```
|
|
|
|
### System-wide Search
|
|
|
|
```python
|
|
# Search across all accessible projects
|
|
results = dxpy.find_data_objects(
|
|
name="important_data.txt",
|
|
describe=True # Include full descriptions
|
|
)
|
|
```
|
|
|
|
## Cloning and Copying
|
|
|
|
### Clone Data Between Projects
|
|
|
|
```python
|
|
# Clone file to another project
|
|
new_file = dxpy.DXFile("file-xxxx").clone(
|
|
project="project-yyyy",
|
|
folder="/imported_data"
|
|
)
|
|
```
|
|
|
|
### Clone Multiple Objects
|
|
|
|
```python
|
|
# Clone folder contents
|
|
files = dxpy.find_data_objects(
|
|
classname="file",
|
|
project="project-xxxx",
|
|
folder="/results"
|
|
)
|
|
|
|
for file in files:
|
|
file_obj = dxpy.DXFile(file['id'])
|
|
file_obj.clone(project="project-yyyy", folder="/backup")
|
|
```
|
|
|
|
## Project Management
|
|
|
|
### Creating Projects
|
|
|
|
```python
|
|
# Create a new project
|
|
project = dxpy.api.project_new({
|
|
"name": "My Analysis Project",
|
|
"description": "RNA-seq analysis for experiment X"
|
|
})
|
|
|
|
project_id = project['id']
|
|
```
|
|
|
|
### Project Permissions
|
|
|
|
```python
|
|
# Invite user to project
|
|
dxpy.api.project_invite(
|
|
project_id,
|
|
{
|
|
"invitee": "user-xxxx",
|
|
"level": "CONTRIBUTE" # VIEW, UPLOAD, CONTRIBUTE, ADMINISTER
|
|
}
|
|
)
|
|
```
|
|
|
|
### List Projects
|
|
|
|
```python
|
|
# List accessible projects
|
|
projects = dxpy.find_projects(describe=True)
|
|
|
|
for proj in projects:
|
|
desc = proj['describe']
|
|
print(f"{desc['name']}: {proj['id']}")
|
|
```
|
|
|
|
## Folder Operations
|
|
|
|
### Creating Folders
|
|
|
|
```python
|
|
# Create nested folders
|
|
dxpy.api.project_new_folder(
|
|
"project-xxxx",
|
|
{"folder": "/analysis/batch1/results", "parents": True}
|
|
)
|
|
```
|
|
|
|
### Moving Objects
|
|
|
|
```python
|
|
# Move file to different folder
|
|
file_obj = dxpy.DXFile("file-xxxx", project="project-xxxx")
|
|
file_obj.move("/new_location")
|
|
```
|
|
|
|
### Removing Objects
|
|
|
|
```python
|
|
# Remove file from project (not permanent deletion)
|
|
dxpy.api.project_remove_objects(
|
|
"project-xxxx",
|
|
{"objects": ["file-xxxx"]}
|
|
)
|
|
|
|
# Permanent deletion
|
|
file_obj = dxpy.DXFile("file-xxxx")
|
|
file_obj.remove()
|
|
```
|
|
|
|
## Archival
|
|
|
|
### Archive Data
|
|
|
|
Archived data is moved to cheaper long-term storage:
|
|
|
|
```python
|
|
# Archive a file
|
|
dxpy.api.project_archive(
|
|
"project-xxxx",
|
|
{"files": ["file-xxxx"]}
|
|
)
|
|
```
|
|
|
|
### Unarchive Data
|
|
|
|
```python
|
|
# Unarchive when needed
|
|
dxpy.api.project_unarchive(
|
|
"project-xxxx",
|
|
{"files": ["file-xxxx"]}
|
|
)
|
|
```
|
|
|
|
## Batch Operations
|
|
|
|
### Upload Multiple Files
|
|
|
|
```python
|
|
import os
|
|
|
|
# Upload all files in directory
|
|
for filename in os.listdir("./data"):
|
|
filepath = os.path.join("./data", filename)
|
|
if os.path.isfile(filepath):
|
|
dxpy.upload_local_file(
|
|
filepath,
|
|
project="project-xxxx",
|
|
folder="/batch_upload"
|
|
)
|
|
```
|
|
|
|
### Download Multiple Files
|
|
|
|
```python
|
|
# Download all files from folder
|
|
files = dxpy.find_data_objects(
|
|
classname="file",
|
|
project="project-xxxx",
|
|
folder="/results"
|
|
)
|
|
|
|
for file in files:
|
|
file_obj = dxpy.DXFile(file['id'])
|
|
filename = file_obj.describe()['name']
|
|
dxpy.download_dxfile(file['id'], f"./downloads/{filename}")
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Close Files**: Always close files after writing to make them accessible
|
|
2. **Use Properties**: Tag data with meaningful properties for easier discovery
|
|
3. **Organize Folders**: Use logical folder structures
|
|
4. **Clean Up**: Remove temporary or obsolete data
|
|
5. **Batch Operations**: Group operations when processing many objects
|
|
6. **Error Handling**: Check object states before operations
|
|
7. **Permissions**: Verify project permissions before data operations
|
|
8. **Archive Old Data**: Use archival for long-term storage cost savings
|