Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,400 @@
# DNAnexus Data Operations
## Overview
DNAnexus provides comprehensive data management capabilities for files, records, databases, and other data objects. All data operations can be performed via the Python SDK (dxpy) or command-line interface (dx).
## Data Object Types
### Files
Binary or text data stored on the platform.
### Records
Structured data objects with arbitrary JSON details and metadata.
### Databases
Structured database objects for relational data.
### Applets and Apps
Executable programs (covered in app-development.md).
### Workflows
Multi-step analysis pipelines.
## Data Object Lifecycle
### States
**Open State**: Data can be modified
- Files: Contents can be written
- Records: Details can be updated
- Applets: Created in closed state by default
**Closed State**: Data becomes immutable
- File contents are fixed
- Metadata fields are locked (types, details, links, visibility)
- Objects are ready for sharing and analysis
### Transitions
```
Create (open) → Modify → Close (immutable)
```
Most objects start open and require explicit closure:
```python
# Close a file
file_obj.close()
```
Some objects can be created and closed in one operation:
```python
# Create closed record
record = dxpy.new_dxrecord(details={...}, close=True)
```
## File Operations
### Uploading Files
**From local file**:
```python
import dxpy
# Upload a file
file_obj = dxpy.upload_local_file("data.txt", project="project-xxxx")
print(f"Uploaded: {file_obj.get_id()}")
```
**With metadata**:
```python
file_obj = dxpy.upload_local_file(
"data.txt",
name="my_data",
project="project-xxxx",
folder="/results",
properties={"sample": "sample1", "type": "raw"},
tags=["experiment1", "batch2"]
)
```
**Streaming upload**:
```python
# For large files or generated data
file_obj = dxpy.new_dxfile(project="project-xxxx", name="output.txt")
file_obj.write("Line 1\n")
file_obj.write("Line 2\n")
file_obj.close()
```
### Downloading Files
**To local file**:
```python
# Download by ID
dxpy.download_dxfile("file-xxxx", "local_output.txt")
# Download using handler
file_obj = dxpy.DXFile("file-xxxx")
dxpy.download_dxfile(file_obj.get_id(), "local_output.txt")
```
**Read file contents**:
```python
file_obj = dxpy.DXFile("file-xxxx")
with file_obj.open_file() as f:
contents = f.read()
```
**Download to specific directory**:
```python
dxpy.download_dxfile("file-xxxx", "/path/to/directory/filename.txt")
```
### File Metadata
**Get file information**:
```python
file_obj = dxpy.DXFile("file-xxxx")
describe = file_obj.describe()
print(f"Name: {describe['name']}")
print(f"Size: {describe['size']} bytes")
print(f"State: {describe['state']}")
print(f"Created: {describe['created']}")
```
**Update file metadata**:
```python
file_obj.set_properties({"experiment": "exp1", "version": "v2"})
file_obj.add_tags(["validated", "published"])
file_obj.rename("new_name.txt")
```
## Record Operations
Records store structured metadata with arbitrary JSON.
### Creating Records
```python
# Create a record
record = dxpy.new_dxrecord(
name="sample_metadata",
types=["SampleMetadata"],
details={
"sample_id": "S001",
"tissue": "blood",
"age": 45,
"conditions": ["diabetes"]
},
project="project-xxxx",
close=True
)
```
### Reading Records
```python
record = dxpy.DXRecord("record-xxxx")
describe = record.describe()
# Access details
details = record.get_details()
sample_id = details["sample_id"]
tissue = details["tissue"]
```
### Updating Records
```python
# Record must be open to update
record = dxpy.DXRecord("record-xxxx")
details = record.get_details()
details["processed"] = True
record.set_details(details)
record.close()
```
## Search and Discovery
### Finding Data Objects
**Search by name**:
```python
results = dxpy.find_data_objects(
name="*.fastq",
project="project-xxxx",
folder="/raw_data"
)
for result in results:
print(f"{result['describe']['name']}: {result['id']}")
```
**Search by properties**:
```python
results = dxpy.find_data_objects(
classname="file",
properties={"sample": "sample1", "type": "processed"},
project="project-xxxx"
)
```
**Search by type**:
```python
# Find all records of specific type
results = dxpy.find_data_objects(
classname="record",
typename="SampleMetadata",
project="project-xxxx"
)
```
**Search with state filter**:
```python
# Find only closed files
results = dxpy.find_data_objects(
classname="file",
state="closed",
project="project-xxxx"
)
```
### System-wide Search
```python
# Search across all accessible projects
results = dxpy.find_data_objects(
name="important_data.txt",
describe=True # Include full descriptions
)
```
## Cloning and Copying
### Clone Data Between Projects
```python
# Clone file to another project
new_file = dxpy.DXFile("file-xxxx").clone(
project="project-yyyy",
folder="/imported_data"
)
```
### Clone Multiple Objects
```python
# Clone folder contents
files = dxpy.find_data_objects(
classname="file",
project="project-xxxx",
folder="/results"
)
for file in files:
file_obj = dxpy.DXFile(file['id'])
file_obj.clone(project="project-yyyy", folder="/backup")
```
## Project Management
### Creating Projects
```python
# Create a new project
project = dxpy.api.project_new({
"name": "My Analysis Project",
"description": "RNA-seq analysis for experiment X"
})
project_id = project['id']
```
### Project Permissions
```python
# Invite user to project
dxpy.api.project_invite(
project_id,
{
"invitee": "user-xxxx",
"level": "CONTRIBUTE" # VIEW, UPLOAD, CONTRIBUTE, ADMINISTER
}
)
```
### List Projects
```python
# List accessible projects
projects = dxpy.find_projects(describe=True)
for proj in projects:
desc = proj['describe']
print(f"{desc['name']}: {proj['id']}")
```
## Folder Operations
### Creating Folders
```python
# Create nested folders
dxpy.api.project_new_folder(
"project-xxxx",
{"folder": "/analysis/batch1/results", "parents": True}
)
```
### Moving Objects
```python
# Move file to different folder
file_obj = dxpy.DXFile("file-xxxx", project="project-xxxx")
file_obj.move("/new_location")
```
### Removing Objects
```python
# Remove file from project (not permanent deletion)
dxpy.api.project_remove_objects(
"project-xxxx",
{"objects": ["file-xxxx"]}
)
# Permanent deletion
file_obj = dxpy.DXFile("file-xxxx")
file_obj.remove()
```
## Archival
### Archive Data
Archived data is moved to cheaper long-term storage:
```python
# Archive a file
dxpy.api.project_archive(
"project-xxxx",
{"files": ["file-xxxx"]}
)
```
### Unarchive Data
```python
# Unarchive when needed
dxpy.api.project_unarchive(
"project-xxxx",
{"files": ["file-xxxx"]}
)
```
## Batch Operations
### Upload Multiple Files
```python
import os
# Upload all files in directory
for filename in os.listdir("./data"):
filepath = os.path.join("./data", filename)
if os.path.isfile(filepath):
dxpy.upload_local_file(
filepath,
project="project-xxxx",
folder="/batch_upload"
)
```
### Download Multiple Files
```python
# Download all files from folder
files = dxpy.find_data_objects(
classname="file",
project="project-xxxx",
folder="/results"
)
for file in files:
file_obj = dxpy.DXFile(file['id'])
filename = file_obj.describe()['name']
dxpy.download_dxfile(file['id'], f"./downloads/{filename}")
```
## Best Practices
1. **Close Files**: Always close files after writing to make them accessible
2. **Use Properties**: Tag data with meaningful properties for easier discovery
3. **Organize Folders**: Use logical folder structures
4. **Clean Up**: Remove temporary or obsolete data
5. **Batch Operations**: Group operations when processing many objects
6. **Error Handling**: Check object states before operations
7. **Permissions**: Verify project permissions before data operations
8. **Archive Old Data**: Use archival for long-term storage cost savings