# DNAnexus Data Operations ## Overview DNAnexus provides comprehensive data management capabilities for files, records, databases, and other data objects. All data operations can be performed via the Python SDK (dxpy) or command-line interface (dx). ## Data Object Types ### Files Binary or text data stored on the platform. ### Records Structured data objects with arbitrary JSON details and metadata. ### Databases Structured database objects for relational data. ### Applets and Apps Executable programs (covered in app-development.md). ### Workflows Multi-step analysis pipelines. ## Data Object Lifecycle ### States **Open State**: Data can be modified - Files: Contents can be written - Records: Details can be updated - Applets: Created in closed state by default **Closed State**: Data becomes immutable - File contents are fixed - Metadata fields are locked (types, details, links, visibility) - Objects are ready for sharing and analysis ### Transitions ``` Create (open) → Modify → Close (immutable) ``` Most objects start open and require explicit closure: ```python # Close a file file_obj.close() ``` Some objects can be created and closed in one operation: ```python # Create closed record record = dxpy.new_dxrecord(details={...}, close=True) ``` ## File Operations ### Uploading Files **From local file**: ```python import dxpy # Upload a file file_obj = dxpy.upload_local_file("data.txt", project="project-xxxx") print(f"Uploaded: {file_obj.get_id()}") ``` **With metadata**: ```python file_obj = dxpy.upload_local_file( "data.txt", name="my_data", project="project-xxxx", folder="/results", properties={"sample": "sample1", "type": "raw"}, tags=["experiment1", "batch2"] ) ``` **Streaming upload**: ```python # For large files or generated data file_obj = dxpy.new_dxfile(project="project-xxxx", name="output.txt") file_obj.write("Line 1\n") file_obj.write("Line 2\n") file_obj.close() ``` ### Downloading Files **To local file**: ```python # Download by ID dxpy.download_dxfile("file-xxxx", "local_output.txt") # Download using handler file_obj = dxpy.DXFile("file-xxxx") dxpy.download_dxfile(file_obj.get_id(), "local_output.txt") ``` **Read file contents**: ```python file_obj = dxpy.DXFile("file-xxxx") with file_obj.open_file() as f: contents = f.read() ``` **Download to specific directory**: ```python dxpy.download_dxfile("file-xxxx", "/path/to/directory/filename.txt") ``` ### File Metadata **Get file information**: ```python file_obj = dxpy.DXFile("file-xxxx") describe = file_obj.describe() print(f"Name: {describe['name']}") print(f"Size: {describe['size']} bytes") print(f"State: {describe['state']}") print(f"Created: {describe['created']}") ``` **Update file metadata**: ```python file_obj.set_properties({"experiment": "exp1", "version": "v2"}) file_obj.add_tags(["validated", "published"]) file_obj.rename("new_name.txt") ``` ## Record Operations Records store structured metadata with arbitrary JSON. ### Creating Records ```python # Create a record record = dxpy.new_dxrecord( name="sample_metadata", types=["SampleMetadata"], details={ "sample_id": "S001", "tissue": "blood", "age": 45, "conditions": ["diabetes"] }, project="project-xxxx", close=True ) ``` ### Reading Records ```python record = dxpy.DXRecord("record-xxxx") describe = record.describe() # Access details details = record.get_details() sample_id = details["sample_id"] tissue = details["tissue"] ``` ### Updating Records ```python # Record must be open to update record = dxpy.DXRecord("record-xxxx") details = record.get_details() details["processed"] = True record.set_details(details) record.close() ``` ## Search and Discovery ### Finding Data Objects **Search by name**: ```python results = dxpy.find_data_objects( name="*.fastq", project="project-xxxx", folder="/raw_data" ) for result in results: print(f"{result['describe']['name']}: {result['id']}") ``` **Search by properties**: ```python results = dxpy.find_data_objects( classname="file", properties={"sample": "sample1", "type": "processed"}, project="project-xxxx" ) ``` **Search by type**: ```python # Find all records of specific type results = dxpy.find_data_objects( classname="record", typename="SampleMetadata", project="project-xxxx" ) ``` **Search with state filter**: ```python # Find only closed files results = dxpy.find_data_objects( classname="file", state="closed", project="project-xxxx" ) ``` ### System-wide Search ```python # Search across all accessible projects results = dxpy.find_data_objects( name="important_data.txt", describe=True # Include full descriptions ) ``` ## Cloning and Copying ### Clone Data Between Projects ```python # Clone file to another project new_file = dxpy.DXFile("file-xxxx").clone( project="project-yyyy", folder="/imported_data" ) ``` ### Clone Multiple Objects ```python # Clone folder contents files = dxpy.find_data_objects( classname="file", project="project-xxxx", folder="/results" ) for file in files: file_obj = dxpy.DXFile(file['id']) file_obj.clone(project="project-yyyy", folder="/backup") ``` ## Project Management ### Creating Projects ```python # Create a new project project = dxpy.api.project_new({ "name": "My Analysis Project", "description": "RNA-seq analysis for experiment X" }) project_id = project['id'] ``` ### Project Permissions ```python # Invite user to project dxpy.api.project_invite( project_id, { "invitee": "user-xxxx", "level": "CONTRIBUTE" # VIEW, UPLOAD, CONTRIBUTE, ADMINISTER } ) ``` ### List Projects ```python # List accessible projects projects = dxpy.find_projects(describe=True) for proj in projects: desc = proj['describe'] print(f"{desc['name']}: {proj['id']}") ``` ## Folder Operations ### Creating Folders ```python # Create nested folders dxpy.api.project_new_folder( "project-xxxx", {"folder": "/analysis/batch1/results", "parents": True} ) ``` ### Moving Objects ```python # Move file to different folder file_obj = dxpy.DXFile("file-xxxx", project="project-xxxx") file_obj.move("/new_location") ``` ### Removing Objects ```python # Remove file from project (not permanent deletion) dxpy.api.project_remove_objects( "project-xxxx", {"objects": ["file-xxxx"]} ) # Permanent deletion file_obj = dxpy.DXFile("file-xxxx") file_obj.remove() ``` ## Archival ### Archive Data Archived data is moved to cheaper long-term storage: ```python # Archive a file dxpy.api.project_archive( "project-xxxx", {"files": ["file-xxxx"]} ) ``` ### Unarchive Data ```python # Unarchive when needed dxpy.api.project_unarchive( "project-xxxx", {"files": ["file-xxxx"]} ) ``` ## Batch Operations ### Upload Multiple Files ```python import os # Upload all files in directory for filename in os.listdir("./data"): filepath = os.path.join("./data", filename) if os.path.isfile(filepath): dxpy.upload_local_file( filepath, project="project-xxxx", folder="/batch_upload" ) ``` ### Download Multiple Files ```python # Download all files from folder files = dxpy.find_data_objects( classname="file", project="project-xxxx", folder="/results" ) for file in files: file_obj = dxpy.DXFile(file['id']) filename = file_obj.describe()['name'] dxpy.download_dxfile(file['id'], f"./downloads/{filename}") ``` ## Best Practices 1. **Close Files**: Always close files after writing to make them accessible 2. **Use Properties**: Tag data with meaningful properties for easier discovery 3. **Organize Folders**: Use logical folder structures 4. **Clean Up**: Remove temporary or obsolete data 5. **Batch Operations**: Group operations when processing many objects 6. **Error Handling**: Check object states before operations 7. **Permissions**: Verify project permissions before data operations 8. **Archive Old Data**: Use archival for long-term storage cost savings