Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,733 @@
# LaminDB Setup & Deployment
This document covers installation, configuration, instance management, storage options, and deployment strategies for LaminDB.
## Installation
### Basic Installation
```bash
# Install LaminDB
pip install lamindb
# Or with pip3
pip3 install lamindb
```
### Installation with Extras
Install optional dependencies for specific functionality:
```bash
# Google Cloud Platform support
pip install 'lamindb[gcp]'
# Flow cytometry formats
pip install 'lamindb[fcs]'
# Array storage and streaming (Zarr support)
pip install 'lamindb[zarr]'
# AWS S3 support (usually included by default)
pip install 'lamindb[aws]'
# Multiple extras
pip install 'lamindb[gcp,zarr,fcs]'
```
### Module Plugins
```bash
# Biological ontologies (Bionty)
pip install bionty
# Wet lab functionality
pip install lamindb-wetlab
# Clinical data (OMOP CDM)
pip install lamindb-clinical
```
### Verify Installation
```python
import lamindb as ln
print(ln.__version__)
# Check available modules
import bionty as bt
print(bt.__version__)
```
## Authentication
### Creating an Account
1. Visit https://lamin.ai
2. Sign up for a free account
3. Navigate to account settings to generate an API key
### Logging In
```bash
# Login with API key
lamin login
# You'll be prompted to enter your API key
# API key is stored locally at ~/.lamin/
```
### Authentication Details
**Data Privacy:** LaminDB authentication only collects basic metadata (email, user information). Your actual data remains private and is not sent to LaminDB servers.
**Local vs Cloud:** Authentication is required even for local-only usage to enable collaboration features and instance management.
## Instance Initialization
### Local SQLite Instance
For local development and small datasets:
```bash
# Initialize in current directory
lamin init --storage ./mydata
# Initialize in specific directory
lamin init --storage /path/to/data
# Initialize with specific modules
lamin init --storage ./mydata --modules bionty
# Initialize with multiple modules
lamin init --storage ./mydata --modules bionty,wetlab
```
### Cloud Storage with SQLite
Use cloud storage but local SQLite database:
```bash
# AWS S3
lamin init --storage s3://my-bucket/path
# Google Cloud Storage
lamin init --storage gs://my-bucket/path
# S3-compatible (MinIO, Cloudflare R2)
lamin init --storage 's3://bucket?endpoint_url=http://endpoint:9000'
```
### Cloud Storage with PostgreSQL
For production deployments:
```bash
# S3 + PostgreSQL
lamin init --storage s3://my-bucket/path \
--db postgresql://user:password@hostname:5432/dbname \
--modules bionty
# GCS + PostgreSQL
lamin init --storage gs://my-bucket/path \
--db postgresql://user:password@hostname:5432/dbname \
--modules bionty
```
### Instance Naming
```bash
# Specify instance name
lamin init --storage ./mydata --name my-project
# Default name uses directory name
lamin init --storage ./mydata # Instance name: "mydata"
```
## Connecting to Instances
### Connect to Your Own Instance
```bash
# By name
lamin connect my-project
# By full path
lamin connect account_handle/my-project
```
### Connect to Shared Instance
```bash
# Connect to someone else's instance
lamin connect other-user/their-project
# Requires appropriate permissions
```
### Switching Between Instances
```bash
# List available instances
lamin info
# Switch instance
lamin connect another-instance
# Close current instance
lamin close
```
## Storage Configuration
### Local Storage
**Advantages:**
- Fast access
- No internet required
- Simple setup
**Setup:**
```bash
lamin init --storage ./data
```
### AWS S3 Storage
**Advantages:**
- Scalable
- Collaborative
- Durable
**Setup:**
```bash
# Set credentials
export AWS_ACCESS_KEY_ID=your_key_id
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1
# Initialize
lamin init --storage s3://my-bucket/project-data \
--db postgresql://user:pwd@host:5432/db
```
**S3 Permissions Required:**
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-bucket/*",
"arn:aws:s3:::my-bucket"
]
}
]
}
```
### Google Cloud Storage
**Setup:**
```bash
# Authenticate
gcloud auth application-default login
# Or use service account
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
# Initialize
lamin init --storage gs://my-bucket/project-data \
--db postgresql://user:pwd@host:5432/db
```
### S3-Compatible Storage
For MinIO, Cloudflare R2, or other S3-compatible services:
```bash
# MinIO example
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
lamin init --storage 's3://my-bucket?endpoint_url=http://minio.example.com:9000'
# Cloudflare R2 example
export AWS_ACCESS_KEY_ID=your_r2_access_key
export AWS_SECRET_ACCESS_KEY=your_r2_secret_key
lamin init --storage 's3://bucket?endpoint_url=https://account-id.r2.cloudflarestorage.com'
```
## Database Configuration
### SQLite (Default)
**Advantages:**
- No separate database server
- Simple setup
- Good for development
**Limitations:**
- Not suitable for concurrent writes
- Limited scalability
**Setup:**
```bash
# SQLite is default
lamin init --storage ./data
# Database stored at ./data/.lamindb/
```
### PostgreSQL
**Advantages:**
- Production-ready
- Concurrent access
- Better performance at scale
**Setup:**
```bash
# Full connection string
lamin init --storage s3://bucket/path \
--db postgresql://username:password@hostname:5432/database
# With SSL
lamin init --storage s3://bucket/path \
--db "postgresql://user:pwd@host:5432/db?sslmode=require"
```
**PostgreSQL Versions:** Compatible with PostgreSQL 12+
### Database Schema Management
```bash
# Check current schema version
lamin migrate check
# Upgrade schema
lamin migrate deploy
# View migration history
lamin migrate history
```
## Cache Configuration
### Cache Directory
LaminDB maintains a local cache for cloud files:
```python
import lamindb as ln
# View cache location
print(ln.settings.cache_dir)
```
### Configure Cache Location
```bash
# Set cache directory
lamin cache set /path/to/cache
# View current cache settings
lamin cache get
```
### System-Wide Cache (Multi-User)
For shared systems with multiple users:
```bash
# Create system settings file
sudo mkdir -p /system/settings
sudo nano /system/settings/system.env
```
Add to `system.env`:
```bash
lamindb_cache_path=/shared/cache/lamindb
```
Ensure permissions:
```bash
sudo chmod 755 /shared/cache/lamindb
sudo chown -R shared-user:shared-group /shared/cache/lamindb
```
### Cache Management
```python
import lamindb as ln
# Clear cache for specific artifact
artifact = ln.Artifact.get(key="data.h5ad")
artifact.delete_cache()
# Check if artifact is cached
if artifact.is_cached():
print("Already cached")
# Manually clear entire cache
import shutil
shutil.rmtree(ln.settings.cache_dir)
```
## Settings Management
### View Current Settings
```python
import lamindb as ln
# User settings
print(ln.setup.settings.user)
# User(handle='username', email='user@email.com', name='Full Name')
# Instance settings
print(ln.setup.settings.instance)
# Instance(name='my-project', storage='s3://bucket/path')
```
### Configure Settings
```bash
# Set development directory for relative keys
lamin settings set dev-dir /path/to/project
# Configure git sync
lamin settings set sync-git-repo https://github.com/user/repo.git
# View all settings
lamin settings
```
### Environment Variables
```bash
# Cache directory
export LAMIN_CACHE_DIR=/path/to/cache
# Settings directory
export LAMIN_SETTINGS_DIR=/path/to/settings
# Git sync
export LAMINDB_SYNC_GIT_REPO=https://github.com/user/repo.git
```
## Instance Management
### Viewing Instance Information
```bash
# Current instance info
lamin info
# List all instances
lamin ls
# View instance details
lamin instance details
```
### Instance Collaboration
```bash
# Set instance visibility (requires LaminHub)
lamin instance set-visibility public
lamin instance set-visibility private
# Invite collaborators (requires LaminHub)
lamin instance invite user@email.com
```
### Instance Migration
```bash
# Backup instance
lamin backup create
# Restore from backup
lamin backup restore backup_id
# Export instance metadata
lamin export instance-metadata.json
```
### Deleting Instances
```bash
# Delete instance (preserves data, removes metadata)
lamin delete --force instance-name
# This only removes the LaminDB metadata
# Actual data in storage location remains
```
## Production Deployment Patterns
### Pattern 1: Local Development → Cloud Production
**Development:**
```bash
# Local development
lamin init --storage ./dev-data --modules bionty
```
**Production:**
```bash
# Cloud production
lamin init --storage s3://prod-bucket/data \
--db postgresql://user:pwd@db-host:5432/prod-db \
--modules bionty \
--name production
```
**Migration:** Export artifacts from dev, import to prod
```python
# Export from dev
artifacts = ln.Artifact.filter().all()
for artifact in artifacts:
artifact.export("/tmp/export/")
# Switch to prod
lamin connect production
# Import to prod
for file in Path("/tmp/export/").glob("*"):
ln.Artifact(str(file), key=file.name).save()
```
### Pattern 2: Multi-Region Deployment
Deploy instances in multiple regions for data sovereignty:
```bash
# US instance
lamin init --storage s3://us-bucket/data \
--db postgresql://user:pwd@us-db:5432/db \
--name us-production
# EU instance
lamin init --storage s3://eu-bucket/data \
--db postgresql://user:pwd@eu-db:5432/db \
--name eu-production
```
### Pattern 3: Shared Storage, Personal Instances
Multiple users, shared data:
```bash
# Shared storage with user-specific DB
lamin init --storage s3://shared-bucket/data \
--db postgresql://user1:pwd@host:5432/user1_db \
--name user1-workspace
lamin init --storage s3://shared-bucket/data \
--db postgresql://user2:pwd@host:5432/user2_db \
--name user2-workspace
```
## Performance Optimization
### Database Performance
```python
# Use connection pooling for PostgreSQL
# Configure in database server settings
# Optimize queries with indexes
# LaminDB creates indexes automatically for common queries
```
### Storage Performance
```bash
# Use appropriate storage classes
# S3: STANDARD for frequent access, INTELLIGENT_TIERING for mixed access
# Configure multipart upload thresholds
export AWS_CLI_FILE_IO_BANDWIDTH=100MB
```
### Cache Optimization
```python
# Pre-cache frequently used artifacts
artifacts = ln.Artifact.filter(key__startswith="reference/")
for artifact in artifacts:
artifact.cache() # Download to cache
# Use backed mode for large arrays
adata = artifact.backed() # Don't load into memory
```
## Security Best Practices
1. **Credentials Management:**
- Use environment variables, not hardcoded credentials
- Use IAM roles on AWS/GCP instead of access keys
- Rotate credentials regularly
2. **Access Control:**
- Use PostgreSQL for multi-user access control
- Configure storage bucket policies
- Enable audit logging
3. **Network Security:**
- Use SSL/TLS for database connections
- Use VPCs for cloud deployments
- Restrict IP addresses when possible
4. **Data Protection:**
- Enable encryption at rest (S3, GCS)
- Use encryption in transit (HTTPS, SSL)
- Implement backup strategies
## Monitoring and Maintenance
### Health Checks
```python
import lamindb as ln
# Check database connection
try:
ln.Artifact.filter().count()
print("✓ Database connected")
except Exception as e:
print(f"✗ Database error: {e}")
# Check storage access
try:
test_artifact = ln.Artifact("test.txt", key="healthcheck.txt").save()
test_artifact.delete(permanent=True)
print("✓ Storage accessible")
except Exception as e:
print(f"✗ Storage error: {e}")
```
### Logging
```python
# Enable debug logging
import logging
logging.basicConfig(level=logging.DEBUG)
# LaminDB operations will produce detailed logs
```
### Backup Strategy
```bash
# Regular database backups (PostgreSQL)
pg_dump -h hostname -U username -d database > backup_$(date +%Y%m%d).sql
# Storage backups (S3 versioning)
aws s3api put-bucket-versioning \
--bucket my-bucket \
--versioning-configuration Status=Enabled
# Metadata export
lamin export metadata_backup.json
```
## Troubleshooting
### Common Issues
**Issue: Cannot connect to instance**
```bash
# Check instance exists
lamin ls
# Verify authentication
lamin login
# Re-connect
lamin connect instance-name
```
**Issue: Storage permissions denied**
```bash
# Check AWS credentials
aws s3 ls s3://your-bucket/
# Check GCS credentials
gsutil ls gs://your-bucket/
# Verify IAM permissions
```
**Issue: Database connection error**
```bash
# Test PostgreSQL connection
psql postgresql://user:pwd@host:5432/db
# Check database version compatibility
lamin migrate check
```
**Issue: Cache full**
```python
# Clear cache
import lamindb as ln
import shutil
shutil.rmtree(ln.settings.cache_dir)
# Set larger cache location
lamin cache set /larger/disk/cache
```
## Upgrade and Migration
### Upgrading LaminDB
```bash
# Upgrade to latest version
pip install --upgrade lamindb
# Upgrade database schema
lamin migrate deploy
```
### Schema Compatibility
Check the compatibility matrix to ensure your database schema version is compatible with your installed LaminDB version.
### Breaking Changes
Major version upgrades may require migration:
```bash
# Check for breaking changes
lamin migrate check
# Review migration plan
lamin migrate plan
# Execute migration
lamin migrate deploy
```
## Best Practices
1. **Start local, scale cloud**: Develop locally, deploy to cloud for production
2. **Use PostgreSQL for production**: SQLite is only for development
3. **Configure appropriate cache**: Size cache based on working set
4. **Enable versioning**: Use S3/GCS versioning for data protection
5. **Monitor costs**: Track storage and compute costs in cloud deployments
6. **Document configuration**: Keep infrastructure-as-code for reproducibility
7. **Test backups**: Regularly verify backup and restore procedures
8. **Set up monitoring**: Implement health checks and alerting
9. **Use modules strategically**: Only install needed plugins to reduce complexity
10. **Plan for scale**: Consider concurrent users and data growth