13 KiB
Architecture Decisions
Documentation of all technical decisions made for Tailscale SSH Sync Agent.
Tool Selection
Selected Tool: sshsync
Justification:
✅ Advantages:
- Ready-to-use: Available via
pip install sshsync - Group management: Built-in support for organizing hosts into groups
- Integration: Works with existing SSH config (
~/.ssh/config) - Simple API: Easy-to-wrap CLI interface
- Parallel execution: Commands run concurrently across hosts
- File operations: Push/pull with recursive support
- Timeout handling: Per-command timeouts for reliability
- Active maintenance: Regular updates and bug fixes
- Python-based: Easy to extend and integrate
✅ Coverage:
- All SSH-accessible hosts
- Works with any SSH server (Linux, macOS, BSD, etc.)
- Platform-agnostic (runs on any OS with Python)
✅ Cost:
- Free and open-source
- No API keys or subscriptions required
- No rate limits
✅ Documentation:
- Clear command-line interface
- PyPI documentation available
- GitHub repository with examples
Alternatives Considered:
❌ Fabric (Python library)
- Pros: Pure Python, very flexible
- Cons: Requires writing more code, no built-in group management
- Rejected because: sshsync provides ready-made functionality
❌ Ansible
- Pros: Industry standard, very powerful
- Cons: Requires learning YAML playbooks, overkill for simple operations
- Rejected because: Too heavyweight for ad-hoc commands and file transfers
❌ pssh (parallel-ssh)
- Pros: Simple parallel SSH
- Cons: No group management, no file transfer built-in, less actively maintained
- Rejected because: sshsync has better group management and file operations
❌ Custom SSH wrapper
- Pros: Full control
- Cons: Reinventing the wheel, maintaining parallel execution logic
- Rejected because: sshsync already provides what we need
Conclusion:
sshsync is the best tool for this use case because it:
- Provides group-based host management out of the box
- Handles parallel execution automatically
- Integrates with existing SSH configuration
- Supports both command execution and file transfers
- Requires minimal wrapper code
Integration: Tailscale
Decision: Integrate with Tailscale for network connectivity
Justification:
✅ Why Tailscale:
- Zero-config VPN: No manual firewall/NAT configuration
- Secure by default: WireGuard encryption
- Works everywhere: Coffee shop, home, office, cloud
- MagicDNS: Easy addressing (machine-name.tailnet.ts.net)
- Standard SSH: Works with all SSH tools including sshsync
- No overhead: Uses regular SSH protocol over Tailscale network
✅ Integration approach:
- Tailscale provides the network layer
- Standard SSH works over Tailscale
- sshsync operates normally using Tailscale hostnames/IPs
- No Tailscale-specific code needed in core operations
- Tailscale status checking for diagnostics
Alternatives:
❌ Direct public internet + port forwarding
- Cons: Complex firewall setup, security risks, doesn't work on mobile/restricted networks
- Rejected because: Requires too much configuration and has security concerns
❌ Other VPNs (WireGuard, OpenVPN, ZeroTier)
- Cons: More manual configuration, less zero-config
- Rejected because: Tailscale is easier to set up and use
Conclusion:
Tailscale + standard SSH is the optimal combination:
- Secure connectivity without configuration
- Works with existing SSH tools
- No vendor lock-in (can use other VPNs if needed)
Architecture
Structure: Modular Scripts + Utilities
Decision: Separate concerns into focused modules
scripts/
├── sshsync_wrapper.py # sshsync CLI interface
├── tailscale_manager.py # Tailscale operations
├── load_balancer.py # Task distribution logic
├── workflow_executor.py # Common workflows
└── utils/
├── helpers.py # Formatting, parsing
└── validators/ # Input validation
Justification:
✅ Modularity:
- Each script has single responsibility
- Easy to test independently
- Easy to extend without breaking others
✅ Reusability:
- Helpers used across all scripts
- Validators prevent duplicate validation logic
- Workflows compose lower-level operations
✅ Maintainability:
- Clear file organization
- Easy to locate specific functionality
- Separation of concerns
Alternatives:
❌ Monolithic single script
- Cons: Hard to test, hard to maintain, becomes too large
- Rejected because: Doesn't scale well
❌ Over-engineered class hierarchy
- Cons: Unnecessary complexity for this use case
- Rejected because: Simple functions are sufficient
Conclusion:
Modular functional approach provides good balance of simplicity and maintainability.
Validation Strategy: Multi-Layer
Decision: Validate at multiple layers
Layers:
-
Parameter validation (
parameter_validator.py)- Validates user inputs before any operations
- Prevents invalid hosts, groups, paths, etc.
-
Host validation (
host_validator.py)- Validates SSH configuration exists
- Checks host reachability
- Validates group membership
-
Connection validation (
connection_validator.py)- Tests actual SSH connectivity
- Verifies Tailscale status
- Checks SSH key authentication
Justification:
✅ Early failure:
- Catch errors before expensive operations
- Clear error messages at each layer
✅ Comprehensive:
- Multiple validation points catch different issues
- Reduces runtime failures
✅ User-friendly:
- Helpful error messages with suggestions
- Clear indication of what went wrong
Conclusion:
Multi-layer validation provides robust error handling and great user experience.
Load Balancing Strategy
Decision: Simple Composite Score
Formula:
score = (cpu_pct * 0.4) + (mem_pct * 0.3) + (disk_pct * 0.3)
Weights:
- CPU: 40% (most important for compute tasks)
- Memory: 30% (important for data processing)
- Disk: 30% (important for I/O operations)
Justification:
✅ Simple and effective:
- Easy to understand
- Fast to calculate
- Works well for most workloads
✅ Balanced:
- Considers multiple resource types
- No single metric dominates
Alternatives:
❌ CPU only
- Cons: Ignores memory-bound and I/O-bound tasks
- Rejected because: Too narrow
❌ Complex ML-based prediction
- Cons: Overkill, slow, requires training data
- Rejected because: Unnecessary complexity
❌ Fixed round-robin
- Cons: Doesn't consider actual load
- Rejected because: Can overload already-busy hosts
Conclusion:
Simple weighted score provides good balance without complexity.
Error Handling Philosophy
Decision: Graceful Degradation + Clear Messages
Principles:
- Fail early with validation: Catch errors before operations
- Isolate failures: One host failure doesn't stop others
- Clear messages: Tell user exactly what went wrong and how to fix
- Automatic retry: Retry transient errors (network, timeout)
- Dry-run support: Preview operations before execution
Implementation:
# Example error handling pattern
try:
validate_host(host)
validate_ssh_connection(host)
result = execute_command(host, command)
except ValidationError as e:
return {'error': str(e), 'suggestion': 'Fix: ...'}
except ConnectionError as e:
return {'error': str(e), 'diagnostics': get_diagnostics(host)}
Justification:
✅ Better UX:
- Users know exactly what's wrong
- Suggestions help fix issues quickly
✅ Reliability:
- Automatic retry handles transient issues
- Dry-run prevents mistakes
✅ Debugging:
- Clear error messages speed up troubleshooting
- Diagnostics provide actionable information
Conclusion:
Graceful degradation with helpful messages creates better user experience.
Caching Strategy
Decision: Minimal caching for real-time accuracy
What we cache:
- Nothing (v1.0.0)
Why no caching:
- Host status changes frequently
- Load metrics change constantly
- Operations need real-time data
- Cache invalidation is complex
Future consideration (v2.0):
- Cache Tailscale status (60s TTL)
- Cache group configuration (5min TTL)
- Cache SSH config parsing (5min TTL)
Justification:
✅ Simplicity:
- No cache invalidation logic needed
- No stale data issues
✅ Accuracy:
- Always get current state
- No surprises from cached data
Trade-off:
- Slightly slower repeated operations
- More network calls
Conclusion:
For v1.0.0, simplicity and accuracy outweigh performance concerns. Real-time data is more valuable than speed.
Testing Strategy
Decision: Comprehensive Unit + Integration Tests
Coverage:
- 29 tests total:
- 11 integration tests (end-to-end workflows)
- 11 helper tests (formatting, parsing, calculations)
- 7 validation tests (input validation, safety checks)
Test Philosophy:
- Test real functionality: Integration tests use actual functions
- Test edge cases: Validation tests cover error conditions
- Test helpers: Ensure formatting/parsing works correctly
- Fast execution: All tests run in < 10 seconds
- No external dependencies: Tests don't require Tailscale or sshsync to be running
Justification:
✅ Confidence:
- Tests verify code works as expected
- Catches regressions when modifying code
✅ Documentation:
- Tests show how to use functions
- Examples of expected behavior
✅ Reliability:
- Production-ready code from v1.0.0
Conclusion:
Comprehensive testing ensures reliable code from the start.
Performance Considerations
Parallel Execution
Decision: Leverage sshsync's built-in parallelization
- sshsync runs commands concurrently across hosts automatically
- No need to implement custom threading/multiprocessing
- Timeout applies per-host independently
Trade-offs:
✅ Pros:
- Simple to use
- Fast for large host groups
- No concurrency bugs
⚠️ Cons:
- Less control over parallelism level
- Can overwhelm network with too many concurrent connections
Conclusion:
Built-in parallelization is sufficient for most use cases. Custom control can be added in v2.0 if needed.
Security Considerations
SSH Key Authentication
Decision: Require SSH keys (no password auth)
Justification:
✅ Security:
- Keys are more secure than passwords
- Can't be brute-forced
- Can be revoked per-host
✅ Automation:
- Non-interactive (no password prompts)
- Works in scripts and CI/CD
Implementation:
- Validators check SSH key auth works
- Clear error messages guide users to set up keys
- Documentation explains SSH key setup
Command Safety
Decision: Validate dangerous commands
Dangerous patterns blocked:
rm -rf /(root deletion)mkfs.*(filesystem formatting)dd.*of=/dev/(direct disk writes)- Fork bombs
- Direct disk writes
Override: Use allow_dangerous=True to bypass
Justification:
✅ Safety:
- Prevents accidental destructive operations
- Dry-run provides preview
✅ Flexibility:
- Can still run dangerous commands if explicitly allowed
Conclusion:
Safety by default with escape hatch for advanced users.
Decisions Summary
| Decision | Choice | Rationale |
|---|---|---|
| CLI Tool | sshsync | Best balance of features, ease of use, and maintenance |
| Network | Tailscale | Zero-config secure VPN, works everywhere |
| Architecture | Modular scripts | Clear separation of concerns, maintainable |
| Validation | Multi-layer | Catch errors early with helpful messages |
| Load Balancing | Composite score | Simple, effective, considers multiple resources |
| Caching | None (v1.0) | Simplicity and real-time accuracy |
| Testing | 29 tests | Comprehensive coverage for reliability |
| Security | SSH keys + validation | Secure and automation-friendly |
Trade-offs Accepted
- No caching → Slightly slower, but always accurate
- sshsync dependency → External tool, but saves development time
- SSH key requirement → Setup needed, but more secure
- Simple load balancing → Less sophisticated, but fast and easy to understand
- Terminal UI only → No web dashboard, but simpler to develop and maintain
Future Improvements
v2.0 Considerations
- Add caching for frequently-accessed data (Tailscale status, groups)
- Web dashboard for visualization and monitoring
- Operation history database for audit trail
- Advanced load balancing with custom metrics
- Automated SSH key distribution across hosts
- Integration with config management tools (Ansible, Terraform)
- Container support via SSH to Docker containers
- Custom validation plugins for domain-specific checks
All decisions prioritize simplicity, security, and maintainability for v1.0.0.