zhongwei/gh-human-frontier-labs-inc-human-frontier-labs-marketplace-plugins-tailscale-sshsync-agent

Fork 0

Files

Zhongwei Li 14c678ceac Initial commit

2025-11-29 18:47:40 +08:00

13 KiB

Raw Blame History

Architecture Decisions

Documentation of all technical decisions made for Tailscale SSH Sync Agent.

Tool Selection

Selected Tool: sshsync

Justification:

✅ Advantages:

Ready-to-use: Available via pip install sshsync
Group management: Built-in support for organizing hosts into groups
Integration: Works with existing SSH config (~/.ssh/config)
Simple API: Easy-to-wrap CLI interface
Parallel execution: Commands run concurrently across hosts
File operations: Push/pull with recursive support
Timeout handling: Per-command timeouts for reliability
Active maintenance: Regular updates and bug fixes
Python-based: Easy to extend and integrate

✅ Coverage:

All SSH-accessible hosts
Works with any SSH server (Linux, macOS, BSD, etc.)
Platform-agnostic (runs on any OS with Python)

✅ Cost:

Free and open-source
No API keys or subscriptions required
No rate limits

✅ Documentation:

Clear command-line interface
PyPI documentation available
GitHub repository with examples

Alternatives Considered:

❌ Fabric (Python library)

Pros: Pure Python, very flexible
Cons: Requires writing more code, no built-in group management
Rejected because: sshsync provides ready-made functionality

❌ Ansible

Pros: Industry standard, very powerful
Cons: Requires learning YAML playbooks, overkill for simple operations
Rejected because: Too heavyweight for ad-hoc commands and file transfers

❌ pssh (parallel-ssh)

Pros: Simple parallel SSH
Cons: No group management, no file transfer built-in, less actively maintained
Rejected because: sshsync has better group management and file operations

❌ Custom SSH wrapper

Pros: Full control
Cons: Reinventing the wheel, maintaining parallel execution logic
Rejected because: sshsync already provides what we need

Conclusion:

sshsync is the best tool for this use case because it:

Provides group-based host management out of the box
Handles parallel execution automatically
Integrates with existing SSH configuration
Supports both command execution and file transfers
Requires minimal wrapper code

Integration: Tailscale

Decision: Integrate with Tailscale for network connectivity

Justification:

✅ Why Tailscale:

Zero-config VPN: No manual firewall/NAT configuration
Secure by default: WireGuard encryption
Works everywhere: Coffee shop, home, office, cloud
MagicDNS: Easy addressing (machine-name.tailnet.ts.net)
Standard SSH: Works with all SSH tools including sshsync
No overhead: Uses regular SSH protocol over Tailscale network

✅ Integration approach:

Tailscale provides the network layer
Standard SSH works over Tailscale
sshsync operates normally using Tailscale hostnames/IPs
No Tailscale-specific code needed in core operations
Tailscale status checking for diagnostics

Alternatives:

❌ Direct public internet + port forwarding

Cons: Complex firewall setup, security risks, doesn't work on mobile/restricted networks
Rejected because: Requires too much configuration and has security concerns

❌ Other VPNs (WireGuard, OpenVPN, ZeroTier)

Cons: More manual configuration, less zero-config
Rejected because: Tailscale is easier to set up and use

Conclusion:

Tailscale + standard SSH is the optimal combination:

Secure connectivity without configuration
Works with existing SSH tools
No vendor lock-in (can use other VPNs if needed)

Architecture

Structure: Modular Scripts + Utilities

Decision: Separate concerns into focused modules

scripts/
├── sshsync_wrapper.py         # sshsync CLI interface
├── tailscale_manager.py       # Tailscale operations
├── load_balancer.py           # Task distribution logic
├── workflow_executor.py       # Common workflows
└── utils/
    ├── helpers.py             # Formatting, parsing
    └── validators/            # Input validation

Justification:

✅ Modularity:

Each script has single responsibility
Easy to test independently
Easy to extend without breaking others

✅ Reusability:

Helpers used across all scripts
Validators prevent duplicate validation logic
Workflows compose lower-level operations

✅ Maintainability:

Clear file organization
Easy to locate specific functionality
Separation of concerns

Alternatives:

❌ Monolithic single script

Cons: Hard to test, hard to maintain, becomes too large
Rejected because: Doesn't scale well

❌ Over-engineered class hierarchy

Cons: Unnecessary complexity for this use case
Rejected because: Simple functions are sufficient

Conclusion:

Modular functional approach provides good balance of simplicity and maintainability.

Validation Strategy: Multi-Layer

Decision: Validate at multiple layers

Layers:

Parameter validation (parameter_validator.py)
- Validates user inputs before any operations
- Prevents invalid hosts, groups, paths, etc.
Host validation (host_validator.py)
- Validates SSH configuration exists
- Checks host reachability
- Validates group membership
Connection validation (connection_validator.py)
- Tests actual SSH connectivity
- Verifies Tailscale status
- Checks SSH key authentication

Justification:

✅ Early failure:

Catch errors before expensive operations
Clear error messages at each layer

✅ Comprehensive:

Multiple validation points catch different issues
Reduces runtime failures

✅ User-friendly:

Helpful error messages with suggestions
Clear indication of what went wrong

Conclusion:

Multi-layer validation provides robust error handling and great user experience.

Load Balancing Strategy

Decision: Simple Composite Score

Formula:

score = (cpu_pct * 0.4) + (mem_pct * 0.3) + (disk_pct * 0.3)

Weights:

CPU: 40% (most important for compute tasks)
Memory: 30% (important for data processing)
Disk: 30% (important for I/O operations)

Justification:

✅ Simple and effective:

Easy to understand
Fast to calculate
Works well for most workloads

✅ Balanced:

Considers multiple resource types
No single metric dominates

Alternatives:

❌ CPU only

Cons: Ignores memory-bound and I/O-bound tasks
Rejected because: Too narrow

❌ Complex ML-based prediction

Cons: Overkill, slow, requires training data
Rejected because: Unnecessary complexity

❌ Fixed round-robin

Cons: Doesn't consider actual load
Rejected because: Can overload already-busy hosts

Conclusion:

Simple weighted score provides good balance without complexity.

Error Handling Philosophy

Decision: Graceful Degradation + Clear Messages

Principles:

Fail early with validation: Catch errors before operations
Isolate failures: One host failure doesn't stop others
Clear messages: Tell user exactly what went wrong and how to fix
Automatic retry: Retry transient errors (network, timeout)
Dry-run support: Preview operations before execution

Implementation:

# Example error handling pattern
try:
    validate_host(host)
    validate_ssh_connection(host)
    result = execute_command(host, command)
except ValidationError as e:
    return {'error': str(e), 'suggestion': 'Fix: ...'}
except ConnectionError as e:
    return {'error': str(e), 'diagnostics': get_diagnostics(host)}

Justification:

✅ Better UX:

Users know exactly what's wrong
Suggestions help fix issues quickly

✅ Reliability:

Automatic retry handles transient issues
Dry-run prevents mistakes

✅ Debugging:

Clear error messages speed up troubleshooting
Diagnostics provide actionable information

Conclusion:

Graceful degradation with helpful messages creates better user experience.

Caching Strategy

Decision: Minimal caching for real-time accuracy

What we cache:

Nothing (v1.0.0)

Why no caching:

Host status changes frequently
Load metrics change constantly
Operations need real-time data
Cache invalidation is complex

Future consideration (v2.0):

Cache Tailscale status (60s TTL)
Cache group configuration (5min TTL)
Cache SSH config parsing (5min TTL)

Justification:

✅ Simplicity:

No cache invalidation logic needed
No stale data issues

✅ Accuracy:

Always get current state
No surprises from cached data

Trade-off:

Slightly slower repeated operations
More network calls

Conclusion:

For v1.0.0, simplicity and accuracy outweigh performance concerns. Real-time data is more valuable than speed.

Testing Strategy

Decision: Comprehensive Unit + Integration Tests

Coverage:

29 tests total:
- 11 integration tests (end-to-end workflows)
- 11 helper tests (formatting, parsing, calculations)
- 7 validation tests (input validation, safety checks)

Test Philosophy:

Test real functionality: Integration tests use actual functions
Test edge cases: Validation tests cover error conditions
Test helpers: Ensure formatting/parsing works correctly
Fast execution: All tests run in < 10 seconds
No external dependencies: Tests don't require Tailscale or sshsync to be running

Justification:

✅ Confidence:

Tests verify code works as expected
Catches regressions when modifying code

✅ Documentation:

Tests show how to use functions
Examples of expected behavior

✅ Reliability:

Production-ready code from v1.0.0

Conclusion:

Comprehensive testing ensures reliable code from the start.

Performance Considerations

Parallel Execution

Decision: Leverage sshsync's built-in parallelization

sshsync runs commands concurrently across hosts automatically
No need to implement custom threading/multiprocessing
Timeout applies per-host independently

Trade-offs:

✅ Pros:

Simple to use
Fast for large host groups
No concurrency bugs

⚠️ Cons:

Less control over parallelism level
Can overwhelm network with too many concurrent connections

Conclusion:

Built-in parallelization is sufficient for most use cases. Custom control can be added in v2.0 if needed.

Security Considerations

SSH Key Authentication

Decision: Require SSH keys (no password auth)

Justification:

✅ Security:

Keys are more secure than passwords
Can't be brute-forced
Can be revoked per-host

✅ Automation:

Non-interactive (no password prompts)
Works in scripts and CI/CD

Implementation:

Validators check SSH key auth works
Clear error messages guide users to set up keys
Documentation explains SSH key setup

Command Safety

Decision: Validate dangerous commands

Dangerous patterns blocked:

rm -rf / (root deletion)
mkfs.* (filesystem formatting)
dd.*of=/dev/ (direct disk writes)
Fork bombs
Direct disk writes

Override: Use allow_dangerous=True to bypass

Justification:

✅ Safety:

Prevents accidental destructive operations
Dry-run provides preview

✅ Flexibility:

Can still run dangerous commands if explicitly allowed

Conclusion:

Safety by default with escape hatch for advanced users.

Decisions Summary

Decision	Choice	Rationale
CLI Tool	sshsync	Best balance of features, ease of use, and maintenance
Network	Tailscale	Zero-config secure VPN, works everywhere
Architecture	Modular scripts	Clear separation of concerns, maintainable
Validation	Multi-layer	Catch errors early with helpful messages
Load Balancing	Composite score	Simple, effective, considers multiple resources
Caching	None (v1.0)	Simplicity and real-time accuracy
Testing	29 tests	Comprehensive coverage for reliability
Security	SSH keys + validation	Secure and automation-friendly

Trade-offs Accepted

No caching → Slightly slower, but always accurate
sshsync dependency → External tool, but saves development time
SSH key requirement → Setup needed, but more secure
Simple load balancing → Less sophisticated, but fast and easy to understand
Terminal UI only → No web dashboard, but simpler to develop and maintain

Future Improvements

v2.0 Considerations

Add caching for frequently-accessed data (Tailscale status, groups)
Web dashboard for visualization and monitoring
Operation history database for audit trail
Advanced load balancing with custom metrics
Automated SSH key distribution across hosts
Integration with config management tools (Ansible, Terraform)
Container support via SSH to Docker containers
Custom validation plugins for domain-specific checks

All decisions prioritize simplicity, security, and maintainability for v1.0.0.

13 KiB Raw Blame History

Architecture Decisions

Tool Selection

Selected Tool: sshsync

Integration: Tailscale

Architecture

Structure: Modular Scripts + Utilities

Validation Strategy: Multi-Layer

Load Balancing Strategy

Decision: Simple Composite Score

Error Handling Philosophy

Decision: Graceful Degradation + Clear Messages

Caching Strategy

Testing Strategy

Decision: Comprehensive Unit + Integration Tests

Performance Considerations

Parallel Execution

Security Considerations

SSH Key Authentication

Command Safety

Decisions Summary

Trade-offs Accepted

Future Improvements

v2.0 Considerations

13 KiB

Raw Blame History