Initial commit
This commit is contained in:
326
skills/vertex-engine-inspector/SKILL.md
Normal file
326
skills/vertex-engine-inspector/SKILL.md
Normal file
@@ -0,0 +1,326 @@
|
||||
---
|
||||
name: vertex-engine-inspector
|
||||
description: |
|
||||
Vertex AI Agent Engine runtime inspector and health monitor.
|
||||
Validates deployed agents, Code Execution Sandbox settings, Memory Bank configuration, A2A protocol compliance, and production readiness.
|
||||
Triggers: "inspect agent engine", "validate agent deployment", "check code execution sandbox", "monitor agent health"
|
||||
allowed-tools: Read, Grep, Glob, Bash
|
||||
version: 1.0.0
|
||||
---
|
||||
|
||||
## What This Skill Does
|
||||
|
||||
Expert inspector for the Vertex AI Agent Engine managed runtime. Performs comprehensive validation of deployed agents including runtime configuration, security posture, performance settings, A2A protocol compliance, and production readiness scoring.
|
||||
|
||||
## When This Skill Activates
|
||||
|
||||
### Trigger Phrases
|
||||
- "Inspect Vertex AI Engine agent"
|
||||
- "Validate Agent Engine deployment"
|
||||
- "Check Code Execution Sandbox configuration"
|
||||
- "Verify Memory Bank settings"
|
||||
- "Monitor agent health"
|
||||
- "Agent Engine production readiness"
|
||||
- "A2A protocol compliance check"
|
||||
- "Agent Engine security audit"
|
||||
|
||||
### Use Cases
|
||||
- Pre-production deployment validation
|
||||
- Post-deployment health monitoring
|
||||
- Security compliance audits
|
||||
- Performance optimization reviews
|
||||
- Troubleshooting agent issues
|
||||
- Configuration drift detection
|
||||
|
||||
## Inspection Categories
|
||||
|
||||
### 1. Runtime Configuration ✅
|
||||
- Model selection (Gemini 2.5 Pro/Flash)
|
||||
- Tools enabled (Code Execution, Memory Bank, custom)
|
||||
- VPC configuration
|
||||
- Resource allocation
|
||||
- Scaling policies
|
||||
|
||||
### 2. Code Execution Sandbox 🔒
|
||||
- **Security**: Isolated environment, no external network access
|
||||
- **State Persistence**: TTL validation (1-14 days)
|
||||
- **IAM**: Least privilege permissions
|
||||
- **Performance**: Timeout and resource limits
|
||||
- **Concurrent Executions**: Max concurrent code runs
|
||||
|
||||
**Critical Checks**:
|
||||
```
|
||||
✅ State TTL between 7-14 days (optimal for production)
|
||||
✅ Sandbox type is SECURE_ISOLATED
|
||||
✅ IAM permissions limited to required GCP services only
|
||||
✅ Timeout configured appropriately
|
||||
⚠️ State TTL < 7 days may cause premature session loss
|
||||
❌ State TTL > 14 days not allowed by Agent Engine
|
||||
```
|
||||
|
||||
### 3. Memory Bank Configuration 🧠
|
||||
- **Enabled Status**: Persistent memory active
|
||||
- **Retention Policy**: Max memories, retention days
|
||||
- **Storage Backend**: Firestore encryption & region
|
||||
- **Query Performance**: Indexing, caching, latency
|
||||
- **Auto-Cleanup**: Quota management
|
||||
|
||||
**Critical Checks**:
|
||||
```
|
||||
✅ Max memories >= 100 (prevents conversation truncation)
|
||||
✅ Indexing enabled (fast query performance)
|
||||
✅ Auto-cleanup enabled (prevents quota exhaustion)
|
||||
✅ Encrypted at rest (Firestore default)
|
||||
⚠️ Low memory limit may truncate long conversations
|
||||
```
|
||||
|
||||
### 4. A2A Protocol Compliance 🔗
|
||||
- **AgentCard**: Available at `/.well-known/agent-card`
|
||||
- **Task API**: `POST /v1/tasks:send` responds correctly
|
||||
- **Status API**: `GET /v1/tasks/{task_id}` accessible
|
||||
- **Protocol Version**: 1.0 compliance
|
||||
- **Required Fields**: name, description, tools, version
|
||||
|
||||
**Compliance Report**:
|
||||
```
|
||||
✅ AgentCard accessible and valid
|
||||
✅ Task submission API functional
|
||||
✅ Status polling API functional
|
||||
✅ Protocol version 1.0
|
||||
❌ Missing AgentCard fields: [...]
|
||||
❌ Task API not responding (check IAM/networking)
|
||||
```
|
||||
|
||||
### 5. Security Posture 🛡️
|
||||
- **IAM Roles**: Least privilege validation
|
||||
- **VPC Service Controls**: Perimeter protection
|
||||
- **Model Armor**: Prompt injection protection
|
||||
- **Encryption**: At-rest and in-transit
|
||||
- **Service Account**: Proper configuration
|
||||
- **Secret Management**: No hardcoded credentials
|
||||
|
||||
**Security Score**:
|
||||
```
|
||||
🟢 SECURE (90-100%): Production ready
|
||||
🟡 NEEDS ATTENTION (70-89%): Address issues before prod
|
||||
🔴 INSECURE (<70%): Do not deploy to production
|
||||
```
|
||||
|
||||
### 6. Performance Metrics 📊
|
||||
- **Auto-Scaling**: Min/max instances configured
|
||||
- **Resource Limits**: CPU, memory appropriate
|
||||
- **Latency**: P50, P95, P99 within SLOs
|
||||
- **Throughput**: Requests per second
|
||||
- **Token Usage**: Cost tracking
|
||||
- **Error Rate**: < 5% target
|
||||
|
||||
**Health Status**:
|
||||
```
|
||||
🟢 HEALTHY: Error rate < 5%, latency < 3s (p95)
|
||||
🟡 DEGRADED: Error rate 5-10% or latency 3-5s
|
||||
🔴 UNHEALTHY: Error rate > 10% or latency > 5s
|
||||
```
|
||||
|
||||
### 7. Monitoring & Observability 📈
|
||||
- **Cloud Monitoring**: Dashboards configured
|
||||
- **Alerting**: Policies for errors, latency, costs
|
||||
- **Logging**: Structured logs aggregated
|
||||
- **Tracing**: OpenTelemetry enabled
|
||||
- **Error Tracking**: Cloud Error Reporting
|
||||
|
||||
**Observability Score**:
|
||||
```
|
||||
✅ All 5 pillars configured: Metrics, Logs, Traces, Alerts, Dashboards
|
||||
⚠️ Missing alerts for critical scenarios
|
||||
❌ No monitoring configured (production blocker)
|
||||
```
|
||||
|
||||
## Production Readiness Scoring
|
||||
|
||||
### Scoring Matrix
|
||||
|
||||
| Category | Weight | Checks |
|
||||
|----------|--------|--------|
|
||||
| Security | 30% | 6 checks (IAM, VPC-SC, encryption, etc.) |
|
||||
| Performance | 25% | 6 checks (scaling, limits, SLOs, etc.) |
|
||||
| Monitoring | 20% | 6 checks (dashboards, alerts, logs, etc.) |
|
||||
| Compliance | 15% | 5 checks (audit logs, DR, privacy, etc.) |
|
||||
| Reliability | 10% | 5 checks (multi-region, failover, etc.) |
|
||||
|
||||
### Overall Readiness Status
|
||||
|
||||
```
|
||||
🟢 PRODUCTION READY (85-100%)
|
||||
- All critical checks passed
|
||||
- Minor optimizations recommended
|
||||
- Safe to deploy
|
||||
|
||||
🟡 NEEDS IMPROVEMENT (70-84%)
|
||||
- Some important checks failed
|
||||
- Address issues before production
|
||||
- Staging deployment acceptable
|
||||
|
||||
🔴 NOT READY (<70%)
|
||||
- Critical failures present
|
||||
- Do not deploy to production
|
||||
- Fix blocking issues first
|
||||
```
|
||||
|
||||
## Inspection Workflow
|
||||
|
||||
### Phase 1: Configuration Analysis
|
||||
```
|
||||
1. Connect to Agent Engine
|
||||
2. Retrieve agent metadata
|
||||
3. Parse runtime configuration
|
||||
4. Extract Code Execution settings
|
||||
5. Extract Memory Bank settings
|
||||
6. Document VPC configuration
|
||||
```
|
||||
|
||||
### Phase 2: Protocol Validation
|
||||
```
|
||||
1. Test AgentCard endpoint
|
||||
2. Validate AgentCard structure
|
||||
3. Test Task API (POST /v1/tasks:send)
|
||||
4. Test Status API (GET /v1/tasks/{id})
|
||||
5. Verify A2A protocol version
|
||||
```
|
||||
|
||||
### Phase 3: Security Audit
|
||||
```
|
||||
1. Review IAM roles and permissions
|
||||
2. Check VPC Service Controls
|
||||
3. Validate encryption settings
|
||||
4. Scan for hardcoded secrets
|
||||
5. Verify Model Armor enabled
|
||||
6. Assess service account security
|
||||
```
|
||||
|
||||
### Phase 4: Performance Analysis
|
||||
```
|
||||
1. Query Cloud Monitoring metrics
|
||||
2. Calculate error rate (last 24h)
|
||||
3. Analyze latency percentiles
|
||||
4. Review token usage and costs
|
||||
5. Check auto-scaling behavior
|
||||
6. Validate resource limits
|
||||
```
|
||||
|
||||
### Phase 5: Production Readiness
|
||||
```
|
||||
1. Run all checklist items (28 checks)
|
||||
2. Calculate category scores
|
||||
3. Calculate overall score
|
||||
4. Determine readiness status
|
||||
5. Generate recommendations
|
||||
6. Create action plan
|
||||
```
|
||||
|
||||
## Tool Permissions
|
||||
|
||||
**Read-only inspection** - Cannot modify configurations:
|
||||
- **Read**: Analyze agent configuration files
|
||||
- **Grep**: Search for security issues
|
||||
- **Glob**: Find related configuration
|
||||
- **Bash**: Query GCP APIs (read-only)
|
||||
|
||||
## Example Inspection Report
|
||||
|
||||
```yaml
|
||||
Agent ID: gcp-deployer-agent
|
||||
Deployment Status: RUNNING
|
||||
Inspection Date: 2025-12-09
|
||||
|
||||
Runtime Configuration:
|
||||
Model: gemini-2.5-flash
|
||||
Code Execution: ✅ Enabled (TTL: 14 days)
|
||||
Memory Bank: ✅ Enabled (retention: 90 days)
|
||||
VPC: ✅ Configured (private-vpc-prod)
|
||||
|
||||
A2A Protocol Compliance:
|
||||
AgentCard: ✅ Valid
|
||||
Task API: ✅ Functional
|
||||
Status API: ✅ Functional
|
||||
Protocol Version: 1.0
|
||||
|
||||
Security Posture:
|
||||
IAM: ✅ Least privilege (score: 95%)
|
||||
VPC-SC: ✅ Enabled
|
||||
Model Armor: ✅ Enabled
|
||||
Encryption: ✅ At-rest & in-transit
|
||||
Overall: 🟢 SECURE (92%)
|
||||
|
||||
Performance Metrics (24h):
|
||||
Request Count: 12,450
|
||||
Error Rate: 2.3% 🟢
|
||||
Latency (p95): 1,850ms 🟢
|
||||
Token Usage: 450K tokens
|
||||
Cost Estimate: $12.50/day
|
||||
|
||||
Production Readiness:
|
||||
Security: 92% (28/30 points)
|
||||
Performance: 88% (22/25 points)
|
||||
Monitoring: 95% (19/20 points)
|
||||
Compliance: 80% (12/15 points)
|
||||
Reliability: 70% (7/10 points)
|
||||
|
||||
Overall Score: 87% 🟢 PRODUCTION READY
|
||||
|
||||
Recommendations:
|
||||
1. Enable multi-region deployment (reliability +10%)
|
||||
2. Configure automated backups (compliance +5%)
|
||||
3. Add circuit breaker pattern (reliability +5%)
|
||||
4. Optimize memory bank indexing (performance +3%)
|
||||
```
|
||||
|
||||
## Integration with Other Plugins
|
||||
|
||||
### Works with jeremy-adk-orchestrator
|
||||
- Orchestrator deploys agents
|
||||
- Inspector validates deployments
|
||||
- Feedback loop for optimization
|
||||
|
||||
### Works with jeremy-vertex-validator
|
||||
- Validator checks code before deployment
|
||||
- Inspector validates runtime after deployment
|
||||
- Complementary pre/post checks
|
||||
|
||||
### Works with jeremy-adk-terraform
|
||||
- Terraform provisions infrastructure
|
||||
- Inspector validates provisioned agents
|
||||
- Ensures IaC matches runtime
|
||||
|
||||
## Troubleshooting Guide
|
||||
|
||||
### Issue: Agent not responding
|
||||
**Inspector checks**:
|
||||
- VPC configuration allows traffic
|
||||
- IAM permissions correct
|
||||
- Agent Engine status is RUNNING
|
||||
- No quota limits exceeded
|
||||
|
||||
### Issue: High error rate
|
||||
**Inspector checks**:
|
||||
- Model configuration appropriate
|
||||
- Resource limits not exceeded
|
||||
- Code Execution sandbox not timing out
|
||||
- Memory Bank not quota-exhausted
|
||||
|
||||
### Issue: Slow response times
|
||||
**Inspector checks**:
|
||||
- Auto-scaling configured
|
||||
- Code Execution TTL appropriate
|
||||
- Memory Bank indexing enabled
|
||||
- Caching strategy implemented
|
||||
|
||||
## Version History
|
||||
|
||||
- **1.0.0** (2025): Initial release with Agent Engine GA support, Code Execution Sandbox, Memory Bank, A2A protocol validation
|
||||
|
||||
## References
|
||||
|
||||
- Agent Engine: https://cloud.google.com/vertex-ai/generative-ai/docs/agent-engine/overview
|
||||
- Code Execution: https://cloud.google.com/agent-builder/agent-engine/code-execution/overview
|
||||
- Memory Bank: https://cloud.google.com/vertex-ai/generative-ai/docs/agent-engine/memory-bank/overview
|
||||
- A2A Protocol: https://google.github.io/adk-docs/a2a/
|
||||
Reference in New Issue
Block a user