Initial commit

2025-11-29 18:27:15 +08:00
commit 240e090d87
7 changed files with 578 additions and 0 deletions
--- a/.claude-plugin/plugin.json
+++ b/.claude-plugin/plugin.json
@@ -0,0 +1,12 @@
 {
  "name": "mission-control-skills",
  "description": "Collection of skills to diagnose and fix issues with mission control",
  "version": "0.0.0-2025.11.28",
  "author": {
    "name": "Flanksource Inc.",
    "email": "contact@flanksource.com"
  },
  "skills": [
    "./skills"
  ]
 }
--- a/README.md
+++ b/README.md
@@ -0,0 +1,3 @@
 # mission-control-skills
 Collection of skills to diagnose and fix issues with mission control
--- a/plugin.lock.json
+++ b/plugin.lock.json
@@ -0,0 +1,57 @@
 {
  "$schema": "internal://schemas/plugin.lock.v1.json",
  "pluginId": "gh:flanksource/claude-code-plugin:",
  "normalized": {
    "repo": null,
    "ref": "refs/tags/v20251128.0",
    "commit": "1ca706f28027a2b8ed722e3184e8ef49925a21de",
    "treeHash": "4c24607b108a8429ecb475c04c668789a9c0f77385e4532f0b048fd2056b0dfe",
    "generatedAt": "2025-11-28T10:16:54.739603Z",
    "toolVersion": "publish_plugins.py@0.2.0"
  },
  "origin": {
    "remote": "git@github.com:zhongweili/42plugin-data.git",
    "branch": "master",
    "commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
    "repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
  },
  "manifest": {
    "name": "mission-control-skills",
    "description": "Collection of skills to diagnose and fix issues with mission control",
    "version": null
  },
  "content": {
    "files": [
      {
        "path": "README.md",
        "sha256": "7bfc57a935d12d2a690bda764a2099dc2bf4f2569ccbaeeff7ecc975746f8589"
      },
      {
        "path": ".claude-plugin/plugin.json",
        "sha256": "3ea0a578b9c0f31034c79c1c7119b56002617bb52037f96a72129ad11e07046f"
      },
      {
        "path": "skills/troubleshooting-health-checks/SKILL.md",
        "sha256": "43fd1b5085ea2039110eb49d173b38bac5fe6cfb456e150a38620f511728b7f4"
      },
      {
        "path": "skills/troubleshooting-health-checks/reference/query-syntax.md",
        "sha256": "fbddba0d6ca723a0ec7f2942d04674b73ec21162e31fb253c52ea68be36793e1"
      },
      {
        "path": "skills/troubleshooting-config-item/SKILL.md",
        "sha256": "98554ec41d1ace16d142a39df53ca216fd6b1bc6a13df5b4e085fc74980d85a5"
      },
      {
        "path": "skills/troubleshooting-notifications/SKILL.md",
        "sha256": "1931a7d040219ad441797c005aa0cea4ce1fcd2f5ef5f63392927cb7d1e11d86"
      }
    ],
    "dirSha256": "4c24607b108a8429ecb475c04c668789a9c0f77385e4532f0b048fd2056b0dfe"
  },
  "security": {
    "scannedAt": null,
    "scannerVersion": null,
    "flags": []
  }
 }
--- a/skills/troubleshooting-config-item/SKILL.md
+++ b/skills/troubleshooting-config-item/SKILL.md
@@ -0,0 +1,125 @@
 ---
 name: troubleshooting-config-items
 description: Troubleshoots infrastructure and application configuration items in Mission Control by diagnosing health issues, analyzing recent changes, and investigating resource relationships. Use when users ask about unhealthy or failing resources, mention specific config items by name or ID, inquire about Kubernetes pods/deployments/services, AWS EC2 instances/volumes, Azure VMs, or other infrastructure components. Also use when investigating why a resource is down, stopped, degraded, or showing errors, or when analyzing what changed that caused an issue.
 allowed-tools: search_catalog, describe_config, list_catalog_types, get_related_configs, search_catalog_changes, get_notification_detail, get_notifications_for_resource
 ---
 # Config Item Troubleshooting Skill
 ## Core Purpose
 This skill enables Claude to troubleshoot infrastructure and application configuration items in Mission Control, diagnose health issues, analyze changes, and identify root causes through systematic investigation of config relationships and history.
 ## Understanding Config Items
 A **ConfigItem** represents a discoverable infrastructure or application configuration (Kubernetes Pods, AWS EC2 instances, Azure VMs, database instances, etc.). Each config item contains:
 - **health**: Overall health status ("healthy", "unhealthy", "warning", "unknown")
 - **status**: Operational state (e.g., "Running", "Stopped", "Pending")
 - **description**: Human-readable description (often contains error messages when unhealthy)
 - **.config**: The actual JSON specification/manifest (e.g., Kubernetes Pod spec, AWS instance details)
 - **type**: The kind of resource (e.g., "Kubernetes::Pod", "AWS::EC2::Instance")
 - **tags**: Metadata for filtering and organization
 - **parent_id/path**: Hierarchical relationships to other configs
 - **external_id**: External system identifier
 ## Key Workflows
 ### Initial Investigation
 **1. Search and Identify the Config**
 Use the MCP `search_catalog` tool to find the config item:
 - Search by id, name, type, tags, or other attributes
 - Narrow down to the specific config experiencing issues
 **2. Get Complete Config Details**
 Use the MCP `describe_config` tool to retrieve full config information:
 - Review the **health** field for overall status
 - Check the **status** field for operational state
 - Read the **description** field carefully - this often contains error messages or status information
 - Examine the **.config** JSON field - this contains the full specification/manifest
 ### Change Analysis
 **3. Review Recent Changes**
 If the issue isn't immediately apparent, use the MCP `search_catalog_changes` tool:
 - Get changes for the specific config item
 - Look for recent modifications to the specification
 - Check `change_type` (created, updated, deleted)
 - Review `severity` (critical, high, medium, low, info)
 - Examine `patches` and `diff` fields to see what changed
 - Check `source` to understand where the change originated
 - Note the `created_at` timestamp to correlate with when issues started
 ### Relationship Navigation
 **4. Investigate Related Configs**
 Use the MCP `get_related_configs` tool to navigate the config hierarchy:
 - **Children**: Resources created/managed by this config
  - Example: A Kubernetes Deployment → ReplicaSets → Pods
  - Example: An AWS Auto Scaling Group → EC2 Instances
 - **Parents**: Resources that manage this config
  - Example: A Pod → ReplicaSet → Deployment
 - **Dependencies**: Resources this config depends on
  - Example: A Pod → ConfigMaps, Secrets, PersistentVolumeClaims
 **Troubleshooting Pattern:**
 When a parent resource is unhealthy, investigate its children to find the actual failing component. When a child is unhealthy, check the parent for misconfigurations.
 ## Critical Requirements
 **Hierarchical Thinking:**
 - Kubernetes: Namespace → Deployment → ReplicaSet → Pod → Container
 - AWS: VPC → Subnet → EC2 Instance → Volume
 - Azure: Resource Group → VM → Disk
 **Change Impact Analysis:**
 - Compare current config with previous working state
 - Identify what changed and when
 - Correlate timing of changes with health degradation
 **Evidence-Based Diagnosis:**
 - Support conclusions with specific evidence from the config data
 - Quote relevant error messages from description fields
 - Reference specific fields in the .config JSON
 - Cite change diffs and timestamps
 ## Diagnosis Workflow
 Follow this systematic approach:
 1. **Identify** - Find the config item
 2. **Assess** - Review health, status, description, and .config spec
 3. **Analyze Changes** - Check recent modifications and events
 4. **Navigate Relationships** - Investigate parent/child/dependency configs
 5. **Review Analysis** - Check automated findings
 6. **Synthesize** - Determine root cause from all evidence
 7. **Recommend** - Provide specific remediation steps
 ## Example Troubleshooting Scenarios
 **Scenario 1: Unhealthy Kubernetes Deployment**
 - Get Deployment details → health: unhealthy
 - Get related configs (children) → ReplicaSets → Pods
 - Find Pod in CrashLoopBackOff
 - Check Pod .config → image pull error
 - Check changes → recent image tag update
 - Root cause: Invalid image tag deployed
 - Recommendation: Rollback to previous image or fix image tag
 **Scenario 2: AWS EC2 Instance Issues**
 - Get Instance details → status: stopped, health: unhealthy
 - Check description → "InsufficientInstanceCapacity"
 - Review changes → instance type changed to unavailable type
 - Get related configs → Security Groups, Volumes
 - Root cause: Requested instance type not available in AZ
 - Recommendation: Change to available instance type or different AZ
--- a/skills/troubleshooting-health-checks/SKILL.md
+++ b/skills/troubleshooting-health-checks/SKILL.md
@@ -0,0 +1,105 @@
 ---
 name: troubleshooting-health-checks
 description: Debugs and troubleshoots Mission Control health checks by analyzing check configurations, reviewing failure patterns, and identifying root causes. Use when users ask about failing health checks, mention specific health check names or IDs, inquire why a health check is failing or unhealthy, or need help understanding health check errors and timeouts.
 allowed-tools: search_health_checks, get_check_status, run_health_check, list_all_checks, search_catalog, describe_config, search_catalog_changes
 ---
 # Health Check Troubleshooting Skill
 ## Core Purpose
 This skill enables Claude to troubleshoot Mission Control health checks by analyzing check configurations, diagnosing failure patterns, identifying timeout and error root causes, and recommending configuration adjustments to improve reliability.
 Note: Read @skills/troubleshooting-health-checks/reference/query-syntax.md to for query syntax
 ## Health check troubleshooting workflow
 Copy this checklist and track your progress:
 ```
 Troubleshooting Progress:
 - [ ] Step 1: Gather health check information
 - [ ] Step 2: Analyze failure patterns
 - [ ] Step 3: Cross-reference configuration issues
 - [ ] Step 4: Create diagnostic summary
 - [ ] Step 5: Verify remediation steps
 ```
 ## Gather health check information
 To begin with, get the id of the check in question.
 Use `search_health_checks` with query syntax to find checks. Read @skills/troubleshooting-health-checks/reference/query-syntax.md to for query syntax
 Else, if you could not get the health check Id from the user provided name, use `list_all_checks` to get complete metadata for all health checks .
 Then, follow this procedure:
 - **Historical Context**: Use `get_check_status` to retrieve execution history
 - **Investigate the check specification**: Understand the intention of the check.
 - **Investiagte the chagnes to the canray**: Use `search_catalog_changes(<canary_uuid>)` to get the changes on the canary.
  Look for the change details to see any new changes on the specification.
 ## Analyze failure patterns
 Examine the historical data to identify patterns. Look for:
 - **Intermittent failures**: Passes sometimes, fails others
  - Suggests: Network instability, load-related issues, race conditions
 - **Consistent failures**: Always failing
  - Suggests: Configuration error, endpoint down, authentication issue
 - **Recent pattern changes**: Was passing, now failing
  - Suggests: Recent deployment, config change, infrastructure change
 - **Timeout patterns**: Fails with timeout errors
  - Suggests: Performance degradation, insufficient timeout value
 - **Time-based patterns**: Fails at specific times
  - Suggests: Scheduled jobs, traffic patterns, resource contention
 Duration analysis:
 - Increasing duration → Performance degrading (may lead to timeouts)
 - Spiky duration → Intermittent load or resource contention
 - Consistent slow duration → Timeout threshold too aggressive
 ## Create diagnostic summary
 Organize findings systematically. Include:
 - **Primary diagnosis**
  - Root cause identification with supporting evidence
  - Quote specific error messages from last_result
  - Reference historical pattern statistics
  - Cite configuration values that contribute to the issue
 - **Contributing factors**
  - Secondary issues that may worsen the problem
  - Environmental factors (network, infrastructure)
  - Configuration mismatches
 - **Impact assessment**
  - How long has the issue persisted
  - Frequency and severity of failures
  - Potential downstream effects
 Example diagnostic format:
 > The health check "api-status" (ID: check-123) is failing based on `get_check_status` history showing error "timeout exceeded" in recent executions. Historical data shows duration increasing from 3s to 5s over 6 hours. This indicates backend performance degradation requiring investigation and potential timeout adjustment.
 ## Verify remediation steps
 Provide and validate specific fixes. For each recommendation:
 - Use `run_health_check` to test fixes immediately
 - Verify check passes after configuration changes
 - Monitor execution duration and response
 ## Success criteria checklist
 Before completing troubleshooting:
 - [ ] Health check configuration fully analyzed
 - [ ] Failure pattern clearly identified with evidence
 - [ ] Root cause diagnosed with supporting data
 - [ ] Specific remediation steps provided
 - [ ] Configuration adjustments justified
 - [ ] Validation approach included
--- a/skills/troubleshooting-health-checks/reference/query-syntax.md
+++ b/skills/troubleshooting-health-checks/reference/query-syntax.md
@@ -0,0 +1,23 @@
 ### Query Syntax for search_health_checks
 **Fields**: id, name, namespace, canary_id, type, status, agent_id, created_at, updated_at, deleted_at, labels._, spec._
 **Operators**: =, :, !=, <, >, <=, >=
 **Wildcards**:
 - `value*`: prefix match
 - `*value`: suffix match
 - `*value*`: contains match
 **Date Math**:
 - Absolute: `YYYY-MM-DD`
 - Relative: `now±N{s|m|h|d|w|mo|y}` (e.g., `now-24h`, `now-7d`)
 **Examples**:
 - `name=api* status=unhealthy` - Find unhealthy API checks
 - `status=healthy labels.app=web` - Healthy checks with web label
 - `created_at>now-24h` - Checks created in last 24 hours
 - `updated_at>2025-01-01 updated_at<2025-01-31` - Checks updated in January
--- a/skills/troubleshooting-notifications/SKILL.md
+++ b/skills/troubleshooting-notifications/SKILL.md
@@ -0,0 +1,253 @@
 ---
 name: troubleshooting-notifications
 description: Investigates Mission Control notifications to identify root causes and provide remediation. Use when users mention notification IDs, ask about alerts or notifications, request help understanding "why did I get this notification", want to troubleshoot a specific alert, or ask about notification patterns and history. This skill retrieves notification details, analyzes historical patterns, routes to resource-specific troubleshooting (config items or health checks), correlates findings, and delivers actionable remediation steps with prevention recommendations.
 allowed-tools: get_notification_detail, get_notifications_for_resource
 ---
 # Notification Troubleshooting Skill
 ## When to Invoke This Skill
 Invoke this skill when users:
 - Mention a specific notification ID or title
 - Ask to investigate or troubleshoot a notification
 - Ask "why did I get this alert/notification"
 - Request help understanding a Mission Control notification
 - Ask about notification history or patterns
 ## Core Purpose
 This skill enables Claude to investigate Mission Control notifications, trace them to their root cause, and provide actionable remediation steps by systematically analyzing notification details, resource context, and historical patterns.
 ## Understanding Notifications
 A **Notification** represents an alert or event triggered by Mission Control when a resource experiences issues or state changes. Each notification contains:
 - **id**: Unique identifier for the notification
 - **title**: Short summary of the notification
 - **message**: Detailed description of what triggered the notification
 - **severity**: Alert level (critical, error, warning, info)
 - **resource_id**: ID of the affected resource
 - **resource_type**: Type of resource ("ConfigItem", "HealthCheck", etc.)
 - **created_at**: When the notification was created
 - **status**: Current status (active, resolved, acknowledged)
 ## Systematic Troubleshooting Workflow
 Follow this step-by-step approach:
 ### Step 1: Retrieve Notification Details
 **CALL** `get_notification_detail` with the notification ID
 **OBSERVE** the response and extract:
 - `title` and `message` - what is the alert about?
 - `severity` - how critical is this?
 - `resource_id` and `resource_type` - what resource is affected?
 - `created_at` - when did this start?
 - `status` - is this still active?
 **ANALYZE** the message field carefully - it often contains:
 - Error messages or stack traces
 - Threshold violations
 - State transition information
 - Specific failure reasons
 ### Step 2: Analyze Notification History
 **CALL** `get_notifications_for_resource` for the affected resource
 **LOOK FOR** patterns:
 - **Recurring notifications**: Same issue happening repeatedly
 - **Frequency changes**: Issue getting worse or better
 - **Event correlation**: Multiple related notifications around same time
 - **Resolution patterns**: What changed when past notifications resolved
 **IDENTIFY** if this is:
 - A new issue (first occurrence)
 - A recurring problem (happened before)
 - Part of a larger incident (multiple resources affected)
 ### Step 3: Route to Resource-Specific Troubleshooting
 Based on `resource_type`, invoke the appropriate skill:
 **IF** `resource_type == "ConfigItem"`:
 ```
 CALL Skill tool with skill="mission-control-skills:config_item"
 PROVIDE the resource_id and context from notification
 ```
 **IF** `resource_type == "HealthCheck"`:
 ```
 CALL Skill tool with skill="mission-control-skills:health"
 PROVIDE the resource_id and context from notification
 ```
 **The invoked skill will**:
 - Investigate the specific resource
 - Analyze current state and changes
 - Identify root cause
 - Provide remediation steps
 ### Step 4: Correlate Findings
 **SYNTHESIZE** information from:
 1. Notification message and severity
 2. Historical notification patterns
 3. Resource-level investigation findings
 4. Timing of events and changes
 **DETERMINE**:
 - Root cause of the notification
 - Why it triggered at this specific time
 - Whether issue is ongoing or resolved
 - Related resources that may be affected
 ### Step 5: Provide Recommendations
 **DELIVER**:
 1. **Root Cause**: Clear explanation of what went wrong
 2. **Evidence**: Specific data points supporting the diagnosis
 3. **Remediation**: Step-by-step actions to resolve
 4. **Prevention**: How to avoid this notification in the future
 ## Common Notification Scenarios
 ### Scenario 1: Config Item Unhealthy Notification
 ```
 1. GET notification details
   → severity: error
   → message: "ConfigItem kubernetes/prod/api-deployment is unhealthy"
   → resource_type: ConfigItem
   → resource_id: "config-123"
 2. GET notification history for config-123
   → 3 similar notifications in past 24h
   → Pattern: recurring every 4 hours
 3. INVOKE config_item skill with resource_id
   → Skill finds: Pod CrashLoopBackOff
   → Root cause: OOMKilled - memory limit too low
 4. SYNTHESIZE findings
   → Notification triggered by health check failure
   → Recurring because pod keeps restarting
   → Memory limit increased 3 days ago (from changes)
 5. RECOMMEND
   → Increase memory limit from 512Mi to 1Gi
   → Monitor for next hour to confirm resolution
   → Set alert threshold higher to avoid false positives
 ```
 ### Scenario 2: Health Check Failure Notification
 ```
 1. GET notification details
   → severity: critical
   → message: "Database connection timeout"
   → resource_type: HealthCheck
   → resource_id: "hc-456"
 2. GET notification history for hc-456
   → First occurrence - new issue
   → No previous failures for this check
 3. INVOKE health skill with resource_id
   → Skill finds: Network policy blocking connection
   → Changed 30 minutes ago
 4. SYNTHESIZE findings
   → New network policy deployed
   → Blocks egress to database port
   → Health check can't reach database
 5. RECOMMEND
   → Update network policy to allow database traffic
   → Test health check manually after fix
   → Review change approval process
 ```
 ## Error Handling
 **IF** `get_notification_detail` returns not found:
 - Verify notification ID is correct
 - Check if user has permissions to view notification
 - Ask user to confirm the notification ID
 **IF** `resource_id` is null or missing:
 - Notification may be system-level (not resource-specific)
 - Analyze message for manual troubleshooting clues
 - Search for related notifications in same timeframe
 **IF** `resource_type` is not ConfigItem or HealthCheck:
 - Investigate the resource type directly
 - Use general troubleshooting principles
 - Document findings and ask for guidance
 **IF** notification history is empty:
 - This is a new type of issue
 - Focus more on recent changes
 - Less context available for pattern analysis
 ## Critical Requirements
 **Evidence-Based Analysis**:
 - Quote specific error messages from notification
 - Reference timestamps and correlation
 - Support conclusions with data from tools
 **Hierarchical Investigation**:
 - Start with notification (symptom)
 - Trace to resource (source)
 - Navigate relationships (context)
 - Identify change (cause)
 **Actionable Remediation**:
 - Provide specific commands or actions
 - Explain why each step will help
 - Include validation steps
 - Consider prevention measures
 ## Skill Invocation Pattern
 When routing to other skills, use this format:
 ```markdown
 Based on the notification for resource_type="ConfigItem", I'm now invoking the config_item troubleshooting skill to investigate the underlying resource.
 [CALL Skill tool with skill="mission-control-skills:config_item"]
 [After skill returns]
 The config_item skill has identified: [summarize findings]
 Combined with the notification history showing [pattern], the root cause is [diagnosis].
 ```
 ## Key Success Criteria
 ✓ Notification context fully understood
 ✓ Historical patterns analyzed
 ✓ Appropriate skill invoked for resource type
 ✓ Root cause identified with evidence
 ✓ Clear remediation steps provided
 ✓ Prevention recommendations included
		`@@ -0,0 +1,3 @@`
							`# mission-control-skills`

							`Collection of skills to diagnose and fix issues with mission control`