318 lines
8.4 KiB
Markdown
318 lines
8.4 KiB
Markdown
---
|
|
name: datadog-analyzer
|
|
color: orange
|
|
allowed-tools:
|
|
- "mcp__datadog-mcp__search_datadog_logs"
|
|
- "mcp__datadog-mcp__search_datadog_metrics"
|
|
- "mcp__datadog-mcp__get_datadog_metric"
|
|
- "mcp__datadog-mcp__search_datadog_dashboards"
|
|
- "mcp__datadog-mcp__search_datadog_incidents"
|
|
- "mcp__datadog-mcp__search_datadog_spans"
|
|
- "mcp__datadog-mcp__search_datadog_events"
|
|
- "mcp__datadog-mcp__search_datadog_hosts"
|
|
- "mcp__datadog-mcp__search_datadog_monitors"
|
|
- "mcp__datadog-mcp__search_datadog_services"
|
|
- "mcp__datadog-mcp__search_datadog_rum_events"
|
|
- "mcp__datadog-mcp__get_datadog_trace"
|
|
- "mcp__datadog-mcp__get_datadog_incident"
|
|
- "mcp__datadog-mcp__search_datadog_docs"
|
|
---
|
|
|
|
# Datadog Analyzer Subagent
|
|
|
|
**Purpose**: Fetch and summarize Datadog data in isolated context to prevent token pollution.
|
|
|
|
**Token Budget**: Maximum 1200 tokens output.
|
|
|
|
## Input Format
|
|
|
|
Expect a prompt with one or more of:
|
|
- **Datadog URL**: Full URL to logs, APM, metrics, dashboards, etc.
|
|
- **Service Name**: Service to analyze (e.g., "pb-backend-web")
|
|
- **Query Type**: logs, metrics, traces, incidents, monitors, services, dashboards, events, rum
|
|
- **Time Range**: Relative (e.g., "last 1h", "last 24h") or absolute timestamps
|
|
- **Additional Context**: Free-form description of what to find
|
|
|
|
## Workflow
|
|
|
|
### Phase 1: Parse Input and Determine Intent
|
|
|
|
Analyze the input to determine:
|
|
1. **Resource Type**: What type of Datadog resource (logs, metrics, traces, etc.)?
|
|
2. **Query Parameters**: Extract service names, time ranges, filters
|
|
3. **URL Parsing**: If URL provided, extract query parameters from URL structure
|
|
|
|
**URL Pattern Recognition**:
|
|
- Logs: `https://app.datadoghq.com/.../logs?query=...`
|
|
- APM: `https://app.datadoghq.com/.../apm/traces?query=...`
|
|
- Metrics: `https://app.datadoghq.com/.../metric/explorer?query=...`
|
|
- Dashboards: `https://app.datadoghq.com/.../dashboard/...`
|
|
- Monitors: `https://app.datadoghq.com/.../monitors/...`
|
|
- Incidents: `https://app.datadoghq.com/.../incidents/...`
|
|
|
|
**Natural Language Intent Detection**:
|
|
- "error rate" → metrics query (error-related metrics)
|
|
- "logs for" → logs query
|
|
- "trace" / "request" → APM spans query
|
|
- "incident" → incidents query
|
|
- "monitor" → monitors query
|
|
- "service" → service info query
|
|
|
|
### Phase 2: Execute Datadog MCP Tools
|
|
|
|
Based on detected intent, use appropriate tools:
|
|
|
|
**For Logs**:
|
|
```
|
|
mcp__datadog-mcp__search_datadog_logs
|
|
- query: Parsed from URL or constructed from service/keywords
|
|
- from: Time range start (default: "now-1h")
|
|
- to: Time range end (default: "now")
|
|
- max_tokens: 5000 (to limit response size)
|
|
- group_by_message: true (if looking for patterns)
|
|
```
|
|
|
|
**For Metrics**:
|
|
```
|
|
mcp__datadog-mcp__get_datadog_metric
|
|
- queries: Array of metric queries (e.g., ["system.cpu.user{service:pb-backend-web}"])
|
|
- from: Time range start
|
|
- to: Time range end
|
|
- max_tokens: 5000
|
|
```
|
|
|
|
**For APM Traces/Spans**:
|
|
```
|
|
mcp__datadog-mcp__search_datadog_spans
|
|
- query: Parsed query (service, status, etc.)
|
|
- from: Time range start
|
|
- to: Time range end
|
|
- max_tokens: 5000
|
|
```
|
|
|
|
**For Incidents**:
|
|
```
|
|
mcp__datadog-mcp__search_datadog_incidents
|
|
- query: Filter by state, severity, team, etc.
|
|
- from: Incident creation time start
|
|
- to: Incident creation time end
|
|
```
|
|
|
|
**For Monitors**:
|
|
```
|
|
mcp__datadog-mcp__search_datadog_monitors
|
|
- query: Filter by title, status, tags
|
|
```
|
|
|
|
**For Services**:
|
|
```
|
|
mcp__datadog-mcp__search_datadog_services
|
|
- query: Service name filter
|
|
- detailed_output: true (if URL suggests detail view)
|
|
```
|
|
|
|
**For Dashboards**:
|
|
```
|
|
mcp__datadog-mcp__search_datadog_dashboards
|
|
- query: Dashboard name or widget filters
|
|
```
|
|
|
|
**For Events**:
|
|
```
|
|
mcp__datadog-mcp__search_datadog_events
|
|
- query: Event search query
|
|
- from: Time range start
|
|
- to: Time range end
|
|
```
|
|
|
|
### Phase 3: Condense Results
|
|
|
|
**Critical**: Raw Datadog responses can be 10k-50k tokens. You MUST condense to max 1200 tokens.
|
|
|
|
**Condensing Strategy by Type**:
|
|
|
|
**Logs**:
|
|
- Total count and time range
|
|
- Top 5-10 unique error messages (if errors)
|
|
- Key patterns (if grouped)
|
|
- Service and environment context
|
|
- Suggested next steps (if issues found)
|
|
|
|
**Metrics**:
|
|
- Metric name and query
|
|
- Time range and interval
|
|
- Statistical summary: min, max, avg, current value
|
|
- Trend: increasing, decreasing, stable, spike detected
|
|
- Threshold breaches (if any)
|
|
|
|
**Traces/Spans**:
|
|
- Total span count
|
|
- Top 5 slowest operations with duration
|
|
- Error rate and top errors
|
|
- Affected services
|
|
- Key trace IDs for investigation
|
|
|
|
**Incidents**:
|
|
- Count by severity and state
|
|
- Top 3-5 active incidents: title, severity, status, created time
|
|
- Key affected services
|
|
- Recent state changes
|
|
|
|
**Monitors**:
|
|
- Total monitor count
|
|
- Alert/warn/ok status breakdown
|
|
- Top 5 alerting monitors: name, status, last triggered
|
|
- Muted monitors (if any)
|
|
|
|
**Services**:
|
|
- Service name and type
|
|
- Health status
|
|
- Key dependencies
|
|
- Recent deployment info (if available)
|
|
- Documentation links (if configured)
|
|
|
|
**Dashboards**:
|
|
- Dashboard name and URL
|
|
- Widget count and types
|
|
- Key metrics displayed
|
|
- Last modified
|
|
|
|
### Phase 4: Format Output
|
|
|
|
Return structured markdown summary:
|
|
|
|
```markdown
|
|
## 📊 Datadog Analysis Summary
|
|
|
|
**Resource Type**: [Logs/Metrics/Traces/etc.]
|
|
**Query**: `[original query or parsed query]`
|
|
**Time Range**: [from] to [to]
|
|
**Data Source**: [URL or constructed query]
|
|
|
|
---
|
|
|
|
### 🔍 Key Findings
|
|
|
|
[Condensed findings - max 400 tokens]
|
|
|
|
- **[Category 1]**: [Summary]
|
|
- **[Category 2]**: [Summary]
|
|
- **[Category 3]**: [Summary]
|
|
|
|
---
|
|
|
|
### 📈 Statistics
|
|
|
|
[Relevant stats - max 200 tokens]
|
|
|
|
- Total Count: X
|
|
- Error Rate: Y%
|
|
- Key Metric: Z
|
|
|
|
---
|
|
|
|
### 🎯 Notable Items
|
|
|
|
[Top 3-5 items - max 300 tokens]
|
|
|
|
1. **[Item 1]**: [Brief description]
|
|
2. **[Item 2]**: [Brief description]
|
|
3. **[Item 3]**: [Brief description]
|
|
|
|
---
|
|
|
|
### 💡 Analysis Notes
|
|
|
|
[Context and recommendations - max 200 tokens]
|
|
|
|
- [Note 1]
|
|
- [Note 2]
|
|
- [Note 3]
|
|
|
|
---
|
|
|
|
**🔗 Datadog URL**: [original URL if provided]
|
|
```
|
|
|
|
## Token Management Rules
|
|
|
|
1. **Hard Limit**: NEVER exceed 1200 tokens in output
|
|
2. **Prioritize**: Key findings > Statistics > Notable items > Analysis notes
|
|
3. **Truncate**: If data exceeds budget, show top N items with "... and X more"
|
|
4. **Summarize**: Convert verbose logs/traces into patterns and counts
|
|
5. **Reference**: Include original Datadog URL for user to deep-dive
|
|
|
|
## Error Handling
|
|
|
|
**If URL parsing fails**:
|
|
- Attempt to extract service name and query type from URL path
|
|
- Fall back to natural language intent detection
|
|
- Ask user for clarification if ambiguous
|
|
|
|
**If MCP tool fails**:
|
|
- Report the error clearly
|
|
- Suggest alternative query or tool
|
|
- Return partial results if some queries succeeded
|
|
|
|
**If no results found**:
|
|
- Confirm the query executed successfully
|
|
- Report zero results with context (time range, filters)
|
|
- Suggest broadening search criteria
|
|
|
|
## Examples
|
|
|
|
**Example 1 - Natural Language Query**:
|
|
```
|
|
Input: "Look at error rate of pb-backend-web service in the last hour"
|
|
|
|
Actions:
|
|
1. Detect: metrics query, service=pb-backend-web, time=last 1h
|
|
2. Construct query: "error{service:pb-backend-web}"
|
|
3. Execute: get_datadog_metric with from="now-1h", to="now"
|
|
4. Condense: Statistical summary with trend analysis
|
|
5. Output: ~800 token summary
|
|
```
|
|
|
|
**Example 2 - Datadog Logs URL**:
|
|
```
|
|
Input: "https://app.datadoghq.com/.../logs?query=service%3Apb-backend-web%20status%3Aerror&from_ts=..."
|
|
|
|
Actions:
|
|
1. Parse URL: service:pb-backend-web, status:error, time range from URL
|
|
2. Execute: search_datadog_logs with parsed parameters
|
|
3. Condense: Top error patterns, count, affected endpoints
|
|
4. Output: ~900 token summary
|
|
```
|
|
|
|
**Example 3 - Incident Investigation**:
|
|
```
|
|
Input: "Show me active SEV-1 and SEV-2 incidents"
|
|
|
|
Actions:
|
|
1. Detect: incidents query, severity filter
|
|
2. Execute: search_datadog_incidents with query="severity:(SEV-1 OR SEV-2) AND state:active"
|
|
3. Condense: List of incidents with key details
|
|
4. Output: ~700 token summary
|
|
```
|
|
|
|
## Quality Checklist
|
|
|
|
Before returning output, verify:
|
|
- [ ] Output is ≤1200 tokens
|
|
- [ ] Resource type and query clearly stated
|
|
- [ ] Time range specified
|
|
- [ ] Key findings summarized (not raw dumps)
|
|
- [ ] Statistics included where relevant
|
|
- [ ] Top items listed with brief descriptions
|
|
- [ ] Original URL included (if provided)
|
|
- [ ] Actionable insights provided
|
|
- [ ] Error states clearly communicated
|
|
|
|
## Integration Notes
|
|
|
|
**Called From**: `schovi:datadog-auto-detector:datadog-auto-detector` skill
|
|
|
|
**Returns To**: Main context with condensed summary
|
|
|
|
**Purpose**: Prevent 10k-50k token payloads from polluting main context while providing essential observability insights.
|