gh-francyjglisboa-agent-ski…/references/phase1-discovery.md

# Phase 1: Discovery and API Research

## Objective

Research and **DECIDE** autonomously which API or data source to use for the agent.

## Detailed Process

### Step 1: Identify Domain

From user input, extract the main domain:

| User Input | Identified Domain |
|------------------|---------------------|
| "US crop data" | Agriculture (US) |
| "stock market analysis" | Finance / Stock Market |
| "global climate data" | Climate / Meteorology |
| "economic indicators" | Economy / Macro |
| "commodity data" | Trading / Commodities |

### Step 2: Search Available APIs

For the identified domain, use WebSearch to find public APIs:

**Search queries**:
```
"[domain] API free public data"
"[domain] government API documentation"
"best API for [domain] historical data"
"[domain] open data sources"
```

**Example (US agriculture)**:
```bash
WebSearch: "US agriculture API free historical data"
WebSearch: "USDA API documentation"
WebSearch: "agricultural statistics API United States"
```

**Typical result**: 5-10 candidate APIs

### Step 3: Research Documentation

For each candidate API, use WebFetch to load:
- Homepage/overview
- Getting started guide
- API reference
- Rate limits and pricing

**Extract information**:

```markdown
## API 1: [Name]

**URL**: [base URL]
**Docs**: [docs URL]

**Authentication**:
- Type: API key / OAuth / None
- Cost: Free / Paid
- How to obtain: [steps]

**Available Data**:
- Temporal coverage: [from when to when]
- Geographic coverage: [countries, regions]
- Metrics: [list]
- Granularity: [daily, monthly, annual]

**Limitations**:
- Rate limit: [requests per day/hour]
- Max records: [per request]
- Throttling: [yes/no]

**Quality**:
- Source: [official government / private]
- Reliability: [high/medium/low]
- Update frequency: [frequency]

**Documentation**:
- Quality: [excellent/good/poor]

### Step 4: API Capability Inventory (NEW v2.0 - CRITICAL!)

**OBJECTIVE:** Ensure the skill uses 100% of API capabilities, not just the basics!

**LEARNING:** us-crop-monitor v1.0 used only CONDITION (1 of 5 NASS metrics).
v2.0 had to add PROGRESS, YIELD, PRODUCTION, AREA (+3,500 lines of rework).

**Process:**

**Step 4.1: Complete Inventory**

For the chosen API, catalog ALL data types:

```markdown
## Complete Inventory - {API Name}

**Available Metrics/Endpoints:**

| Endpoint/Metric | Returns | Granularity | Coverage | Value |
|-----------------|---------------|---------------|-----------|-------|
| {metric1}       | {description}   | {daily/weekly}| {geo}     | ⭐⭐⭐⭐⭐ |
| {metric2}       | {description}   | {monthly}     | {geo}     | ⭐⭐⭐⭐⭐ |
| {metric3}       | {description}   | {annual}      | {geo}     | ⭐⭐⭐⭐  |
...

**Real Example (NASS):**

| Metric Type    | Data               | Frequency | Value    | Implement? |
|----------------|--------------------| ----------|----------|------------|
| CONDITION      | Quality ratings    | Weekly    | ⭐⭐⭐⭐⭐ | ✅ YES     |
| PROGRESS       | % planted/harvested| Weekly    | ⭐⭐⭐⭐⭐ | ✅ YES     |
| YIELD          | Bu/acre            | Monthly   | ⭐⭐⭐⭐⭐ | ✅ YES     |
| PRODUCTION     | Total bushels      | Monthly   | ⭐⭐⭐⭐⭐ | ✅ YES     |
| AREA           | Acres planted      | Annual    | ⭐⭐⭐⭐  | ✅ YES     |
| PRICE          | $/bushel           | Monthly   | ⭐⭐⭐    | ⚪ v2.0    |
```

**Step 4.2: Coverage Decision**

**GOLDEN RULE:**
- If metric has ⭐⭐⭐⭐ or ⭐⭐⭐⭐⭐ value → Implement in v1.0
- If API has 5 high-value metrics → Implement all 5!
- Never leave >50% of API unused without strong justification

**Step 4.3: Document Decision**

In DECISIONS.md:
```markdown
## API Coverage Decision

API {name} offers {N} types of metrics.

**Implemented in v1.0 ({X} of {N}):**
- {metric1} - {justification}
- {metric2} - {justification}
...

**Not implemented ({Y} of {N}):**
- {metricZ} - {why not} (planned for v2.0)

**Coverage:** {X/N * 100}% = {evaluation}
- If < 70%: Clearly explain why low coverage
- If > 70%: ✅ Good coverage
```

**Output of this phase:** Exact list of all `get_*()` methods to implement
- Examples: [many/few/none]
- SDKs: [Python/R/None]

**Ease of Use**:
- Format: JSON / CSV / XML
- Structure: [simple/complex]
- Quirks: [any strange behavior?]
```

### Step 4: Compare Options

Create comparison table:

| API | Coverage | Cost | Rate Limit | Quality | Docs | Ease | Score |
|-----|-----------|-------|------------|-----------|------|------------|-------|
| API 1 | ⭐⭐⭐⭐⭐ | Free | 1000/day | Official | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 9.2/10 |
| API 2 | ⭐⭐⭐⭐ | $49/mo | Unlimited | Private | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 7.8/10 |
| API 3 | ⭐⭐⭐ | Free | 100/day | Private | ⭐⭐ | ⭐⭐⭐ | 5.5/10 |

**Scoring criteria**:
- Coverage (fit with need): 30% weight
- Cost (prefer free): 20% weight
- Rate limit (sufficient?): 15% weight
- Quality (official > private): 15% weight
- Documentation (facilitates implementation): 10% weight
- Ease of use (format, structure): 10% weight

### Step 5: DECIDE

**Consider user constraints**:
- Mentioned "free"? → Eliminate paid options
- Mentioned "10+ years historical data"? → Check coverage
- Mentioned "real-time"? → Prioritize streaming APIs

**Apply logic**:
1. Eliminate APIs that violate constraints
2. Of remaining, choose highest score
3. If tie, prefer:
   - Official > private
   - Better documentation
   - Easier to use

**FINAL DECISION**:

```markdown
## Selected API: [API Name]

**Score**: X.X/10

**Justification**:
- ✅ Coverage: [specific details]
- ✅ Cost: [free/paid + details]
- ✅ Rate limit: [number] requests/day (sufficient for [estimated usage])
- ✅ Quality: [official/private + reliability]
- ✅ Documentation: [quality + examples]
- ✅ Ease of use: [format, structure]

**Fit with requirements**:
- Constraint 1 (e.g., free): ✅ Met
- Constraint 2 (e.g., 10+ years history): ✅ Met (since [year])
- Primary need (e.g., crop production): ✅ Covered

**Alternatives Considered**:

**API X**: Score 7.5/10
- Rejected because: [specific reason]
- Trade-off: [what we lose vs gain]

**API Y**: Score 6.2/10
- Rejected because: [reason]

**Conclusion**:
[API Name] is the best option because [1-2 sentence synthesis].
```

### Step 6: Research Technical Details

After deciding, dive deep into documentation:

**Load via WebFetch**:
- Getting started guide
- Complete API reference
- Authentication guide
- Rate limiting details
- Best practices

**Extract for implementation**:

```markdown
## Technical Details - [API]

### Authentication

**Method**: API key in header
**Header**: `X-Api-Key: YOUR_KEY`
**Obtaining key**:
1. [step 1]
2. [step 2]
3. [step 3]

### Main Endpoints

**Endpoint 1**: [Name]
- **URL**: `GET https://api.example.com/v1/endpoint`
- **Parameters**:
  - `param1` (required): [description, type, example]
  - `param2` (optional): [description, type, default]
- **Response** (200 OK):
```json
{
  "data": [...],
  "meta": {...}
}
```
- **Errors**:
  - 400: [when occurs, how to handle]
  - 401: [when occurs, how to handle]
  - 429: [rate limit, how to handle]

**Example request**:
```bash
curl -H "X-Api-Key: YOUR_KEY" \
  "https://api.example.com/v1/endpoint?param1=value"
```

[Repeat for all relevant endpoints]

### Rate Limiting

- Limit: [number] requests per [period]
- Response headers:
  - `X-RateLimit-Limit`: Total limit
  - `X-RateLimit-Remaining`: Remaining requests
  - `X-RateLimit-Reset`: Reset timestamp
- Behavior when exceeded: [429 error, throttling, ban?]
- Best practice: [how to implement rate limiting]

### Quirks and Gotchas

**Quirk 1**: Values come as strings with formatting
- Example: `"2,525,000"` instead of `2525000`
- Solution: Remove commas before converting

**Quirk 2**: Suppressed data marked as "(D)"
- Meaning: Withheld to avoid disclosing data
- Solution: Treat as NULL, signal to user

**Quirk 3**: [other non-obvious behavior]
- Solution: [how to handle]

### Performance Tips

- Historical data doesn't change → cache permanently
- Recent data may be revised → short cache (7 days)
- Use pagination parameters if large response
- Make parallel requests when possible (respecting rate limit)
```

### Step 7: Document for Later Use

Save everything in `references/api-guide.md` of the agent to be created.

## Discovery Examples

### Example 1: US Agriculture

**Input**: "US crop data"

**Research**:
```
WebSearch: "USDA API agricultural data"
→ Found: NASS QuickStats, ERS, FAS

WebFetch: https://quickstats.nass.usda.gov/api
→ Free, data since 1866, 1000/day rate limit

WebFetch: https://www.ers.usda.gov/developer/
→ Free, economic focus, less granular

WebFetch: https://apps.fas.usda.gov/api
→ International focus, not domestic
```

**Comparison**:
| API | Coverage (US domestic) | Cost | Production Data | Score |
|-----|---------------------------|-------|-------------------|-------|
| NASS | ⭐⭐⭐⭐⭐ (excellent) | Free | ⭐⭐⭐⭐⭐ | 9.5/10 |
| ERS | ⭐⭐⭐⭐ (good) | Free | ⭐⭐⭐ (economic) | 7.0/10 |
| FAS | ⭐⭐ (international) | Free | ⭐⭐ (global) | 4.0/10 |

**DECISION**: NASS QuickStats API
- Best coverage for US domestic agriculture
- Free with reasonable rate limit
- Complete production, area, yield data

### Example 2: Stock Market

**Input**: "technical stock analysis"

**Research**:
```
WebSearch: "stock market API free historical data"
→ Alpha Vantage, Yahoo Finance, IEX Cloud, Polygon.io

WebFetch: Alpha Vantage docs
→ Free, 5 requests/min, 500/day

WebFetch: Yahoo Finance (yfinance)
→ Free, unlimited but unofficial

WebFetch: IEX Cloud
→ Freemium, good docs, 50k free credits/month
```

**Comparison**:
| API | Data | Cost | Rate Limit | Official | Score |
|-----|-------|-------|------------|---------|-------|
| Alpha Vantage | Complete | Free | 500/day | ⭐⭐⭐ | 8.0/10 |
| Yahoo Finance | Complete | Free | Unlimited | ❌ Unofficial | 7.5/10 |
| IEX Cloud | Excellent | Freemium | 50k/month | ⭐⭐⭐⭐ | 8.5/10 |

**DECISION**: IEX Cloud (free tier)
- Official and reliable
- 50k requests/month sufficient
- Excellent documentation
- Complete data (OHLCV + volume)

### Example 3: Global Climate

**Input**: "global climate data"

**Research**:
```
WebSearch: "weather API historical data global"
→ NOAA, OpenWeather, Weather.gov, Meteostat

[Research each one...]
```

**DECISION**: NOAA Climate Data Online (CDO) API
- Official (US government)
- Free
- Global and historical coverage (1900+)
- Rate limit: 1000/day

## Decision Documentation

Create `DECISIONS.md` file in agent:

```markdown
# Architecture Decisions

## Date: [creation date]

## Phase 1: API Selection

### Chosen API

**[API Name]**

### Selection Process

**APIs Researched**: [list]

**Evaluation Criteria**:
1. Data coverage (fit with need)
2. Cost (preference for free)
3. Rate limits (viability)
4. Quality (official > private)
5. Documentation (facilitates development)

### Comparison

[Comparison table]

### Final Justification

[2-3 paragraphs explaining why this API was chosen]

### Trade-offs

**What we gain**:
- [benefit 1]
- [benefit 2]

**What we lose** (vs alternatives):
- [accepted limitation 1]
- [accepted limitation 2]

### Technical Details

[Summary of endpoints, authentication, rate limits, etc]

**Complete documentation**: See `references/api-guide.md`
```

## Phase 1 Checklist

Before proceeding to Phase 2, verify:

- [ ] Research completed (WebSearch + WebFetch)
- [ ] Minimum 3 APIs compared
- [ ] Decision made with clear justification
- [ ] User constraints respected
- [ ] Technical details extracted
- [ ] DECISIONS.md created
- [ ] Ready for analysis design