Files
gh-francyjglisboa-agent-ski…/references/phase1-discovery.md
2025-11-29 18:27:25 +08:00

11 KiB

Phase 1: Discovery and API Research

Objective

Research and DECIDE autonomously which API or data source to use for the agent.

Detailed Process

Step 1: Identify Domain

From user input, extract the main domain:

User Input Identified Domain
"US crop data" Agriculture (US)
"stock market analysis" Finance / Stock Market
"global climate data" Climate / Meteorology
"economic indicators" Economy / Macro
"commodity data" Trading / Commodities

Step 2: Search Available APIs

For the identified domain, use WebSearch to find public APIs:

Search queries:

"[domain] API free public data"
"[domain] government API documentation"
"best API for [domain] historical data"
"[domain] open data sources"

Example (US agriculture):

WebSearch: "US agriculture API free historical data"
WebSearch: "USDA API documentation"
WebSearch: "agricultural statistics API United States"

Typical result: 5-10 candidate APIs

Step 3: Research Documentation

For each candidate API, use WebFetch to load:

  • Homepage/overview
  • Getting started guide
  • API reference
  • Rate limits and pricing

Extract information:

## API 1: [Name]

**URL**: [base URL]
**Docs**: [docs URL]

**Authentication**:
- Type: API key / OAuth / None
- Cost: Free / Paid
- How to obtain: [steps]

**Available Data**:
- Temporal coverage: [from when to when]
- Geographic coverage: [countries, regions]
- Metrics: [list]
- Granularity: [daily, monthly, annual]

**Limitations**:
- Rate limit: [requests per day/hour]
- Max records: [per request]
- Throttling: [yes/no]

**Quality**:
- Source: [official government / private]
- Reliability: [high/medium/low]
- Update frequency: [frequency]

**Documentation**:
- Quality: [excellent/good/poor]

### Step 4: API Capability Inventory (NEW v2.0 - CRITICAL!)

**OBJECTIVE:** Ensure the skill uses 100% of API capabilities, not just the basics!

**LEARNING:** us-crop-monitor v1.0 used only CONDITION (1 of 5 NASS metrics).
v2.0 had to add PROGRESS, YIELD, PRODUCTION, AREA (+3,500 lines of rework).

**Process:**

**Step 4.1: Complete Inventory**

For the chosen API, catalog ALL data types:

```markdown
## Complete Inventory - {API Name}

**Available Metrics/Endpoints:**

| Endpoint/Metric | Returns | Granularity | Coverage | Value |
|-----------------|---------------|---------------|-----------|-------|
| {metric1}       | {description}   | {daily/weekly}| {geo}     | ⭐⭐⭐⭐⭐ |
| {metric2}       | {description}   | {monthly}     | {geo}     | ⭐⭐⭐⭐⭐ |
| {metric3}       | {description}   | {annual}      | {geo}     | ⭐⭐⭐⭐  |
...

**Real Example (NASS):**

| Metric Type    | Data               | Frequency | Value    | Implement? |
|----------------|--------------------| ----------|----------|------------|
| CONDITION      | Quality ratings    | Weekly    | ⭐⭐⭐⭐⭐ | ✅ YES     |
| PROGRESS       | % planted/harvested| Weekly    | ⭐⭐⭐⭐⭐ | ✅ YES     |
| YIELD          | Bu/acre            | Monthly   | ⭐⭐⭐⭐⭐ | ✅ YES     |
| PRODUCTION     | Total bushels      | Monthly   | ⭐⭐⭐⭐⭐ | ✅ YES     |
| AREA           | Acres planted      | Annual    | ⭐⭐⭐⭐  | ✅ YES     |
| PRICE          | $/bushel           | Monthly   | ⭐⭐⭐    | ⚪ v2.0    |

Step 4.2: Coverage Decision

GOLDEN RULE:

  • If metric has or value → Implement in v1.0
  • If API has 5 high-value metrics → Implement all 5!
  • Never leave >50% of API unused without strong justification

Step 4.3: Document Decision

In DECISIONS.md:

## API Coverage Decision

API {name} offers {N} types of metrics.

**Implemented in v1.0 ({X} of {N}):**
- {metric1} - {justification}
- {metric2} - {justification}
...

**Not implemented ({Y} of {N}):**
- {metricZ} - {why not} (planned for v2.0)

**Coverage:** {X/N * 100}% = {evaluation}
- If < 70%: Clearly explain why low coverage
- If > 70%: ✅ Good coverage

Output of this phase: Exact list of all get_*() methods to implement

  • Examples: [many/few/none]
  • SDKs: [Python/R/None]

Ease of Use:

  • Format: JSON / CSV / XML
  • Structure: [simple/complex]
  • Quirks: [any strange behavior?]

### Step 4: Compare Options

Create comparison table:

| API | Coverage | Cost | Rate Limit | Quality | Docs | Ease | Score |
|-----|-----------|-------|------------|-----------|------|------------|-------|
| API 1 | ⭐⭐⭐⭐⭐ | Free | 1000/day | Official | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 9.2/10 |
| API 2 | ⭐⭐⭐⭐ | $49/mo | Unlimited | Private | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 7.8/10 |
| API 3 | ⭐⭐⭐ | Free | 100/day | Private | ⭐⭐ | ⭐⭐⭐ | 5.5/10 |

**Scoring criteria**:
- Coverage (fit with need): 30% weight
- Cost (prefer free): 20% weight
- Rate limit (sufficient?): 15% weight
- Quality (official > private): 15% weight
- Documentation (facilitates implementation): 10% weight
- Ease of use (format, structure): 10% weight

### Step 5: DECIDE

**Consider user constraints**:
- Mentioned "free"? → Eliminate paid options
- Mentioned "10+ years historical data"? → Check coverage
- Mentioned "real-time"? → Prioritize streaming APIs

**Apply logic**:
1. Eliminate APIs that violate constraints
2. Of remaining, choose highest score
3. If tie, prefer:
   - Official > private
   - Better documentation
   - Easier to use

**FINAL DECISION**:

```markdown
## Selected API: [API Name]

**Score**: X.X/10

**Justification**:
- ✅ Coverage: [specific details]
- ✅ Cost: [free/paid + details]
- ✅ Rate limit: [number] requests/day (sufficient for [estimated usage])
- ✅ Quality: [official/private + reliability]
- ✅ Documentation: [quality + examples]
- ✅ Ease of use: [format, structure]

**Fit with requirements**:
- Constraint 1 (e.g., free): ✅ Met
- Constraint 2 (e.g., 10+ years history): ✅ Met (since [year])
- Primary need (e.g., crop production): ✅ Covered

**Alternatives Considered**:

**API X**: Score 7.5/10
- Rejected because: [specific reason]
- Trade-off: [what we lose vs gain]

**API Y**: Score 6.2/10
- Rejected because: [reason]

**Conclusion**:
[API Name] is the best option because [1-2 sentence synthesis].

Step 6: Research Technical Details

After deciding, dive deep into documentation:

Load via WebFetch:

  • Getting started guide
  • Complete API reference
  • Authentication guide
  • Rate limiting details
  • Best practices

Extract for implementation:

## Technical Details - [API]

### Authentication

**Method**: API key in header
**Header**: `X-Api-Key: YOUR_KEY`
**Obtaining key**:
1. [step 1]
2. [step 2]
3. [step 3]

### Main Endpoints

**Endpoint 1**: [Name]
- **URL**: `GET https://api.example.com/v1/endpoint`
- **Parameters**:
  - `param1` (required): [description, type, example]
  - `param2` (optional): [description, type, default]
- **Response** (200 OK):
```json
{
  "data": [...],
  "meta": {...}
}
  • Errors:
    • 400: [when occurs, how to handle]
    • 401: [when occurs, how to handle]
    • 429: [rate limit, how to handle]

Example request:

curl -H "X-Api-Key: YOUR_KEY" \
  "https://api.example.com/v1/endpoint?param1=value"

[Repeat for all relevant endpoints]

Rate Limiting

  • Limit: [number] requests per [period]
  • Response headers:
    • X-RateLimit-Limit: Total limit
    • X-RateLimit-Remaining: Remaining requests
    • X-RateLimit-Reset: Reset timestamp
  • Behavior when exceeded: [429 error, throttling, ban?]
  • Best practice: [how to implement rate limiting]

Quirks and Gotchas

Quirk 1: Values come as strings with formatting

  • Example: "2,525,000" instead of 2525000
  • Solution: Remove commas before converting

Quirk 2: Suppressed data marked as "(D)"

  • Meaning: Withheld to avoid disclosing data
  • Solution: Treat as NULL, signal to user

Quirk 3: [other non-obvious behavior]

  • Solution: [how to handle]

Performance Tips

  • Historical data doesn't change → cache permanently
  • Recent data may be revised → short cache (7 days)
  • Use pagination parameters if large response
  • Make parallel requests when possible (respecting rate limit)

### Step 7: Document for Later Use

Save everything in `references/api-guide.md` of the agent to be created.

## Discovery Examples

### Example 1: US Agriculture

**Input**: "US crop data"

**Research**:

WebSearch: "USDA API agricultural data" → Found: NASS QuickStats, ERS, FAS

WebFetch: https://quickstats.nass.usda.gov/api → Free, data since 1866, 1000/day rate limit

WebFetch: https://www.ers.usda.gov/developer/ → Free, economic focus, less granular

WebFetch: https://apps.fas.usda.gov/api → International focus, not domestic


**Comparison**:
| API | Coverage (US domestic) | Cost | Production Data | Score |
|-----|---------------------------|-------|-------------------|-------|
| NASS | ⭐⭐⭐⭐⭐ (excellent) | Free | ⭐⭐⭐⭐⭐ | 9.5/10 |
| ERS | ⭐⭐⭐⭐ (good) | Free | ⭐⭐⭐ (economic) | 7.0/10 |
| FAS | ⭐⭐ (international) | Free | ⭐⭐ (global) | 4.0/10 |

**DECISION**: NASS QuickStats API
- Best coverage for US domestic agriculture
- Free with reasonable rate limit
- Complete production, area, yield data

### Example 2: Stock Market

**Input**: "technical stock analysis"

**Research**:

WebSearch: "stock market API free historical data" → Alpha Vantage, Yahoo Finance, IEX Cloud, Polygon.io

WebFetch: Alpha Vantage docs → Free, 5 requests/min, 500/day

WebFetch: Yahoo Finance (yfinance) → Free, unlimited but unofficial

WebFetch: IEX Cloud → Freemium, good docs, 50k free credits/month


**Comparison**:
| API | Data | Cost | Rate Limit | Official | Score |
|-----|-------|-------|------------|---------|-------|
| Alpha Vantage | Complete | Free | 500/day | ⭐⭐⭐ | 8.0/10 |
| Yahoo Finance | Complete | Free | Unlimited | ❌ Unofficial | 7.5/10 |
| IEX Cloud | Excellent | Freemium | 50k/month | ⭐⭐⭐⭐ | 8.5/10 |

**DECISION**: IEX Cloud (free tier)
- Official and reliable
- 50k requests/month sufficient
- Excellent documentation
- Complete data (OHLCV + volume)

### Example 3: Global Climate

**Input**: "global climate data"

**Research**:

WebSearch: "weather API historical data global" → NOAA, OpenWeather, Weather.gov, Meteostat

[Research each one...]


**DECISION**: NOAA Climate Data Online (CDO) API
- Official (US government)
- Free
- Global and historical coverage (1900+)
- Rate limit: 1000/day

## Decision Documentation

Create `DECISIONS.md` file in agent:

```markdown
# Architecture Decisions

## Date: [creation date]

## Phase 1: API Selection

### Chosen API

**[API Name]**

### Selection Process

**APIs Researched**: [list]

**Evaluation Criteria**:
1. Data coverage (fit with need)
2. Cost (preference for free)
3. Rate limits (viability)
4. Quality (official > private)
5. Documentation (facilitates development)

### Comparison

[Comparison table]

### Final Justification

[2-3 paragraphs explaining why this API was chosen]

### Trade-offs

**What we gain**:
- [benefit 1]
- [benefit 2]

**What we lose** (vs alternatives):
- [accepted limitation 1]
- [accepted limitation 2]

### Technical Details

[Summary of endpoints, authentication, rate limits, etc]

**Complete documentation**: See `references/api-guide.md`

Phase 1 Checklist

Before proceeding to Phase 2, verify:

  • Research completed (WebSearch + WebFetch)
  • Minimum 3 APIs compared
  • Decision made with clear justification
  • User constraints respected
  • Technical details extracted
  • DECISIONS.md created
  • Ready for analysis design