# Data Extraction

Extract specific information from unstructured/semi-structured data with completeness and accuracy.

## Common Patterns

| Type | Pattern | Validation |
|------|---------|------------|
| Email | `user@domain.ext` | Has `@` and `.` after @ |
| URL | `http(s)://domain...` | Valid protocol and domain |
| Date | ISO, US, EU, timestamp | Valid ranges (month 1-12) |
| Phone | Various formats | 7-15 digits |
| IP | IPv4: `x.x.x.x`, IPv6 | Octets 0-255 |
| Key-Value | `key=value`, `key: value` | Handle quoted/nested |

## Process

1. **Analyze:** Format, delimiters, variations, headers to skip
2. **Extract:** Match all instances, capture context, handle partial matches
3. **Clean:** Trim, normalize (dates to ISO, phones to digits), validate
4. **Format:** Consistent fields, proper escaping, sort/dedupe if needed

## Output Formats

**JSON:** `{"results": [...], "summary": {"total": N, "unique": N}}`

**CSV:** Headers + rows

**Markdown:** Table with headers

**Plain:** Bullet list

## Principles

- **Complete:** Extract ALL matches, don't stop early
- **Accurate:** Preserve exact values, maintain case
- **Handle edge cases:** Missing → null, malformed → flag, duplicates → note

## Output Structure

```
[Extracted data]

## Summary
- Total: X
- Unique: Y
- Issues: Z

## Notes
- Line 42: Partial match "user@" (missing domain)
```