Files
2025-11-30 08:29:39 +08:00

52 lines
1.3 KiB
Markdown

# Data Extraction
Extract specific information from unstructured/semi-structured data with completeness and accuracy.
## Common Patterns
| Type | Pattern | Validation |
|------|---------|------------|
| Email | `user@domain.ext` | Has `@` and `.` after @ |
| URL | `http(s)://domain...` | Valid protocol and domain |
| Date | ISO, US, EU, timestamp | Valid ranges (month 1-12) |
| Phone | Various formats | 7-15 digits |
| IP | IPv4: `x.x.x.x`, IPv6 | Octets 0-255 |
| Key-Value | `key=value`, `key: value` | Handle quoted/nested |
## Process
1. **Analyze:** Format, delimiters, variations, headers to skip
2. **Extract:** Match all instances, capture context, handle partial matches
3. **Clean:** Trim, normalize (dates to ISO, phones to digits), validate
4. **Format:** Consistent fields, proper escaping, sort/dedupe if needed
## Output Formats
**JSON:** `{"results": [...], "summary": {"total": N, "unique": N}}`
**CSV:** Headers + rows
**Markdown:** Table with headers
**Plain:** Bullet list
## Principles
- **Complete:** Extract ALL matches, don't stop early
- **Accurate:** Preserve exact values, maintain case
- **Handle edge cases:** Missing → null, malformed → flag, duplicates → note
## Output Structure
```
[Extracted data]
## Summary
- Total: X
- Unique: Y
- Issues: Z
## Notes
- Line 42: Partial match "user@" (missing domain)
```