1.3 KiB
1.3 KiB
Data Extraction
Extract specific information from unstructured/semi-structured data with completeness and accuracy.
Common Patterns
| Type | Pattern | Validation |
|---|---|---|
user@domain.ext |
Has @ and . after @ |
|
| URL | http(s)://domain... |
Valid protocol and domain |
| Date | ISO, US, EU, timestamp | Valid ranges (month 1-12) |
| Phone | Various formats | 7-15 digits |
| IP | IPv4: x.x.x.x, IPv6 |
Octets 0-255 |
| Key-Value | key=value, key: value |
Handle quoted/nested |
Process
- Analyze: Format, delimiters, variations, headers to skip
- Extract: Match all instances, capture context, handle partial matches
- Clean: Trim, normalize (dates to ISO, phones to digits), validate
- Format: Consistent fields, proper escaping, sort/dedupe if needed
Output Formats
JSON: {"results": [...], "summary": {"total": N, "unique": N}}
CSV: Headers + rows
Markdown: Table with headers
Plain: Bullet list
Principles
- Complete: Extract ALL matches, don't stop early
- Accurate: Preserve exact values, maintain case
- Handle edge cases: Missing → null, malformed → flag, duplicates → note
Output Structure
[Extracted data]
## Summary
- Total: X
- Unique: Y
- Issues: Z
## Notes
- Line 42: Partial match "user@" (missing domain)