Initial commit
This commit is contained in:
231
skills/python-json-parsing/SKILL.md
Normal file
231
skills/python-json-parsing/SKILL.md
Normal file
@@ -0,0 +1,231 @@
|
||||
---
|
||||
name: python-json-parsing
|
||||
description: >
|
||||
Python JSON parsing best practices covering performance optimization (orjson/msgspec),
|
||||
handling large files (streaming/JSONL), security (injection prevention),
|
||||
and advanced querying (JSONPath/JMESPath).
|
||||
Use when working with JSON data, parsing APIs, handling large JSON files,
|
||||
or optimizing JSON performance.
|
||||
---
|
||||
|
||||
# Python JSON Parsing Best Practices
|
||||
|
||||
Comprehensive guide to JSON parsing in Python with focus on performance, security, and scalability.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic JSON Parsing
|
||||
|
||||
```python
|
||||
import json
|
||||
|
||||
# Parse JSON string
|
||||
data = json.loads('{"name": "Alice", "age": 30}')
|
||||
|
||||
# Parse JSON file
|
||||
with open("data.json", "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
|
||||
# Write JSON file
|
||||
with open("output.json", "w", encoding="utf-8") as f:
|
||||
json.dump(data, f, indent=2)
|
||||
```
|
||||
|
||||
**Key Rule:** Always specify `encoding="utf-8"` when reading/writing files.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
|
||||
- Working with JSON APIs or data interchange
|
||||
- Optimizing JSON performance in high-throughput applications
|
||||
- Handling large JSON files (> 100MB)
|
||||
- Securing applications against JSON injection
|
||||
- Extracting data from complex nested JSON structures
|
||||
|
||||
## Performance: Choose the Right Library
|
||||
|
||||
### Library Comparison (10,000 records benchmark)
|
||||
|
||||
| Library | Serialize (s) | Deserialize (s) | Best For |
|
||||
|---------|---------------|-----------------|----------|
|
||||
| **orjson** | 0.42 | 1.27 | FastAPI, web APIs (3.9x faster) |
|
||||
| **msgspec** | 0.49 | 0.93 | Maximum performance (1.7x faster deserialization) |
|
||||
| **json** (stdlib) | 1.62 | 1.62 | Universal compatibility |
|
||||
| **ujson** | 1.41 | 1.85 | Drop-in replacement (2x faster) |
|
||||
|
||||
**Recommendation:**
|
||||
|
||||
- Use **orjson** for FastAPI/web APIs (native support, fastest serialization)
|
||||
- Use **msgspec** for data pipelines (fastest overall, typed validation)
|
||||
- Use **json** when compatibility is critical
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# High-performance libraries
|
||||
pip install orjson msgspec ujson
|
||||
|
||||
# Advanced querying
|
||||
pip install jsonpath-ng jmespath
|
||||
|
||||
# Streaming large files
|
||||
pip install ijson
|
||||
|
||||
# Schema validation
|
||||
pip install jsonschema
|
||||
```
|
||||
|
||||
## Large Files: Streaming Strategies
|
||||
|
||||
For files > 100MB, avoid loading into memory.
|
||||
|
||||
### Strategy 1: JSONL (JSON Lines)
|
||||
|
||||
Convert large JSON arrays to line-delimited format:
|
||||
|
||||
```python
|
||||
# Stream process JSONL
|
||||
with open("large.jsonl", "r") as infile, open("output.jsonl", "w") as outfile:
|
||||
for line in infile:
|
||||
obj = json.loads(line)
|
||||
obj["processed"] = True
|
||||
outfile.write(json.dumps(obj) + "\n")
|
||||
```
|
||||
|
||||
### Strategy 2: Streaming with ijson
|
||||
|
||||
```python
|
||||
import ijson
|
||||
|
||||
# Process large JSON without loading into memory
|
||||
with open("huge.json", "rb") as f:
|
||||
for item in ijson.items(f, "products.item"):
|
||||
process(item) # Handle one item at a time
|
||||
```
|
||||
|
||||
See: `patterns/streaming-large-json.md`
|
||||
|
||||
## Security: Prevent JSON Injection
|
||||
|
||||
**Critical Rules:**
|
||||
|
||||
1. Always use `json.loads()`, never `eval()`
|
||||
2. Validate input with `jsonschema`
|
||||
3. Sanitize user input before serialization
|
||||
4. Escape special characters (`"` and `\`)
|
||||
|
||||
**Vulnerable Code:**
|
||||
|
||||
```python
|
||||
# NEVER DO THIS
|
||||
username = request.GET['username'] # User input: admin", "role": "admin
|
||||
json_string = f'{{"user":"{username}","role":"user"}}'
|
||||
# Result: privilege escalation
|
||||
```
|
||||
|
||||
**Secure Code:**
|
||||
|
||||
```python
|
||||
# Use json.dumps for serialization
|
||||
data = {"user": username, "role": "user"}
|
||||
json_string = json.dumps(data) # Properly escaped
|
||||
```
|
||||
|
||||
See: `anti-patterns/security-json-injection.md`, `anti-patterns/eval-usage.md`
|
||||
|
||||
## Advanced: JSONPath for Complex Queries
|
||||
|
||||
Extract data from nested JSON without complex loops:
|
||||
|
||||
```python
|
||||
import jsonpath_ng as jp
|
||||
|
||||
data = {
|
||||
"products": [
|
||||
{"name": "Apple", "price": 12.88},
|
||||
{"name": "Peach", "price": 27.25}
|
||||
]
|
||||
}
|
||||
|
||||
# Filter by price
|
||||
query = jp.parse("products[?price>20].name")
|
||||
results = [match.value for match in query.find(data)]
|
||||
# Output: ["Peach"]
|
||||
```
|
||||
|
||||
**Key Operators:**
|
||||
|
||||
- `$` - Root selector
|
||||
- `..` - Recursive descendant
|
||||
- `*` - Wildcard
|
||||
- `[?<predicate>]` - Filter (e.g., `[?price > 20]`)
|
||||
- `[start:end:step]` - Array slicing
|
||||
|
||||
See: `patterns/jsonpath-querying.md`
|
||||
|
||||
## Custom Objects: Serialization
|
||||
|
||||
Handle datetime, UUID, Decimal, and custom classes:
|
||||
|
||||
```python
|
||||
from datetime import datetime
|
||||
import json
|
||||
|
||||
class CustomEncoder(json.JSONEncoder):
|
||||
def default(self, obj):
|
||||
if isinstance(obj, datetime):
|
||||
return obj.isoformat()
|
||||
if isinstance(obj, set):
|
||||
return list(obj)
|
||||
return super().default(obj)
|
||||
|
||||
# Usage
|
||||
data = {"timestamp": datetime.now(), "tags": {"python", "json"}}
|
||||
json_str = json.dumps(data, cls=CustomEncoder)
|
||||
```
|
||||
|
||||
See: `patterns/custom-object-serialization.md`
|
||||
|
||||
## Performance Checklist
|
||||
|
||||
- [ ] Use orjson/msgspec for high-throughput applications
|
||||
- [ ] Specify UTF-8 encoding when reading/writing files
|
||||
- [ ] Use streaming (ijson/JSONL) for files > 100MB
|
||||
- [ ] Minify JSON for production (`separators=(',', ':')`)
|
||||
- [ ] Pretty-print for development (`indent=2`)
|
||||
|
||||
## Security Checklist
|
||||
|
||||
- [ ] Never use `eval()` for JSON parsing
|
||||
- [ ] Validate input with `jsonschema`
|
||||
- [ ] Sanitize user input before serialization
|
||||
- [ ] Use `json.dumps()` to prevent injection
|
||||
- [ ] Escape special characters in user data
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
**Performance:**
|
||||
|
||||
- `reference/python-json-parsing-best-practices-2025.md` - Comprehensive research with benchmarks
|
||||
|
||||
**Patterns:**
|
||||
|
||||
- `patterns/streaming-large-json.md` - ijson and JSONL strategies
|
||||
- `patterns/custom-object-serialization.md` - Handle datetime, UUID, custom classes
|
||||
- `patterns/jsonpath-querying.md` - Advanced nested data extraction
|
||||
|
||||
**Security:**
|
||||
|
||||
- `anti-patterns/security-json-injection.md` - Prevent injection attacks
|
||||
- `anti-patterns/eval-usage.md` - Why never to use eval()
|
||||
|
||||
**Examples:**
|
||||
|
||||
- `examples/high-performance-parsing.py` - orjson and msgspec code
|
||||
- `examples/large-file-streaming.py` - Streaming with ijson
|
||||
- `examples/secure-validation.py` - jsonschema validation
|
||||
|
||||
**Tools:**
|
||||
|
||||
- `tools/json-performance-benchmark.py` - Benchmark different libraries
|
||||
Reference in New Issue
Block a user