Files
gh-basher83-lunar-claude-pl…/skills/python-json-parsing/SKILL.md
2025-11-29 18:00:18 +08:00

232 lines
5.9 KiB
Markdown

---
name: python-json-parsing
description: >
Python JSON parsing best practices covering performance optimization (orjson/msgspec),
handling large files (streaming/JSONL), security (injection prevention),
and advanced querying (JSONPath/JMESPath).
Use when working with JSON data, parsing APIs, handling large JSON files,
or optimizing JSON performance.
---
# Python JSON Parsing Best Practices
Comprehensive guide to JSON parsing in Python with focus on performance, security, and scalability.
## Quick Start
### Basic JSON Parsing
```python
import json
# Parse JSON string
data = json.loads('{"name": "Alice", "age": 30}')
# Parse JSON file
with open("data.json", "r", encoding="utf-8") as f:
data = json.load(f)
# Write JSON file
with open("output.json", "w", encoding="utf-8") as f:
json.dump(data, f, indent=2)
```
**Key Rule:** Always specify `encoding="utf-8"` when reading/writing files.
## When to Use This Skill
Use this skill when:
- Working with JSON APIs or data interchange
- Optimizing JSON performance in high-throughput applications
- Handling large JSON files (> 100MB)
- Securing applications against JSON injection
- Extracting data from complex nested JSON structures
## Performance: Choose the Right Library
### Library Comparison (10,000 records benchmark)
| Library | Serialize (s) | Deserialize (s) | Best For |
|---------|---------------|-----------------|----------|
| **orjson** | 0.42 | 1.27 | FastAPI, web APIs (3.9x faster) |
| **msgspec** | 0.49 | 0.93 | Maximum performance (1.7x faster deserialization) |
| **json** (stdlib) | 1.62 | 1.62 | Universal compatibility |
| **ujson** | 1.41 | 1.85 | Drop-in replacement (2x faster) |
**Recommendation:**
- Use **orjson** for FastAPI/web APIs (native support, fastest serialization)
- Use **msgspec** for data pipelines (fastest overall, typed validation)
- Use **json** when compatibility is critical
### Installation
```bash
# High-performance libraries
pip install orjson msgspec ujson
# Advanced querying
pip install jsonpath-ng jmespath
# Streaming large files
pip install ijson
# Schema validation
pip install jsonschema
```
## Large Files: Streaming Strategies
For files > 100MB, avoid loading into memory.
### Strategy 1: JSONL (JSON Lines)
Convert large JSON arrays to line-delimited format:
```python
# Stream process JSONL
with open("large.jsonl", "r") as infile, open("output.jsonl", "w") as outfile:
for line in infile:
obj = json.loads(line)
obj["processed"] = True
outfile.write(json.dumps(obj) + "\n")
```
### Strategy 2: Streaming with ijson
```python
import ijson
# Process large JSON without loading into memory
with open("huge.json", "rb") as f:
for item in ijson.items(f, "products.item"):
process(item) # Handle one item at a time
```
See: `patterns/streaming-large-json.md`
## Security: Prevent JSON Injection
**Critical Rules:**
1. Always use `json.loads()`, never `eval()`
2. Validate input with `jsonschema`
3. Sanitize user input before serialization
4. Escape special characters (`"` and `\`)
**Vulnerable Code:**
```python
# NEVER DO THIS
username = request.GET['username'] # User input: admin", "role": "admin
json_string = f'{{"user":"{username}","role":"user"}}'
# Result: privilege escalation
```
**Secure Code:**
```python
# Use json.dumps for serialization
data = {"user": username, "role": "user"}
json_string = json.dumps(data) # Properly escaped
```
See: `anti-patterns/security-json-injection.md`, `anti-patterns/eval-usage.md`
## Advanced: JSONPath for Complex Queries
Extract data from nested JSON without complex loops:
```python
import jsonpath_ng as jp
data = {
"products": [
{"name": "Apple", "price": 12.88},
{"name": "Peach", "price": 27.25}
]
}
# Filter by price
query = jp.parse("products[?price>20].name")
results = [match.value for match in query.find(data)]
# Output: ["Peach"]
```
**Key Operators:**
- `$` - Root selector
- `..` - Recursive descendant
- `*` - Wildcard
- `[?<predicate>]` - Filter (e.g., `[?price > 20]`)
- `[start:end:step]` - Array slicing
See: `patterns/jsonpath-querying.md`
## Custom Objects: Serialization
Handle datetime, UUID, Decimal, and custom classes:
```python
from datetime import datetime
import json
class CustomEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, datetime):
return obj.isoformat()
if isinstance(obj, set):
return list(obj)
return super().default(obj)
# Usage
data = {"timestamp": datetime.now(), "tags": {"python", "json"}}
json_str = json.dumps(data, cls=CustomEncoder)
```
See: `patterns/custom-object-serialization.md`
## Performance Checklist
- [ ] Use orjson/msgspec for high-throughput applications
- [ ] Specify UTF-8 encoding when reading/writing files
- [ ] Use streaming (ijson/JSONL) for files > 100MB
- [ ] Minify JSON for production (`separators=(',', ':')`)
- [ ] Pretty-print for development (`indent=2`)
## Security Checklist
- [ ] Never use `eval()` for JSON parsing
- [ ] Validate input with `jsonschema`
- [ ] Sanitize user input before serialization
- [ ] Use `json.dumps()` to prevent injection
- [ ] Escape special characters in user data
## Reference Documentation
**Performance:**
- `reference/python-json-parsing-best-practices-2025.md` - Comprehensive research with benchmarks
**Patterns:**
- `patterns/streaming-large-json.md` - ijson and JSONL strategies
- `patterns/custom-object-serialization.md` - Handle datetime, UUID, custom classes
- `patterns/jsonpath-querying.md` - Advanced nested data extraction
**Security:**
- `anti-patterns/security-json-injection.md` - Prevent injection attacks
- `anti-patterns/eval-usage.md` - Why never to use eval()
**Examples:**
- `examples/high-performance-parsing.py` - orjson and msgspec code
- `examples/large-file-streaming.py` - Streaming with ijson
- `examples/secure-validation.py` - jsonschema validation
**Tools:**
- `tools/json-performance-benchmark.py` - Benchmark different libraries