232 lines
5.9 KiB
Markdown
232 lines
5.9 KiB
Markdown
---
|
|
name: python-json-parsing
|
|
description: >
|
|
Python JSON parsing best practices covering performance optimization (orjson/msgspec),
|
|
handling large files (streaming/JSONL), security (injection prevention),
|
|
and advanced querying (JSONPath/JMESPath).
|
|
Use when working with JSON data, parsing APIs, handling large JSON files,
|
|
or optimizing JSON performance.
|
|
---
|
|
|
|
# Python JSON Parsing Best Practices
|
|
|
|
Comprehensive guide to JSON parsing in Python with focus on performance, security, and scalability.
|
|
|
|
## Quick Start
|
|
|
|
### Basic JSON Parsing
|
|
|
|
```python
|
|
import json
|
|
|
|
# Parse JSON string
|
|
data = json.loads('{"name": "Alice", "age": 30}')
|
|
|
|
# Parse JSON file
|
|
with open("data.json", "r", encoding="utf-8") as f:
|
|
data = json.load(f)
|
|
|
|
# Write JSON file
|
|
with open("output.json", "w", encoding="utf-8") as f:
|
|
json.dump(data, f, indent=2)
|
|
```
|
|
|
|
**Key Rule:** Always specify `encoding="utf-8"` when reading/writing files.
|
|
|
|
## When to Use This Skill
|
|
|
|
Use this skill when:
|
|
|
|
- Working with JSON APIs or data interchange
|
|
- Optimizing JSON performance in high-throughput applications
|
|
- Handling large JSON files (> 100MB)
|
|
- Securing applications against JSON injection
|
|
- Extracting data from complex nested JSON structures
|
|
|
|
## Performance: Choose the Right Library
|
|
|
|
### Library Comparison (10,000 records benchmark)
|
|
|
|
| Library | Serialize (s) | Deserialize (s) | Best For |
|
|
|---------|---------------|-----------------|----------|
|
|
| **orjson** | 0.42 | 1.27 | FastAPI, web APIs (3.9x faster) |
|
|
| **msgspec** | 0.49 | 0.93 | Maximum performance (1.7x faster deserialization) |
|
|
| **json** (stdlib) | 1.62 | 1.62 | Universal compatibility |
|
|
| **ujson** | 1.41 | 1.85 | Drop-in replacement (2x faster) |
|
|
|
|
**Recommendation:**
|
|
|
|
- Use **orjson** for FastAPI/web APIs (native support, fastest serialization)
|
|
- Use **msgspec** for data pipelines (fastest overall, typed validation)
|
|
- Use **json** when compatibility is critical
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# High-performance libraries
|
|
pip install orjson msgspec ujson
|
|
|
|
# Advanced querying
|
|
pip install jsonpath-ng jmespath
|
|
|
|
# Streaming large files
|
|
pip install ijson
|
|
|
|
# Schema validation
|
|
pip install jsonschema
|
|
```
|
|
|
|
## Large Files: Streaming Strategies
|
|
|
|
For files > 100MB, avoid loading into memory.
|
|
|
|
### Strategy 1: JSONL (JSON Lines)
|
|
|
|
Convert large JSON arrays to line-delimited format:
|
|
|
|
```python
|
|
# Stream process JSONL
|
|
with open("large.jsonl", "r") as infile, open("output.jsonl", "w") as outfile:
|
|
for line in infile:
|
|
obj = json.loads(line)
|
|
obj["processed"] = True
|
|
outfile.write(json.dumps(obj) + "\n")
|
|
```
|
|
|
|
### Strategy 2: Streaming with ijson
|
|
|
|
```python
|
|
import ijson
|
|
|
|
# Process large JSON without loading into memory
|
|
with open("huge.json", "rb") as f:
|
|
for item in ijson.items(f, "products.item"):
|
|
process(item) # Handle one item at a time
|
|
```
|
|
|
|
See: `patterns/streaming-large-json.md`
|
|
|
|
## Security: Prevent JSON Injection
|
|
|
|
**Critical Rules:**
|
|
|
|
1. Always use `json.loads()`, never `eval()`
|
|
2. Validate input with `jsonschema`
|
|
3. Sanitize user input before serialization
|
|
4. Escape special characters (`"` and `\`)
|
|
|
|
**Vulnerable Code:**
|
|
|
|
```python
|
|
# NEVER DO THIS
|
|
username = request.GET['username'] # User input: admin", "role": "admin
|
|
json_string = f'{{"user":"{username}","role":"user"}}'
|
|
# Result: privilege escalation
|
|
```
|
|
|
|
**Secure Code:**
|
|
|
|
```python
|
|
# Use json.dumps for serialization
|
|
data = {"user": username, "role": "user"}
|
|
json_string = json.dumps(data) # Properly escaped
|
|
```
|
|
|
|
See: `anti-patterns/security-json-injection.md`, `anti-patterns/eval-usage.md`
|
|
|
|
## Advanced: JSONPath for Complex Queries
|
|
|
|
Extract data from nested JSON without complex loops:
|
|
|
|
```python
|
|
import jsonpath_ng as jp
|
|
|
|
data = {
|
|
"products": [
|
|
{"name": "Apple", "price": 12.88},
|
|
{"name": "Peach", "price": 27.25}
|
|
]
|
|
}
|
|
|
|
# Filter by price
|
|
query = jp.parse("products[?price>20].name")
|
|
results = [match.value for match in query.find(data)]
|
|
# Output: ["Peach"]
|
|
```
|
|
|
|
**Key Operators:**
|
|
|
|
- `$` - Root selector
|
|
- `..` - Recursive descendant
|
|
- `*` - Wildcard
|
|
- `[?<predicate>]` - Filter (e.g., `[?price > 20]`)
|
|
- `[start:end:step]` - Array slicing
|
|
|
|
See: `patterns/jsonpath-querying.md`
|
|
|
|
## Custom Objects: Serialization
|
|
|
|
Handle datetime, UUID, Decimal, and custom classes:
|
|
|
|
```python
|
|
from datetime import datetime
|
|
import json
|
|
|
|
class CustomEncoder(json.JSONEncoder):
|
|
def default(self, obj):
|
|
if isinstance(obj, datetime):
|
|
return obj.isoformat()
|
|
if isinstance(obj, set):
|
|
return list(obj)
|
|
return super().default(obj)
|
|
|
|
# Usage
|
|
data = {"timestamp": datetime.now(), "tags": {"python", "json"}}
|
|
json_str = json.dumps(data, cls=CustomEncoder)
|
|
```
|
|
|
|
See: `patterns/custom-object-serialization.md`
|
|
|
|
## Performance Checklist
|
|
|
|
- [ ] Use orjson/msgspec for high-throughput applications
|
|
- [ ] Specify UTF-8 encoding when reading/writing files
|
|
- [ ] Use streaming (ijson/JSONL) for files > 100MB
|
|
- [ ] Minify JSON for production (`separators=(',', ':')`)
|
|
- [ ] Pretty-print for development (`indent=2`)
|
|
|
|
## Security Checklist
|
|
|
|
- [ ] Never use `eval()` for JSON parsing
|
|
- [ ] Validate input with `jsonschema`
|
|
- [ ] Sanitize user input before serialization
|
|
- [ ] Use `json.dumps()` to prevent injection
|
|
- [ ] Escape special characters in user data
|
|
|
|
## Reference Documentation
|
|
|
|
**Performance:**
|
|
|
|
- `reference/python-json-parsing-best-practices-2025.md` - Comprehensive research with benchmarks
|
|
|
|
**Patterns:**
|
|
|
|
- `patterns/streaming-large-json.md` - ijson and JSONL strategies
|
|
- `patterns/custom-object-serialization.md` - Handle datetime, UUID, custom classes
|
|
- `patterns/jsonpath-querying.md` - Advanced nested data extraction
|
|
|
|
**Security:**
|
|
|
|
- `anti-patterns/security-json-injection.md` - Prevent injection attacks
|
|
- `anti-patterns/eval-usage.md` - Why never to use eval()
|
|
|
|
**Examples:**
|
|
|
|
- `examples/high-performance-parsing.py` - orjson and msgspec code
|
|
- `examples/large-file-streaming.py` - Streaming with ijson
|
|
- `examples/secure-validation.py` - jsonschema validation
|
|
|
|
**Tools:**
|
|
|
|
- `tools/json-performance-benchmark.py` - Benchmark different libraries
|