--- name: python-json-parsing description: > Python JSON parsing best practices covering performance optimization (orjson/msgspec), handling large files (streaming/JSONL), security (injection prevention), and advanced querying (JSONPath/JMESPath). Use when working with JSON data, parsing APIs, handling large JSON files, or optimizing JSON performance. --- # Python JSON Parsing Best Practices Comprehensive guide to JSON parsing in Python with focus on performance, security, and scalability. ## Quick Start ### Basic JSON Parsing ```python import json # Parse JSON string data = json.loads('{"name": "Alice", "age": 30}') # Parse JSON file with open("data.json", "r", encoding="utf-8") as f: data = json.load(f) # Write JSON file with open("output.json", "w", encoding="utf-8") as f: json.dump(data, f, indent=2) ``` **Key Rule:** Always specify `encoding="utf-8"` when reading/writing files. ## When to Use This Skill Use this skill when: - Working with JSON APIs or data interchange - Optimizing JSON performance in high-throughput applications - Handling large JSON files (> 100MB) - Securing applications against JSON injection - Extracting data from complex nested JSON structures ## Performance: Choose the Right Library ### Library Comparison (10,000 records benchmark) | Library | Serialize (s) | Deserialize (s) | Best For | |---------|---------------|-----------------|----------| | **orjson** | 0.42 | 1.27 | FastAPI, web APIs (3.9x faster) | | **msgspec** | 0.49 | 0.93 | Maximum performance (1.7x faster deserialization) | | **json** (stdlib) | 1.62 | 1.62 | Universal compatibility | | **ujson** | 1.41 | 1.85 | Drop-in replacement (2x faster) | **Recommendation:** - Use **orjson** for FastAPI/web APIs (native support, fastest serialization) - Use **msgspec** for data pipelines (fastest overall, typed validation) - Use **json** when compatibility is critical ### Installation ```bash # High-performance libraries pip install orjson msgspec ujson # Advanced querying pip install jsonpath-ng jmespath # Streaming large files pip install ijson # Schema validation pip install jsonschema ``` ## Large Files: Streaming Strategies For files > 100MB, avoid loading into memory. ### Strategy 1: JSONL (JSON Lines) Convert large JSON arrays to line-delimited format: ```python # Stream process JSONL with open("large.jsonl", "r") as infile, open("output.jsonl", "w") as outfile: for line in infile: obj = json.loads(line) obj["processed"] = True outfile.write(json.dumps(obj) + "\n") ``` ### Strategy 2: Streaming with ijson ```python import ijson # Process large JSON without loading into memory with open("huge.json", "rb") as f: for item in ijson.items(f, "products.item"): process(item) # Handle one item at a time ``` See: `patterns/streaming-large-json.md` ## Security: Prevent JSON Injection **Critical Rules:** 1. Always use `json.loads()`, never `eval()` 2. Validate input with `jsonschema` 3. Sanitize user input before serialization 4. Escape special characters (`"` and `\`) **Vulnerable Code:** ```python # NEVER DO THIS username = request.GET['username'] # User input: admin", "role": "admin json_string = f'{{"user":"{username}","role":"user"}}' # Result: privilege escalation ``` **Secure Code:** ```python # Use json.dumps for serialization data = {"user": username, "role": "user"} json_string = json.dumps(data) # Properly escaped ``` See: `anti-patterns/security-json-injection.md`, `anti-patterns/eval-usage.md` ## Advanced: JSONPath for Complex Queries Extract data from nested JSON without complex loops: ```python import jsonpath_ng as jp data = { "products": [ {"name": "Apple", "price": 12.88}, {"name": "Peach", "price": 27.25} ] } # Filter by price query = jp.parse("products[?price>20].name") results = [match.value for match in query.find(data)] # Output: ["Peach"] ``` **Key Operators:** - `$` - Root selector - `..` - Recursive descendant - `*` - Wildcard - `[?]` - Filter (e.g., `[?price > 20]`) - `[start:end:step]` - Array slicing See: `patterns/jsonpath-querying.md` ## Custom Objects: Serialization Handle datetime, UUID, Decimal, and custom classes: ```python from datetime import datetime import json class CustomEncoder(json.JSONEncoder): def default(self, obj): if isinstance(obj, datetime): return obj.isoformat() if isinstance(obj, set): return list(obj) return super().default(obj) # Usage data = {"timestamp": datetime.now(), "tags": {"python", "json"}} json_str = json.dumps(data, cls=CustomEncoder) ``` See: `patterns/custom-object-serialization.md` ## Performance Checklist - [ ] Use orjson/msgspec for high-throughput applications - [ ] Specify UTF-8 encoding when reading/writing files - [ ] Use streaming (ijson/JSONL) for files > 100MB - [ ] Minify JSON for production (`separators=(',', ':')`) - [ ] Pretty-print for development (`indent=2`) ## Security Checklist - [ ] Never use `eval()` for JSON parsing - [ ] Validate input with `jsonschema` - [ ] Sanitize user input before serialization - [ ] Use `json.dumps()` to prevent injection - [ ] Escape special characters in user data ## Reference Documentation **Performance:** - `reference/python-json-parsing-best-practices-2025.md` - Comprehensive research with benchmarks **Patterns:** - `patterns/streaming-large-json.md` - ijson and JSONL strategies - `patterns/custom-object-serialization.md` - Handle datetime, UUID, custom classes - `patterns/jsonpath-querying.md` - Advanced nested data extraction **Security:** - `anti-patterns/security-json-injection.md` - Prevent injection attacks - `anti-patterns/eval-usage.md` - Why never to use eval() **Examples:** - `examples/high-performance-parsing.py` - orjson and msgspec code - `examples/large-file-streaming.py` - Streaming with ijson - `examples/secure-validation.py` - jsonschema validation **Tools:** - `tools/json-performance-benchmark.py` - Benchmark different libraries