Initial commit

2025-11-29 18:00:18 +08:00
commit 765529cd13
69 changed files with 18291 additions and 0 deletions
--- a/skills/python-json-parsing/reference/python-json-parsing-best-practices-2025.md
+++ b/skills/python-json-parsing/reference/python-json-parsing-best-practices-2025.md
@@ -0,0 +1,419 @@
+# Python JSON Parsing Best Practices (2025)
+
+**Research Date:** October 31, 2025
+**Query:** Best practices for parsing JSON in Python 2025
+
+## Executive Summary
+
+JSON parsing in Python has evolved significantly with performance-optimized libraries and enhanced security practices. This research identifies critical best practices for developers working with JSON data in 2025, covering library selection, performance optimization, security considerations, and handling large-scale datasets.
+
+---
+
+## 1. Core Library Selection & Performance
+
+### Standard Library (`json` module)
+
+The built-in `json` module remains the baseline for JSON operations in Python:
+
+- **Serialization**: `json.dumps()` converts Python objects to JSON strings
+- **Deserialization**: `json.loads()` parses JSON strings to Python objects
+- **File Operations**: `json.dump()` and `json.load()` for direct file I/O
+- **Performance**: Adequate for most use cases but slower than alternatives
+
+**Key Insight**: Always specify encoding when working with files - UTF-8 is the recommended standard per RFC requirements.
+
+```python
+import json
+
+# Best practice: Always specify encoding
+with open("data.json", "r", encoding="utf-8") as f:
+    data = json.load(f)
+```
+
+**Source**: [Real Python - Working With JSON Data in Python](https://realpython.com/python-json)
+
+### High-Performance Alternatives (2025 Benchmarks)
+
+Based on comprehensive benchmarking of 10,000 records with 10 runs:
+
+| Library | Serialization (s) | Deserialization (s) | Key Features |
+|---------|-------------------|---------------------|--------------|
+| **orjson** | 0.417962 | 1.272813 | Rust-based, fastest serialization, built-in FastAPI support |
+| **msgspec** | 0.489964 | 0.930834 | Ultra-fast, typed structs, supports YAML/TOML |
+| **json** (stdlib) | 1.616786 | 1.616203 | Universal compatibility, stable |
+| **ujson** | 1.413367 | 1.853332 | C-based, drop-in replacement |
+| **rapidjson** | 2.044958 | 1.717067 | C++ wrapper, flexible |
+
+**Recommendation**:
+- Use **orjson** for web APIs (FastAPI native support, 3.9x faster serialization)
+- Use **msgspec** for maximum performance across all operations (1.7x faster deserialization)
+- Stick with **json** for compatibility-critical applications
+
+**Source**: [DEV Community - Benchmarking Python JSON Libraries](https://dev.to/kanakos01/benchmarking-python-json-libraries-33bb)
+
+---
+
+## 2. Handling Large JSON Files
+
+### Problem: Memory Constraints
+
+Loading multi-million line JSON files with `json.load()` causes memory exhaustion.
+
+### Solution Strategies
+
+#### Strategy 1: JSON Lines (JSONL) Format
+
+Convert large JSON arrays to line-delimited format for streaming processing:
+
+```python
+# Streaming read + process + write
+with open("big.jsonl", "r") as infile, open("new.jsonl", "w") as outfile:
+    for line in infile:
+        obj = json.loads(line)
+        obj["status"] = "processed"
+        outfile.write(json.dumps(obj) + "\n")
+```
+
+**Benefits**:
+- Easy appending of new records
+- Line-by-line updates without rewriting entire file
+- Native support in pandas, Spark, and `jq`
+
+**Source**: [DEV Community - Handling Large JSON Files in Python](https://dev.to/lovestaco/handling-large-json-files-in-python-efficient-read-write-and-update-strategies-3jgg)
+
+#### Strategy 2: Incremental Parsing with `ijson`
+
+For true JSON arrays/objects, use streaming parsers:
+
+```python
+import ijson
+
+# Process large file without loading into memory
+with open("huge.json", "rb") as f:
+    for item in ijson.items(f, "products.item"):
+        process(item)  # Handle one item at a time
+```
+
+#### Strategy 3: Database Migration
+
+For frequently queried/updated data, migrate from JSON to:
+- **SQLite**: Lightweight, file-based
+- **PostgreSQL/MongoDB**: Scalable solutions
+
+**Critical Decision Matrix**:
+- JSON additions only → JSONL format
+- Batch updates → Stream read + rewrite
+- Frequent random updates → Database
+
+**Source**: [DEV Community - Handling Large JSON Files](https://dev.to/lovestaco/handling-large-json-files-in-python-efficient-read-write-and-update-strategies-3jgg)
+
+---
+
+## 3. Advanced Parsing with JSONPath & JMESPath
+
+### JSONPath: XPath for JSON
+
+Use JSONPath for nested data extraction with complex queries:
+
+```python
+import jsonpath_ng as jp
+
+data = {
+    "products": [
+        {"name": "Apple", "price": 12.88},
+        {"name": "Peach", "price": 27.25}
+    ]
+}
+
+# Filter by price
+query = jp.parse("products[?price>20].name")
+results = [match.value for match in query.find(data)]
+# Output: ["Peach"]
+```
+
+**Key Operators**:
+- `$` - Root selector
+- `..` - Recursive descendant
+- `*` - Wildcard
+- `[?<predicate>]` - Filter (e.g., `[?price > 20 & price < 100]`)
+- `[start:end:step]` - Array slicing
+
+**Use Cases**:
+- Web scraping hidden JSON data in `<script>` tags
+- Extracting nested API response data
+- Complex filtering across multiple levels
+
+**Source**: [ScrapFly - Introduction to Parsing JSON with Python JSONPath](https://scrapfly.io/blog/posts/parse-json-jsonpath-python)
+
+### JMESPath Alternative
+
+JMESPath offers easier dataset mutation and filtering for predictable structures, while JSONPath excels at extracting deeply nested data.
+
+---
+
+## 4. Security Best Practices
+
+### Critical Vulnerabilities
+
+#### JSON Injection Attacks
+
+**Server-side injection** occurs when unsanitized user input is directly serialized:
+
+```python
+# VULNERABLE CODE
+username = request.GET['username']  # User input: admin", "role": "administrator
+json_string = f'{{"user":"{username}","role":"user"}}'
+# Result: {"user":"admin", "role":"administrator", "role":"user"}
+# Parser takes last role → privilege escalation
+```
+
+**Client-side injection** via `eval()`:
+
+```python
+# NEVER DO THIS
+data = eval("(" + json_response + ")")  # Code execution risk!
+
+# CORRECT APPROACH
+data = json.loads(json_response)  # Safe parsing
+```
+
+**Source**: [Comparitech - JSON Injection Guide](https://www.comparitech.com/net-admin/json-injection-guide)
+
+### Defense Strategies
+
+1. **Input Sanitization**: Validate and escape all user input before serialization
+2. **Never Use `eval()`**: Always use `json.loads()` or `JSON.parse()`
+3. **Schema Validation**: Use `jsonschema` library to enforce data contracts
+4. **Content Security Policy (CSP)**: Prevents eval() usage by default
+5. **Escape Special Characters**: Properly escape `"` and `\` in user data
+
+```python
+from jsonschema import validate, ValidationError
+
+schema = {
+    "type": "object",
+    "required": ["id", "name", "email"],
+    "properties": {
+        "id": {"type": "integer", "minimum": 1},
+        "email": {"type": "string", "format": "email"}
+    }
+}
+
+try:
+    validate(instance=user_data, schema=schema)
+except ValidationError as e:
+    print(f"Invalid data: {e.message}")
+```
+
+**Source**: [Better Stack - Working With JSON Data in Python](https://betterstack.com/community/guides/scaling-nodejs/json-data-in-python)
+
+---
+
+## 5. Type Handling & Custom Objects
+
+### Python ↔ JSON Type Mapping
+
+| Python | → JSON | JSON → | Python |
+|--------|--------|--------|--------|
+| dict | object | object | dict |
+| list, tuple | array | array | list ⚠️ |
+| str | string | string | str |
+| int, float | number | number | int/float |
+| True/False | true/false | true/false | True/False |
+| None | null | null | None |
+
+**⚠️ Gotcha**: Tuples serialize to arrays but deserialize back to lists (data type loss).
+
+### Custom Object Serialization
+
+```python
+from datetime import datetime
+import json
+
+class CustomEncoder(json.JSONEncoder):
+    def default(self, obj):
+        if isinstance(obj, datetime):
+            return obj.isoformat()
+        if isinstance(obj, set):
+            return list(obj)
+        return super().default(obj)
+
+# Usage
+data = {"timestamp": datetime.now(), "tags": {"python", "json"}}
+json_str = json.dumps(data, cls=CustomEncoder)
+```
+
+**Advanced Alternative**: Use **Pydantic** or **msgspec** for typed validation and automatic serialization.
+
+**Source**: [Better Stack Community Guide](https://betterstack.com/community/guides/scaling-nodejs/json-data-in-python)
+
+---
+
+## 6. Formatting & Debugging
+
+### Pretty Printing
+
+```python
+# Readable output with indentation
+json_str = json.dumps(data, indent=2, sort_keys=True)
+
+# Command-line validation and formatting
+# Validate JSON file
+python -m json.tool config.json
+
+# Pretty-print to new file
+python -m json.tool input.json output.json --indent 2
+```
+
+### Minification for Production
+
+```python
+# Remove all whitespace for minimal size
+minified = json.dumps(data, separators=(',', ':'))
+
+# Command line
+python -m json.tool --compact input.json output.json
+```
+
+**Performance Impact**: Pretty-printed JSON can be 2x larger (308 bytes → 645 bytes in benchmarks).
+
+**Source**: [Real Python - Working With JSON](https://realpython.com/python-json)
+
+---
+
+## 7. Web Scraping: Handling Non-Standard JSON
+
+### ChompJS for JavaScript Objects
+
+Many websites embed data in JavaScript objects that aren't valid JSON:
+
+```python
+import chompjs
+
+# These are valid JS but invalid JSON:
+js_objects = [
+    "{'a': 'b'}",           # Single quotes
+    "{a: 'b'}",             # Unquoted keys
+    '{"a": [1,2,3,]}',      # Trailing comma
+    '{"price": .99}'        # Missing leading zero
+]
+
+# ChompJS handles all of these
+for js in js_objects:
+    python_dict = chompjs.parse_js_object(js)
+```
+
+**Use Case**: Extracting hidden web data from `<script>` tags containing JavaScript initializers.
+
+**Source**: [Zyte - JSON Parsing with Python](https://www.zyte.com/blog/json-parsing-with-python)
+
+---
+
+## 8. Production Optimization Checklist
+
+### High-Performance Applications
+
+Based on LinkedIn engineering insights for million-request APIs:
+
+1. **Library Selection**:
+   - FastAPI apps → `orjson` (native support, 4x faster)
+   - Data pipelines → `msgspec` (fastest overall)
+   - General use → `ujson` (2x faster than stdlib)
+
+2. **Streaming for Scale**:
+   - Use `ijson` for files > 100MB
+   - Convert to JSONL for append-heavy workloads
+   - Consider Protocol Buffers for ultra-high performance
+
+3. **Buffer Optimization**:
+   - Profile with `cProfile` to identify bottlenecks
+   - Tune I/O buffer sizes for your workload
+   - Use async I/O for concurrent request handling
+
+4. **Monitoring**:
+   - Track JSON processing time metrics
+   - Set up alerts for parsing errors
+   - Continuously benchmark against SLAs
+
+**Source**: [LinkedIn - Optimizing JSON Parsing and Serialization](https://linkedin.com/pulse/optimizing-json-parsing-serialization-applications-amit-jindal-1g0tf)
+
+---
+
+## 9. Logging Best Practices
+
+### Structured JSON Logging
+
+```python
+import logging
+import json
+
+# Configure JSON logging from the start
+class JsonFormatter(logging.Formatter):
+    def format(self, record):
+        log_data = {
+            "timestamp": self.formatTime(record),
+            "level": record.levelname,
+            "message": record.getMessage(),
+            "user_id": getattr(record, 'user_id', None),
+            "session_id": getattr(record, 'session_id', None)
+        }
+        return json.dumps(log_data)
+
+# Benefits:
+# - Easy parsing and searching
+# - Structured database storage
+# - Better correlation across services
+```
+
+**Schema Design Tips**:
+- Use consistent key naming (snake_case recommended)
+- Flatten structures when possible (concatenate keys with separator)
+- Uniform data types per field
+- Parse stack traces into hierarchical attributes
+
+**Source**: [Graylog - What To Know About Parsing JSON](https://graylog.org/post/what-to-know-parsing-json)
+
+---
+
+## 10. Key Takeaways for Developers
+
+### Must-Do Practices
+
+1. ✅ **Always use `json.loads()`, never `eval()`** for security
+2. ✅ **Specify UTF-8 encoding** when reading/writing files
+3. ✅ **Validate input** with `jsonschema` before processing
+4. ✅ **Choose performance library** based on workload (orjson/msgspec)
+5. ✅ **Use JSONL format** for large, append-heavy datasets
+6. ✅ **Implement streaming** for files > 100MB
+7. ✅ **Pretty-print for development**, minify for production
+8. ✅ **Add structured logging** from project start
+
+### Common Pitfalls to Avoid
+
+1. ❌ Loading entire large files into memory
+2. ❌ Using `eval()` for JSON parsing
+3. ❌ Skipping input validation on user data
+4. ❌ Ignoring type conversions (tuple → list)
+5. ❌ Not handling exceptions properly
+6. ❌ Over-logging during parsing (performance impact)
+7. ❌ Using sequential IDs (security risk - use UUID/GUID)
+
+---
+
+## References
+
+1. [Real Python - Working With JSON Data in Python](https://realpython.com/python-json) - Comprehensive guide to json module, Aug 2025
+2. [Better Stack Community - JSON Data in Python](https://betterstack.com/community/guides/scaling-nodejs/json-data-in-python) - Advanced techniques and validation, Apr 2025
+3. [DEV Community - Handling Large JSON Files](https://dev.to/lovestaco/handling-large-json-files-in-python-efficient-read-write-and-update-strategies-3jgg) - Strategies for massive datasets, Oct 2025
+4. [ScrapFly - JSONPath in Python](https://scrapfly.io/blog/posts/parse-json-jsonpath-python) - Advanced querying techniques, Sep 2025
+5. [DEV Community - Benchmarking JSON Libraries](https://dev.to/kanakos01/benchmarking-python-json-libraries-33bb) - Performance comparison, Jul 2025
+6. [LinkedIn - Optimizing JSON for High-Performance](https://linkedin.com/pulse/optimizing-json-parsing-serialization-applications-amit-jindal-1g0tf) - Enterprise optimization, Mar 2025
+7. [Graylog - Parsing JSON](https://graylog.org/post/what-to-know-parsing-json) - Logging best practices, Mar 2025
+8. [Comparitech - JSON Injection Guide](https://www.comparitech.com/net-admin/json-injection-guide) - Security vulnerabilities, Nov 2024
+9. [Zyte - JSON Parsing with Python](https://www.zyte.com/blog/json-parsing-with-python) - Practical guide, Dec 2024
+
+---
+
+**Document Version:** 1.0
+**Last Updated:** October 31, 2025
+**Maintained By:** Lunar Claude Research Team