Initial commit

2025-11-29 18:00:18 +08:00
commit 765529cd13
69 changed files with 18291 additions and 0 deletions
--- a/skills/python-json-parsing/SKILL.md
+++ b/skills/python-json-parsing/SKILL.md
@@ -0,0 +1,231 @@
+---
+name: python-json-parsing
+description: >
+  Python JSON parsing best practices covering performance optimization (orjson/msgspec),
+  handling large files (streaming/JSONL), security (injection prevention),
+  and advanced querying (JSONPath/JMESPath).
+  Use when working with JSON data, parsing APIs, handling large JSON files,
+  or optimizing JSON performance.
+---
+
+# Python JSON Parsing Best Practices
+
+Comprehensive guide to JSON parsing in Python with focus on performance, security, and scalability.
+
+## Quick Start
+
+### Basic JSON Parsing
+
+```python
+import json
+
+# Parse JSON string
+data = json.loads('{"name": "Alice", "age": 30}')
+
+# Parse JSON file
+with open("data.json", "r", encoding="utf-8") as f:
+    data = json.load(f)
+
+# Write JSON file
+with open("output.json", "w", encoding="utf-8") as f:
+    json.dump(data, f, indent=2)
+```
+
+**Key Rule:** Always specify `encoding="utf-8"` when reading/writing files.
+
+## When to Use This Skill
+
+Use this skill when:
+
+- Working with JSON APIs or data interchange
+- Optimizing JSON performance in high-throughput applications
+- Handling large JSON files (> 100MB)
+- Securing applications against JSON injection
+- Extracting data from complex nested JSON structures
+
+## Performance: Choose the Right Library
+
+### Library Comparison (10,000 records benchmark)
+
+| Library | Serialize (s) | Deserialize (s) | Best For |
+|---------|---------------|-----------------|----------|
+| **orjson** | 0.42 | 1.27 | FastAPI, web APIs (3.9x faster) |
+| **msgspec** | 0.49 | 0.93 | Maximum performance (1.7x faster deserialization) |
+| **json** (stdlib) | 1.62 | 1.62 | Universal compatibility |
+| **ujson** | 1.41 | 1.85 | Drop-in replacement (2x faster) |
+
+**Recommendation:**
+
+- Use **orjson** for FastAPI/web APIs (native support, fastest serialization)
+- Use **msgspec** for data pipelines (fastest overall, typed validation)
+- Use **json** when compatibility is critical
+
+### Installation
+
+```bash
+# High-performance libraries
+pip install orjson msgspec ujson
+
+# Advanced querying
+pip install jsonpath-ng jmespath
+
+# Streaming large files
+pip install ijson
+
+# Schema validation
+pip install jsonschema
+```
+
+## Large Files: Streaming Strategies
+
+For files > 100MB, avoid loading into memory.
+
+### Strategy 1: JSONL (JSON Lines)
+
+Convert large JSON arrays to line-delimited format:
+
+```python
+# Stream process JSONL
+with open("large.jsonl", "r") as infile, open("output.jsonl", "w") as outfile:
+    for line in infile:
+        obj = json.loads(line)
+        obj["processed"] = True
+        outfile.write(json.dumps(obj) + "\n")
+```
+
+### Strategy 2: Streaming with ijson
+
+```python
+import ijson
+
+# Process large JSON without loading into memory
+with open("huge.json", "rb") as f:
+    for item in ijson.items(f, "products.item"):
+        process(item)  # Handle one item at a time
+```
+
+See: `patterns/streaming-large-json.md`
+
+## Security: Prevent JSON Injection
+
+**Critical Rules:**
+
+1. Always use `json.loads()`, never `eval()`
+2. Validate input with `jsonschema`
+3. Sanitize user input before serialization
+4. Escape special characters (`"` and `\`)
+
+**Vulnerable Code:**
+
+```python
+# NEVER DO THIS
+username = request.GET['username']  # User input: admin", "role": "admin
+json_string = f'{{"user":"{username}","role":"user"}}'
+# Result: privilege escalation
+```
+
+**Secure Code:**
+
+```python
+# Use json.dumps for serialization
+data = {"user": username, "role": "user"}
+json_string = json.dumps(data)  # Properly escaped
+```
+
+See: `anti-patterns/security-json-injection.md`, `anti-patterns/eval-usage.md`
+
+## Advanced: JSONPath for Complex Queries
+
+Extract data from nested JSON without complex loops:
+
+```python
+import jsonpath_ng as jp
+
+data = {
+    "products": [
+        {"name": "Apple", "price": 12.88},
+        {"name": "Peach", "price": 27.25}
+    ]
+}
+
+# Filter by price
+query = jp.parse("products[?price>20].name")
+results = [match.value for match in query.find(data)]
+# Output: ["Peach"]
+```
+
+**Key Operators:**
+
+- `$` - Root selector
+- `..` - Recursive descendant
+- `*` - Wildcard
+- `[?<predicate>]` - Filter (e.g., `[?price > 20]`)
+- `[start:end:step]` - Array slicing
+
+See: `patterns/jsonpath-querying.md`
+
+## Custom Objects: Serialization
+
+Handle datetime, UUID, Decimal, and custom classes:
+
+```python
+from datetime import datetime
+import json
+
+class CustomEncoder(json.JSONEncoder):
+    def default(self, obj):
+        if isinstance(obj, datetime):
+            return obj.isoformat()
+        if isinstance(obj, set):
+            return list(obj)
+        return super().default(obj)
+
+# Usage
+data = {"timestamp": datetime.now(), "tags": {"python", "json"}}
+json_str = json.dumps(data, cls=CustomEncoder)
+```
+
+See: `patterns/custom-object-serialization.md`
+
+## Performance Checklist
+
+- [ ] Use orjson/msgspec for high-throughput applications
+- [ ] Specify UTF-8 encoding when reading/writing files
+- [ ] Use streaming (ijson/JSONL) for files > 100MB
+- [ ] Minify JSON for production (`separators=(',', ':')`)
+- [ ] Pretty-print for development (`indent=2`)
+
+## Security Checklist
+
+- [ ] Never use `eval()` for JSON parsing
+- [ ] Validate input with `jsonschema`
+- [ ] Sanitize user input before serialization
+- [ ] Use `json.dumps()` to prevent injection
+- [ ] Escape special characters in user data
+
+## Reference Documentation
+
+**Performance:**
+
+- `reference/python-json-parsing-best-practices-2025.md` - Comprehensive research with benchmarks
+
+**Patterns:**
+
+- `patterns/streaming-large-json.md` - ijson and JSONL strategies
+- `patterns/custom-object-serialization.md` - Handle datetime, UUID, custom classes
+- `patterns/jsonpath-querying.md` - Advanced nested data extraction
+
+**Security:**
+
+- `anti-patterns/security-json-injection.md` - Prevent injection attacks
+- `anti-patterns/eval-usage.md` - Why never to use eval()
+
+**Examples:**
+
+- `examples/high-performance-parsing.py` - orjson and msgspec code
+- `examples/large-file-streaming.py` - Streaming with ijson
+- `examples/secure-validation.py` - jsonschema validation
+
+**Tools:**
+
+- `tools/json-performance-benchmark.py` - Benchmark different libraries