Initial commit

2025-11-30 08:35:51 +08:00
commit d6a8021492
6 changed files with 740 additions and 0 deletions
--- a/skills/xml-element-extractor/reference.md
+++ b/skills/xml-element-extractor/reference.md
@@ -0,0 +1,260 @@
+# XML Element Extraction - Python Implementation
+
+This reference document explains the XML element extraction process used in the Python-based extraction script and provides troubleshooting guidance. The Python implementation provides better compatibility and handles special characters more reliably than the previous sed-based approach.
+
+## Core XML Processing Logic
+
+### Python ElementTree Parsing
+
+The Python script uses the standard library `xml.etree.ElementTree` for robust XML parsing:
+
+```python
+import xml.etree.ElementTree as ET
+
+# Parse the XML file
+tree = ET.parse(source_file)
+root = tree.getroot()
+```
+
+**Key advantages:**
+- Proper XML parsing that understands structure and encoding
+- Handles special characters, quotes, and XML entities correctly
+- Provides structured access to XML elements and attributes
+- Better error handling and validation
+- Cross-platform compatibility using Python standard library
+
+### Tag Matching Algorithm
+
+The script implements sophisticated tag matching to find exact element matches:
+
+```python
+def find_element_by_tag_string(tree_root, element_tag):
+    tag_name = extract_tag_name(element_tag)
+
+    # Find all elements with the matching tag name
+    for element in tree_root.iter(tag_name):
+        # Reconstruct the opening tag string with attributes
+        if element.attrib:
+            attrs = ' '.join(f'{k}="{v}"' for k, v in sorted(element.attrib.items()))
+            constructed_tag = f'<{tag_name} {attrs}>'
+        else:
+            constructed_tag = f'<{tag_name}>'
+
+        # Compare normalized tags for case-insensitive attribute order matching
+        if normalize_tag(constructed_tag) == normalize_tag(element_tag):
+            return element
+```
+
+### Tag Name Extraction
+
+The script extracts tag names using regular expressions:
+
+```python
+def extract_tag_name(element_tag):
+    match = re.match(r'<\s*([^>\s]+)', element_tag)
+    if match:
+        return match.group(1)
+    return None
+```
+
+**Examples:**
+- `<InstrumentVector Id="0">` → `InstrumentVector`
+- `<MidiTrack>` → `MidiTrack`
+- `<ComplexTag attr1="value1" attr2="value2">` → `ComplexTag`
+
+### Tag Normalization
+
+To handle different attribute orders and whitespace variations:
+
+```python
+def normalize_tag(tag):
+    # Parse the tag to normalize it
+    match = re.match(r'<\s*([^>\s]+)(.*)>', tag)
+    if not match:
+        return tag
+
+    tag_name = match.group(1)
+    attrs_str = match.group(2).strip()
+
+    # Parse attributes and sort them for consistent comparison
+    attrs = {}
+    for attr_match in re.finditer(r'(\w+)\s*=\s*"([^"]*)"', attrs_str):
+        attr_name = attr_match.group(1)
+        attr_value = attr_match.group(2)
+        attrs[attr_name] = attr_value
+
+    # Reconstruct with sorted attributes
+    if attrs:
+        sorted_attrs = ' '.join(f'{k}="{v}"' for k, v in sorted(attrs.items()))
+        return f'<{tag_name} {sorted_attrs}>'
+    else:
+        return f'<{tag_name}>'
+```
+
+## Troubleshooting Guide
+
+### Common Issues and Solutions
+
+#### 1. No Matching Element Found
+
+**Symptoms:**
+- Empty destination file
+- "No matching element found or extraction failed" error
+
+**Possible Causes:**
+- Incorrect tag spelling or case sensitivity
+- Missing or extra whitespace in tag
+- Attributes don't match exactly
+- Tag contains special characters needing escaping
+
+**Solutions:**
+- Verify exact tag spelling and case
+- Check for exact attribute matching
+- Use quotes around special characters in attributes
+- Validate source file contains the expected element
+
+#### 2. Multiple Elements Extracted
+
+**Symptoms:**
+- Output contains more than one element
+- Unexpected content in destination file
+
+**Possible Causes:**
+- Source file has nested identical elements
+- Closing tag matching is ambiguous
+
+**Solutions:**
+- The script should handle this with the "first element only" pattern
+- If issue persists, check source file structure
+- Consider using more specific attributes in opening tag
+
+#### 3. Special Characters in Tags
+
+**Symptoms:**
+- Sed syntax errors
+- Failed pattern matching
+
+**Common Special Characters:**
+- Quotes (single or double)
+- Ampersands (&)
+- Greater/less than signs within attributes
+- Unicode characters
+
+**Solutions:**
+- Properly quote the element tag when calling the script
+- Escape special characters if needed
+- Use exact character encoding from source file
+
+#### 4. File Permission Issues
+
+**Symptoms:**
+- "Source file does not exist" error
+- "Source file is not readable" error
+
+**Solutions:**
+- Verify file path is correct
+- Check file permissions: `ls -la source.xml`
+- Ensure read permissions: `chmod +r source.xml`
+- Check directory permissions if file is in subdirectory
+
+### Debugging Tips
+
+#### Test XML Patterns Manually
+
+Before using the script, test XML patterns:
+
+```bash
+# Test opening tag detection
+grep -n "<InstrumentVector" source.xml
+
+# Test element extraction with Python (single line test)
+python3 -c "
+import xml.etree.ElementTree as ET
+tree = ET.parse('source.xml')
+for elem in tree.iter('InstrumentVector'):
+    if elem.get('Id') == '0':
+        print('Found element with Id=0')
+        break
+"
+
+# Test tag name extraction
+echo '<InstrumentVector Id="0">' | python3 -c "
+import sys, re
+line = sys.stdin.read().strip()
+match = re.match(r'<\s*([^>\s]+)', line)
+if match: print(match.group(1))
+"
+```
+
+#### Validate XML Structure
+
+```bash
+# Check if XML is well-formed
+xmllint --noout source.xml
+
+# Pretty-print XML to understand structure
+xmllint --format source.xml | head -50
+```
+
+#### Check Element Count
+
+```bash
+# Count occurrences of specific element
+grep -c "<InstrumentVector" source.xml
+grep -c "</InstrumentVector>" source.xml
+```
+
+## Advanced Usage
+
+### Customizing for Specific XML Structures
+
+For complex XML structures, you can extend the Python script or use additional Python logic:
+
+**Nested Elements:**
+```python
+# Extract only direct children, not nested ones
+import xml.etree.ElementTree as ET
+
+tree = ET.parse('source.xml')
+parent = tree.find('.//Parent')
+if parent:
+    for child in parent:
+        if child.tag == 'Child':
+            print(ET.tostring(child).decode())
+```
+
+**Multiple Attributes:**
+```python
+# More specific matching with multiple attributes
+tree = ET.parse('source.xml')
+for elem in tree.iter('InstrumentVector'):
+    if elem.get('Id') == '0' and elem.get('Type') == 'Audio':
+        print(ET.tostring(elem).decode())
+        break
+```
+
+**Conditional Extraction:**
+```python
+# Extract only if element contains specific content
+tree = ET.parse('source.xml')
+for elem in tree.iter('InstrumentVector'):
+    if elem.findtext('.//SpecificContent') is not None:
+        print(ET.tostring(elem).decode())
+        break
+```
+
+### Performance Considerations
+
+For large XML files:
+- Consider using XML-specific tools like `xmlstarlet`
+- Process files in chunks if memory is limited
+- Use more specific patterns to reduce processing time
+
+## Alternative Tools
+
+For more complex XML processing needs:
+- `xmlstarlet`: XML-specific command-line tool with XPath support
+- `xmllint`: More robust XML processing and validation
+- Python with `lxml`: For advanced XML manipulation and XPath
+- `xpath`: Command-line XPath-based extraction
+- Python `BeautifulSoup`: For HTML/XML parsing with tolerance for malformed documents