Initial commit
This commit is contained in:
260
skills/xml-element-extractor/reference.md
Normal file
260
skills/xml-element-extractor/reference.md
Normal file
@@ -0,0 +1,260 @@
|
||||
# XML Element Extraction - Python Implementation
|
||||
|
||||
This reference document explains the XML element extraction process used in the Python-based extraction script and provides troubleshooting guidance. The Python implementation provides better compatibility and handles special characters more reliably than the previous sed-based approach.
|
||||
|
||||
## Core XML Processing Logic
|
||||
|
||||
### Python ElementTree Parsing
|
||||
|
||||
The Python script uses the standard library `xml.etree.ElementTree` for robust XML parsing:
|
||||
|
||||
```python
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
# Parse the XML file
|
||||
tree = ET.parse(source_file)
|
||||
root = tree.getroot()
|
||||
```
|
||||
|
||||
**Key advantages:**
|
||||
- Proper XML parsing that understands structure and encoding
|
||||
- Handles special characters, quotes, and XML entities correctly
|
||||
- Provides structured access to XML elements and attributes
|
||||
- Better error handling and validation
|
||||
- Cross-platform compatibility using Python standard library
|
||||
|
||||
### Tag Matching Algorithm
|
||||
|
||||
The script implements sophisticated tag matching to find exact element matches:
|
||||
|
||||
```python
|
||||
def find_element_by_tag_string(tree_root, element_tag):
|
||||
tag_name = extract_tag_name(element_tag)
|
||||
|
||||
# Find all elements with the matching tag name
|
||||
for element in tree_root.iter(tag_name):
|
||||
# Reconstruct the opening tag string with attributes
|
||||
if element.attrib:
|
||||
attrs = ' '.join(f'{k}="{v}"' for k, v in sorted(element.attrib.items()))
|
||||
constructed_tag = f'<{tag_name} {attrs}>'
|
||||
else:
|
||||
constructed_tag = f'<{tag_name}>'
|
||||
|
||||
# Compare normalized tags for case-insensitive attribute order matching
|
||||
if normalize_tag(constructed_tag) == normalize_tag(element_tag):
|
||||
return element
|
||||
```
|
||||
|
||||
### Tag Name Extraction
|
||||
|
||||
The script extracts tag names using regular expressions:
|
||||
|
||||
```python
|
||||
def extract_tag_name(element_tag):
|
||||
match = re.match(r'<\s*([^>\s]+)', element_tag)
|
||||
if match:
|
||||
return match.group(1)
|
||||
return None
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
- `<InstrumentVector Id="0">` → `InstrumentVector`
|
||||
- `<MidiTrack>` → `MidiTrack`
|
||||
- `<ComplexTag attr1="value1" attr2="value2">` → `ComplexTag`
|
||||
|
||||
### Tag Normalization
|
||||
|
||||
To handle different attribute orders and whitespace variations:
|
||||
|
||||
```python
|
||||
def normalize_tag(tag):
|
||||
# Parse the tag to normalize it
|
||||
match = re.match(r'<\s*([^>\s]+)(.*)>', tag)
|
||||
if not match:
|
||||
return tag
|
||||
|
||||
tag_name = match.group(1)
|
||||
attrs_str = match.group(2).strip()
|
||||
|
||||
# Parse attributes and sort them for consistent comparison
|
||||
attrs = {}
|
||||
for attr_match in re.finditer(r'(\w+)\s*=\s*"([^"]*)"', attrs_str):
|
||||
attr_name = attr_match.group(1)
|
||||
attr_value = attr_match.group(2)
|
||||
attrs[attr_name] = attr_value
|
||||
|
||||
# Reconstruct with sorted attributes
|
||||
if attrs:
|
||||
sorted_attrs = ' '.join(f'{k}="{v}"' for k, v in sorted(attrs.items()))
|
||||
return f'<{tag_name} {sorted_attrs}>'
|
||||
else:
|
||||
return f'<{tag_name}>'
|
||||
```
|
||||
|
||||
## Troubleshooting Guide
|
||||
|
||||
### Common Issues and Solutions
|
||||
|
||||
#### 1. No Matching Element Found
|
||||
|
||||
**Symptoms:**
|
||||
- Empty destination file
|
||||
- "No matching element found or extraction failed" error
|
||||
|
||||
**Possible Causes:**
|
||||
- Incorrect tag spelling or case sensitivity
|
||||
- Missing or extra whitespace in tag
|
||||
- Attributes don't match exactly
|
||||
- Tag contains special characters needing escaping
|
||||
|
||||
**Solutions:**
|
||||
- Verify exact tag spelling and case
|
||||
- Check for exact attribute matching
|
||||
- Use quotes around special characters in attributes
|
||||
- Validate source file contains the expected element
|
||||
|
||||
#### 2. Multiple Elements Extracted
|
||||
|
||||
**Symptoms:**
|
||||
- Output contains more than one element
|
||||
- Unexpected content in destination file
|
||||
|
||||
**Possible Causes:**
|
||||
- Source file has nested identical elements
|
||||
- Closing tag matching is ambiguous
|
||||
|
||||
**Solutions:**
|
||||
- The script should handle this with the "first element only" pattern
|
||||
- If issue persists, check source file structure
|
||||
- Consider using more specific attributes in opening tag
|
||||
|
||||
#### 3. Special Characters in Tags
|
||||
|
||||
**Symptoms:**
|
||||
- Sed syntax errors
|
||||
- Failed pattern matching
|
||||
|
||||
**Common Special Characters:**
|
||||
- Quotes (single or double)
|
||||
- Ampersands (&)
|
||||
- Greater/less than signs within attributes
|
||||
- Unicode characters
|
||||
|
||||
**Solutions:**
|
||||
- Properly quote the element tag when calling the script
|
||||
- Escape special characters if needed
|
||||
- Use exact character encoding from source file
|
||||
|
||||
#### 4. File Permission Issues
|
||||
|
||||
**Symptoms:**
|
||||
- "Source file does not exist" error
|
||||
- "Source file is not readable" error
|
||||
|
||||
**Solutions:**
|
||||
- Verify file path is correct
|
||||
- Check file permissions: `ls -la source.xml`
|
||||
- Ensure read permissions: `chmod +r source.xml`
|
||||
- Check directory permissions if file is in subdirectory
|
||||
|
||||
### Debugging Tips
|
||||
|
||||
#### Test XML Patterns Manually
|
||||
|
||||
Before using the script, test XML patterns:
|
||||
|
||||
```bash
|
||||
# Test opening tag detection
|
||||
grep -n "<InstrumentVector" source.xml
|
||||
|
||||
# Test element extraction with Python (single line test)
|
||||
python3 -c "
|
||||
import xml.etree.ElementTree as ET
|
||||
tree = ET.parse('source.xml')
|
||||
for elem in tree.iter('InstrumentVector'):
|
||||
if elem.get('Id') == '0':
|
||||
print('Found element with Id=0')
|
||||
break
|
||||
"
|
||||
|
||||
# Test tag name extraction
|
||||
echo '<InstrumentVector Id="0">' | python3 -c "
|
||||
import sys, re
|
||||
line = sys.stdin.read().strip()
|
||||
match = re.match(r'<\s*([^>\s]+)', line)
|
||||
if match: print(match.group(1))
|
||||
"
|
||||
```
|
||||
|
||||
#### Validate XML Structure
|
||||
|
||||
```bash
|
||||
# Check if XML is well-formed
|
||||
xmllint --noout source.xml
|
||||
|
||||
# Pretty-print XML to understand structure
|
||||
xmllint --format source.xml | head -50
|
||||
```
|
||||
|
||||
#### Check Element Count
|
||||
|
||||
```bash
|
||||
# Count occurrences of specific element
|
||||
grep -c "<InstrumentVector" source.xml
|
||||
grep -c "</InstrumentVector>" source.xml
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Customizing for Specific XML Structures
|
||||
|
||||
For complex XML structures, you can extend the Python script or use additional Python logic:
|
||||
|
||||
**Nested Elements:**
|
||||
```python
|
||||
# Extract only direct children, not nested ones
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
tree = ET.parse('source.xml')
|
||||
parent = tree.find('.//Parent')
|
||||
if parent:
|
||||
for child in parent:
|
||||
if child.tag == 'Child':
|
||||
print(ET.tostring(child).decode())
|
||||
```
|
||||
|
||||
**Multiple Attributes:**
|
||||
```python
|
||||
# More specific matching with multiple attributes
|
||||
tree = ET.parse('source.xml')
|
||||
for elem in tree.iter('InstrumentVector'):
|
||||
if elem.get('Id') == '0' and elem.get('Type') == 'Audio':
|
||||
print(ET.tostring(elem).decode())
|
||||
break
|
||||
```
|
||||
|
||||
**Conditional Extraction:**
|
||||
```python
|
||||
# Extract only if element contains specific content
|
||||
tree = ET.parse('source.xml')
|
||||
for elem in tree.iter('InstrumentVector'):
|
||||
if elem.findtext('.//SpecificContent') is not None:
|
||||
print(ET.tostring(elem).decode())
|
||||
break
|
||||
```
|
||||
|
||||
### Performance Considerations
|
||||
|
||||
For large XML files:
|
||||
- Consider using XML-specific tools like `xmlstarlet`
|
||||
- Process files in chunks if memory is limited
|
||||
- Use more specific patterns to reduce processing time
|
||||
|
||||
## Alternative Tools
|
||||
|
||||
For more complex XML processing needs:
|
||||
- `xmlstarlet`: XML-specific command-line tool with XPath support
|
||||
- `xmllint`: More robust XML processing and validation
|
||||
- Python with `lxml`: For advanced XML manipulation and XPath
|
||||
- `xpath`: Command-line XPath-based extraction
|
||||
- Python `BeautifulSoup`: For HTML/XML parsing with tolerance for malformed documents
|
||||
Reference in New Issue
Block a user