19 KiB
name, description
| name | description |
|---|---|
| hybrid-unif-merge-stats-creator | Generate professional HTML/PDF merge statistics report from ID unification results for Snowflake or Databricks with expert analysis and visualizations |
ID Unification Merge Statistics Report Generator
Overview
I'll generate a comprehensive, professional HTML report analyzing your ID unification merge statistics with:
- 📊 Executive Summary with key performance indicators
- 📈 Identity Resolution Performance analysis and deduplication rates
- 🎯 Merge Distribution patterns and complexity analysis
- 👥 Top Merged Profiles highlighting complex identity resolutions
- ✅ Data Quality Metrics with coverage percentages
- 🚀 Convergence Analysis showing iteration performance
- 💡 Expert Recommendations for optimization and next steps
Platform Support:
- ✅ Snowflake (using Snowflake MCP tools)
- ✅ Databricks (using Databricks MCP tools)
Output Format:
- Beautiful HTML report with charts, tables, and visualizations
- PDF-ready (print to PDF from browser)
- Consistent formatting every time
- Platform-agnostic design
What You Need to Provide
1. Platform Selection
- Snowflake: For Snowflake-based ID unification
- Databricks: For Databricks-based ID unification
2. Database/Catalog Configuration
For Snowflake:
- Database Name: Where your unification tables are stored (e.g.,
INDRESH_TEST,CUSTOMER_CDP) - Schema Name: Schema containing tables (e.g.,
PUBLIC,ID_UNIFICATION)
For Databricks:
- Catalog Name: Unity Catalog name (e.g.,
customer_data,cdp_prod) - Schema Name: Schema containing tables (e.g.,
id_unification,unified_profiles)
3. Canonical ID Configuration
- Canonical ID Name: Name used for your unified ID (e.g.,
td_id,unified_customer_id,master_id)- This is used to find the correct tables:
{canonical_id}_lookup,{canonical_id}_master_table, etc.
- This is used to find the correct tables:
4. Output Configuration (Optional)
- Output File Path: Where to save the HTML report (default:
id_unification_report.html) - Report Title: Custom title for the report (default: "ID Unification Merge Statistics Report")
What I'll Do
Step 1: Platform Detection and Validation
Snowflake:
1. Verify Snowflake MCP tools are available
2. Test connection to specified database.schema
3. Validate canonical ID tables exist:
- {database}.{schema}.{canonical_id}_lookup
- {database}.{schema}.{canonical_id}_master_table
- {database}.{schema}.{canonical_id}_source_key_stats
- {database}.{schema}.{canonical_id}_result_key_stats
4. Confirm access permissions
Databricks:
1. Verify Databricks MCP tools are available (or use Snowflake fallback)
2. Test connection to specified catalog.schema
3. Validate canonical ID tables exist
4. Confirm access permissions
Step 2: Data Collection with Expert Analysis
I'll execute 16 specialized queries to collect comprehensive statistics:
Core Statistics Queries:
-
Source Key Statistics
- Pre-unification identity counts
- Distinct values per key type (customer_id, email, phone, etc.)
- Per-table breakdowns
-
Result Key Statistics
- Post-unification canonical ID counts
- Distribution histograms
- Coverage per key type
-
Canonical ID Metrics
- Total identities processed
- Unique canonical IDs created
- Merge ratio calculation
-
Top Merged Profiles
- Top 10 most complex merges
- Identity count per canonical ID
- Merge complexity scoring
-
Merge Distribution Analysis
- Categorization (2, 3-5, 6-10, 10+ identities)
- Percentage distribution
- Pattern analysis
-
Key Type Distribution
- Identity breakdown by type
- Namespace analysis
- Cross-key coverage
-
Master Table Quality Metrics
- Attribute coverage percentages
- Data completeness analysis
- Sample record extraction
-
Configuration Metadata
- Unification settings
- Column mappings
- Validation rules
Platform-Specific SQL Adaptation:
For Snowflake:
SELECT COUNT(*) as total_identities,
COUNT(DISTINCT canonical_id) as unique_canonical_ids
FROM {database}.{schema}.{canonical_id}_lookup;
For Databricks:
SELECT COUNT(*) as total_identities,
COUNT(DISTINCT canonical_id) as unique_canonical_ids
FROM {catalog}.{schema}.{canonical_id}_lookup;
Step 3: Statistical Analysis and Calculations
I'll perform expert-level calculations:
Deduplication Rates:
For each key type:
- Source distinct count (pre-unification)
- Final canonical IDs (post-unification)
- Deduplication % = (source - final) / source * 100
Merge Ratios:
- Average identities per customer = total_identities / unique_canonical_ids
- Distribution across categories
- Outlier detection (10+ merges)
Convergence Analysis:
- Parse from execution logs if available
- Calculate from iteration metadata tables
- Estimate convergence quality
Data Quality Scores:
- Coverage % for each attribute
- Completeness assessment
- Quality grading (Excellent, Good, Needs Improvement)
Step 4: HTML Report Generation
I'll generate a pixel-perfect HTML report with:
Design Features:
- ✨ Modern gradient design (purple theme)
- 📊 Interactive visualizations (progress bars, horizontal bar charts)
- 🎨 Color-coded badges and status indicators
- 📱 Responsive layout (works on all devices)
- 🖨️ Print-optimized CSS for PDF export
Report Structure:
<!DOCTYPE html>
<html>
<head>
- Professional CSS styling
- Chart/visualization styles
- Print media queries
</head>
<body>
<header>
- Report title
- Executive tagline
</header>
<metadata-bar>
- Database/Catalog info
- Canonical ID name
- Generation timestamp
- Platform indicator
</metadata-bar>
<section: Executive Summary>
- 4 KPI metric cards
- Key findings insight box
</section>
<section: Identity Resolution Performance>
- Source vs result comparison table
- Deduplication rate analysis
- Horizontal bar charts
- Expert insights
</section>
<section: Merge Distribution Analysis>
- Category breakdown table
- Distribution visualizations
- Pattern analysis insights
</section>
<section: Top Merged Profiles>
- Top 10 ranked table
- Complexity badges
- Investigation recommendations
</section>
<section: Source Table Configuration>
- Column mapping table
- Source contributions
- Multi-key strategy analysis
</section>
<section: Master Table Data Quality>
- 6 coverage cards with progress bars
- Sample records table
- Quality assessment
</section>
<section: Convergence Performance>
- Iteration breakdown table
- Convergence progression chart
- Efficiency analysis
</section>
<section: Expert Recommendations>
- 4 recommendation cards
- Strategic next steps
- Downstream activation ideas
</section>
<section: Summary Statistics>
- Comprehensive metrics table
- All key numbers documented
</section>
<footer>
- Generation metadata
- Platform information
- Report description
</footer>
</body>
</html>
Step 5: Quality Validation and Output
Pre-Output Validation:
1. Verify all sections have data
2. Check calculations are correct
3. Validate percentages sum properly
4. Ensure no missing values
5. Confirm HTML is well-formed
File Output:
1. Write HTML to specified path
2. Create backup if file exists
3. Set proper file permissions
4. Verify file was written successfully
Report Summary:
✓ Report generated: {file_path}
✓ File size: {size} KB
✓ Sections included: 9
✓ Statistics queries: 16
✓ Data quality score: {score}%
✓ Ready for: Browser viewing, PDF export, sharing
Example Workflow
Snowflake Example
User Input:
Platform: Snowflake
Database: INDRESH_TEST
Schema: PUBLIC
Canonical ID: td_id
Output: snowflake_merge_report.html
Process:
✓ Connected to Snowflake via MCP
✓ Database: INDRESH_TEST.PUBLIC validated
✓ Tables found:
- td_id_lookup (19,512 records)
- td_id_master_table (4,940 records)
- td_id_source_key_stats (4 records)
- td_id_result_key_stats (4 records)
Executing queries:
✓ Query 1: Source statistics retrieved
✓ Query 2: Result statistics retrieved
✓ Query 3: Canonical ID counts (19,512 → 4,940)
✓ Query 4: Top 10 merged profiles identified
✓ Query 5: Merge distribution calculated
✓ Query 6: Key type distribution analyzed
✓ Query 7: Master table coverage (100% email, 99.39% phone)
✓ Query 8: Sample records extracted
✓ Query 9-11: Metadata retrieved
Calculating metrics:
✓ Merge ratio: 3.95:1
✓ Fragmentation reduction: 74.7%
✓ Deduplication rates:
- customer_id: 23.9%
- email: 32.0%
- phone: 14.8%
✓ Data quality score: 99.7%
Generating HTML report:
✓ Executive summary section
✓ Performance analysis section
✓ Merge distribution section
✓ Top profiles section
✓ Source configuration section
✓ Data quality section
✓ Convergence section
✓ Recommendations section
✓ Summary statistics section
✓ Report saved: snowflake_merge_report.html (142 KB)
✓ Open in browser to view
✓ Print to PDF for distribution
Generated Report Contents:
Executive Summary:
- 4,940 unified profiles
- 19,512 total identities
- 3.95:1 merge ratio
- 74.7% fragmentation reduction
Identity Resolution:
- customer_id: 6,489 → 4,940 (23.9% reduction)
- email: 7,261 → 4,940 (32.0% reduction)
- phone: 5,762 → 4,910 (14.8% reduction)
Merge Distribution:
- 89.0% profiles: 3-5 identities (normal)
- 8.1% profiles: 6-10 identities (high engagement)
- 2.3% profiles: 10+ identities (complex)
Top Merged Profile:
- mS9ssBEh4EsN: 38 identities merged
Data Quality:
- Email: 100% coverage
- Phone: 99.39% coverage
- Names: 100% coverage
- Location: 100% coverage
Expert Recommendations:
- Implement incremental processing
- Monitor profiles with 20+ merges
- Enable downstream activation
- Set up quality monitoring
Databricks Example
User Input:
Platform: Databricks
Catalog: customer_cdp
Schema: id_unification
Canonical ID: unified_customer_id
Output: databricks_merge_report.html
Process:
✓ Connected to Databricks (or using Snowflake MCP fallback)
✓ Catalog: customer_cdp.id_unification validated
✓ Tables found:
- unified_customer_id_lookup
- unified_customer_id_master_table
- unified_customer_id_source_key_stats
- unified_customer_id_result_key_stats
[Same query execution and report generation as Snowflake]
✓ Report saved: databricks_merge_report.html
Key Features
🎯 Consistency Guarantee
- Same report every time: Deterministic HTML generation
- Platform-agnostic design: Works identically on Snowflake and Databricks
- Version controlled: Report structure is fixed and versioned
🔍 Expert Analysis
- 16 specialized queries: Comprehensive data collection
- Calculated metrics: Deduplication rates, merge ratios, quality scores
- Pattern detection: Identify anomalies and outliers
- Strategic insights: Actionable recommendations
📊 Professional Visualizations
- KPI metric cards: Large, colorful summary metrics
- Progress bars: Coverage percentages with animations
- Horizontal bar charts: Distribution comparisons
- Color-coded badges: Status indicators (Excellent, Good, Needs Review)
- Tables with hover effects: Interactive data exploration
🌍 Platform Flexibility
- Snowflake: Uses
mcp__snowflake__execute_querytool - Databricks: Uses Databricks MCP tools (with fallback options)
- Automatic SQL adaptation: Platform-specific query generation
- Table name resolution: Handles catalog vs database differences
📋 Comprehensive Coverage
9 Report Sections:
- Executive Summary (4 KPIs + findings)
- Identity Resolution Performance (deduplication analysis)
- Merge Distribution Analysis (categorized breakdown)
- Top Merged Profiles (complexity ranking)
- Source Table Configuration (mappings)
- Master Table Data Quality (coverage metrics)
- Convergence Performance (iteration analysis)
- Expert Recommendations (strategic guidance)
- Summary Statistics (complete metrics)
16 Statistical Queries:
- Source/result key statistics
- Canonical ID counts and distributions
- Merge pattern analysis
- Quality coverage metrics
- Configuration metadata
Table Naming Conventions
The command automatically finds tables based on your canonical ID name:
Required Tables
For canonical ID = {canonical_id}:
-
Lookup Table:
{canonical_id}_lookup- Contains: canonical_id, id, id_key_type
- Used for: Merge ratio, distribution, top profiles
-
Master Table:
{canonical_id}_master_table- Contains: {canonical_id}, best_* attributes
- Used for: Data quality coverage
-
Source Stats:
{canonical_id}_source_key_stats- Contains: from_table, total_distinct, distinct_*
- Used for: Pre-unification baseline
-
Result Stats:
{canonical_id}_result_key_stats- Contains: from_table, total_distinct, histogram_*
- Used for: Post-unification results
Optional Tables
-
Unification Metadata:
unification_metadata- Contains: canonical_id_name, canonical_id_type
- Used for: Configuration documentation
-
Column Lookup:
column_lookup- Contains: table_name, column_name, key_name
- Used for: Source table mappings
-
Filter Lookup:
filter_lookup- Contains: key_name, invalid_texts, valid_regexp
- Used for: Validation rules
All tables must be in the same database.schema (Snowflake) or catalog.schema (Databricks)
Output Format
HTML Report Features
Styling:
- Gradient purple theme (#667eea to #764ba2)
- Modern typography (system fonts)
- Responsive grid layouts
- Smooth hover animations
- Print-optimized media queries
Sections:
- Header with gradient background
- Metadata bar with key info
- 9 content sections with analysis
- Footer with generation details
Visualizations:
- Metric cards (4 in executive summary)
- Progress bars (6 in data quality)
- Horizontal bar charts (3 throughout report)
- Tables with sorting and hover effects
- Insight boxes with recommendations
Interactivity:
- Hover effects on cards and tables
- Animated progress bars
- Expandable insight boxes
- Responsive layout adapts to screen size
PDF Export
To create a PDF from the HTML report:
- Open HTML file in browser
- Press Ctrl+P (Windows) or Cmd+P (Mac)
- Select "Save as PDF"
- Choose landscape orientation for better chart visibility
- Enable background graphics for full styling
Error Handling
Common Issues and Solutions
Issue: "Tables not found"
Solution:
1. Verify canonical ID name is correct
2. Check database/catalog and schema names
3. Ensure unification workflow completed successfully
4. Confirm table naming: {canonical_id}_lookup, {canonical_id}_master_table, etc.
Issue: "MCP tools not available"
Solution:
1. For Snowflake: Verify Snowflake MCP server is configured
2. For Databricks: Fall back to Snowflake MCP with proper connection string
3. Check network connectivity
4. Validate credentials
Issue: "No data in statistics tables"
Solution:
1. Verify unification workflow ran completely
2. Check that statistics SQL files were executed
3. Confirm data exists in lookup and master tables
4. Re-run the unification workflow if needed
Issue: "Permission denied"
Solution:
1. Verify READ access to all tables
2. For Snowflake: Grant SELECT on schema
3. For Databricks: Grant USE CATALOG, USE SCHEMA, SELECT
4. Check role/user permissions
Success Criteria
Generated report will:
- ✅ Open successfully in all modern browsers (Chrome, Firefox, Safari, Edge)
- ✅ Display all 9 sections with complete data
- ✅ Show accurate calculations for all metrics
- ✅ Include visualizations (charts, progress bars, tables)
- ✅ Render consistently every time it's generated
- ✅ Export cleanly to PDF with proper formatting
- ✅ Match the reference design (same HTML/CSS structure)
- ✅ Contain expert insights and recommendations
- ✅ Be production-ready for stakeholder distribution
Usage Examples
Quick Start (Snowflake)
/cdp-hybrid-idu:hybrid-unif-merge-stats-creator
> Platform: Snowflake
> Database: PROD_CDP
> Schema: ID_UNIFICATION
> Canonical ID: master_customer_id
> Output: (press Enter for default)
✓ Report generated: id_unification_report.html
Custom Output Path
/cdp-hybrid-idu:hybrid-unif-merge-stats-creator
> Platform: Databricks
> Catalog: analytics_prod
> Schema: unified_ids
> Canonical ID: td_id
> Output: /reports/weekly/td_id_stats_2025-10-15.html
✓ Report generated: /reports/weekly/td_id_stats_2025-10-15.html
Multiple Environments
Generate reports for different environments:
# Production
/hybrid-unif-merge-stats-creator
Platform: Snowflake
Database: PROD_CDP
Output: prod_merge_stats.html
# Staging
/hybrid-unif-merge-stats-creator
Platform: Snowflake
Database: STAGING_CDP
Output: staging_merge_stats.html
# Compare metrics across environments
Best Practices
Regular Reporting
- Weekly Reports: Track merge performance over time
- Post-Workflow Reports: Generate after each unification run
- Quality Audits: Monthly deep-dive analysis
- Stakeholder Updates: Executive-friendly format
Comparative Analysis
Generate reports at different stages:
- After initial unification setup
- After incremental updates
- After data quality improvements
- Across different customer segments
Archive and Versioning
reports/
2025-10-15_td_id_merge_stats.html
2025-10-08_td_id_merge_stats.html
2025-10-01_td_id_merge_stats.html
Track improvements over time by comparing:
- Merge ratios
- Data quality scores
- Convergence iterations
- Deduplication rates
Getting Started
Ready to generate your merge statistics report?
Please provide:
- Platform: Snowflake or Databricks?
- Database/Catalog: Where are your unification tables?
- Schema: Which schema contains the tables?
- Canonical ID: What's the name of your unified ID? (e.g., td_id)
- Output Path (optional): Where to save the report?
Example:
I want to generate a merge statistics report for:
Platform: Snowflake
Database: INDRESH_TEST
Schema: PUBLIC
Canonical ID: td_id
Output: my_unification_report.html
I'll analyze your ID unification results and create a comprehensive, beautiful HTML report with expert insights!