13 KiB
13 KiB
Data Visualization
This reference covers Vaex's visualization capabilities for creating plots, heatmaps, and interactive visualizations of large datasets.
Overview
Vaex excels at visualizing datasets with billions of rows through efficient binning and aggregation. The visualization system works directly with large data without sampling, providing accurate representations of the entire dataset.
Key features:
- Visualize billion-row datasets interactively
- No sampling required - uses all data
- Automatic binning and aggregation
- Integration with matplotlib
- Interactive widgets for Jupyter
Basic Plotting
1D Histograms
import vaex
import matplotlib.pyplot as plt
df = vaex.open('data.hdf5')
# Simple histogram
df.plot1d(df.age)
# With customization
df.plot1d(df.age,
limits=[0, 100],
shape=50, # Number of bins
figsize=(10, 6),
xlabel='Age',
ylabel='Count')
plt.show()
2D Density Plots (Heatmaps)
# Basic 2D plot
df.plot(df.x, df.y)
# With limits
df.plot(df.x, df.y, limits=[[0, 10], [0, 10]])
# Auto limits using percentiles
df.plot(df.x, df.y, limits='99.7%') # 3-sigma limits
# Customize shape (resolution)
df.plot(df.x, df.y, shape=(512, 512))
# Logarithmic color scale
df.plot(df.x, df.y, f='log')
Scatter Plots (Small Data)
# For smaller datasets or samples
df_sample = df.sample(n=1000)
df_sample.scatter(df_sample.x, df_sample.y,
alpha=0.5,
s=10) # Point size
plt.show()
Advanced Visualization Options
Color Scales and Normalization
# Linear scale (default)
df.plot(df.x, df.y, f='identity')
# Logarithmic scale
df.plot(df.x, df.y, f='log')
df.plot(df.x, df.y, f='log10')
# Square root scale
df.plot(df.x, df.y, f='sqrt')
# Custom colormap
df.plot(df.x, df.y, colormap='viridis')
df.plot(df.x, df.y, colormap='plasma')
df.plot(df.x, df.y, colormap='hot')
Limits and Ranges
# Manual limits
df.plot(df.x, df.y, limits=[[xmin, xmax], [ymin, ymax]])
# Percentile-based limits
df.plot(df.x, df.y, limits='99.7%') # 3-sigma
df.plot(df.x, df.y, limits='95%')
df.plot(df.x, df.y, limits='minmax') # Full range
# Mixed limits
df.plot(df.x, df.y, limits=[[0, 100], 'minmax'])
Resolution Control
# Higher resolution (more bins)
df.plot(df.x, df.y, shape=(1024, 1024))
# Lower resolution (faster)
df.plot(df.x, df.y, shape=(128, 128))
# Different resolutions per axis
df.plot(df.x, df.y, shape=(512, 256))
Statistical Visualizations
Visualizing Aggregations
# Mean on a grid
df.plot(df.x, df.y, what=df.z.mean(),
limits=[[0, 10], [0, 10]],
shape=(100, 100),
colormap='viridis')
# Standard deviation
df.plot(df.x, df.y, what=df.z.std())
# Sum
df.plot(df.x, df.y, what=df.z.sum())
# Count (default)
df.plot(df.x, df.y, what='count')
Multiple Statistics
# Create figure with subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Count
df.plot(df.x, df.y, what='count',
ax=axes[0, 0], show=False)
axes[0, 0].set_title('Count')
# Mean
df.plot(df.x, df.y, what=df.z.mean(),
ax=axes[0, 1], show=False)
axes[0, 1].set_title('Mean of z')
# Std
df.plot(df.x, df.y, what=df.z.std(),
ax=axes[1, 0], show=False)
axes[1, 0].set_title('Std of z')
# Min
df.plot(df.x, df.y, what=df.z.min(),
ax=axes[1, 1], show=False)
axes[1, 1].set_title('Min of z')
plt.tight_layout()
plt.show()
Working with Selections
Visualize different segments of data simultaneously:
import vaex
import matplotlib.pyplot as plt
df = vaex.open('data.hdf5')
# Create selections
df.select(df.category == 'A', name='group_a')
df.select(df.category == 'B', name='group_b')
# Plot both selections
df.plot1d(df.value, selection='group_a', label='Group A')
df.plot1d(df.value, selection='group_b', label='Group B')
plt.legend()
plt.show()
# 2D plot with selection
df.plot(df.x, df.y, selection='group_a')
Overlay Multiple Selections
# Create base plot
fig, ax = plt.subplots(figsize=(10, 8))
# Plot each selection with different colors
df.plot(df.x, df.y, selection='group_a',
ax=ax, show=False, colormap='Reds', alpha=0.5)
df.plot(df.x, df.y, selection='group_b',
ax=ax, show=False, colormap='Blues', alpha=0.5)
ax.set_title('Overlaid Selections')
plt.show()
Subplots and Layouts
Creating Multiple Plots
import matplotlib.pyplot as plt
# Create subplot grid
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
# Plot different variables
variables = ['x', 'y', 'z', 'a', 'b', 'c']
for idx, var in enumerate(variables):
row = idx // 3
col = idx % 3
df.plot1d(df[var], ax=axes[row, col], show=False)
axes[row, col].set_title(f'Distribution of {var}')
plt.tight_layout()
plt.show()
Faceted Plots
# Plot by category
categories = df.category.unique()
fig, axes = plt.subplots(1, len(categories), figsize=(15, 5))
for idx, cat in enumerate(categories):
df_cat = df[df.category == cat]
df_cat.plot(df_cat.x, df_cat.y,
ax=axes[idx], show=False)
axes[idx].set_title(f'Category {cat}')
plt.tight_layout()
plt.show()
Interactive Widgets (Jupyter)
Create interactive visualizations in Jupyter notebooks:
Selection Widget
# Interactive selection
df.widget.selection_expression()
Histogram Widget
# Interactive histogram with selection
df.plot_widget(df.x, df.y)
Scatter Widget
# Interactive scatter plot
df.scatter_widget(df.x, df.y)
Customization
Styling Plots
import matplotlib.pyplot as plt
# Create plot with custom styling
fig, ax = plt.subplots(figsize=(12, 8))
df.plot(df.x, df.y,
limits='99%',
shape=(256, 256),
colormap='plasma',
ax=ax,
show=False)
# Customize axes
ax.set_xlabel('X Variable', fontsize=14, fontweight='bold')
ax.set_ylabel('Y Variable', fontsize=14, fontweight='bold')
ax.set_title('Custom Density Plot', fontsize=16, fontweight='bold')
ax.grid(alpha=0.3)
# Add colorbar
plt.colorbar(ax.collections[0], ax=ax, label='Density')
plt.tight_layout()
plt.show()
Figure Size and DPI
# High-resolution plot
df.plot(df.x, df.y,
figsize=(12, 10),
dpi=300)
Specialized Visualizations
Hexbin Plots
# Alternative to heatmap using hexagonal bins
plt.figure(figsize=(10, 8))
plt.hexbin(df.x.values[:100000], df.y.values[:100000],
gridsize=50, cmap='viridis')
plt.colorbar(label='Count')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
Contour Plots
import numpy as np
# Get 2D histogram data
counts = df.count(binby=[df.x, df.y],
limits=[[0, 10], [0, 10]],
shape=(100, 100))
# Create contour plot
x = np.linspace(0, 10, 100)
y = np.linspace(0, 10, 100)
plt.contourf(x, y, counts.T, levels=20, cmap='viridis')
plt.colorbar(label='Count')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Contour Plot')
plt.show()
Vector Field Overlay
# Compute mean vectors on grid
mean_vx = df.mean(df.vx, binby=[df.x, df.y],
limits=[[0, 10], [0, 10]],
shape=(20, 20))
mean_vy = df.mean(df.vy, binby=[df.x, df.y],
limits=[[0, 10], [0, 10]],
shape=(20, 20))
# Create grid
x = np.linspace(0, 10, 20)
y = np.linspace(0, 10, 20)
X, Y = np.meshgrid(x, y)
# Plot
fig, ax = plt.subplots(figsize=(10, 8))
# Base heatmap
df.plot(df.x, df.y, ax=ax, show=False)
# Vector overlay
ax.quiver(X, Y, mean_vx.T, mean_vy.T, alpha=0.7, color='white')
plt.show()
Performance Considerations
Optimizing Large Visualizations
# For very large datasets, reduce shape
df.plot(df.x, df.y, shape=(256, 256)) # Fast
# For publication quality
df.plot(df.x, df.y, shape=(1024, 1024)) # Higher quality
# Balance quality and performance
df.plot(df.x, df.y, shape=(512, 512)) # Good balance
Caching Visualization Data
# Compute once, plot multiple times
counts = df.count(binby=[df.x, df.y],
limits=[[0, 10], [0, 10]],
shape=(512, 512))
# Use in different plots
plt.figure()
plt.imshow(counts.T, origin='lower', cmap='viridis')
plt.colorbar()
plt.show()
plt.figure()
plt.imshow(np.log10(counts.T + 1), origin='lower', cmap='plasma')
plt.colorbar()
plt.show()
Export and Saving
Saving Figures
# Save as PNG
df.plot(df.x, df.y)
plt.savefig('plot.png', dpi=300, bbox_inches='tight')
# Save as PDF (vector)
plt.savefig('plot.pdf', bbox_inches='tight')
# Save as SVG
plt.savefig('plot.svg', bbox_inches='tight')
Batch Plotting
# Generate multiple plots
variables = ['x', 'y', 'z']
for var in variables:
plt.figure(figsize=(10, 6))
df.plot1d(df[var])
plt.title(f'Distribution of {var}')
plt.savefig(f'plot_{var}.png', dpi=300, bbox_inches='tight')
plt.close()
Common Patterns
Pattern: Exploratory Data Analysis
import matplotlib.pyplot as plt
# Create comprehensive visualization
fig = plt.figure(figsize=(16, 12))
# 1D histograms
ax1 = plt.subplot(3, 3, 1)
df.plot1d(df.x, ax=ax1, show=False)
ax1.set_title('X Distribution')
ax2 = plt.subplot(3, 3, 2)
df.plot1d(df.y, ax=ax2, show=False)
ax2.set_title('Y Distribution')
ax3 = plt.subplot(3, 3, 3)
df.plot1d(df.z, ax=ax3, show=False)
ax3.set_title('Z Distribution')
# 2D plots
ax4 = plt.subplot(3, 3, 4)
df.plot(df.x, df.y, ax=ax4, show=False)
ax4.set_title('X vs Y')
ax5 = plt.subplot(3, 3, 5)
df.plot(df.x, df.z, ax=ax5, show=False)
ax5.set_title('X vs Z')
ax6 = plt.subplot(3, 3, 6)
df.plot(df.y, df.z, ax=ax6, show=False)
ax6.set_title('Y vs Z')
# Statistics on grids
ax7 = plt.subplot(3, 3, 7)
df.plot(df.x, df.y, what=df.z.mean(), ax=ax7, show=False)
ax7.set_title('Mean Z on X-Y grid')
plt.tight_layout()
plt.savefig('eda_summary.png', dpi=300, bbox_inches='tight')
plt.show()
Pattern: Comparison Across Groups
# Compare distributions by category
categories = df.category.unique()
fig, axes = plt.subplots(len(categories), 2,
figsize=(12, 4 * len(categories)))
for idx, cat in enumerate(categories):
df.select(df.category == cat, name=f'cat_{cat}')
# 1D histogram
df.plot1d(df.value, selection=f'cat_{cat}',
ax=axes[idx, 0], show=False)
axes[idx, 0].set_title(f'Category {cat} - Distribution')
# 2D plot
df.plot(df.x, df.y, selection=f'cat_{cat}',
ax=axes[idx, 1], show=False)
axes[idx, 1].set_title(f'Category {cat} - X vs Y')
plt.tight_layout()
plt.show()
Pattern: Time Series Visualization
# Aggregate by time bins
df['year'] = df.timestamp.dt.year
df['month'] = df.timestamp.dt.month
# Plot time series
monthly_sales = df.groupby(['year', 'month']).agg({'sales': 'sum'})
plt.figure(figsize=(14, 6))
plt.plot(range(len(monthly_sales)), monthly_sales['sales'])
plt.xlabel('Time Period')
plt.ylabel('Sales')
plt.title('Sales Over Time')
plt.grid(alpha=0.3)
plt.show()
Integration with Other Libraries
Plotly for Interactivity
import plotly.graph_objects as go
# Get data from Vaex
counts = df.count(binby=[df.x, df.y], shape=(100, 100))
# Create plotly figure
fig = go.Figure(data=go.Heatmap(z=counts.T))
fig.update_layout(title='Interactive Heatmap')
fig.show()
Seaborn Style
import seaborn as sns
import matplotlib.pyplot as plt
# Use seaborn styling
sns.set_style('darkgrid')
sns.set_palette('husl')
df.plot1d(df.value)
plt.show()
Best Practices
- Use appropriate shape - Balance resolution and performance (256-512 for exploration, 1024+ for publication)
- Apply sensible limits - Use percentile-based limits ('99%', '99.7%') to handle outliers
- Choose color scales wisely - Log scale for wide-ranging counts, linear for uniform data
- Leverage selections - Compare subsets without creating new DataFrames
- Cache aggregations - Compute once if creating multiple similar plots
- Use vector formats for publication - Save as PDF or SVG for scalable figures
- Avoid sampling - Vaex visualizations use all data, no sampling needed
Troubleshooting
Issue: Empty or Sparse Plot
# Problem: Limits don't match data range
df.plot(df.x, df.y, limits=[[0, 10], [0, 10]])
# Solution: Use automatic limits
df.plot(df.x, df.y, limits='minmax')
df.plot(df.x, df.y, limits='99%')
Issue: Plot Too Slow
# Problem: Too high resolution
df.plot(df.x, df.y, shape=(2048, 2048))
# Solution: Reduce shape
df.plot(df.x, df.y, shape=(512, 512))
Issue: Can't See Low-Density Regions
# Problem: Linear scale overwhelmed by high-density areas
df.plot(df.x, df.y, f='identity')
# Solution: Use logarithmic scale
df.plot(df.x, df.y, f='log')
Related Resources
- For data aggregation: See
data_processing.md - For performance optimization: See
performance.md - For DataFrame basics: See
core_dataframes.md