614 lines
13 KiB
Markdown
614 lines
13 KiB
Markdown
# Data Visualization
|
|
|
|
This reference covers Vaex's visualization capabilities for creating plots, heatmaps, and interactive visualizations of large datasets.
|
|
|
|
## Overview
|
|
|
|
Vaex excels at visualizing datasets with billions of rows through efficient binning and aggregation. The visualization system works directly with large data without sampling, providing accurate representations of the entire dataset.
|
|
|
|
**Key features:**
|
|
- Visualize billion-row datasets interactively
|
|
- No sampling required - uses all data
|
|
- Automatic binning and aggregation
|
|
- Integration with matplotlib
|
|
- Interactive widgets for Jupyter
|
|
|
|
## Basic Plotting
|
|
|
|
### 1D Histograms
|
|
|
|
```python
|
|
import vaex
|
|
import matplotlib.pyplot as plt
|
|
|
|
df = vaex.open('data.hdf5')
|
|
|
|
# Simple histogram
|
|
df.plot1d(df.age)
|
|
|
|
# With customization
|
|
df.plot1d(df.age,
|
|
limits=[0, 100],
|
|
shape=50, # Number of bins
|
|
figsize=(10, 6),
|
|
xlabel='Age',
|
|
ylabel='Count')
|
|
|
|
plt.show()
|
|
```
|
|
|
|
### 2D Density Plots (Heatmaps)
|
|
|
|
```python
|
|
# Basic 2D plot
|
|
df.plot(df.x, df.y)
|
|
|
|
# With limits
|
|
df.plot(df.x, df.y, limits=[[0, 10], [0, 10]])
|
|
|
|
# Auto limits using percentiles
|
|
df.plot(df.x, df.y, limits='99.7%') # 3-sigma limits
|
|
|
|
# Customize shape (resolution)
|
|
df.plot(df.x, df.y, shape=(512, 512))
|
|
|
|
# Logarithmic color scale
|
|
df.plot(df.x, df.y, f='log')
|
|
```
|
|
|
|
### Scatter Plots (Small Data)
|
|
|
|
```python
|
|
# For smaller datasets or samples
|
|
df_sample = df.sample(n=1000)
|
|
|
|
df_sample.scatter(df_sample.x, df_sample.y,
|
|
alpha=0.5,
|
|
s=10) # Point size
|
|
|
|
plt.show()
|
|
```
|
|
|
|
## Advanced Visualization Options
|
|
|
|
### Color Scales and Normalization
|
|
|
|
```python
|
|
# Linear scale (default)
|
|
df.plot(df.x, df.y, f='identity')
|
|
|
|
# Logarithmic scale
|
|
df.plot(df.x, df.y, f='log')
|
|
df.plot(df.x, df.y, f='log10')
|
|
|
|
# Square root scale
|
|
df.plot(df.x, df.y, f='sqrt')
|
|
|
|
# Custom colormap
|
|
df.plot(df.x, df.y, colormap='viridis')
|
|
df.plot(df.x, df.y, colormap='plasma')
|
|
df.plot(df.x, df.y, colormap='hot')
|
|
```
|
|
|
|
### Limits and Ranges
|
|
|
|
```python
|
|
# Manual limits
|
|
df.plot(df.x, df.y, limits=[[xmin, xmax], [ymin, ymax]])
|
|
|
|
# Percentile-based limits
|
|
df.plot(df.x, df.y, limits='99.7%') # 3-sigma
|
|
df.plot(df.x, df.y, limits='95%')
|
|
df.plot(df.x, df.y, limits='minmax') # Full range
|
|
|
|
# Mixed limits
|
|
df.plot(df.x, df.y, limits=[[0, 100], 'minmax'])
|
|
```
|
|
|
|
### Resolution Control
|
|
|
|
```python
|
|
# Higher resolution (more bins)
|
|
df.plot(df.x, df.y, shape=(1024, 1024))
|
|
|
|
# Lower resolution (faster)
|
|
df.plot(df.x, df.y, shape=(128, 128))
|
|
|
|
# Different resolutions per axis
|
|
df.plot(df.x, df.y, shape=(512, 256))
|
|
```
|
|
|
|
## Statistical Visualizations
|
|
|
|
### Visualizing Aggregations
|
|
|
|
```python
|
|
# Mean on a grid
|
|
df.plot(df.x, df.y, what=df.z.mean(),
|
|
limits=[[0, 10], [0, 10]],
|
|
shape=(100, 100),
|
|
colormap='viridis')
|
|
|
|
# Standard deviation
|
|
df.plot(df.x, df.y, what=df.z.std())
|
|
|
|
# Sum
|
|
df.plot(df.x, df.y, what=df.z.sum())
|
|
|
|
# Count (default)
|
|
df.plot(df.x, df.y, what='count')
|
|
```
|
|
|
|
### Multiple Statistics
|
|
|
|
```python
|
|
# Create figure with subplots
|
|
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
|
|
|
|
# Count
|
|
df.plot(df.x, df.y, what='count',
|
|
ax=axes[0, 0], show=False)
|
|
axes[0, 0].set_title('Count')
|
|
|
|
# Mean
|
|
df.plot(df.x, df.y, what=df.z.mean(),
|
|
ax=axes[0, 1], show=False)
|
|
axes[0, 1].set_title('Mean of z')
|
|
|
|
# Std
|
|
df.plot(df.x, df.y, what=df.z.std(),
|
|
ax=axes[1, 0], show=False)
|
|
axes[1, 0].set_title('Std of z')
|
|
|
|
# Min
|
|
df.plot(df.x, df.y, what=df.z.min(),
|
|
ax=axes[1, 1], show=False)
|
|
axes[1, 1].set_title('Min of z')
|
|
|
|
plt.tight_layout()
|
|
plt.show()
|
|
```
|
|
|
|
## Working with Selections
|
|
|
|
Visualize different segments of data simultaneously:
|
|
|
|
```python
|
|
import vaex
|
|
import matplotlib.pyplot as plt
|
|
|
|
df = vaex.open('data.hdf5')
|
|
|
|
# Create selections
|
|
df.select(df.category == 'A', name='group_a')
|
|
df.select(df.category == 'B', name='group_b')
|
|
|
|
# Plot both selections
|
|
df.plot1d(df.value, selection='group_a', label='Group A')
|
|
df.plot1d(df.value, selection='group_b', label='Group B')
|
|
plt.legend()
|
|
plt.show()
|
|
|
|
# 2D plot with selection
|
|
df.plot(df.x, df.y, selection='group_a')
|
|
```
|
|
|
|
### Overlay Multiple Selections
|
|
|
|
```python
|
|
# Create base plot
|
|
fig, ax = plt.subplots(figsize=(10, 8))
|
|
|
|
# Plot each selection with different colors
|
|
df.plot(df.x, df.y, selection='group_a',
|
|
ax=ax, show=False, colormap='Reds', alpha=0.5)
|
|
df.plot(df.x, df.y, selection='group_b',
|
|
ax=ax, show=False, colormap='Blues', alpha=0.5)
|
|
|
|
ax.set_title('Overlaid Selections')
|
|
plt.show()
|
|
```
|
|
|
|
## Subplots and Layouts
|
|
|
|
### Creating Multiple Plots
|
|
|
|
```python
|
|
import matplotlib.pyplot as plt
|
|
|
|
# Create subplot grid
|
|
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
|
|
|
|
# Plot different variables
|
|
variables = ['x', 'y', 'z', 'a', 'b', 'c']
|
|
for idx, var in enumerate(variables):
|
|
row = idx // 3
|
|
col = idx % 3
|
|
df.plot1d(df[var], ax=axes[row, col], show=False)
|
|
axes[row, col].set_title(f'Distribution of {var}')
|
|
|
|
plt.tight_layout()
|
|
plt.show()
|
|
```
|
|
|
|
### Faceted Plots
|
|
|
|
```python
|
|
# Plot by category
|
|
categories = df.category.unique()
|
|
|
|
fig, axes = plt.subplots(1, len(categories), figsize=(15, 5))
|
|
|
|
for idx, cat in enumerate(categories):
|
|
df_cat = df[df.category == cat]
|
|
df_cat.plot(df_cat.x, df_cat.y,
|
|
ax=axes[idx], show=False)
|
|
axes[idx].set_title(f'Category {cat}')
|
|
|
|
plt.tight_layout()
|
|
plt.show()
|
|
```
|
|
|
|
## Interactive Widgets (Jupyter)
|
|
|
|
Create interactive visualizations in Jupyter notebooks:
|
|
|
|
### Selection Widget
|
|
|
|
```python
|
|
# Interactive selection
|
|
df.widget.selection_expression()
|
|
```
|
|
|
|
### Histogram Widget
|
|
|
|
```python
|
|
# Interactive histogram with selection
|
|
df.plot_widget(df.x, df.y)
|
|
```
|
|
|
|
### Scatter Widget
|
|
|
|
```python
|
|
# Interactive scatter plot
|
|
df.scatter_widget(df.x, df.y)
|
|
```
|
|
|
|
## Customization
|
|
|
|
### Styling Plots
|
|
|
|
```python
|
|
import matplotlib.pyplot as plt
|
|
|
|
# Create plot with custom styling
|
|
fig, ax = plt.subplots(figsize=(12, 8))
|
|
|
|
df.plot(df.x, df.y,
|
|
limits='99%',
|
|
shape=(256, 256),
|
|
colormap='plasma',
|
|
ax=ax,
|
|
show=False)
|
|
|
|
# Customize axes
|
|
ax.set_xlabel('X Variable', fontsize=14, fontweight='bold')
|
|
ax.set_ylabel('Y Variable', fontsize=14, fontweight='bold')
|
|
ax.set_title('Custom Density Plot', fontsize=16, fontweight='bold')
|
|
ax.grid(alpha=0.3)
|
|
|
|
# Add colorbar
|
|
plt.colorbar(ax.collections[0], ax=ax, label='Density')
|
|
|
|
plt.tight_layout()
|
|
plt.show()
|
|
```
|
|
|
|
### Figure Size and DPI
|
|
|
|
```python
|
|
# High-resolution plot
|
|
df.plot(df.x, df.y,
|
|
figsize=(12, 10),
|
|
dpi=300)
|
|
```
|
|
|
|
## Specialized Visualizations
|
|
|
|
### Hexbin Plots
|
|
|
|
```python
|
|
# Alternative to heatmap using hexagonal bins
|
|
plt.figure(figsize=(10, 8))
|
|
plt.hexbin(df.x.values[:100000], df.y.values[:100000],
|
|
gridsize=50, cmap='viridis')
|
|
plt.colorbar(label='Count')
|
|
plt.xlabel('X')
|
|
plt.ylabel('Y')
|
|
plt.show()
|
|
```
|
|
|
|
### Contour Plots
|
|
|
|
```python
|
|
import numpy as np
|
|
|
|
# Get 2D histogram data
|
|
counts = df.count(binby=[df.x, df.y],
|
|
limits=[[0, 10], [0, 10]],
|
|
shape=(100, 100))
|
|
|
|
# Create contour plot
|
|
x = np.linspace(0, 10, 100)
|
|
y = np.linspace(0, 10, 100)
|
|
plt.contourf(x, y, counts.T, levels=20, cmap='viridis')
|
|
plt.colorbar(label='Count')
|
|
plt.xlabel('X')
|
|
plt.ylabel('Y')
|
|
plt.title('Contour Plot')
|
|
plt.show()
|
|
```
|
|
|
|
### Vector Field Overlay
|
|
|
|
```python
|
|
# Compute mean vectors on grid
|
|
mean_vx = df.mean(df.vx, binby=[df.x, df.y],
|
|
limits=[[0, 10], [0, 10]],
|
|
shape=(20, 20))
|
|
mean_vy = df.mean(df.vy, binby=[df.x, df.y],
|
|
limits=[[0, 10], [0, 10]],
|
|
shape=(20, 20))
|
|
|
|
# Create grid
|
|
x = np.linspace(0, 10, 20)
|
|
y = np.linspace(0, 10, 20)
|
|
X, Y = np.meshgrid(x, y)
|
|
|
|
# Plot
|
|
fig, ax = plt.subplots(figsize=(10, 8))
|
|
|
|
# Base heatmap
|
|
df.plot(df.x, df.y, ax=ax, show=False)
|
|
|
|
# Vector overlay
|
|
ax.quiver(X, Y, mean_vx.T, mean_vy.T, alpha=0.7, color='white')
|
|
|
|
plt.show()
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
### Optimizing Large Visualizations
|
|
|
|
```python
|
|
# For very large datasets, reduce shape
|
|
df.plot(df.x, df.y, shape=(256, 256)) # Fast
|
|
|
|
# For publication quality
|
|
df.plot(df.x, df.y, shape=(1024, 1024)) # Higher quality
|
|
|
|
# Balance quality and performance
|
|
df.plot(df.x, df.y, shape=(512, 512)) # Good balance
|
|
```
|
|
|
|
### Caching Visualization Data
|
|
|
|
```python
|
|
# Compute once, plot multiple times
|
|
counts = df.count(binby=[df.x, df.y],
|
|
limits=[[0, 10], [0, 10]],
|
|
shape=(512, 512))
|
|
|
|
# Use in different plots
|
|
plt.figure()
|
|
plt.imshow(counts.T, origin='lower', cmap='viridis')
|
|
plt.colorbar()
|
|
plt.show()
|
|
|
|
plt.figure()
|
|
plt.imshow(np.log10(counts.T + 1), origin='lower', cmap='plasma')
|
|
plt.colorbar()
|
|
plt.show()
|
|
```
|
|
|
|
## Export and Saving
|
|
|
|
### Saving Figures
|
|
|
|
```python
|
|
# Save as PNG
|
|
df.plot(df.x, df.y)
|
|
plt.savefig('plot.png', dpi=300, bbox_inches='tight')
|
|
|
|
# Save as PDF (vector)
|
|
plt.savefig('plot.pdf', bbox_inches='tight')
|
|
|
|
# Save as SVG
|
|
plt.savefig('plot.svg', bbox_inches='tight')
|
|
```
|
|
|
|
### Batch Plotting
|
|
|
|
```python
|
|
# Generate multiple plots
|
|
variables = ['x', 'y', 'z']
|
|
|
|
for var in variables:
|
|
plt.figure(figsize=(10, 6))
|
|
df.plot1d(df[var])
|
|
plt.title(f'Distribution of {var}')
|
|
plt.savefig(f'plot_{var}.png', dpi=300, bbox_inches='tight')
|
|
plt.close()
|
|
```
|
|
|
|
## Common Patterns
|
|
|
|
### Pattern: Exploratory Data Analysis
|
|
|
|
```python
|
|
import matplotlib.pyplot as plt
|
|
|
|
# Create comprehensive visualization
|
|
fig = plt.figure(figsize=(16, 12))
|
|
|
|
# 1D histograms
|
|
ax1 = plt.subplot(3, 3, 1)
|
|
df.plot1d(df.x, ax=ax1, show=False)
|
|
ax1.set_title('X Distribution')
|
|
|
|
ax2 = plt.subplot(3, 3, 2)
|
|
df.plot1d(df.y, ax=ax2, show=False)
|
|
ax2.set_title('Y Distribution')
|
|
|
|
ax3 = plt.subplot(3, 3, 3)
|
|
df.plot1d(df.z, ax=ax3, show=False)
|
|
ax3.set_title('Z Distribution')
|
|
|
|
# 2D plots
|
|
ax4 = plt.subplot(3, 3, 4)
|
|
df.plot(df.x, df.y, ax=ax4, show=False)
|
|
ax4.set_title('X vs Y')
|
|
|
|
ax5 = plt.subplot(3, 3, 5)
|
|
df.plot(df.x, df.z, ax=ax5, show=False)
|
|
ax5.set_title('X vs Z')
|
|
|
|
ax6 = plt.subplot(3, 3, 6)
|
|
df.plot(df.y, df.z, ax=ax6, show=False)
|
|
ax6.set_title('Y vs Z')
|
|
|
|
# Statistics on grids
|
|
ax7 = plt.subplot(3, 3, 7)
|
|
df.plot(df.x, df.y, what=df.z.mean(), ax=ax7, show=False)
|
|
ax7.set_title('Mean Z on X-Y grid')
|
|
|
|
plt.tight_layout()
|
|
plt.savefig('eda_summary.png', dpi=300, bbox_inches='tight')
|
|
plt.show()
|
|
```
|
|
|
|
### Pattern: Comparison Across Groups
|
|
|
|
```python
|
|
# Compare distributions by category
|
|
categories = df.category.unique()
|
|
|
|
fig, axes = plt.subplots(len(categories), 2,
|
|
figsize=(12, 4 * len(categories)))
|
|
|
|
for idx, cat in enumerate(categories):
|
|
df.select(df.category == cat, name=f'cat_{cat}')
|
|
|
|
# 1D histogram
|
|
df.plot1d(df.value, selection=f'cat_{cat}',
|
|
ax=axes[idx, 0], show=False)
|
|
axes[idx, 0].set_title(f'Category {cat} - Distribution')
|
|
|
|
# 2D plot
|
|
df.plot(df.x, df.y, selection=f'cat_{cat}',
|
|
ax=axes[idx, 1], show=False)
|
|
axes[idx, 1].set_title(f'Category {cat} - X vs Y')
|
|
|
|
plt.tight_layout()
|
|
plt.show()
|
|
```
|
|
|
|
### Pattern: Time Series Visualization
|
|
|
|
```python
|
|
# Aggregate by time bins
|
|
df['year'] = df.timestamp.dt.year
|
|
df['month'] = df.timestamp.dt.month
|
|
|
|
# Plot time series
|
|
monthly_sales = df.groupby(['year', 'month']).agg({'sales': 'sum'})
|
|
|
|
plt.figure(figsize=(14, 6))
|
|
plt.plot(range(len(monthly_sales)), monthly_sales['sales'])
|
|
plt.xlabel('Time Period')
|
|
plt.ylabel('Sales')
|
|
plt.title('Sales Over Time')
|
|
plt.grid(alpha=0.3)
|
|
plt.show()
|
|
```
|
|
|
|
## Integration with Other Libraries
|
|
|
|
### Plotly for Interactivity
|
|
|
|
```python
|
|
import plotly.graph_objects as go
|
|
|
|
# Get data from Vaex
|
|
counts = df.count(binby=[df.x, df.y], shape=(100, 100))
|
|
|
|
# Create plotly figure
|
|
fig = go.Figure(data=go.Heatmap(z=counts.T))
|
|
fig.update_layout(title='Interactive Heatmap')
|
|
fig.show()
|
|
```
|
|
|
|
### Seaborn Style
|
|
|
|
```python
|
|
import seaborn as sns
|
|
import matplotlib.pyplot as plt
|
|
|
|
# Use seaborn styling
|
|
sns.set_style('darkgrid')
|
|
sns.set_palette('husl')
|
|
|
|
df.plot1d(df.value)
|
|
plt.show()
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Use appropriate shape** - Balance resolution and performance (256-512 for exploration, 1024+ for publication)
|
|
2. **Apply sensible limits** - Use percentile-based limits ('99%', '99.7%') to handle outliers
|
|
3. **Choose color scales wisely** - Log scale for wide-ranging counts, linear for uniform data
|
|
4. **Leverage selections** - Compare subsets without creating new DataFrames
|
|
5. **Cache aggregations** - Compute once if creating multiple similar plots
|
|
6. **Use vector formats for publication** - Save as PDF or SVG for scalable figures
|
|
7. **Avoid sampling** - Vaex visualizations use all data, no sampling needed
|
|
|
|
## Troubleshooting
|
|
|
|
### Issue: Empty or Sparse Plot
|
|
|
|
```python
|
|
# Problem: Limits don't match data range
|
|
df.plot(df.x, df.y, limits=[[0, 10], [0, 10]])
|
|
|
|
# Solution: Use automatic limits
|
|
df.plot(df.x, df.y, limits='minmax')
|
|
df.plot(df.x, df.y, limits='99%')
|
|
```
|
|
|
|
### Issue: Plot Too Slow
|
|
|
|
```python
|
|
# Problem: Too high resolution
|
|
df.plot(df.x, df.y, shape=(2048, 2048))
|
|
|
|
# Solution: Reduce shape
|
|
df.plot(df.x, df.y, shape=(512, 512))
|
|
```
|
|
|
|
### Issue: Can't See Low-Density Regions
|
|
|
|
```python
|
|
# Problem: Linear scale overwhelmed by high-density areas
|
|
df.plot(df.x, df.y, f='identity')
|
|
|
|
# Solution: Use logarithmic scale
|
|
df.plot(df.x, df.y, f='log')
|
|
```
|
|
|
|
## Related Resources
|
|
|
|
- For data aggregation: See `data_processing.md`
|
|
- For performance optimization: See `performance.md`
|
|
- For DataFrame basics: See `core_dataframes.md`
|