Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,367 @@
# Core DataFrames and Data Loading
This reference covers Vaex DataFrame basics, loading data from various sources, and understanding the DataFrame structure.
## DataFrame Fundamentals
A Vaex DataFrame is the central data structure for working with large tabular datasets. Unlike pandas, Vaex DataFrames:
- Use **lazy evaluation** - operations are not executed until needed
- Work **out-of-core** - data doesn't need to fit in RAM
- Support **virtual columns** - computed columns with no memory overhead
- Enable **billion-row-per-second** processing through optimized C++ backend
## Opening Existing Files
### Primary Method: `vaex.open()`
The most common way to load data:
```python
import vaex
# Works with multiple formats
df = vaex.open('data.hdf5') # HDF5 (recommended)
df = vaex.open('data.arrow') # Apache Arrow (recommended)
df = vaex.open('data.parquet') # Parquet
df = vaex.open('data.csv') # CSV (slower for large files)
df = vaex.open('data.fits') # FITS (astronomy)
# Can open multiple files as one DataFrame
df = vaex.open('data_*.hdf5') # Wildcards supported
```
**Key characteristics:**
- **Instant for HDF5/Arrow** - Memory-maps files, no loading time
- **Handles large CSVs** - Automatically chunks large CSV files
- **Returns immediately** - Lazy evaluation means no computation until needed
### Format-Specific Loaders
```python
# CSV with options
df = vaex.from_csv(
'large_file.csv',
chunk_size=5_000_000, # Process in chunks
convert=True, # Convert to HDF5 automatically
copy_index=False # Don't copy pandas index if present
)
# Apache Arrow
df = vaex.open('data.arrow') # Native support, very fast
# HDF5 (optimal format)
df = vaex.open('data.hdf5') # Instant loading via memory mapping
```
## Creating DataFrames from Other Sources
### From Pandas
```python
import pandas as pd
import vaex
# Convert pandas DataFrame
pdf = pd.read_csv('data.csv')
df = vaex.from_pandas(pdf, copy_index=False)
# Warning: This loads entire pandas DataFrame into memory
# For large data, prefer vaex.from_csv() directly
```
### From NumPy Arrays
```python
import numpy as np
import vaex
# Single array
x = np.random.rand(1_000_000)
df = vaex.from_arrays(x=x)
# Multiple arrays
x = np.random.rand(1_000_000)
y = np.random.rand(1_000_000)
df = vaex.from_arrays(x=x, y=y)
```
### From Dictionaries
```python
import vaex
# Dictionary of lists/arrays
data = {
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'salary': [50000, 60000, 70000]
}
df = vaex.from_dict(data)
```
### From Arrow Tables
```python
import pyarrow as pa
import vaex
# From Arrow Table
arrow_table = pa.table({
'x': [1, 2, 3],
'y': [4, 5, 6]
})
df = vaex.from_arrow_table(arrow_table)
```
## Example Datasets
Vaex provides built-in example datasets for testing:
```python
import vaex
# NYC taxi dataset (~1GB, 11 million rows)
df = vaex.example()
# Smaller datasets
df = vaex.datasets.titanic()
df = vaex.datasets.iris()
```
## Inspecting DataFrames
### Basic Information
```python
# Display first and last rows
print(df)
# Shape (rows, columns)
print(df.shape) # Returns (row_count, column_count)
print(len(df)) # Row count
# Column names
print(df.columns)
print(df.column_names)
# Data types
print(df.dtypes)
# Memory usage (for materialized columns)
df.byte_size()
```
### Statistical Summary
```python
# Quick statistics for all numeric columns
df.describe()
# Single column statistics
df.x.mean()
df.x.std()
df.x.min()
df.x.max()
df.x.sum()
df.x.count()
# Quantiles
df.x.quantile(0.5) # Median
df.x.quantile([0.25, 0.5, 0.75]) # Multiple quantiles
```
### Viewing Data
```python
# First/last rows (returns pandas DataFrame)
df.head(10)
df.tail(10)
# Random sample
df.sample(n=100)
# Convert to pandas (careful with large data!)
pdf = df.to_pandas_df()
# Convert specific columns only
pdf = df[['x', 'y']].to_pandas_df()
```
## DataFrame Structure
### Columns
```python
# Access columns as expressions
x_column = df.x
y_column = df['y']
# Column operations return expressions (lazy)
sum_column = df.x + df.y # Not computed yet
# List all columns
print(df.get_column_names())
# Check column types
print(df.dtypes)
# Virtual vs materialized columns
print(df.get_column_names(virtual=False)) # Materialized only
print(df.get_column_names(virtual=True)) # All columns
```
### Rows
```python
# Row count
row_count = len(df)
row_count = df.count()
# Single row (returns dict)
row = df.row(0)
print(row['column_name'])
# Note: Iterating over rows is NOT recommended in Vaex
# Use vectorized operations instead
```
## Working with Expressions
Expressions are Vaex's way of representing computations that haven't been executed yet:
```python
# Create expressions (no computation)
expr = df.x ** 2 + df.y
# Expressions can be used in many contexts
mean_of_expr = expr.mean() # Still lazy
df['new_col'] = expr # Virtual column
filtered = df[expr > 10] # Selection
# Force evaluation
result = expr.values # Returns NumPy array (use carefully!)
```
## DataFrame Operations
### Copying
```python
# Shallow copy (shares data)
df_copy = df.copy()
# Deep copy (independent data)
df_deep = df.copy(deep=True)
```
### Trimming/Slicing
```python
# Select row range
df_subset = df[1000:2000] # Rows 1000-2000
df_subset = df[:1000] # First 1000 rows
df_subset = df[-1000:] # Last 1000 rows
# Note: This creates a view, not a copy (efficient)
```
### Concatenating
```python
# Vertical concatenation (combine rows)
df_combined = vaex.concat([df1, df2, df3])
# Horizontal concatenation (combine columns)
# Use join or simply assign columns
df['new_col'] = other_df.some_column
```
## Best Practices
1. **Prefer HDF5 or Arrow formats** - Instant loading, optimal performance
2. **Convert large CSVs to HDF5** - One-time conversion for repeated use
3. **Avoid `.to_pandas_df()` on large data** - Defeats Vaex's purpose
4. **Use expressions instead of `.values`** - Keep operations lazy
5. **Check data types** - Ensure numeric columns aren't string type
6. **Use virtual columns** - Zero memory overhead for derived data
## Common Patterns
### Pattern: One-time CSV to HDF5 Conversion
```python
# Initial conversion (do once)
df = vaex.from_csv('large_data.csv', convert='large_data.hdf5')
# Future loads (instant)
df = vaex.open('large_data.hdf5')
```
### Pattern: Inspecting Large Datasets
```python
import vaex
df = vaex.open('large_file.hdf5')
# Quick overview
print(df) # First/last rows
print(df.shape) # Dimensions
print(df.describe()) # Statistics
# Sample for detailed inspection
sample = df.sample(1000).to_pandas_df()
print(sample.head())
```
### Pattern: Loading Multiple Files
```python
# Load multiple files as one DataFrame
df = vaex.open('data_part*.hdf5')
# Or explicitly concatenate
df1 = vaex.open('data_2020.hdf5')
df2 = vaex.open('data_2021.hdf5')
df_all = vaex.concat([df1, df2])
```
## Common Issues and Solutions
### Issue: CSV Loading is Slow
```python
# Solution: Convert to HDF5 first
df = vaex.from_csv('large.csv', convert='large.hdf5')
# Future loads: df = vaex.open('large.hdf5')
```
### Issue: Column Shows as String Type
```python
# Check type
print(df.dtypes)
# Convert to numeric (creates virtual column)
df['age_numeric'] = df.age.astype('int64')
```
### Issue: Out of Memory on Small Operations
```python
# Likely using .values or .to_pandas_df()
# Solution: Use lazy operations
# Bad (loads into memory)
array = df.x.values
# Good (stays lazy)
mean = df.x.mean()
filtered = df[df.x > 10]
```
## Related Resources
- For data manipulation and filtering: See `data_processing.md`
- For performance optimization: See `performance.md`
- For file format details: See `io_operations.md`