Files
2025-11-30 08:30:10 +08:00

15 KiB

I/O Operations

This reference covers file input/output operations, format conversions, export strategies, and working with various data formats in Vaex.

Overview

Vaex supports multiple file formats with varying performance characteristics. The choice of format significantly impacts loading speed, memory usage, and overall performance.

Format recommendations:

  • HDF5 - Best for most use cases (instant loading, memory-mapped)
  • Apache Arrow - Best for interoperability (instant loading, columnar)
  • Parquet - Good for distributed systems (compressed, columnar)
  • CSV - Avoid for large datasets (slow loading, not memory-mapped)

Reading Data

import vaex

# Open HDF5 file (instant, memory-mapped)
df = vaex.open('data.hdf5')

# Multiple files as one DataFrame
df = vaex.open('data_part*.hdf5')
df = vaex.open(['data_2020.hdf5', 'data_2021.hdf5', 'data_2022.hdf5'])

Advantages:

  • Instant loading (memory-mapped, no data read into RAM)
  • Optimal performance for Vaex operations
  • Supports compression
  • Random access patterns

Apache Arrow Files

# Open Arrow file (instant, memory-mapped)
df = vaex.open('data.arrow')
df = vaex.open('data.feather')  # Feather is Arrow format

# Multiple Arrow files
df = vaex.open('data_*.arrow')

Advantages:

  • Instant loading (memory-mapped)
  • Language-agnostic format
  • Excellent for data sharing
  • Zero-copy integration with Arrow ecosystem

Parquet Files

# Open Parquet file
df = vaex.open('data.parquet')

# Multiple Parquet files
df = vaex.open('data_*.parquet')

# From cloud storage
df = vaex.open('s3://bucket/data.parquet')
df = vaex.open('gs://bucket/data.parquet')

Advantages:

  • Compressed by default
  • Columnar format
  • Wide ecosystem support
  • Good for distributed systems

Considerations:

  • Slower than HDF5/Arrow for local files
  • May require full file read for some operations

CSV Files

# Simple CSV
df = vaex.from_csv('data.csv')

# Large CSV with automatic chunking
df = vaex.from_csv('large_data.csv', chunk_size=5_000_000)

# CSV with conversion to HDF5
df = vaex.from_csv('large_data.csv', convert='large_data.hdf5')
# Creates HDF5 file for future fast loading

# CSV with options
df = vaex.from_csv(
    'data.csv',
    sep=',',
    header=0,
    names=['col1', 'col2', 'col3'],
    dtype={'col1': 'int64', 'col2': 'float64'},
    usecols=['col1', 'col2'],  # Only load specific columns
    nrows=100000  # Limit number of rows
)

Recommendations:

  • Always convert large CSVs to HDF5 for repeated use
  • Use convert parameter to create HDF5 automatically
  • CSV loading can take significant time for large files

FITS Files (Astronomy)

# Open FITS file
df = vaex.open('astronomical_data.fits')

# Multiple FITS files
df = vaex.open('survey_*.fits')

Writing/Exporting Data

Export to HDF5

# Export to HDF5 (recommended for Vaex)
df.export_hdf5('output.hdf5')

# With progress bar
df.export_hdf5('output.hdf5', progress=True)

# Export subset of columns
df[['col1', 'col2', 'col3']].export_hdf5('subset.hdf5')

# Export with compression
df.export_hdf5('compressed.hdf5', compression='gzip')

Export to Arrow

# Export to Arrow format
df.export_arrow('output.arrow')

# Export to Feather (Arrow format)
df.export_feather('output.feather')

Export to Parquet

# Export to Parquet
df.export_parquet('output.parquet')

# With compression
df.export_parquet('output.parquet', compression='snappy')
df.export_parquet('output.parquet', compression='gzip')

Export to CSV

# Export to CSV (not recommended for large data)
df.export_csv('output.csv')

# With options
df.export_csv(
    'output.csv',
    sep=',',
    header=True,
    index=False,
    chunk_size=1_000_000
)

# Export subset
df[df.age > 25].export_csv('filtered_output.csv')

Format Conversion

CSV to HDF5 (Most Common)

import vaex

# Method 1: Automatic conversion during read
df = vaex.from_csv('large.csv', convert='large.hdf5')
# Creates large.hdf5, returns DataFrame pointing to it

# Method 2: Explicit conversion
df = vaex.from_csv('large.csv')
df.export_hdf5('large.hdf5')

# Future loads (instant)
df = vaex.open('large.hdf5')

HDF5 to Arrow

# Load HDF5
df = vaex.open('data.hdf5')

# Export to Arrow
df.export_arrow('data.arrow')

Parquet to HDF5

# Load Parquet
df = vaex.open('data.parquet')

# Export to HDF5
df.export_hdf5('data.hdf5')

Multiple CSV Files to Single HDF5

import vaex
import glob

# Find all CSV files
csv_files = glob.glob('data_*.csv')

# Load and concatenate
dfs = [vaex.from_csv(f) for f in csv_files]
df_combined = vaex.concat(dfs)

# Export as single HDF5
df_combined.export_hdf5('combined_data.hdf5')

Incremental/Chunked I/O

Processing Large CSV in Chunks

import vaex

# Process CSV in chunks
chunk_size = 1_000_000
output_file = 'processed.hdf5'

for i, df_chunk in enumerate(vaex.from_csv_chunked('huge.csv', chunk_size=chunk_size)):
    # Process chunk
    df_chunk['new_col'] = df_chunk.x + df_chunk.y

    # Append to HDF5
    if i == 0:
        df_chunk.export_hdf5(output_file)
    else:
        df_chunk.export_hdf5(output_file, mode='a')  # Append

# Load final result
df = vaex.open(output_file)

Exporting in Chunks

# Export large DataFrame in chunks (for CSV)
chunk_size = 1_000_000

for i in range(0, len(df), chunk_size):
    df_chunk = df[i:i+chunk_size]
    mode = 'w' if i == 0 else 'a'
    df_chunk.export_csv('large_output.csv', mode=mode, header=(i == 0))

Pandas Integration

From Pandas to Vaex

import pandas as pd
import vaex

# Read with pandas
pdf = pd.read_csv('data.csv')

# Convert to Vaex
df = vaex.from_pandas(pdf, copy_index=False)

# For better performance: Use Vaex directly
df = vaex.from_csv('data.csv')  # Preferred

From Vaex to Pandas

# Full conversion (careful with large data!)
pdf = df.to_pandas_df()

# Convert subset
pdf = df[['col1', 'col2']].to_pandas_df()
pdf = df[:10000].to_pandas_df()  # First 10k rows
pdf = df[df.age > 25].to_pandas_df()  # Filtered

# Sample for exploration
pdf_sample = df.sample(n=10000).to_pandas_df()

Arrow Integration

From Arrow to Vaex

import pyarrow as pa
import vaex

# From Arrow Table
arrow_table = pa.table({
    'a': [1, 2, 3],
    'b': [4, 5, 6]
})
df = vaex.from_arrow_table(arrow_table)

# From Arrow file
arrow_table = pa.ipc.open_file('data.arrow').read_all()
df = vaex.from_arrow_table(arrow_table)

From Vaex to Arrow

# Convert to Arrow Table
arrow_table = df.to_arrow_table()

# Write Arrow file
import pyarrow as pa
with pa.ipc.new_file('output.arrow', arrow_table.schema) as writer:
    writer.write_table(arrow_table)

# Or use Vaex export
df.export_arrow('output.arrow')

Remote and Cloud Storage

Reading from S3

import vaex

# Read from S3 (requires s3fs)
df = vaex.open('s3://bucket-name/data.parquet')
df = vaex.open('s3://bucket-name/data.hdf5')

# With credentials
import s3fs
fs = s3fs.S3FileSystem(key='access_key', secret='secret_key')
df = vaex.open('s3://bucket-name/data.parquet', fs=fs)

Reading from Google Cloud Storage

# Read from GCS (requires gcsfs)
df = vaex.open('gs://bucket-name/data.parquet')

# With credentials
import gcsfs
fs = gcsfs.GCSFileSystem(token='path/to/credentials.json')
df = vaex.open('gs://bucket-name/data.parquet', fs=fs)

Reading from Azure

# Read from Azure Blob Storage (requires adlfs)
df = vaex.open('az://container-name/data.parquet')

Writing to Cloud Storage

# Export to S3
df.export_parquet('s3://bucket-name/output.parquet')
df.export_hdf5('s3://bucket-name/output.hdf5')

# Export to GCS
df.export_parquet('gs://bucket-name/output.parquet')

Database Integration

Reading from SQL Databases

import vaex
import pandas as pd
from sqlalchemy import create_engine

# Read with pandas, convert to Vaex
engine = create_engine('postgresql://user:password@host:port/database')
pdf = pd.read_sql('SELECT * FROM table', engine)
df = vaex.from_pandas(pdf)

# For large tables: Read in chunks
chunks = []
for chunk in pd.read_sql('SELECT * FROM large_table', engine, chunksize=100000):
    chunks.append(vaex.from_pandas(chunk))
df = vaex.concat(chunks)

# Better: Export from database to CSV/Parquet, then load with Vaex

Writing to SQL Databases

# Convert to pandas, then write
pdf = df.to_pandas_df()
pdf.to_sql('table_name', engine, if_exists='replace', index=False)

# For large data: Write in chunks
chunk_size = 100000
for i in range(0, len(df), chunk_size):
    chunk = df[i:i+chunk_size].to_pandas_df()
    chunk.to_sql('table_name', engine,
                 if_exists='append' if i > 0 else 'replace',
                 index=False)

Memory-Mapped Files

Understanding Memory Mapping

# HDF5 and Arrow files are memory-mapped by default
df = vaex.open('data.hdf5')  # No data loaded into RAM

# Data is read from disk on-demand
mean = df.x.mean()  # Streams through data, minimal memory

# Check if column is memory-mapped
print(df.is_local('column_name'))  # False = memory-mapped

Forcing Data into Memory

# If needed, load data into memory
df_in_memory = df.copy()
for col in df.get_column_names():
    df_in_memory[col] = df[col].values  # Materializes in memory

File Compression

HDF5 Compression

# Export with compression
df.export_hdf5('compressed.hdf5', compression='gzip')
df.export_hdf5('compressed.hdf5', compression='lzf')
df.export_hdf5('compressed.hdf5', compression='blosc')

# Trade-off: Smaller file size, slightly slower I/O

Parquet Compression

# Parquet is compressed by default
df.export_parquet('data.parquet', compression='snappy')  # Fast
df.export_parquet('data.parquet', compression='gzip')    # Better compression
df.export_parquet('data.parquet', compression='brotli')  # Best compression

Vaex Server (Remote Data)

Starting Vaex Server

# Start server
vaex-server data.hdf5 --host 0.0.0.0 --port 9000

Connecting to Remote Server

import vaex

# Connect to remote Vaex server
df = vaex.open('ws://hostname:9000/data')

# Operations work transparently
mean = df.x.mean()  # Computed on server

State Files

Saving DataFrame State

# Save state (includes virtual columns, selections, etc.)
df.state_write('state.json')

# Includes:
# - Virtual column definitions
# - Active selections
# - Variables
# - Transformations (scalers, encoders, models)

Loading DataFrame State

# Load data
df = vaex.open('data.hdf5')

# Apply saved state
df.state_load('state.json')

# All virtual columns, selections, and transformations restored

Best Practices

1. Choose the Right Format

# For local work: HDF5
df.export_hdf5('data.hdf5')

# For sharing/interoperability: Arrow
df.export_arrow('data.arrow')

# For distributed systems: Parquet
df.export_parquet('data.parquet')

# Avoid CSV for large data

2. Convert CSV Once

# One-time conversion
df = vaex.from_csv('large.csv', convert='large.hdf5')

# All future loads
df = vaex.open('large.hdf5')  # Instant!

3. Materialize Before Export

# If DataFrame has many virtual columns
df_materialized = df.materialize()
df_materialized.export_hdf5('output.hdf5')

# Faster exports and future loads

4. Use Compression Wisely

# For archival or infrequently accessed data
df.export_hdf5('archived.hdf5', compression='gzip')

# For active work (faster I/O)
df.export_hdf5('working.hdf5')  # No compression

5. Checkpoint Long Pipelines

# After expensive preprocessing
df_preprocessed = preprocess(df)
df_preprocessed.export_hdf5('checkpoint_preprocessed.hdf5')

# After feature engineering
df_features = engineer_features(df_preprocessed)
df_features.export_hdf5('checkpoint_features.hdf5')

# Enables resuming from checkpoints

Performance Comparisons

Format Loading Speed

import time
import vaex

# CSV (slowest)
start = time.time()
df_csv = vaex.from_csv('data.csv')
csv_time = time.time() - start

# HDF5 (instant)
start = time.time()
df_hdf5 = vaex.open('data.hdf5')
hdf5_time = time.time() - start

# Arrow (instant)
start = time.time()
df_arrow = vaex.open('data.arrow')
arrow_time = time.time() - start

print(f"CSV: {csv_time:.2f}s")
print(f"HDF5: {hdf5_time:.4f}s")
print(f"Arrow: {arrow_time:.4f}s")

Common Patterns

Pattern: Production Data Pipeline

import vaex

# Read from source (CSV, database export, etc.)
df = vaex.from_csv('raw_data.csv')

# Process
df['cleaned'] = clean(df.raw_column)
df['feature'] = engineer_feature(df)

# Export for production use
df.export_hdf5('production_data.hdf5')
df.state_write('production_state.json')

# In production: Fast loading
df_prod = vaex.open('production_data.hdf5')
df_prod.state_load('production_state.json')

Pattern: Archiving with Compression

# Archive old data with compression
df_2020 = vaex.open('data_2020.hdf5')
df_2020.export_hdf5('archive_2020.hdf5', compression='gzip')

# Remove uncompressed original
import os
os.remove('data_2020.hdf5')

Pattern: Multi-Source Data Loading

import vaex

# Load from multiple sources
df_csv = vaex.from_csv('data.csv')
df_hdf5 = vaex.open('data.hdf5')
df_parquet = vaex.open('data.parquet')

# Concatenate
df_all = vaex.concat([df_csv, df_hdf5, df_parquet])

# Export unified format
df_all.export_hdf5('unified.hdf5')

Troubleshooting

Issue: CSV Loading Too Slow

# Solution: Convert to HDF5
df = vaex.from_csv('large.csv', convert='large.hdf5')
# Future: df = vaex.open('large.hdf5')

Issue: Out of Memory on Export

# Solution: Export in chunks or materialize first
df_materialized = df.materialize()
df_materialized.export_hdf5('output.hdf5')

Issue: Can't Read File from Cloud

# Install required libraries
# pip install s3fs gcsfs adlfs

# Verify credentials
import s3fs
fs = s3fs.S3FileSystem()
fs.ls('s3://bucket-name/')

Format Feature Matrix

Feature HDF5 Arrow Parquet CSV
Load Speed Instant Instant Fast Slow
Memory-mapped Yes Yes No No
Compression Optional No Yes No
Columnar Yes Yes Yes No
Portability Good Excellent Excellent Excellent
File Size Medium Medium Small Large
Best For Vaex workflows Interop Distributed Exchange
  • For DataFrame creation: See core_dataframes.md
  • For performance optimization: See performance.md
  • For data processing: See data_processing.md