# Core DataFrames and Data Loading This reference covers Vaex DataFrame basics, loading data from various sources, and understanding the DataFrame structure. ## DataFrame Fundamentals A Vaex DataFrame is the central data structure for working with large tabular datasets. Unlike pandas, Vaex DataFrames: - Use **lazy evaluation** - operations are not executed until needed - Work **out-of-core** - data doesn't need to fit in RAM - Support **virtual columns** - computed columns with no memory overhead - Enable **billion-row-per-second** processing through optimized C++ backend ## Opening Existing Files ### Primary Method: `vaex.open()` The most common way to load data: ```python import vaex # Works with multiple formats df = vaex.open('data.hdf5') # HDF5 (recommended) df = vaex.open('data.arrow') # Apache Arrow (recommended) df = vaex.open('data.parquet') # Parquet df = vaex.open('data.csv') # CSV (slower for large files) df = vaex.open('data.fits') # FITS (astronomy) # Can open multiple files as one DataFrame df = vaex.open('data_*.hdf5') # Wildcards supported ``` **Key characteristics:** - **Instant for HDF5/Arrow** - Memory-maps files, no loading time - **Handles large CSVs** - Automatically chunks large CSV files - **Returns immediately** - Lazy evaluation means no computation until needed ### Format-Specific Loaders ```python # CSV with options df = vaex.from_csv( 'large_file.csv', chunk_size=5_000_000, # Process in chunks convert=True, # Convert to HDF5 automatically copy_index=False # Don't copy pandas index if present ) # Apache Arrow df = vaex.open('data.arrow') # Native support, very fast # HDF5 (optimal format) df = vaex.open('data.hdf5') # Instant loading via memory mapping ``` ## Creating DataFrames from Other Sources ### From Pandas ```python import pandas as pd import vaex # Convert pandas DataFrame pdf = pd.read_csv('data.csv') df = vaex.from_pandas(pdf, copy_index=False) # Warning: This loads entire pandas DataFrame into memory # For large data, prefer vaex.from_csv() directly ``` ### From NumPy Arrays ```python import numpy as np import vaex # Single array x = np.random.rand(1_000_000) df = vaex.from_arrays(x=x) # Multiple arrays x = np.random.rand(1_000_000) y = np.random.rand(1_000_000) df = vaex.from_arrays(x=x, y=y) ``` ### From Dictionaries ```python import vaex # Dictionary of lists/arrays data = { 'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35], 'salary': [50000, 60000, 70000] } df = vaex.from_dict(data) ``` ### From Arrow Tables ```python import pyarrow as pa import vaex # From Arrow Table arrow_table = pa.table({ 'x': [1, 2, 3], 'y': [4, 5, 6] }) df = vaex.from_arrow_table(arrow_table) ``` ## Example Datasets Vaex provides built-in example datasets for testing: ```python import vaex # NYC taxi dataset (~1GB, 11 million rows) df = vaex.example() # Smaller datasets df = vaex.datasets.titanic() df = vaex.datasets.iris() ``` ## Inspecting DataFrames ### Basic Information ```python # Display first and last rows print(df) # Shape (rows, columns) print(df.shape) # Returns (row_count, column_count) print(len(df)) # Row count # Column names print(df.columns) print(df.column_names) # Data types print(df.dtypes) # Memory usage (for materialized columns) df.byte_size() ``` ### Statistical Summary ```python # Quick statistics for all numeric columns df.describe() # Single column statistics df.x.mean() df.x.std() df.x.min() df.x.max() df.x.sum() df.x.count() # Quantiles df.x.quantile(0.5) # Median df.x.quantile([0.25, 0.5, 0.75]) # Multiple quantiles ``` ### Viewing Data ```python # First/last rows (returns pandas DataFrame) df.head(10) df.tail(10) # Random sample df.sample(n=100) # Convert to pandas (careful with large data!) pdf = df.to_pandas_df() # Convert specific columns only pdf = df[['x', 'y']].to_pandas_df() ``` ## DataFrame Structure ### Columns ```python # Access columns as expressions x_column = df.x y_column = df['y'] # Column operations return expressions (lazy) sum_column = df.x + df.y # Not computed yet # List all columns print(df.get_column_names()) # Check column types print(df.dtypes) # Virtual vs materialized columns print(df.get_column_names(virtual=False)) # Materialized only print(df.get_column_names(virtual=True)) # All columns ``` ### Rows ```python # Row count row_count = len(df) row_count = df.count() # Single row (returns dict) row = df.row(0) print(row['column_name']) # Note: Iterating over rows is NOT recommended in Vaex # Use vectorized operations instead ``` ## Working with Expressions Expressions are Vaex's way of representing computations that haven't been executed yet: ```python # Create expressions (no computation) expr = df.x ** 2 + df.y # Expressions can be used in many contexts mean_of_expr = expr.mean() # Still lazy df['new_col'] = expr # Virtual column filtered = df[expr > 10] # Selection # Force evaluation result = expr.values # Returns NumPy array (use carefully!) ``` ## DataFrame Operations ### Copying ```python # Shallow copy (shares data) df_copy = df.copy() # Deep copy (independent data) df_deep = df.copy(deep=True) ``` ### Trimming/Slicing ```python # Select row range df_subset = df[1000:2000] # Rows 1000-2000 df_subset = df[:1000] # First 1000 rows df_subset = df[-1000:] # Last 1000 rows # Note: This creates a view, not a copy (efficient) ``` ### Concatenating ```python # Vertical concatenation (combine rows) df_combined = vaex.concat([df1, df2, df3]) # Horizontal concatenation (combine columns) # Use join or simply assign columns df['new_col'] = other_df.some_column ``` ## Best Practices 1. **Prefer HDF5 or Arrow formats** - Instant loading, optimal performance 2. **Convert large CSVs to HDF5** - One-time conversion for repeated use 3. **Avoid `.to_pandas_df()` on large data** - Defeats Vaex's purpose 4. **Use expressions instead of `.values`** - Keep operations lazy 5. **Check data types** - Ensure numeric columns aren't string type 6. **Use virtual columns** - Zero memory overhead for derived data ## Common Patterns ### Pattern: One-time CSV to HDF5 Conversion ```python # Initial conversion (do once) df = vaex.from_csv('large_data.csv', convert='large_data.hdf5') # Future loads (instant) df = vaex.open('large_data.hdf5') ``` ### Pattern: Inspecting Large Datasets ```python import vaex df = vaex.open('large_file.hdf5') # Quick overview print(df) # First/last rows print(df.shape) # Dimensions print(df.describe()) # Statistics # Sample for detailed inspection sample = df.sample(1000).to_pandas_df() print(sample.head()) ``` ### Pattern: Loading Multiple Files ```python # Load multiple files as one DataFrame df = vaex.open('data_part*.hdf5') # Or explicitly concatenate df1 = vaex.open('data_2020.hdf5') df2 = vaex.open('data_2021.hdf5') df_all = vaex.concat([df1, df2]) ``` ## Common Issues and Solutions ### Issue: CSV Loading is Slow ```python # Solution: Convert to HDF5 first df = vaex.from_csv('large.csv', convert='large.hdf5') # Future loads: df = vaex.open('large.hdf5') ``` ### Issue: Column Shows as String Type ```python # Check type print(df.dtypes) # Convert to numeric (creates virtual column) df['age_numeric'] = df.age.astype('int64') ``` ### Issue: Out of Memory on Small Operations ```python # Likely using .values or .to_pandas_df() # Solution: Use lazy operations # Bad (loads into memory) array = df.x.values # Good (stays lazy) mean = df.x.mean() filtered = df[df.x > 10] ``` ## Related Resources - For data manipulation and filtering: See `data_processing.md` - For performance optimization: See `performance.md` - For file format details: See `io_operations.md`