Files
gh-tachyon-beep-skillpacks-…/skills/using-python-engineering/scientific-computing-foundations.md
2025-11-30 08:59:19 +08:00

25 KiB

Scientific Computing Foundations

Overview

Core Principle: Vectorize operations, avoid loops. NumPy and pandas are built on C/Fortran code that's orders of magnitude faster than Python loops. The biggest performance gains come from eliminating iteration over rows/elements.

Scientific computing in Python centers on NumPy (arrays) and pandas (dataframes). These libraries enable fast numerical computation on large datasets through vectorized operations and efficient memory layouts. The most common mistake: using Python loops when vectorized operations exist.

When to Use

Use this skill when:

  • "NumPy operations"
  • "Pandas DataFrame slow"
  • "Vectorization"
  • "How to avoid loops?"
  • "DataFrame iteration"
  • "Array performance"
  • "Memory usage too high"
  • "Large dataset processing"

Don't use when:

  • Setting up project (use project-structure-and-tooling)
  • Profiling needed first (use debugging-and-profiling)
  • ML pipeline orchestration (use ml-engineering-workflows)

Symptoms triggering this skill:

  • Slow DataFrame operations
  • High memory usage with arrays
  • Using loops over DataFrame rows
  • Need to process large datasets efficiently

NumPy Fundamentals

Array Creation and Types

import numpy as np

# ❌ WRONG: Creating arrays from Python lists in loop
data = []
for i in range(1000000):
    data.append(i * 2)
arr = np.array(data)

# ✅ CORRECT: Use NumPy functions
arr = np.arange(1000000) * 2

# ✅ CORRECT: Pre-allocate for known size
arr = np.empty(1000000, dtype=np.int64)
for i in range(1000000):
    arr[i] = i * 2  # Still slow, but better than list

# ✅ BETTER: Fully vectorized
arr = np.arange(1000000, dtype=np.int64) * 2

# ✅ CORRECT: Specify dtype for memory efficiency
# float64 (default): 8 bytes per element
# float32: 4 bytes per element
large_arr = np.zeros(1000000, dtype=np.float32)  # Half the memory

# Why this matters: dtype affects both memory usage and performance
# Use smallest dtype that fits your data

Vectorized Operations

# ❌ WRONG: Loop over array elements
arr = np.arange(1000000)
result = np.empty(1000000)
for i in range(len(arr)):
    result[i] = arr[i] ** 2 + 2 * arr[i] + 1

# ✅ CORRECT: Vectorized operations
arr = np.arange(1000000)
result = arr ** 2 + 2 * arr + 1

# Speed difference: ~100x faster with vectorization

# ❌ WRONG: Element-wise comparison in loop
matches = []
for val in arr:
    if val > 100:
        matches.append(val)
result = np.array(matches)

# ✅ CORRECT: Boolean indexing
result = arr[arr > 100]

# ✅ CORRECT: Complex conditions
result = arr[(arr > 100) & (arr < 200)]  # Note: & not 'and'
result = arr[(arr < 50) | (arr > 150)]   # Note: | not 'or'

Why this matters: Vectorized operations run in C, avoiding Python interpreter overhead. 10-100x speedup typical.

Broadcasting

# Broadcasting: Operating on arrays of different shapes

# ✅ CORRECT: Scalar broadcasting
arr = np.array([1, 2, 3, 4])
result = arr + 10  # [11, 12, 13, 14]

# ✅ CORRECT: 1D array broadcast to 2D
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

row_vector = np.array([10, 20, 30])
result = matrix + row_vector
# [[11, 22, 33],
#  [14, 25, 36],
#  [17, 28, 39]]

# ✅ CORRECT: Column vector broadcasting
col_vector = np.array([[10],
                       [20],
                       [30]])
result = matrix + col_vector
# [[11, 12, 13],
#  [24, 25, 26],
#  [37, 38, 39]]

# ✅ CORRECT: Add axis for broadcasting
row = np.array([1, 2, 3])
col = row[:, np.newaxis]  # Convert to column vector
# col shape: (3, 1)

# Outer product via broadcasting
outer = row[np.newaxis, :] * col
# [[1, 2, 3],
#  [2, 4, 6],
#  [3, 6, 9]]

# ❌ WRONG: Manual broadcasting with loops
result = np.empty_like(matrix)
for i in range(matrix.shape[0]):
    for j in range(matrix.shape[1]):
        result[i, j] = matrix[i, j] + row_vector[j]

# Why this matters: Broadcasting eliminates loops and is much faster

Memory-Efficient Operations

# ❌ WRONG: Creating unnecessary copies
large_arr = np.random.rand(10000, 10000)  # ~800MB
result1 = large_arr + 1  # Creates new 800MB array
result2 = result1 * 2    # Creates another 800MB array
# Total: 2.4GB memory usage

# ✅ CORRECT: In-place operations
large_arr = np.random.rand(10000, 10000)
large_arr += 1  # Modifies in-place, no copy
large_arr *= 2  # Modifies in-place, no copy
# Total: 800MB memory usage

# ✅ CORRECT: Use 'out' parameter
result = np.empty_like(large_arr)
np.add(large_arr, 1, out=result)
np.multiply(result, 2, out=result)

# ❌ WRONG: Unnecessary array copies
arr = np.arange(1000000)
subset = arr[::2].copy()  # Explicit copy needed? Check first
subset[0] = 999  # Doesn't affect arr

# ✅ CORRECT: Views avoid copies (when possible)
arr = np.arange(1000000)
view = arr[::2]  # View, not copy (shares memory)
view[0] = 999  # Modifies arr too!

# Check if view or copy
print(arr.base is None)  # False = view, True = owns memory

Why this matters: Large arrays consume lots of memory. In-place operations and views avoid copies, reducing memory usage significantly.

Aggregations and Reductions

# ✅ CORRECT: Axis-aware aggregations
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

# Sum all elements
total = matrix.sum()  # 45

# Sum along axis 0 (columns)
col_sums = matrix.sum(axis=0)  # [12, 15, 18]

# Sum along axis 1 (rows)
row_sums = matrix.sum(axis=1)  # [6, 15, 24]

# ❌ WRONG: Manual aggregation
total = 0
for row in matrix:
    for val in row:
        total += val

# ✅ CORRECT: Multiple aggregations
matrix.mean()
matrix.std()
matrix.min()
matrix.max()
matrix.argmin()  # Index of minimum
matrix.argmax()  # Index of maximum

# ✅ CORRECT: Conditional aggregations
# Sum only positive values
positive_sum = matrix[matrix > 0].sum()

# Count elements > 5
count = (matrix > 5).sum()

# Percentage > 5
percentage = (matrix > 5).mean() * 100

pandas Fundamentals

DataFrame Creation

import pandas as pd

# ❌ WRONG: Building DataFrame row by row
df = pd.DataFrame()
for i in range(10000):
    df = pd.concat([df, pd.DataFrame({'a': [i], 'b': [i*2]})], ignore_index=True)
# Extremely slow: O(n²) complexity

# ✅ CORRECT: Create from dict of lists
data = {
    'a': list(range(10000)),
    'b': [i * 2 for i in range(10000)]
}
df = pd.DataFrame(data)

# ✅ BETTER: Use NumPy arrays
df = pd.DataFrame({
    'a': np.arange(10000),
    'b': np.arange(10000) * 2
})

# ✅ CORRECT: From records
records = [{'a': i, 'b': i*2} for i in range(10000)]
df = pd.DataFrame.from_records(records)

The Iteration Anti-Pattern

# ❌ WRONG: iterrows() - THE MOST COMMON MISTAKE
df = pd.DataFrame({
    'value': np.random.rand(100000),
    'category': np.random.choice(['A', 'B', 'C'], 100000)
})

result = []
for idx, row in df.iterrows():  # VERY SLOW
    if row['value'] > 0.5:
        result.append(row['value'] * 2)

# ✅ CORRECT: Vectorized operations
mask = df['value'] > 0.5
result = df.loc[mask, 'value'] * 2

# Speed difference: ~100x faster

# ❌ WRONG: apply() on axis=1 (still row-by-row)
df['result'] = df.apply(
    lambda row: row['value'] * 2 if row['value'] > 0.5 else 0,
    axis=1
)
# Still slow: applies Python function to each row

# ✅ CORRECT: Vectorized with np.where
df['result'] = np.where(df['value'] > 0.5, df['value'] * 2, 0)

# ✅ CORRECT: Boolean indexing + assignment
df['result'] = 0
df.loc[df['value'] > 0.5, 'result'] = df['value'] * 2

Why this matters: iterrows() is the single biggest DataFrame performance killer. ALWAYS look for vectorized alternatives.

Efficient Filtering and Selection

df = pd.DataFrame({
    'A': np.random.rand(100000),
    'B': np.random.rand(100000),
    'C': np.random.choice(['X', 'Y', 'Z'], 100000)
})

# ❌ WRONG: Chaining filters inefficiently
df_filtered = df[df['A'] > 0.5]
df_filtered = df_filtered[df_filtered['B'] < 0.3]
df_filtered = df_filtered[df_filtered['C'] == 'X']

# ✅ CORRECT: Single boolean mask
mask = (df['A'] > 0.5) & (df['B'] < 0.3) & (df['C'] == 'X')
df_filtered = df[mask]

# ✅ CORRECT: query() for complex filters (cleaner syntax)
df_filtered = df.query('A > 0.5 and B < 0.3 and C == "X"')

# ✅ CORRECT: isin() for multiple values
df_filtered = df[df['C'].isin(['X', 'Y'])]

# ❌ WRONG: String matching in loop
matches = []
for val in df['C']:
    if 'X' in val:
        matches.append(True)
    else:
        matches.append(False)
df_filtered = df[matches]

# ✅ CORRECT: Vectorized string operations
df_filtered = df[df['C'].str.contains('X')]

GroupBy Operations

df = pd.DataFrame({
    'category': np.random.choice(['A', 'B', 'C'], 100000),
    'value': np.random.rand(100000),
    'count': np.random.randint(1, 100, 100000)
})

# ❌ WRONG: Manual grouping
groups = {}
for idx, row in df.iterrows():
    cat = row['category']
    if cat not in groups:
        groups[cat] = []
    groups[cat].append(row['value'])

results = {cat: sum(vals) / len(vals) for cat, vals in groups.items()}

# ✅ CORRECT: GroupBy
results = df.groupby('category')['value'].mean()

# ✅ CORRECT: Multiple aggregations
results = df.groupby('category').agg({
    'value': ['mean', 'std', 'min', 'max'],
    'count': 'sum'
})

# ✅ CORRECT: Named aggregations (pandas 0.25+)
results = df.groupby('category').agg(
    mean_value=('value', 'mean'),
    std_value=('value', 'std'),
    total_count=('count', 'sum')
)

# ✅ CORRECT: Custom aggregation function
def range_func(x):
    return x.max() - x.min()

results = df.groupby('category')['value'].agg(range_func)

# ✅ CORRECT: Transform (keeps original shape)
df['value_centered'] = df.groupby('category')['value'].transform(
    lambda x: x - x.mean()
)

Why this matters: GroupBy is highly optimized. Much faster than manual grouping. Use built-in aggregations when possible.

Performance Anti-Patterns

Anti-Pattern 1: DataFrame Iteration

# ❌ WRONG: Iterating over rows
for idx, row in df.iterrows():
    df.at[idx, 'new_col'] = row['a'] + row['b']

# ✅ CORRECT: Vectorized column operation
df['new_col'] = df['a'] + df['b']

# ❌ WRONG: Itertuples (better than iterrows, but still slow)
for row in df.itertuples():
    # Process row...

# ✅ CORRECT: Use vectorized operations or apply to columns

Anti-Pattern 2: Repeated Concatenation

# ❌ WRONG: Growing DataFrame in loop
df = pd.DataFrame()
for i in range(10000):
    df = pd.concat([df, new_row_df], ignore_index=True)
# O(n²) complexity, extremely slow

# ✅ CORRECT: Collect data, then create DataFrame
data = []
for i in range(10000):
    data.append({'a': i, 'b': i*2})
df = pd.DataFrame(data)

# ✅ CORRECT: Pre-allocate NumPy array
arr = np.empty((10000, 2))
for i in range(10000):
    arr[i] = [i, i*2]
df = pd.DataFrame(arr, columns=['a', 'b'])

Anti-Pattern 3: Using apply When Vectorized Exists

# ❌ WRONG: apply() for simple operations
df['result'] = df['value'].apply(lambda x: x * 2)

# ✅ CORRECT: Direct vectorized operation
df['result'] = df['value'] * 2

# ❌ WRONG: apply() for conditions
df['category'] = df['value'].apply(lambda x: 'high' if x > 0.5 else 'low')

# ✅ CORRECT: np.where or pd.cut
df['category'] = np.where(df['value'] > 0.5, 'high', 'low')

# ✅ CORRECT: pd.cut for binning
df['category'] = pd.cut(df['value'], bins=[0, 0.5, 1.0], labels=['low', 'high'])

# When apply IS appropriate:
# - Complex logic not vectorizable
# - Need to call external function per row
# But verify vectorization truly impossible first

Anti-Pattern 4: Not Using Categorical Data

# ❌ WRONG: String columns for repeated values
df = pd.DataFrame({
    'category': ['A'] * 10000 + ['B'] * 10000 + ['C'] * 10000
})
# Memory: ~240KB (each string stored separately)

# ✅ CORRECT: Categorical type
df['category'] = pd.Categorical(df['category'])
# Memory: ~30KB (integers + small string table)

# ✅ CORRECT: Define categories at creation
df = pd.DataFrame({
    'category': pd.Categorical(
        ['A'] * 10000 + ['B'] * 10000,
        categories=['A', 'B', 'C']
    )
})

# When to use categorical:
# - Limited number of unique values (< 50% of rows)
# - Repeated string/object values
# - Memory constraints
# - Faster groupby operations

Memory Optimization

Choosing Appropriate dtypes

# ❌ WRONG: Default dtypes waste memory
df = pd.DataFrame({
    'int_col': [1, 2, 3, 4, 5],  # int64 by default
    'float_col': [1.0, 2.0, 3.0],  # float64 by default
    'str_col': ['a', 'b', 'c', 'd', 'e']  # object dtype
})

print(df.memory_usage(deep=True))

# ✅ CORRECT: Optimize dtypes
df = pd.DataFrame({
    'int_col': pd.array([1, 2, 3, 4, 5], dtype='int8'),  # -128 to 127
    'float_col': pd.array([1.0, 2.0, 3.0], dtype='float32'),
    'str_col': pd.Categorical(['a', 'b', 'c', 'd', 'e'])
})

# ✅ CORRECT: Downcast after loading
df = pd.read_csv('data.csv')
df['int_col'] = pd.to_numeric(df['int_col'], downcast='integer')
df['float_col'] = pd.to_numeric(df['float_col'], downcast='float')

# Integer dtype ranges:
# int8: -128 to 127
# int16: -32,768 to 32,767
# int32: -2.1B to 2.1B
# int64: -9.2E18 to 9.2E18

# Float dtype precision:
# float16: ~3 decimal digits (rarely used)
# float32: ~7 decimal digits
# float64: ~15 decimal digits

Chunked Processing for Large Files

# ❌ WRONG: Loading entire file into memory
df = pd.read_csv('huge_file.csv')  # 10GB file, OOM!
df_processed = process_dataframe(df)

# ✅ CORRECT: Process in chunks
chunk_size = 100000
results = []

for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size):
    processed = process_dataframe(chunk)
    results.append(processed)

df_final = pd.concat(results, ignore_index=True)

# ✅ CORRECT: Streaming aggregation
totals = {'A': 0, 'B': 0, 'C': 0}

for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size):
    for col in totals:
        totals[col] += chunk[col].sum()

# ✅ CORRECT: Only load needed columns
df = pd.read_csv('huge_file.csv', usecols=['col1', 'col2', 'col3'])

Using Sparse Data Structures

# ❌ WRONG: Dense array for sparse data
# Data with 99% zeros
dense = np.zeros(1000000)
dense[::100] = 1  # Only 1% non-zero
# Memory: 8MB (float64 * 1M)

# ✅ CORRECT: Sparse array
from scipy.sparse import csr_matrix
sparse = csr_matrix(dense)
# Memory: ~80KB (only stores non-zero values + indices)

# ✅ CORRECT: Sparse DataFrame
df = pd.DataFrame({
    'A': pd.arrays.SparseArray([0] * 100 + [1] + [0] * 100),
    'B': pd.arrays.SparseArray([0] * 50 + [2] + [0] * 150)
})

Data Pipeline Patterns

Method Chaining

# ❌ WRONG: Many intermediate variables
df = pd.read_csv('data.csv')
df = df[df['value'] > 0]
df = df.groupby('category')['value'].mean()
df = df.reset_index()
df = df.rename(columns={'value': 'mean_value'})

# ✅ CORRECT: Method chaining
df = (
    pd.read_csv('data.csv')
    .query('value > 0')
    .groupby('category')['value']
    .mean()
    .reset_index()
    .rename(columns={'value': 'mean_value'})
)

# ✅ CORRECT: Pipe for custom functions
def remove_outliers(df, column, n_std=3):
    mean = df[column].mean()
    std = df[column].std()
    return df[
        (df[column] > mean - n_std * std) &
        (df[column] < mean + n_std * std)
    ]

df = (
    pd.read_csv('data.csv')
    .pipe(remove_outliers, 'value', n_std=2)
    .groupby('category')['value']
    .mean()
)

Efficient Merges and Joins

# ❌ WRONG: Multiple small merges
for small_df in list_of_dfs:
    main_df = main_df.merge(small_df, on='key')
# Inefficient: creates many intermediate copies

# ✅ CORRECT: Merge all at once
df_merged = pd.concat(list_of_dfs, ignore_index=True)

# ✅ CORRECT: Optimize merge with sorted/indexed data
df1 = df1.set_index('key').sort_index()
df2 = df2.set_index('key').sort_index()
result = df1.merge(df2, left_index=True, right_index=True)

# ✅ CORRECT: Use indicator to track merge sources
result = df1.merge(df2, on='key', how='outer', indicator=True)
print(result['_merge'].value_counts())
# Shows: left_only, right_only, both

# ❌ WRONG: Cartesian product by accident
# df1: 1000 rows, df2: 1000 rows
result = df1.merge(df2, on='wrong_key')
# result: 1,000,000 rows! (if all keys match)

# ✅ CORRECT: Validate merge
result = df1.merge(df2, on='key', validate='1:1')
# Raises error if not one-to-one relationship

Handling Missing Data

# ❌ WRONG: Dropping all rows with any NaN
df_clean = df.dropna()  # Might lose most of data

# ✅ CORRECT: Drop rows with NaN in specific columns
df_clean = df.dropna(subset=['important_col1', 'important_col2'])

# ✅ CORRECT: Fill NaN with appropriate values
df['numeric_col'] = df['numeric_col'].fillna(df['numeric_col'].mean())
df['category_col'] = df['category_col'].fillna('Unknown')

# ✅ CORRECT: Forward/backward fill for time series
df['value'] = df['value'].fillna(method='ffill')  # Forward fill

# ✅ CORRECT: Interpolation
df['value'] = df['value'].interpolate(method='linear')

# ❌ WRONG: Not checking for NaN before operations
result = df['value'].mean()  # NaN propagates, might return NaN

# ✅ CORRECT: Explicit NaN handling
result = df['value'].mean(skipna=True)  # Default, but explicit is better

Advanced NumPy Techniques

Universal Functions (ufuncs)

# ✅ CORRECT: Using built-in ufuncs
arr = np.random.rand(1000000)

# Trigonometric
result = np.sin(arr)
result = np.cos(arr)

# Exponential
result = np.exp(arr)
result = np.log(arr)

# Comparison
result = np.maximum(arr, 0.5)  # Element-wise max with scalar
result = np.minimum(arr, 0.5)

# ✅ CORRECT: Custom ufunc with @vectorize
from numba import vectorize

@vectorize
def custom_func(x):
    if x > 0.5:
        return x ** 2
    else:
        return x ** 3

result = custom_func(arr)  # Runs at C speed

Advanced Indexing

# ✅ CORRECT: Fancy indexing
arr = np.arange(100)
indices = [0, 5, 10, 15, 20]
result = arr[indices]  # Select specific indices

# ✅ CORRECT: Boolean indexing with multiple conditions
arr = np.random.rand(1000000)
mask = (arr > 0.3) & (arr < 0.7)
result = arr[mask]

# ✅ CORRECT: np.where for conditional replacement
arr = np.random.rand(1000)
result = np.where(arr > 0.5, arr, 0)  # Replace values <= 0.5 with 0

# ✅ CORRECT: Multi-dimensional indexing
matrix = np.random.rand(100, 100)
rows = [0, 10, 20]
cols = [5, 15, 25]
result = matrix[rows, cols]  # Select specific elements

# Get diagonal
diagonal = matrix[np.arange(100), np.arange(100)]
# Or use np.diag
diagonal = np.diag(matrix)

Linear Algebra Operations

# ✅ CORRECT: Matrix multiplication
A = np.random.rand(1000, 500)
B = np.random.rand(500, 200)
C = A @ B  # Python 3.5+ matrix multiply operator

# Or
C = np.dot(A, B)
C = np.matmul(A, B)

# ✅ CORRECT: Solve linear system Ax = b
A = np.random.rand(100, 100)
b = np.random.rand(100)
x = np.linalg.solve(A, b)

# ✅ CORRECT: Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)

# ✅ CORRECT: SVD (Singular Value Decomposition)
U, s, Vt = np.linalg.svd(A)

# ✅ CORRECT: Inverse
A_inv = np.linalg.inv(A)

# ❌ WRONG: Using inverse for solving Ax = b
x = np.linalg.inv(A) @ b  # Slower and less numerically stable

# ✅ CORRECT: Use solve directly
x = np.linalg.solve(A, b)

Type Hints for NumPy and pandas

NumPy Type Hints

import numpy as np
from numpy.typing import NDArray

# ✅ CORRECT: Type hint for NumPy arrays
def process_array(arr: NDArray[np.float64]) -> NDArray[np.float64]:
    return arr * 2

# ✅ CORRECT: Generic array type
def normalize(arr: NDArray) -> NDArray:
    return (arr - arr.mean()) / arr.std()

# ✅ CORRECT: Shape-specific type hints (Python 3.11+)
from typing import TypeAlias

Vector: TypeAlias = NDArray[np.float64]  # 1D array
Matrix: TypeAlias = NDArray[np.float64]  # 2D array

def matrix_multiply(A: Matrix, B: Matrix) -> Matrix:
    return A @ B

pandas Type Hints

import pandas as pd

# ✅ CORRECT: Type hints for Series and DataFrame
def process_series(s: pd.Series) -> pd.Series:
    return s * 2

def process_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    return df[df['value'] > 0]

# ✅ CORRECT: More specific DataFrame types (using Protocols)
from typing import Protocol

class DataFrameWithColumns(Protocol):
    """DataFrame with specific columns."""
    def __getitem__(self, key: str) -> pd.Series: ...

def analyze_data(df: DataFrameWithColumns) -> float:
    return df['value'].mean()

Real-World Patterns

Time Series Operations

# ✅ CORRECT: Efficient time series resampling
df = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=1000000, freq='1s'),
    'value': np.random.rand(1000000)
})

df = df.set_index('timestamp')

# Resample to 1-minute intervals
df_resampled = df.resample('1min').agg({
    'value': ['mean', 'std', 'min', 'max']
})

# ✅ CORRECT: Rolling window operations
df['rolling_mean'] = df['value'].rolling(window=60).mean()
df['rolling_std'] = df['value'].rolling(window=60).std()

# ✅ CORRECT: Lag features
df['value_lag1'] = df['value'].shift(1)
df['value_lag60'] = df['value'].shift(60)

# ✅ CORRECT: Difference for stationarity
df['value_diff'] = df['value'].diff()

Multi-Index Operations

# ✅ CORRECT: Creating multi-index DataFrame
df = pd.DataFrame({
    'country': ['USA', 'USA', 'UK', 'UK'],
    'city': ['NYC', 'LA', 'London', 'Manchester'],
    'value': [100, 200, 150, 175]
})

df = df.set_index(['country', 'city'])

# Accessing with multi-index
df.loc['USA']  # All USA cities
df.loc[('USA', 'NYC')]  # Specific city

# ✅ CORRECT: Cross-section
df.xs('USA', level='country')
df.xs('London', level='city')

# ✅ CORRECT: GroupBy with multi-index
df.groupby(level='country').mean()

Parallel Processing with Dask

# For datasets larger than memory, use Dask (not in plan detail, but worth mentioning)
import dask.dataframe as dd

# ✅ CORRECT: Dask for out-of-core processing
df = dd.read_csv('huge_file.csv')
result = df.groupby('category')['value'].mean().compute()

# Dask uses same API as pandas, but lazy evaluation
# Only computes when .compute() is called

Anti-Pattern Summary

Top 5 Performance Killers

  1. iterrows() - Use vectorized operations
  2. Growing DataFrame in loop - Collect data, then create DataFrame
  3. apply() for simple operations - Use vectorized alternatives
  4. Not using categorical for strings - Convert to categorical
  5. Loading entire file when chunking works - Use chunksize parameter

Memory Usage Mistakes

  1. Using float64 when float32 sufficient - Halves memory
  2. Not using categorical for repeated strings - 10x memory savings
  3. Creating unnecessary copies - Use in-place operations
  4. Loading all columns when few needed - Use usecols parameter

Decision Trees

Should I Use NumPy or pandas?

What's your data structure?
├─ Homogeneous numeric array → NumPy
├─ Heterogeneous tabular data → pandas
├─ Time series → pandas
└─ Linear algebra → NumPy

How to Optimize DataFrame Operation?

Can I vectorize?
├─ Yes → Use vectorized pandas/NumPy operations
└─ No → Can I use groupby?
    ├─ Yes → Use groupby with built-in aggregations
    └─ No → Can I use apply on columns (not rows)?
        ├─ Yes → Use apply on Series
        └─ No → Use itertuples (last resort)

Memory Optimization Strategy

Is memory usage high?
├─ Yes → Check dtypes (downcast if possible)
│   └─ Still high? → Use categorical for strings
│       └─ Still high? → Process in chunks
└─ No → Continue with current approach

Integration with Other Skills

After using this skill:

  • If need ML pipelines → See @ml-engineering-workflows for experiment tracking
  • If performance issues persist → See @debugging-and-profiling for profiling
  • If type hints needed → See @modern-syntax-and-types for advanced typing

Before using this skill:

  • If unsure if slow → Use @debugging-and-profiling to profile first
  • If setting up project → Use @project-structure-and-tooling for dependencies

Quick Reference

NumPy Quick Wins

# Vectorization
result = arr ** 2 + 2 * arr + 1  # Not: for loop

# Boolean indexing
result = arr[arr > 0]  # Not: list comprehension

# Broadcasting
result = matrix + row_vector  # Not: loop over rows

# In-place operations
arr += 1  # Not: arr = arr + 1

pandas Quick Wins

# Never iterrows
df['new'] = df['a'] + df['b']  # Not: iterrows

# Vectorized conditions
df['category'] = np.where(df['value'] > 0.5, 'high', 'low')

# Categorical for strings
df['category'] = pd.Categorical(df['category'])

# Query for complex filters
df.query('A > 0.5 and B < 0.3')  # Not: multiple []

Memory Optimization Checklist

  • Use smallest dtype that fits data
  • Convert repeated strings to categorical
  • Use chunking for files > available RAM
  • Avoid unnecessary copies (use views or in-place ops)
  • Only load needed columns (usecols in read_csv)