# Performance Optimization ## Performance Tool Selection Guide ### Profiling Tools Decision Matrix | Tool | Use When | Don't Use When | What It Shows | |------|----------|----------------|---------------| | **`profvis`** | Complex code, unknown bottlenecks | Simple functions, known issues | Time per line, call stack | | **`bench::mark()`** | Comparing alternatives | Single approach | Relative performance, memory | | **`system.time()`** | Quick checks | Detailed analysis | Total runtime only | | **`Rprof()`** | Base R only environments | When profvis available | Raw profiling data | ### Step-by-Step Performance Workflow ```r # 1. Profile first - find the actual bottlenecks library(profvis) profvis({ # Your slow code here }) # 2. Focus on the slowest parts (80/20 rule) # Don't optimize until you know where time is spent # 3. Benchmark alternatives for hot spots library(bench) bench::mark( current = current_approach(data), vectorized = vectorized_approach(data), parallel = map(data, in_parallel(func)) ) # 4. Consider tool trade-offs based on bottleneck type ``` ## When Each Tool Helps vs Hurts ### Parallel Processing (`in_parallel()`) ```r # Helps when: ✓ CPU-intensive computations ✓ Embarrassingly parallel problems ✓ Large datasets with independent operations ✓ I/O bound operations (file reading, API calls) # Hurts when: ✗ Simple, fast operations (overhead > benefit) ✗ Memory-intensive operations (may cause thrashing) ✗ Operations requiring shared state ✗ Small datasets # Example decision point: expensive_func <- function(x) Sys.sleep(0.1) # 100ms per call fast_func <- function(x) x^2 # microseconds per call # Good for parallel map(1:100, in_parallel(expensive_func)) # ~10s -> ~2.5s on 4 cores # Bad for parallel (overhead > benefit) map(1:100, in_parallel(fast_func)) # 100μs -> 50ms (500x slower!) ``` ### vctrs Backend Tools ```r # Use vctrs when: ✓ Type safety matters more than raw speed ✓ Building reusable package functions ✓ Complex coercion/combination logic ✓ Consistent behavior across edge cases # Avoid vctrs when: ✗ One-off scripts where speed matters most ✗ Simple operations where base R is sufficient ✗ Memory is extremely constrained # Decision point: simple_combine <- function(x, y) c(x, y) # Fast, simple robust_combine <- function(x, y) vec_c(x, y) # Safer, slight overhead # Use simple for hot loops, robust for package APIs ``` ### Data Backend Selection ```r # Use data.table when: ✓ Very large datasets (>1GB) ✓ Complex grouping operations ✓ Reference semantics desired ✓ Maximum performance critical # Use dplyr when: ✓ Readability and maintainability priority ✓ Complex joins and window functions ✓ Team familiarity with tidyverse ✓ Moderate sized data (<100MB) # Use dtplyr (dplyr with data.table backend) when: ✓ Want dplyr syntax with data.table performance ✓ Large data but team prefers tidyverse ✓ Lazy evaluation desired # Use base R when: ✓ No dependencies allowed ✓ Simple operations ✓ Teaching/learning contexts ``` ## Profiling Best Practices ```r # 1. Profile realistic data sizes profvis({ # Use actual data size, not toy examples real_data |> your_analysis() }) # 2. Profile multiple runs for stability bench::mark( your_function(data), min_iterations = 10, # Multiple runs max_iterations = 100 ) # 3. Check memory usage too bench::mark( approach1 = method1(data), approach2 = method2(data), check = FALSE, # If outputs differ slightly filter_gc = FALSE # Include GC time ) # 4. Profile with realistic usage patterns # Not just isolated function calls ``` ## Performance Anti-Patterns to Avoid ```r # Don't optimize without measuring # ✗ "This looks slow" -> immediately rewrite # ✓ Profile first, optimize bottlenecks # Don't over-engineer for performance # ✗ Complex optimizations for 1% gains # ✓ Focus on algorithmic improvements # Don't assume - measure # ✗ "for loops are always slow in R" # ✓ Benchmark your specific use case # Don't ignore readability costs # ✗ Unreadable code for minor speedups # ✓ Readable code with targeted optimizations # Don't grow objects in loops # ✗ result <- c(); for(i in 1:n) result <- c(result, x[i]) # ✓ result <- vector("list", n); for(i in 1:n) result[[i]] <- x[i] ``` ## Modern purrr Patterns for Performance Use modern purrr 1.0+ patterns: ```r # Modern data frame row binding (purrr 1.0+) models <- data_splits |> map(\(split) train_model(split)) |> list_rbind() # Replaces map_dfr() # Column binding summaries <- data_list |> map(\(df) get_summary_stats(df)) |> list_cbind() # Replaces map_dfc() # Side effects with walk() plots <- walk2(data_list, plot_names, \(df, name) { p <- ggplot(df, aes(x, y)) + geom_point() ggsave(name, p) }) # Parallel processing (purrr 1.1.0+) library(mirai) daemons(4) results <- large_datasets |> map(in_parallel(expensive_computation)) daemons(0) ``` ## Vectorization ```r # Good - vectorized operations result <- x + y # Good - Type-stable purrr functions map_dbl(data, mean) # always returns double map_chr(data, class) # always returns character # Avoid - Type-unstable base functions sapply(data, mean) # might return list or vector # Avoid - explicit loops for simple operations result <- numeric(length(x)) for(i in seq_along(x)) { result[i] <- x[i] + y[i] } ``` ## Using dtplyr for Large Data For large datasets, use dtplyr to get data.table performance with dplyr syntax: ```r library(dtplyr) # Convert to lazy data.table large_data_dt <- lazy_dt(large_data) # Use dplyr syntax as normal result <- large_data_dt |> filter(year >= 2020) |> group_by(category) |> summarise( total = sum(value), avg = mean(value) ) |> as_tibble() # Convert back to tibble # See generated data.table code result |> show_query() ``` ## Memory Optimization ```r # Pre-allocate vectors result <- vector("numeric", n) # Use appropriate data types # integer instead of double when possible x <- 1:1000 # integer y <- seq(1, 1000, by = 1) # double # Remove large objects when done rm(large_object) gc() # Force garbage collection if needed # Use data.table for large data library(data.table) dt <- as.data.table(large_df) dt[, new_col := old_col * 2] # Modifies in place ``` ## String Manipulation Performance Use stringr over base R for consistency and performance: ```r # Good - stringr (consistent, pipe-friendly) text |> str_to_lower() |> str_trim() |> str_replace_all("pattern", "replacement") |> str_extract("\\d+") # Common patterns str_detect(text, "pattern") # vs grepl("pattern", text) str_extract(text, "pattern") # vs complex regmatches() str_replace_all(text, "a", "b") # vs gsub("a", "b", text) str_split(text, ",") # vs strsplit(text, ",") str_length(text) # vs nchar(text) str_sub(text, 1, 5) # vs substr(text, 1, 5) ``` ## When to Use vctrs ### Core Benefits - **Type stability** - Predictable output types regardless of input values - **Size stability** - Predictable output sizes from input sizes - **Consistent coercion rules** - Single set of rules applied everywhere - **Robust class design** - Proper S3 vector infrastructure ### Use vctrs when: ```r # Type-Stable Functions in Packages my_function <- function(x, y) { # Always returns double, regardless of input values vec_cast(result, double()) } # Consistent Coercion/Casting vec_cast(x, double()) # Clear intent, predictable behavior vec_ptype_common(x, y, z) # Finds richest compatible type # Size/Length Stability vec_c(x, y) # size = vec_size(x) + vec_size(y) vec_rbind(df1, df2) # size = sum of input sizes ``` ### Don't Use vctrs When: - Simple one-off analyses - Base R is sufficient - No custom classes needed - Standard types work fine - Performance critical + simple operations - Base R may be faster - External API constraints - Must return base R types The key insight: **vctrs is most valuable in package development where type safety, consistency, and extensibility matter more than raw speed for simple operations.**