Files
gh-codingkaiser-claude-kais…/skills/r-development/references/performance.md
2025-11-29 18:15:04 +08:00

8.0 KiB

Performance Optimization

Performance Tool Selection Guide

Profiling Tools Decision Matrix

Tool Use When Don't Use When What It Shows
profvis Complex code, unknown bottlenecks Simple functions, known issues Time per line, call stack
bench::mark() Comparing alternatives Single approach Relative performance, memory
system.time() Quick checks Detailed analysis Total runtime only
Rprof() Base R only environments When profvis available Raw profiling data

Step-by-Step Performance Workflow

# 1. Profile first - find the actual bottlenecks
library(profvis)
profvis({
  # Your slow code here
})

# 2. Focus on the slowest parts (80/20 rule)
# Don't optimize until you know where time is spent

# 3. Benchmark alternatives for hot spots
library(bench)
bench::mark(
  current = current_approach(data),
  vectorized = vectorized_approach(data),
  parallel = map(data, in_parallel(func))
)

# 4. Consider tool trade-offs based on bottleneck type

When Each Tool Helps vs Hurts

Parallel Processing (in_parallel())

# Helps when:CPU-intensive computationsEmbarrassingly parallel problemsLarge datasets with independent operationsI/O bound operations (file reading, API calls)

# Hurts when:Simple, fast operations (overhead > benefit)Memory-intensive operations (may cause thrashing)Operations requiring shared stateSmall datasets

# Example decision point:
expensive_func <- function(x) Sys.sleep(0.1) # 100ms per call
fast_func <- function(x) x^2                 # microseconds per call

# Good for parallel
map(1:100, in_parallel(expensive_func))  # ~10s -> ~2.5s on 4 cores

# Bad for parallel (overhead > benefit)  
map(1:100, in_parallel(fast_func))       # 100μs -> 50ms (500x slower!)

vctrs Backend Tools

# Use vctrs when:Type safety matters more than raw speedBuilding reusable package functionsComplex coercion/combination logicConsistent behavior across edge cases

# Avoid vctrs when:One-off scripts where speed matters mostSimple operations where base R is sufficientMemory is extremely constrained

# Decision point:
simple_combine <- function(x, y) c(x, y)           # Fast, simple
robust_combine <- function(x, y) vec_c(x, y)      # Safer, slight overhead

# Use simple for hot loops, robust for package APIs

Data Backend Selection

# Use data.table when:Very large datasets (>1GB)Complex grouping operationsReference semantics desiredMaximum performance critical

# Use dplyr when:Readability and maintainability priorityComplex joins and window functionsTeam familiarity with tidyverseModerate sized data (<100MB)

# Use dtplyr (dplyr with data.table backend) when:Want dplyr syntax with data.table performanceLarge data but team prefers tidyverseLazy evaluation desired

# Use base R when:No dependencies allowedSimple operationsTeaching/learning contexts

Profiling Best Practices

# 1. Profile realistic data sizes
profvis({
  # Use actual data size, not toy examples
  real_data |> your_analysis()
})

# 2. Profile multiple runs for stability
bench::mark(
  your_function(data),
  min_iterations = 10,  # Multiple runs
  max_iterations = 100
)

# 3. Check memory usage too
bench::mark(
  approach1 = method1(data), 
  approach2 = method2(data),
  check = FALSE,  # If outputs differ slightly
  filter_gc = FALSE  # Include GC time
)

# 4. Profile with realistic usage patterns
# Not just isolated function calls

Performance Anti-Patterns to Avoid

# Don't optimize without measuring
# ✗ "This looks slow" -> immediately rewrite
# ✓ Profile first, optimize bottlenecks

# Don't over-engineer for performance  
# ✗ Complex optimizations for 1% gains
# ✓ Focus on algorithmic improvements

# Don't assume - measure
# ✗ "for loops are always slow in R"
# ✓ Benchmark your specific use case

# Don't ignore readability costs
# ✗ Unreadable code for minor speedups
# ✓ Readable code with targeted optimizations

# Don't grow objects in loops
# ✗ result <- c(); for(i in 1:n) result <- c(result, x[i])
# ✓ result <- vector("list", n); for(i in 1:n) result[[i]] <- x[i]

Modern purrr Patterns for Performance

Use modern purrr 1.0+ patterns:

# Modern data frame row binding (purrr 1.0+)
models <- data_splits |> 
  map(\(split) train_model(split)) |>
  list_rbind()  # Replaces map_dfr()

# Column binding  
summaries <- data_list |> 
  map(\(df) get_summary_stats(df)) |>
  list_cbind()  # Replaces map_dfc()

# Side effects with walk()
plots <- walk2(data_list, plot_names, \(df, name) {
  p <- ggplot(df, aes(x, y)) + geom_point()
  ggsave(name, p)
})

# Parallel processing (purrr 1.1.0+)
library(mirai)
daemons(4)
results <- large_datasets |> 
  map(in_parallel(expensive_computation))
daemons(0)

Vectorization

# Good - vectorized operations
result <- x + y

# Good - Type-stable purrr functions
map_dbl(data, mean)    # always returns double
map_chr(data, class)   # always returns character

# Avoid - Type-unstable base functions
sapply(data, mean)     # might return list or vector

# Avoid - explicit loops for simple operations
result <- numeric(length(x))
for(i in seq_along(x)) {
  result[i] <- x[i] + y[i]
}

Using dtplyr for Large Data

For large datasets, use dtplyr to get data.table performance with dplyr syntax:

library(dtplyr)

# Convert to lazy data.table
large_data_dt <- lazy_dt(large_data)

# Use dplyr syntax as normal
result <- large_data_dt |>
  filter(year >= 2020) |>
  group_by(category) |>
  summarise(
    total = sum(value),
    avg = mean(value)
  ) |>
  as_tibble()  # Convert back to tibble

# See generated data.table code
result |> show_query()

Memory Optimization

# Pre-allocate vectors
result <- vector("numeric", n)

# Use appropriate data types
# integer instead of double when possible
x <- 1:1000  # integer
y <- seq(1, 1000, by = 1)  # double

# Remove large objects when done
rm(large_object)
gc()  # Force garbage collection if needed

# Use data.table for large data
library(data.table)
dt <- as.data.table(large_df)
dt[, new_col := old_col * 2]  # Modifies in place

String Manipulation Performance

Use stringr over base R for consistency and performance:

# Good - stringr (consistent, pipe-friendly)
text |>
  str_to_lower() |>
  str_trim() |>
  str_replace_all("pattern", "replacement") |>
  str_extract("\\d+")

# Common patterns
str_detect(text, "pattern")     # vs grepl("pattern", text)
str_extract(text, "pattern")    # vs complex regmatches()
str_replace_all(text, "a", "b") # vs gsub("a", "b", text)
str_split(text, ",")            # vs strsplit(text, ",")
str_length(text)                # vs nchar(text)
str_sub(text, 1, 5)             # vs substr(text, 1, 5)

When to Use vctrs

Core Benefits

  • Type stability - Predictable output types regardless of input values
  • Size stability - Predictable output sizes from input sizes
  • Consistent coercion rules - Single set of rules applied everywhere
  • Robust class design - Proper S3 vector infrastructure

Use vctrs when:

# Type-Stable Functions in Packages
my_function <- function(x, y) {
  # Always returns double, regardless of input values
  vec_cast(result, double())
}

# Consistent Coercion/Casting
vec_cast(x, double())  # Clear intent, predictable behavior
vec_ptype_common(x, y, z)  # Finds richest compatible type

# Size/Length Stability
vec_c(x, y)  # size = vec_size(x) + vec_size(y)
vec_rbind(df1, df2)  # size = sum of input sizes

Don't Use vctrs When:

  • Simple one-off analyses - Base R is sufficient
  • No custom classes needed - Standard types work fine
  • Performance critical + simple operations - Base R may be faster
  • External API constraints - Must return base R types

The key insight: vctrs is most valuable in package development where type safety, consistency, and extensibility matter more than raw speed for simple operations.