Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:15:04 +08:00
commit ec0d1b5905
19 changed files with 5696 additions and 0 deletions

View File

@@ -0,0 +1,214 @@
---
name: r-development
description: Modern R development practices emphasizing tidyverse patterns (dplyr 1.1 and later, native pipe, join_by, .by grouping), rlang metaprogramming, performance optimization, and package development. Use when Claude needs to write R code, create R packages, optimize R performance, or provide R programming guidance.
---
# R Development
This skill provides comprehensive guidance for modern R development, emphasizing current best practices with tidyverse, performance optimization, and professional package development.
## Core Principles
1. **Use modern tidyverse patterns** - Prioritize dplyr 1.1+ features, native pipe, and current APIs
2. **Profile before optimizing** - Use profvis and bench to identify real bottlenecks
3. **Write readable code first** - Optimize only when necessary and after profiling
4. **Follow tidyverse style guide** - Consistent naming, spacing, and structure
## Modern Tidyverse Essentials
### Native Pipe (`|>` not `%>%`)
Always use native pipe `|>` instead of magrittr `%>%` (R 4.1+):
```r
# Modern
data |>
filter(year >= 2020) |>
summarise(mean_value = mean(value))
# Avoid legacy pipe
data %>% filter(year >= 2020)
```
### Join Syntax (dplyr 1.1+)
Use `join_by()` for all joins:
```r
# Modern join syntax with equality
transactions |>
inner_join(companies, by = join_by(company == id))
# Inequality joins
transactions |>
inner_join(companies, join_by(company == id, year >= since))
# Rolling joins (closest match)
transactions |>
inner_join(companies, join_by(company == id, closest(year >= since)))
```
Control match behavior:
```r
# Expect 1:1 matches
inner_join(x, y, by = join_by(id), multiple = "error")
# Ensure all rows match
inner_join(x, y, by = join_by(id), unmatched = "error")
```
### Per-Operation Grouping with `.by`
Use `.by` instead of `group_by() |> ... |> ungroup()`:
```r
# Modern approach (always returns ungrouped)
data |>
summarise(mean_value = mean(value), .by = category)
# Multiple grouping variables
data |>
summarise(total = sum(revenue), .by = c(company, year))
```
### Column Operations
Use modern column selection and transformation functions:
```r
# pick() for column selection in data-masking contexts
data |>
summarise(
n_x_cols = ncol(pick(starts_with("x"))),
n_y_cols = ncol(pick(starts_with("y")))
)
# across() for applying functions to multiple columns
data |>
summarise(across(where(is.numeric), mean, .names = "mean_{.col}"), .by = group)
# reframe() for multi-row results per group
data |>
reframe(quantiles = quantile(x, c(0.25, 0.5, 0.75)), .by = group)
```
## rlang Metaprogramming
For comprehensive rlang patterns, see [references/rlang-patterns.md](references/rlang-patterns.md).
### Quick Reference
- **`{{}}`** - Forward function arguments to data-masking functions
- **`!!`** - Inject single expressions or values
- **`!!!`** - Inject multiple arguments from a list
- **`.data[[]]`** - Access columns by name (character vectors)
- **`pick()`** - Select columns inside data-masking functions
Example function with embracing:
```r
my_summary <- function(data, group_var, summary_var) {
data |>
summarise(mean_val = mean({{ summary_var }}), .by = {{ group_var }})
}
```
## Performance Optimization
For detailed performance guidance, see [references/performance.md](references/performance.md).
### Key Strategies
1. **Profile first**: Use `profvis::profvis()` and `bench::mark()`
2. **Vectorize operations**: Avoid loops when vectorized alternatives exist
3. **Use dtplyr**: For large data operations (lazy evaluation with data.table backend)
4. **Parallel processing**: Use `furrr::future_map()` for parallelizable work
5. **Memory efficiency**: Pre-allocate, use appropriate data types
Quick example:
```r
# Profile code
profvis::profvis({
result <- data |>
complex_operation() |>
another_operation()
})
# Benchmark alternatives
bench::mark(
approach_1 = method1(data),
approach_2 = method2(data),
check = FALSE
)
```
## Package Development
For complete package development guidance, see [references/package-development.md](references/package-development.md).
### Quick Guidelines
**API Design:**
- Use `.by` parameter for per-operation grouping
- Use `{{}}` for column arguments
- Return tibbles consistently
- Validate user-facing function inputs thoroughly
**Dependencies:**
- Add dependencies for significant functionality gains
- Core tidyverse packages usually worth including: dplyr, purrr, stringr, tidyr
- Minimize dependencies for widely-used packages
**Testing:**
- Unit tests for individual functions
- Integration tests for workflows
- Test edge cases and error conditions
**Documentation:**
- Document all exported functions
- Provide usage examples
- Explain non-obvious parameter interactions
## Common Migration Patterns
### Base R → Tidyverse
```r
# Data manipulation
subset(data, condition)filter(data, condition)
data[order(data$x), ]arrange(data, x)
aggregate(x ~ y, data, mean)summarise(data, mean(x), .by = y)
# Functional programming
sapply(x, f)map(x, f) # type-stable
lapply(x, f)map(x, f)
# Strings
grepl("pattern", text)str_detect(text, "pattern")
gsub("old", "new", text)str_replace_all(text, "old", "new")
```
### Old → New Tidyverse
```r
# Pipes
%>%|>
# Grouping
group_by() |> ... |> ungroup()summarise(..., .by = x)
# Joins
by = c("a" = "b")by = join_by(a == b)
# Reshaping
gather()/spread()pivot_longer()/pivot_wider()
```
## Additional Resources
- **rlang patterns**: See [references/rlang-patterns.md](references/rlang-patterns.md) for comprehensive data-masking and metaprogramming guidance
- **Performance optimization**: See [references/performance.md](references/performance.md) for profiling, benchmarking, and optimization strategies
- **Package development**: See [references/package-development.md](references/package-development.md) for complete package creation guidance
- **Object systems**: See [references/object-systems.md](references/object-systems.md) for S3, S4, S7, R6, and vctrs guidance

View File

@@ -0,0 +1,310 @@
# Object-Oriented Programming in R
## S7: Modern OOP for New Projects
S7 combines S3 simplicity with S4 structure:
- Formal class definitions with automatic validation
- Compatible with existing S3 code
- Better error messages and discoverability
```r
# S7 class definition
Range <- new_class("Range",
properties = list(
start = class_double,
end = class_double
),
validator = function(self) {
if (self@end < self@start) {
"@end must be >= @start"
}
}
)
# Usage - constructor and property access
x <- Range(start = 1, end = 10)
x@start # 1
x@end <- 20 # automatic validation
# Methods
inside <- new_generic("inside", "x")
method(inside, Range) <- function(x, y) {
y >= x@start & y <= x@end
}
```
## OOP System Decision Matrix
### Decision Tree: What Are You Building?
#### 1. Vector-like Objects
**Use vctrs when:**
- ✓ Need data frame integration (columns/rows)
- ✓ Want type-stable vector operations
- ✓ Building factor-like, date-like, or numeric-like classes
- ✓ Need consistent coercion/casting behavior
- ✓ Working with existing tidyverse infrastructure
**Examples:** custom date classes, units, categorical data
```r
# Vector-like behavior in data frames
percent <- new_vctr(0.5, class = "percentage")
data.frame(x = 1:3, pct = percent(c(0.1, 0.2, 0.3))) # works seamlessly
# Type-stable operations
vec_c(percent(0.1), percent(0.2)) # predictable behavior
vec_cast(0.5, percent()) # explicit, safe casting
```
#### 2. General Objects (Complex Data Structures)
**Use S7 when:**
- ✓ NEW projects that need formal classes
- ✓ Want property validation and safe property access (@)
- ✓ Need multiple dispatch (beyond S3's double dispatch)
- ✓ Converting from S3 and want better structure
- ✓ Building class hierarchies with inheritance
- ✓ Want better error messages and discoverability
```r
# Complex validation needs
Range <- new_class("Range",
properties = list(start = class_double, end = class_double),
validator = function(self) {
if (self@end < self@start) "@end must be >= @start"
}
)
# Multiple dispatch needs
method(generic, list(ClassA, ClassB)) <- function(x, y) ...
# Class hierarchies with clear inheritance
Child <- new_class("Child", parent = Parent)
```
**Use S3 when:**
- ✓ Simple classes with minimal structure needs
- ✓ Maximum compatibility and minimal dependencies
- ✓ Quick prototyping or internal classes
- ✓ Contributing to existing S3-based ecosystems
- ✓ Performance is absolutely critical (minimal overhead)
```r
# Simple classes without complex needs
new_simple <- function(x) structure(x, class = "simple")
print.simple <- function(x, ...) cat("Simple:", x)
```
**Use S4 when:**
- ✓ Working in Bioconductor ecosystem
- ✓ Need complex multiple inheritance (S7 doesn't support this)
- ✓ Existing S4 codebase that works well
**Use R6 when:**
- ✓ Need reference semantics (mutable objects)
- ✓ Building stateful objects
- ✓ Coming from OOP languages like Python/Java
- ✓ Need encapsulation and private methods
## Detailed S7 vs S3 Comparison
| Feature | S3 | S7 | When S7 wins |
|---------|----|----|---------------|
| **Class definition** | Informal (convention) | Formal (`new_class()`) | Need guaranteed structure |
| **Property access** | `$` or `attr()` (unsafe) | `@` (safe, validated) | Property validation matters |
| **Validation** | Manual, inconsistent | Built-in validators | Data integrity important |
| **Method discovery** | Hard to find methods | Clear method printing | Developer experience matters |
| **Multiple dispatch** | Limited (base generics) | Full multiple dispatch | Complex method dispatch needed |
| **Inheritance** | Informal, `NextMethod()` | Explicit `super()` | Predictable inheritance needed |
| **Migration cost** | - | Low (1-2 hours) | Want better structure |
| **Performance** | Fastest | ~Same as S3 | Performance difference negligible |
| **Compatibility** | Full S3 | Full S3 + S7 | Need both old and new patterns |
## vctrs for Vector Classes
### Basic Vector Class
```r
# Constructor (low-level)
new_percent <- function(x = double()) {
vec_assert(x, double())
new_vctr(x, class = "pkg_percent")
}
# Helper (user-facing)
percent <- function(x = double()) {
x <- vec_cast(x, double())
new_percent(x)
}
# Format method
format.pkg_percent <- function(x, ...) {
paste0(vec_data(x) * 100, "%")
}
```
### Coercion Methods
```r
# Self-coercion
vec_ptype2.pkg_percent.pkg_percent <- function(x, y, ...) {
new_percent()
}
# With double
vec_ptype2.pkg_percent.double <- function(x, y, ...) double()
vec_ptype2.double.pkg_percent <- function(x, y, ...) double()
# Casting
vec_cast.pkg_percent.double <- function(x, to, ...) {
new_percent(x)
}
vec_cast.double.pkg_percent <- function(x, to, ...) {
vec_data(x)
}
```
## S3 Basics
### Creating S3 Classes
```r
# Constructor
new_myclass <- function(x, y) {
structure(
list(x = x, y = y),
class = "myclass"
)
}
# Methods
print.myclass <- function(x, ...) {
cat("myclass object\n")
cat("x:", x$x, "\n")
cat("y:", x$y, "\n")
}
summary.myclass <- function(object, ...) {
list(x = object$x, y = object$y)
}
```
### Generic Functions
```r
# Create generic
my_generic <- function(x, ...) {
UseMethod("my_generic")
}
# Default method
my_generic.default <- function(x, ...) {
stop("No method for class ", class(x))
}
# Specific method
my_generic.myclass <- function(x, ...) {
# Implementation
}
```
## R6 Classes
### Basic R6 Class
```r
library(R6)
MyClass <- R6Class("MyClass",
public = list(
x = NULL,
y = NULL,
initialize = function(x, y) {
self$x <- x
self$y <- y
},
add = function() {
self$x + self$y
}
),
private = list(
internal_value = NULL
)
)
# Usage
obj <- MyClass$new(1, 2)
obj$add() # 3
```
## Migration Strategy
### S3 → S7
Usually 1-2 hours work, keeps full compatibility:
```r
# S3 version
new_range <- function(start, end) {
structure(
list(start = start, end = end),
class = "range"
)
}
# S7 version
Range <- new_class("Range",
properties = list(
start = class_double,
end = class_double
)
)
```
### S4 → S7
More complex, evaluate if S4 features are actually needed.
### Base R → vctrs
For vector-like classes, significant benefits in type stability and data frame integration.
### Combining Approaches
S7 classes can use vctrs principles internally for vector-like properties.
## When to Use Each System
### Use S7 for:
- New projects needing formal OOP
- Class validation and type safety
- Multiple dispatch
- Better developer experience
### Use vctrs for:
- Vector-like classes
- Data frame columns
- Type-stable operations
- Tidyverse integration
### Use S3 for:
- Simple classes
- Maximum compatibility
- Existing S3 ecosystems
- Quick prototypes
### Use S4 for:
- Bioconductor packages
- Complex multiple inheritance
- Existing S4 codebases
### Use R6 for:
- Mutable state
- Reference semantics
- Encapsulation needs
- Coming from OOP languages

View File

@@ -0,0 +1,393 @@
# Package Development
## Dependency Strategy
### When to Add Dependencies vs Base R
```r
# Add dependency when:
Significant functionality gain
Maintenance burden reduction
User experience improvement
Complex implementation (regex, dates, web)
# Use base R when:
Simple utility functions
Package will be widely used (minimize deps)
Dependency is large for small benefit
Base R solution is straightforward
# Example decisions:
str_detect(x, "pattern") # Worth stringr dependency
length(x) > 0 # Don't need purrr for this
parse_dates(x) # Worth lubridate dependency
x + 1 # Don't need dplyr for this
```
### Tidyverse Dependency Guidelines
```r
# Core tidyverse (usually worth it):
dplyr # Complex data manipulation
purrr # Functional programming, parallel
stringr # String manipulation
tidyr # Data reshaping
# Specialized tidyverse (evaluate carefully):
lubridate # If heavy date manipulation
forcats # If many categorical operations
readr # If specific file reading needs
ggplot2 # If package creates visualizations
# Heavy dependencies (use sparingly):
tidyverse # Meta-package, very heavy
shiny # Only for interactive apps
```
## API Design Patterns
### Function Design Strategy
```r
# Modern tidyverse API patterns
# 1. Use .by for per-operation grouping
my_summarise <- function(.data, ..., .by = NULL) {
# Support modern grouped operations
}
# 2. Use {{ }} for user-provided columns
my_select <- function(.data, cols) {
.data |> select({{ cols }})
}
# 3. Use ... for flexible arguments
my_mutate <- function(.data, ..., .by = NULL) {
.data |> mutate(..., .by = {{ .by }})
}
# 4. Return consistent types (tibbles, not data.frames)
my_function <- function(.data) {
result |> tibble::as_tibble()
}
```
### Input Validation Strategy
```r
# Validation level by function type:
# User-facing functions - comprehensive validation
user_function <- function(x, threshold = 0.5) {
# Check all inputs thoroughly
if (!is.numeric(x)) stop("x must be numeric")
if (!is.numeric(threshold) || length(threshold) != 1) {
stop("threshold must be a single number")
}
# ... function body
}
# Internal functions - minimal validation
.internal_function <- function(x, threshold) {
# Assume inputs are valid (document assumptions)
# Only check critical invariants
# ... function body
}
# Package functions with vctrs - type-stable validation
safe_function <- function(x, y) {
x <- vec_cast(x, double())
y <- vec_cast(y, double())
# Automatic type checking and coercion
}
```
## Error Handling Patterns
```r
# Good error messages - specific and actionable
if (length(x) == 0) {
cli::cli_abort(
"Input {.arg x} cannot be empty.",
"i" = "Provide a non-empty vector."
)
}
# Include function name in errors
validate_input <- function(x, call = caller_env()) {
if (!is.numeric(x)) {
cli::cli_abort("Input must be numeric", call = call)
}
}
# Use consistent error styling
# cli package for user-friendly messages
# rlang for developer tools
```
## When to Create Internal vs Exported Functions
### Export Function When:
```r
Users will call it directly
Other packages might want to extend it
Part of the core package functionality
Stable API that won't change often
# Example: main data processing functions
export_these <- function(.data, ...) {
# Comprehensive input validation
# Full documentation required
# Stable API contract
}
```
### Keep Function Internal When:
```r
Implementation detail that may change
Only used within package
Complex implementation helpers
Would clutter user-facing API
# Example: helper functions
.internal_helper <- function(x, y) {
# Minimal documentation
# Can change without breaking users
# Assume inputs are pre-validated
}
```
## Testing and Documentation Strategy
### Testing Levels
```r
# Unit tests - individual functions
test_that("function handles edge cases", {
expect_equal(my_func(c()), expected_empty_result)
expect_error(my_func(NULL), class = "my_error_class")
})
# Integration tests - workflow combinations
test_that("pipeline works end-to-end", {
result <- data |>
step1() |>
step2() |>
step3()
expect_s3_class(result, "expected_class")
})
# Property-based tests for package functions
test_that("function properties hold", {
# Test invariants across many inputs
})
```
### Testing rlang Functions
```r
# Test data-masking behavior
test_that("function supports data masking", {
result <- my_function(mtcars, cyl)
expect_equal(names(result), "mean_cyl")
# Test with expressions
result2 <- my_function(mtcars, cyl * 2)
expect_true("mean_cyl * 2" %in% names(result2))
})
# Test injection behavior
test_that("function supports injection", {
var <- "cyl"
result <- my_function(mtcars, !!sym(var))
expect_true(nrow(result) > 0)
})
```
### Documentation Priorities
```r
# Must document:
All exported functions
Complex algorithms or formulas
Non-obvious parameter interactions
Examples of typical usage
# Can skip documentation:
Simple internal helpers
Obvious parameter meanings
Functions that just call other functions
```
### Documentation Tags for rlang
```r
#' @param var <[`data-masked`][dplyr::dplyr_data_masking]> Column to summarize
#' @param ... <[`dynamic-dots`][rlang::dyn-dots]> Additional grouping variables
#' @param cols <[`tidy-select`][dplyr::dplyr_tidy_select]> Columns to select
```
## Package Structure
### DESCRIPTION File
```r
Package: mypackage
Title: What the Package Does (One Line, Title Case)
Version: 0.1.0
Authors@R: person("First", "Last", email = "email@example.com", role = c("aut", "cre"))
Description: What the package does (one paragraph).
License: MIT + file LICENSE
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.3
Imports:
dplyr (>= 1.1.0),
rlang (>= 1.1.0),
cli
Suggests:
testthat (>= 3.0.0)
Config/testthat/edition: 3
```
### NAMESPACE Management
Use roxygen2 for NAMESPACE management:
```r
# Import specific functions
#' @importFrom rlang := enquo enquos
#' @importFrom dplyr mutate filter
# Or import entire packages (use sparingly)
#' @import dplyr
```
### rlang Import Strategy
```r
# In DESCRIPTION:
Imports: rlang
# In NAMESPACE, import specific functions:
importFrom(rlang, enquo, enquos, expr, !!!, :=)
# Or import key functions:
#' @importFrom rlang := enquo enquos
```
## Naming Conventions
```r
# Good naming: snake_case for variables/functions
calculate_mean_score <- function(data, score_col) {
# Function body
}
# Prefix non-standard arguments with .
my_function <- function(.data, ...) {
# Reduces argument conflicts
}
# Internal functions start with .
.internal_helper <- function(x, y) {
# Not exported
}
```
## Style Guide Essentials
### Object Names
- Use snake_case for all names
- Variable names = nouns, function names = verbs
- Avoid dots except for S3 methods
```r
# Good
day_one
calculate_mean
user_data
# Avoid
DayOne
calculate.mean
userData
```
### Spacing and Layout
```r
# Good spacing
x[, 1]
mean(x, na.rm = TRUE)
if (condition) {
action()
}
# Pipe formatting
data |>
filter(year >= 2020) |>
group_by(category) |>
summarise(
mean_value = mean(value),
count = n()
)
```
## Package Development Workflow
1. **Setup**: Use `usethis::create_package()`
2. **Add functions**: Place in `R/` directory
3. **Document**: Use roxygen2 comments
4. **Test**: Write tests in `tests/testthat/`
5. **Check**: Run `devtools::check()`
6. **Build**: Use `devtools::build()`
7. **Install**: Use `devtools::install()`
### Key usethis Functions
```r
# Initial setup
usethis::create_package("mypackage")
usethis::use_git()
usethis::use_mit_license()
# Add dependencies
usethis::use_package("dplyr")
usethis::use_package("testthat", "Suggests")
# Add infrastructure
usethis::use_readme_md()
usethis::use_news_md()
usethis::use_testthat()
# Add files
usethis::use_r("my_function")
usethis::use_test("my_function")
usethis::use_vignette("introduction")
```
## Common Pitfalls
### What to Avoid
```r
# Don't use library() in packages
# Use Imports in DESCRIPTION instead
# Don't use source()
# Use proper function dependencies
# Don't use attach()
# Always use explicit :: notation
# Don't modify global options without restoring
old <- options(stringsAsFactors = FALSE)
on.exit(options(old), add = TRUE)
# Don't use setwd()
# Use here::here() or relative paths
```

View File

@@ -0,0 +1,311 @@
# Performance Optimization
## Performance Tool Selection Guide
### Profiling Tools Decision Matrix
| Tool | Use When | Don't Use When | What It Shows |
|------|----------|----------------|---------------|
| **`profvis`** | Complex code, unknown bottlenecks | Simple functions, known issues | Time per line, call stack |
| **`bench::mark()`** | Comparing alternatives | Single approach | Relative performance, memory |
| **`system.time()`** | Quick checks | Detailed analysis | Total runtime only |
| **`Rprof()`** | Base R only environments | When profvis available | Raw profiling data |
### Step-by-Step Performance Workflow
```r
# 1. Profile first - find the actual bottlenecks
library(profvis)
profvis({
# Your slow code here
})
# 2. Focus on the slowest parts (80/20 rule)
# Don't optimize until you know where time is spent
# 3. Benchmark alternatives for hot spots
library(bench)
bench::mark(
current = current_approach(data),
vectorized = vectorized_approach(data),
parallel = map(data, in_parallel(func))
)
# 4. Consider tool trade-offs based on bottleneck type
```
## When Each Tool Helps vs Hurts
### Parallel Processing (`in_parallel()`)
```r
# Helps when:
CPU-intensive computations
Embarrassingly parallel problems
Large datasets with independent operations
I/O bound operations (file reading, API calls)
# Hurts when:
Simple, fast operations (overhead > benefit)
Memory-intensive operations (may cause thrashing)
Operations requiring shared state
Small datasets
# Example decision point:
expensive_func <- function(x) Sys.sleep(0.1) # 100ms per call
fast_func <- function(x) x^2 # microseconds per call
# Good for parallel
map(1:100, in_parallel(expensive_func)) # ~10s -> ~2.5s on 4 cores
# Bad for parallel (overhead > benefit)
map(1:100, in_parallel(fast_func)) # 100μs -> 50ms (500x slower!)
```
### vctrs Backend Tools
```r
# Use vctrs when:
Type safety matters more than raw speed
Building reusable package functions
Complex coercion/combination logic
Consistent behavior across edge cases
# Avoid vctrs when:
One-off scripts where speed matters most
Simple operations where base R is sufficient
Memory is extremely constrained
# Decision point:
simple_combine <- function(x, y) c(x, y) # Fast, simple
robust_combine <- function(x, y) vec_c(x, y) # Safer, slight overhead
# Use simple for hot loops, robust for package APIs
```
### Data Backend Selection
```r
# Use data.table when:
Very large datasets (>1GB)
Complex grouping operations
Reference semantics desired
Maximum performance critical
# Use dplyr when:
Readability and maintainability priority
Complex joins and window functions
Team familiarity with tidyverse
Moderate sized data (<100MB)
# Use dtplyr (dplyr with data.table backend) when:
Want dplyr syntax with data.table performance
Large data but team prefers tidyverse
Lazy evaluation desired
# Use base R when:
No dependencies allowed
Simple operations
Teaching/learning contexts
```
## Profiling Best Practices
```r
# 1. Profile realistic data sizes
profvis({
# Use actual data size, not toy examples
real_data |> your_analysis()
})
# 2. Profile multiple runs for stability
bench::mark(
your_function(data),
min_iterations = 10, # Multiple runs
max_iterations = 100
)
# 3. Check memory usage too
bench::mark(
approach1 = method1(data),
approach2 = method2(data),
check = FALSE, # If outputs differ slightly
filter_gc = FALSE # Include GC time
)
# 4. Profile with realistic usage patterns
# Not just isolated function calls
```
## Performance Anti-Patterns to Avoid
```r
# Don't optimize without measuring
# ✗ "This looks slow" -> immediately rewrite
# ✓ Profile first, optimize bottlenecks
# Don't over-engineer for performance
# ✗ Complex optimizations for 1% gains
# ✓ Focus on algorithmic improvements
# Don't assume - measure
# ✗ "for loops are always slow in R"
# ✓ Benchmark your specific use case
# Don't ignore readability costs
# ✗ Unreadable code for minor speedups
# ✓ Readable code with targeted optimizations
# Don't grow objects in loops
# ✗ result <- c(); for(i in 1:n) result <- c(result, x[i])
# ✓ result <- vector("list", n); for(i in 1:n) result[[i]] <- x[i]
```
## Modern purrr Patterns for Performance
Use modern purrr 1.0+ patterns:
```r
# Modern data frame row binding (purrr 1.0+)
models <- data_splits |>
map(\(split) train_model(split)) |>
list_rbind() # Replaces map_dfr()
# Column binding
summaries <- data_list |>
map(\(df) get_summary_stats(df)) |>
list_cbind() # Replaces map_dfc()
# Side effects with walk()
plots <- walk2(data_list, plot_names, \(df, name) {
p <- ggplot(df, aes(x, y)) + geom_point()
ggsave(name, p)
})
# Parallel processing (purrr 1.1.0+)
library(mirai)
daemons(4)
results <- large_datasets |>
map(in_parallel(expensive_computation))
daemons(0)
```
## Vectorization
```r
# Good - vectorized operations
result <- x + y
# Good - Type-stable purrr functions
map_dbl(data, mean) # always returns double
map_chr(data, class) # always returns character
# Avoid - Type-unstable base functions
sapply(data, mean) # might return list or vector
# Avoid - explicit loops for simple operations
result <- numeric(length(x))
for(i in seq_along(x)) {
result[i] <- x[i] + y[i]
}
```
## Using dtplyr for Large Data
For large datasets, use dtplyr to get data.table performance with dplyr syntax:
```r
library(dtplyr)
# Convert to lazy data.table
large_data_dt <- lazy_dt(large_data)
# Use dplyr syntax as normal
result <- large_data_dt |>
filter(year >= 2020) |>
group_by(category) |>
summarise(
total = sum(value),
avg = mean(value)
) |>
as_tibble() # Convert back to tibble
# See generated data.table code
result |> show_query()
```
## Memory Optimization
```r
# Pre-allocate vectors
result <- vector("numeric", n)
# Use appropriate data types
# integer instead of double when possible
x <- 1:1000 # integer
y <- seq(1, 1000, by = 1) # double
# Remove large objects when done
rm(large_object)
gc() # Force garbage collection if needed
# Use data.table for large data
library(data.table)
dt <- as.data.table(large_df)
dt[, new_col := old_col * 2] # Modifies in place
```
## String Manipulation Performance
Use stringr over base R for consistency and performance:
```r
# Good - stringr (consistent, pipe-friendly)
text |>
str_to_lower() |>
str_trim() |>
str_replace_all("pattern", "replacement") |>
str_extract("\\d+")
# Common patterns
str_detect(text, "pattern") # vs grepl("pattern", text)
str_extract(text, "pattern") # vs complex regmatches()
str_replace_all(text, "a", "b") # vs gsub("a", "b", text)
str_split(text, ",") # vs strsplit(text, ",")
str_length(text) # vs nchar(text)
str_sub(text, 1, 5) # vs substr(text, 1, 5)
```
## When to Use vctrs
### Core Benefits
- **Type stability** - Predictable output types regardless of input values
- **Size stability** - Predictable output sizes from input sizes
- **Consistent coercion rules** - Single set of rules applied everywhere
- **Robust class design** - Proper S3 vector infrastructure
### Use vctrs when:
```r
# Type-Stable Functions in Packages
my_function <- function(x, y) {
# Always returns double, regardless of input values
vec_cast(result, double())
}
# Consistent Coercion/Casting
vec_cast(x, double()) # Clear intent, predictable behavior
vec_ptype_common(x, y, z) # Finds richest compatible type
# Size/Length Stability
vec_c(x, y) # size = vec_size(x) + vec_size(y)
vec_rbind(df1, df2) # size = sum of input sizes
```
### Don't Use vctrs When:
- Simple one-off analyses - Base R is sufficient
- No custom classes needed - Standard types work fine
- Performance critical + simple operations - Base R may be faster
- External API constraints - Must return base R types
The key insight: **vctrs is most valuable in package development where type safety, consistency, and extensibility matter more than raw speed for simple operations.**

View File

@@ -0,0 +1,247 @@
# rlang Patterns for Data-Masking
## Core Concepts
**Data-masking** allows R expressions to refer to data frame columns as if they were variables in the environment. rlang provides the metaprogramming framework that powers tidyverse data-masking.
### Key rlang Tools
- **Embracing `{{}}`** - Forward function arguments to data-masking functions
- **Injection `!!`** - Inject single expressions or values
- **Splicing `!!!`** - Inject multiple arguments from a list
- **Dynamic dots** - Programmable `...` with injection support
- **Pronouns `.data`/`.env`** - Explicit disambiguation between data and environment variables
## Function Argument Patterns
### Forwarding with `{{}}`
Use `{{}}` to forward function arguments to data-masking functions:
```r
# Single argument forwarding
my_summarise <- function(data, var) {
data |> dplyr::summarise(mean = mean({{ var }}))
}
# Works with any data-masking expression
mtcars |> my_summarise(cyl)
mtcars |> my_summarise(cyl * am)
mtcars |> my_summarise(.data$cyl) # pronoun syntax supported
```
### Forwarding `...`
No special syntax needed for dots forwarding:
```r
# Simple dots forwarding
my_group_by <- function(.data, ...) {
.data |> dplyr::group_by(...)
}
# Works with tidy selections too
my_select <- function(.data, ...) {
.data |> dplyr::select(...)
}
# For single-argument tidy selections, wrap in c()
my_pivot_longer <- function(.data, ...) {
.data |> tidyr::pivot_longer(c(...))
}
```
### Names Patterns with `.data`
Use `.data` pronoun for programmatic column access:
```r
# Single column by name
my_mean <- function(data, var) {
data |> dplyr::summarise(mean = mean(.data[[var]]))
}
# Usage - completely insulated from data-masking
mtcars |> my_mean("cyl") # No ambiguity, works like regular function
# Multiple columns with all_of()
my_select_vars <- function(data, vars) {
data |> dplyr::select(all_of(vars))
}
mtcars |> my_select_vars(c("cyl", "am"))
```
## Injection Operators
### When to Use Each Operator
| Operator | Use Case | Example |
|----------|----------|---------|
| `{{ }}` | Forward function arguments | `summarise(mean = mean({{ var }}))` |
| `!!` | Inject single expression/value | `summarise(mean = mean(!!sym(var)))` |
| `!!!` | Inject multiple arguments | `group_by(!!!syms(vars))` |
| `.data[[]]` | Access columns by name | `mean(.data[[var]])` |
### Advanced Injection with `!!`
```r
# Create symbols from strings
var <- "cyl"
mtcars |> dplyr::summarise(mean = mean(!!sym(var)))
# Inject values to avoid name collisions
df <- data.frame(x = 1:3)
x <- 100
df |> dplyr::mutate(scaled = x / !!x) # Uses both data and env x
# Use data_sym() for tidyeval contexts (more robust)
mtcars |> dplyr::summarise(mean = mean(!!data_sym(var)))
```
### Splicing with `!!!`
```r
# Multiple symbols from character vector
vars <- c("cyl", "am")
mtcars |> dplyr::group_by(!!!syms(vars))
# Or use data_syms() for tidy contexts
mtcars |> dplyr::group_by(!!!data_syms(vars))
# Splice lists of arguments
args <- list(na.rm = TRUE, trim = 0.1)
mtcars |> dplyr::summarise(mean = mean(cyl, !!!args))
```
## Dynamic Dots Patterns
### Using `list2()` for Dynamic Dots Support
```r
my_function <- function(...) {
# Collect with list2() instead of list() for dynamic features
dots <- list2(...)
# Process dots...
}
# Enables these features:
my_function(a = 1, b = 2) # Normal usage
my_function(!!!list(a = 1, b = 2)) # Splice a list
my_function("{name}" := value) # Name injection
my_function(a = 1, ) # Trailing commas OK
```
### Name Injection with Glue Syntax
```r
# Basic name injection
name <- "result"
list2("{name}" := 1) # Creates list(result = 1)
# In function arguments with {{
my_mean <- function(data, var) {
data |> dplyr::summarise("mean_{{ var }}" := mean({{ var }}))
}
mtcars |> my_mean(cyl) # Creates column "mean_cyl"
mtcars |> my_mean(cyl * am) # Creates column "mean_cyl * am"
# Allow custom names with englue()
my_mean <- function(data, var, name = englue("mean_{{ var }}")) {
data |> dplyr::summarise("{name}" := mean({{ var }}))
}
# User can override default
mtcars |> my_mean(cyl, name = "cylinder_mean")
```
## Pronouns for Disambiguation
### `.data` and `.env` Best Practices
```r
# Explicit disambiguation prevents masking issues
cyl <- 1000 # Environment variable
mtcars |> dplyr::summarise(
data_cyl = mean(.data$cyl), # Data frame column
env_cyl = mean(.env$cyl), # Environment variable
ambiguous = mean(cyl) # Could be either (usually data wins)
)
# Use in loops and programmatic contexts
vars <- c("cyl", "am")
for (var in vars) {
result <- mtcars |> dplyr::summarise(mean = mean(.data[[var]]))
print(result)
}
```
## Programming Patterns
### Bridge Patterns
Converting between data-masking and tidy selection behaviors:
```r
# across() as selection-to-data-mask bridge
my_group_by <- function(data, vars) {
data |> dplyr::group_by(across({{ vars }}))
}
# Works with tidy selection
mtcars |> my_group_by(starts_with("c"))
# across(all_of()) as names-to-data-mask bridge
my_group_by <- function(data, vars) {
data |> dplyr::group_by(across(all_of(vars)))
}
mtcars |> my_group_by(c("cyl", "am"))
```
### Transformation Patterns
```r
# Transform single arguments by wrapping
my_mean <- function(data, var) {
data |> dplyr::summarise(mean = mean({{ var }}, na.rm = TRUE))
}
# Transform dots with across()
my_means <- function(data, ...) {
data |> dplyr::summarise(across(c(...), ~ mean(.x, na.rm = TRUE)))
}
# Manual transformation (advanced)
my_means_manual <- function(.data, ...) {
vars <- enquos(..., .named = TRUE)
vars <- purrr::map(vars, ~ expr(mean(!!.x, na.rm = TRUE)))
.data |> dplyr::summarise(!!!vars)
}
```
## Common Patterns Summary
### When to Use What
**Use `{{}}` when:**
- Forwarding user-provided column references
- Building wrapper functions around dplyr/tidyr
- Need to support both bare names and expressions
**Use `.data[[]]` when:**
- Working with character vector column names
- Iterating over column names programmatically
- Need complete insulation from data-masking
**Use `!!` when:**
- Need to inject computed expressions
- Converting strings to symbols with `sym()`
- Avoiding variable name collisions
**Use `!!!` when:**
- Injecting multiple arguments from a list
- Working with variable numbers of columns
- Splicing named arguments