Initial commit

2025-11-29 18:15:04 +08:00
commit ec0d1b5905
19 changed files with 5696 additions and 0 deletions
--- a/skills/r-development/SKILL.md
+++ b/skills/r-development/SKILL.md
@@ -0,0 +1,214 @@
+---
+name: r-development
+description: Modern R development practices emphasizing tidyverse patterns (dplyr 1.1 and later, native pipe, join_by, .by grouping), rlang metaprogramming, performance optimization, and package development. Use when Claude needs to write R code, create R packages, optimize R performance, or provide R programming guidance.
+---
+
+# R Development
+
+This skill provides comprehensive guidance for modern R development, emphasizing current best practices with tidyverse, performance optimization, and professional package development.
+
+## Core Principles
+
+1. **Use modern tidyverse patterns** - Prioritize dplyr 1.1+ features, native pipe, and current APIs
+2. **Profile before optimizing** - Use profvis and bench to identify real bottlenecks
+3. **Write readable code first** - Optimize only when necessary and after profiling
+4. **Follow tidyverse style guide** - Consistent naming, spacing, and structure
+
+## Modern Tidyverse Essentials
+
+### Native Pipe (`|>` not `%>%`)
+
+Always use native pipe `|>` instead of magrittr `%>%` (R 4.1+):
+
+```r
+# Modern
+data |> 
+  filter(year >= 2020) |>
+  summarise(mean_value = mean(value))
+
+# Avoid legacy pipe
+data %>% filter(year >= 2020)
+```
+
+### Join Syntax (dplyr 1.1+)
+
+Use `join_by()` for all joins:
+
+```r
+# Modern join syntax with equality
+transactions |> 
+  inner_join(companies, by = join_by(company == id))
+
+# Inequality joins
+transactions |>
+  inner_join(companies, join_by(company == id, year >= since))
+
+# Rolling joins (closest match)
+transactions |>
+  inner_join(companies, join_by(company == id, closest(year >= since)))
+```
+
+Control match behavior:
+
+```r
+# Expect 1:1 matches
+inner_join(x, y, by = join_by(id), multiple = "error")
+
+# Ensure all rows match
+inner_join(x, y, by = join_by(id), unmatched = "error")
+```
+
+### Per-Operation Grouping with `.by`
+
+Use `.by` instead of `group_by() |> ... |> ungroup()`:
+
+```r
+# Modern approach (always returns ungrouped)
+data |>
+  summarise(mean_value = mean(value), .by = category)
+
+# Multiple grouping variables
+data |>
+  summarise(total = sum(revenue), .by = c(company, year))
+```
+
+### Column Operations
+
+Use modern column selection and transformation functions:
+
+```r
+# pick() for column selection in data-masking contexts
+data |>
+  summarise(
+    n_x_cols = ncol(pick(starts_with("x"))),
+    n_y_cols = ncol(pick(starts_with("y")))
+  )
+
+# across() for applying functions to multiple columns
+data |>
+  summarise(across(where(is.numeric), mean, .names = "mean_{.col}"), .by = group)
+
+# reframe() for multi-row results per group
+data |>
+  reframe(quantiles = quantile(x, c(0.25, 0.5, 0.75)), .by = group)
+```
+
+## rlang Metaprogramming
+
+For comprehensive rlang patterns, see [references/rlang-patterns.md](references/rlang-patterns.md).
+
+### Quick Reference
+
+- **`{{}}`** - Forward function arguments to data-masking functions
+- **`!!`** - Inject single expressions or values
+- **`!!!`** - Inject multiple arguments from a list
+- **`.data[[]]`** - Access columns by name (character vectors)
+- **`pick()`** - Select columns inside data-masking functions
+
+Example function with embracing:
+
+```r
+my_summary <- function(data, group_var, summary_var) {
+  data |>
+    summarise(mean_val = mean({{ summary_var }}), .by = {{ group_var }})
+}
+```
+
+## Performance Optimization
+
+For detailed performance guidance, see [references/performance.md](references/performance.md).
+
+### Key Strategies
+
+1. **Profile first**: Use `profvis::profvis()` and `bench::mark()`
+2. **Vectorize operations**: Avoid loops when vectorized alternatives exist
+3. **Use dtplyr**: For large data operations (lazy evaluation with data.table backend)
+4. **Parallel processing**: Use `furrr::future_map()` for parallelizable work
+5. **Memory efficiency**: Pre-allocate, use appropriate data types
+
+Quick example:
+
+```r
+# Profile code
+profvis::profvis({
+  result <- data |> 
+    complex_operation() |>
+    another_operation()
+})
+
+# Benchmark alternatives
+bench::mark(
+  approach_1 = method1(data),
+  approach_2 = method2(data),
+  check = FALSE
+)
+```
+
+## Package Development
+
+For complete package development guidance, see [references/package-development.md](references/package-development.md).
+
+### Quick Guidelines
+
+**API Design:**
+- Use `.by` parameter for per-operation grouping
+- Use `{{}}` for column arguments
+- Return tibbles consistently
+- Validate user-facing function inputs thoroughly
+
+**Dependencies:**
+- Add dependencies for significant functionality gains
+- Core tidyverse packages usually worth including: dplyr, purrr, stringr, tidyr
+- Minimize dependencies for widely-used packages
+
+**Testing:**
+- Unit tests for individual functions
+- Integration tests for workflows
+- Test edge cases and error conditions
+
+**Documentation:**
+- Document all exported functions
+- Provide usage examples
+- Explain non-obvious parameter interactions
+
+## Common Migration Patterns
+
+### Base R → Tidyverse
+
+```r
+# Data manipulation
+subset(data, condition)         → filter(data, condition)
+data[order(data$x), ]          → arrange(data, x)
+aggregate(x ~ y, data, mean)   → summarise(data, mean(x), .by = y)
+
+# Functional programming
+sapply(x, f)                   → map(x, f)  # type-stable
+lapply(x, f)                   → map(x, f)
+
+# Strings
+grepl("pattern", text)         → str_detect(text, "pattern")
+gsub("old", "new", text)       → str_replace_all(text, "old", "new")
+```
+
+### Old → New Tidyverse
+
+```r
+# Pipes
+%>%                            → |>
+
+# Grouping
+group_by() |> ... |> ungroup() → summarise(..., .by = x)
+
+# Joins
+by = c("a" = "b")             → by = join_by(a == b)
+
+# Reshaping
+gather()/spread()              → pivot_longer()/pivot_wider()
+```
+
+## Additional Resources
+
+- **rlang patterns**: See [references/rlang-patterns.md](references/rlang-patterns.md) for comprehensive data-masking and metaprogramming guidance
+- **Performance optimization**: See [references/performance.md](references/performance.md) for profiling, benchmarking, and optimization strategies
+- **Package development**: See [references/package-development.md](references/package-development.md) for complete package creation guidance
+- **Object systems**: See [references/object-systems.md](references/object-systems.md) for S3, S4, S7, R6, and vctrs guidance
--- a/skills/r-development/references/object-systems.md
+++ b/skills/r-development/references/object-systems.md
@@ -0,0 +1,310 @@
+# Object-Oriented Programming in R
+
+## S7: Modern OOP for New Projects
+
+S7 combines S3 simplicity with S4 structure:
+- Formal class definitions with automatic validation
+- Compatible with existing S3 code
+- Better error messages and discoverability
+
+```r
+# S7 class definition
+Range <- new_class("Range",
+  properties = list(
+    start = class_double,
+    end = class_double
+  ),
+  validator = function(self) {
+    if (self@end < self@start) {
+      "@end must be >= @start"
+    }
+  }
+)
+
+# Usage - constructor and property access
+x <- Range(start = 1, end = 10)
+x@start  # 1
+x@end <- 20  # automatic validation
+
+# Methods
+inside <- new_generic("inside", "x")
+method(inside, Range) <- function(x, y) {
+  y >= x@start & y <= x@end
+}
+```
+
+## OOP System Decision Matrix
+
+### Decision Tree: What Are You Building?
+
+#### 1. Vector-like Objects
+
+**Use vctrs when:**
+- ✓ Need data frame integration (columns/rows)
+- ✓ Want type-stable vector operations  
+- ✓ Building factor-like, date-like, or numeric-like classes
+- ✓ Need consistent coercion/casting behavior
+- ✓ Working with existing tidyverse infrastructure
+
+**Examples:** custom date classes, units, categorical data
+
+```r
+# Vector-like behavior in data frames
+percent <- new_vctr(0.5, class = "percentage") 
+data.frame(x = 1:3, pct = percent(c(0.1, 0.2, 0.3)))  # works seamlessly
+
+# Type-stable operations
+vec_c(percent(0.1), percent(0.2))  # predictable behavior
+vec_cast(0.5, percent())          # explicit, safe casting
+```
+
+#### 2. General Objects (Complex Data Structures)
+
+**Use S7 when:**
+- ✓ NEW projects that need formal classes
+- ✓ Want property validation and safe property access (@)
+- ✓ Need multiple dispatch (beyond S3's double dispatch)
+- ✓ Converting from S3 and want better structure
+- ✓ Building class hierarchies with inheritance
+- ✓ Want better error messages and discoverability
+
+```r
+# Complex validation needs
+Range <- new_class("Range",
+  properties = list(start = class_double, end = class_double),
+  validator = function(self) {
+    if (self@end < self@start) "@end must be >= @start"
+  }
+)
+
+# Multiple dispatch needs  
+method(generic, list(ClassA, ClassB)) <- function(x, y) ...
+
+# Class hierarchies with clear inheritance
+Child <- new_class("Child", parent = Parent)
+```
+
+**Use S3 when:**
+- ✓ Simple classes with minimal structure needs
+- ✓ Maximum compatibility and minimal dependencies  
+- ✓ Quick prototyping or internal classes
+- ✓ Contributing to existing S3-based ecosystems
+- ✓ Performance is absolutely critical (minimal overhead)
+
+```r
+# Simple classes without complex needs
+new_simple <- function(x) structure(x, class = "simple")
+print.simple <- function(x, ...) cat("Simple:", x)
+```
+
+**Use S4 when:**
+- ✓ Working in Bioconductor ecosystem
+- ✓ Need complex multiple inheritance (S7 doesn't support this)
+- ✓ Existing S4 codebase that works well
+
+**Use R6 when:**
+- ✓ Need reference semantics (mutable objects)
+- ✓ Building stateful objects
+- ✓ Coming from OOP languages like Python/Java
+- ✓ Need encapsulation and private methods
+
+## Detailed S7 vs S3 Comparison
+
+| Feature | S3 | S7 | When S7 wins |
+|---------|----|----|---------------|
+| **Class definition** | Informal (convention) | Formal (`new_class()`) | Need guaranteed structure |
+| **Property access** | `$` or `attr()` (unsafe) | `@` (safe, validated) | Property validation matters |
+| **Validation** | Manual, inconsistent | Built-in validators | Data integrity important |
+| **Method discovery** | Hard to find methods | Clear method printing | Developer experience matters |
+| **Multiple dispatch** | Limited (base generics) | Full multiple dispatch | Complex method dispatch needed |
+| **Inheritance** | Informal, `NextMethod()` | Explicit `super()` | Predictable inheritance needed |
+| **Migration cost** | - | Low (1-2 hours) | Want better structure |
+| **Performance** | Fastest | ~Same as S3 | Performance difference negligible |
+| **Compatibility** | Full S3 | Full S3 + S7 | Need both old and new patterns |
+
+## vctrs for Vector Classes
+
+### Basic Vector Class
+
+```r
+# Constructor (low-level)
+new_percent <- function(x = double()) {
+  vec_assert(x, double())
+  new_vctr(x, class = "pkg_percent")
+}
+
+# Helper (user-facing)
+percent <- function(x = double()) {
+  x <- vec_cast(x, double())
+  new_percent(x)
+}
+
+# Format method
+format.pkg_percent <- function(x, ...) {
+  paste0(vec_data(x) * 100, "%")
+}
+```
+
+### Coercion Methods
+
+```r
+# Self-coercion
+vec_ptype2.pkg_percent.pkg_percent <- function(x, y, ...) {
+  new_percent()
+}
+
+# With double
+vec_ptype2.pkg_percent.double <- function(x, y, ...) double()
+vec_ptype2.double.pkg_percent <- function(x, y, ...) double()
+
+# Casting
+vec_cast.pkg_percent.double <- function(x, to, ...) {
+  new_percent(x)
+}
+vec_cast.double.pkg_percent <- function(x, to, ...) {
+  vec_data(x)
+}
+```
+
+## S3 Basics
+
+### Creating S3 Classes
+
+```r
+# Constructor
+new_myclass <- function(x, y) {
+  structure(
+    list(x = x, y = y),
+    class = "myclass"
+  )
+}
+
+# Methods
+print.myclass <- function(x, ...) {
+  cat("myclass object\n")
+  cat("x:", x$x, "\n")
+  cat("y:", x$y, "\n")
+}
+
+summary.myclass <- function(object, ...) {
+  list(x = object$x, y = object$y)
+}
+```
+
+### Generic Functions
+
+```r
+# Create generic
+my_generic <- function(x, ...) {
+  UseMethod("my_generic")
+}
+
+# Default method
+my_generic.default <- function(x, ...) {
+  stop("No method for class ", class(x))
+}
+
+# Specific method
+my_generic.myclass <- function(x, ...) {
+  # Implementation
+}
+```
+
+## R6 Classes
+
+### Basic R6 Class
+
+```r
+library(R6)
+
+MyClass <- R6Class("MyClass",
+  public = list(
+    x = NULL,
+    y = NULL,
+    
+    initialize = function(x, y) {
+      self$x <- x
+      self$y <- y
+    },
+    
+    add = function() {
+      self$x + self$y
+    }
+  ),
+  
+  private = list(
+    internal_value = NULL
+  )
+)
+
+# Usage
+obj <- MyClass$new(1, 2)
+obj$add()  # 3
+```
+
+## Migration Strategy
+
+### S3 → S7
+
+Usually 1-2 hours work, keeps full compatibility:
+
+```r
+# S3 version
+new_range <- function(start, end) {
+  structure(
+    list(start = start, end = end),
+    class = "range"
+  )
+}
+
+# S7 version
+Range <- new_class("Range",
+  properties = list(
+    start = class_double,
+    end = class_double
+  )
+)
+```
+
+### S4 → S7
+
+More complex, evaluate if S4 features are actually needed.
+
+### Base R → vctrs
+
+For vector-like classes, significant benefits in type stability and data frame integration.
+
+### Combining Approaches
+
+S7 classes can use vctrs principles internally for vector-like properties.
+
+## When to Use Each System
+
+### Use S7 for:
+- New projects needing formal OOP
+- Class validation and type safety
+- Multiple dispatch
+- Better developer experience
+
+### Use vctrs for:
+- Vector-like classes
+- Data frame columns
+- Type-stable operations
+- Tidyverse integration
+
+### Use S3 for:
+- Simple classes
+- Maximum compatibility
+- Existing S3 ecosystems
+- Quick prototypes
+
+### Use S4 for:
+- Bioconductor packages
+- Complex multiple inheritance
+- Existing S4 codebases
+
+### Use R6 for:
+- Mutable state
+- Reference semantics
+- Encapsulation needs
+- Coming from OOP languages
--- a/skills/r-development/references/package-development.md
+++ b/skills/r-development/references/package-development.md
@@ -0,0 +1,393 @@
+# Package Development
+
+## Dependency Strategy
+
+### When to Add Dependencies vs Base R
+
+```r
+# Add dependency when:
+✓ Significant functionality gain
+✓ Maintenance burden reduction
+✓ User experience improvement
+✓ Complex implementation (regex, dates, web)
+
+# Use base R when:
+✓ Simple utility functions
+✓ Package will be widely used (minimize deps)
+✓ Dependency is large for small benefit
+✓ Base R solution is straightforward
+
+# Example decisions:
+str_detect(x, "pattern")    # Worth stringr dependency
+length(x) > 0              # Don't need purrr for this
+parse_dates(x)             # Worth lubridate dependency  
+x + 1                      # Don't need dplyr for this
+```
+
+### Tidyverse Dependency Guidelines
+
+```r
+# Core tidyverse (usually worth it):
+dplyr     # Complex data manipulation
+purrr     # Functional programming, parallel
+stringr   # String manipulation
+tidyr     # Data reshaping
+
+# Specialized tidyverse (evaluate carefully):
+lubridate # If heavy date manipulation
+forcats   # If many categorical operations  
+readr     # If specific file reading needs
+ggplot2   # If package creates visualizations
+
+# Heavy dependencies (use sparingly):
+tidyverse # Meta-package, very heavy
+shiny     # Only for interactive apps
+```
+
+## API Design Patterns
+
+### Function Design Strategy
+
+```r
+# Modern tidyverse API patterns
+
+# 1. Use .by for per-operation grouping
+my_summarise <- function(.data, ..., .by = NULL) {
+  # Support modern grouped operations
+}
+
+# 2. Use {{ }} for user-provided columns  
+my_select <- function(.data, cols) {
+  .data |> select({{ cols }})
+}
+
+# 3. Use ... for flexible arguments
+my_mutate <- function(.data, ..., .by = NULL) {
+  .data |> mutate(..., .by = {{ .by }})
+}
+
+# 4. Return consistent types (tibbles, not data.frames)
+my_function <- function(.data) {
+  result |> tibble::as_tibble()
+}
+```
+
+### Input Validation Strategy
+
+```r
+# Validation level by function type:
+
+# User-facing functions - comprehensive validation
+user_function <- function(x, threshold = 0.5) {
+  # Check all inputs thoroughly
+  if (!is.numeric(x)) stop("x must be numeric")
+  if (!is.numeric(threshold) || length(threshold) != 1) {
+    stop("threshold must be a single number")
+  }
+  # ... function body
+}
+
+# Internal functions - minimal validation  
+.internal_function <- function(x, threshold) {
+  # Assume inputs are valid (document assumptions)
+  # Only check critical invariants
+  # ... function body
+}
+
+# Package functions with vctrs - type-stable validation
+safe_function <- function(x, y) {
+  x <- vec_cast(x, double())
+  y <- vec_cast(y, double())
+  # Automatic type checking and coercion
+}
+```
+
+## Error Handling Patterns
+
+```r
+# Good error messages - specific and actionable
+if (length(x) == 0) {
+  cli::cli_abort(
+    "Input {.arg x} cannot be empty.",
+    "i" = "Provide a non-empty vector."
+  )
+}
+
+# Include function name in errors
+validate_input <- function(x, call = caller_env()) {
+  if (!is.numeric(x)) {
+    cli::cli_abort("Input must be numeric", call = call)
+  }
+}
+
+# Use consistent error styling
+# cli package for user-friendly messages
+# rlang for developer tools
+```
+
+## When to Create Internal vs Exported Functions
+
+### Export Function When:
+
+```r
+✓ Users will call it directly
+✓ Other packages might want to extend it
+✓ Part of the core package functionality
+✓ Stable API that won't change often
+
+# Example: main data processing functions
+export_these <- function(.data, ...) {
+  # Comprehensive input validation
+  # Full documentation required
+  # Stable API contract
+}
+```
+
+### Keep Function Internal When:
+
+```r
+✓ Implementation detail that may change
+✓ Only used within package
+✓ Complex implementation helpers
+✓ Would clutter user-facing API
+
+# Example: helper functions
+.internal_helper <- function(x, y) {
+  # Minimal documentation
+  # Can change without breaking users
+  # Assume inputs are pre-validated
+}
+```
+
+## Testing and Documentation Strategy
+
+### Testing Levels
+
+```r
+# Unit tests - individual functions
+test_that("function handles edge cases", {
+  expect_equal(my_func(c()), expected_empty_result)
+  expect_error(my_func(NULL), class = "my_error_class")
+})
+
+# Integration tests - workflow combinations  
+test_that("pipeline works end-to-end", {
+  result <- data |> 
+    step1() |> 
+    step2() |>
+    step3()
+  expect_s3_class(result, "expected_class")
+})
+
+# Property-based tests for package functions
+test_that("function properties hold", {
+  # Test invariants across many inputs
+})
+```
+
+### Testing rlang Functions
+
+```r
+# Test data-masking behavior
+test_that("function supports data masking", {
+  result <- my_function(mtcars, cyl)
+  expect_equal(names(result), "mean_cyl")
+  
+  # Test with expressions
+  result2 <- my_function(mtcars, cyl * 2)
+  expect_true("mean_cyl * 2" %in% names(result2))
+})
+
+# Test injection behavior
+test_that("function supports injection", {
+  var <- "cyl"
+  result <- my_function(mtcars, !!sym(var))
+  expect_true(nrow(result) > 0)
+})
+```
+
+### Documentation Priorities
+
+```r
+# Must document:
+✓ All exported functions
+✓ Complex algorithms or formulas
+✓ Non-obvious parameter interactions
+✓ Examples of typical usage
+
+# Can skip documentation:
+✗ Simple internal helpers
+✗ Obvious parameter meanings
+✗ Functions that just call other functions
+```
+
+### Documentation Tags for rlang
+
+```r
+#' @param var <[`data-masked`][dplyr::dplyr_data_masking]> Column to summarize
+#' @param ... <[`dynamic-dots`][rlang::dyn-dots]> Additional grouping variables  
+#' @param cols <[`tidy-select`][dplyr::dplyr_tidy_select]> Columns to select
+```
+
+## Package Structure
+
+### DESCRIPTION File
+
+```r
+Package: mypackage
+Title: What the Package Does (One Line, Title Case)
+Version: 0.1.0
+Authors@R: person("First", "Last", email = "email@example.com", role = c("aut", "cre"))
+Description: What the package does (one paragraph).
+License: MIT + file LICENSE
+Encoding: UTF-8
+Roxygen: list(markdown = TRUE)
+RoxygenNote: 7.2.3
+Imports:
+    dplyr (>= 1.1.0),
+    rlang (>= 1.1.0),
+    cli
+Suggests:
+    testthat (>= 3.0.0)
+Config/testthat/edition: 3
+```
+
+### NAMESPACE Management
+
+Use roxygen2 for NAMESPACE management:
+
+```r
+# Import specific functions
+#' @importFrom rlang := enquo enquos
+#' @importFrom dplyr mutate filter
+
+# Or import entire packages (use sparingly)
+#' @import dplyr
+```
+
+### rlang Import Strategy
+
+```r
+# In DESCRIPTION:
+Imports: rlang
+
+# In NAMESPACE, import specific functions:
+importFrom(rlang, enquo, enquos, expr, !!!, :=)
+
+# Or import key functions:
+#' @importFrom rlang := enquo enquos
+```
+
+## Naming Conventions
+
+```r
+# Good naming: snake_case for variables/functions
+calculate_mean_score <- function(data, score_col) {
+  # Function body
+}
+
+# Prefix non-standard arguments with .
+my_function <- function(.data, ...) {
+  # Reduces argument conflicts
+}
+
+# Internal functions start with .
+.internal_helper <- function(x, y) {
+  # Not exported
+}
+```
+
+## Style Guide Essentials
+
+### Object Names
+
+- Use snake_case for all names
+- Variable names = nouns, function names = verbs
+- Avoid dots except for S3 methods
+
+```r
+# Good
+day_one
+calculate_mean  
+user_data
+
+# Avoid
+DayOne
+calculate.mean
+userData
+```
+
+### Spacing and Layout
+
+```r
+# Good spacing
+x[, 1]
+mean(x, na.rm = TRUE)
+if (condition) {
+  action()
+}
+
+# Pipe formatting
+data |>
+  filter(year >= 2020) |>
+  group_by(category) |>
+  summarise(
+    mean_value = mean(value),
+    count = n()
+  )
+```
+
+## Package Development Workflow
+
+1. **Setup**: Use `usethis::create_package()`
+2. **Add functions**: Place in `R/` directory
+3. **Document**: Use roxygen2 comments
+4. **Test**: Write tests in `tests/testthat/`
+5. **Check**: Run `devtools::check()`
+6. **Build**: Use `devtools::build()`
+7. **Install**: Use `devtools::install()`
+
+### Key usethis Functions
+
+```r
+# Initial setup
+usethis::create_package("mypackage")
+usethis::use_git()
+usethis::use_mit_license()
+
+# Add dependencies
+usethis::use_package("dplyr")
+usethis::use_package("testthat", "Suggests")
+
+# Add infrastructure
+usethis::use_readme_md()
+usethis::use_news_md()
+usethis::use_testthat()
+
+# Add files
+usethis::use_r("my_function")
+usethis::use_test("my_function")
+usethis::use_vignette("introduction")
+```
+
+## Common Pitfalls
+
+### What to Avoid
+
+```r
+# Don't use library() in packages
+# Use Imports in DESCRIPTION instead
+
+# Don't use source()
+# Use proper function dependencies
+
+# Don't use attach()
+# Always use explicit :: notation
+
+# Don't modify global options without restoring
+old <- options(stringsAsFactors = FALSE)
+on.exit(options(old), add = TRUE)
+
+# Don't use setwd()
+# Use here::here() or relative paths
+```
--- a/skills/r-development/references/performance.md
+++ b/skills/r-development/references/performance.md
@@ -0,0 +1,311 @@
+# Performance Optimization
+
+## Performance Tool Selection Guide
+
+### Profiling Tools Decision Matrix
+
+| Tool | Use When | Don't Use When | What It Shows |
+|------|----------|----------------|---------------|
+| **`profvis`** | Complex code, unknown bottlenecks | Simple functions, known issues | Time per line, call stack |
+| **`bench::mark()`** | Comparing alternatives | Single approach | Relative performance, memory |
+| **`system.time()`** | Quick checks | Detailed analysis | Total runtime only |
+| **`Rprof()`** | Base R only environments | When profvis available | Raw profiling data |
+
+### Step-by-Step Performance Workflow
+
+```r
+# 1. Profile first - find the actual bottlenecks
+library(profvis)
+profvis({
+  # Your slow code here
+})
+
+# 2. Focus on the slowest parts (80/20 rule)
+# Don't optimize until you know where time is spent
+
+# 3. Benchmark alternatives for hot spots
+library(bench)
+bench::mark(
+  current = current_approach(data),
+  vectorized = vectorized_approach(data),
+  parallel = map(data, in_parallel(func))
+)
+
+# 4. Consider tool trade-offs based on bottleneck type
+```
+
+## When Each Tool Helps vs Hurts
+
+### Parallel Processing (`in_parallel()`)
+
+```r
+# Helps when:
+✓ CPU-intensive computations
+✓ Embarrassingly parallel problems  
+✓ Large datasets with independent operations
+✓ I/O bound operations (file reading, API calls)
+
+# Hurts when:
+✗ Simple, fast operations (overhead > benefit)
+✗ Memory-intensive operations (may cause thrashing)
+✗ Operations requiring shared state
+✗ Small datasets
+
+# Example decision point:
+expensive_func <- function(x) Sys.sleep(0.1) # 100ms per call
+fast_func <- function(x) x^2                 # microseconds per call
+
+# Good for parallel
+map(1:100, in_parallel(expensive_func))  # ~10s -> ~2.5s on 4 cores
+
+# Bad for parallel (overhead > benefit)  
+map(1:100, in_parallel(fast_func))       # 100μs -> 50ms (500x slower!)
+```
+
+### vctrs Backend Tools
+
+```r
+# Use vctrs when:
+✓ Type safety matters more than raw speed
+✓ Building reusable package functions
+✓ Complex coercion/combination logic
+✓ Consistent behavior across edge cases
+
+# Avoid vctrs when:
+✗ One-off scripts where speed matters most
+✗ Simple operations where base R is sufficient  
+✗ Memory is extremely constrained
+
+# Decision point:
+simple_combine <- function(x, y) c(x, y)           # Fast, simple
+robust_combine <- function(x, y) vec_c(x, y)      # Safer, slight overhead
+
+# Use simple for hot loops, robust for package APIs
+```
+
+### Data Backend Selection
+
+```r
+# Use data.table when:
+✓ Very large datasets (>1GB)
+✓ Complex grouping operations
+✓ Reference semantics desired
+✓ Maximum performance critical
+
+# Use dplyr when:
+✓ Readability and maintainability priority
+✓ Complex joins and window functions
+✓ Team familiarity with tidyverse
+✓ Moderate sized data (<100MB)
+
+# Use dtplyr (dplyr with data.table backend) when:
+✓ Want dplyr syntax with data.table performance
+✓ Large data but team prefers tidyverse
+✓ Lazy evaluation desired
+
+# Use base R when:
+✓ No dependencies allowed
+✓ Simple operations
+✓ Teaching/learning contexts
+```
+
+## Profiling Best Practices
+
+```r
+# 1. Profile realistic data sizes
+profvis({
+  # Use actual data size, not toy examples
+  real_data |> your_analysis()
+})
+
+# 2. Profile multiple runs for stability
+bench::mark(
+  your_function(data),
+  min_iterations = 10,  # Multiple runs
+  max_iterations = 100
+)
+
+# 3. Check memory usage too
+bench::mark(
+  approach1 = method1(data), 
+  approach2 = method2(data),
+  check = FALSE,  # If outputs differ slightly
+  filter_gc = FALSE  # Include GC time
+)
+
+# 4. Profile with realistic usage patterns
+# Not just isolated function calls
+```
+
+## Performance Anti-Patterns to Avoid
+
+```r
+# Don't optimize without measuring
+# ✗ "This looks slow" -> immediately rewrite
+# ✓ Profile first, optimize bottlenecks
+
+# Don't over-engineer for performance  
+# ✗ Complex optimizations for 1% gains
+# ✓ Focus on algorithmic improvements
+
+# Don't assume - measure
+# ✗ "for loops are always slow in R"
+# ✓ Benchmark your specific use case
+
+# Don't ignore readability costs
+# ✗ Unreadable code for minor speedups
+# ✓ Readable code with targeted optimizations
+
+# Don't grow objects in loops
+# ✗ result <- c(); for(i in 1:n) result <- c(result, x[i])
+# ✓ result <- vector("list", n); for(i in 1:n) result[[i]] <- x[i]
+```
+
+## Modern purrr Patterns for Performance
+
+Use modern purrr 1.0+ patterns:
+
+```r
+# Modern data frame row binding (purrr 1.0+)
+models <- data_splits |> 
+  map(\(split) train_model(split)) |>
+  list_rbind()  # Replaces map_dfr()
+
+# Column binding  
+summaries <- data_list |> 
+  map(\(df) get_summary_stats(df)) |>
+  list_cbind()  # Replaces map_dfc()
+
+# Side effects with walk()
+plots <- walk2(data_list, plot_names, \(df, name) {
+  p <- ggplot(df, aes(x, y)) + geom_point()
+  ggsave(name, p)
+})
+
+# Parallel processing (purrr 1.1.0+)
+library(mirai)
+daemons(4)
+results <- large_datasets |> 
+  map(in_parallel(expensive_computation))
+daemons(0)
+```
+
+## Vectorization
+
+```r
+# Good - vectorized operations
+result <- x + y
+
+# Good - Type-stable purrr functions
+map_dbl(data, mean)    # always returns double
+map_chr(data, class)   # always returns character
+
+# Avoid - Type-unstable base functions
+sapply(data, mean)     # might return list or vector
+
+# Avoid - explicit loops for simple operations
+result <- numeric(length(x))
+for(i in seq_along(x)) {
+  result[i] <- x[i] + y[i]
+}
+```
+
+## Using dtplyr for Large Data
+
+For large datasets, use dtplyr to get data.table performance with dplyr syntax:
+
+```r
+library(dtplyr)
+
+# Convert to lazy data.table
+large_data_dt <- lazy_dt(large_data)
+
+# Use dplyr syntax as normal
+result <- large_data_dt |>
+  filter(year >= 2020) |>
+  group_by(category) |>
+  summarise(
+    total = sum(value),
+    avg = mean(value)
+  ) |>
+  as_tibble()  # Convert back to tibble
+
+# See generated data.table code
+result |> show_query()
+```
+
+## Memory Optimization
+
+```r
+# Pre-allocate vectors
+result <- vector("numeric", n)
+
+# Use appropriate data types
+# integer instead of double when possible
+x <- 1:1000  # integer
+y <- seq(1, 1000, by = 1)  # double
+
+# Remove large objects when done
+rm(large_object)
+gc()  # Force garbage collection if needed
+
+# Use data.table for large data
+library(data.table)
+dt <- as.data.table(large_df)
+dt[, new_col := old_col * 2]  # Modifies in place
+```
+
+## String Manipulation Performance
+
+Use stringr over base R for consistency and performance:
+
+```r
+# Good - stringr (consistent, pipe-friendly)
+text |>
+  str_to_lower() |>
+  str_trim() |>
+  str_replace_all("pattern", "replacement") |>
+  str_extract("\\d+")
+
+# Common patterns
+str_detect(text, "pattern")     # vs grepl("pattern", text)
+str_extract(text, "pattern")    # vs complex regmatches()
+str_replace_all(text, "a", "b") # vs gsub("a", "b", text)
+str_split(text, ",")            # vs strsplit(text, ",")
+str_length(text)                # vs nchar(text)
+str_sub(text, 1, 5)             # vs substr(text, 1, 5)
+```
+
+## When to Use vctrs
+
+### Core Benefits
+- **Type stability** - Predictable output types regardless of input values
+- **Size stability** - Predictable output sizes from input sizes
+- **Consistent coercion rules** - Single set of rules applied everywhere
+- **Robust class design** - Proper S3 vector infrastructure
+
+### Use vctrs when:
+
+```r
+# Type-Stable Functions in Packages
+my_function <- function(x, y) {
+  # Always returns double, regardless of input values
+  vec_cast(result, double())
+}
+
+# Consistent Coercion/Casting
+vec_cast(x, double())  # Clear intent, predictable behavior
+vec_ptype_common(x, y, z)  # Finds richest compatible type
+
+# Size/Length Stability
+vec_c(x, y)  # size = vec_size(x) + vec_size(y)
+vec_rbind(df1, df2)  # size = sum of input sizes
+```
+
+### Don't Use vctrs When:
+- Simple one-off analyses - Base R is sufficient
+- No custom classes needed - Standard types work fine  
+- Performance critical + simple operations - Base R may be faster
+- External API constraints - Must return base R types
+
+The key insight: **vctrs is most valuable in package development where type safety, consistency, and extensibility matter more than raw speed for simple operations.**
--- a/skills/r-development/references/rlang-patterns.md
+++ b/skills/r-development/references/rlang-patterns.md
@@ -0,0 +1,247 @@
+# rlang Patterns for Data-Masking
+
+## Core Concepts
+
+**Data-masking** allows R expressions to refer to data frame columns as if they were variables in the environment. rlang provides the metaprogramming framework that powers tidyverse data-masking.
+
+### Key rlang Tools
+
+- **Embracing `{{}}`** - Forward function arguments to data-masking functions
+- **Injection `!!`** - Inject single expressions or values
+- **Splicing `!!!`** - Inject multiple arguments from a list
+- **Dynamic dots** - Programmable `...` with injection support
+- **Pronouns `.data`/`.env`** - Explicit disambiguation between data and environment variables
+
+## Function Argument Patterns
+
+### Forwarding with `{{}}`
+
+Use `{{}}` to forward function arguments to data-masking functions:
+
+```r
+# Single argument forwarding
+my_summarise <- function(data, var) {
+  data |> dplyr::summarise(mean = mean({{ var }}))
+}
+
+# Works with any data-masking expression
+mtcars |> my_summarise(cyl)
+mtcars |> my_summarise(cyl * am)
+mtcars |> my_summarise(.data$cyl)  # pronoun syntax supported
+```
+
+### Forwarding `...`
+
+No special syntax needed for dots forwarding:
+
+```r
+# Simple dots forwarding
+my_group_by <- function(.data, ...) {
+  .data |> dplyr::group_by(...)
+}
+
+# Works with tidy selections too
+my_select <- function(.data, ...) {
+  .data |> dplyr::select(...)
+}
+
+# For single-argument tidy selections, wrap in c()
+my_pivot_longer <- function(.data, ...) {
+  .data |> tidyr::pivot_longer(c(...))
+}
+```
+
+### Names Patterns with `.data`
+
+Use `.data` pronoun for programmatic column access:
+
+```r
+# Single column by name
+my_mean <- function(data, var) {
+  data |> dplyr::summarise(mean = mean(.data[[var]]))
+}
+
+# Usage - completely insulated from data-masking
+mtcars |> my_mean("cyl")  # No ambiguity, works like regular function
+
+# Multiple columns with all_of()
+my_select_vars <- function(data, vars) {
+  data |> dplyr::select(all_of(vars))
+}
+
+mtcars |> my_select_vars(c("cyl", "am"))
+```
+
+## Injection Operators
+
+### When to Use Each Operator
+
+| Operator | Use Case | Example |
+|----------|----------|---------|
+| `{{ }}` | Forward function arguments | `summarise(mean = mean({{ var }}))` |
+| `!!` | Inject single expression/value | `summarise(mean = mean(!!sym(var)))` |
+| `!!!` | Inject multiple arguments | `group_by(!!!syms(vars))` |
+| `.data[[]]` | Access columns by name | `mean(.data[[var]])` |
+
+### Advanced Injection with `!!`
+
+```r
+# Create symbols from strings
+var <- "cyl"
+mtcars |> dplyr::summarise(mean = mean(!!sym(var)))
+
+# Inject values to avoid name collisions
+df <- data.frame(x = 1:3)
+x <- 100
+df |> dplyr::mutate(scaled = x / !!x)  # Uses both data and env x
+
+# Use data_sym() for tidyeval contexts (more robust)
+mtcars |> dplyr::summarise(mean = mean(!!data_sym(var)))
+```
+
+### Splicing with `!!!`
+
+```r
+# Multiple symbols from character vector
+vars <- c("cyl", "am")
+mtcars |> dplyr::group_by(!!!syms(vars))
+
+# Or use data_syms() for tidy contexts
+mtcars |> dplyr::group_by(!!!data_syms(vars))
+
+# Splice lists of arguments
+args <- list(na.rm = TRUE, trim = 0.1)
+mtcars |> dplyr::summarise(mean = mean(cyl, !!!args))
+```
+
+## Dynamic Dots Patterns
+
+### Using `list2()` for Dynamic Dots Support
+
+```r
+my_function <- function(...) {
+  # Collect with list2() instead of list() for dynamic features
+  dots <- list2(...)
+  # Process dots...
+}
+
+# Enables these features:
+my_function(a = 1, b = 2)           # Normal usage
+my_function(!!!list(a = 1, b = 2))  # Splice a list
+my_function("{name}" := value)      # Name injection
+my_function(a = 1, )               # Trailing commas OK
+```
+
+### Name Injection with Glue Syntax
+
+```r
+# Basic name injection
+name <- "result"
+list2("{name}" := 1)  # Creates list(result = 1)
+
+# In function arguments with {{
+my_mean <- function(data, var) {
+  data |> dplyr::summarise("mean_{{ var }}" := mean({{ var }}))
+}
+
+mtcars |> my_mean(cyl)        # Creates column "mean_cyl"
+mtcars |> my_mean(cyl * am)   # Creates column "mean_cyl * am"
+
+# Allow custom names with englue()
+my_mean <- function(data, var, name = englue("mean_{{ var }}")) {
+  data |> dplyr::summarise("{name}" := mean({{ var }}))
+}
+
+# User can override default
+mtcars |> my_mean(cyl, name = "cylinder_mean")
+```
+
+## Pronouns for Disambiguation
+
+### `.data` and `.env` Best Practices
+
+```r
+# Explicit disambiguation prevents masking issues
+cyl <- 1000  # Environment variable
+
+mtcars |> dplyr::summarise(
+  data_cyl = mean(.data$cyl),    # Data frame column
+  env_cyl = mean(.env$cyl),      # Environment variable
+  ambiguous = mean(cyl)          # Could be either (usually data wins)
+)
+
+# Use in loops and programmatic contexts
+vars <- c("cyl", "am")
+for (var in vars) {
+  result <- mtcars |> dplyr::summarise(mean = mean(.data[[var]]))
+  print(result)
+}
+```
+
+## Programming Patterns
+
+### Bridge Patterns
+
+Converting between data-masking and tidy selection behaviors:
+
+```r
+# across() as selection-to-data-mask bridge
+my_group_by <- function(data, vars) {
+  data |> dplyr::group_by(across({{ vars }}))
+}
+
+# Works with tidy selection
+mtcars |> my_group_by(starts_with("c"))
+
+# across(all_of()) as names-to-data-mask bridge  
+my_group_by <- function(data, vars) {
+  data |> dplyr::group_by(across(all_of(vars)))
+}
+
+mtcars |> my_group_by(c("cyl", "am"))
+```
+
+### Transformation Patterns
+
+```r
+# Transform single arguments by wrapping
+my_mean <- function(data, var) {
+  data |> dplyr::summarise(mean = mean({{ var }}, na.rm = TRUE))
+}
+
+# Transform dots with across()
+my_means <- function(data, ...) {
+  data |> dplyr::summarise(across(c(...), ~ mean(.x, na.rm = TRUE)))
+}
+
+# Manual transformation (advanced)
+my_means_manual <- function(.data, ...) {
+  vars <- enquos(..., .named = TRUE)
+  vars <- purrr::map(vars, ~ expr(mean(!!.x, na.rm = TRUE)))
+  .data |> dplyr::summarise(!!!vars)
+}
+```
+
+## Common Patterns Summary
+
+### When to Use What
+
+**Use `{{}}` when:**
+- Forwarding user-provided column references
+- Building wrapper functions around dplyr/tidyr
+- Need to support both bare names and expressions
+
+**Use `.data[[]]` when:**
+- Working with character vector column names
+- Iterating over column names programmatically
+- Need complete insulation from data-masking
+
+**Use `!!` when:**
+- Need to inject computed expressions
+- Converting strings to symbols with `sym()`
+- Avoiding variable name collisions
+
+**Use `!!!` when:**
+- Injecting multiple arguments from a list
+- Working with variable numbers of columns
+- Splicing named arguments