This appendix documents all datasets and R packages used throughout the book. All datasets are simulated with fixed random seeds for full reproducibility — every reader running the code will obtain identical results. All datasets are saved as .rds files in the RDS/ directory of the book’s companion repository and can be loaded directly into R without re-running the simulation code.
Each dataset can be loaded directly from the repository once you have cloned it locally:
# Load any dataset by namesystolic<-readRDS("RDS/systolic.rds")gxe<-readRDS("RDS/gxe.rds")cardiac_recovery<-readRDS("RDS/cardiac_recovery.rds")cardiac_recovery_wide<-readRDS("RDS/cardiac_recovery_wide.rds")grazing<-readRDS("RDS/grazing.rds")irrigation<-readRDS("RDS/irrigation.rds")physio<-readRDS("RDS/physio.rds")birds<-readRDS("RDS/birds.rds")
Or regenerate any dataset from scratch by running the corresponding simulation chunk. All chunks use set.seed(42) for reproducibility.
B.1.2 Dataset 1: Antihypertensive Drug Trial
File:RDS/systolic.rds
Used in: Chapter 3 (One-Way ANOVA), Chapter 4 (Multiple Comparisons)
Description: A simulated randomised controlled trial investigating the effect of antihypertensive treatment on systolic blood pressure in patients with mild hypertension. Patients are randomly assigned to one of three groups: placebo, low-dose drug, or high-dose drug, and systolic blood pressure (mmHg) is measured after eight weeks of treatment.
Design: One-way completely randomised design. Three treatment groups, 20 patients per group, \(N = 60\) observations total.
Variables:
Variable
Type
Description
bp
Numeric
Systolic blood pressure (mmHg)
group
Factor
Treatment group (Placebo / Low dose / High dose)
True parameters: group means 150, 140, and 128 mmHg; common within-group SD = 12 mmHg.
set.seed(42)n<-20groups<-c("Placebo", "Low dose", "High dose")means<-c(150, 140, 128)sd_common<-12group<-factor(rep(groups, each =n), levels =groups)bp<-c(rnorm(n, mean =means[1], sd =sd_common),rnorm(n, mean =means[2], sd =sd_common),rnorm(n, mean =means[3], sd =sd_common))systolic<-data.frame(bp, group)saveRDS(systolic, "RDS/systolic.rds")summary(systolic)
bp group
Min. : 92.08 Placebo :20
1st Qu.:129.61 Low dose :20
Median :136.62 High dose:20
Mean :139.01
3rd Qu.:148.76
Max. :177.44
Used in: Chapter 5 (Two-Way ANOVA and Factorial Designs)
Description: A simulated multi-environment wheat trial designed to demonstrate genotype × environment interaction (G×E). Four wheat varieties are grown across three rainfall environments (dry, moderate, wet), with five replicate plots per variety-environment combination. Two varieties are environment-specific specialists (Var B thrives in wet conditions; Var C in dry conditions), while two are stable generalists (Var A and Var D), producing a clear crossover interaction.
variety env rep yield
Var A:15 Dry :20 Min. :1 Min. :26.03
Var B:15 Moderate:20 1st Qu.:2 1st Qu.:45.60
Var C:15 Wet :20 Median :3 Median :52.71
Var D:15 Mean :3 Mean :51.28
3rd Qu.:4 3rd Qu.:56.95
Max. :5 Max. :67.16
Description: A simulated longitudinal clinical trial following 40 patients recovering from cardiac surgery. Patients are randomly assigned to either a standard rehabilitation programme (control) or an intensive cardiac rehabilitation programme (intervention). Resting heart rate (bpm) is measured at baseline and at weeks 4, 8, and 12. The intensive programme produces a steeper decline in heart rate, creating a significant group-by-time interaction.
Design: Mixed repeated measures design. Two between-subjects groups (20 patients per group) crossed with four within-subject time points. Total: 160 observations in long format.
library(MASS)set.seed(42)n_group<-20k<-4times<-c(0, 4, 8, 12)groups<-c("Control", "Intervention")means_control<-c(82, 80, 78, 77)means_intervention<-c(82, 76, 71, 68)sigma_within<-matrix(c(16,8,6,4, 8,16,8,6,6,8,16,8, 4,6,8,16), nrow =4)df_control<-as.data.frame(mvrnorm(n_group, mu =means_control, Sigma =sigma_within))df_intervention<-as.data.frame(mvrnorm(n_group, mu =means_intervention, Sigma =sigma_within))names(df_control)<-names(df_intervention)<-paste0("t", 1:k)df_wide<-rbind(df_control, df_intervention)df_wide$subject<-factor(1:(2*n_group))df_wide$group<-factor(rep(groups, each =n_group), levels =groups)cardiac_recovery_wide<-df_widecardiac_recovery<-reshape(df_wide, varying =paste0("t", 1:k), v.names ="hr", timevar ="time", times =times, direction ="long")cardiac_recovery$time<-factor(cardiac_recovery$time)cardiac_recovery$subject<-factor(cardiac_recovery$subject)saveRDS(cardiac_recovery, "RDS/cardiac_recovery.rds")saveRDS(cardiac_recovery_wide, "RDS/cardiac_recovery_wide.rds")summary(cardiac_recovery)
subject group time hr id
1 : 4 Control :80 0 :40 Min. :58.80 Min. : 1.00
2 : 4 Intervention:80 4 :40 1st Qu.:71.66 1st Qu.:10.75
3 : 4 8 :40 Median :77.09 Median :20.50
4 : 4 12:40 Mean :76.32 Mean :20.50
5 : 4 3rd Qu.:80.38 3rd Qu.:30.25
6 : 4 Max. :93.61 Max. :40.00
(Other):136
B.1.5 Dataset 4: Grazing Effects on Plant Species Richness
File:RDS/grazing.rds
Used in: Chapter 7 (Split-Plot and Nested Designs)
Description: A simulated ecological experiment investigating the effect of grazing intensity on plant species richness in a grassland. Three grazing treatments (ungrazed, light, heavy) are applied to six large paddocks (two per treatment). Within each paddock, species richness is counted in four quadrats, creating a nested structure where quadrats are subsamples within paddocks and paddocks are the true experimental units.
Design: Nested design with two levels. Three treatments × 2 paddocks per treatment × 4 quadrats per paddock = 24 observations.
Variables:
Variable
Type
Description
richness
Integer
Plant species richness per quadrat
treatment
Factor
Grazing intensity (Ungrazed / Light / Heavy)
paddock
Factor
Paddock identifier (U1, U2, L1, L2, H1, H2)
True parameters: treatment means 12, 9, and 6 species per quadrat (ungrazed to heavy); paddock SD = 2; quadrat SD = 1.
Used in: Chapter 7 (Split-Plot and Nested Designs)
Description: A simulated agricultural split-plot experiment testing the effects of irrigation regime and fertiliser type on wheat yield. Irrigation (rainfed vs irrigated) is the whole-plot factor applied to entire fields; fertiliser type (none, nitrogen, phosphorus) is the sub-plot factor applied to individual plots within each field. The design correctly reflects the practical constraint that irrigation infrastructure cannot be changed within a field, creating two different error terms for the two factors.
Design: Split-plot design. Two irrigation treatments × 3 fields per treatment × 3 fertiliser types per field = 18 observations.
Variables:
Variable
Type
Description
yield
Numeric
Wheat yield (g/plot)
irrigation
Factor
Irrigation regime (Rainfed / Irrigated)
fertiliser
Factor
Fertiliser type (None / Nitrogen / Phosphorus)
field
Factor
Field identifier (R1—R3 rainfed, I1—I3 irrigated)
True parameters: cell means range from 30 (rainfed, no fertiliser) to 55 (irrigated, nitrogen); field SD = 4; plot SD = 2.
Description: A simulated national health study investigating the effect of a new physiotherapy protocol on functional recovery scores following knee replacement surgery. Patients are recruited from 15 hospitals distributed across 3 regions, with 8-12 patients per hospital. The three-level nested structure (patients within hospitals within regions) requires a mixed-effects model with two nested random intercepts to produce valid inference about the treatment effect.
Design: Three-level nested observational study. 3 regions × 5 hospitals per region × 8-12 patients per hospital. Total: 148 patients.
recovery treatment hospital region
Min. :39.19 Control :78 H14 :12 R1:52
1st Qu.:57.59 Treatment:70 H23 :12 R2:45
Median :64.79 H34 :12 R3:51
Mean :64.38 H12 :11
3rd Qu.:72.37 H13 :11
Max. :95.10 H32 :11
(Other):79
B.1.8 Dataset 7: Bird Abundance in Forest Landscapes
File:RDS/birds.rds
Used in: Chapter 11 (Generalised Linear Mixed Models)
Description: A simulated ecological survey of bird abundance across three landscape types (agricultural matrix, fragmented forest, and continuous forest). The focal species is a forest interior specialist that is genuinely absent from many agricultural sites (structural zeros), creating a zero-inflated count response that motivates the use of negative binomial and zero-inflated GLMMs. Five transects are surveyed within each of 24 sites (8 per landscape type).
Design: Hierarchical count survey. 3 landscape types × 8 sites per landscape × 5 transects per site = 120 observations.
Variables:
Variable
Type
Description
count
Integer
Number of individuals detected per transect
landscape
Factor
Landscape type (Agricultural / Fragmented / Continuous)
site
Factor
Site identifier (Agr1—Agr8, Fra1—Fra8, Con1—Con8)
transect
Factor
Transect number within site (1—5)
True parameters: structural absence probabilities of 0.70 (agricultural), 0.30 (fragmented), and 0.00 (continuous); expected log-counts of 1.2, 2.2, and 3.2 given presence; negative binomial size = 2; site SD = 0.5 on log scale.
All analyses in this book were conducted in R v3.6.0 (R2026?). The following packages are required to reproduce all code in the book. Install them all at once with:
packages<-c(# Core modelling"lme4", "lmerTest", "nlme", "glmmTMB", "ordinal",# Model diagnostics and comparison"DHARMa", "MuMIn", "pbkrtest", "car",# Post-hoc comparisons and effect sizes"emmeans", "effectsize", "multcomp",# Power analysis"pwr", "simr",# Non-parametric and robust methods"WRS2", "PMCMRplus", "dunn.test",# Data manipulation and visualisation"ggplot2", "patchwork", "ggsignif",# Parallel computation"future", "furrr", "purrr",# Utilities"here", "MASS", "scales")
The following references correspond to the R packages listed in the table above. Citations were generated automatically using knitr::write_bib() and reflect the package versions installed at the time of writing.
Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using Lme4.”Journal of Statistical Software 67 (October): 1–48. https://doi.org/10.18637/jss.v067.i01.
Ben-Shachar, Mattan S., Daniel Lüdecke, and Dominique Makowski. 2020. “Effectsize: Estimation of Effect Size Indices and Standardized Parameters.”Journal of Open Source Software 5 (56): 2815. https://doi.org/10.21105/joss.02815.
Brooks, Mollie E., Kasper Kristensen, Koen J. van Benthem, Arni Magnusson, Casper W. Berg, Anders Nielsen, Hans J. Skaug, Martin Mächler, and Benjamin M. Bolker. 2017. “glmmTMB Balances Speed and Flexibility Among Packages for Zero-inflated Generalized Linear Mixed Modeling.”The R Journal 9 (2): 378–400. https://doi.org/10.32614/RJ-2017-066.
Green, Peter, and Catriona J. MacLeod. 2016. “SIMR: An R Package for Power Analysis of Generalized Linear Mixed Models by Simulation.”Methods in Ecology and Evolution 7 (4): 493–98. https://doi.org/10.1111/2041-210X.12504.
Kuznetsova, Alexandra, Per B. Brockhoff, and Rune H. B. Christensen. 2017. “lmerTest Package: Tests in Linear Mixed Effects Models.”Journal of Statistical Software 82 (December): 1–26. https://doi.org/10.18637/jss.v082.i13.
Mair, Patrick, and Rand Wilcox. 2020. “Robust Statistical Methods in R Using the WRS2 Package.”Behavior Research Methods 52 (2): 464–88. https://doi.org/10.3758/s13428-019-01246-w.
Pinheiro, J., and D. Bates. 2000. Mixed-Effects Models in S and s-PLUS. Statistics and Computing. Springer New York.