Appendix B — Datasets and R Packages

This appendix documents all datasets and R packages used throughout the book. All datasets are simulated with fixed random seeds for full reproducibility — every reader running the code will obtain identical results. All datasets are saved as .rds files in the RDS/ directory of the book’s companion repository and can be loaded directly into R without re-running the simulation code.

The companion repository is available at: https://github.com/Ebedthan/maov_data

B.1 B.1 Datasets

B.1.1 How to Load the Datasets

Each dataset can be loaded directly from the repository once you have cloned it locally:

# Load any dataset by name
systolic <- readRDS("RDS/systolic.rds")
gxe <- readRDS("RDS/gxe.rds")
cardiac_recovery <- readRDS("RDS/cardiac_recovery.rds")
cardiac_recovery_wide <- readRDS("RDS/cardiac_recovery_wide.rds")
grazing <- readRDS("RDS/grazing.rds")
irrigation <- readRDS("RDS/irrigation.rds")
physio <- readRDS("RDS/physio.rds")
birds <- readRDS("RDS/birds.rds")

Or regenerate any dataset from scratch by running the corresponding simulation chunk. All chunks use set.seed(42) for reproducibility.

B.1.2 Dataset 1: Antihypertensive Drug Trial

File: RDS/systolic.rds

Used in: Chapter 3 (One-Way ANOVA), Chapter 4 (Multiple Comparisons)

Description: A simulated randomised controlled trial investigating the effect of antihypertensive treatment on systolic blood pressure in patients with mild hypertension. Patients are randomly assigned to one of three groups: placebo, low-dose drug, or high-dose drug, and systolic blood pressure (mmHg) is measured after eight weeks of treatment.

Design: One-way completely randomised design. Three treatment groups, 20 patients per group, \(N = 60\) observations total.

Variables:

Variable	Type	Description
`bp`	Numeric	Systolic blood pressure (mmHg)
`group`	Factor	Treatment group (Placebo / Low dose / High dose)

True parameters: group means 150, 140, and 128 mmHg; common within-group SD = 12 mmHg.

set.seed(42)

n         <- 20
groups    <- c("Placebo", "Low dose", "High dose")
means     <- c(150, 140, 128)
sd_common <- 12

group    <- factor(rep(groups, each = n), levels = groups)
bp       <- c(rnorm(n, mean = means[1], sd = sd_common),
              rnorm(n, mean = means[2], sd = sd_common),
              rnorm(n, mean = means[3], sd = sd_common))
systolic <- data.frame(bp, group)

saveRDS(systolic, "RDS/systolic.rds")
summary(systolic)

       bp               group   
 Min.   : 92.08   Placebo  :20  
 1st Qu.:129.61   Low dose :20  
 Median :136.62   High dose:20  
 Mean   :139.01                 
 3rd Qu.:148.76                 
 Max.   :177.44

B.1.3 Dataset 2: Genotype × Environment Wheat Trial

File: RDS/gxe.rds

Used in: Chapter 5 (Two-Way ANOVA and Factorial Designs)

Description: A simulated multi-environment wheat trial designed to demonstrate genotype × environment interaction (G×E). Four wheat varieties are grown across three rainfall environments (dry, moderate, wet), with five replicate plots per variety-environment combination. Two varieties are environment-specific specialists (Var B thrives in wet conditions; Var C in dry conditions), while two are stable generalists (Var A and Var D), producing a clear crossover interaction.

Design: Two-way factorial design (variety × environment) with replication. \(4 \times 3 \times 5 = 60\) observations total.

Variables:

Variable	Type	Description
`variety`	Factor	Wheat variety (Var A / Var B / Var C / Var D)
`env`	Factor	Growing environment (Dry / Moderate / Wet)
`rep`	Integer	Replicate number within each cell (1—5)
`yield`	Numeric	Plot yield (g/plot)

True parameters: cell means range from 40 (Var B in dry) to 65 (Var B in wet); within-cell SD = 5 g/plot.

set.seed(42)

n_rep     <- 5
varieties <- c("Var A", "Var B", "Var C", "Var D")
envs      <- c("Dry", "Moderate", "Wet")

cell_means <- matrix(
  c(50, 52, 51,
    40, 52, 65,
    62, 53, 41,
    50, 51, 50),
  nrow = 4, byrow = TRUE,
  dimnames = list(varieties, envs)
)

gxe        <- expand.grid(variety = varieties,
                           env     = envs,
                           rep     = seq_len(n_rep))
gxe$yield  <- mapply(function(v, e) {
  rnorm(1, mean = cell_means[v, e], sd = 5)
}, as.character(gxe$variety), as.character(gxe$env))
gxe$variety <- factor(gxe$variety, levels = varieties)
gxe$env     <- factor(gxe$env,     levels = envs)

saveRDS(gxe, "RDS/gxe.rds")
summary(gxe)

  variety         env          rep        yield      
 Var A:15   Dry     :20   Min.   :1   Min.   :26.03  
 Var B:15   Moderate:20   1st Qu.:2   1st Qu.:45.60  
 Var C:15   Wet     :20   Median :3   Median :52.71  
 Var D:15                 Mean   :3   Mean   :51.28  
                          3rd Qu.:4   3rd Qu.:56.95  
                          Max.   :5   Max.   :67.16

B.1.4 Dataset 3: Cardiac Rehabilitation Trial

Files: RDS/cardiac_recovery.rds (long format), RDS/cardiac_recovery_wide.rds (wide format) Used in: Chapter 6 (Repeated Measures ANOVA), Chapter 9 (Mixed-Effects Models)

Description: A simulated longitudinal clinical trial following 40 patients recovering from cardiac surgery. Patients are randomly assigned to either a standard rehabilitation programme (control) or an intensive cardiac rehabilitation programme (intervention). Resting heart rate (bpm) is measured at baseline and at weeks 4, 8, and 12. The intensive programme produces a steeper decline in heart rate, creating a significant group-by-time interaction.

Design: Mixed repeated measures design. Two between-subjects groups (20 patients per group) crossed with four within-subject time points. Total: 160 observations in long format.

Variables (long format):

Variable	Type	Description
`hr`	Numeric	Resting heart rate (bpm)
`group`	Factor	Rehabilitation programme (Control / Intervention)
`time`	Factor	Measurement time point (0 / 4 / 8 / 12 weeks)
`subject`	Factor	Patient identifier (1—40)

True parameters: control trajectory (82, 80, 78, 77 bpm); intervention trajectory (82, 76, 71, 68 bpm); within-subject covariance with moderate temporal correlation.

library(MASS)
set.seed(42)

n_group            <- 20
k                  <- 4
times              <- c(0, 4, 8, 12)
groups             <- c("Control", "Intervention")
means_control      <- c(82, 80, 78, 77)
means_intervention <- c(82, 76, 71, 68)

sigma_within <- matrix(c(16,8,6,4, 8,16,8,6,
                          6,8,16,8, 4,6,8,16), nrow = 4)

df_control      <- as.data.frame(
  mvrnorm(n_group, mu = means_control,      Sigma = sigma_within))
df_intervention <- as.data.frame(
  mvrnorm(n_group, mu = means_intervention, Sigma = sigma_within))
names(df_control) <- names(df_intervention) <- paste0("t", 1:k)

df_wide          <- rbind(df_control, df_intervention)
df_wide$subject  <- factor(1:(2 * n_group))
df_wide$group    <- factor(rep(groups, each = n_group),
                            levels = groups)

cardiac_recovery_wide <- df_wide
cardiac_recovery      <- reshape(
  df_wide, varying = paste0("t", 1:k),
  v.names = "hr", timevar = "time", times = times,
  direction = "long")
cardiac_recovery$time    <- factor(cardiac_recovery$time)
cardiac_recovery$subject <- factor(cardiac_recovery$subject)

saveRDS(cardiac_recovery,      "RDS/cardiac_recovery.rds")
saveRDS(cardiac_recovery_wide, "RDS/cardiac_recovery_wide.rds")
summary(cardiac_recovery)

    subject             group    time          hr              id       
 1      :  4   Control     :80   0 :40   Min.   :58.80   Min.   : 1.00  
 2      :  4   Intervention:80   4 :40   1st Qu.:71.66   1st Qu.:10.75  
 3      :  4                     8 :40   Median :77.09   Median :20.50  
 4      :  4                     12:40   Mean   :76.32   Mean   :20.50  
 5      :  4                             3rd Qu.:80.38   3rd Qu.:30.25  
 6      :  4                             Max.   :93.61   Max.   :40.00  
 (Other):136

B.1.5 Dataset 4: Grazing Effects on Plant Species Richness

File: RDS/grazing.rds

Used in: Chapter 7 (Split-Plot and Nested Designs)

Description: A simulated ecological experiment investigating the effect of grazing intensity on plant species richness in a grassland. Three grazing treatments (ungrazed, light, heavy) are applied to six large paddocks (two per treatment). Within each paddock, species richness is counted in four quadrats, creating a nested structure where quadrats are subsamples within paddocks and paddocks are the true experimental units.

Design: Nested design with two levels. Three treatments × 2 paddocks per treatment × 4 quadrats per paddock = 24 observations.

Variables:

Variable	Type	Description
`richness`	Integer	Plant species richness per quadrat
`treatment`	Factor	Grazing intensity (Ungrazed / Light / Heavy)
`paddock`	Factor	Paddock identifier (U1, U2, L1, L2, H1, H2)

True parameters: treatment means 12, 9, and 6 species per quadrat (ungrazed to heavy); paddock SD = 2; quadrat SD = 1.

set.seed(42)

n_treatments    <- 3
n_paddocks      <- 2
n_quadrats      <- 4
treatment_means <- c(12, 9, 6)
paddock_sd      <- 2
quadrat_sd      <- 1

grazing <- do.call(rbind, lapply(seq_len(n_treatments), function(t) {
  do.call(rbind, lapply(seq_len(n_paddocks), function(p) {
    paddock_effect <- rnorm(1, 0, paddock_sd)
    data.frame(
      richness  = round(rnorm(n_quadrats,
                              mean = treatment_means[t] +
                                     paddock_effect,
                              sd   = quadrat_sd)),
      treatment = factor(c("Ungrazed", "Light", "Heavy")[t],
                         levels = c("Ungrazed", "Light", "Heavy")),
      paddock   = factor(paste0(c("U","L","H")[t], p))
    )
  }))
}))

saveRDS(grazing, "RDS/grazing.rds")
summary(grazing)

    richness         treatment paddock
 Min.   : 3.000   Ungrazed:8   U1:4   
 1st Qu.: 6.750   Light   :8   U2:4   
 Median :10.500   Heavy   :8   L1:4   
 Mean   : 9.792                L2:4   
 3rd Qu.:13.250                H1:4   
 Max.   :15.000                H2:4

B.1.6 Dataset 5: Irrigation × Fertiliser Split-Plot Trial

File: RDS/irrigation.rds

Used in: Chapter 7 (Split-Plot and Nested Designs)

Description: A simulated agricultural split-plot experiment testing the effects of irrigation regime and fertiliser type on wheat yield. Irrigation (rainfed vs irrigated) is the whole-plot factor applied to entire fields; fertiliser type (none, nitrogen, phosphorus) is the sub-plot factor applied to individual plots within each field. The design correctly reflects the practical constraint that irrigation infrastructure cannot be changed within a field, creating two different error terms for the two factors.

Design: Split-plot design. Two irrigation treatments × 3 fields per treatment × 3 fertiliser types per field = 18 observations.

Variables:

Variable	Type	Description
`yield`	Numeric	Wheat yield (g/plot)
`irrigation`	Factor	Irrigation regime (Rainfed / Irrigated)
`fertiliser`	Factor	Fertiliser type (None / Nitrogen / Phosphorus)
`field`	Factor	Field identifier (R1—R3 rainfed, I1—I3 irrigated)

True parameters: cell means range from 30 (rainfed, no fertiliser) to 55 (irrigated, nitrogen); field SD = 4; plot SD = 2.

set.seed(42)

irrigation <- c("Rainfed", "Irrigated")
fertiliser <- c("None", "Nitrogen", "Phosphorus")
n_fields   <- 3

cell_means <- matrix(
  c(30, 38, 35, 42, 55, 50), nrow = 2, byrow = TRUE,
  dimnames = list(irrigation, fertiliser))

df_agr <- do.call(rbind, lapply(seq_along(irrigation), function(i) {
  do.call(rbind, lapply(seq_len(n_fields), function(f) {
    field_effect <- rnorm(1, 0, 4)
    data.frame(
      yield      = sapply(fertiliser, function(fert)
        rnorm(1, mean = cell_means[i, fert] + field_effect, sd = 2)),
      irrigation = irrigation[i],
      fertiliser = fertiliser,
      field      = factor(paste0(substr(irrigation[i], 1, 1), f))
    )
  }))
}))

df_agr$irrigation <- factor(df_agr$irrigation,
                              levels = c("Rainfed", "Irrigated"))
df_agr$fertiliser <- factor(df_agr$fertiliser, levels = fertiliser)

saveRDS(df_agr, "RDS/irrigation.rds")
summary(df_agr)

     yield           irrigation      fertiliser field 
 Min.   :31.40   Rainfed  :9    None      :6    R1:3  
 1st Qu.:36.62   Irrigated:9    Nitrogen  :6    R2:3  
 Median :43.43                  Phosphorus:6    R3:3  
 Mean   :42.98                                  I1:3  
 3rd Qu.:48.91                                  I2:3  
 Max.   :53.43                                  I3:3

B.1.7 Dataset 6: Physiotherapy Protocol: Three-Level Nested Trial

File: RDS/physio.rds

Used in: Chapter 9 (Mixed-Effects Models)

Description: A simulated national health study investigating the effect of a new physiotherapy protocol on functional recovery scores following knee replacement surgery. Patients are recruited from 15 hospitals distributed across 3 regions, with 8-12 patients per hospital. The three-level nested structure (patients within hospitals within regions) requires a mixed-effects model with two nested random intercepts to produce valid inference about the treatment effect.

Design: Three-level nested observational study. 3 regions × 5 hospitals per region × 8-12 patients per hospital. Total: 148 patients.

Variables:

Variable	Type	Description
`recovery`	Numeric	Functional recovery score (0—100)
`treatment`	Factor	Protocol (Control / Treatment)
`hospital`	Factor	Hospital identifier (H11—H35)
`region`	Factor	Region identifier (R1 / R2 / R3)

True parameters: treatment effect = 7 points; region SD = 4; hospital-within-region SD = 3; patient-level SD = 8.

set.seed(42)

n_regions        <- 3
n_hospitals      <- 5
n_patients       <- c(8, 9, 10, 11, 12)
treatment_effect <- 7

df_nested <- do.call(rbind, lapply(seq_len(n_regions), function(r) {
  region_effect <- rnorm(1, 0, 4)
  do.call(rbind, lapply(seq_len(n_hospitals), function(h) {
    hospital_effect <- rnorm(1, 0, 3)
    n_pat   <- sample(n_patients, 1)
    treatment <- rep(c("Control", "Treatment"), length.out = n_pat)
    data.frame(
      recovery  = rnorm(n_pat,
                        mean = 60 + region_effect +
                               hospital_effect +
                               treatment_effect *
                               (treatment == "Treatment"),
                        sd   = 8),
      treatment = factor(treatment,
                         levels = c("Control", "Treatment")),
      hospital  = factor(paste0("H", r, h)),
      region    = factor(paste0("R", r))
    )
  }))
}))

saveRDS(df_nested, "RDS/physio.rds")
summary(df_nested)

    recovery         treatment     hospital  region 
 Min.   :39.19   Control  :78   H14    :12   R1:52  
 1st Qu.:57.59   Treatment:70   H23    :12   R2:45  
 Median :64.79                  H34    :12   R3:51  
 Mean   :64.38                  H12    :11          
 3rd Qu.:72.37                  H13    :11          
 Max.   :95.10                  H32    :11          
                                (Other):79

B.1.8 Dataset 7: Bird Abundance in Forest Landscapes

File: RDS/birds.rds

Used in: Chapter 11 (Generalised Linear Mixed Models)

Description: A simulated ecological survey of bird abundance across three landscape types (agricultural matrix, fragmented forest, and continuous forest). The focal species is a forest interior specialist that is genuinely absent from many agricultural sites (structural zeros), creating a zero-inflated count response that motivates the use of negative binomial and zero-inflated GLMMs. Five transects are surveyed within each of 24 sites (8 per landscape type).

Design: Hierarchical count survey. 3 landscape types × 8 sites per landscape × 5 transects per site = 120 observations.

Variables:

Variable	Type	Description
`count`	Integer	Number of individuals detected per transect
`landscape`	Factor	Landscape type (Agricultural / Fragmented / Continuous)
`site`	Factor	Site identifier (Agr1—Agr8, Fra1—Fra8, Con1—Con8)
`transect`	Factor	Transect number within site (1—5)

True parameters: structural absence probabilities of 0.70 (agricultural), 0.30 (fragmented), and 0.00 (continuous); expected log-counts of 1.2, 2.2, and 3.2 given presence; negative binomial size = 2; site SD = 0.5 on log scale.

set.seed(42)

n_landscapes    <- 3
n_sites         <- 8
n_transects     <- 5
landscape_names <- c("Agricultural", "Fragmented", "Continuous")
zi_prob         <- c(0.70, 0.30, 0.00)
lambda_log      <- c(1.2, 2.2, 3.2)
site_sd         <- 0.5

df_birds <- do.call(rbind, lapply(
  seq_len(n_landscapes), function(l) {
  do.call(rbind, lapply(seq_len(n_sites), function(s) {
    site_effect  <- rnorm(1, 0, site_sd)
    site_absent  <- rbinom(1, 1, prob = zi_prob[l])
    counts <- if (site_absent == 1) {
      rep(0, n_transects)
    } else {
      rnbinom(n_transects,
              mu   = exp(lambda_log[l] + site_effect),
              size = 2)
    }
    data.frame(
      count     = counts,
      landscape = factor(landscape_names[l],
                         levels = landscape_names),
      site      = factor(paste0(
        substr(landscape_names[l], 1, 3), s)),
      transect  = factor(1:n_transects)
    )
  }))
}))

saveRDS(df_birds, "RDS/birds.rds")
summary(df_birds)

     count               landscape       site    transect
 Min.   : 0.000   Agricultural:40   Agr1   : 5   1:24    
 1st Qu.: 0.000   Fragmented  :40   Agr2   : 5   2:24    
 Median : 5.000   Continuous  :40   Agr3   : 5   3:24    
 Mean   : 8.925                     Agr4   : 5   4:24    
 3rd Qu.:14.000                     Agr5   : 5   5:24    
 Max.   :47.000                     Agr6   : 5           
                                    (Other):90

B.1.9 Dataset Summary Table

Dataset	File	Design	\(N\)	Response	Chapter
Antihypertensive trial	`systolic.rds`	One-way CRD	60	Blood pressure (mmHg)	3, 4
G×E wheat trial	`gxe.rds`	Two-way factorial	60	Yield (g/plot)	5
Cardiac rehabilitation	`cardiac_recovery.rds`	Repeated measures	160	Heart rate (bpm)	6, 9
Grazing richness	`grazing.rds`	Nested	24	Species richness	7
Irrigation × fertiliser	`irrigation.rds`	Split-plot	18	Yield (g/plot)	7
Physiotherapy trial	`physio.rds`	Three-level nested	148	Recovery score	9
Bird abundance	`birds.rds`	Count survey	120	Individual count	11

B.2 B.2 R Packages

All analyses in this book were conducted in R v3.6.0 (R2026?). The following packages are required to reproduce all code in the book. Install them all at once with:

packages <- c(
  # Core modelling
  "lme4", "lmerTest", "nlme", "glmmTMB", "ordinal",
  # Model diagnostics and comparison
  "DHARMa", "MuMIn", "pbkrtest", "car",
  # Post-hoc comparisons and effect sizes
  "emmeans", "effectsize", "multcomp",
  # Power analysis
  "pwr", "simr",
  # Non-parametric and robust methods
  "WRS2", "PMCMRplus", "dunn.test",
  # Data manipulation and visualisation
  "ggplot2", "patchwork", "ggsignif",
  # Parallel computation
  "future", "furrr", "purrr",
  # Utilities
  "here", "MASS", "scales"
)

install.packages(packages)

The table below lists each package with its primary use in the book, its current version on CRAN, and the citation to use when reporting analyses.

Package	Primary use	Citation
`lme4`	Linear mixed models	Bates et al. (2015)
`lmerTest`	P-values for mixed models (Satterthwaite/KR)	Kuznetsova, Brockhoff, and Christensen (2017)
`nlme`	Mixed models with covariance structures	Pinheiro and Bates (2000)
`glmmTMB`	GLMMs, zero-inflation, negative binomial	Brooks et al. (2017)
`ordinal`	Cumulative link mixed models	(Christensen2019?)
`DHARMa`	Simulation-based GLMM diagnostics	Hartig (2026)
`MuMIn`	Model selection, AIC, \(R^2\) for mixed models	(Barton2023?)
`pbkrtest`	Kenward-Roger degrees of freedom	(Halekoh2014?)
`car`	Type II/III ANOVA, Levene’s test	(Fox2019?)
`emmeans`	Estimated marginal means, post-hoc tests	(Lenth2023?)
`effectsize`	\(\eta^2\), \(\omega^2\), Cohen’s \(f\)	Ben-Shachar, Lüdecke, and Makowski (2020)
`multcomp`	Planned contrasts, general linear hypotheses	(Hothorn2008?)
`pwr`	Analytical power analysis	Cohen (1988)
`simr`	Simulation-based power for mixed models	Green and MacLeod (2016)
`WRS2`	Robust ANOVA methods	Mair and Wilcox (2020)
`PMCMRplus`	Friedman post-hoc tests	(Pohlert2022?)
`dunn.test`	Dunn’s post-hoc test after Kruskal-Wallis	(Dinno2017?)
`ggplot2`	Data visualisation	(Wickham2016?)
`patchwork`	Combining ggplot2 panels	(Pedersen2020?)
`ggsignif`	Significance annotations on plots	(Ahlmann2021?)
`future`	Parallel computation backend	(Bengtsson2021?)
`furrr`	Parallel `purrr`-style mapping	(Vaughan2022?)
`purrr`	Functional programming tools	(Wickham2023?)
`here`	Reproducible file paths	(Müller2020?)
`MASS`	Multivariate normal simulation	(Venables2002?)

B.2.1 Checking Your Package Versions

To verify that all required packages are installed and check their versions:

# Check all required packages
pkg_status <- sapply(packages, function(p) {
  if (requireNamespace(p, quietly = TRUE)) {
    as.character(packageVersion(p))
  } else {
    "NOT INSTALLED"
  }
})

pkg_df <- data.frame(
  Package = names(pkg_status),
  Version = pkg_status,
  row.names = NULL
)
print(pkg_df)

      Package    Version
1        lme4      2.0.1
2    lmerTest      3.2.1
3        nlme    3.1.169
4     glmmTMB     1.1.14
5     ordinal 2025.12.29
6      DHARMa      0.5.0
7       MuMIn    1.48.19
8    pbkrtest      0.5.5
9         car      3.1.5
10    emmeans      2.0.3
11 effectsize      1.0.2
12   multcomp     1.4.30
13        pwr      1.3.0
14       simr      1.0.9
15       WRS2      1.1.7
16  PMCMRplus     1.9.12
17  dunn.test      1.4.1
18    ggplot2      4.0.3
19  patchwork      1.3.2
20   ggsignif      0.6.4
21     future     1.70.0
22      furrr      0.4.0
23      purrr      1.2.2
24       here      1.0.2
25       MASS     7.3.65
26     scales      1.4.0

B.2.2 Generating Package Citations

B.3 Package References

The following references correspond to the R packages listed in the table above. Citations were generated automatically using knitr::write_bib() and reflect the package versions installed at the time of writing.