7 Split-Plot and Nested Designs

Every ANOVA design considered so far has shared a common simplifying assumption: that the experimental units receiving different treatments are all of the same type, at the same level of organisation, and subject to the same sources of variation. A plant is a plant, a patient is a patient, a plot is a plot.

Biological reality is rarely so flat. Ecosystems are organised in layers: individual organisms live within plots, plots within fields, fields within regions. Clinical trials recruit patients from hospitals, hospitals from healthcare systems. Agricultural experiments apply whole-farm treatments to fields and within-field treatments to plots. In all of these cases, the experimental units exist at multiple levels of a hierarchy, and treatments are applied, and therefore replicated, at different levels of that hierarchy.

Ignoring this hierarchical structure leads directly to the pseudoreplication problem introduced in Chapter 2. Accounting for it correctly requires designs and analyses that explicitly represent the nested structure of the data. This chapter introduces the two most important such designs in biological research: the split-plot and the nested ANOVA.

7.1 Hierarchy in Biological Data: Why It Cannot Be Ignored

7.1.1 Everything Is Nested in Something

Before introducing any new design, it is worth pausing to build a clear intuition for what hierarchical structure means and why it demands special treatment.

Consider a simple example. A researcher wants to study the effect of two soil treatments (amended vs unamended) and two plant varieties (A vs B) on crop yield. She has access to four large fields. Two fields receive the soil amendment; two do not. Within each field, she establishes four plots: two planted with Variety A and two with Variety B. She measures yield from five plants within each plot.

The data have three levels of organisation:

\[\text{Field} \supset \text{Plot} \supset \text{Plant}\]

This is a hierarchy. Fields are nested within soil treatment. Plots are nested within fields. Plants are nested within plots. Each level has its own sources of variation: fields differ from each other in ways that have nothing to do with soil treatment (drainage, history, microclimate); plots within a field differ from each other in ways that have nothing to do with variety (position, local soil variation); plants within a plot differ from each other due to individual biological variation.

The critical point is that the treatment is applied and replicated at the field level, not the plot level and not the plant level. The soil amendment was not randomly assigned to individual plants: it was assigned to whole fields. The field is therefore the unit of replication for the soil treatment comparison, and there are only two replicates per treatment (\(n = 2\) fields per treatment), not \(2 \times 4 \times 5 = 40\).

7.1.2 What Goes Wrong When You Ignore the Hierarchy

If the researcher analyses this experiment by running a standard ANOVA on all \(4 \times 4 \times 5 = 80\) plant measurements, treating each plant as an independent observation, she is pseudoreplicating at two levels simultaneously. The within-field and within-plot correlation structures are ignored, and the error term used to test the soil treatment effect is the plant-level residual, which is far too small, because it does not account for the field-to-field variation that is the appropriate noise against which the treatment should be judged.

The simulation below demonstrates the inflation of the false positive rate that results:

Code

library(future)
library(furrr)
library(purrr)

plan(multisession)

set.seed(42)
n_sim <- 5000
alpha <- 0.05
n_fields <- 4      # 2 per treatment
n_plots <- 4      # per field
n_plants <- 5      # per plot
field_sd <- 3      # field-to-field variation
plot_sd <- 2      # plot-to-plot variation within field
plant_sd <- 1      # plant-to-plant variation within plot

results <- future_map_dfr(seq_len(n_sim), function(i) {
  
  treatment   <- rep(c("Control", "Amended"), each = n_fields / 2)
  field_effect <- rnorm(n_fields, 0, field_sd)
  
  df <- do.call(rbind, lapply(seq_len(n_fields), function(f) {
    do.call(rbind, lapply(seq_len(n_plots), function(p) {
      plot_effect <- rnorm(1, 0, plot_sd)
      data.frame(
        yield = rnorm(n_plants, mean = 50 + field_effect[f] + plot_effect, sd = plant_sd),
        treatment = treatment[f],
        field = factor(f),
        plot = factor(paste(f, p, sep = "_"))
      )
    }))
  }))
  
  # Wrong: treat all plants as independent
  p_wrong <- summary(
    aov(yield ~ treatment, data = df)
  )[[1]][["Pr(>F)"]][1]
  
  # Correct: use field means as the unit of analysis
  field_means <- aggregate(yield ~ treatment + field,
                           data = df, FUN = mean)
  p_correct <- summary(
    aov(yield ~ treatment, data = field_means)
  )[[1]][["Pr(>F)"]][1]
  
  data.frame(p_wrong = p_wrong, p_correct = p_correct)
  
}, .options = furrr_options(seed = TRUE))

plan(sequential)

# View the simulated data
head(results, n = 10)

        p_wrong p_correct
1  1.713975e-02 0.4803663
2  1.914645e-05 0.3266718
3  9.265631e-01 0.9875177
4  7.418683e-09 0.2185384
5  7.697033e-05 0.4926903
6  8.143981e-01 0.9355019
7  2.013291e-01 0.8170871
8  1.935473e-01 0.8389504
9  4.284297e-20 0.1088934
10 4.178010e-01 0.8922692

cat(
  "False positive rate ignoring hierarchy:",
  round(mean(results$p_wrong   < alpha), 3),
  "\nFalse positive rate correct analysis:",
  round(mean(results$p_correct < alpha), 3), "\n"
  )

False positive rate ignoring hierarchy: 0.681 
False positive rate correct analysis: 0.05

The false positive rate from the naive analysis is far above 5% (0.681), while the correct analysis, using field means as the unit of analysis for the field-level treatment, stays at the nominal level. This is the pseudoreplication problem from Chapter 2, now seen in a hierarchical context.

7.1.3 Visualising the Hierarchy

It helps to draw the structure of a hierarchical experiment explicitly before deciding how to analyse it. For the field-plot-plant example:

flowchart TD
    subgraph L1["Level 1: Experiment"]
        A[Experiment]
    end

    subgraph L2["Level 2: Field\ntreatment unit"]
        B[Control\nField 1 and 2]
        C[Amended\nField 3 and 4]
    end

    subgraph L3["\n\n\nLevel 3: Plot\nvariety unit"]
        D[Plot 1\nVariety A]
        E[Plot 2\nVariety B]
        F[Plot 3\nVariety A]
        G[Plot 4\nVariety B]
    end

    subgraph L4["\n\n\n\nLevel 4: Plant - observation"]
        H[Plants\nn obs]
        I[Plants\nn obs]
        J[Plants\nn obs]
        K[Plants\nn obs]
    end

    A --> B
    A --> C
    B --> D
    B --> E
    C --> F
    C --> G
    D --> H
    E --> I
    F --> J
    G --> K

    style A fill:#555555,color:#ffffff
    style B fill:#2E86AB,color:#ffffff
    style C fill:#E84855,color:#ffffff
    style D fill:#2E86AB,color:#ffffff
    style E fill:#2E86AB,color:#ffffff
    style F fill:#E84855,color:#ffffff
    style G fill:#E84855,color:#ffffff
    style H fill:#aaaaaa,color:#ffffff
    style I fill:#aaaaaa,color:#ffffff
    style J fill:#aaaaaa,color:#ffffff
    style K fill:#aaaaaa,color:#ffffff
    style L1 fill:#f8f8f8,stroke:#cccccc,color:#555555
    style L2 fill:#f8f8f8,stroke:#cccccc,color:#555555
    style L3 fill:#f8f8f8,stroke:#cccccc,color:#555555
    style L4 fill:#f8f8f8,stroke:#cccccc,color:#555555

Figure 7.1: Hierarchical structure of the field experiment. Soil treatment is applied at the field level and replicated across fields. Variety is applied at the plot level within each field. Plants are the observational unit at the bottom of the hierarchy.

7.1.4 The Golden Rule of Hierarchical Analysis

The rule that prevents most hierarchical analysis errors can be stated simply:

The error term used to test a treatment effect must come from the same level of the hierarchy at which that treatment was applied and randomised.

If soil amendment was randomised across fields, the field-to-field variation is the correct error term for testing the amendment effect. If variety was randomised across plots within fields, the plot-to-plot variation within fields is the correct error term for testing the variety effect. Using plant-level variation to test a field-level treatment is like using a ruler calibrated in millimetres to measure a distance that varies in kilometres, the instrument is simply measuring the wrong thing.

This rule is the conceptual foundation of both split-plot and nested ANOVA designs, which formalise it into a statistical framework.

7.2 Pseudoreplication: The Silent Epidemic in Biology

7.2.1 A Pervasive Problem

In their landmark 1984 paper, Hurlbert (1984) documented that pseudoreplication, the treatment of non-independent observations as independent replicates, was present in a substantial fraction of published ecological field experiments. Subsequent reviews have found similar rates across ecology, agriculture, and biomedical research. The problem has not gone away.

Pseudoreplication takes several distinct forms in practice, and being able to recognise each is an essential skill.

Simple pseudoreplication occurs when multiple measurements are taken from a single experimental unit and treated as independent replicates. Six measurements of soil nitrogen from the same plot do not give you six independent replicates, they give you six estimates of the nitrogen level in one plot. The plot is still \(n = 1\).

Sacrificial pseudoreplication occurs when the analysis is conducted at the wrong level of the hierarchy, as in the field example above. The name reflects the fact that the correct, but lower-powered, analysis is sacrificed in favour of an inflated, incorrect one.

Temporal pseudoreplication occurs when the same unit is measured repeatedly over time and the measurements are treated as independent. This is the repeated measures problem from Chapter 6 seen in a field ecology context: a plot measured in spring, summer, and autumn provides three measurements of the same plot, not three independent plots.

7.2.2 How to Avoid It

The antidote to pseudoreplication is to identify the experimental unit, the smallest unit that was independently assigned to a treatment, before the analysis begins, and to ensure that the unit of analysis matches the unit of replication. When subsampling within units is necessary (for precision, or because the response must be measured at a finer scale than the treatment was applied), the subsamples should be averaged to the level of the experimental unit before analysis, or the hierarchical structure should be modelled explicitly.

7.3 Split-Plot Design: Structure and Correct Error Terms

7.3.1 When Does a Split-Plot Design Arise?

A split-plot design arises naturally when two factors must be tested but one is much more difficult or expensive to change than the other. The classic agricultural example: irrigation regime (irrigated vs rainfed) must be applied to whole large fields because irrigation infrastructure cannot be changed plot by plot, while fertiliser type can be applied to much smaller individual plots within each field.

The experiment therefore applies the hard-to-change factor (whole-plot factor) to large units (whole plots), then subdivides each whole plot into smaller units (sub-plots) and applies the easy-to-change factor (sub-plot factor) to those.

This structure is not merely a practical compromise, it is a deliberate design choice that must be reflected in the analysis. The two factors are tested against different error terms, because they were randomised at different levels of the hierarchy.

7.3.2 Structure of the Split-Plot Design

With \(a\) levels of the whole-plot factor \(A\), \(b\) levels of the sub-plot factor \(B\), and \(r\) whole-plot replicates per level of \(A\):

There are \(a \times r\) whole plots in total.
Each whole plot is divided into \(b\) sub-plots.
Total observations: \(a \times r \times b\).

The two error terms are:

Whole-plot error (\(MS_{WP}\)): the variation among whole plots within the same level of \(A\). This is the correct error term for testing the main effect of \(A\).
Sub-plot error (\(MS_{SP}\)): the variation among sub-plots within the same whole plot. This is the correct error term for testing the main effect of \(B\) and the \(A \times B\) interaction.

The sub-plot error is typically smaller than the whole-plot error, which is why the sub-plot factor and the interaction are tested with more power than the whole-plot factor, a direct consequence of the design structure.

7.3.3 The Split-Plot Model

\[y_{ijk} = \mu + \alpha_i + \delta_{ij} + \beta_k + (\alpha\beta)_{ik} + \varepsilon_{ijk}\]

where \(\alpha_i\) is the whole-plot treatment effect, \(\delta_{ij}\) is the whole-plot error (variation among whole plots within treatment \(i\), the error term for \(A\)), \(\beta_k\) is the sub-plot treatment effect, \((\alpha\beta)_{ik}\) is the interaction, and \(\varepsilon_{ijk}\) is the sub-plot error (the error term for \(B\) and \(A \times B\)).

In R, the two error terms are specified explicitly using the Error() term in aov():

# whole.plot: the unit receiving the whole-plot treatment
# A: whole-plot factor
# B: sub-plot factor

fit_sp <- aov(y ~ A * B + Error(whole.plot/B), data = df)
summary(fit_sp)

7.4 Nested ANOVA: Subsampling vs True Replication

7.4.1 The Nested Structure

A nested design arises when the levels of one factor exist entirely within the levels of another, and the levels of the inner factor are not the same across levels of the outer factor. Fields contain plots, but the plots in Field 1 are different from the plots in Field 2 as they are nested within their respective fields, not crossed with them.

This is distinct from a factorial design, where every level of every factor is combined with every level of every other factor. In a factorial design, Variety A appears in both the irrigated and rainfed fields. In a nested design, the plots in Field 1 are unique to Field 1.

The nested ANOVA model for two levels of nesting:

\[y_{ijk} = \mu + \alpha_i + \beta_{j(i)} + \varepsilon_{ijk}\]

where \(\alpha_i\) is the effect of the outer factor \(A\) at level \(i\), \(\beta_{j(i)}\) is the effect of the inner factor \(B\) at level \(j\) nested within level \(i\) of \(A\) (denoted \(j(i)\)), and \(\varepsilon_{ijk}\) is the residual.

In R:

# B is nested within A
fit_nested <- aov(y ~ A + B %in% A, data = df)
# Equivalently:
fit_nested <- aov(y ~ A/B, data = df)
summary(fit_nested)

7.4.2 Subsampling vs True Replication

The most important practical distinction in nested designs is between true replication and subsampling. True replicates are independent experimental units that each received a treatment independently. Subsamples are multiple measurements taken from the same experimental unit.

	True replicates	Subsamples
Definition	Independently treated units	Multiple measures of one unit
Example	Four separately irrigated fields	Five soil cores from one field
Contribution to \(n\)	Each counts as one replicate	Together count as one replicate
Error term	Between-unit residual	Within-unit residual

Treating subsamples as true replicates is the most common form of sacrificial pseudoreplication. The nested ANOVA correctly partitions the variation into between-unit and within-unit components and uses the appropriate error term for each.

7.5 Example 1: Grazing Effects on Plant Diversity

7.5.1 The Study

An ecologist is investigating the effect of grazing intensity (ungrazed, light grazing, heavy grazing) on plant species richness in a grassland (Section B.1.5). Six large paddocks are available, two assigned to each grazing treatment. Within each paddock, four quadrats are established and plant species richness is counted in each. Three random soil cores are taken from each quadrat for soil nutrient analysis, but species richness is the response of interest.

The structure is:

\[\text{Grazing treatment} \supset \text{Paddock} \supset \text{Quadrat}\]

Grazing treatment is the fixed factor of interest. Paddock is nested within grazing treatment and is the unit of replication for the treatment comparison. Quadrat is nested within paddock and provides subsamples within the true replicates.

# Show the simulated data
summary(grazing)

    richness         treatment paddock
 Min.   : 3.000   Ungrazed:8   U1:4   
 1st Qu.: 6.750   Light   :8   U2:4   
 Median :10.500   Heavy   :8   L1:4   
 Mean   : 9.792                L2:4   
 3rd Qu.:13.250                H1:4   
 Max.   :15.000                H2:4

7.5.2 Correct Analysis: Paddock as the Error Term

# Nested ANOVA: paddock nested within treatment
fit_eco <- aov(richness ~ treatment + Error(paddock/treatment), data = grazing)
summary(fit_eco)


Error: paddock
          Df Sum Sq Mean Sq F value Pr(>F)  
treatment  2 303.58  151.79   23.81 0.0144 *
Residuals  3  19.13    6.38                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Error: Within
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 18  35.25   1.958

The output is split into two strata, each telling a different part of the story.

The Error: paddock stratum is where the treatment test lives. Grazing treatment has a significant effect on plant species richness (\(F_{2,3} = 23.81\), \(p = 0.014\)). The residual mean square in this stratum (6.38) estimates the paddock-to-paddock variation within treatments i.e. the genuine noise against which the treatment effect is judged, since paddocks are the units that were independently assigned to grazing treatments. This is the correct error term.

The Error: Within stratum captures the quadrat-to-quadrat variation within paddocks, with a residual mean square of 1.96. Notice that this is substantially smaller than the paddock-level residual (6.38) meaning paddocks differ from each other more than quadrats within the same paddock differ from each other. This is the typical pattern in nested designs, and it is precisely why using quadrat-level variation to test the treatment effect would be wrong: the within-paddock noise floor is too low, and testing against it would make the treatment appear far more significant than the actual replication warrants.

With only two paddocks per treatment, the degrees of freedom for the treatment test are low (\(df_{\text{treatment}} = 2\), \(df_{\text{error}} = 3\)). This correctly reflects the limited replication at the treatment level. No matter how many quadrats are sampled within each paddock, the experiment has only six independent units at the treatment level, and the analysis honestly acknowledges this. The significant result (\(p = 0.014\)) is therefore a genuine finding achieved with sparse replication, not an artefact of inflated degrees of freedom.

# Visualise
library(ggplot2)

paddock_means <- aggregate(richness ~ treatment + paddock, data = grazing, FUN = mean)

ggplot(grazing, aes(x = treatment, y = richness, colour = treatment)) +
  geom_jitter(width = 0.15, alpha = 0.4, size = 2) +
  geom_point(data = paddock_means, aes(y = richness, group = paddock), size = 4, shape = 18) +
  stat_summary(fun = mean, geom = "crossbar", width = 0.4, colour = "grey30", linewidth = 0.8) +
  scale_colour_manual(values = c("#2E86AB", "#F6AE2D", "#E84855")) +
  labs(x = "Grazing treatment", y = "Species richness", colour = NULL) +
  theme_bw() +
  theme(legend.position = "none")

The analysis uses paddock as the correct error term for testing the grazing treatment effect. With only two paddocks per treatment, the degrees of freedom for the treatment test are low (\(df = 2\) for the treatment, \(df = 3\) for the whole-plot error), correctly reflecting the fact that the experiment has limited replication at the treatment level, regardless of how many quadrats were sampled within each paddock.

7.5.3 What Happens if You Ignore the Nesting

# Wrong: treat all quadrats as independent
fit_eco_wrong <- aov(richness ~ treatment, data = grazing)
summary(fit_eco_wrong)

            Df Sum Sq Mean Sq F value   Pr(>F)    
treatment    2 303.58  151.79   58.62 2.55e-09 ***
Residuals   21  54.38    2.59                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cat(
  "\nCorrect df for treatment: 2, error df: 3\nWrong df for treatment: 2, error df:",
  df.residual(fit_eco_wrong), "\n"
  )


Correct df for treatment: 2, error df: 3
Wrong df for treatment: 2, error df: 21

The naive analysis has \(df_{\text{error}} = 21\) instead of 3, artificially inflating the power of the test by treating quadrat-level variation as if it were independent replication of the treatment.

7.6 Example 2: Irrigation and Fertiliser in a Split-Plot Design

7.6.1 The Study

An agronomist is testing two irrigation regimes (rainfed vs irrigated) and three fertiliser types (none, nitrogen, phosphorus) on wheat yield (Section B.1.6). Irrigation is applied to whole fields (it cannot be changed within a field) while fertiliser can be applied to individual plots within each field. Six fields are available, three assigned to each irrigation treatment. Each field is divided into three plots, one for each fertiliser type.

This is a split-plot design:

Whole-plot factor: irrigation (2 levels), replicated across 6 fields (3 per treatment).
Sub-plot factor: fertiliser (3 levels), applied to plots within each field.
Total observations: \(6 \times 3 = 18\) plots.

summary(irrig)

     yield           irrigation      fertiliser field 
 Min.   :31.40   Rainfed  :9    None      :6    R1:3  
 1st Qu.:36.62   Irrigated:9    Nitrogen  :6    R2:3  
 Median :43.43                  Phosphorus:6    R3:3  
 Mean   :42.98                                  I1:3  
 3rd Qu.:48.91                                  I2:3  
 Max.   :53.43                                  I3:3

7.6.2 Fitting the Split-Plot Model

# Correct split-plot analysis
# field is the whole-plot unit; fertiliser is the sub-plot factor
fit_sp <- aov(yield ~ irrigation * fertiliser + Error(field/fertiliser), data = irrig)
summary(fit_sp)


Error: field
           Df Sum Sq Mean Sq F value Pr(>F)
irrigation  1  105.6  105.58   3.647  0.129
Residuals   4  115.8   28.95               

Error: field:fertiliser
                      Df Sum Sq Mean Sq F value   Pr(>F)    
fertiliser             2  532.6  266.29  89.801 3.31e-06 ***
irrigation:fertiliser  2   26.6   13.28   4.479   0.0495 *  
Residuals              8   23.7    2.97                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output separates into two error strata, each testing a different set of effects against the appropriate level of variation.

The Error: field stratum tests the whole-plot factor, irrigation, against the field-to-field variation within irrigation treatments (\(MS_{\text{field}} = 28.95\)). The effect of irrigation is not significant (\(F_{1,4} = 3.65\), \(p = 0.129\)). This result should be interpreted carefully: the non-significance does not mean irrigation has no effect on yield (the interaction below suggests it does) but rather that the main effect of irrigation, averaged across all three fertiliser types, cannot be distinguished from field-to-field background variation with only three fields per irrigation treatment. This is the honest cost of having limited whole-plot replication. The wide confidence interval around the irrigation effect, not the \(p\)-value alone, reflects the true uncertainty here.

The Error: field:fertiliser stratum tests the sub-plot factor and the interaction, both against the plot-to-plot variation within fields (\(MS_{\text{subplot}} = 2.97\)). This residual is nearly ten times smaller than the whole-plot residual (28.95), which is the typical pattern in split-plot designs and explains why sub-plot effects are tested with considerably more power than whole-plot effects.

The main effect of fertiliser is highly significant (\(F_{2,8} = 89.80\), \(p < 0.001\)), confirming that fertiliser type has a strong effect on yield regardless of irrigation regime. More interestingly, the irrigation-by-fertiliser interaction is significant (\(F_{2,8} = 4.48\), \(p = 0.049\)), indicating that the yield advantage of different fertiliser types is not the same under rainfed and irrigated conditions. Nitrogen fertiliser, in particular, produces a larger absolute yield gain under irrigation than under rainfed conditions which is a biologically sensible result, since nitrogen uptake is enhanced by water availability.

This pattern of results is instructive for two reasons. First, it illustrates why the split-plot design is worth the added analytical complexity: had this been run as a simple two-way factorial with independent plots for all combinations, the irrigation effect and the interaction would have been tested against the same error term, concealing the different precision available for whole-plot and sub-plot comparisons. Second, it demonstrates a common outcome in agricultural split-plot experiments: the whole-plot factor (here irrigation, which is expensive to manipulate and therefore sparsely replicated) is underpowered relative to the sub-plot factor (fertiliser, which is cheap to vary and therefore well replicated). Recognising this asymmetry at the design stage, and planning accordingly, is one of the practical skills that distinguishes experienced from novice experimenters.

7.6.3 Visualising the Results

library(ggplot2)

summ_agr <- aggregate(yield ~ irrigation + fertiliser, data = irrig, FUN = mean)

ggplot(irrig, aes(x = fertiliser, y = yield, colour = irrigation, group = irrigation)) +
  geom_jitter(width = 0.08, alpha = 0.5, size = 2.5, aes(colour = irrigation)) +
  geom_line(data = summ_agr, aes(y = yield), linewidth = 1.2) +
  geom_point(data = summ_agr, aes(y = yield), size = 4) +
  scale_colour_manual(values = c("#E84855", "#2E86AB")) +
  labs(x = "Fertiliser", y = "Yield (g/plot)", colour = "Irrigation") +
  theme_bw() +
  theme(legend.position = "inside", legend.position.inside = c(0.15, 0.85))

Figure 7.2: Mean wheat yield by irrigation regime and fertiliser type in a split-plot design. Points show individual plot yields. Lines connect the means of each irrigation group across fertiliser types. The larger spread of the rainfed points reflects the greater whole-plot variability in unirrigated fields.

7.6.4 Why the Split-Plot Analysis Matters

# Correct split-plot analysis vs naive ANOVA

# Correct
cat("=== Correct split-plot analysis ===\n")

=== Correct split-plot analysis ===

print(summary(fit_sp))


Error: field
           Df Sum Sq Mean Sq F value Pr(>F)
irrigation  1  105.6  105.58   3.647  0.129
Residuals   4  115.8   28.95               

Error: field:fertiliser
                      Df Sum Sq Mean Sq F value   Pr(>F)    
fertiliser             2  532.6  266.29  89.801 3.31e-06 ***
irrigation:fertiliser  2   26.6   13.28   4.479   0.0495 *  
Residuals              8   23.7    2.97                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Wrong: treat all plots as independent — ignores field structure
cat("\n=== Naive ANOVA (ignoring field structure) ===\n")


=== Naive ANOVA (ignoring field structure) ===

summary(aov(yield ~ irrigation * fertiliser, data = irrig))

                      Df Sum Sq Mean Sq F value   Pr(>F)    
irrigation             1  105.6  105.58   9.080   0.0108 *  
fertiliser             2  532.6  266.29  22.902 8.01e-05 ***
irrigation:fertiliser  2   26.6   13.28   1.142   0.3515    
Residuals             12  139.5   11.63                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing the two analyses reveals precisely what goes wrong when the split-plot structure is ignored, and the contrast need all our attention.

The most consequential difference concerns the irrigation effect. In the correct split-plot analysis, irrigation is non-significant (\(F_{1,4} = 3.65\), \(p = 0.129\)), tested against the whole-plot residual (\(MS = 28.95\)) which is the field-to-field variation that is the genuine source of uncertainty for a treatment applied at the field level. In the naive analysis, irrigation appears significant (\(F_{1,12} = 9.08\), \(p = 0.011\)), tested against the pooled residual (\(MS = 11.63\)) which combines whole-plot and sub-plot variation indiscriminately. The naive analysis has produced a false positive: it declares irrigation significant not because the evidence supports this conclusion, but because it used an error term that is far too small for a field-level treatment. The residual degrees of freedom jump from 4 to 12, not because more information is available, but because sub-plot variation has been illegitimately pressed into service as a stand-in for whole-plot variation.

The interaction tells an equally instructive story, but in the opposite direction. In the correct analysis the irrigation-by-fertiliser interaction is significant (\(F_{2,8} = 4.48\), \(p = 0.049\)), tested against the sub-plot residual (\(MS = 2.97\)) which is the appropriate error term for an effect at the plot level. In the naive analysis the same interaction is non-significant (\(F_{2,12} = 1.14\), \(p = 0.352\)), because the pooled residual inflates the error term relative to the true sub-plot noise. Here the naive analysis produces a false negative: a real interaction that the correct analysis detects is hidden when the error terms are conflated.

The fertiliser main effect is significant in both analyses and reaches a similar F value (\(F_{2,8} = 89.80\) vs \(F_{2,12} = 22.90\)), though the naive analysis underestimates the precision available for sub-plot comparisons by using a larger pooled residual. This consistency for the sub-plot factor is not grounds for complacency: the simultaneous false positive for irrigation and false negative for the interaction show that the naive analysis gives the right answer for the right factor for the wrong reasons.

The lesson is general: in a split-plot design, using a single pooled error term simultaneously makes whole-plot tests too liberal and sub-plot tests too conservative. The only correct solution is to use the two separate error terms that reflect the two levels at which randomisation actually occurred.