14 Reproducible Analysis

Science is only as trustworthy as its reproducibility. A statistical analysis that cannot be recreated, because the code was lost, the random seed was not set, the software versions were not recorded, or the steps were not documented, is not a scientific contribution in any meaningful sense. It is an anecdote with a p-value attached.

Reproducibility has become a central concern in biology and medicine over the past decade, driven by mounting evidence that a substantial fraction of published results cannot be independently verified. The causes are multiple, underpowered studies, publication bias, p-hacking, selective reporting, but one of the most tractable is poor computational practice. An analysis that is scripted, versioned, seeded, and documented can be reproduced exactly, by anyone, at any time. This chapter shows you how to achieve that standard.

14.1 What Reproducibility Means and Why It Matters

14.1.1 Three Levels of Reproducibility

Reproducibility exists on a spectrum. It is useful to distinguish three levels, each building on the previous:

Computational reproducibility: given the same data and the same code, a different analyst on a different computer obtains identical numerical results. This is the minimum standard and the one this chapter focuses on. It requires fixed random seeds, recorded software versions, and self-contained code.
Analytical reproducibility: given the same data, a different analyst using a different but reasonable analysis obtains the same scientific conclusions. This is a stronger standard that requires transparent reporting of analytical choices that is for example why this model rather than that one, why this correction rather than another.
Empirical reproducibility: a different research team collecting new data from the same population obtains the same results. This is the gold standard and depends on adequate sample sizes, pre-registered hypotheses, and honest reporting.

This chapter addresses computational reproducibility directly and analytical reproducibility through transparent reporting practices. Empirical reproducibility depends on all of the above plus good experimental design, which Chapter 12 covered.

14.1.2 The Reproducibility Checklist

Before submitting any paper or sharing any analysis, verify that:

Every random number generation call has a fixed set.seed().
All software versions (R, packages) are recorded.
The code runs from top to bottom without errors on a fresh R session.
All file paths are relative, not absolute.
The data are available or can be regenerated from documented code.
All analytical choices are documented with justifications.

14.2 Project Structure

A well-organised project directory is the foundation of reproducibility. When files are scattered across a desktop, a Downloads folder, and three different project directories, analyses cannot be reliably reproduced, even by the original author six months later.

14.2.1 The Recommended Structure for a Quarto Book

Your book already follows a good structure. For a standalone analysis project, the following layout is recommended:

my-analysis/
├── _quarto.yml          # or analysis.Rmd / analysis.qmd
├── data/
│   ├── raw/             # original data, never modified
│   └── processed/       # cleaned data produced by code
├── R/
│   └── utils.R          # shared functions
├── figures/             # generated figures
├── outputs/             # tables, model summaries
├── references.bib       # bibliography
└── README.md            # project description and instructions

14.2.2 The `here` Package for Portable Paths

Never use absolute file paths like /home/user/Documents/folder/data/systolic.rds. These paths work only on your computer and break the moment anyone else tries to reproduce the analysis. Use the here package instead:

library(here)

# here() always resolves relative to the project root
# regardless of which subdirectory the script is run from
systolic <- readRDS(here("RDS", "systolic.rds"))

# Equivalently
source(here("R", "utils.R"))

here() finds the project root by looking for a .Rproj file, a _quarto.yml file, or a .here file. As long as one of these exists, paths constructed with here() work on any machine.

14.2.3 The `renv` Package for Package Version Control

Different versions of R packages can produce different results. renv creates a project-specific package library with recorded version numbers, so that anyone reproducing the analysis uses exactly the same package versions:

# Initialize renv in your project (run once)
# install.packages("renv")
renv::init()

# After installing or updating packages, snapshot the state
renv::snapshot()

# This creates renv.lock, commit this file to git

# Anyone reproducing the analysis runs:
renv::restore()
# This installs exactly the recorded package versions

The renv.lock file records every package name, version, and source. Commit it to your git repository alongside your code.

# Check which packages are out of sync with the lockfile
renv::status()

# Update the lockfile after adding new packages
renv::snapshot()

For more, please see renv Posit guide and introduction to renv.

14.3 14.3 Literate Programming with Quarto

The entire book you are reading was written using Quarto, which implements the principle of literate programming: code, output, and prose are woven together in a single document. This is the gold standard for reproducible analysis because the document itself is the analysis: there is no separate step of copying results from R into a Word document, which is where transcription errors enter.

14.3.1 The Anti-Pattern: Copy-Paste Reporting

The most common reproducibility failure in biological papers is the copy-paste workflow:

Run analysis in R.
Read the numbers from the console.
Type them into a Word document.
Revise the analysis.
Forget to update one of the numbers in the document.

This produces papers where the numbers in the text do not match the numbers in the tables, or where changing the dataset requires manually hunting down and updating dozens of values scattered through the manuscript.

14.3.2 The Solution: Inline Code

Every number in a Quarto document should come directly from R, not from the keyboard. This is achieved with inline code:

I can include it for example for \(df1\) = 2, \(df2\) = 57, \(F\) = 15.7886084, \(p\) = 0 and \(\omega^2\) = 0.3301868 with this text in R:

sprintf("$F_{%d,%d} = %.2f$, $p = %.3f$, $\\omega^2 = %.2f$",
df1, df2, f_val, p_val, om)

which will render (in latex) as:

Treatment had a significant effect on blood pressure
($F_{2,57} = 15.79$, $p = 0.000$, $\omega^2 = 0.33$).

Then, when the analysis changes, a new patient is added, an outlier is removed, the model specification is updated, every number in the document updates automatically on the next render. Nothing can fall out of sync because the document and the analysis are the same thing.

14.3.3 Chunk Options for Clean Documents

For a polished manuscript, use these chunk options to control what appears in the output:

# In _quarto.yml, apply globally
knitr:
  opts_chunk:
    echo: false       # hide code in output
    warning: false    # hide warnings
    message: false    # hide messages
    fig.align: center
    out.width: "90%"

Override locally for specific chunks:

library(lme4)
physio <- readRDS("RDS/physio.rds")
fit <- lmer(recovery ~ treatment + (1 | hospital), data = physio)
summary(fit)

Linear mixed model fit by REML ['lmerMod']
Formula: recovery ~ treatment + (1 | hospital)
   Data: physio

REML criterion at convergence: 1054.1

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-3.01213 -0.66150  0.06456  0.63362  2.33489 

Random effects:
 Groups   Name        Variance Std.Dev.
 hospital (Intercept) 29.18    5.401   
 Residual             64.09    8.006   
Number of obs: 148, groups:  hospital, 15

Fixed effects:
                   Estimate Std. Error t value
(Intercept)          61.333      1.666  36.822
treatmentTreatment    6.090      1.320   4.615

Correlation of Fixed Effects:
            (Intr)
trtmntTrtmn -0.374

14.4 Version Control with Git

14.4.1 Why Every Analysis Needs Version Control

Version control is not just for software developers. For a statistician or data analyst, git provides:

A complete history of every change made to the analysis, with timestamps and commit messages explaining why each change was made.
The ability to revert to any previous state of the analysis which is essential when a “simplification” turns out to be a mistake.
A backup that is automatically off-site when pushed to GitHub.
Collaboration with co-authors who can suggest changes through pull requests without overwriting each other’s work.

14.4.2 The Essential Git Workflow for an Analysis Project

# Initial setup (once per project)
git init
git add -A
git commit -m "Initial project structure"
git remote add origin https://github.com/username/project.git
git push -u origin main

# Daily workflow
git status                          # what has changed?
git diff                            # what exactly changed?
git add analysis.qmd R/utils.R      # stage specific files
git commit -m "Add power analysis for nested model"
git push origin main                # backup to GitHub

14.4.3 Writing Good Commit Messages

A commit message should complete the sentence “This commit…”:

# Bad: tells you nothing
git commit -m "update"
git commit -m "fix"
git commit -m "changes"

# Good: tells you what changed and why
git commit -m "Fix overdispersion check in GLMM chapter"
git commit -m "Add Kenward-Roger df to mixed model reporting template"
git commit -m "Rewrite blocking example with simulated field data"

14.4.4 What to Include and Exclude from Git

# .gitignore: files that should NOT be committed
*.Rhistory
.RData
.Rproj.user/
*_cache/           # Quarto cache (large, reproducible)
*_files/           # Quarto intermediate files
docs/              # built book (if using CI/CD)
renv/library/      # renv package library (large, reproducible)
*.log
*.aux

# Files that SHOULD be committed
*.qmd              # source documents
*.R                # R scripts
*.bib              # bibliography
renv.lock          # package versions
_quarto.yml        # project configuration
utils.R            # shared functions
RDS/               # saved datasets (if small enough)

For details and in-depth explanation of version control, git and its use within R, check the Happy Git and GitHub for the useR by Jenny Brian.

14.5 14.5 Pre-Registration and the Open Science Framework

Pre-registration is the practice of specifying your hypotheses, design, and analysis plan before collecting data and depositing that specification in a public registry. It is the most effective single intervention against p-hacking and HARKing (Hypothesising After Results are Known).

14.5.1 What to Pre-Register

A complete pre-registration for an ANOVA-based study includes:

# Pre-registration: [Study Title]
## Hypotheses
- Primary: [State H0 and H1 precisely]
- Secondary: [List any secondary hypotheses]

## Design
- Design type: [one-way CRD / factorial / split-plot / etc.]
- Sample size and justification: [power analysis results]
- Randomisation procedure: [how treatments will be assigned]
- Blinding: [who is blinded to treatment allocation]

## Primary Analysis
- Model formula: `lmer(response ~ treatment + (1 | site), data = df)`
- Degrees of freedom method: Kenward-Roger
- Significance threshold: alpha = 0.05
- Effect size: partial omega-squared

## Multiple Comparisons
- Post-hoc test: Tukey HSD
- Planned contrasts (if any): [specify exactly]

## Assumption Checks
- Independence: verified by design [describe]
- Homoscedasticity: Levene's test + residual plots
- Normality: Q-Q plot + Shapiro-Wilk

## Deviations from Plan
- Any deviation from this plan will be reported explicitly
  in the paper with justification

14.5.2 Registering on the Open Science Framework

The Open Science Framework (OSF, https://osf.io) provides free pre-registration for biological and medical research:

Create an account at https://osf.io
Create a new project for your study
Click Registrations => New Registration
Choose a registration template (AsPredicted or OSF Standard)
Complete the form and submit, this creates a time-stamped, publicly accessible record

The OSF also hosts data, code, and materials, making it possible to create a single URL that links to everything needed to reproduce a study.

14.6 Sharing Data and Code

14.6.1 Data Sharing

Raw data should be shared in an open, non-proprietary format:

# Save data in multiple formats for maximum accessibility

# CSV: universally readable
write.csv(systolic, "data/systolic.csv", row.names = FALSE)

# RDS: preserves R factor levels and attributes
saveRDS(systolic, "data/systolic.rds")

# Arrow/Parquet: efficient for large datasets
# install.packages("arrow")
arrow::write_parquet(systolic, "data/systolic.parquet")

Always include a data dictionary which is a plain text or markdown file describing every variable:

# Data Dictionary: systolic.csv

## Variables

| Variable | Type | Unit | Description | Values |
|----------|------|------|-------------|--------|
| bp | numeric | mmHg | Systolic blood pressure | 100-200 |
| group | factor | - | Treatment group | Placebo, Low dose, High dose |

## Notes
- Data simulated with set.seed(42)
- 20 patients per group, N = 60 total
- See Chapter 3 for full simulation code

14.6.2 Code Sharing

Share your analysis code as a self-contained script or Quarto document that runs from top to bottom on a fresh R session:

# At the top of every shared analysis script
# 1. Record the R version
cat("R version:", R.version$version.string, "\n")

# 2. Record package versions
packages_used <- c("lme4", "lmerTest", "glmmTMB", "emmeans", "effectsize", "car")
pkg_versions  <- sapply(packages_used, function(p) as.character(packageVersion(p)))
print(data.frame(Package = packages_used,
                 Version = pkg_versions,
                 row.names = NULL))

# 3. Set the random seed
set.seed(42)

# 4. Use here() for all file paths
library(here)

14.6.3 Session Information

Always include a session information block at the end of every analysis document. In Quarto, add this as the last chunk:

sessionInfo()

R version 4.6.1 (2026-06-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Africa/Abidjan
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lme4_2.0-1       Matrix_1.7-5     effectsize_1.0.2

loaded via a namespace (and not attached):
 [1] jsonlite_2.0.0     compiler_4.6.1     Rcpp_1.1.1-1.1     splines_4.6.1     
 [5] boot_1.3-32        yaml_2.3.12        fastmap_1.2.0      lattice_0.22-9    
 [9] coda_0.19-4.1      TH.data_1.1-5      knitr_1.51         rbibutils_2.4.1   
[13] htmlwidgets_1.6.4  MASS_7.3-65        nloptr_2.2.1       insight_1.5.1     
[17] minqa_1.2.8        rlang_1.2.0        multcomp_1.4-30    xfun_0.59         
[21] parameters_0.29.1  datawizard_1.3.1   otel_0.2.0         estimability_1.5.1
[25] cli_3.6.6          Rdpack_2.6.6       emmeans_2.0.3      digest_0.6.39     
[29] grid_4.6.1         rstudioapi_0.19.0  mvtnorm_1.4-1      xtable_1.8-4      
[33] sandwich_3.1-1     nlme_3.1-169       reformulas_0.4.4   evaluate_1.0.5    
[37] codetools_0.2-20   zoo_1.8-15         survival_3.8-6     bayestestR_0.18.1 
[41] rmarkdown_2.31     tools_4.6.1        htmltools_0.5.9

Or use the more compact sessioninfo package:

# install.packages("sessioninfo")
sessioninfo::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.6.1 (2026-06-24)
 os       Ubuntu 24.04.4 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Africa/Abidjan
 date     2026-06-29
 pandoc   3.8.3 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
 quarto   1.5.56 @ /usr/local/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 package      * version   date (UTC) lib source
 bayestestR     0.18.1    2026-05-24 [1] CRAN (R 4.6.0)
 boot           1.3-32    2025-08-29 [3] RSPM (R 4.5.0)
 cli            3.6.6     2026-04-09 [3] CRAN (R 4.5.3)
 coda           0.19-4.1  2024-01-31 [3] RSPM (R 4.4.0)
 codetools      0.2-20    2024-03-31 [3] RSPM (R 4.4.0)
 datawizard     1.3.1     2026-04-26 [1] CRAN (R 4.6.0)
 digest         0.6.39    2025-11-19 [3] CRAN (R 4.5.2)
 effectsize   * 1.0.2     2026-03-11 [1] CRAN (R 4.6.0)
 emmeans        2.0.3     2026-04-09 [1] CRAN (R 4.6.0)
 estimability   1.5.1     2024-05-12 [1] CRAN (R 4.6.0)
 evaluate       1.0.5     2025-08-27 [3] RSPM (R 4.5.0)
 fastmap        1.2.0     2024-05-15 [3] RSPM (R 4.4.2)
 htmltools      0.5.9     2025-12-04 [3] CRAN (R 4.5.2)
 htmlwidgets    1.6.4     2023-12-06 [3] CRAN (R 4.3.2)
 insight        1.5.1     2026-05-21 [1] CRAN (R 4.6.0)
 jsonlite       2.0.0     2025-03-27 [3] RSPM (R 4.4.0)
 knitr          1.51      2025-12-20 [3] CRAN (R 4.5.2)
 lattice        0.22-9    2026-02-09 [3] CRAN (R 4.5.2)
 lme4         * 2.0-1     2026-03-05 [3] CRAN (R 4.5.2)
 MASS           7.3-65    2025-02-28 [3] RSPM (R 4.4.0)
 Matrix       * 1.7-5     2026-03-21 [3] CRAN (R 4.5.3)
 minqa          1.2.8     2024-08-17 [3] RSPM (R 4.4.0)
 multcomp       1.4-30    2026-03-09 [3] CRAN (R 4.5.2)
 mvtnorm        1.4-1     2026-06-06 [3] CRAN (R 4.6.0)
 nlme           3.1-169   2026-03-27 [4] CRAN (R 4.5.3)
 nloptr         2.2.1     2025-03-17 [3] RSPM (R 4.4.0)
 otel           0.2.0     2025-08-29 [3] RSPM (R 4.5.0)
 parameters     0.29.1    2026-05-24 [1] CRAN (R 4.6.0)
 rbibutils      2.4.1     2026-01-21 [3] CRAN (R 4.5.2)
 Rcpp           1.1.1-1.1 2026-04-24 [3] CRAN (R 4.5.3)
 Rdpack         2.6.6     2026-02-08 [3] CRAN (R 4.5.2)
 reformulas     0.4.4     2026-02-02 [3] CRAN (R 4.5.2)
 rlang          1.2.0     2026-04-06 [3] CRAN (R 4.6.0)
 rmarkdown      2.31      2026-03-26 [3] CRAN (R 4.5.3)
 rstudioapi     0.19.0    2026-06-11 [3] CRAN (R 4.6.0)
 sandwich       3.1-1     2024-09-15 [3] RSPM (R 4.4.0)
 sessioninfo    1.2.4     2026-06-04 [3] CRAN (R 4.6.0)
 survival       3.8-6     2026-01-16 [3] CRAN (R 4.5.2)
 TH.data        1.1-5     2025-11-17 [3] CRAN (R 4.5.2)
 xfun           0.59      2026-06-19 [3] CRAN (R 4.6.0)
 xtable         1.8-4     2019-04-21 [3] CRAN (R 4.0.1)
 yaml           2.3.12    2025-12-10 [3] CRAN (R 4.6.0)
 zoo            1.8-15    2025-12-15 [3] CRAN (R 4.5.2)

 [1] /home/ediman/R/x86_64-pc-linux-gnu-library/4.6
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library
 * ── Packages attached to the search path.

──────────────────────────────────────────────────────────────────────────────

The session information records R version, operating system, and every loaded package with its version number. A reader trying to reproduce your analysis can use this to identify any version discrepancies that might explain different results.

14.7 A Reproducibility Workflow for Biological Papers

Putting everything together, here is the complete workflow for a reproducible ANOVA-based biological paper from design to publication:

14.7.1 Before Data Collection

# 1. Pre-register the study on OSF
# 2. Conduct power analysis and record it
library(simr)
library(pwr)

# Document the power analysis
power_analysis <- list(
  method = "simulation via simr",
  effect_size = "5 recovery points",
  between_sd = 3,
  residual_sd = 8,
  target_power = 0.80,
  alpha = 0.05,
  n_simulations = 1000,
  result = "n = 10 hospitals x 10 patients = 100 total",
  date = Sys.Date()
)
saveRDS(power_analysis, here("outputs", "power_analysis.rds"))

# 3. Write the analysis script before seeing any data
# 4. Initialize git and renv

14.7.2 During Data Collection

# Commit the pre-registration and analysis plan
git add pre-registration.md analysis-plan.qmd
git commit -m "Add pre-registration and analysis plan before data collection"
git push origin main

# This commit timestamp proves the plan predates the data

14.7.3 After Data Collection

# Always import raw data without modification
raw_data <- read.csv(here::here("data", "raw", "trial_data.csv"))

# Document any data cleaning steps explicitly
cleaned_data <- raw_data |>
  # Remove patients who withdrew consent
  dplyr::filter(consent == "yes") |>
  # Convert group to factor with reference level
  dplyr::mutate(
    group = factor(group,
                   levels = c("Control", "Treatment"))
  )

# Record exclusions
cat("Original N:", nrow(raw_data), "\n")
cat("After exclusions:", nrow(cleaned_data), "\n")
cat("Excluded:", nrow(raw_data) - nrow(cleaned_data), "\n")

# Save cleaned data
saveRDS(cleaned_data, here::here("data", "processed", "cleaned_data.rds"))

14.7.4 Analysis

# Run the pre-specified primary analysis exactly as planned
fit_primary <- lmerTest::lmer(
  recovery ~ treatment + (1 | hospital),
  data = cleaned_data,
  REML = TRUE
)

# Any deviation from the pre-registered analysis must be flagged
# as exploratory and reported as such

14.7.5 Reporting Deviations

If the analysis deviates from the pre-registered plan, a different model was needed, an additional covariate was added, the primary outcome was changed, document this explicitly:

::: {.callout-important}
## Deviation from Pre-Registered Analysis

The pre-registered analysis specified a Poisson GLMM for the count outcome. Inspection of the data revealed substantial overdispersion (ratio = 4.2), so a negative binomial GLMM was fitted instead. This decision was made before examining the treatment effects and is consistent with the pre-registered assumption check procedure. The Poisson model results are reported in Supplementary Table S1 for comparison.
:::

14.7.6 Final Checks Before Submission

# 1. Restart R and run the entire analysis from scratch
# Session > Restart R and Run All Chunks (in RStudio)

# 2. Verify the renv lockfile is up to date
renv::status()
renv::snapshot()

# 3. Check that all file paths work
# (no absolute paths, all here() calls resolve correctly)

# 4. Render the final document
# quarto render --to html
# quarto render --to pdf

# 5. Commit the final version
git add -A
git commit -m "Final analysis for submission to [journal]"
git tag -a "submission-v1" -m "Version submitted to journal"
git push origin main --tags

The git tag creates a permanent named marker in the repository history, so you can always return to exactly the version that was submitted, even after subsequent revisions.

14.8 14.8 Tools and Resources

14.8.1 Essential Tools

Tool	Purpose	Where to get it
Quarto	Literate programming	quarto.org
git	Version control	git-scm.com
GitHub	Remote repository	github.com
`renv`	Package version management	CRAN
`here`	Portable file paths	CRAN
OSF	Pre-registration and data sharing	osf.io
Zenodo	Permanent data archiving with DOI	zenodo.org

14.8.2 Getting a DOI for Your Code and Data

For published papers, code and data should have a permanent identifier (DOI) that can be cited. Zenodo integrates directly with GitHub:

Go to https://zenodo.org and log in with your GitHub account.
Enable the repository under GitHub => Enabled Repositories.
Create a release on GitHub and Zenodo automatically archives it and assigns a DOI.
Cite the DOI in your paper’s Data Availability statement.

14.8.3 Checklist for a Reproducible Paper

## Reproducibility Checklist

### Before submission
- [ ] set.seed() in every chunk that uses random numbers
- [ ] All file paths use here()
- [ ] renv.lock committed to repository
- [ ] Analysis runs from scratch in a fresh R session
- [ ] Session information included in supplementary materials
- [ ] All deviations from pre-registration documented
- [ ] Data dictionary provided for all datasets
- [ ] Code deposited on GitHub/Zenodo with DOI

### In the Methods section
- [ ] R version stated
- [ ] All package names and versions cited
- [ ] Full model formula reported
- [ ] Random seed stated if simulation was used
- [ ] Pre-registration URL cited (if applicable)
- [ ] Data availability statement included

### In the Results section
- [ ] All test statistics with degrees of freedom
- [ ] Exact p-values (not just < 0.05)
- [ ] Effect sizes with confidence intervals
- [ ] Sample sizes at every relevant level
- [ ] Post-hoc method named and justified

14.9 A Note on Artificial Intelligence Tools

Large language models and AI coding assistants are increasingly used in statistical analysis, for writing code, checking syntax, and explaining output. Their use is not inherently problematic, but it introduces specific reproducibility risks worth being explicit about.

AI-generated code should be treated as a first draft, not a final analysis. AI tools can produce plausible-looking but incorrect code, choose the wrong model for the data structure, or suggest analyses that are inconsistent with the pre-registered plan. Every line of AI-generated code should be understood, verified, and tested before being included in a published analysis.
The analyst remains responsible for the analysis. Citing an AI tool as the source of an analytical decision, “the model was selected by ChatGPT”, is not a valid justification in a scientific paper. The analyst must understand and defend every choice.
Disclose AI tool use in accordance with the target journal’s policy. Many journals now require explicit disclosure of AI assistance in methods or acknowledgements sections.

The most defensible position is to use AI tools for syntax help and code explanation, the same role that Stack Overflow has played for a decade, while making all analytical decisions (model choice, assumption checks, interpretation) independently and documenting the reasoning explicitly.

14.10 Closing Remarks

Reproducibility is not a constraint on scientific creativity, it is the condition that makes scientific creativity trustworthy. An elegant analysis that cannot be reproduced is worthless; a pedestrian analysis that is fully documented, versioned, and shared is a permanent contribution to knowledge.

The tools and practices covered in this chapter, Quarto, git, renv, here, pre-registration, open data, require an upfront investment that pays compound returns. The first time you return to a project after six months and find that everything still runs correctly, or the first time a reviewer asks for a minor revision and you can update every number in the paper by changing two lines of code and re-rendering, the value of these practices becomes immediately apparent.

The book began with Fisher at Rothamsted, separating signal from noise in agricultural field trials. A century later, the tools have changed beyond recognition, but the core obligation has not: to be honest about what the data show, transparent about how they were analysed, and open enough that others can verify the work. Reproducibility is how statistics keeps its promises.