Version Control, Reproducibility, and Sources of Error

Will Styler - CSS Bootcamp


Today’s Plan


Version Control


~/myanalysis $: ls -a
plots/
analysis.py
data.csv
data_ed.csv
final_analysis.ipynb
final_analysis_v2.ipynb
new_analysis.ipynb
new_new_analysis.ipynb
rawdata.csv
rawdata_ed_final.csv
really_final_analysis_v2.ipynb
submitted_final_analysis.ipynb

Version Control in a nutshell


~/myanalysis $: ls -a
.git/
plots/
analysis.ipynb
rawdata.csv

git is the best version control system in common use


The git workflow


Let’s look at a git repo


A note on git commit names

commit 5b0ab0a6229d4267ea6ce87a9981e9ffa48d134f
Author: Will Styler <will@savethevowels.org>
Date:   Fri Aug 19 12:47:41 2022 -0700

A note on git/github nomenclature


You can also work with other repos


Git is ridiculously deep


Benefits of Version Control


Downsides of Version Control


Important Note - Version Control is not backup!!


Version Control is not backup!


Use version control


Reproducible Research and Analysis


Reproducible analysis vs. Replicable Research


Why do replication and reproducibility matter?


What do you need to replicate an experiment?


What do you need to reproduce an analysis?


The Data


The Tools


The Code


YOU are the primary person reproducing your research


Allowing others to reproduce your research is gravy!


What are the downsides of making your research open and reproducible?


Common Sources of Error


For each of these sources of error ask…


Changes in package versions/functionality


Human data correction without logging the correction


Manually running code or setting variables outside of the script


Only editing some elements of your copypasted code


Copy-pasting the same code five times, seeing an error, and fixing all four


Saving your plots/data with the wrong filename


Re-making plots ‘for publication’ which don’t show the same things as the original plots


Running the wrong version of the analysis


Reused variable names


Forgetting to subset data where needed


Dumb join errors


Dumb math errors


Dumb code errors


Re-using old versions of data in new analyses


Stateless Analysis


Will believes that your analyses should be ‘stateless’


Don’t store intermediate steps!


Benefits of Stateless Analysis


Downsides of Stateless Analysis


Key Takeaways


Playing with Version Control