Version Control, Reproducibility, and Sources of Error

Will Styler - CSS Bootcamp

Today’s Plan

Version Control
Reproducibility
Sources of Error

Version Control

~/myanalysis $: ls -a
plots/
analysis.py
data.csv
data_ed.csv
final_analysis.ipynb
final_analysis_v2.ipynb
new_analysis.ipynb
new_new_analysis.ipynb
rawdata.csv
rawdata_ed_final.csv
really_final_analysis_v2.ipynb
submitted_final_analysis.ipynb

Version Control in a nutshell

“Rather than manually saving different versions, let’s just track what changed between checkpoints”
“Rather than storing every single change, let’s let the user ‘commit’ all of the changes when they reach a meaningful checkpoint”
“Let’s store those changes as changes, rather than storing copies of the files, wherever we can.”
“Let’s create a hidden folder (dotfile) which contains everything you’d need to reconstruct the data at each commit, so users can always revert to an older version”

~/myanalysis $: ls -a
.git/
plots/
analysis.ipynb
rawdata.csv

`git` is the best version control system in common use

Written in 2005 by Linus Torvalds (yes, the ‘Linux’ Linus) to control source code
Free and open source software, available for most systems
This is the basis of http://github.com
Stores all the changes in a repository (‘repo’), also known as a ‘folder’
It nearly completely cleared the field of other version control programs, because it works

The `git` workflow

git init a new repo, or git clone an existing one
git add changed files, git mv to move folders, git rm to delete
git commit your changes with an -m message
git log shows all the commits in a repository
git revert undoes all changes in existing commit(s)
git checkout updates the ‘working copy’ to match a certain commit

Let’s look at a git repo

Open your terminal in datahub and cd into CSSBootCamp
Now git log

A note on git commit names

These are cryptographic hashes, used as checksums
They turn large amounts of data into unique strings
The commit name is the hash of the commit information

commit 5b0ab0a6229d4267ea6ce87a9981e9ffa48d134f
Author: Will Styler <will@savethevowels.org>
Date:   Fri Aug 19 12:47:41 2022 -0700

A note on git/github nomenclature

git at one point used ‘master’ as the head of the repo
This has been changed to main by default
git pull origin master is now git pull origin main
You’ll need to update some tutorials mentally

You can also work with other repos

You can fetch, pull and push data to and from other repos and/or github
You can merge two repos together, even with different commit states
git will find any conflicts and let the user resolve them
Multiple people can commit to the same repo

Git is ridiculously deep

It is used by some of the biggest software projects out there
It has a function to do damned near anything
We’re barely scratching the surface
… but this is enough to teach you that version control is important
- … and to give you basic abilities to work with it

Benefits of Version Control

Easy* access to prior versions of analyses, plots, etc
Easy* ability to compare prior versions to the current one
Ability to track the timing and content of changes
Ability to revert changes, or identify differences
Ability to collaborate on a codebase without clobbering each others’ changes

Downsides of Version Control

You store all the changes to all the files, not just the changes that ‘matter’
Some binary types of content (e.g. pictures, audio) don’t allow for ‘deltas’ and take up more room
You need to remember to commit changes at meaningful times
git is not the most user-friendly software
- github desktop is a nice GUI for it
Jupyter Notebooks don’t play super well with git and changes are hard to interpret
- It’s reliable and you can always pull back a notebook from an earlier commit, but seeing what changed isn’t as easy.

Important Note - Version Control is not backup!!

Why?

Version Control is not backup!

Use version control

You’ll have a plan in case you make mistakes and need to go back
You can collaborate more easily on large analyses
You can retrieve things from the distant past
It’s much more space efficient than copying the folder periodically
- … and there’s only one ‘myanalysis.py’ floating around
When you know exactly what you’ve done, it’s easier to do…

Reproducible Research and Analysis

Reproducible analysis vs. Replicable Research

Replication vs. Reanalysis
“Can I re-run this experiment and get the same results?” == Replication
“Can I start from their data and arrive at the same conclusions wrt their hypothesis?” == Reproducible analysis

Why do replication and reproducibility matter?

What do you need to replicate an experiment?

Similar experimental methods
Similar circumstances and subjects (or not!)
Similar tools (or not!)
Similar analysis (at least in part!)

What do you need to reproduce an analysis?

The Data
The Tools
The Code

The Data

Which data?
What has happened between initial collection and the start of analysis?
- Was data removed? Redacted? Fabricated? Filled in?
Can the data actually be released in the necessary form for reproduction?
- Participant privacy? Proprietary data? Size concerns? Ongoing work?
How are the data being posted and released?
- Where? What format? How durable?

The Tools

What kinds of software was used to do the analysis?
Is the software publicly available? Is it free?
What versions of the software were used? On what platform?
Are there other degrees of freedom?

The Code

What dependencies and libraries were used?
Is there any code missing (e.g. pre-post-processing files)?
Can it run on other computers? Are important elements hard-coded?
Is it literate, well explained, and commented?
Are there set seeds allowing us to do exactly what they did?
Are all the hyperparameters fixed and explicit?
Is the version of the code they provide the same as yielded the ‘final product’ or paper?

YOU are the primary person reproducing your research

“Oh no, my computer shut down”
“Oh no I closed that tab”
“New laptop who dis?”
“We finally got reviews back over two years later, now we need to compute D’…”

Allowing others to reproduce your research is gravy!

Saves people suffering
Allows your time and effort to do more good
It makes for better science!

What are the downsides of making your research open and reproducible?

Common Sources of Error

That Will doesn’t do ever, at all
Nope, not even once. Never.

For each of these sources of error ask…

Why is this a problem?
How does this happen?
How can we avoid this?

Changes in package versions/functionality

Take note of versions and use an environment

Human data correction without logging the correction

Do your corrections and outlier removal in the analysis script

Manually running code or setting variables outside of the script

‘Oh, I’ll just set C to 0.1 in this cell for testing and then delete the cell’, then forgetting that that’d happened
Set variables, explicitly, and if it doesn’t matter, set to default

Only editing some elements of your copypasted code

Paste in filler code which can’t run, rather than chunks from earlier code

Copy-pasting the same code five times, seeing an error, and fixing all four

Write functions and call them multiple times

Saving your plots/data with the wrong filename

Write reasonable filenames, and change them before you run them
Re-run the whole analysis before you stop

Re-making plots ‘for publication’ which don’t show the same things as the original plots

Compare plots that should be the same and make sure they are

Running the wrong version of the analysis

Use version control, not final_analysis_2_final_final_review_final.ipynb

Reused variable names

Change variable names where you can

Forgetting to subset data where needed

Ask yourself about subsets, and double check

Dumb join errors

Check data.shape() before and after merging dataframes

Dumb math errors

Sanity check your math

Dumb code errors

Sanity check your code
Write tests (‘This should be the same as that’)

Re-using old versions of data in new analyses

Stateless Analysis

Will believes that your analyses should be ‘stateless’

Each time you work on the data, you should run the script to go from your rawest data to your final analysis
Your should be able to reproduce your analysis exactly from the raw data with just one set of commands
Do all of your subject dropping, column adding, merging, and data-fixing within the script
The distance between ‘Raw data’ and ‘Journal-ready analysis’ should be ‘Reset and Run All Cells’ and going for a cup of coffee
Generate and save the plots each time your scripts run

Don’t store intermediate steps!

You should be wary of storing intermediate representations of your data/analysis
- Only when it’s wildly computationally expensive
Any stored representations should be reproducible, from scratch, with a single click!
- ‘if this file exists, skip this portion of the code, else, run it to recreate’
Be very careful with .rdata, pickles, and otherwise
Any time you make changes to the data or early parts of the analysis, re-do your stored work
‘Do you want to save your workspace?’ ‘Hell no!’

Benefits of Stateless Analysis

You can shut down/stop your analysis at any time and lose nothing
Your work isn’t affected by lost computers (because you use a 3-2-1 backup strategy)
You always know every single thing which happened to your data since collection
You have one data version, one analysis version, and one output version
If you messed up early in the process, you change only that one line and just re-run

Downsides of Stateless Analysis

Computationally inefficient
- You’ll run the same models over and over again each time you start
Requires discipline
- “I can’t just go into Excel and fix this”
You’ll end up with a lot of code
- … but no more than you would’ve needed anyways
New runs with changes clobber your old analysis/plots
- Version control!

Key Takeaways

Use version control
Reproducibility is generally a good thing
Error creeps into your analysis in many ways
Stateless analysis is the best approach if you can make it happen!

Version Control, Reproducibility, and Sources of Error

Will Styler - CSS Bootcamp

Today’s Plan

Version Control

Version Control in a nutshell

git is the best version control system in common use

The git workflow

Let’s look at a git repo

A note on git commit names

A note on git/github nomenclature

You can also work with other repos

Git is ridiculously deep

Benefits of Version Control

Downsides of Version Control

Important Note - Version Control is not backup!!

Version Control is not backup!

Use version control

Reproducible Research and Analysis

Reproducible analysis vs. Replicable Research

Why do replication and reproducibility matter?

What do you need to replicate an experiment?

What do you need to reproduce an analysis?

The Data

The Tools

The Code

YOU are the primary person reproducing your research

Allowing others to reproduce your research is gravy!

What are the downsides of making your research open and reproducible?

Common Sources of Error

For each of these sources of error ask…

Changes in package versions/functionality

Human data correction without logging the correction

Manually running code or setting variables outside of the script

Only editing some elements of your copypasted code

Copy-pasting the same code five times, seeing an error, and fixing all four

Saving your plots/data with the wrong filename

Re-making plots ‘for publication’ which don’t show the same things as the original plots

Running the wrong version of the analysis

Reused variable names

Forgetting to subset data where needed

Dumb join errors

Dumb math errors

Dumb code errors

Re-using old versions of data in new analyses

Stateless Analysis

Will believes that your analyses should be ‘stateless’

Don’t store intermediate steps!

Benefits of Stateless Analysis

Downsides of Stateless Analysis

Key Takeaways

Playing with Version Control

`git` is the best version control system in common use

The `git` workflow