Version Control
Reproducibility
Sources of Error
~/myanalysis $: ls -a
plots/
analysis.py
data.csv
data_ed.csv
final_analysis.ipynb
final_analysis_v2.ipynb
new_analysis.ipynb
new_new_analysis.ipynb
rawdata.csv
rawdata_ed_final.csv
really_final_analysis_v2.ipynb
submitted_final_analysis.ipynb
“Rather than manually saving different versions, let’s just track what changed between checkpoints”
“Rather than storing every single change, let’s let the user ‘commit’ all of the changes when they reach a meaningful checkpoint”
“Let’s store those changes as changes, rather than storing copies of the files, wherever we can.”
“Let’s create a hidden folder (dotfile) which contains everything you’d need to reconstruct the data at each commit, so users can always revert to an older version”
~/myanalysis $: ls -a
.git/
plots/
analysis.ipynb
rawdata.csv
git
is the best version control system in common useWritten in 2005 by Linus Torvalds (yes, the ‘Linux’ Linus) to control source code
Free and open source software, available for most systems
This is the basis of http://github.com
Stores all the changes in a repository (‘repo’), also known as a ‘folder’
It nearly completely cleared the field of other version control programs, because it works
git
workflowgit init
a new repo, or git clone
an
existing one
git add
changed files, git mv
to move
folders, git rm
to delete
git commit
your changes with an -m
message
git log
shows all the commits in a
repository
git revert
undoes all changes in existing
commit(s)
git checkout
updates the ‘working copy’ to match a
certain commit
Open your terminal in datahub and cd
into
CSSBootCamp
Now git log
These are cryptographic hashes, used as checksums
They turn large amounts of data into unique strings
The commit name is the hash of the commit information
commit 5b0ab0a6229d4267ea6ce87a9981e9ffa48d134f
Author: Will Styler <will@savethevowels.org>
Date: Fri Aug 19 12:47:41 2022 -0700
git
at one point used ‘master’ as the head of the
repo
This has been changed to main
by default
git pull origin master
is now
git pull origin main
You’ll need to update some tutorials mentally
You can fetch
, pull
and
push
data to and from other repos and/or github
You can merge
two repos together, even with
different commit states
git
will find any conflicts and let the user resolve
them
Multiple people can commit to the same repo
It is used by some of the biggest software projects out there
It has a function to do damned near anything
We’re barely scratching the surface
… but this is enough to teach you that version control is important
Easy* access to prior versions of analyses, plots, etc
Easy* ability to compare prior versions to the current one
Ability to track the timing and content of changes
Ability to revert changes, or identify differences
Ability to collaborate on a codebase without clobbering each others’ changes
You store all the changes to all the files, not just the changes that ‘matter’
Some binary types of content (e.g. pictures, audio) don’t allow for ‘deltas’ and take up more room
You need to remember to commit changes at meaningful times
git is not the most user-friendly software
Jupyter Notebooks don’t play super well with git
and
changes are hard to interpret
You’ll have a plan in case you make mistakes and need to go back
You can collaborate more easily on large analyses
You can retrieve things from the distant past
It’s much more space efficient than copying the folder periodically
When you know exactly what you’ve done, it’s easier to do…
Replication vs. Reanalysis
“Can I re-run this experiment and get the same results?” == Replication
“Can I start from their data and arrive at the same conclusions wrt their hypothesis?” == Reproducible analysis
Similar experimental methods
Similar circumstances and subjects (or not!)
Similar tools (or not!)
Similar analysis (at least in part!)
The Data
The Tools
The Code
Which data?
What has happened between initial collection and the start of analysis?
Can the data actually be released in the necessary form for reproduction?
How are the data being posted and released?
What kinds of software was used to do the analysis?
Is the software publicly available? Is it free?
What versions of the software were used? On what platform?
Are there other degrees of freedom?
What dependencies and libraries were used?
Is there any code missing (e.g. pre-post-processing files)?
Can it run on other computers? Are important elements hard-coded?
Is it literate, well explained, and commented?
Are there set seeds allowing us to do exactly what they did?
Are all the hyperparameters fixed and explicit?
Is the version of the code they provide the same as yielded the ‘final product’ or paper?
“Oh no, my computer shut down”
“Oh no I closed that tab”
“New laptop who dis?”
“We finally got reviews back over two years later, now we need to compute D’…”
Saves people suffering
Allows your time and effort to do more good
It makes for better science!
That Will doesn’t do ever, at all
Nope, not even once. Never.
Why is this a problem?
How does this happen?
How can we avoid this?
‘Oh, I’ll just set C to 0.1 in this cell for testing and then delete the cell’, then forgetting that that’d happened
Set variables, explicitly, and if it doesn’t matter, set to default
Write reasonable filenames, and change them before you run them
Re-run the whole analysis before you stop
final_analysis_2_final_final_review_final.ipynb
Sanity check your code
Write tests (‘This should be the same as that’)
Each time you work on the data, you should run the script to go from your rawest data to your final analysis
Your should be able to reproduce your analysis exactly from the raw data with just one set of commands
Do all of your subject dropping, column adding, merging, and data-fixing within the script
The distance between ‘Raw data’ and ‘Journal-ready analysis’ should be ‘Reset and Run All Cells’ and going for a cup of coffee
Generate and save the plots each time your scripts run
You should be wary of storing intermediate representations of your data/analysis
Any stored representations should be reproducible, from scratch, with a single click!
Be very careful with .rdata, pickles, and otherwise
Any time you make changes to the data or early parts of the analysis, re-do your stored work
‘Do you want to save your workspace?’ ‘Hell no!’
You can shut down/stop your analysis at any time and lose nothing
Your work isn’t affected by lost computers (because you use a 3-2-1 backup strategy)
You always know every single thing which happened to your data since collection
You have one data version, one analysis version, and one output version
If you messed up early in the process, you change only that one line and just re-run
Computationally inefficient
Requires discipline
You’ll end up with a lot of code
New runs with changes clobber your old analysis/plots
Use version control
Reproducibility is generally a good thing
Error creeps into your analysis in many ways
Stateless analysis is the best approach if you can make it happen!