CSS Bootcamp - Version Control, Reproducibility, and Sources of Error

# Version Control, Reproducibility, and Sources of Error

### Will Styler - CSS Bootcamp

---

### Today's Plan

- Version Control

- Reproducibility

- Sources of Error

---

## Version Control

---

```bash
~/myanalysis $: ls -a
plots/
analysis.py
data.csv
data_ed.csv
final_analysis.ipynb
final_analysis_v2.ipynb
new_analysis.ipynb
new_new_analysis.ipynb
rawdata.csv
rawdata_ed_final.csv
really_final_analysis_v2.ipynb
submitted_final_analysis.ipynb
```

---

### Version Control in a nutshell

- "Rather than manually saving different versions, let's just track what changed between checkpoints"

- "Rather than storing *every single change*, let's let the user 'commit' all of the changes when they reach a meaningful checkpoint"

- "Let's store those changes as changes, rather than storing copies of the files, wherever we can."

- "Let's create a hidden folder (dotfile) which contains everything you'd need to reconstruct the data at each commit, so users can always revert to an older version"

---

```bash
~/myanalysis $: ls -a
.git/
plots/
analysis.ipynb
rawdata.csv
```
---

### `git` is the best version control system in common use

- Written in 2005 by Linus Torvalds (yes, the 'Linux' Linus) to control source code

- Free and open source software, available for most systems

- This is the basis of <http://github.com>

- Stores all the changes in a repository ('repo'), also known as a 'folder'

- It nearly completely cleared the field of other version control programs, because it *works*

---

### The `git` workflow

- `git init` a new repo, or `git clone` an existing one

- `git add` changed files, `git mv` to move folders, `git rm` to delete

- `git commit` your changes with an `-m` message

- `git log` shows all the commits in a repository

- `git revert` undoes all changes in existing commit(s)

- `git checkout` updates the 'working copy' to match a certain commit

---

### Let's look at a git repo

- Open your terminal in datahub and `cd` into `CSSBootCamp`

- Now `git log`

---

### A note on git commit names

- These are cryptographic hashes, used as checksums

- They turn large amounts of data into unique strings

- The commit name is the hash of the commit information

```
commit 5b0ab0a6229d4267ea6ce87a9981e9ffa48d134f
Author: Will Styler <will@savethevowels.org>
Date:   Fri Aug 19 12:47:41 2022 -0700
```

---

### A note on git/github nomenclature

- `git` at one point used 'master' as the head of the repo

- This has been changed to `main` by default

- `git pull origin master` is now `git pull origin main`

- You'll need to update some tutorials mentally

---

### You can also work with other repos

- You can `fetch`, `pull` and `push` data to and from other repos and/or github

- You can `merge` two repos together, even with different commit states

- `git` will find any conflicts and let the user resolve them

- Multiple people can commit to the same repo

---

### Git is ridiculously deep

- It is used by some of the biggest software projects out there

- It has a function to do damned near anything

- We're barely scratching the surface

- ... but this is enough to teach you that version control is important

- ... and to give you basic abilities to work with it

---

### Benefits of Version Control

- Easy* access to prior versions of analyses, plots, etc

- Easy* ability to compare prior versions to the current one

- Ability to track the timing and content of changes

- Ability to revert changes, or identify differences

- Ability to collaborate on a codebase without clobbering each others' changes

---

### Downsides of Version Control

- You store all the changes to all the files, not just the changes that 'matter'

- Some binary types of content (e.g. pictures, audio) don't allow for 'deltas' and take up more room

- You need to remember to commit changes at meaningful times

- git is not the most user-friendly software

- github desktop is a nice GUI for it

- Jupyter Notebooks don't play super well with `git` and changes are hard to interpret

- It's reliable and you can always pull back a notebook from an earlier commit, but seeing what changed isn't as easy.

---

### Important Note - Version Control is not backup!!

- Why?

---

### Version Control is not backup!

---

### Use version control

- You'll have a plan in case you make mistakes and need to go back

- You can collaborate more easily on large analyses

- You can retrieve things from the distant past

- It's much more space efficient than copying the folder periodically

- ... and there's only one 'myanalysis.py' floating around

- When you know exactly what you've done, it's easier to do...

---

## Reproducible Research and Analysis

---

### Reproducible analysis vs. Replicable Research

- Replication vs. Reanalysis

- "Can I re-run this experiment and get the same results?"  == Replication

- "Can I start from their data and arrive at the same conclusions wrt their hypothesis?" == Reproducible analysis

---

### Why do replication and reproducibility matter?

---

### What do you need to replicate an experiment?

- Similar experimental methods

- Similar circumstances and subjects (or not!)

- Similar tools (or not!)

- Similar analysis (at least in part!)

---

### What do you need to reproduce an analysis?

- The Data

- The Tools

- The Code

---

### The Data

- Which data?

- What has happened between initial collection and the start of analysis?

- Was data removed? Redacted? Fabricated? Filled in?

- Can the data actually be released in the necessary form for reproduction?

- Participant privacy? Proprietary data? Size concerns? Ongoing work?

- How are the data being posted and released?

- Where? What format? How durable?

---

### The Tools

- What kinds of software was used to do the analysis?

- Is the software publicly available?  Is it free?

- What versions of the software were used?  On what platform?

- Are there other degrees of freedom?

---

### The Code

- What dependencies and libraries were used?

- Is there any code missing (e.g. pre-post-processing files)?

- Can it run on other computers?  Are important elements hard-coded?

- Is it literate, well explained, and commented?

- Are there set seeds allowing us to do *exactly* what they did?

- Are all the hyperparameters fixed and explicit?

- Is the version of the code they provide the same as yielded the 'final product' or paper?

---

### YOU are the primary person reproducing your research

- "Oh no, my computer shut down"

- "Oh no I closed that tab"

- "New laptop who dis?"

- "We finally got reviews back over two years later, now we need to compute D'..."

---

### Allowing others to reproduce your research is gravy!

- Saves people suffering

- Allows your time and effort to do more good

- It makes for better science!

---

### What are the downsides of making your research open and reproducible?

---

## Common Sources of Error

- That Will doesn't do ever, at all

- *Nope, not even once. Never.*

---

### For each of these sources of error ask...

- Why is this a problem?

- How does this happen?

- How can we avoid this?

---

### Changes in package versions/functionality

- Take note of versions and use an environment

---

### Human data correction *without logging the correction*

- Do your corrections and outlier removal in the analysis script

---

### Manually running code or setting variables outside of the script

- 'Oh, I'll just set C to 0.1 in this cell for testing and then delete the cell', then forgetting that that'd happened

- Set variables, explicitly, and if it doesn't matter, set to default

---

### Only editing some elements of your copypasted code

- Paste in filler code which can't run, rather than chunks from earlier code

---

### Copy-pasting the same code five times, seeing an error, and fixing all four

- Write functions and call them multiple times

---

### Saving your plots/data with the wrong filename

- Write reasonable filenames, and change them before you run them

- Re-run the whole analysis before you stop

---

### Re-making plots 'for publication' which don't show the same things as the original plots

- Compare plots that should be the same and make sure they are

---

### Running the wrong version of the analysis

- Use version control, not `final_analysis_2_final_final_review_final.ipynb`

---

### Reused variable names

- Change variable names where you can

---

### Forgetting to subset data where needed

- Ask yourself about subsets, and double check

---

### Dumb join errors

- Check data.shape() before and after merging dataframes

---

### Dumb math errors

- Sanity check your math

---

### Dumb code errors

- Sanity check your code

- Write tests ('This should be the same as that')

---

### Re-using old versions of data in new analyses

---

## Stateless Analysis

---

### Will believes that your analyses should be 'stateless'

- Each time you work on the data, you should run the script to go from your rawest data to your final analysis

- Your should be able to reproduce your analysis exactly from the raw data with just one set of commands

- Do all of your subject dropping, column adding, merging, and data-fixing within the script

- The distance between 'Raw data' and 'Journal-ready analysis' should be 'Reset and Run All Cells' and going for a cup of coffee

- Generate and save the plots each time your scripts run

---

### Don't store intermediate steps!

- You should be wary of storing intermediate representations of your data/analysis

- Only when it's wildly computationally expensive

- Any stored representations should be reproducible, from scratch, with a single click!

- 'if this file exists, skip this portion of the code, else, run it to recreate'

- Be very careful with .rdata, pickles, and otherwise

- Any time you make changes to the data or early parts of the analysis, re-do your stored work

- 'Do you want to save your workspace?'  '**Hell no!**'

---

### Benefits of Stateless Analysis

- You can shut down/stop your analysis at any time and lose nothing

- Your work isn't affected by lost computers (because you use a 3-2-1 backup strategy)

- You always know *every single thing which happened to your data since collection*

- You have *one data version*, *one analysis version*, and *one output version*

- If you messed up **early** in the process, you change *only that one line* and just re-run

---

### Downsides of Stateless Analysis

- Computationally inefficient

- You'll run the same models *over and over again* each time you start

- Requires discipline

- "I can't just go into Excel and fix this"

- You'll end up with a lot of code

- ... but no more than you would've needed anyways

- New runs with changes clobber your old analysis/plots

- Version control!

---

### Key Takeaways

- Use version control

- Reproducibility is generally a good thing

- Error creeps into your analysis in many ways

- Stateless analysis is the best approach if you can make it happen!

---

### Playing with Version Control