# Version Control, Reproducibility, and Sources of Error ### Will Styler - CSS Bootcamp --- ### Today's Plan - Version Control - Reproducibility - Sources of Error --- ## Version Control --- ```bash ~/myanalysis $: ls -a plots/ analysis.py data.csv data_ed.csv final_analysis.ipynb final_analysis_v2.ipynb new_analysis.ipynb new_new_analysis.ipynb rawdata.csv rawdata_ed_final.csv really_final_analysis_v2.ipynb submitted_final_analysis.ipynb ``` --- ### Version Control in a nutshell - "Rather than manually saving different versions, let's just track what changed between checkpoints" - "Rather than storing *every single change*, let's let the user 'commit' all of the changes when they reach a meaningful checkpoint" - "Let's store those changes as changes, rather than storing copies of the files, wherever we can." - "Let's create a hidden folder (dotfile) which contains everything you'd need to reconstruct the data at each commit, so users can always revert to an older version" --- ```bash ~/myanalysis $: ls -a .git/ plots/ analysis.ipynb rawdata.csv ``` --- ### `git` is the best version control system in common use - Written in 2005 by Linus Torvalds (yes, the 'Linux' Linus) to control source code - Free and open source software, available for most systems - This is the basis of
- Stores all the changes in a repository ('repo'), also known as a 'folder' - It nearly completely cleared the field of other version control programs, because it *works* --- ### The `git` workflow - `git init` a new repo, or `git clone` an existing one - `git add` changed files, `git mv` to move folders, `git rm` to delete - `git commit` your changes with an `-m` message - `git log` shows all the commits in a repository - `git revert` undoes all changes in existing commit(s) - `git checkout` updates the 'working copy' to match a certain commit --- ### Let's look at a git repo - Open your terminal in datahub and `cd` into `CSSBootCamp` - Now `git log` --- ### A note on git commit names - These are cryptographic hashes, used as checksums - They turn large amounts of data into unique strings - The commit name is the hash of the commit information ``` commit 5b0ab0a6229d4267ea6ce87a9981e9ffa48d134f Author: Will Styler
Date: Fri Aug 19 12:47:41 2022 -0700 ``` --- ### A note on git/github nomenclature - `git` at one point used 'master' as the head of the repo - This has been changed to `main` by default - `git pull origin master` is now `git pull origin main` - You'll need to update some tutorials mentally --- ### You can also work with other repos - You can `fetch`, `pull` and `push` data to and from other repos and/or github - You can `merge` two repos together, even with different commit states - `git` will find any conflicts and let the user resolve them - Multiple people can commit to the same repo --- ### Git is ridiculously deep - It is used by some of the biggest software projects out there - It has a function to do damned near anything - We're barely scratching the surface - ... but this is enough to teach you that version control is important - ... and to give you basic abilities to work with it --- ### Benefits of Version Control - Easy* access to prior versions of analyses, plots, etc - Easy* ability to compare prior versions to the current one - Ability to track the timing and content of changes - Ability to revert changes, or identify differences - Ability to collaborate on a codebase without clobbering each others' changes --- ### Downsides of Version Control - You store all the changes to all the files, not just the changes that 'matter' - Some binary types of content (e.g. pictures, audio) don't allow for 'deltas' and take up more room - You need to remember to commit changes at meaningful times - git is not the most user-friendly software - github desktop is a nice GUI for it - Jupyter Notebooks don't play super well with `git` and changes are hard to interpret - It's reliable and you can always pull back a notebook from an earlier commit, but seeing what changed isn't as easy. --- ### Important Note - Version Control is not backup!! - Why? --- ### Version Control is not backup!
--- ### Use version control - You'll have a plan in case you make mistakes and need to go back - You can collaborate more easily on large analyses - You can retrieve things from the distant past - It's much more space efficient than copying the folder periodically - ... and there's only one 'myanalysis.py' floating around - When you know exactly what you've done, it's easier to do... --- ## Reproducible Research and Analysis --- ### Reproducible analysis vs. Replicable Research - Replication vs. Reanalysis - "Can I re-run this experiment and get the same results?" == Replication - "Can I start from their data and arrive at the same conclusions wrt their hypothesis?" == Reproducible analysis --- ### Why do replication and reproducibility matter? --- ### What do you need to replicate an experiment? - Similar experimental methods - Similar circumstances and subjects (or not!) - Similar tools (or not!) - Similar analysis (at least in part!) --- ### What do you need to reproduce an analysis? - The Data - The Tools - The Code --- ### The Data - Which data? - What has happened between initial collection and the start of analysis? - Was data removed? Redacted? Fabricated? Filled in? - Can the data actually be released in the necessary form for reproduction? - Participant privacy? Proprietary data? Size concerns? Ongoing work? - How are the data being posted and released? - Where? What format? How durable? --- ### The Tools - What kinds of software was used to do the analysis? - Is the software publicly available? Is it free? - What versions of the software were used? On what platform? - Are there other degrees of freedom? --- ### The Code - What dependencies and libraries were used? - Is there any code missing (e.g. pre-post-processing files)? - Can it run on other computers? Are important elements hard-coded? - Is it literate, well explained, and commented? - Are there set seeds allowing us to do *exactly* what they did? - Are all the hyperparameters fixed and explicit? - Is the version of the code they provide the same as yielded the 'final product' or paper? --- ### YOU are the primary person reproducing your research - "Oh no, my computer shut down" - "Oh no I closed that tab" - "New laptop who dis?" - "We finally got reviews back over two years later, now we need to compute D'..." --- ### Allowing others to reproduce your research is gravy! - Saves people suffering - Allows your time and effort to do more good - It makes for better science! --- ### What are the downsides of making your research open and reproducible? --- ## Common Sources of Error - That Will doesn't do ever, at all - *Nope, not even once. Never.* --- ### For each of these sources of error ask... - Why is this a problem? - How does this happen? - How can we avoid this? --- ### Changes in package versions/functionality - Take note of versions and use an environment --- ### Human data correction *without logging the correction* - Do your corrections and outlier removal in the analysis script --- ### Manually running code or setting variables outside of the script - 'Oh, I'll just set C to 0.1 in this cell for testing and then delete the cell', then forgetting that that'd happened - Set variables, explicitly, and if it doesn't matter, set to default --- ### Only editing some elements of your copypasted code - Paste in filler code which can't run, rather than chunks from earlier code --- ### Copy-pasting the same code five times, seeing an error, and fixing all four - Write functions and call them multiple times --- ### Saving your plots/data with the wrong filename - Write reasonable filenames, and change them before you run them - Re-run the whole analysis before you stop --- ### Re-making plots 'for publication' which don't show the same things as the original plots - Compare plots that should be the same and make sure they are --- ### Running the wrong version of the analysis - Use version control, not `final_analysis_2_final_final_review_final.ipynb` --- ### Reused variable names - Change variable names where you can --- ### Forgetting to subset data where needed - Ask yourself about subsets, and double check --- ### Dumb join errors - Check data.shape() before and after merging dataframes --- ### Dumb math errors - Sanity check your math --- ### Dumb code errors - Sanity check your code - Write tests ('This should be the same as that') --- ### Re-using old versions of data in new analyses --- ## Stateless Analysis --- ### Will believes that your analyses should be 'stateless' - Each time you work on the data, you should run the script to go from your rawest data to your final analysis - Your should be able to reproduce your analysis exactly from the raw data with just one set of commands - Do all of your subject dropping, column adding, merging, and data-fixing within the script - The distance between 'Raw data' and 'Journal-ready analysis' should be 'Reset and Run All Cells' and going for a cup of coffee - Generate and save the plots each time your scripts run --- ### Don't store intermediate steps! - You should be wary of storing intermediate representations of your data/analysis - Only when it's wildly computationally expensive - Any stored representations should be reproducible, from scratch, with a single click! - 'if this file exists, skip this portion of the code, else, run it to recreate' - Be very careful with .rdata, pickles, and otherwise - Any time you make changes to the data or early parts of the analysis, re-do your stored work - 'Do you want to save your workspace?' '**Hell no!**' --- ### Benefits of Stateless Analysis - You can shut down/stop your analysis at any time and lose nothing - Your work isn't affected by lost computers (because you use a 3-2-1 backup strategy) - You always know *every single thing which happened to your data since collection* - You have *one data version*, *one analysis version*, and *one output version* - If you messed up **early** in the process, you change *only that one line* and just re-run --- ### Downsides of Stateless Analysis - Computationally inefficient - You'll run the same models *over and over again* each time you start - Requires discipline - "I can't just go into Excel and fix this" - You'll end up with a lot of code - ... but no more than you would've needed anyways - New runs with changes clobber your old analysis/plots - Version control! --- ### Key Takeaways - Use version control - Reproducibility is generally a good thing - Error creeps into your analysis in many ways - Stateless analysis is the best approach if you can make it happen! --- ### Playing with Version Control