# Software for Analysis, Package Managers, and Environments ### Will Styler - CSS Bootcamp
--- ### Today's Plan - Your computer as a tool - What makes good software good? - Open vs. Proprietary Software - Packages, 'Apps', and Package Managers - Development Environments - How to run code - Other useful software --- ## Statement of Bias - I am a free-and-open-source software zealot with substantial trust issues - Controversial statements will be labeled with "🤡" - I am also a Unix-style thinker, and tend to use that philosophy - I am purposefully being extreme, to show you the other end of the thought process - Your mileage may vary --- ## Your Computer as a Tool --- ### As a computational social scientist, your computer is a tool - In some cases it's your most important tool - What characteristics do we want a tool to have? --- ### A good computing tool should be... - Reliable - Sufficient for your needs - Up-to-date - Secure - In a known state - Backed up - Under your control --- ### Reliable Computers - Mechanically functioning, with reliable power - 'Bleeding edge' software implies that there will be blood - Use a reliable operating system - Consider a 'long term supported' operating system - Know how to get support - IT department, 'genius bar', DIY repairs, or otherwise --- ### 'Sufficient for your needs' - Enough processing power to work in reasonable time - Enough RAM to fit the data you need - Compatible with needed packages (e.g. Nvidia's CUDA) - Can you access your data/StackExchange/remote servers? --- ### Up-to-date - Make sure that your system has the latest security patches - Make sure that you're updating the packages you use where smart - ... but do so sanely - Pain is inverse normally distributed with software age - Too old or too new both cause trouble --- ### Secure - Encrypt your hard drives, always - Consider the varied sensitivity of your data - Use a password manager - Don't use USB drives you find on the street - Keep your machine up to date - Limit the 'attack surface' by avoiding unneeded, untrusted software - Vary your security based on the threat profile and value of your data --- ### Aside: Security threat profiles - Your children - Computer thieves - Opportunistic hackers - Targeted threats (e.g. IP theft, personal information, ransomware) and Evil Maids - Nation-State Threats (e.g. NSA, Mossad, GRU, Military Cyber-Attack groups) - Computer Manufacturers --- ### *You don't have to outrun the bear, just the other campers* --- ### An important security question: "Who am I trusting right now?" - People with physical access to the machine - Other users on the machine? - People developing software which you run? - People running 'administration' software from your work/institution? - People developing your operating system? - People making your hardware? --- ### *The Cloud is just Somebody Else's Computer* --- ### Security is onion-shaped
--- ### In a known and specified state - What kind of software is on your computer? - Can the programs interfere with one another? - If your computer was stolen today, could you rebuild and re-run the same analysis the same way tomorrow? --- ### Backed up - Two is one, one is none. - [I lost a good portion of my dissertation data to a corrupted MacBookPro Hard Drive](https://wstyler.ucsd.edu/posts/lost_dissertation_files.html) - Consider a 3-2-1 Backup Scheme - Three copies, Two different media and methods, one offsite - Think of everything you need to back up - Documents, media, settings, lists of installed packages, passwords, and more - Don't upload unencrypted and sensitive data to Google/iCloud/Dropbox 🤡 --- ### Under your control - Other people should not be able to modify your system's function without your consent - User accounts should be separated, and you should be the only administrator on your machine 🤡 - You should be the one to initiate any changes which could break your machine - Your employers may demand some level of control, but this should be a part of your threat model 🤡 - 'Security Monitoring' software is a virus your employer controls, and should reduce trust in your machine 🤡 --- ### A good computing tool should be... - Reliable - Sufficient for your needs - Up-to-date - Secure - In a known state - Backed up - Under your control --- ## What makes good software good? --- ### The same things! - Reliable - Sufficient for your needs - Up-to-date - Secure - In a known state - Backed up - Under your control - *Interoperable with Open formats* --- ### Reliable Software - Does what you need, when you need it to - Doesn't crash more than needed - Can be trusted to function in the same way, every time - Will reliably be accessible in many settings (e.g. offline, new computer, different country, etc) - Subscription-based Software is unreliable by definition 🤡 --- ### Sufficient for your needs - Software should do what you need it to - Beware swatting flies with hand grenades - Remember that you might need to fight the IT people to install every single package --- ### Up-to-date - Be wary of software solutions which require old versions of the OS or Python - Software which hasn't been updated for a while is either *very* solid or *very* precarious --- ### Secure - Untrusted code deserves fewer privileges - All code is untrusted 🤡 - Be wary of 'pickles' and python binary formats - Use .safetensors and .gguf files for Neural Network models - Make sure you know who wrote what you're running and what it does - Make sure you understand why you're running the commands you are before you run something from StackExchange --- ### In a Known State - Software which randomly updates and changes features is problematic - You should be able to control the version(s) you're running - Software companies [can and will remove features at will if it makes them more money](https://hackaday.com/2022/08/12/local-simulation-feature-to-be-removed-from-all-autodesk-fusion-360-versions/) --- ### Backed up - Your data should be accessible and able to be backed up independent of the program - If your data exists only in the vendor's cloud, it's on borrowed time 🤡 - There should be a clear way to export your settings, data, and analysis for transfer to a new device --- ### Under your control - Be skeptical of additional software you're required to run - You should run things 'in the cloud' which could be done on your computer only with good reason 🤡 - What happens when Google's servers are down? Or you're on a plane? - You should ask questions if they ask for your password 🤡 --- ### Open Formats and Interoperability - Proprietary formats and closed formats are Bad. 🤡 - You should be able to get the data from saved to useful without a company's help or a license - If a program saves your data in a format only it can understand, this is a means of exerting control over you 🤡 - Do not put yourself in a position where you **must** pay for the upgrade 🤡 - Anything you create should be transferable to a different or competing software package - 'Never walk into a room you don't know how to walk out of' 🤡 - This also makes it easier to pipeline software (e.g. do one analysis in one package, and then transfer to a different one) - Plaintext is Durable --- ### The same things! - Reliable - Sufficient for your needs - Up-to-date - Secure - In a known state - Backed up - Under your control - *Interoperable with Open formats* --- ## Pains and benefits in Open Source Software --- # 🤡 !!! --- ### What is free and open source software? - Free as in beer ('gratis') - Autodesk Fusion is Free for home users, but under very restrictive terms - ArcGIS is free for UCSD students, temporarily, with limits - Free as in freedom ('libre') - Some projects charge for easy-to-run-binaries, or have an open core (e.g. VSCode, Android) - Libre and Gratis Software - Free to download, and you're free to use it in any way compatible with the license - Often called FOSS (Free Open Source Software) or FLOSS (Free and Libre Open Source Software) --- ### Free software licenses - Many different versions, more and less restrictive - Common licenses are MIT, GPL, Apache, Creative Commons - Common restrictions on re-sale, use-without-acknowledgement, inclusion in non-free software packages - 'Copyleft' licenses prohibit making open code closed and 'proprietarization' - Viral licenses exist, which try to make free and open anything which contains code from it - This is a **whole situation** - ... but the upshot is that there are many free and open packages that are amazing --- ### Open Source Projects we're using - Python, NumPy, Pandas, MatPlotLib, Seaborn, SciKit Learn, R - Linux on Datahub is open-source too - LibreOffice is a free Microsoft Office Replacement - This slideshow is created entirely with FOSS - We'll be teaching you more open source tools as time goes on --- ### Closed-source Tools we're using - Windows/iOS/macOS* - Android* - Google Docs/Sheets - Zoom - ArcGIS - Cisco AnyConnect VPN Client --- ### Free and Open Source Advantages - The code **can** be inspected, modified, and improved by anybody, even you! - Community ownership means development will often* favor the community - The project can be 'forked', resulting in two different source code bases - No Digital Rights Management, licensing fees, or otherwise, allowing greater equity and reproducibility --- ### Free and Open Source Advantages (cont.) - Often easier to download and install and available on more platforms - The software cannot 'disappear', be discontinued, or be changed unfavorably without recourse - Problems can be solved without the developer's cooperation* --- ### Free and Open Source Disadvantages - Not everybody inspects or improves the code - Maintainer burnout is real [and can be weaponized](https://en.wikipedia.org/wiki/XZ_Utils_backdoor) - Customer requests are at the whim of the developer, and these can be dictatorships - Users must rely on the community or developer's generosity for support - Websites, documentation, and resources are all community-provided - You can't build copyleft license code into proprietary projects - Some companies are terrified of AGPL code --- ### Proprietary, Paid Software Advantages - Maintainers are paid for their work (reducing burnout and incentivizing development) - Customers have the ability to demand changes to the code or functionality - Support is often offered as a part of the process, often more cheaply than expert hires/consultants - 'Cloud' components involving recurring cost are possible - Security through obscurity - Guarantees are offered with financial backing --- ### Proprietary, Paid Software Drawbacks - Upfront and ongoing costs are out of (your) control - "Oh, no, your permanent license is now a subscription." - DRM and Licensing becomes a complication (e.g. 'how many people are using this?' and 'Is this computer authorized to play this?') - No ability to 'do it yourself' if there's a fix that needs fixed - Security through obscurity is not security - ... and you can't even tell if there's a problem --- ### 🤡 ["Embrace, Extend, Extinguish"](https://en.wikipedia.org/wiki/Embrace,_extend,_and_extinguish) 🤡 - Adopt an open standard, data type, project, or approach - Add additional patent-encumbered or IP-restricted features to your version without relicensing - Also incorporating third-party services (e.g. AI, Messaging) which can't be added in free versions - Use your market position and ability to lose money at first to drive other projects into disrepair or dissolution - [Enshittification](https://en.wikipedia.org/wiki/Enshittification) and money/data extraction commence --- ### The incentives of proprietary software developers and customers are very rarely aligned 🤡 - Their optimal price is the absolute most you're willing to pay before you switch to a competing product - Locking you into their product with proprietary formats helps increase that number - Monopolies (e.g. ArcGIS, Adobe Creative Suite, MS Office, iOS) increase that number further - Support is a cost-center, and often similarly 'optimized' - For data science, Proprietary Software is seldom a good idea. --- ### Open Software Supports Open Science - It allows you to support more people, students, and experiments - People can reproduce your analyses more readily - People can replicate your approach without funding or site licenses - Your approach is more likely to 'catch on' if anybody can do it - Proprietary software can opaquely break your results for future runs - "Oh that functionality was removed in 1.7, and 1.6 is deprecated and your license code no longer supports it" --- ### Analysis involving closed-source software is not reproducible 🤡 - [Here's a nice article on that topic](https://www.r-bloggers.com/2022/11/open-source-is-a-hard-requirement-for-reproducibility/) --- ## 🤡🤡 Think Open Source First 🤡🤡 - Spend the money you would've spent on the proprietary software on supporting open software developers - ... then, only if no possible open alternative exists, start paying people for proprietary software - Advocate for your company to use and contribute to open products, where possible! --- ## Questions? --- ## Technical Debt --- ### Every software choice has costs - Sometimes it forces an operating system or hardware choice - Sometimes it forces a specific programming language (version?) - Sometimes it forces you to use other specific software - Sometimes it introduces speed bottlenecks - Sometimes it locks you into a specific version of a package - Sometimes it requires a *lot* of extra code --- ### These costs augment over time - Hardware/Software change over time - Programming languages change over time - Software packages constantly update, change, and die - Speed costs add up over time and with increased use - Additional code has maintenance costs --- ### These things are forms of 'technical debt' --- ## Technical Debt - The cost (in time or money) of re-working your code to keep it functioning --- ### Example Technical Debts - "New GPUs no longer support this API, and my GPU broke" - "In order to keep running this, I need to update it to work with..." - "We've moved most of our shop to Linux, but ArcGIS needs Windows 10" - "This model works great for 10 queries per minute, but we have 10,000 per minute now..." - "The bug is someplace in that undocumented mess of code which Will wrote, but he's not here to fix it" - "The maintainer of that package died in 2019, so it no longer works with new versions of Pandas" --- ## Minimizing Technical Debt in your analysis - "Don't build on things that could collapse, and if you must, build smartly!" --- ### Target Stable Platforms - Write cross-platform code, using cross-platform packages - Look what platforms have working code from five years ago - Avoid dependency on any particular hardware or manufacturer - If the OS or major elements of the platform is on the way out, write for the replacement --- ### Target Stable Software - 'Flavor of the week' software and languages are a gamble - You want robust software, with a long history of boring updates - You want software which is well-maintained, and getting maintenance releases - How likely is the software to still exist and be affordable in X years? - Look out for capricious 'breaking' changes --- ### Minimize Dependencies - Prefer functions from the 'base' over fancy overlay packages - 'Tidyverse' in R is a nice example of a very brittle dependency which could break a *lot* of code - Every library you call or package you use could stop being maintained or get broken by an update - Balance 'necessity to do the work' against 'likelihood of eventual destruction' - The best dependencies are both important and stable --- ### Modularize your code - Write functions which call dependencies, so you can replace the dependency without rewriting the code - 'Ah damn, np.doathing() is deprecated, I'll rewrite it within the function or replace it with pd.doathing() and it'll be fine' - Give people functions to maintain, rather than 'the whole thing' - Anything which needs to run a *lot* should be a function - This makes it easier for the CS nerds to optimize later --- ### Do it 'right' the first time - Document your code and write comments as you do the work - Write code which you expect to scale well - Once you know you'll need that code, optimize it *right then* --- ### You'll still have technical debt - It is inevitable - Thinking about all this should guide your choices of platforms, software, and packages - Considering this will be part of your life --- ### Example: Python's version change - Python 2 to Python 3 was a *breaking* change - Things as simple as `print "my name is Will"` changed to `print("mynameiswill")` - `<>` no longer meant `!=` - Core list and dictionary functionality changed - Lots of old methods, approaches, and libraries broke in half - So did Will's Python knowledge --- ### Many programs still use Python 2.7 - They're deep in technical debt - So, there's a chance you'll need to run both Python3 and Python2 for your work. --- ### There is great pain here - Yep! You'll need... --- ## Packages, 'Apps', and Package Managers --- ### There are many ways to install software! - "Go to our website and download the .exe installer" - "Click the button to download Zoom.app" - "Open the Apple/Microsoft/Steam/iOS App Store and search for..." - These are fine for 'big', monolithic software, but doesn't work well for python packages and other tools --- ### Software Dependencies - It's dumb to reinvent the wheel, smart developers use other (free and open) software instead - "Why rewrite code which pulls information from the web when `curl` exists?" - "Seaborn is built on MatPlotLib, so without that code installed, it won't do anything" - Every piece of software has dependencies! - Sometimes they're packaged alongside the software, sometimes they're external --- ### Enter the 'Package Manager' - Software which is designed to install, uninstall, upgrade, and configure other software - Software is retrieved from 'repositories' containing softwares, source code, and otherwise - Different from 'App Stores' in that they handle dependencies (and tend to be open to the world) - Often used for command-line applications, but you can use them to install GUI applications too! --- ### Common Package Managers - Examples are [Homebrew](https://brew.sh/) for Mac/Linux, `apt/rpm/dnf/pacman/emerge/nix` on Linux - `pip` or `conda` for Python - Some software includes package managers (e.g. R, ruby, javascript Matlab) --- ### Using a package manager: A play - "Hey, computer, I want to install `rsync`" - `brew install rsync` --- ### "Oh the human wants `rsync`!" - 'Great! I've checked our repository ('repo') and `rsync` is there.' - 'Even better, there's a binary for x86_64 which is what they're using, so they don't have to build from source' - '...But if I install it, what else does it need?' --- ### 'OK, so first I'll install the dependencies' ``` ~ % brew deps rsync ca-certificates lz4 openssl@1.1 popt xxhash xz zstd ``` --- ### 'Oh, great, they already have `openssl` 1.1, I can skip that one!' - I'll just install the rest - 'Oh no, `popt` has dependencies too! I'll install those first!' - 'OK, all deps installed!' --- ### 'OK, that all happened successfully!' - 'Now, I'll install `rsync`, and move it to wherever on the drive applications belong!' - 'Woohoo! Mission accomplished!' --- ### You will need to deal with package management - There are many amazing tools which don't have GUI (graphical user interface) installers or 'websites' - You'll use this extensively on non-Windows machines to install command-line packages - Oftentimes much faster to use than conventional 'hunt down the file' installs --- ```bash sudo dnf install ImageMagick R apfs-fuse awscli bzip2 cmus ffmpeg gimp kitty w3m ncurses neovim pandoc phoronix-test-suite rclone syncthing restic rstudio-desktop vim zsh texlive yt-dlp texlive-scheme-full google-roboto-fonts openscad prusa-slicer espanso steam lutris mullvad audacity gocryptfs ``` --- ### Python Package Management - For python, `pip` is the most common package manager - `pip install -U scikit-learn` - This will install `scikit-learn` and all of its dependencies --- ### (Ana)conda - Anaconda is a ready-made Python Distribution - 'Distribution' is generally a key piece of software and related packages - It contains a large selection of packages for data science, statistics, and scholarly communication, as well as `conda` for package management - This is a cheap way to get all your python packages you're liable to need - Alongside some other goodies --- ## Environments --- ### 'Your software should be in a known and specified state' - Will, earlier - Why does it matter what packages and versions are involved? - Some packages don't play well together! --- ### Some programs allow you to specify your environment - conda or `venv` allows you to create specific 'environments' which specify - Installed packages - Installed package versions - Installed dependencies and dep versions - These environments are *isolated from one another* - Installing a package in one environment does not change another - You can have conflicting versions and packages 'installed', but isolated --- ### How to use `conda` environments - You specify the configuration of an environment, then 'switch' when you switch projects - You can always recreate a `conda` environment later on when you need it -
--- ### There are multiple environments on your machine already - The System level Python - User Level python - Program bundled Pythons --- ### You can also use software which *declaratively* specifies environments - "Make my computer always have this software on it" - Ansible is a great example, as is nix/NixOS. - You can specify packages, versions, and file locations --- ### Why is this helpful? --- ### A Note on Docker, Podman, Distrobox, or 'Containers' - This is another way to do 'environments' - These containers have their libraries, software, and config files built into a tiny container with an OS - OS-level virtualization is used to keep those programs from interacting with things outside the container except as specified - This allows you to pull down a tool and all its dependencies and configurations at once - "Start up this second OS to run this tool" --- ## What's the best way to edit code? --- ### Coding Text Editors - **NOT MICROSOFT WORD** - Good text editors don't do formatting, and only write text - vim/emacs/nano/neovim on a CLI - Choice among these is subject of a long term Holy War - 🤡 neovim is best 🤡 - [Notepad++](https://notepad-plus-plus.org/)/[TextMate](https://macromates.com/)/[SublimeText](https://www.sublimetext.com/) have GUIs - [VSCode](https://code.visualstudio.com/) is increasingly popular - This is an open source, non-free product as shipped - [VSCodium](https://github.com/VSCodium/vscodium) is a version which is built solely from Free/Libre Components --- ### IDEs ('Integrated Development Environments') - Programs designed to let you work with a specific programming language - Often includes niceties to work with that language (e.g. syntax highlighting, debugging, etc) - This includes programs like Eclipse, PyCharm, Netbeans, VSCode, vim or emacs (with plugins), Android Studio, XCode, and more - Jupyter Notebooks are effectively an IDE - Not necessary, but sometimes very nice! --- ### Aside: jupyter notebooks are great - They do a lot of hard things easily - Good syntax highlighting - Integrating code and documentation - Output can be captured and shared widely - It allows for '[literate programming](https://en.wikipedia.org/wiki/Literate_programming)' - They're just more awkward to run remotely --- ### You don't need to run python within Jupyter notebooks - You can put the code in a .py file and just run it with `python3 yourcode.py` - Just make sure you're saving the output in meaningful places - This is easy to 'run and walk away' - This doesn't require a fancy interface - It's easier to submit jobs in this method to remote machines --- ## Other Valuable Software --- ### Backup Programs - You want to back your work up (3-2-1!) - **Mac:** Time Machine, rsync, CarbonCopyCloner, restic/borg - **Windows:** Windows Backup, restic - **Linux:** file system snapshots, rsync, restic/borg - **Cloud Backup Services:** iCloud/OneDrive Backup, Backblaze, Crashplan, SpiderOak One - Encryption is *important* for cloud backup sites --- ### Spreadsheet Applications - [LibreOffice](https://www.libreoffice.org/) - Microsoft Excel - Google Sheets - OpenOffice --- ### Synchronization Tools - SyncThing (Libre and gratis) - `rsync` (unidirectional) - Dropbox/Google Drive/OneDrive with rclone to make it zero-knowledge - Working on a remote server --- ### Great free applications for getting work done - `ffmpeg` for audio and video file processing - `imagemagick` for batch processing of images - `pandoc` for changing document formats around - Audacity for editing audio --- ### Wrapping up - The things that make good tools reliable are universal to computers *and* software - Technical debt is a problem - Open Source Software is a good idea - Package managers help you get software - Development Environments keep you sane - Code can be run in many ways - Other useful software