Will Styler's Homepage
Will Styler

Associate Teaching Professor of Linguistics at UC San Diego

Director of UCSD's Computational Social Science Program

Using Unix for Linguistic Research

Will Styler - Winter 2023 - Version 0.6

This document is designed to serve as a tutorial for Linguists who are using Unix-based machines for corpus and other linguistic work. It is not a comprehensive tutorial, nor will you leave this completely competent in all aspects of Unix work. But it’s designed to touch on the concepts, commands, and techniques most useful to linguists.

If you find errors, see issues, or use it for your course, please drop me an email.

Using Unix for Linguistic Research is completely free, and is available for use, both privately and as a part of classes. All I ask is that you share this link to share the guide (rather than sharing the HTML file itself) so that people end up with the latest versions.

Also, if you’re using the guide in your class or project, I’d love it if you’d send me an email (will at savethevowels dot org), so I can see where my work is going, and so I can get ideas as to how to improve it for the future.

Using Unix for Linguistic Research is licensed under a Creative Commons Attribution ShareAlike (CC BY-SA) License. More information on this license is available here, but in short, this license means that I’d appreciate if you’d attribute my work to me, and that any derivative works must also be creative commons licensed. Free should stay free.

Any questions, comments, concerns or corrections should be sent to me by email - will (at) savethevowels (dot) org.

Logging into a Unix-based remote server

Although on many operating systems, you’ll be able to work locally, you’ll often need to be able to log into a remote computer using a UNIX-style command line.

You’ll log into the server using ssh, short for ‘secure shell’. Unfortunately, this process looks a bit different depending on your operating system.

UCSD Only: Getting your Username, Password, and Server Address

No matter your operating system, you’ll need to have login credentials for the computer you’re logging into. For most applications, you’ll already know your username and password, as it’s been given to you by your local SysAdmin, or perhaps you set it yourself. But regardless, to connect using any option, you’ll need to know the Username, Password, and Server Address.

For LIGN 6 at UCSD, to get your course-specific Username and Password, you’ll want to go to https://sdacs.ucsd.edu/~icc/index.php, then under Tools click “Account Lookup”. This will allow you to enter your username and PID to get your login and set a password. There’s a chance it’ll have multiple accounts, the one you want here starts with ‘ln6’. You’ll need to log in and ‘Reset your password’, making sure to write down your username and password, you’ll need them later! The server address for LIGN 6 is ieng6.ucsd.edu, and if your SSH client requires you to specify a ‘Port’, it’s 22.

If you’re using LinuxCloud, you’ll also need to enroll that account in Duo https://duo-registration.ucsd.edu/ by inputting your new username and password and connecting to your mobile device.

Logging in via ssh on MacOS and Linux

On the Mac, you’ll go to /Applications/Utilities, and then double click “Terminal”. This puts you in a fully-functioning Unix command line, and will allow you to either work locally (accessing the files on your own computer), or to enter an ssh command (as described below) and connect to the campus server.

If you’re using Linux, there’s a good chance you know how to open a Terminal. For the most part, any terminal (Gnome Terminal, Konsole, xTerm, eTerm) will work fine. And again, once it’s launched, this puts you in a fully-functioning Unix command line, and will allow you to either work locally (accessing the files on your own computer), or to enter an ssh command (as described below) and connect to the campus server.

Once you’ve got a terminal opened in either OS, you can either work locally (as you’re already at a unix prompt), or type ssh yourusername@ieng6.ucsd.edu at the prompt to connect to a remote server.

The first time you connect, you’ll see a message talking about the Server’s “RSA Key”. Just type “yes”. Then, you’ll enter your password. Nothing will show up as you’re entering your password, but don’t worry, it still works

Once that happens, the server will automatically display an annoying =====NOTICE===== message with privacy information (if you’re at UCSD). Once you hit q to exit that, you’ll be logged in.

Once you’re logged in, you’ll see a line ending with a $. This is the ‘prompt’, and now, you move on to the next steps.

Logging in via ssh on Windows

Unfortunately, Windows users cannot do UNIX work locally (as Windows is not a Unix-based OS), and have some additional work to do to be able to ssh into our server.

As of Windows 10, there is an SSH client built into Windows, which allows you to log in to remote servers. To use it:

  1. Open Windows Terminal
  2. Type ssh yourusername@ieng6.ucsd.edu at the prompt to connect to our remote server.
  3. The first time you connect, you’ll see a message talking about the Server’s “RSA Key”. Just type “yes”. Then, you’ll enter your password. Nothing will show up as you’re entering your password, but don’t worry, it still works. Then, continue below.

If this isn’t working, if you’re using an older version of Windows, or if you’d like a slightly more robust terminal client, you can use PuTTY:

  1. Download PuTTY from https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html, using the MSI ‘Windows Installer’ option on that website.
  2. Install PuTTY from the downloaded .exe file.
  3. After installation, find the PuTTY program in the Start Menu.
  4. PuTTY will open a new window. Enter the below settings…
    • Under ‘Host Name (or IP address)’ enter your username followed by ieng6.ucsd.edu (For eg: wstyler@ieng6.ucsd.edu).
    • Under ‘Port’, keep the default value of “22”.
    • Under ‘Connection Type’, select “SSH”.
    • Note, this information will differ if you’re connecting to a different server than the UCSD LIGN 6 server
  5. To save these settings (so that you don’t have to re­enter them later)…
    • Enter a name under ‘Saved Sessions’
    • Click ‘Save’
    • To re-use this information later on, just double click that name under ‘Saved Sessions’
  6. Click ‘Open’ to start your SSH connection.
  7. The very first time you connect to a new server, a pop­up will ask you to save the key. Choose ‘Yes’.
  8. The command window will open, and ask for your password. Type your password and press enter.
    • Nothing will show up as you’re entering your password, but don’t worry, it still works

Once you’re connected through either method, the server will automatically display an annoying =====NOTICE===== message with privacy information (if you’re at UCSD). Once you hit q to exit that, you’ll be logged in.

Once you’re logged in, you’ll see a line ending with a $. This is the ‘prompt’, and now, you move on to the next steps.

For what it’s worth, Windows now also has the optional ‘WSL’ (Windows Subsystem for Linux) tooling (which I would argue is actually a Linux subsystem for Windows). But this installs a Unix-like virtual machine, effectively, on your Windows machine, giving you some of the power of Unix applications and tools within Windows. Note that some tools won’t work or may perform strangely. It remains the rough equivalent of using a lamborghini engine to run the generator in your 1970’s era camper van, and if you’re serious about using these tools in the longer term, you’ll probably want to either log into a remote server running a real Unix-based OS, or install a Linux distribution alongside (or instead of) Windows.

Logging in via ssh on iOS or Android or ChromeOS

You can actually do some Terminal work from iOS or Android, using an ‘ssh’ client like Prompt (for iOS) or ConnectBot (for Android). It’ll be unpleasant if you don’t have access to a real keyboard, but you can pull it off.

On ChromeOS, you can use the Secure Shell App for Chrome to connect to a server.

Enter the server name (ieng6.ucsd.edu), your username, and password, in the relevant places.

The first time you connect, you’ll see a message talking about the Server’s “RSA Key”. Just type “yes”.

Once that happens, the server will automatically display an annoying =====NOTICE===== message with privacy information (if you’re at UCSD). Once you hit q to exit that, you’ll be logged in.

Once you’re logged in, you’ll see a line ending with a $. This is the ‘prompt’, and now, you move on to the next steps.

UCSD Only: Logging in via LinuxCloud

If you’re a LIGN 6 Student, you can also log in at https://linuxcloud.ucsd.edu/# using your ieng6 account above. If you’re using LinuxCloud, you’ll also need to enroll that account in Duo https://duo-registration.ucsd.edu/ by inputting your new username and password and connecting to your mobile device.

Once you’ve enrolled in Duo and logged in, you’ll have an option under ‘All Connections’ which for ‘Linux SSH Terminals’. Double click that and you’ll be on a command line, and ready to get started.

Basic Unix Theory

Unix doesn’t refer to a specific operating system, but a family of related (and largely interoperable) operating systems. Members of this family include MacOS, iOS, Android, Linux, BSD, Solaris, and others. All of them share some low-level basics, and focus on using simple tools which can be combined together to do more complex things in a program called a ‘shell’.

John Lawler has written a full Linguistic description of the Unix language family, which is amazing both in its depth, and its sense of humor. It should be required reading for any linguist interested in using Unix.

Let’s start by talking about some of the basics of Unix-compatible machines.

The Unix Shell

Every Unix server, when you log in, will put you into a ‘shell’ program. bash is probably the most popular option these days, but you’ll occasionally see systems still using csh or sh. There are also other shells out there in the world which offer more modern features. I personally prefer zsh because it has a few nice tools (particularly an expanded set of wildcard functions), but I’m perfectly happy in bash, and this tutorial is written with bash in mind.

The Prompt

When you log in to a Unix machine (either via a Terminal on your local machine, or via SSH), you’ll have the ability to enter text commands at the Prompt (usually a $), which the local operating system will then evaluate. All of your interaction is text-based, and although you may be able to move, click, and select text in the terminal, your mouse does nothing here.

When you log into the UCSD server, your prompt will look like the below:

[yourusername@ieng6-201]:~:100$ 

This includes information about how you’re logged in (username at which specific node of the cluster), then your current directory (in this case, home), then a number indicating how many commands you’ve previously entered. On your local machine, your prompt will look different. But it will likely still end with a $ if you’re in the bash shell.

Any time that you see this line, ending with $, you can enter a command. But before we talk about commands, we should talk about where you are and the files around you.

In UNIX, you’re always ‘in’ a folder while you’re logged into a Unix machine. Your presence in the system moves from folder to folder as you navigate. At any given moment, you’re ‘in’ a directory, and all commands refer to those items in that folder. Think of this as being kind of like human conversational pragmatics: If you’re in a room, and there’s a mug on the table, and the other person in the room asks “Hey, hand me the mug”, you’re going to grab the one on the table, rather than asking “There are lots of mugs in the world, which do you want?”.

This means that you need to be sure what’s in the folder you’re in, so that your commands will work as expected, and you’re not trying to ask for a mug that isn’t in the current room!

But how do we describe where we are?

File and folder paths

All files and folders have a ‘path’, that is, the place on in the file system where you can find them. We can refer to a file or folder’s location in two ways, either using the ‘local path’ (“the path to this file from here”), or the absolute path, (“the path to get to that folder from the root of the system”).

If you have a file called system.txt, and it exists in the folder /home/linux/ieng6/ln6w/ln6w, then the absolute path to that file is /home/linux/ieng6/ln6w/ln6w/system.txt. If you’re already ‘in’ that folder, you can refer to it simply as system.txt. If you’re in that folder, the paths above are equivalent.

Also, it’s worth noting that your home folder is ~, so you can always use that in a path name, wherever you are. So, the file samplefile.txt in your home folder can always be called ~/samplefile.txt no matter where you are.

In Unix, there are four locations that have a dedicated symbol:

. represents the current folder. You don’t need to use it very often, except when doing something like ‘move this file into the current folder’ (e.g. mv /folder/over/there/file.txt .).

.. represents the folder which contains the current folder. So, to move file.txt from the folder where you are to the folder which contains it, mv file.txt ... Similarly, cd .. will always take you back up a level.

~ represents your home folder, and you can always use that in a path name, wherever you are. So, the file samplefile.txt in your home folder can always be called ~/samplefile.txt no matter where you are.

/ represents the root of the filesystem (‘the folder which contains all folders’). Any path which starts with / is what’s called an ‘absolute’ path, and will be valid no matter what folder you’re in on that machine.

Now, you’ve got paths. How do we actually do things?

The * Wildcard

In most Unix shells, you can use the * ‘wildcard’ to refer to groups of files. Used alone, it means “all items in the current folder”. So, for instance, if you’re in a folder with four other folders, running ls will list the contents of the folder you’re in, but running ls * will list the contents of all the folders inside that folder.

You can also use the wildcard to match partial filenames. So, something like cat ling* will display the contents of every file starting with ling (“ling101_syllabus.txt”,“lingering_ghosts.txt”, “linguistics_is_awesome.md”). But cat ling*.txt would not match a .md file.

This is amazingly powerful. Simple commands in Unix can replace huge amounts of human work moving files in folder. Imagine that you’ve got a folder with .wav sounds files and .txt transcripts. mkdir wavdir && mv *.wav wavdir will instantly move every wav file (but not the .txt files) into a new folder called ‘wavdir’.

Because the Unix command line is generally quite fast, you can work with thousands of files at once, according to simple rules. Imagine you had many thousands of text snippets from various sources, labeled by filename, with files from twitter prefixed with “twi_” and files from facebook prefixed with “face_” and so forth. Using bash, you could instantly combine every file from twitter (but not every file from Facebook) into a single text file using just the command cat twi_* > twitterfiles.txt.

Be careful, though, as it’s easy to underestimate the scope of your wildcards, and forget what you’re affecting. You should be very careful using wildcards with rm or other destructive commands. rm -r * will instantly delete every file and folder in the current folder.

Dotfiles

In Unix, files which start with a period/dot . are ‘hidden’ from general view, requring the use of a special command (ls -a) to see them. They tend to be configuration files (see the above ls -la output, but any file can be made into a dotfile. They’re still usable with other commands though (so cat .bashrc will still output the contents of that file to the terminal. For this reason, you’ll want to be very careful with periods in filenames.

Tab Autocomplete

In Unix, hitting <tab> will always try to complete your command. So, if you’re in a folder with three files, ‘filea.txt’, ‘fileb.txt’, ‘otherfile.txt’, and you type o then hit <tab>, it will automatically complete the command to otherfile.txt.

If you type f then <tab>, it will complete to file (as there are two files which both start with that much, allowing you to then hit a or b to specify which of the options you want, and <tab> again to fully complete the file’s name.

This is helpful, and can save a lot of time.

Command History

In Unix, hitting the up arrow at the command line will automatically scroll through your command history. So if you want to re-run a command, just hit ‘up arrow’ then <enter>.

Killing a running process

If you’ve got a runaway process, or something that you ran which is taking much longer than expected, you can hit Ctrl+c to kill the process and return to a prompt.

Unix has no ‘undo’

One crucial thing to understand is that unlike in most operating systems, there’s not an ‘undo’ functionality. If you delete a folder, unless you have a sympathetic sysadmin with backups, it’s gone. This is why you should be extremely careful when using destructive commands (like rm), you should make sure you’re not overwriting files, and never hesitate to make copies of important files before you work on them.

On Unix Commands

To do things in Unix, we use textual commands, which are specific (and generally compact) programs that serve a small function. You’ll enter individual commands, followed by ‘flags’ which specify command behavior as well as ‘arguments’ which are the ‘objects’ in the linguistic sense.

Note that the exact complement of commands may differ from system to system, and that additional commands can be added by the user (using a package manager like apt or portage on Linux, ‘pkg’ on BSD, or homebrew for MacOS. The commands we’ll be using here are all fairly basic, and should likely be found on any Unix-compliant system, although the exact flags and arguments taken may vary slightly depending on version.

As it is a verb-initial language, the basic syntactic structure of a unix command is command [-flags] [arguments].

Unix Command Arguments

Unix commands can take arguments, and depending on the command, these arguments will be optional or obligatory, and ordered in certain ways.

Flags/Options/Switches

Additionally, many commands take ‘flags’ (also known as ‘options’ or ‘switches’) which slightly modify their function, which are generally prefaced with a -. The ls (‘list directory contents’) command will simply list file and folder names if called using ls /home/linux/ieng6/ln6w/ln6w:

$ ls /home/linux/ieng6/ln6w/ln6w
dirs.txt  duolingo.txt  perl5

If you use the -a flag, it will show ‘hidden’ files too:

$ ls -a /home/linux/ieng6/ln6w/ln6w
.   .bash_profile  .cache   .cshrc  .local  .modulesbegenv  .profile   .zshenv  dirs.txt      perl5
..  .bashrc        .config  .kshrc  .login  .procmailrc     .zprofile  .zshrc   duolingo.txt

If you use the -l flag it will give more details:

$ ls -l /home/linux/ieng6/ln6w/ln6w
total 588
-rw-r----- 1 ln6w ieng6_ln6w 592096 Jan 27 10:27 dirs.txt
-rw-r----- 1 ln6w ieng6_ln6w      0 Jan 27 10:01 duolingo.txt
drwxr-xr-x 2 ln6w ieng6_ln6w   4096 Jan 27 08:21 perl5

And you can combine these flags, so ls -la will produce detailed output about all files including hidden ones:

$ ls -la /home/linux/ieng6/ln6w/ln6w
total 652
drwx------  6 ln6w ieng6_ln6w   4096 Jan 27 10:27 .
drwxr-sr-x 51 ln6w ieng6_ln6w   4096 Jan 27 09:53 ..
-rwxr-x---  1 ln6w adm           975 Jan 18 14:56 .bash_profile
-rw-r-----  1 ln6w ieng6_ln6w 592096 Jan 27 10:27 dirs.txt
-rw-r-----  1 ln6w ieng6_ln6w      0 Jan 27 10:01 duolingo.txt
drwxr-xr-x  2 ln6w ieng6_ln6w   4096 Jan 27 08:21 perl5

Occasionally, flags will demand an additional argument. This guidebook is rendered into html using the following unix command (using the pandoc command, which is not standard Unix, but is among humanity’s greatest accomplishments):

pandoc -s --toc -N -o unix/index.html unix/index.md

In this case, the -s --toc -N flags set program options, and the -o takes an obligatory filename, which is the desired location of the command’s output.

Some commands can take numerical flags, too. head file.txt used without a flag, prints the first 10 lines of a file. But head -76 file.txt prints the first 76 lines.

Not all commands require flags, and not all flags require arguments, but it’s important to understand what they are when they do show up. To learn the flags and arguments for a given command, use the man command to read the manual.

Basic Unix Commands

exit - ‘exit this environment’

If you’re logged into a Unix machine and don’t want to be logged in anymore, exit is your friend. This will close an ssh connection, or if you’re working locally, close your terminal window. But remember, if a program’s gone off the rails, you’ll want to use Ctrl+c instead.

pwd - ‘print working directory’

pwd tells you where you are.

$ pwd
/home/linux/ieng6/ln6w/ln6w

This means you’re in the folder /home/linux/ieng6/ln6w/ln6w

whoami - ‘who am i’

whoami tells you who you are, that is, it returns your username.

$ whoami
ln6w

Useful in case of identity crises, or if you forget which machine you’re logged into.

date - ‘what time is it?’

The date command simply returns today’s date, time, and timezone.

$ date
Sun Jan 27 12:56:57 PST 2019

This isn’t super useful when you’re just doing work, but it can be quite helpful, particularly with >>, for dating individual files or versions of files.

echo - ‘Display text in terminal’

echo simply repeats whatever text you give it back into the terminal.

$ echo "I want this text displayed in a terminal"
I want this text displayed in a terminal

This isn’t as useful in interactive use, but it’s a nice way to add information to a file (using something like echo "I want this text displayed in a terminal" > ~/Downloads/tmp.txt) or to display information in a shell script.

ls - ‘list directory contents’

ls tells you what’s in a directory. Although you can include a path as an argument (‘What’s in this folder?’), the default is to return what’s in the current directory.

$ ls
perl5  samplefile.txt  samplefile2.txt  samplefolder  samplefolder2

You can use the -l flag to give more information (long form), and the -a flag to show all files, including ‘hidden’ files.

$ ls -l
total 12
drwxr-xr-x 2 ln6w ieng6_ln6w 4096 Jan 27 08:21 perl5
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 08:30 samplefile.txt
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 08:30 samplefile2.txt
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder2

The -l view also gives you information about the files’ permissions (drwxr-xr-x), types (2 or 1), owners (ln6w), sizes (4096), and modification dates, as seen above.

Note also that ll is a commonly available shortcut for ls -l

cd - ‘change directory’

cd lets you move into a new directory, specified as an argument. You can specify the directory name in one of two ways. Let’s say you’re in /home/linux/ieng6/ln6w/ln6w, and you want to move into ‘samplefolder’ which is inside that directory. You can do this using the ‘local path’, “the path to this file from here”). Here’s the command, with a pair of pwd commands to either side so you can see the change, using a local path. We specify the folder by just giving its name, and because it’s in the current directory, we go there:

$ pwd
/home/linux/ieng6/ln6w/ln6w
$ cd samplefolder
$ pwd
/home/linux/ieng6/ln6w/ln6w/samplefolder

We can also do the same exact thing, specifying the ‘absolute path’, the path to get to that folder from the root of the system. This cd command would get you to that exact folder, from anywhere.

$ pwd 
/home/linux/ieng6/ln6w/ln6w
$ cd /home/linux/ieng6/ln6w/ln6w/samplefolder
$ pwd
/home/linux/ieng6/ln6w/ln6w/samplefolder

Note that absolute paths will always start with /, which is the ‘root’ of the server.

Also, it’s worth noting that cd, when used without any ‘arguments’, will take you to your home folder (which is also given the symbol ~. And cd .. will always take you back up a level.

mkdir - ‘make directory’

mkdir creates a directory, whose name is given as an argument, in the current folder. Here’s an example, making a new diretory called ‘samplefolder3’ with an ls -l to either side to illustrate the change:

$ ls -l
total 12
drwxr-xr-x 2 ln6w ieng6_ln6w 4096 Jan 27 08:21 perl5
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 08:30 samplefile.txt
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 08:30 samplefile2.txt
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder2
$ mkdir samplefolder3
$ ls -l
total 16
drwxr-xr-x 2 ln6w ieng6_ln6w 4096 Jan 27 08:21 perl5
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 08:30 samplefile.txt
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 08:30 samplefile2.txt
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder2
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:01 samplefolder3

touch - ‘create or update modification date for a file’

The touch command does two related things for us. First, it can be used on an existing file to update the modification date of that file. See below example, where I’ll update the modification time for samplefile.txt:

$ ls -l
total 16
drwxr-xr-x 2 ln6w ieng6_ln6w 4096 Jan 27 08:21 perl5
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 08:30 samplefile.txt
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 08:30 samplefile2.txt
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder2
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:01 samplefolder3
$ touch samplefile.txt 
$ ls -l
total 16
drwxr-xr-x 2 ln6w ieng6_ln6w 4096 Jan 27 08:21 perl5
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 09:04 samplefile.txt
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 08:30 samplefile2.txt
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder2
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:01 samplefolder3

More usefully though, it can create a new file out of thin air, just by giving the filename as an argument:

$ ls -l
total 16
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 09:04 samplefile.txt
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 08:30 samplefile2.txt
$ touch samplefile3.txt
$ ls -l
total 16
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 09:04 samplefile.txt
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 08:30 samplefile2.txt
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 09:06 samplefile3.txt

The created file will be empty, but you can then open it with a text editor.

mv - ‘move or rename file or folder’

mv moves files. You give two arguments, one is the file’s current location and name, and the other is the file’s new location and name. You can do this to move files between folders, as below (note that I’ve thrown in a cd and ls so you can see what’s going on). Here, I’ll move samplefile.txt into the subfolder ‘samplefolder’:

$ ls -l
total 16
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 09:04 samplefile.txt
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder
$ mv samplefile.txt samplefolder/samplefile.txt
$ cd samplefolder/
$ ls -l
total 1
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 09:45 samplefile.txt

You can also use mv to rename a file or folder, by simply ‘moving’ the file from the original filename to the new one:

$ ls -l
total 1
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 09:45 samplefile.txt
$ mv samplefile.txt newsamplefile.txt
$ ls -l
total 0
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 09:45 newsamplefile.txt

This can be done within a folder, or while you’re moving a file to a new location outside the folder (e.g. mv myfile.txt myfolder/mynewfile.txt will both rename the file and move it into the ‘myfolder’ folder.

Note that if you ‘move’ a file to the same name and location as an existing file, it will overwrite the existing file! Unix has no ‘undo’ functionality. Be sure.

cp - ‘copy file or folder’

cp copies files. Like mv, it takes two arguments, the starting name, and the new name, but instead of moving the file, it will create a new copy of the file. Like this command, which will create a copy of ‘unilingo.txt’ called ‘duolingo.txt’:

$ ls -l
total 8
drwxr-xr-x 2 ln6w ieng6_ln6w 4096 Jan 27 08:21 perl5
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:53 samplefolder
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 09:59 unilingo.txt
$ cp unilingo.txt duolingo.txt
$ ls -l
total 8
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 10:01 duolingo.txt
drwxr-xr-x 2 ln6w ieng6_ln6w 4096 Jan 27 08:21 perl5
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:53 samplefolder
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 09:59 unilingo.txt

To copy a directory (and all the files inside it), you’ll need to use `cp -r’, which makes the command ‘recursive’:

$ ls -l
total 8
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:53 samplefolder
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 09:59 unilingo.txt
$ cp -r samplefolder/ samplefolder2
$ ls -l
total 12
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:53 samplefolder
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 10:04 samplefolder2
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 09:59 unilingo.txt

rm - ‘remove file or folder’

rm removes files, or with the -r operator, entire directories.

$ ls -l
total 12
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 10:01 duolingo.txt
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:53 samplefolder
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 10:04 samplefolder2
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 09:59 unilingo.txt
$ rm -r samplefolder2
$ rm unilingo.txt
$ ls -l
total 8
-rw-r----- 1 ln6w ieng6_ln6w    0 Jan 27 10:01 duolingo.txt
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:53 samplefolder

BE VERY CAREFUL: In Unix rm, there is no ‘trash can’, there is no ‘undo’. It’s just gone. If you accidentally rm -r your home folder, it’s over. It’s a good practice to use the -i operator, which makes you hand-confirm deletions as they happen, as you’re first learning. And be especially cautious with the * wildcard, which will delete everything in a folder.

man - ‘read the manual’

The most important command you’ll find is man, which explains to you how another command works. To get information about a command, for instance, the rm command, just type man rm, then read through the document using <space> to advance and q to quit.

Combining commands

One of the nicest elements of the Unix philosophy is the ability to combine many small components into bigger workflows, either through shell scripting (that is, writing a file which contains many sequential commands), or using operators like the ones below.

| - ‘Use the output of this command as the input to the next’

The pipe operator is simple and extremely powerful. Put simply, it takes the output of one command and passes it into the next command as input. For instance, ls -l | tail -1 will take the output of the ls -l command, and pass it to tail -1. This will take the last line of the ls -l output, and thus, give you the last file (alphabetically) in a folder.

Or perhaps you’re wanting to concatenate many small files to view them all at once, but the resulting file is too big to comfortably read in the terminal using cat. So if you have readme1.txt, readme2.txt, and readme3.txt, running cat readme* | less will combine the files, then pass that output to the less command for more comfortable reading. Or you could pass to the word count (wc) utility to check the length, using cat readme* | wc.

You’ll likely find yourself regularly using | to string small processes together, doing very complex things from very simple chunks of code. It is the most powerful tool in Unix, in the author’s opinion, and is one you’ll want to know.

< - ‘Use this file as input’

The < operator tells a command to take a given file as the input. So, for instance, the tr (translate characters) command does not take a file as an argument directly, and instead, will take an inputted file only using the < operator. So, to change every instance of ‘c’ to ‘k’ in a file called ‘unsorted.txt’, you’d use tr c k < unsorted.txt.

Note that many commands will take input directly as an argument (e.g. cat myfile.txt) or using < (e.g. cat < myfile.txt). But when no ‘file’ argument is given, this approach will work.

> - ‘Redirect the output to a file’

The > operator, when placed after a command, will save the output of the command (that is, what would normally appear in the terminal) as a designated file. So, for instance, if you wanted to list the contents of a folder, but wanted to save the list to file for later use, you’d run ls -l myfolder > folderlist.txt. Although nothing will happen in the terminal, you’ll find that a new file has been created called ‘folderlist.txt’, and that that file contains what would’ve happened there.

For text analysis and processing, this is a really powerful trick. Early in the tutorial, I gave the example of using a wildcard to concatenate (that is, append together into a single file) every file in a folder starting with ‘twi_’. Although you could just run cat twi_* to view it all in the terminal, it’s likely better to save to a single file for later use or easier reading, so cat twi_* > twitterfile.txt will create a new file.

This is occasionally useful when you’ve got a command which will take a while to run and produce a lot of text. For instance, the command find / -mtime -0.5 -ls > past12hrs.txt will return a listing of every single file which has changed on the entire system in the last 12 hours (0.5 days). This will create many entries, and take a while to process, so it’s better to run this directly into a file, here, past12hrs.txt.

Finally, remember that redirecting a command with > will overwrite the target file if it already exists! If you want to add the output to an existing file, you’ll want …

>> - ‘Append the output to a file’

>> is identical to >, except that instead of directing the output to replace a designated file, it will append the output to an existing file. So, if you’re planning to, for instance, list the contents of 10-12 directories, across a few different folders, into a file, you’ll want to run ls -l dir1 > output.txt (to create a clean file for the first one), then ls -l dir2 >> output.txt to append to the existing file.

This difference between > and >> is subtle, but absolutely crucial, and mixing the two up can cause a lot of missing data, or a lot of messy files.

& and && - Running Multiple Commands

In Unix, you’ll occasionally want to run multiple commands at once. There are two ways to do so:

The & command says “Do this in the background”, allowing you to run multiple commands in parallel. So, if you want to run a long task (e.g. find and list every file on the hard drive containing “twi_”), but didn’t want your terminal frozen while it runs, you would run find / -name "twi_" -ls > twifiles.txt &, where the final & just says ‘Run this, and let me know when it finishes’. This will return a process ID immediately, and when it eventually finishes, will return an ‘exit’ message to the terminal. But in the mean time, you can keep working.

&& allows you to run muliple commands in series, meaning that the second command won’t be run until after the first one is completed. For instance, mkdir newdir2 && cd newdir2 will create a new directory, then enter it, in a single command. mkdir txtfiles && mv subfolder/*.txt txtfiles && rm -r subfolder will create a new folder called ‘txtfiles’, move all the text files out of a subfolder into it, then remove that subfolder. && is particularly useful when you expect a command to take a while to run, but you still have a few small tasks which need to be done afterwards, and you’d just as soon they happen automatically.

In practice, you should be careful using these commands at first, as you don’t want to delete a folder before you’ve confirmed that your first command worked properly (e.g.), and the time savings of typing command && nextcommand isn’t all that large compared to “command, enter, nextcommand, enter”. But they’re out there, and useful.

Editing Text Interactively in Unix

On a Unix system, there are usually a variety of programs usable for editing text interactively inside a terminal interface.

nano

nano is very user-friendly, and allows more or less anybody to edit text in Unix. The only ‘trick’ you need to know is that the commands at the bottom of the window (listed as ^X) are triggered using the <control> and the listed letter.

To edit a file in nano (or create a new file with this name), you’ll type nano thefilename.txt.

So, when you open nano, you can start typing inside the box. When you’re finished, to save your edited file and exit, you’ll type CTRL+O (‘Write out’), specify a filename (if you haven’t done so when you started the program), then <Return>, then CTRL+X (‘Exit’).

vim

vim (or its progenitor, vi) is an unquestionably powerful, fast, free, and rather weird. It’s a ‘modal’ editor, which means that there are two different ‘modes’ for interacting with the document, with one transforming the keyboard keys into commands for moving rapidly around the document, and the other mode allowing text entry. This seems silly at first, but in practice, it’s extremely powerful for moving quickly around a document, and there are few faster ways to edit text.

emacs

emacs is another very powerful and free editor, which is not modal (in its default configuration), but has keystrokes, commands, and packages allowing you to do nearly anything without leaving the editor. This makes emacs bulky and occasionally a bit slow, but extensible and powerful, and is a world-class plaintext editor.

Which editor is best?

Whoa there, are you trying to start a holy war or something? :)

One’s choice of unix text editor is intensely personal, and depends as much on the person’s outlook on the world in general as it does on the relative merits of the programs. Anybody who’s spent any amount of time programming or scripting will have opinions, and it’s useful to be familiar with one of either emacs or vim for just this kind of work, as it’s likely that any given server will have one of them to work with.

Additionally, remember that many people use other editors (e.g. BBEdit, SublimeText, TextMate, etc.) which aren’t available on Unix per se, but may be able to access remote files on a unix machine. If you’re working on a Mac, Windows machine, or Linux Desktop, there are more options, although emacs and vim still have a lot to offer even there. In fact, this guide is being written in emacs on the author’s Mac, as that’s the author’s editor of choice, but he spent many years in vim.

The most important thing, ultimately, is that you try a number of options, then choose the one that made the most sense with your brain and workflow. Remember, there’s no correct answer.

Basic Text Reading Tools

As linguists, you’ll be doing a fair amount of text reading and manipulation.

cat - ‘output the file into the terminal’

cat offers you a simple way to read a short file. Just use cat and the filename, and it’ll print the contents directly into the terminal. So, if you want to read the file willsfavoritepoem.txt, just run cat willsfavoritepoem.txt.

You can also use the * wildcard to use cat to concatenate, that is, append end-to-end, many files together. So if you have readme1.txt, readme2.txt, and readme3.txt. running cat readme* will print all three files’ content into the terminal. cat is particularly useful when combined with the | and > operators. cat readme* > readme_combined.txt will combine them into one new file called ‘readme_combined.txt’, and cat readme* | wc will give you a wordcount of the three files combined, just to give a pair of brief examples.

The problem with cat is that big files are often too big for your terminal to scroll back and read. So, for big files, you’ll want to use…

less - ‘read a long file’

less offers a better experience for reading long files, allowing you to open a file using less willsfavoritepoem.txt, then read through the document using <space> to advance, u to back up, and q to quit.

If you don’t have less on your system, you’re likely to have more, which does the same thing, but without the ability to scroll back. (Yes, in this case, more is actually less)

head and tail - ‘Display the start/end of a file’

The head and tail commands exist to do just one thing: to display the edges of a file. Calling the head command with a file as an argument (e.g. head reallylongfile.txt) will display the first 10 lines of that file. Calling tail reallylongfile.txt will display the final 10 lines.

Note that both head and tail can take a numerical argument as well indicating the number of lines to display, so head -40 reallylongfile.txt will display the first 40 lines.

These commands are useful for previewing files, but it’s also often handy to grab chunks of files by number and similar tricks. For instance, head -60 myfile.txt | tail -10 will give you lines 50-60 of a file (there are other ways to do this using sed and others, but these commands are often handy.

Unix Tools for Corpus Work

As you’re doing natural language processing or corpus work, some tools are particularly useful above and beyond file reading tools like cat.

egrep is a simple regular expression search program, and an incredibly useful tool for working with corpora. For the remainder of this explanation, we’re going to pretend that we’re located in the ‘enronsent’ folder which contains a working copy of the EnronSent corpus, and we’re wanting to examine the data there.

egrep takes two arguments, the expression, and the file to search.

To perform a basic search for the word ‘corruption’ across all the files in the folder, you’d type egrep 'corruption' * and hit enter. Because EnronSent is split across 44 text files, we’ll use the * wildcard in all of these searches, but you could just as easily search an individual file (e.g. egrep 'corruption' enronsent33 would only search for the word corruption in file ‘enronsent33’.

$ egrep 'corruption' *
enronsent08:enlighten you on the degree of corruption in Nigeria.
enronsent13:courts in Brazil which are generally reliable and free of corruption (e.g., 
enronsent17:??N_POTISME ET CORRUPTION??Le n,potisme et la corruption sont deux des prin=
enronsent18:electoral corruption and fraud has taken place, a more balanced Central 
enronsent20:by corruption, endless beuacracy, and cost of delays.  These "entry hurdles" 
enronsent20:Turkish military to expose and eliminate corruption in the Turkish energy=
enronsent21:              employees, and corruption.  The EBRD is pushing for progress
enronsent21:              government has alleged that corruption occurred when the PPA
enronsent22:good, safe and clean fun?  didn't add to your corruption score?  is that the 
enronsent22:how did you do on the corruption test?
enronsent29:>  > > Paul Wilcher - Attorney investigating corruption at Mena Airport with
enronsent32:free from corruption.  In particular, foreign parties have been treated
enronsent35:* Protect your code from hackers or unintentional corruption
enronsent37:corruption. They are not saints =01) the Indonesian government removed the=
enronsent37:Turkish military to expose and eliminate corruption in the Turkish energy=
enronsent41:system. E-mail may be susceptible to data corruption, interception and
enronsent41:corruption, interception or amendment or the consequences thereof.

This gives us every line with the word ‘corruption’ in it from the entire corpus, labeled by file. Note that this is giving us the entire line, not just the word itself, which makes interpreting the words much easier.

To make the search results case insensitive (e.g. returning ‘Corruption’, ‘corruption’, or ‘CoRrUpTiOn’ above), you’ll want to use the -i flag.

Importantly, egrep uses regular expressions to search. This allows you to perform more intellegent searches by specifying your search pattern in a more complex and nuanced way. Although the internet does not require yet another RegEx tutorial, so I won’t belabor them here, they are remarkably useful. One particularly useful resource is https://regexr.com/, which provides an interface to check your regular expressions in real time and make sure they’re matching what you hope they will.

To give an example of this, to extract all instances of the word ‘cat’ in both singular and plural form, as well as “catnip”, you could use egrep 'cat(s|nip)' *, which would return all three forms. Provided you’re not looking for more detailed syntactic, part-of-speech, or verb sense information, you can get a remarkable amount of information from a few careful greps including inflectional markers.

It’s worth noting that you can use egrep -n to also get line numbers, which can then be combined with sed -n to extract a range of lines. If, for instance, you get an interesting hit from egrep at line 29076 of file enronsent22, you can then follow up with sed -n '29070,29080 p' enronsent22 to extract more context (from line 29070 to line 29080).

You’ll also want to read this tutorial from Nikolaj Lindberg which offers some similar unix information.

wc - ‘word count’

wc, much like its British namesake, is a very useful thing to have around when you need it, as it provides a simple way to count words, lines, and characters in a file. By default, wc will output the number of lines, words, and characters (assuming single-byte characters) in the input file(s).

$ wc enronsent22
   50000  294705 1771571 enronsent

Very often, you’ll want to use wc on the output of another command. For instance, to count the number of lines, words, and characters in the EnronSent corpus, you could…

$ cat enronsent* | wc
 2205910 13810266 88171505

You can also use the -l, -w, or -c operators to only return the number of lines, words, or characters, if that’s all you need.

One of my most common uses of the command is something like egrep 'searchterm' * | wc -l, which, because egrep returns one line per match, will instantly return the number of occurrences of ‘searchterm’ in the dataset. You’ll want to confirm that your search term is actually doing what you want (so, it’s not unreasonable to think of doing egrep 'searchterm' * > searchterm_results.txt && wc -l searchterm_results.txt instead.)

One completely unrelated use of wc is with the -c operator, which prints out the number of bytes in a file. This may sound silly at first, but then you realize that find . -mtime -365 -type f | xargs wc -c adds up the size of every file modified in the last year, in one single line. Which… omg. (Note the use here of ‘find’ to identify files, and the piping to xargs, which handles large numbers of arguments more gracefully.

sort - ‘sort the file by contents’

Sometimes, you’ll have a file containing a list of items, and you’ll need to sort it. The sort command does just that. Take this list of animals will finds cute:

$ cat unsorted.txt 
cats
chickadees
velociraptors
dogs
hamsters
squirrels
jumping spiders
lizards
owls

We can use the sort command to display the file in sorted order:


$ sort unsorted.txt 
cats
chickadees
dogs
hamsters
jumping spiders
lizards
owls
squirrels
velociraptors

Note that this hasn’t modified the file, just displayed it in sorted order. To save the sorted output, you’d want to use sort unsorted.txt > sorted.txt. And of course, you’ll want to check man sort for all of the various options, including reversed order -r and other alternative orderings.

tr - ‘translate characters’

tr is a textbook unix ‘small utility’ which can be quite helpful, transforming single characters into other ones (e.g. turn all instances of ‘c’ into ‘k’). tr takes the file argument using <, so the above transformation would look like tr 'c' 'k' < unsorted.txt.

Perhaps the most useful case of this is transforming newline characters \n into spaces, to bring a multi-line file into a single line file: tr '\n' ' ' < unsorted.txt. It’s a perfect example of a stupid little Unix command which could save a person hours of manual work, or a trip into a heavier text editor for find-and-replace. Speaking of which…

sed - ‘find and replace’

The sed command (‘stream edit’) is a beautiful tool for modifying text files. To use it, you’ll format a request as sed 's/oldtext/newtext/g' inputfile.txt. To give an example, we’ll use this sample text, from the EnronSent corpus:

$ cat sampletxt.txt 
Hi Elyse - I just spoke with Ramesh Rao (associate dean).  He has not yet
made the Excellence Fund awards, which are also part of the Enron endowment.
He would like to have the dinner in the fall, when all of the awardees have
been chosen.  He felt that a dinner the end of April would conflict with a
major E-commerce conference we're having here, and the Enron awards would
get the attention they deserve in the fall.  The other advantage of a fall
dinner is that the new students will be here, and if we have a big

Now, we can replace every instance of ‘the’ with ‘gebleeble’ using sed 's/the/gebleeble/g' sampletxt.txt:

$ sed 's/the/gebleeble/g' sampletxt.txt
Hi Elyse - I just spoke with Ramesh Rao (associate dean).  He has not yet
made gebleeble Excellence Fund awards, which are also part of gebleeble Enron endowment.
He would like to have gebleeble dinner in gebleeble fall, when all of gebleeble awardees have
been chosen.  He felt that a dinner gebleeble end of April would conflict with a
major E-commerce conference we're having here, and gebleeble Enron awards would
get gebleeble attention gebleebley deserve in gebleeble fall.  The ogebleebler advantage of a fall
dinner is that gebleeble new students will be here, and if we have a big

Note, though, that it also matched word ‘the’ sequences (e.g. ‘they’ became ‘gebleebley’). This can be fixed by using spaces to either side sed 's/ the / gebleeble /g' sampletxt.txt. Also note that this has just printed the output to the terminal, you’ll want to use > to save the output.

sed is also perfectly happy to take regular expressions, although the exact syntax, frustratingly, varies among operating systems. On MacOS, you’ll want to use the -E flag to mark ‘extended’ regex syntax for some things, but you could easily replace multiple words, as in:

$ sed -E 's/ (the|that|he|they|we) / gebleeble /g' sampletxt.txt
Hi Elyse - I just spoke with Ramesh Rao (associate dean).  He has not yet
made gebleeble Excellence Fund awards, which are also part of gebleeble Enron endowment.
He would like to have gebleeble dinner in gebleeble fall, when all of gebleeble awardees have
been chosen.  He felt gebleeble a dinner gebleeble end of April would conflict with a
major E-commerce conference we're having here, and gebleeble Enron awards would
get gebleeble attention gebleeble deserve in gebleeble fall.  The other advantage of gebleeble fall
dinner is gebleeble new students will be here, and if gebleeble have gebleeble big

Note that this remains case sensitive (so ‘the’ was replaced, but ‘The’ wasn’t). sed is very powerful, but due in part to the many implementations of it, you’ll likely find some frustration using it at first, and as always, Google and StackExchange are your friends. Also note that awk exists and does similar things, but with a bit more extensibility at the cost of a bit more opacity.

cut - ‘Extract a delimited column’

You’ll often be given tab or comma delimited data in your life. The cut command allows you to extract individual columns by number, without opening them in R or Google Sheets or a similar spreadsheet interface. For instance, in the CELEX corpus, the data are organized by columns, and delimited by <tab>:

$ head tc_celex.txt 
1   a   413887  2   P   '1  VV  1   P   @   V   @
2   a   422366  1   P   '1  VV  1               
3   a   8448    1   P   '1  VV  1               
4   A   422334  1   P   '1  VV  1               
5   AA  52  1   P   """1-'1"    VVVV    11              
6   AA  95  1   P   """1-'1"    VVVV    11              
7   aback   59  1   P   @-'b{k  VCVC    @b{k                
8   abacus  8   1   P   '{-b@-k@s   VCVCVC  {b@k@s              
9   abaft   0   1   P   @-'b#ft VCVVCC  @bAft               
10  abaft   2   1   P   @-'b#ft VCVVCC  @bAft   

To extract only the second column containing words, we would run cut -f2 tc_celex.txt (or the alternative below, which just does this for the first 10 lines):

$ cut -f2 tc_celex.txt | head
a
a
a
A
AA
AA
aback
abacus
abaft
abaft

You could extract lines 2 and 3 using cut -f2-3 tc_celex.txt. If you’ve got a different delimiter, you can use the -d flag to specify. The opposite of cut is paste, which combines files containing single “columns” of data into a single tab-delimited file.

As an illustration of power, imagine you have a tab delimited class roster (with lots of other information) including ‘Last, First’ student names, and wanting a single document in the format “Will Styler, Rebecca Scarborough, Pam Beddor, Andries Coetzee, Jelena Krivokapic, …”. You could first extract the name column using cut, then extracting the first and last name columns individually using cut -d "," -f 1 and cut -d "," -f 2, then pasting them together in the reverse order using paste and a single space delimiter. Then finally, tr could turn every newline into a comma, and you have your list. This seems like a silly example, but imagine if you have hundreds of thousands of names, the amount of pain this would be.

uniq - ‘remove duplicate lines’

uniq is a nice little tool designed to remove (and with the -c flag, count) adjacent duplicated lines in a file. Note carefully the ‘adjacent’ part: this means that if two lines aren’t right next to each other, it won’t remove them. But it does have its uses.

Among other things, it makes for a cheap and easy little word counter, when combined with tr and sort. Let’s say we wanted to count the number of times each word occurred in the above ‘sampletxt.txt’. We could write the below command, which would first open the file, then transform spaces into newlines (putting each word onto a line), then sort (so that duplicates are adjacent), then count the duplicates:

$ cat sampletxt.txt | tr ' ' '\n' | sort | uniq -c
   3 
   1 (associate
   1 -
   1 April
   1 E-commerce
   1 Elyse
   2 Enron
   1 Excellence
   1 Fund
   3 He
   1 Hi
   1 I
   1 Ramesh
   1 Rao
   1 The
   4 a
   1 advantage
   1 all
   ...

Of course, this is not the best way to get word counts (it’s better to use NLTK’s built-in tokenizer and ngrams() function), but it’s a nice illustration of the power of stringing together simple commands to do complex tasks.

tgrep - ‘Tree Grep’

If you’re dealing with existing syntactic trees or POS-tagged data, you might want a specialized program for searching them. tgrep is exactly that. It’s not basic to any flavor of Unix, installation isn’t trivial, and it requires specifically formatted trees, but if you’re asking questions like “How often does leverage get used as a verb in these tagged data?”, there’s nothing better.

Unix Shell Scripting

Although the shell is in many ways designed to be run interactively, there’s nothing stopping you from building a list of commands, to be run one after the other, to do a more complex task. To do this, you’ll simply create a new text file, preface it with #!/bin/bash (if you’re in the bash shell), then add the commands, one after the other, and including if you’d like lines starting with # which are human-readable comments not interpreted by the computer. Here’s a basic shell script that I use to do routine maintenance on my system, called ‘cleanup.bash’:

#!/bin/bash
# Clear the screen
clear
# Clear DNS caches
echo "Clearing DNS Caches"
dscacheutil -flushcache
sudo killall -HUP mDNSResponder
# Clear out the Quicklook cache
echo "Clearing Quicklook Caches"
qlmanage -r
# Remove log files
echo "Clearing Logs"
sudo rm -rv /Users/wstyler/Library/Logs/*
sudo rm -rv /private/var/log/*
sudo rm -rv /var/log/*
sudo rm -rv /Library/Logs/*
# Update my packages with Homebrew
echo "Updating packages"
brew update
brew upgrade --all
# Move back to the home folder
cd

Once this is done, you can either run it directly (bash cleanup.bash), or you can make it executable and run it directly from the folder ($ cleanup.bash).

You don’t always need to shell script

With that said, it’s often the case that the best way to write a script to do things on a Unix system is to do it in Python, and then use one of the many internal-to-python means to execute the bash lines you need. You often get more flexibility this way, and put simply, python is a more kind place for complex scripting with various control structures, string manipulation and parsing, and other such things. Unless all you’re doing is running sequential bash commands like the above, or making minimal use of conditional logic, your life is often better doing what you need in Python with occasional subprocess.run bash calls.

Advanced Unix Commands

These commands aren’t ‘necessary’ to most everyday work, but can make life substantially easier. If you’re just getting started (e.g. you’re in Will’s LIGN 6), you can skip this section, but the tools are absolutely helpful, and may change your world someday.

tee - Write to file and terminal

tee is one of those commands that’s seldom useful, but really useful when needed. If you’re wanting to write something to a file as well as writing it to the terminal, the tee command is designed to do both that. So, for instance, if you wanted to combine three files (all ending with _log.txt, save the combined file, and print them to the terminal, you could run cat *_log.txt | tee logscombined.txt.

Practically speaking, the most common use for tee is to replace > in a chain of piped commands. For instance, cat *_log.txt | tee logscombined.txt | sort | tee logsorted.txt | wc -l would produce two files, one containing the combined unsorted output, one containing sorted output, and because it also continues the pipe, we get a line count at the end.

You can accomplish the same things with, e.g., cat *_log.txt > logscombined.txt && sort logscombined.txt > logsorted.txt && wc -l logsorted.txt, but tee just feels way better.

screen - Terminal multiplexer

screen is a much more advanced command, but as you start running larger jobs, it’s absolutely crucial. Let’s say that you’re about to start a massive job, for instance, a large R script, or a server migration using rsync. In a perfect world, you’d be able to say connected to the server, with a terminal open, over the course of the minutes/hours/days that the job will run, so that you see all of the output in one, compact place. But particularly in the laptop era, we often move from network to network, sleep our machines, etc. This is where screen comes in.

When you know you’re about to start a long process on a remote server, first type screen. This will ‘clear’ your terminal, but more importantly, you’ve just entered a ‘second’ nested terminal. All the settings and commands will be the same, but you’re inside a layer of abstraction.

Now, you’ll run your ‘huge’ command which you expect to take some time. Once you’re satisfied that it’s running as expected, the magic happens: press Ctrl+A and then d. You’ll get a new message reading something like…

[detached from 8798.pts-0.yourserver]

… and you’ll be kicked back out to another prompt. And in fact, you can even log out of the server. But the big long command is still running on the machine!

Once you’re back at home, nestled into your couch with a cup of tea, you can check back in on your process by reconnecting to the server, then running screen -r. This will ‘resume’ your screen session, and you’ll find yourself ‘back inside’ the terminal where the big command is running, with its output presented there. And you can exit back out again, if you’d like, again with Ctrl+A then d.

Once you’re done in that window, Ctrl+A and then K will kill that subscreen, as will ‘exit’.

There is much more to screen than I’ve covered here, and there’s a very nice manual for the command here, but this covers my most common use case, and will make huge computes (which don’t merit being pushed off to a cluster/job-scheduling system) much more tractable in your daily life.

Other Command-Line tools I can’t live without

These tools are not likely to be pre-installed on systems you’re working with, but are absolutely worth considering.

ansible - Automated Machine Configuration

Your system isn’t properly backed up if you can’t reproduce what’s installed, and a big part of reproducible science is knowing how every aspect of the work was done. Ansible exists to allow you to specify how your system should look in terms of what’s installed, versions, and configurations. Once you’ve written up the playbook used for your system, you can, ideally, reinstall and reconfigure everything in ‘run a command and then go get coffee’ sorts of ways. Because everything’s ‘idempotent’, meaning that running multiple times just ensures everything is as expected multiple times, you can even use it to keep two machines in shared state. It is an incredible tool, and merits the learning curve.

say - MacOS Text-to-Speech Tool

If you’re in a terminal on MacOS, you can use the say command to directly access the MacOS Text-to-speech engine. So, to have your Mac read you some Hamlet:

say "To be or not to be, that is the question"

You can combine this with the -o flag to save the results directly into a .aiff or .mp4 file. This isn’t the most useful thing ever, but it’s a great way to have some fun with a reasonable, modern TTS system.

Pro-Tip: If you happen to have ssh access to a friend’s Mac, you can make their computer start talking to them with nothing visible on the screen to indicate the source of the speech. Bonus points if you can convince them AI has finally emerged.

The say command will not be available on systems other than MacOS.

rsync - Folder synchronization tool

Although it’s nominally meant for synchronizing folders, rsync is a tool that’s amazing for many things. The simplest use for rsync is to synchronize a pair of folders, such that one overwrites the other. If you have, for instance, a large corpus of data in two places, and you’ve made a few small tweaks to it, but you’d rather not copy the whole thing over again, a simple rsync command can copy over only the files that have changed, bringing folder 2 to parity with folder 1.

You can also use rsync to copy folders from one computer to another, again without copying unchanged files. To update my website, I run…

rsync -azvhCL --exclude=.DS_Store --exclude='*.md' --exclude='*/.DS_Store' --delete ~/Documents/web/ wstyler@dss-sites.ucsd.edu:/home/websites/wstyler/

The various flags do things like excluding filetypes, --delete means that removed local files are removed on the destination, and specifying behaviors.

One “off-label” use of rsync is for copying huge amounts of data from one folder to another on a different system, or with a poor connection. I recently had to move around 1.75 TB of video and text files on a remote server via SAMBA (as I couldn’t log in directly), and much to my frustration, the destination server kept having connectivity issues. Using cp -r didn’t work, as every time the connectivity dropped, the transfer would fail mid-stream, and couldn’t ‘start over from where it got cut off’. rsync worked beautifully, though, as every time it cut out, I could just re-run the rsync command, it would figure out what had already been synced, and start from the file which was in progress. Eventually, after 5-10 restarts, I was able to copy the data, and life was good.

pandoc - Document conversion tool

pandoc is a treasure of humanity. From their website:

If you need to convert files from one markup format into another, pandoc is your swiss-army knife. Pandoc can convert documents in (several dialects of) Markdown, reStructuredText, textile, HTML, DocBook, LaTeX, MediaWiki markup, TWiki markup, TikiWiki markup, Creole 1.0, Vimwiki markup, roff man, OPML, Emacs Org-Mode, Emacs Muse, txt2tags, Microsoft Word docx, LibreOffice ODT, EPUB, or Haddock markup to (a whole bunch of formats)

This document is being written in markdown then transformed into html (with a custom header and footer) using pandoc using the below command:

$ pandoc -B includes/header -A includes/footer -s --toc -N -o unix/index.html unix/index.md

Similarly, my disseration was written using a combination of Markdown and LaTeX and rendered with pandoc. Pandoc is the glue which holds together my textual life.

yt-dlp - YouTube Downloader

Sometimes, you need offline access to a YouTube video. This may be for your lecture slides, or to collect speech data, or otherwise. yt-dlp (which is a modern and more performant (as of Feb. 2022) fork of the better-known youtube-dl) is a great tool for pulling down videos, their audio, and more. But to get them into usable formats, you may need…

ffmpeg - Audio and Video conversion tool

When dealing with non-textual data, including audio and video, there are few tools more useful than ffmpeg. It allows you to capture and extract frames, convert from format to format, and even combine multiple videos into a single signal. And it’s completely free.

imagemagick - Image processing tool

There are few image-related tasks that ImageMagick cannot accomplish. It is a swiss army knife for anything image-related, and does amazing work.

R - Statistical Computing Environment

It’s worth noting that the same R Statistical Computing environment you’re used to using on the desktop works just fine via the Unix command line. Once installed, you can call R and enter the R console, running commands as usual. You can also run R scripts from the Unix command line using Rscript, allowing you to compose in a tool like RStudio, and run on somebody else’s (much more powerful) machine.

NLTK - Natural Language Toolkit

NLTK isn’t straight Unix, as it’s a library for Python, but it’s the best place to start in legacy approaches to Natural Language Processing, and in many cases, the best place to finish. I use it for everything from POS tagging, to tokenizing, to N-Gramming, all the way to drawing syntax trees.

jdupes

How do you handle the situation in which you have two folders which have 50,000 and 49,500 files, which are mostly identical? Well, you could go through manually, or you could use a tool like jdupes, remove all files in folder B which are identical to one in folder A, and then worry only about what’s different. It is one of those things that takes years of your life manually, but can move very quickly with the right script.

Scraping Web Data to make Corpora

One common process in NLP research is pulling down web data in nicely formatted chunks, for later use in corpus work.

First, you’ll figure out your desired data. In some cases, you might be interested in a noisy-but-large amount of data. To do this, scraping Google’s search results might be the best approach. In other cases, you’ll want to use a specific subset of URLs that you’ve hand-picked (although remember, you want to balance your corpus for any relevant features). Regardless, we’ll assume that you’re starting with a file containing a bunch of URLs (scraping_urls.txt).

Then, your next step will be to scrape those URLs onto your local machine. You’ll want to remove as much of the ‘noise’ as possible, things like navigation bars, ‘Like on Facebook!’, links to other stories, and more. To this end, Justext is your best friend. It does both the downloading and the cleaning, and will output just the raw text of the website. Here it is, folded into a cute little chunk of bash shell script:

while read p; do
    # Increment the counter
    count=$(($count + 1))
    # Give the URL
    echo $p
    # Set a suffix
    suffix=".txt"
    # Turn that into a filename
    ffinal="scraped_url_"$count$suffix
    # Drop the URL into justext
    python -m justext -s English -o $ffinal $p
    # Add the URL to the start of the file
    echo -e "URL:$p\n$(cat $ffinal)" > $ffinal
done < ../scraping_urls.txt

The end result of running this will be a series of files, inside the working directory, each containing a URL and a big chunk of text. Enjoy!

Version History

To Do