Associate Teaching Professor of Linguistics at UC San Diego
Director of UCSD's Computational Social Science Program
Using Unix for Linguistic Research
This document is designed to serve as a tutorial for Linguists who
are using Unix-based machines for corpus and other linguistic work. It
is not a comprehensive tutorial, nor will you leave this completely
competent in all aspects of Unix work. But it’s designed to
touch
on the concepts, commands, and techniques most useful
to linguists.
If you find errors, see issues, or use it for your course, please drop me an email.
Using Unix for Linguistic Research is completely free, and is available for use, both privately and as a part of classes. All I ask is that you share this link to share the guide (rather than sharing the HTML file itself) so that people end up with the latest versions.
Also, if you’re using the guide in your class or project, I’d love it if you’d send me an email (will at savethevowels dot org), so I can see where my work is going, and so I can get ideas as to how to improve it for the future.
Using Unix for Linguistic Research is licensed under a Creative Commons Attribution ShareAlike (CC BY-SA) License. More information on this license is available here, but in short, this license means that I’d appreciate if you’d attribute my work to me, and that any derivative works must also be creative commons licensed. Free should stay free.
Any questions, comments, concerns or corrections should be sent to me by email - will (at) savethevowels (dot) org.
Logging into a Unix-based remote server
Although on many operating systems, you’ll be able to work locally, you’ll often need to be able to log into a remote computer using a UNIX-style command line.
You’ll log into the server using ssh
, short for ‘secure
shell’. Unfortunately, this process looks a bit different depending on
your operating system.
UCSD Only: Getting your Username, Password, and Server Address
No matter your operating system, you’ll need to have login credentials for the computer you’re logging into. For most applications, you’ll already know your username and password, as it’s been given to you by your local SysAdmin, or perhaps you set it yourself. But regardless, to connect using any option, you’ll need to know the Username, Password, and Server Address.
For LIGN 6 at UCSD, to get your course-specific Username and
Password, you’ll want to go to https://sdacs.ucsd.edu/~icc/index.php, then under
Tools click “Account Lookup”. This will allow you to enter your
username and PID to get your login and set a password. There’s a chance
it’ll have multiple accounts, the one you want here starts with ‘ln6’.
You’ll need to log in and ‘Reset your password’, making sure to
write down your username and password, you’ll need them later!
The server address for LIGN 6 is ieng6.ucsd.edu
, and if
your SSH client requires you to specify a ‘Port’, it’s 22.
If you’re using LinuxCloud, you’ll also need to enroll that account in Duo https://duo-registration.ucsd.edu/ by inputting your new username and password and connecting to your mobile device.
Logging in via ssh on MacOS and Linux
On the Mac, you’ll go to /Applications/Utilities, and then double click “Terminal”. This puts you in a fully-functioning Unix command line, and will allow you to either work locally (accessing the files on your own computer), or to enter an ssh command (as described below) and connect to the campus server.
If you’re using Linux, there’s a good chance you know how to open a Terminal. For the most part, any terminal (Gnome Terminal, Konsole, xTerm, eTerm) will work fine. And again, once it’s launched, this puts you in a fully-functioning Unix command line, and will allow you to either work locally (accessing the files on your own computer), or to enter an ssh command (as described below) and connect to the campus server.
Once you’ve got a terminal opened in either OS, you can either work
locally (as you’re already at a unix prompt), or type
ssh yourusername@ieng6.ucsd.edu
at the prompt to connect to
a remote server.
The first time you connect, you’ll see a message talking about the Server’s “RSA Key”. Just type “yes”. Then, you’ll enter your password. Nothing will show up as you’re entering your password, but don’t worry, it still works
Once that happens, the server will automatically display an annoying
=====NOTICE=====
message with privacy information (if
you’re at UCSD). Once you hit q
to exit that, you’ll be
logged in.
Once you’re logged in, you’ll see a line ending with a
$
. This is the ‘prompt’, and now, you move on to the next
steps.
Logging in via ssh on Windows
Unfortunately, Windows users cannot do UNIX work locally (as Windows is not a Unix-based OS), and have some additional work to do to be able to ssh into our server.
As of Windows 10, there is an SSH client built into Windows, which allows you to log in to remote servers. To use it:
- Open Windows Terminal
- Type
ssh yourusername@ieng6.ucsd.edu
at the prompt to connect to our remote server. - The first time you connect, you’ll see a message talking about the Server’s “RSA Key”. Just type “yes”. Then, you’ll enter your password. Nothing will show up as you’re entering your password, but don’t worry, it still works. Then, continue below.
If this isn’t working, if you’re using an older version of Windows, or if you’d like a slightly more robust terminal client, you can use PuTTY:
- Download PuTTY from https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html, using the MSI ‘Windows Installer’ option on that website.
- Install PuTTY from the downloaded .exe file.
- After installation, find the PuTTY program in the Start Menu.
- PuTTY will open a new window. Enter the below settings…
- Under ‘Host Name (or IP address)’ enter your username followed by ieng6.ucsd.edu (For eg: wstyler@ieng6.ucsd.edu).
- Under ‘Port’, keep the default value of “22”.
- Under ‘Connection Type’, select “SSH”.
- Note, this information will differ if you’re connecting to a different server than the UCSD LIGN 6 server
- To save these settings (so that you don’t have to reenter them
later)…
- Enter a name under ‘Saved Sessions’
- Click ‘Save’
- To re-use this information later on, just double click that name under ‘Saved Sessions’
- Click ‘Open’ to start your SSH connection.
- The very first time you connect to a new server, a popup will ask you to save the key. Choose ‘Yes’.
- The command window will open, and ask for your password. Type your
password and press enter.
- Nothing will show up as you’re entering your password, but don’t worry, it still works
Once you’re connected through either method, the server will
automatically display an annoying =====NOTICE=====
message
with privacy information (if you’re at UCSD). Once you hit
q
to exit that, you’ll be logged in.
Once you’re logged in, you’ll see a line ending with a
$
. This is the ‘prompt’, and now, you move on to the next
steps.
For what it’s worth, Windows now also has the optional ‘WSL’ (Windows Subsystem for Linux) tooling (which I would argue is actually a Linux subsystem for Windows). But this installs a Unix-like virtual machine, effectively, on your Windows machine, giving you some of the power of Unix applications and tools within Windows. Note that some tools won’t work or may perform strangely. It remains the rough equivalent of using a lamborghini engine to run the generator in your 1970’s era camper van, and if you’re serious about using these tools in the longer term, you’ll probably want to either log into a remote server running a real Unix-based OS, or install a Linux distribution alongside (or instead of) Windows.
Logging in via ssh on iOS or Android or ChromeOS
You can actually do some Terminal work from iOS or Android, using an ‘ssh’ client like Prompt (for iOS) or ConnectBot (for Android). It’ll be unpleasant if you don’t have access to a real keyboard, but you can pull it off.
On ChromeOS, you can use the Secure Shell App for Chrome to connect to a server.
Enter the server name (ieng6.ucsd.edu), your username, and password, in the relevant places.
The first time you connect, you’ll see a message talking about the Server’s “RSA Key”. Just type “yes”.
Once that happens, the server will automatically display an annoying
=====NOTICE=====
message with privacy information (if
you’re at UCSD). Once you hit q
to exit that, you’ll be
logged in.
Once you’re logged in, you’ll see a line ending with a
$
. This is the ‘prompt’, and now, you move on to the next
steps.
UCSD Only: Logging in via LinuxCloud
If you’re a LIGN 6 Student, you can also log in at https://linuxcloud.ucsd.edu/# using your ieng6 account above. If you’re using LinuxCloud, you’ll also need to enroll that account in Duo https://duo-registration.ucsd.edu/ by inputting your new username and password and connecting to your mobile device.
Once you’ve enrolled in Duo and logged in, you’ll have an option under ‘All Connections’ which for ‘Linux SSH Terminals’. Double click that and you’ll be on a command line, and ready to get started.
Basic Unix Theory
Unix doesn’t refer to a specific operating system, but a family of related (and largely interoperable) operating systems. Members of this family include MacOS, iOS, Android, Linux, BSD, Solaris, and others. All of them share some low-level basics, and focus on using simple tools which can be combined together to do more complex things in a program called a ‘shell’.
John Lawler has written a full Linguistic description of the Unix language family, which is amazing both in its depth, and its sense of humor. It should be required reading for any linguist interested in using Unix.
Let’s start by talking about some of the basics of Unix-compatible machines.
The Unix Shell
Every Unix server, when you log in, will put you into a ‘shell’
program. bash
is probably the most popular option these
days, but you’ll occasionally see systems still using csh
or sh
. There are also other shells out there in the world
which offer more modern features. I personally prefer zsh
because it has a few nice tools (particularly an expanded set of
wildcard functions), but I’m perfectly happy in bash
, and
this tutorial is written with bash in mind.
The Prompt
When you log in to a Unix machine (either via a Terminal on your local machine, or via SSH), you’ll have the ability to enter text commands at the Prompt (usually a $), which the local operating system will then evaluate. All of your interaction is text-based, and although you may be able to move, click, and select text in the terminal, your mouse does nothing here.
When you log into the UCSD server, your prompt will look like the below:
[yourusername@ieng6-201]:~:100$
This includes information about how you’re logged in (username at
which specific node of the cluster), then your current directory (in
this case, home), then a number indicating how many commands you’ve
previously entered. On your local machine, your prompt will look
different. But it will likely still end with a $
if you’re
in the bash
shell.
Any time that you see this line, ending with $
, you can
enter a command. But before we talk about commands, we should talk about
where you are and the files around you.
Navigating the system
In UNIX, you’re always ‘in’ a folder while you’re logged into a Unix machine. Your presence in the system moves from folder to folder as you navigate. At any given moment, you’re ‘in’ a directory, and all commands refer to those items in that folder. Think of this as being kind of like human conversational pragmatics: If you’re in a room, and there’s a mug on the table, and the other person in the room asks “Hey, hand me the mug”, you’re going to grab the one on the table, rather than asking “There are lots of mugs in the world, which do you want?”.
This means that you need to be sure what’s in the folder you’re in, so that your commands will work as expected, and you’re not trying to ask for a mug that isn’t in the current room!
But how do we describe where we are?
File and folder paths
All files and folders have a ‘path’, that is, the place on in the file system where you can find them. We can refer to a file or folder’s location in two ways, either using the ‘local path’ (“the path to this file from here”), or the absolute path, (“the path to get to that folder from the root of the system”).
If you have a file called system.txt
, and it exists in
the folder /home/linux/ieng6/ln6w/ln6w
, then the absolute
path to that file is
/home/linux/ieng6/ln6w/ln6w/system.txt
. If you’re already
‘in’ that folder, you can refer to it simply as system.txt
.
If you’re in that folder, the paths above are equivalent.
Also, it’s worth noting that your home folder is ~
, so
you can always use that in a path name, wherever you are. So, the file
samplefile.txt
in your home folder can always be called
~/samplefile.txt
no matter where you are.
In Unix, there are four locations that have a dedicated symbol:
.
represents the current folder. You don’t need to use
it very often, except when doing something like ‘move this file into the
current folder’
(e.g. mv /folder/over/there/file.txt .
).
..
represents the folder which contains the current
folder. So, to move file.txt from the folder where you are to the folder
which contains it, mv file.txt ..
. Similarly,
cd ..
will always take you back up a level.
~
represents your home folder, and you can always use
that in a path name, wherever you are. So, the file
samplefile.txt
in your home folder can always be called
~/samplefile.txt
no matter where you are.
/
represents the root of the filesystem (‘the folder
which contains all folders’). Any path which starts with /
is what’s called an ‘absolute’ path, and will be valid no matter
what folder you’re in on that machine.
Now, you’ve got paths. How do we actually do things?
The * Wildcard
In most Unix shells, you can use the *
‘wildcard’ to
refer to groups of files. Used alone, it means “all items in the current
folder”. So, for instance, if you’re in a folder with four other
folders, running ls
will list the contents of the folder
you’re in, but running ls *
will list the contents of all
the folders inside that folder.
You can also use the wildcard to match partial filenames. So,
something like cat ling*
will display the contents of every
file starting with ling (“ling101_syllabus.txt”,“lingering_ghosts.txt”,
“linguistics_is_awesome.md”). But cat ling*.txt
would not
match a .md file.
This is amazingly powerful. Simple commands in Unix can
replace huge amounts of human work moving files in folder. Imagine that
you’ve got a folder with .wav sounds files and .txt transcripts.
mkdir wavdir && mv *.wav wavdir
will instantly move
every wav file (but not the .txt files) into a new folder called
‘wavdir’.
Because the Unix command line is generally quite fast, you can work
with thousands of files at once, according to simple rules. Imagine you
had many thousands of text snippets from various sources, labeled by
filename, with files from twitter prefixed with “twi_” and files from
facebook prefixed with “face_” and so forth. Using bash, you could
instantly combine every file from twitter (but not every file
from Facebook) into a single text file using just the command
cat twi_* > twitterfiles.txt
.
Be careful, though, as it’s easy to underestimate the scope of your
wildcards, and forget what you’re affecting. You should be very
careful using wildcards with rm
or other destructive
commands. rm -r *
will instantly delete every file and
folder in the current folder.
Dotfiles
In Unix, files which start with a period/dot .
are
‘hidden’ from general view, requring the use of a special command
(ls -a
) to see them. They tend to be configuration files
(see the above ls -la
output, but any file can be made into
a dotfile. They’re still usable with other commands though (so
cat .bashrc
will still output the contents of that file to
the terminal. For this reason, you’ll want to be very careful with
periods in filenames.
Tab Autocomplete
In Unix, hitting <tab>
will always try to complete
your command. So, if you’re in a folder with three files, ‘filea.txt’,
‘fileb.txt’, ‘otherfile.txt’, and you type o
then hit
<tab>
, it will automatically complete the command to
otherfile.txt
.
If you type f
then <tab>
, it will
complete to file
(as there are two files which both start
with that much, allowing you to then hit a
or
b
to specify which of the options you want, and
<tab>
again to fully complete the file’s name.
This is helpful, and can save a lot of time.
Command History
In Unix, hitting the up arrow at the command line will automatically
scroll through your command history. So if you want to re-run a command,
just hit ‘up arrow’ then <enter>
.
Killing a running process
If you’ve got a runaway process, or something that you ran which is taking much longer than expected, you can hit Ctrl+c to kill the process and return to a prompt.
Unix has no ‘undo’
One crucial thing to understand is that unlike in most operating
systems, there’s not an ‘undo’ functionality. If you delete a folder,
unless you have a sympathetic sysadmin with backups, it’s gone. This is
why you should be extremely careful when using destructive commands
(like rm
), you should make sure you’re not overwriting
files, and never hesitate to make copies of important files before you
work on them.
On Unix Commands
To do things in Unix, we use textual commands, which are specific (and generally compact) programs that serve a small function. You’ll enter individual commands, followed by ‘flags’ which specify command behavior as well as ‘arguments’ which are the ‘objects’ in the linguistic sense.
Note that the exact complement of commands may differ from system to
system, and that additional commands can be added by the user (using a
package manager like apt
or portage
on Linux,
‘pkg’ on BSD, or homebrew for MacOS. The
commands we’ll be using here are all fairly basic, and should likely be
found on any Unix-compliant system, although the exact flags and
arguments taken may vary slightly depending on version.
As it is a verb-initial language, the basic syntactic structure of a
unix command is command [-flags] [arguments]
.
Unix Command Arguments
Unix commands can take arguments, and depending on the command, these arguments will be optional or obligatory, and ordered in certain ways.
Some commands take no required arguments like
date
, which will just create a small calendar, orpwd
, which only prints your working location.Some commands take one argument, like the
cd
(‘change directory’) command.cd myfolder
will move your current location to be ‘inside’ a new folder calledmyfolder
.Some commands take multiple arguments like the
cp
(‘copy item’) command.cp myfile.txt mycopy.txt
will make a copy of the existing filemyfile.txt
calledmycopy.txt
.
Flags/Options/Switches
Additionally, many commands take ‘flags’ (also known as ‘options’ or
‘switches’) which slightly modify their function, which are generally
prefaced with a -
. The ls
(‘list directory
contents’) command will simply list file and folder names if called
using ls /home/linux/ieng6/ln6w/ln6w
:
$ ls /home/linux/ieng6/ln6w/ln6w
dirs.txt duolingo.txt perl5
If you use the -a
flag, it will show ‘hidden’ files
too:
$ ls -a /home/linux/ieng6/ln6w/ln6w
. .bash_profile .cache .cshrc .local .modulesbegenv .profile .zshenv dirs.txt perl5
.. .bashrc .config .kshrc .login .procmailrc .zprofile .zshrc duolingo.txt
If you use the -l
flag it will give more details:
$ ls -l /home/linux/ieng6/ln6w/ln6w
total 588
-rw-r----- 1 ln6w ieng6_ln6w 592096 Jan 27 10:27 dirs.txt
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 10:01 duolingo.txt
drwxr-xr-x 2 ln6w ieng6_ln6w 4096 Jan 27 08:21 perl5
And you can combine these flags, so ls -la
will produce
detailed output about all files including hidden ones:
$ ls -la /home/linux/ieng6/ln6w/ln6w
total 652
drwx------ 6 ln6w ieng6_ln6w 4096 Jan 27 10:27 .
drwxr-sr-x 51 ln6w ieng6_ln6w 4096 Jan 27 09:53 ..
-rwxr-x--- 1 ln6w adm 975 Jan 18 14:56 .bash_profile
-rw-r----- 1 ln6w ieng6_ln6w 592096 Jan 27 10:27 dirs.txt
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 10:01 duolingo.txt
drwxr-xr-x 2 ln6w ieng6_ln6w 4096 Jan 27 08:21 perl5
Occasionally, flags will demand an additional argument. This
guidebook is rendered into html using the following unix command (using
the pandoc
command, which
is not standard Unix, but is among humanity’s greatest
accomplishments):
pandoc -s --toc -N -o unix/index.html unix/index.md
In this case, the -s --toc -N
flags set program options,
and the -o
takes an obligatory filename, which is the
desired location of the command’s output.
Some commands can take numerical flags, too.
head file.txt
used without a flag, prints the first 10
lines of a file. But head -76 file.txt
prints the first 76
lines.
Not all commands require flags, and not all flags require arguments,
but it’s important to understand what they are when they do show up. To
learn the flags and arguments for a given command, use the
man
command to read the manual.
Basic Unix Commands
exit - ‘exit this environment’
If you’re logged into a Unix machine and don’t want to be logged in
anymore, exit
is your friend. This will close an ssh
connection, or if you’re working locally, close your terminal window.
But remember, if a program’s gone off the rails, you’ll want to use
Ctrl+c instead.
pwd - ‘print working directory’
pwd tells you where you are.
$ pwd
/home/linux/ieng6/ln6w/ln6w
This means you’re in the folder /home/linux/ieng6/ln6w/ln6w
whoami - ‘who am i’
whoami
tells you who you are, that is, it returns your
username.
$ whoami
ln6w
Useful in case of identity crises, or if you forget which machine you’re logged into.
date - ‘what time is it?’
The date
command simply returns today’s date, time, and
timezone.
$ date
Sun Jan 27 12:56:57 PST 2019
This isn’t super useful when you’re just doing work, but it can be
quite helpful, particularly with >>
, for dating
individual files or versions of files.
echo - ‘Display text in terminal’
echo
simply repeats whatever text you give it back into
the terminal.
$ echo "I want this text displayed in a terminal"
I want this text displayed in a terminal
This isn’t as useful in interactive use, but it’s a nice way to add
information to a file (using something like
echo "I want this text displayed in a terminal" > ~/Downloads/tmp.txt
)
or to display information in a shell script.
ls - ‘list directory contents’
ls
tells you what’s in a directory. Although you can
include a path as an argument (‘What’s in this folder?’), the default is
to return what’s in the current directory.
$ ls
perl5 samplefile.txt samplefile2.txt samplefolder samplefolder2
You can use the -l
flag to give more information (long
form), and the -a
flag to show all files, including
‘hidden’ files.
$ ls -l
total 12
drwxr-xr-x 2 ln6w ieng6_ln6w 4096 Jan 27 08:21 perl5
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 08:30 samplefile.txt
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 08:30 samplefile2.txt
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder2
The -l
view also gives you information about the files’
permissions (drwxr-xr-x), types (2 or 1), owners (ln6w), sizes (4096),
and modification dates, as seen above.
Note also that ll
is a commonly available shortcut for
ls -l
cd - ‘change directory’
cd
lets you move into a new directory, specified as an
argument. You can specify the directory name in one of two ways. Let’s
say you’re in /home/linux/ieng6/ln6w/ln6w, and you want to move into
‘samplefolder’ which is inside that directory. You can do this using the
‘local path’, “the path to this file from here”). Here’s the command,
with a pair of pwd
commands to either side so you can see
the change, using a local path. We specify the folder by just giving its
name, and because it’s in the current directory, we go there:
$ pwd
/home/linux/ieng6/ln6w/ln6w
$ cd samplefolder
$ pwd
/home/linux/ieng6/ln6w/ln6w/samplefolder
We can also do the same exact thing, specifying the
‘absolute path’, the path to get to that folder from the root of the
system. This cd
command would get you to that exact folder,
from anywhere.
$ pwd
/home/linux/ieng6/ln6w/ln6w
$ cd /home/linux/ieng6/ln6w/ln6w/samplefolder
$ pwd
/home/linux/ieng6/ln6w/ln6w/samplefolder
Note that absolute paths will always start with /
, which
is the ‘root’ of the server.
Also, it’s worth noting that cd
, when used without any
‘arguments’, will take you to your home folder (which is also given the
symbol ~
. And cd ..
will always take you back
up a level.
mkdir - ‘make directory’
mkdir
creates a directory, whose name is given as an
argument, in the current folder. Here’s an example, making a new
diretory called ‘samplefolder3’ with an ls -l
to either
side to illustrate the change:
$ ls -l
total 12
drwxr-xr-x 2 ln6w ieng6_ln6w 4096 Jan 27 08:21 perl5
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 08:30 samplefile.txt
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 08:30 samplefile2.txt
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder2
$ mkdir samplefolder3
$ ls -l
total 16
drwxr-xr-x 2 ln6w ieng6_ln6w 4096 Jan 27 08:21 perl5
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 08:30 samplefile.txt
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 08:30 samplefile2.txt
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder2
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:01 samplefolder3
touch - ‘create or update modification date for a file’
The touch
command does two related things for us. First,
it can be used on an existing file to update the modification date of
that file. See below example, where I’ll update the modification time
for samplefile.txt:
$ ls -l
total 16
drwxr-xr-x 2 ln6w ieng6_ln6w 4096 Jan 27 08:21 perl5
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 08:30 samplefile.txt
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 08:30 samplefile2.txt
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder2
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:01 samplefolder3
$ touch samplefile.txt
$ ls -l
total 16
drwxr-xr-x 2 ln6w ieng6_ln6w 4096 Jan 27 08:21 perl5
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 09:04 samplefile.txt
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 08:30 samplefile2.txt
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder2
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:01 samplefolder3
More usefully though, it can create a new file out of thin air, just by giving the filename as an argument:
$ ls -l
total 16
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 09:04 samplefile.txt
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 08:30 samplefile2.txt
$ touch samplefile3.txt
$ ls -l
total 16
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 09:04 samplefile.txt
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 08:30 samplefile2.txt
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 09:06 samplefile3.txt
The created file will be empty, but you can then open it with a text editor.
mv - ‘move or rename file or folder’
mv
moves files. You give two arguments, one is
the file’s current location and name, and the other is the file’s new
location and name. You can do this to move files between folders, as
below (note that I’ve thrown in a cd and ls so you can see what’s going
on). Here, I’ll move samplefile.txt into the subfolder
‘samplefolder’:
$ ls -l
total 16
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 09:04 samplefile.txt
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 08:30 samplefolder
$ mv samplefile.txt samplefolder/samplefile.txt
$ cd samplefolder/
$ ls -l
total 1
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 09:45 samplefile.txt
You can also use mv
to rename a file or folder, by
simply ‘moving’ the file from the original filename to the new one:
$ ls -l
total 1
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 09:45 samplefile.txt
$ mv samplefile.txt newsamplefile.txt
$ ls -l
total 0
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 09:45 newsamplefile.txt
This can be done within a folder, or while you’re moving a file to a
new location outside the folder
(e.g. mv myfile.txt myfolder/mynewfile.txt
will both rename
the file and move it into the ‘myfolder’ folder.
Note that if you ‘move’ a file to the same name and location as an existing file, it will overwrite the existing file! Unix has no ‘undo’ functionality. Be sure.
cp - ‘copy file or folder’
cp
copies files. Like mv, it takes two arguments, the
starting name, and the new name, but instead of moving the file, it will
create a new copy of the file. Like this command, which will create a
copy of ‘unilingo.txt’ called ‘duolingo.txt’:
$ ls -l
total 8
drwxr-xr-x 2 ln6w ieng6_ln6w 4096 Jan 27 08:21 perl5
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:53 samplefolder
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 09:59 unilingo.txt
$ cp unilingo.txt duolingo.txt
$ ls -l
total 8
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 10:01 duolingo.txt
drwxr-xr-x 2 ln6w ieng6_ln6w 4096 Jan 27 08:21 perl5
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:53 samplefolder
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 09:59 unilingo.txt
To copy a directory (and all the files inside it), you’ll need to use `cp -r’, which makes the command ‘recursive’:
$ ls -l
total 8
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:53 samplefolder
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 09:59 unilingo.txt
$ cp -r samplefolder/ samplefolder2
$ ls -l
total 12
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:53 samplefolder
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 10:04 samplefolder2
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 09:59 unilingo.txt
rm - ‘remove file or folder’
rm
removes files, or with the -r
operator,
entire directories.
$ ls -l
total 12
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 10:01 duolingo.txt
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:53 samplefolder
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 10:04 samplefolder2
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 09:59 unilingo.txt
$ rm -r samplefolder2
$ rm unilingo.txt
$ ls -l
total 8
-rw-r----- 1 ln6w ieng6_ln6w 0 Jan 27 10:01 duolingo.txt
drwxr-x--- 2 ln6w ieng6_ln6w 4096 Jan 27 09:53 samplefolder
BE VERY CAREFUL: In Unix rm
, there is no ‘trash
can’, there is no ‘undo’. It’s just gone. If you accidentally rm -r your
home folder, it’s over. It’s a good practice to use the -i
operator, which makes you hand-confirm deletions as they happen, as
you’re first learning. And be especially cautious with the
*
wildcard, which will delete everything in a
folder.
man - ‘read the manual’
The most important command you’ll find is man
, which
explains to you how another command works. To get information about a
command, for instance, the rm
command, just type
man rm
, then read through the document using
<space>
to advance and q
to quit.
Combining commands
One of the nicest elements of the Unix philosophy is the ability to combine many small components into bigger workflows, either through shell scripting (that is, writing a file which contains many sequential commands), or using operators like the ones below.
| - ‘Use the output of this command as the input to the next’
The pipe
operator is simple and extremely
powerful. Put simply, it takes the output of one command and passes it
into the next command as input. For instance,
ls -l | tail -1
will take the output of the
ls -l
command, and pass it to tail -1
. This
will take the last line of the ls -l
output, and thus, give
you the last file (alphabetically) in a folder.
Or perhaps you’re wanting to concatenate many small files to view
them all at once, but the resulting file is too big to comfortably read
in the terminal using cat
. So if you have readme1.txt,
readme2.txt, and readme3.txt, running cat readme* | less
will combine the files, then pass that output to the less
command for more comfortable reading. Or you could pass to the word
count (wc
) utility to check the length, using
cat readme* | wc
.
You’ll likely find yourself regularly using |
to string
small processes together, doing very complex things from very simple
chunks of code. It is the most powerful tool in Unix, in the author’s
opinion, and is one you’ll want to know.
< - ‘Use this file as input’
The <
operator tells a command to take a given file
as the input. So, for instance, the tr
(translate
characters) command does not take a file as an argument directly, and
instead, will take an inputted file only using the <
operator. So, to change every instance of ‘c’ to ‘k’ in a file called
‘unsorted.txt’, you’d use tr c k < unsorted.txt
.
Note that many commands will take input directly as an argument
(e.g. cat myfile.txt
) or using <
(e.g. cat < myfile.txt
). But when no ‘file’ argument is
given, this approach will work.
> - ‘Redirect the output to a file’
The >
operator, when placed after a command, will
save the output of the command (that is, what would normally appear in
the terminal) as a designated file. So, for instance, if you wanted to
list the contents of a folder, but wanted to save the list to file for
later use, you’d run ls -l myfolder > folderlist.txt
.
Although nothing will happen in the terminal, you’ll find that a new
file has been created called ‘folderlist.txt’, and that that file
contains what would’ve happened there.
For text analysis and processing, this is a really powerful
trick. Early in the tutorial, I gave the example of using a wildcard to
concatenate (that is, append together into a single file) every file in
a folder starting with ‘twi_’. Although you could just run
cat twi_*
to view it all in the terminal, it’s likely
better to save to a single file for later use or easier reading, so
cat twi_* > twitterfile.txt
will create a new file.
This is occasionally useful when you’ve got a command which will take
a while to run and produce a lot of text. For instance, the command
find / -mtime -0.5 -ls > past12hrs.txt
will return a
listing of every single file which has changed on the entire
system in the last 12 hours (0.5 days). This will create many entries,
and take a while to process, so it’s better to run this directly into a
file, here, past12hrs.txt
.
Finally, remember that redirecting a command with
>
will overwrite the target file if it already
exists! If you want to add the output to an existing file,
you’ll want …
>> - ‘Append the output to a file’
>>
is identical to >
, except that
instead of directing the output to replace a designated file,
it will append the output to an existing file. So, if you’re
planning to, for instance, list the contents of 10-12 directories,
across a few different folders, into a file, you’ll want to run
ls -l dir1 > output.txt
(to create a clean file for the
first one), then ls -l dir2 >> output.txt
to append
to the existing file.
This difference between >
and >>
is subtle, but absolutely crucial, and mixing the two up can cause a lot
of missing data, or a lot of messy files.
& and && - Running Multiple Commands
In Unix, you’ll occasionally want to run multiple commands at once. There are two ways to do so:
The &
command says “Do this in the background”,
allowing you to run multiple commands in parallel. So, if you
want to run a long task (e.g. find and list every file on the hard drive
containing “twi_”), but didn’t want your terminal frozen while it runs,
you would run
find / -name "twi_" -ls > twifiles.txt &
, where the
final &
just says ‘Run this, and let me know when it
finishes’. This will return a process ID immediately, and when it
eventually finishes, will return an ‘exit’ message to the terminal. But
in the mean time, you can keep working.
&&
allows you to run muliple commands in
series, meaning that the second command won’t be run until after
the first one is completed. For instance,
mkdir newdir2 && cd newdir2
will create a new
directory, then enter it, in a single command.
mkdir txtfiles && mv subfolder/*.txt txtfiles && rm -r subfolder
will create a new folder called ‘txtfiles’, move all the text files out
of a subfolder into it, then remove that subfolder.
&&
is particularly useful when you expect a command
to take a while to run, but you still have a few small tasks which need
to be done afterwards, and you’d just as soon they happen
automatically.
In practice, you should be careful using these commands at first, as
you don’t want to delete a folder before you’ve confirmed that your
first command worked properly (e.g.), and the time savings of typing
command && nextcommand
isn’t all that large
compared to “command, enter, nextcommand, enter”. But they’re out there,
and useful.
Editing Text Interactively in Unix
On a Unix system, there are usually a variety of programs usable for editing text interactively inside a terminal interface.
nano
nano
is very user-friendly, and allows more or less
anybody to edit text in Unix. The only ‘trick’ you need to know is that
the commands at the bottom of the window (listed as ^X
) are
triggered using the <control>
and the listed
letter.
To edit a file in nano
(or create a new file with this
name), you’ll type nano thefilename.txt
.
So, when you open nano, you can start typing inside the box. When
you’re finished, to save your edited file and exit, you’ll type CTRL+O
(‘Write out’), specify a filename (if you haven’t done so when you
started the program), then <Return>
, then CTRL+X
(‘Exit’).
- If you’re unsure which editor to use, start here. It’s good to know a more powerful editor, but this is a great starting point.
vim
vim
(or its progenitor, vi
) is an
unquestionably powerful, fast, free, and rather weird. It’s a ‘modal’
editor, which means that there are two different ‘modes’ for interacting
with the document, with one transforming the keyboard keys into commands
for moving rapidly around the document, and the other mode allowing text
entry. This seems silly at first, but in practice, it’s extremely
powerful for moving quickly around a document, and there are few faster
ways to edit text.
vim
has an extraordinarily steep learning curve, so be prepared to have a tutorial open in the other window the first few times you use it.
emacs
emacs
is another very powerful and free editor, which is
not modal (in its default configuration), but has keystrokes, commands,
and packages allowing you to do nearly anything without leaving the
editor. This makes emacs
bulky and occasionally a bit slow,
but extensible and powerful, and is a world-class plaintext editor.
emacs
has a learning curve, so be prepared to have a tutorial open in the other window the first few times you use it.
Which editor is best?
Whoa there, are you trying to start a holy war or something? :)
One’s choice of unix text editor is intensely personal, and depends as much on the person’s outlook on the world in general as it does on the relative merits of the programs. Anybody who’s spent any amount of time programming or scripting will have opinions, and it’s useful to be familiar with one of either emacs or vim for just this kind of work, as it’s likely that any given server will have one of them to work with.
Additionally, remember that many people use other editors
(e.g. BBEdit, SublimeText, TextMate, etc.) which aren’t available on
Unix per se, but may be able to access remote files on a unix machine.
If you’re working on a Mac, Windows machine, or Linux Desktop, there are
more options, although emacs
and vim
still
have a lot to offer even there. In fact, this guide is being written in
emacs
on the author’s Mac, as that’s the author’s editor of
choice, but he spent many years in vim
.
The most important thing, ultimately, is that you try a number of options, then choose the one that made the most sense with your brain and workflow. Remember, there’s no correct answer.
Basic Text Reading Tools
As linguists, you’ll be doing a fair amount of text reading and manipulation.
cat - ‘output the file into the terminal’
cat
offers you a simple way to read a short file. Just
use cat
and the filename, and it’ll print the contents
directly into the terminal. So, if you want to read the file
willsfavoritepoem.txt
, just run
cat willsfavoritepoem.txt
.
You can also use the *
wildcard to use cat
to concatenate, that is, append end-to-end, many files together. So if
you have readme1.txt, readme2.txt, and readme3.txt. running
cat readme*
will print all three files’ content into the
terminal. cat
is particularly useful when combined with the
|
and >
operators.
cat readme* > readme_combined.txt
will combine them into
one new file called ‘readme_combined.txt’, and
cat readme* | wc
will give you a wordcount of the three
files combined, just to give a pair of brief examples.
The problem with cat is that big files are often too big for your terminal to scroll back and read. So, for big files, you’ll want to use…
less - ‘read a long file’
less
offers a better experience for reading long files,
allowing you to open a file using
less willsfavoritepoem.txt
, then read through the document
using <space>
to advance, u
to back up,
and q
to quit.
If you don’t have less
on your system, you’re likely to
have more
, which does the same thing, but without the
ability to scroll back. (Yes, in this case, more
is
actually less)
head and tail - ‘Display the start/end of a file’
The head
and tail
commands exist to do just
one thing: to display the edges of a file. Calling the head
command with a file as an argument
(e.g. head reallylongfile.txt
) will display the first 10
lines of that file. Calling tail reallylongfile.txt
will
display the final 10 lines.
Note that both head
and tail
can take a
numerical argument as well indicating the number of lines to display, so
head -40 reallylongfile.txt
will display the first 40
lines.
These commands are useful for previewing files, but it’s also often
handy to grab chunks of files by number and similar tricks. For
instance, head -60 myfile.txt | tail -10
will give you
lines 50-60 of a file (there are other ways to do this using
sed
and others, but these commands are often handy.
Unix Tools for Corpus Work
As you’re doing natural language processing or corpus work, some
tools are particularly useful above and beyond file reading tools like
cat
.
egrep - ‘regular expression search’
egrep
is a simple regular expression search program, and
an incredibly useful tool for working with corpora. For the remainder of
this explanation, we’re going to pretend that we’re located in the
‘enronsent’ folder which contains a working copy of the EnronSent corpus,
and we’re wanting to examine the data there.
egrep
takes two arguments, the expression, and the file
to search.
To perform a basic search for the word ‘corruption’ across all the
files in the folder, you’d type egrep 'corruption' *
and
hit enter. Because EnronSent is split across 44 text files, we’ll use
the * wildcard in all of these searches, but you could just as easily
search an individual file
(e.g. egrep 'corruption' enronsent33
would only search for
the word corruption in file ‘enronsent33’.
$ egrep 'corruption' *
enronsent08:enlighten you on the degree of corruption in Nigeria.
enronsent13:courts in Brazil which are generally reliable and free of corruption (e.g.,
enronsent17:??N_POTISME ET CORRUPTION??Le n,potisme et la corruption sont deux des prin=
enronsent18:electoral corruption and fraud has taken place, a more balanced Central
enronsent20:by corruption, endless beuacracy, and cost of delays. These "entry hurdles"
enronsent20:Turkish military to expose and eliminate corruption in the Turkish energy=
enronsent21: employees, and corruption. The EBRD is pushing for progress
enronsent21: government has alleged that corruption occurred when the PPA
enronsent22:good, safe and clean fun? didn't add to your corruption score? is that the
enronsent22:how did you do on the corruption test?
enronsent29:> > > Paul Wilcher - Attorney investigating corruption at Mena Airport with
enronsent32:free from corruption. In particular, foreign parties have been treated
enronsent35:* Protect your code from hackers or unintentional corruption
enronsent37:corruption. They are not saints =01) the Indonesian government removed the=
enronsent37:Turkish military to expose and eliminate corruption in the Turkish energy=
enronsent41:system. E-mail may be susceptible to data corruption, interception and
enronsent41:corruption, interception or amendment or the consequences thereof.
This gives us every line with the word ‘corruption’ in it from the entire corpus, labeled by file. Note that this is giving us the entire line, not just the word itself, which makes interpreting the words much easier.
To make the search results case insensitive (e.g. returning
‘Corruption’, ‘corruption’, or ‘CoRrUpTiOn’ above), you’ll want to use
the -i
flag.
Importantly, egrep
uses regular expressions to search.
This allows you to perform more intellegent searches by specifying your
search pattern in a more complex and nuanced way. Although the
internet does not require yet another RegEx tutorial, so I won’t
belabor them here, they are remarkably useful. One particularly useful
resource is https://regexr.com/, which provides an interface to
check your regular expressions in real time and make sure they’re
matching what you hope they will.
To give an example of this, to extract all instances of the word
‘cat’ in both singular and plural form, as well as “catnip”, you could
use egrep 'cat(s|nip)' *
, which would return all three
forms. Provided you’re not looking for more detailed syntactic,
part-of-speech, or verb sense information, you can get a remarkable
amount of information from a few careful greps including inflectional
markers.
It’s worth noting that you can use egrep -n
to also get
line numbers, which can then be combined with sed -n
to
extract a range of lines. If, for instance, you get an interesting hit
from egrep
at line 29076 of file enronsent22, you can then
follow up with sed -n '29070,29080 p' enronsent22
to
extract more context (from line 29070 to line 29080).
You’ll also want to read this tutorial from Nikolaj Lindberg which offers some similar unix information.
wc - ‘word count’
wc
, much like its British namesake, is a very useful
thing to have around when you need it, as it provides a simple way to
count words, lines, and characters in a file. By default,
wc
will output the number of lines, words, and characters
(assuming single-byte characters) in the input file(s).
$ wc enronsent22
50000 294705 1771571 enronsent
Very often, you’ll want to use wc
on the output of
another command. For instance, to count the number of lines, words, and
characters in the EnronSent corpus, you could…
$ cat enronsent* | wc
2205910 13810266 88171505
You can also use the -l
, -w
, or
-c
operators to only return the number of lines, words, or
characters, if that’s all you need.
One of my most common uses of the command is something like
egrep 'searchterm' * | wc -l
, which, because egrep returns
one line per match, will instantly return the number of occurrences of
‘searchterm’ in the dataset. You’ll want to confirm that your search
term is actually doing what you want (so, it’s not unreasonable to think
of doing
egrep 'searchterm' * > searchterm_results.txt && wc -l searchterm_results.txt
instead.)
One completely unrelated use of wc
is with the
-c
operator, which prints out the number of bytes in a
file. This may sound silly at first, but then you realize that
find . -mtime -365 -type f | xargs wc -c
adds up the size
of every file modified in the last year, in one single line. Which… omg.
(Note the use here of ‘find’ to identify files, and the piping to xargs,
which handles large numbers of arguments more gracefully.
sort - ‘sort the file by contents’
Sometimes, you’ll have a file containing a list of items, and you’ll
need to sort it. The sort
command does just that. Take this
list of animals will finds cute:
$ cat unsorted.txt
cats
chickadees
velociraptors
dogs
hamsters
squirrels
jumping spiders
lizards
owls
We can use the sort
command to display the file in
sorted order:
$ sort unsorted.txt
cats
chickadees
dogs
hamsters
jumping spiders
lizards
owls
squirrels
velociraptors
Note that this hasn’t modified the file, just displayed it in sorted
order. To save the sorted output, you’d want to use
sort unsorted.txt > sorted.txt
. And of course, you’ll
want to check man sort
for all of the various options,
including reversed order -r
and other alternative
orderings.
tr - ‘translate characters’
tr
is a textbook unix ‘small utility’ which can be quite
helpful, transforming single characters into other ones (e.g. turn all
instances of ‘c’ into ‘k’). tr
takes the file argument
using <
, so the above transformation would look like
tr 'c' 'k' < unsorted.txt
.
Perhaps the most useful case of this is transforming newline
characters \n
into spaces, to bring a multi-line file into
a single line file: tr '\n' ' ' < unsorted.txt
. It’s a
perfect example of a stupid little Unix command which could save a
person hours of manual work, or a trip into a heavier text editor for
find-and-replace. Speaking of which…
sed - ‘find and replace’
The sed
command (‘stream edit’) is a beautiful tool for
modifying text files. To use it, you’ll format a request as
sed 's/oldtext/newtext/g' inputfile.txt
. To give an
example, we’ll use this sample text, from the EnronSent corpus:
$ cat sampletxt.txt
Hi Elyse - I just spoke with Ramesh Rao (associate dean). He has not yet
made the Excellence Fund awards, which are also part of the Enron endowment.
He would like to have the dinner in the fall, when all of the awardees have
been chosen. He felt that a dinner the end of April would conflict with a
major E-commerce conference we're having here, and the Enron awards would
get the attention they deserve in the fall. The other advantage of a fall
dinner is that the new students will be here, and if we have a big
Now, we can replace every instance of ‘the’ with ‘gebleeble’ using
sed 's/the/gebleeble/g' sampletxt.txt
:
$ sed 's/the/gebleeble/g' sampletxt.txt
Hi Elyse - I just spoke with Ramesh Rao (associate dean). He has not yet
made gebleeble Excellence Fund awards, which are also part of gebleeble Enron endowment.
He would like to have gebleeble dinner in gebleeble fall, when all of gebleeble awardees have
been chosen. He felt that a dinner gebleeble end of April would conflict with a
major E-commerce conference we're having here, and gebleeble Enron awards would
get gebleeble attention gebleebley deserve in gebleeble fall. The ogebleebler advantage of a fall
dinner is that gebleeble new students will be here, and if we have a big
Note, though, that it also matched word ‘the’ sequences (e.g. ‘they’
became ‘gebleebley’). This can be fixed by using spaces to either side
sed 's/ the / gebleeble /g' sampletxt.txt
. Also note that
this has just printed the output to the terminal, you’ll want to use
>
to save the output.
sed
is also perfectly happy to take regular expressions,
although the exact syntax, frustratingly, varies among operating
systems. On MacOS, you’ll want to use the -E
flag to mark
‘extended’ regex syntax for some things, but you could easily replace
multiple words, as in:
$ sed -E 's/ (the|that|he|they|we) / gebleeble /g' sampletxt.txt
Hi Elyse - I just spoke with Ramesh Rao (associate dean). He has not yet
made gebleeble Excellence Fund awards, which are also part of gebleeble Enron endowment.
He would like to have gebleeble dinner in gebleeble fall, when all of gebleeble awardees have
been chosen. He felt gebleeble a dinner gebleeble end of April would conflict with a
major E-commerce conference we're having here, and gebleeble Enron awards would
get gebleeble attention gebleeble deserve in gebleeble fall. The other advantage of gebleeble fall
dinner is gebleeble new students will be here, and if gebleeble have gebleeble big
Note that this remains case sensitive (so ‘the’ was replaced, but
‘The’ wasn’t). sed
is very powerful, but due in part to the
many implementations of it, you’ll likely find some frustration using it
at first, and as always, Google and StackExchange are your friends. Also
note that awk
exists and does similar things, but with a
bit more extensibility at the cost of a bit more opacity.
cut - ‘Extract a delimited column’
You’ll often be given tab or comma delimited data in your life. The
cut
command allows you to extract individual columns by
number, without opening them in R or Google Sheets or a similar
spreadsheet interface. For instance, in the CELEX corpus, the data
are organized by columns, and delimited by <tab>
:
$ head tc_celex.txt
1 a 413887 2 P '1 VV 1 P @ V @
2 a 422366 1 P '1 VV 1
3 a 8448 1 P '1 VV 1
4 A 422334 1 P '1 VV 1
5 AA 52 1 P """1-'1" VVVV 11
6 AA 95 1 P """1-'1" VVVV 11
7 aback 59 1 P @-'b{k VCVC @b{k
8 abacus 8 1 P '{-b@-k@s VCVCVC {b@k@s
9 abaft 0 1 P @-'b#ft VCVVCC @bAft
10 abaft 2 1 P @-'b#ft VCVVCC @bAft
To extract only the second column containing words, we would run
cut -f2 tc_celex.txt
(or the alternative below, which just
does this for the first 10 lines):
$ cut -f2 tc_celex.txt | head
a
a
a
A
AA
AA
aback
abacus
abaft
abaft
You could extract lines 2 and 3 using
cut -f2-3 tc_celex.txt
. If you’ve got a different
delimiter, you can use the -d
flag to specify. The opposite
of cut
is paste
, which combines files
containing single “columns” of data into a single tab-delimited
file.
As an illustration of power, imagine you have a tab delimited class
roster (with lots of other information) including ‘Last, First’ student
names, and wanting a single document in the format “Will Styler, Rebecca
Scarborough, Pam Beddor, Andries Coetzee, Jelena Krivokapic, …”. You
could first extract the name column using cut
, then
extracting the first and last name columns individually using
cut -d "," -f 1
and cut -d "," -f 2
, then
pasting them together in the reverse order using paste
and
a single space delimiter. Then finally, tr
could turn every
newline into a comma, and you have your list. This seems like a silly
example, but imagine if you have hundreds of thousands of names, the
amount of pain this would be.
uniq - ‘remove duplicate lines’
uniq
is a nice little tool designed to remove (and with
the -c
flag, count) adjacent duplicated lines in a
file. Note carefully the ‘adjacent’ part: this means that if two lines
aren’t right next to each other, it won’t remove them. But it does have
its uses.
Among other things, it makes for a cheap and easy little word
counter, when combined with tr
and sort
. Let’s
say we wanted to count the number of times each word occurred in the
above ‘sampletxt.txt’. We could write the below command, which would
first open the file, then transform spaces into newlines (putting each
word onto a line), then sort (so that duplicates are adjacent), then
count the duplicates:
$ cat sampletxt.txt | tr ' ' '\n' | sort | uniq -c
3
1 (associate
1 -
1 April
1 E-commerce
1 Elyse
2 Enron
1 Excellence
1 Fund
3 He
1 Hi
1 I
1 Ramesh
1 Rao
1 The
4 a
1 advantage
1 all
...
Of course, this is not the best way to get word counts (it’s better to use NLTK’s built-in tokenizer and ngrams() function), but it’s a nice illustration of the power of stringing together simple commands to do complex tasks.
tgrep - ‘Tree Grep’
If you’re dealing with existing syntactic trees or POS-tagged data, you might want a specialized program for searching them. tgrep is exactly that. It’s not basic to any flavor of Unix, installation isn’t trivial, and it requires specifically formatted trees, but if you’re asking questions like “How often does leverage get used as a verb in these tagged data?”, there’s nothing better.
Unix Shell Scripting
Although the shell is in many ways designed to be run interactively,
there’s nothing stopping you from building a list of commands, to be run
one after the other, to do a more complex task. To do this, you’ll
simply create a new text file, preface it with #!/bin/bash
(if you’re in the bash
shell), then add the commands, one
after the other, and including if you’d like lines starting with
#
which are human-readable comments not interpreted by the
computer. Here’s a basic shell script that I use to do routine
maintenance on my system, called ‘cleanup.bash’:
#!/bin/bash
# Clear the screen
clear
# Clear DNS caches
echo "Clearing DNS Caches"
dscacheutil -flushcache
sudo killall -HUP mDNSResponder
# Clear out the Quicklook cache
echo "Clearing Quicklook Caches"
qlmanage -r
# Remove log files
echo "Clearing Logs"
sudo rm -rv /Users/wstyler/Library/Logs/*
sudo rm -rv /private/var/log/*
sudo rm -rv /var/log/*
sudo rm -rv /Library/Logs/*
# Update my packages with Homebrew
echo "Updating packages"
brew update
brew upgrade --all
# Move back to the home folder
cd
Once this is done, you can either run it directly
(bash cleanup.bash
), or you can make it executable and run
it directly from the folder ($ cleanup.bash
).
You don’t always need to shell script
With that said, it’s often the case that the best way to write a
script to do things on a Unix system is to do it in Python, and then use
one of the many internal-to-python means to execute the bash lines you
need. You often get more flexibility this way, and put simply, python is
a more kind place for complex scripting with various control structures,
string manipulation and parsing, and other such things. Unless all
you’re doing is running sequential bash commands like the above, or
making minimal use of conditional logic, your life is often better doing
what you need in Python with occasional subprocess.run
bash
calls.
Advanced Unix Commands
These commands aren’t ‘necessary’ to most everyday work, but can make life substantially easier. If you’re just getting started (e.g. you’re in Will’s LIGN 6), you can skip this section, but the tools are absolutely helpful, and may change your world someday.
tee - Write to file and terminal
tee
is one of those commands that’s seldom useful, but
really useful when needed. If you’re wanting to write something
to a file as well as writing it to the terminal, the
tee
command is designed to do both that. So, for instance,
if you wanted to combine three files (all ending with
_log.txt
, save the combined file, and print them
to the terminal, you could run
cat *_log.txt | tee logscombined.txt
.
Practically speaking, the most common use for tee
is to
replace >
in a chain of piped commands. For instance,
cat *_log.txt | tee logscombined.txt | sort | tee logsorted.txt | wc -l
would produce two files, one containing the combined unsorted output,
one containing sorted output, and because it also continues the pipe, we
get a line count at the end.
You can accomplish the same things with, e.g.,
cat *_log.txt > logscombined.txt && sort logscombined.txt > logsorted.txt && wc -l logsorted.txt
,
but tee
just feels way better.
screen - Terminal multiplexer
screen
is a much more advanced command, but as you start
running larger jobs, it’s absolutely crucial. Let’s say that you’re
about to start a massive job, for instance, a large R script, or a
server migration using rsync
. In a perfect world, you’d be
able to say connected to the server, with a terminal open, over the
course of the minutes/hours/days that the job will run, so that you see
all of the output in one, compact place. But particularly in the laptop
era, we often move from network to network, sleep our machines, etc.
This is where screen
comes in.
When you know you’re about to start a long process on a remote
server, first type screen
. This will ‘clear’ your terminal,
but more importantly, you’ve just entered a ‘second’ nested terminal.
All the settings and commands will be the same, but you’re inside a
layer of abstraction.
Now, you’ll run your ‘huge’ command which you expect to take some
time. Once you’re satisfied that it’s running as expected, the magic
happens: press Ctrl+A
and then d
. You’ll get a
new message reading something like…
[detached from 8798.pts-0.yourserver]
… and you’ll be kicked back out to another prompt. And in fact, you can even log out of the server. But the big long command is still running on the machine!
Once you’re back at home, nestled into your couch with a cup of tea,
you can check back in on your process by reconnecting to the server,
then running screen -r
. This will ‘resume’ your
screen
session, and you’ll find yourself ‘back inside’ the
terminal where the big command is running, with its output presented
there. And you can exit back out again, if you’d like, again with
Ctrl+A
then d
.
Once you’re done in that window, Ctrl+A
and then
K
will kill that subscreen, as will ‘exit’.
There is much more to screen
than I’ve covered here, and
there’s
a very nice manual for the command here, but this covers my most
common use case, and will make huge computes (which don’t merit being
pushed off to a cluster/job-scheduling system) much more tractable in
your daily life.
Other Command-Line tools I can’t live without
These tools are not likely to be pre-installed on systems you’re working with, but are absolutely worth considering.
ansible - Automated Machine Configuration
Your system isn’t properly backed up if you can’t reproduce what’s installed, and a big part of reproducible science is knowing how every aspect of the work was done. Ansible exists to allow you to specify how your system should look in terms of what’s installed, versions, and configurations. Once you’ve written up the playbook used for your system, you can, ideally, reinstall and reconfigure everything in ‘run a command and then go get coffee’ sorts of ways. Because everything’s ‘idempotent’, meaning that running multiple times just ensures everything is as expected multiple times, you can even use it to keep two machines in shared state. It is an incredible tool, and merits the learning curve.
say - MacOS Text-to-Speech Tool
If you’re in a terminal on MacOS, you can use the say
command to directly access the MacOS Text-to-speech engine. So, to have
your Mac read you some Hamlet:
say "To be or not to be, that is the question"
You can combine this with the -o
flag to save the
results directly into a .aiff or .mp4 file. This isn’t the most useful
thing ever, but it’s a great way to have some fun with a
reasonable, modern TTS system.
Pro-Tip: If you happen to have ssh
access to a friend’s
Mac, you can make their computer start talking to them with nothing
visible on the screen to indicate the source of the speech. Bonus points
if you can convince them AI has finally emerged.
The say command will not be available on systems other than MacOS.
rsync - Folder synchronization tool
Although it’s nominally meant for synchronizing folders,
rsync
is a tool that’s amazing for many things. The
simplest use for rsync is to synchronize a pair of folders, such that
one overwrites the other. If you have, for instance, a large corpus of
data in two places, and you’ve made a few small tweaks to it, but you’d
rather not copy the whole thing over again, a simple rsync
command can copy over only the files that have changed, bringing folder
2 to parity with folder 1.
You can also use rsync to copy folders from one computer to another, again without copying unchanged files. To update my website, I run…
rsync -azvhCL --exclude=.DS_Store --exclude='*.md' --exclude='*/.DS_Store' --delete ~/Documents/web/ wstyler@dss-sites.ucsd.edu:/home/websites/wstyler/
The various flags do things like excluding filetypes,
--delete
means that removed local files are removed on the
destination, and specifying behaviors.
One “off-label” use of rsync
is for copying
huge amounts of data from one folder to another on a different
system, or with a poor connection. I recently had to move around 1.75 TB
of video and text files on a remote server via SAMBA (as I couldn’t log
in directly), and much to my frustration, the destination server kept
having connectivity issues. Using cp -r
didn’t work, as
every time the connectivity dropped, the transfer would fail mid-stream,
and couldn’t ‘start over from where it got cut off’. rsync
worked beautifully, though, as every time it cut out, I could just
re-run the rsync
command, it would figure out what had
already been synced, and start from the file which was in progress.
Eventually, after 5-10 restarts, I was able to copy the data, and life
was good.
pandoc - Document conversion tool
pandoc is a treasure of humanity. From their website:
If you need to convert files from one markup format into another, pandoc is your swiss-army knife. Pandoc can convert documents in (several dialects of) Markdown, reStructuredText, textile, HTML, DocBook, LaTeX, MediaWiki markup, TWiki markup, TikiWiki markup, Creole 1.0, Vimwiki markup, roff man, OPML, Emacs Org-Mode, Emacs Muse, txt2tags, Microsoft Word docx, LibreOffice ODT, EPUB, or Haddock markup to (a whole bunch of formats)
This document is being written in markdown then transformed into html (with a custom header and footer) using pandoc using the below command:
$ pandoc -B includes/header -A includes/footer -s --toc -N -o unix/index.html unix/index.md
Similarly, my disseration was written
using a combination of Markdown and LaTeX and rendered with
pandoc
. Pandoc is the glue which holds together my textual
life.
yt-dlp - YouTube Downloader
Sometimes, you need offline access to a YouTube video. This may be for your lecture slides, or to collect speech data, or otherwise. yt-dlp (which is a modern and more performant (as of Feb. 2022) fork of the better-known youtube-dl) is a great tool for pulling down videos, their audio, and more. But to get them into usable formats, you may need…
ffmpeg - Audio and Video conversion tool
When dealing with non-textual data, including audio and video, there are few tools more useful than ffmpeg. It allows you to capture and extract frames, convert from format to format, and even combine multiple videos into a single signal. And it’s completely free.
imagemagick - Image processing tool
There are few image-related tasks that ImageMagick cannot accomplish. It is a swiss army knife for anything image-related, and does amazing work.
R - Statistical Computing Environment
It’s worth noting that the same R Statistical Computing environment
you’re used to using on the desktop works just fine via the Unix command
line. Once installed, you can call R
and enter the R
console, running commands as usual. You can also run R scripts from the
Unix command line using Rscript
, allowing you to compose in
a tool like RStudio, and run on somebody else’s (much more powerful)
machine.
NLTK - Natural Language Toolkit
NLTK isn’t straight Unix, as it’s a library for Python, but it’s the best place to start in legacy approaches to Natural Language Processing, and in many cases, the best place to finish. I use it for everything from POS tagging, to tokenizing, to N-Gramming, all the way to drawing syntax trees.
jdupes
How do you handle the situation in which you have two folders which
have 50,000 and 49,500 files, which are mostly identical? Well,
you could go through manually, or you could use a tool like
jdupes
, remove all files in folder B which are identical to
one in folder A, and then worry only about what’s different. It is one
of those things that takes years of your life manually, but can move
very quickly with the right script.
Scraping Web Data to make Corpora
One common process in NLP research is pulling down web data in nicely formatted chunks, for later use in corpus work.
First, you’ll figure out your desired data. In some cases, you might
be interested in a noisy-but-large amount of data. To do this, scraping
Google’s search results might be the best approach. In other cases,
you’ll want to use a specific subset of URLs that you’ve hand-picked
(although remember, you want to balance your corpus for any relevant
features). Regardless, we’ll assume that you’re starting with a file
containing a bunch of URLs (scraping_urls.txt
).
Then, your next step will be to scrape those URLs onto your local machine. You’ll want to remove as much of the ‘noise’ as possible, things like navigation bars, ‘Like on Facebook!’, links to other stories, and more. To this end, Justext is your best friend. It does both the downloading and the cleaning, and will output just the raw text of the website. Here it is, folded into a cute little chunk of bash shell script:
while read p; do
# Increment the counter
count=$(($count + 1))
# Give the URL
echo $p
# Set a suffix
suffix=".txt"
# Turn that into a filename
ffinal="scraped_url_"$count$suffix
# Drop the URL into justext
python -m justext -s English -o $ffinal $p
# Add the URL to the start of the file
echo -e "URL:$p\n$(cat $ffinal)" > $ffinal
done < ../scraping_urls.txt
The end result of running this will be a series of files, inside the working directory, each containing a URL and a big chunk of text. Enjoy!
Version History
- 0.6 - December 2023 - Updated some bits about NLTK, added a section
nudging folks towards using Python for many scripting tasks. Also added
ansible
andjdupes
to the CLI utils page. - 0.5.9 - May 2022 - Added reference to Windows’ built-in SSH client, and a note on WSL
- 0.5.8 - Feb 2022 - Fixed a few links, added links to yt-dlp to other great CLI tools
- 0.5.7 - April 2020 - Added an amazing secondary use of wc
- 0.5.6 - Feb 2020 - Fixed some bugs, updated the login guide for LIGN 6
- 0.5.5 - Feb 2019 - Added ‘Advanced Unix Commands’ section, with
discussion of
tee
andscreen
. Also added mention of case insensitivity toegrep
and specified the port for campus login. - 0.5.1 - Feb 2019 - Added information about scraping web data
- 0.5 - January 2019 - Initial (buggy) public release
To Do
- File permissions basics and
chmod
- More
egrep
examples