Making Words work for you

---

# Making Words Work For You
### Will Styler - World of Words

---

## There's a *lot* of language data out there.

* 1.3 billion active websites (<a href="http://www.internetlivestats.com/total-number-of-websites/">Source</a>)

* Mayo Clinic enters 298 million patient records per year (<a href="http://www.mayoclinic.org/emr/">Source</a>)

* 500 million Tweets per day (<a href="http://www.internetlivestats.com/twitter-statistics/">Source</a>)

* 294 billion emails sent daily (<a href="http://email.about.com/od/emailtrivia/f/emails_per_day.htm">Source</a>)

... and that's just the digital stuff

---

# That's a LOT of words

---

## Today we're going to talk about putting those words to work

---

## NLP - Natural Language Processing

Teaching computers to “understand” human language by building language models

- We're going to mostly focus on approaches that focus on words

---

### NLP has two main goals

- Research-focused NLP

- Using computers to tell us about language
	
- NLP Engineering

- Using computers that know language to solve problems
	
- We're going to see some of each today
	
---

# Using NLP to Learn about Language

---

### Words are used differently in different situations

- ... and we as Linguists want to know about that

---

> “... by using these existing standards, we hope to be able to leverage new technologies during processing.”

---

### A Question:

- Was Will really talking like a corporate drone when he used “leverage” as a verb, or do lots of people do that outside the boardroom?

- For that matter, how is “leverage” usually used?

---

### How can we answer this?

- Make linguists go through huge amounts of text, counting words and finding patterns

---

## The Problem:

Humans are inefficient and expensive.

---

---

---

<img src="img/hal9000.jpg">
## The Solution

---

## Let's search some corpora!

---

### **Corpus (pl. Corpora)**: A collection of written or spoken texts assembled for the purpose of studying language

---

### There are many different corpora out there.

---

## The Brown Corpus

~2,000,000 words of English fiction, books, humor, textbooks, reporting, and gov’t docs

---

## The Callhome Corpus

Transcripts of 120 phone conversations (18.3 hours of speech)

---

## The Switchboard Corpus

2430 conversations (~3,000,000 words of text) from phone calls

---

## The Broadcast News Corpus

~1,243,526 words transcribed from various broadcast news sources

---

## The EnronSent Corpus
96,106 email messages (~13,000,000 words) from the Enron Corporate Email Servers

---

## The Google Books Corpus
All the text from every book in the “Google Books” service

---

## The Google Corpus
The entire internet. At your fingertips

---

## The Lena Corpus
Thousands of hours of recorded child and child- directed speech

---

## The Penn Treebank
A large corpus of syntactically marked data (showing the tree structure of sentences)

---

## The Callhome Speech Corpus

This corpus actually contains sound files, useful for speech geeks like myself

---

### (and many, many, *many* more)

---

### So, that’s a bunch of data. How do you actually ask your question?

---

## How to run a corpus search:

-  1) Figure out where you want to look (which corpora)

-  2) Figure out exactly what you want to search for

-  3) Find where and when it occurs

-  4) Sift through the results

---

## So, to find out if leveraging is corporate-speak, I want to look...

-  1) Someplace businessy (EnronSent)

-  2) Someplace not businessy (CallHome)

- **Hypothesis: 'Leverage' will occur more often in business-related corpora**

---

## What would I need to look for?

-  leverage?

-  leveraged?

-  leverages

-  leveraging

---

## How would I ask the computer to find that?

-  Get your geek on by logging into a corpus server

-  “Find any lines in which EITHER “leverages” OR “leveraging” occurs

- `egrep "leverages|leveraging" yourcorpus`
	
- (This all looks complicated, but it gets easier quickly, and there’s usually someone around to help.)

---

## This gives you numbers!

- Callhome Corpus: 0 Results

- EnronSent Corpus: 61 results

- By doing statistical analysis on your searches, you can measure whether the data really shows what you claim it does

- ... and moves your hypotheses beyond opinion

---

> "Leverage was used as a verb 61 times in the Enron corpus, and none at all in the equally large callhome corpus."

### Case Closed.  Booyeah.

---

Counting word frequency is a very powerful tool!

---

# n-grams

---

### What is an n-gram?

- An n-gram is a sequence of words from a sample of text or speech

- 1 word is a 'unigram', 2 is a 'bigram', 3 is a 'trigram'...

- Counting words is relatively inexpensive

- Doesn't require as much detailed processing or any human work

---

### We just did a unigram search

- "How often does word X occur alone in corpus Y?"

- These can be very powerful with many corpora

- What if we could search through history with a corpus search to see how words are used over time?

---

## Enter Google Ngrams
<a href=https://books.google.com/ngrams>https://books.google.com/ngrams</a>

---

### Some things never change

---

<img width="1024" src="img/eat.png">
### Eat

---

<img width="1024" src="img/sleep.png">
### Sleep

---

<img width="1024" src="img/walk.png">
### Walk

---

<img width="1024" src="img/eatsleepwalk.png">
### Eat/Sleep/Walk

---

### Some words, you might expect to change over time

---

<img width="1024" src="img/automobile.png">
### Automobile

---

<img width="1024" src="img/computer.png">
### Computer

---

<img width="1024" src="img/laptop.png">
### Laptop

---

<img width="1024" src="img/download.png">
### Download

---

<img width="1024" src="img/google.png">
### Google

---

<img width="1024" src="img/confederacy.png">
### Confederacy

---

### Some words are falling out of use

---

<img width="1024" src="img/bilious.png">
### Bilious

---

<img width="1024" src="img/blackguard.png">
### Blackguard

---

<img width="1024" src="img/ngram_retarded.png">
### Retarded

---

### Society is represented in distributions

---

<img width="1024" src="img/nazi.png">
### Nazi

---

<img width="1024" src="img/war.png">
### War

---

<img width="1024" src="img/ngram_colorterms.png">
### Color Terms

---

<img width="1024" src="img/sex.png">
### Sex

---

### Let's play a game!

---

Clue: Type of person (belonging to a certain group or culture)

- Hippy

---

Clue: Home/Office Technology

- Typewriter

---

Clue: Country

- USSR

---

Clue: Military Technology

- Atomic Bomb

---

Clue: Transportation Technology

- Rocket

---

Clue: Food Product

- Spam

---

### Warring Words

---

<img width="1024" src="img/vitriol.png">
### Vitriol vs. Sulfuric Acid

---

<img width="1024" src="img/aeroplane.png">
### Aeroplane vs. Airplane

---

<img width="1024" src="img/vhs.png">
### VHS vs. DVD

---

<img width="1024" src="img/handicapped.png">
### Handicapped vs. Disabled

---

<img width="1024" src="img/flammable.png">
### Flammable vs. Inflammable
<img width="200" src="img/whorf.png">

---

### Unigrams are interesting!

- ... and they can give us lots of good information

- But bigrams and trigrams give much more interesting information about language

---

### Why?

- Counting how often word groupings occur together in a corpus gives us more than just counts

- "What is the probability of word X following word Y?"

---

### These Probabilities tell us about language

- "You are" is more likely than "You is"

- "Two princesses" is more likely than "Two princess"

- "Would have" is more likely than "Would did"

- "By and" is, by and large, followed by "large"

- **n-grams provide a very simple *language model* which we can do inference from**

---

### These probabilities tell us about the world

- We all have probability models for language

- "Let's go _______"

- "Take a ______"

- "______ pizza"

- "Chinese food" or "Chinese manufacturing" are often discussed

- But "Chinese snow" or "Chinese Ceiling" not so much
	 
- **Probable pairings tell us something about how the world works**

---

### Sociolinguistic n-gramming

- "How often is word X used to describe black athletes vs. white athletes?"

- "Is Unigram frequency of these words predicted by subject race?"

- Words like "Aggressive", "Angry", "Unstoppable" and "Ferocious" are preferentially applied to black athletes

- Work is ongoing

- c.f Wright 2017, The Reflection and Reification of Racialized Language in Popular Media
	
	- Mastro et al. 2011, Characterizations of Criminal Athletes: Systematic Examination of Sports News Depictions of Race and Crime
	
---

### What words are used most regularly with white vs. black athletes?

- *Preliminary Data alert!*

- Work from Kelly Wright in the LING Department

- White: Husband, Wins, Dating, Favorite, Baby

- But also things like 'Rolex', 'Invitational', 'Rain', 'Saddle'

- Black: Color, Media, Role (model), Violence, Dunk, HIV

- But also things like 'Teammate', 'MVP', 'Harlem', 'Jamaican'
	
- Machine learning algorithms can predict athlete race from word counts alone!

---

### n-grams are *really* useful

- Provide some grammatical information

- "What word forms regularly occur together?"
	
- Provide some real-world information

- "What are people most commonly talking about?"
	
- Tell us about the intersection of language and the real world

- "How do people talk about other people?"
	
- Inexpensive to generate

---

## ... but can they actually fix any problems?

---

# Using NLP to solve problems

---

## What can n-grams actually do for us?

---

### Typo detection and autocorrect

- "I made a bog mistake"
	
- "She got lost in a peat big"
	
- "She sent flowers because I love the flowers' scent"
	
---

### Predictive Text

- Yeah that's what I was gonna was the last night you were gonna today in a bit I got a little piece for my birthday today in my pocket I just wanna is that I got a nice little guy that loves it to a lot and he has to do a good thing and I think I might try it but I'll try and see what he says about ...
	
---

### Disambiguating Speech recognition

- "I need a walk for exercise"
	
- "I need a wok for stir fry"

- "I love n-grams.  Their great at quantifying words and they're effect."
	
---

### Sentiment Analysis
	
- How often do "Toyota" and "sucks" co-occur relative to "Nissan" and "Sucks"?

- "Do we see otherwise improbable n-grams here associated with angry commenting?"

---

### Understanding basic conceptual relationships

- Related concepts co-occur frequently in text

- 'broom handle' happens much more often than 'broom wheel'

- 'car wheel' happens much more often than 'car handle'

- So, there must be a link between brooms and handles, cars and wheels

- 'car parking' and 'car crash', though!
	
---

### What else do you think n-grams can do?
	
---

## More Nuanced Approaches to Counting Words

---

### Windowed Co-Occurrence Models

- "How often do words X and Y occur within N words of each other?"

- "The Giraffe was really tall." and "Tall animals like giraffes" will both hit on "Tall/Giraffe"

- This builds not just counts, but associations

- "What words are often used *in the vicinity of* the word "black"?"

---

### Latent Semantic Analysis

Find the words which are regularly grouped within the same document from a greater corpus

- "Interception", "Touchdown", "Tackle" might group in some documents while "Rocket", "Booster", "Satellite" group in others

- No clue *why*, just that they happen in the same documents

---

## We can get all of this from counting words which often occur together!

- ... but more complex language models can do more for us

---

# More Complex NLP Techniques

---

### Fancier tools

- n-grams are a good, cheap starting point, but serious NLP moves beyond that

- NLP often uses more sophisticated techniques and language models to describe corpora

- "Characterize the data in a variety of ways, then use machine learning to learn the patterns you're after"

---

## Machine Learning in ∞ easy steps:

-  1) Get a corpus of data and annotate it to tell the computer what’s going on in one specific domain

-  2) “Train” the computer by letting it analyze that corpus using a number of tools

-  3) Give it a different corpus, have it try to guess what’s going on and answer questions

-  4) Refine the programming, then get another corpus and annotate it to tell the computer what’s going on..

-  (Repeat)

---

## What kinds of other tools are used?

---

### Tokenizer

Breaks sentences into individual words

- (breaks,sentences,into,individual,words)

---

### Part of Speech Taggers

Labels words with their grammatical functions

---

### Syntactic Parser

This turns sentences into syntactic representations for analysis.
~~~~(ROOT
  (S
    (NP (DT This))
    (VP (VBZ turns)
      (NP (NNS sentences))
      (PP (IN into)
        (NP
          (NP (JJ syntactic) (NNS representations))
          (PP (IN for)
            (NP (NN analysis))))))
    (. .)))~~~~
	
---

### Semantic Frame Annotations

"John boldly threw the stick at the polar bear."

- Thrower: John
- Object Thrown: Stick
- Target: Polar Bear
- Manner: Boldly

---

### Coreference/Anaphora

"John boldly threw the stick at the polar bear.  The beast cast it aside then enjoyed a snack."

- "John[1] boldly threw the stick[2] at the polar bear[3].  The beast[3] cast it[2] aside then enjoyed a snack[1]."

---

### (and many, many more)

---

## What can we do with serious NLP?

---

### Analyzing Speech Data

"Ask people why they're calling, and connect them to the right department based on their answer."

"Flag all tech support conversations where the customer mentions a competitor"

"Redirect all angry-sounding customers to higher-tier support workers" (Speech emotion detection)

"Are the two people in this skype call flirting, arguing, expressing love, or sadness? Target post-session ads accordingly."

"I want to talk to... billing?" (Uncertainty analysis)

"Yeah, I really like going to Applebees." (Spot-the-sarcasm)

---

### Data Aggregation

“Watch Twitter and give me the locations of wildfires, floods, etc, and provide information about damage, shelters and resources in an easy-to-read format” (EPIC)

“Read every news article about the Ukrainian Revolution and present the information on a cohesive timeline, with sources labeled.” (RED)

“Collect all case-law involving reverse mortgages in the state of Florida in which the plaintiff's children filed suit against the mortgage company”

---

### Authorship attribution and stylistic analysis

“Examine these two written passages/books and tell me whether they were both written by the same person” (Authorship Attribution Analysis)

"Examine these negative reviews and tell me what demographic the authors likely represent based on the language used."

"Are these critical forum posts all written by the same person?"

---

### Predictive analysis of text

“Look for any information in the newswire which will predict a change in this company's stock price, then buy or sell stock automatically.”

“Based on this person's Facebook post history, how likely is he to click an ad for weight-loss pills?”

"Based on all the political posts and tweets in Saginaw compared to those in Ann Arbor, how likely is this senator to lose in a recall election?"

- “Given this large sample of a child’s speech, is the child likely to be autistic?” (Current research at the LENA foundation in Boulder)

---

### Sentiment Analysis

“How often, in this corpus of blogs, do people say nice or awful things about product X?”

"We've just leaked a picture of our next supercar. How do people on twitter like the design?"

"What are people saying about our leaked $199.99 pricepoint?"

"How do people on these forums feel about 9/11?"

- "This guy talks about guns a lot on Facebook.  Should we show him ads for firearms, or ads for gun-control organizations?"

---

### Language Pattern Detection

- "Is this an inflammatory, hateful, angry, or trollish comment?" (YouTube)

- "Scan online forums for anything which looks like a threat against the President" (The US Secret Service)

- “Watch these websites being used by radical groups and look for specific language usage patterns that predict violent behavior.” (All sorts of defense department grants)

- “Read every email, looking for threats or discussion of terrorist attacks on American soil.”

---
<img class="big" src="img/NSA.jpeg">

---

... or my personal favorite NLP task...

---

## Temporal Analysis and Event Discovery

---

### “The patient developed a mild post-surgical rash, which was treated with hydrocortisone at the follow-up”

- Sequence of events:

- 1) Surgery

- 2) Mild rash

- 3) Hydrocortisone, Followup (overlapping)

- 4) No more rash

---

### If a computer can be taught to interpret time in medical records, we can ask...

- "I have 30 seconds to learn this patient's history.  Go."

- “How often do patients have heart attacks within 2 years of starting Vioxx?”

- “How many people who have a facelift develop persistent facial numbness?”

- “How long do patients usually live following diagnosis of Glioblastoma?”

- "Is there a correlation between the administration of vaccines and the development of autism?”

- **[(No, damnit.)](http://www.webmd.com/brain/autism/news/20110105/bmj-wakefield-autism-faq?print=true)**

---

## Computers with complex understanding of language can solve many problems

- ... and can do many tasks that humans can't (or don't want to)

---

So, we're all doomed

---

### ... well, not so fast

---

### Language is still really hard

"Gold covered the miner's hands"/"Gold paid for the miner's education"

“The Queen of England’s hat was purple”

“We gave the monkeys the bananas because they were ripe”

“We gave the monkeys the bananas because they were hungry”

“Time flies like an arrow, fruit flies like a banana”

“The old man returned to his house was happy”

---

<img width="900" src="img/morpheus.png">
## Hooray!

---

# Final Conclusion

- You can get a lot of great information just by counting words in chunks of text

- This information can tell us a lot about language

- More complex analysis gets you even more information

- ... but NLP is really, really hard.

---

### This presentation is available online at:
<a href="http://savethevowels.org/talks/ngrams_nlp_2018.html">http://savethevowels.org/talks/ngrams_nlp_2018.html</a>

---

### Thank you!

---