Please help me out by completing a Mid-Quarter Teaching Feedback form!

http://savethevowels.org/feedback


N-Gram Language Models

Will Styler - LIGN 6


The Plan


What is an N-gram?


How do we find N-Gram counts?


Tokenization

The language-specific process of separating natural language text into component units, and throwing away needless punctuation and noise.


Tokenization can be quite easy

Margot went to the park with Talisha and Yuan last week.

Tokenization can also be awful.

Although we aren’t sure why John-Paul O’Rourke left on the 22nd, we’re sure that he would’ve had his Tekashi 6ix9ine CD, co-authored manuscript (dated 8-15-1985), and at least $150 million in cash-money in his back pack if he’d planned to leave for New York University.


Tokenization Problems


Tokenization can be done automatically


How do we find N-Gram counts?

Choose a (large) corpus of text

Tokenize the words


Let’s try it in our data!



#!/usr/bin/env python

import nltk
from nltk import word_tokenize
from nltk.util import ngrams

es = open('enronsent_all.txt','r')
text = es.read()
token = nltk.word_tokenize(text)

unigrams = ngrams(token,1)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
fourgrams = ngrams(token,4)
fivegrams = ngrams(token,5)


Unigrams


Bigrams


Trigrams


Four-Grams


Five-Grams


Note that the frequencies of occurrence dropped as N rose


OK, Great.


N-Grams give us more than just counts


N-Grams can give us a language model


These probabilities tell us about Grammar


These probabilities tell us about meaning


The Distributional Hypothesis

“A word is characterized by the company it keeps” - John Rupert Firth


These probabilities tell us about the world


Even Unigram counts are interesting, in the right context


Enter Google Ngrams

https://books.google.com/ngrams


Some things never change


Eat


Sleep


Walk


Eat/Sleep/Walk


Some words, you might expect to change over time


Automobile


Computer


Laptop


Download


Google


Confederacy


Some words are falling out of use


Bilious


Blackguard


Retarded


Society is represented in distributions


Nazi


War


Color Terms


Sex


Let’s play a game!


Clue: Type of person (belonging to a certain group or culture)


Clue: Home/Office Technology


Clue: Country


Clue: Military Technology


Clue: Transportation Technology


Clue: Food Product


Warring Words


Vitriol vs. Sulfuric Acid


Aeroplane vs. Airplane


VHS vs. DVD


Handicapped vs. Disabled


Flammable vs. Inflammable


N-Gram models are really useful


N-Gram uses in the real world


… and all of this comes from counting words


N-Gram Modeling Strengths


N-Gram Modeling is relatively simple


N-Gram Modeling is easily scalable


N-Gram Modeling Weaknesses


They only work with strict juxtaposition


Very poor at handling uncommon or unattested N-Grams


N-Gram models are missing information


N-Grams aren’t the solution to every problem


Improvements on N-Gram Models


Skip-Grams


Word Embeddings/Word2Vec


… but still, ngram modeling forms the core!


Wrapping up


Thank you!