Text Data and Language Modeling

Will Styler - CSS Bootcamp


Today’s plan


There’s a lot of natural language data out there.


Why do we care about natural language data?


Why do we want natural language data at all?


Natural Language Data tells us about the world


Natural Language Data provides valuable (🤑) information


Things you’d want to know from natural language data


Natural language data tell us about how language is used.


Natural language data allow us to build language models


Language Model

A probabilistic model which can predict and quantify the probability of a given word, construction, or sentence in a given type of language


Let’s be language models


Every element of natural language understanding depends on good language models


Language models are created by analyzing large amounts of text


Calculating Probability (well) requires large amounts of data!


Building a Corpus


A corpus isn’t super complicated


Corpora have a bunch of language data


(We have access to many more corpora, just talk to Will!)


The format needs to be non-awful


You want to minimize non-language stuff



Everything needs to be in one place


You might want metadata or annotations, too!


Document information


Linguistic information


Semantic information


All of this information combined makes a successful corpus


Creating a corpus is a straightforward process


… but you don’t need to build a corpus for everything …


There are also a huge number of pre-made corpora


Choosing a corpus


Why do we have multiple corpora?


Every type of text is unique


Balance is important


What kind of corpus would you use, and how would you annotate it?


What kind of corpus would you use, and how would you annotate it?


So, you’ve got a corpus, what do you do?


Using Corpora


Many levels of analysis


Reading the corpus


Gentlemen, Attached is an electronic version of the “proposed” First Amendment to ISDA Master Agreement, which was directed by FED EX to Gareth Krauss @ Merced on October 11, 2001. On November 5th, Gareth mentioned to me that their lawyer would be contacting Sara Shackleton (ENA-Legal) with any comments to the proposed First Amendment. Let me know if I may be of further assistance.

Regards, Susan S. Bailey Senior Legal Specialist


Searching the Corpus for specific terms


enronsent08:17021:enlighten you on the degree of corruption in Nigeria.

enronsent13:20442:courts in Brazil which are generally reliable and free of corruption (e.g.,

enronsent17:45199:??N_POTISME ET CORRUPTION??Le n,potisme et la corruption sont deux des prin=

enronsent18:26272:electoral corruption and fraud has taken place, a more balanced Central

enronsent20:3642:by corruption, endless beuacracy, and cost of delays. These “entry hurdles”

enronsent20:23272:Turkish military to expose and eliminate corruption in the Turkish energy=

enronsent21:2159: employees, and corruption. The EBRD is pushing for progress

enronsent21:2292: government has alleged that corruption occurred when the PPA

enronsent22:30087:how did you do on the corruption test?


Searching the corpus for specific patterns


“How often do you see the”needs fixed” construction?” in Corporate emails?

enronsent02:41843:ation’s energy needs analyzed and streamlined, Enron could do the job. If y=

enronsent11:22173:Let me know if anything needs changed or corrected.

enronsent30:46927:Means broken and needs fixed - like your Mercedes.

enronsent43:7591:Two quick questions that Doug Leach needs answered ASAP to get the oil ordered:


“How often is ‘leverage’ used as a verb?” (70 times)

enronsent27:34968:? SK-Enron has several assets that can be leveraged into an internet play=

enronsent27:36353: leveraging our respective strengths

enronsent35:777:> Well, I know that you were leveraged too

enronsent36:2066:enhanced leveraged product is indeed what is under consideration.

enronsent37:10220:finance and origination skills would be best leveraged. I am very interested

enronsent37:15725:Overall, we’re leveraging our hedge fund relationships to generate more

enronsent41:38104:I believe this division of responsibilities leverages off everyone expertise


Classifying documents


Information extraction


So, how does any of this work?


Conditional Probability

‘What is the probability of this event, given that this other event occurred?’


Probabilities are often conditional on other events


Differences in conditional probabilities are information!


Differences in conditional probability let us model language!


How can we get these probabilities cheaply?


N-Gram Language Models


What is an N-gram?


How do we find N-Gram counts?


Tokenization

The language-specific process of separating natural language text into component units, and throwing away needless punctuation and noise.


Tokenization can be quite easy

Margot went to the park with Talisha and Yuan last week.

Tokenization can also be awful.

Although we aren’t sure why John-Paul O’Rourke left on the 22nd, we’re sure that he would’ve had his Tekashi 6ix9ine CD, co-authored manuscript (dated 8-15-1985), and at least $150 million in cash-money in his back pack if he’d planned to leave for New York University.


Tokenization Problems


Tokenization can be done automatically


How do we find N-Gram counts?

Choose a (large) corpus of text

Tokenize the words


Let’s try it in our data!



#!/usr/bin/env python

import nltk
from nltk import word_tokenize
from nltk.util import ngrams

es = open('enronsent_all.txt','r')
text = es.read()
token = nltk.word_tokenize(text)

unigrams = ngrams(token,1)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
fourgrams = ngrams(token,4)
fivegrams = ngrams(token,5)


Unigrams


Bigrams


Trigrams


Four-Grams


Five-Grams


Note that the frequencies of occurrence dropped as N rose


OK, Great.


N-Grams give us more than just counts


N-Grams can give us a language model


These probabilities tell us about Grammar


These probabilities tell us about meaning


The Distributional Hypothesis

“You shall know a word by the company it keeps” - John Rupert Firth


These probabilities tell us about the world


N-Gram Modeling Strengths and Weaknesses?


N-Gram Modeling is relatively simple


N-Gram Modeling is easily scalable


N-Gram Modeling Weaknesses


They only work with strict juxtaposition


Long distance context

I want to tell you the story of the least reliable car I ever bought. This piece of crap was seemingly assembled from spit and popsicle sticks, with bits of foil added in, all for $3000 per part plus labor. Every moment I drove it was offset with two in the shop, paying a Master Technician a masterful wage. Yet, despite a high price tag and decades of amazing reputation, the car was a Mercedes.


Very poor at handling uncommon or unattested N-Grams


N-Gram models are missing information


N-Grams aren’t the solution to every problem


Improvements on N-Gram Models


Skip-Grams


Bag of Words


How do you turn all this into a featureset for machine learning?


Unigram Frequencies are features!


Every text snippet is a row


Bag of Words is dumb


When Bag-of-Words Fails Case Study: The Hodinkee Travel Clock


The easy approach


How this algorithm reads posts


“Wow, that’s a lot of interest!”


This algorithm has one tiny problem


Treating these as mentions would be dumb


Sentiment Analysis can help!


Sentiment Analysis is hard



How might sentiment analysis work?


What can these basic word counting approaches handle?


What can’t these basic, word counting approaches handle?


Try basic, bag-of-words analysis first!


You can go a bit more complex without going fully neural


Word Vectors


Distributed Representations


We want movement in this space to represent semantic



Visualizing Word Vectors


Word Vectors offer more power than N-Grams


Limitations of Word Vectors


Other Text Analysis Methods


Text as Data


TF-IDF (Term Frequency-Inverse Document Frequency)


LSA (Latent Semantic Analysis)


LDA (Latent Dirichlet Allocation)


LDA is wildly powerful


LDA still has weaknesses


We need a better model!



Other questions about text analysis?