“I know the project is due today but I’m just getting started and I have questions…”


Automatic Part-of-Speech Tagging

Will Styler - LIGN 6


Today’s Plan


We’ve talked about parts of speech already


Lexical Categories


… but these are linguistic, human categories


We also gave you ‘tests’ to use


… but a computer can’t use any of these tests


So, we can’t teach computers to do POS tagging in the same way that we teach humans to!


Preparing for POS Tagging


Before we can automate it, we need to do it with humans


Determining the best tagset


For English…

(Table from Jurafsky and Martin ‘Speech and Language Processing’ 3e)


Annotating a corpus for POS tags


On/IN an/DT exceptionally/RB hot/JJ evening/NN early/RB in/IN July/NNP a/DT young/JJ man/NN came/VBD out/RP of/IN the/DT garret/NN in/IN which/WDT he/PRP lodged/VBN and/CC walked/VBD slowly/RB ,/, as/RB though/IN in/IN hesitation/NN ,/, towards/IN a/DT bridge/NN ./.


All example tagging from today comes from the Stanford Parser


There are many tagged corpora already out there


Once you have a tagset and a corpus, you can use…


Automatic POS Tagging


POS Ambiguity

How much uncertainty there is about the part of speech of a given word


Some words are certain in terms of POS


Some words are only a bit ambiguous in POS


Some words are very ambiguous in POS


Some words have many parts of speech


POS tagging is about resolving this ambiguity


The Stupid Approach: ‘Most Frequent Tag’


Most Frequent Tag Accuracy


Slightly more intelligent: Word form features


… but words come in sequences. We should use that!


HMM-based POS Tagging


Hidden Markov Model

A machine learning process which models a series of observations, with the assumption that there’s some ‘hidden’ state which helps to predict the observations


One major assumption of HMMs


HMMs for POS Tagging


How do we use HMMs for POS-tagging


We need to know two types of probabilities


To get observation probabilities…


Observation probability gets at the idea of ‘POS Ambiguity’


To get Transition probabilities…


Transition probabilities get at the idea that syntax involves sequences of word types


Now we know the probabilities!


We decode the HMM


HMM Decoding: The Basic Idea



So, we have the most likely set of POS tags


One consequence of HMM-based tagging


the/DT three/CD cute/JJ cats/NNS made/VBN will/MD sit/VB back/RP in/IN awe/NN


How does HMM-based POS tagging perform?


… Why only 97% accuracy?


POS Tagging is hard


Use-mention distinctions


Not all words are being used, when being used


‘She said ’bear’ was her favorite word.’


‘Roger texted me ’back’’


‘I bought the The Pianist DVD’


Ambiguous Sentences


Some sentences are actually ambiguous in POS tagging


‘Maria was entertaining last night’


‘I saw the official take from the store.’


‘You should ask a Smith.’


‘I hate bridging gaps.’


Rare or Unknown words


Rare or unknown Words


‘yeet’


‘yeeting’


‘yeeted’


‘I yeet when I throw empty cans’


‘lit’


‘That phonetics lab meeting was lit’


‘I’m studying English Lit’


‘They lit the beacon of Amon Din to summon the Rohirrim’


Homonyms


Homonyms are (always) a problem


‘I saw the sign’


‘I saw the sign whenever I need to test the cutting feel of a new blade’


‘I bought a saw’


POS Tagging is crucial


POS Tagging is very helpful


Wrapping up


For Next Time


Thank you!