## Ask An Algorithm: Using Machine Learning to study Human Speech ### Will Styler University of Michigan Department of Linguistics wstyler@umich.edu --- ## Humans are necessary for linguistic research - Any hypothesis about human language must be tested with human speakers - ... but testing with human subjects is a painful process - IRBs are required - It's time consuming - It's expensive - Studies can be difficult to design - Each participant has a different language background --- So, even though we need humans to test our hypotheses and theories... - **Any information we can get *without actually involving humans* is great** - Today, we're going to talk about collecting this kind of information using... --- # Machine Learning --- ### Machine Learning - Asking computer algorithms to take data and *independently* build a model of it - "I don't know what this pattern is. You figure it out and apply it". - Heavily statistical - Many different algorithms - Each with strengths, weaknesses, and use cases --- ### We'll be talking about classifiers - Classifiers read in features, then sort and classify data into multiple categories - Usable for known data (to learn more about the patterns) - Usable for predicting new data --- ### Supervised Machine Classification 101 * Select a large corpus of data, and manually assign each observation to a group * "Good mail" vs. "Spam" or "/i/" vs. "/u/" * **Training:** Feed this labeled data into an algorithm so it can learn the patterns - Using a defined set of features of the data * **Testing:** Give the trained algorithm new data without labels, and check the accuracy of its classifications * Better accuracy often indicates more useful information was given to the classifier! --- ### Machine Classification is *everywhere* - "Is this email spam, or not?" - "Should we lend this person money?" - "Is this handwritten symbol "1" or "2" or "3" or...?" - "Is this word a noun, or a verb, or an adjective, or...?" - So, these algorithms are well-studied and optimized - ... but they're usually used for engineering and problem-solving, not research --- ### They have some serious advantages for studying language! * Their decisions are easier to quantify than humans' * They'll (often) tell you *how* they made the decision they did * They have no knowledge that you don't give to them * They make all decisions independently * They don't require payment or scheduling * They're available 24/7 --- ## Today's Talk - **How can machine learning be used to complement humans in linguistic research?** - Two very different domains, problems and goals * **Part 1: Speech Perception**: What are the acoustic cues to vowel nasality? * **Part 2: Speech Production**: How can we identify specific articulatory gestures or postures in connected speech? - Both show strong evidence that algorithms can model human judgement and perception in tough problems --- # Part 1: Machine Learning and Speech Perception --- ## Vowel Nasality - Vowel Nasality is the opening of the Velopharyngeal port during the vowel
‘Cat’
[kæt]
‘Can't’
[kæ̃nt]
- **What are the acoustic cues used for perceiving vowel nasality in English?** - The biggest problem with this question is that... --- ## The signal is incredibly rich
--- ### 29 Potential Cues for evaluation * All spectral or temporal features in the signal * Some absolute, some relative * Features like... * Formant Frequencies and Bandwidths * Spectral Relationships (like A1-P0 or A3-P0) * Nasal Resonances * Spectral Tilt * Vowel Duration * ... and more! --- ### Testing the perception of 29 features is *not* trivial - Each feature requires many stimuli to be generated - Each stimulus might need several repetitions - The task is deeply boring - You'll run out of participant endurance long before you run out of features - So, instead of asking humans to evaluate all 29, let's use... --- ## Machine Speech Perception --- ### The Basic Idea Human speech perception is just classifying sounds based on acoustical features * **Computers can do that too!** * Give the acoustic feature information to a classifier and ask for oral vs. nasal judgements * Greater accuracy means a feature or grouping is more useful and informative! --- ### The Plan - 1: Collect a corpus of oral and nasal words, and measure each feature - 2: Give each feature to a Machine Learning Algorithm individually - The most informative features should be the most accurate * 3: Find the best group of features * Find the balance between "few features" and "good accuracy" * 4: Test *those* features with expensive and difficult humans --- ### Labeling and Training * I recorded 12 English speakers making words with oral and nasal(ized) vowels * "Oral" vowels were in CVC contexts, and "Nasal" were in CVN/NVC/NVN contexts * This resulted in 3823 words * Then, I measured each of the 29 features at two timepoints per vowel * All measurement was done automatically by Praat Script * Then I handed them to a Support Vector Machine as training data --- ### Support Vector Machines * A very common, very accurate machine learning algorithm * Look at all the data in an multi-dimensional space * As many dimensions as features * Try to find a line or hyperplane that optimally separates the classes * Classification is just seeing where the new data is relative to that line --- ## **So how does it perform with nasality?** --- ### Single-Feature testing * Are any features good enough *on their own* to allow recognition of nasal vowels? * 29 separate models (one per feature) classifying datapoints as "oral" or "nasal" * Each model outputs accuracy figures, which we can compare --- ### Single-feature findings * F1's Bandwidth is the most useful and informative feature * 67.6% SVM accuracy * A1-P0, a measure of relative spectral prominence, gets second place * 64.7% SVM accuracy * The worst feature performed at 51.23% accuracy * *None of the features are good enough on their own!* --- ## What *group of features* provides the best information? --- ### Multi-feature modeling * Tested 10 *a priori* feature groupings * Selected from various outputs of the machine learning and statistics * Compared the accuracy *in light of the number of features* * The winning model gets the best performance from the fewest features --- ### Multi-feature Results * SVMs with all features worked best (29 features) * 84.7% accuracy * Formant Frequency and Bandwidth, Spectral Tilt, A1-P0, and Vowel Duration was the best subgroup (5 features) * 82.2% accuracy * **We only lose 2.5% accuracy when we reduce our feature set by 69%!** * That's a promising grouping! --- ### Overall Machine Learning Results * **Formant Bandwidth** was the most useful single feature for English (62.5% accuracy) * ... and we've got a multi-feature grouping with very good accuracy (82.2% accuracy)! * Formant Width, Formant Frequency, Spectal Tilt, A1-P0, and Duration * **So, let's test those five features with actual humans!** --- # Human Perception --- ### Methods * English listeners can use vowel nasality to identify missing nasal consonants * ba_ could be "bad" or "ban" * **Let's add or remove features from vowels to see what indicates "nasality"!** * If adding or removing a feature changes perception, or makes them react more slowly, it's important! --- ### The Modifications Use signal processing to simulate the oral-to-nasal change (or vice versa) in... * 1) A1-P0 (or vice versa) * 2) Duration * 3) Spectral Tilt * 4) Formant Bandwidth and Frequency * Combined * 5) Modify *all five features at once!* --- ### The Experiment * Recruited 42 normal-hearing Native English speakers from a department subject pool * Each listened to 400 words with different modifications * Analyzed both confusion and reaction time associated with stimulus changes ---
bad
ban
---
bomb
bob
--- ### Human Perception Summary * Only **formant modification** had a significant effect on perception * Formant modification caused listeners to respond more slowly * Formant modification made oral vowels sound "nasal" * F1's bandwidth is probably the cue * This makes sense acoustically, and Hawkins and Stevens (1985) also points in that direction --- ### Score one for the Machine! - The machine learning models predicted F1's bandwidth as the most useful feature... * ... and the humans agreed! * ### How similar *are* the SVMs and the humans? --- # Humans vs. Machines *
--- *Let's give the computer the same experimental task as the humans, using the same altered stimuli, and see how they compare!* --- ### Testing Humans vs. Machines * 1) Train an SVM on all of the English Data * 2) Extract acoustic features from the stimuli used in the experiment * 3) Test those SVMs using the experimental stimuli data * Again classifying "oral" or "nasal" * 4) Compare the by-condition confusion results to the humans --- ### Confusion by Condition
--- ### Humans vs. Machines Summary * Humans and machines *did* show similar patterns * Modifications that were difficult for humans were difficult for SVMs * Humans are still more accurate overall --- ### The SVMs didn't model the humans exactly! * SVMs predicted gradient usefulness of the features * Humans based their decisions entirely on F1's Bandwidth * SVMs showed greater accuracy when all features were available * Humans weren't meaningfully affected by the additional three features * So, SVMs can show relative informativeness of features * ... but they can't show what humans actually do use --- ## Part 1: Conclusion * The SVM studies very effectively narrowed the field * The SVM studies and the humans both agreed on the best feature * Trained SVMs were able to perform the same experiment, with similar results * **Modeling human speech perception using machine learning is helpful!** --- # Part 2: Machine Learning and Speech Production --- ## Pause Postures - Articulation during pauses has been studied previously - Gick et al. 2005 - Wilson and Gick 2013 - Ramanarayananan et al. 2013 - Usually in the context of "articulatory setting" - Pause postures *per se* first described by Katsika (2014) in Greek - Specific configurations of the articulators at strong prosodic boundaries - Moving articulator(s) to an separate and different target during a pause --- ### Pause Postures in Lip Movement
--- ### Pause postures - These postures seem to violate our usual tendency towards economy of effort - Do these pause postures occur in English? - If so, under what conditions? - Myself, Jelena Krivokapic, Ben Parrell, and Jiseung Kim are working to find this out. - But this is very new research - So first, we need to know... - ***Can we reproducibly detect, measure, and label these pause postures?*** --- ### Electromagnetic Articulography (EMA) Carstens AG501 System
--- ### EMA: The Basics Small wired sensors are glued to the articulators
--- ### EMA Sensor Placement (Tongue)
--- ### EMA Sensor Placement (Lips)
--- ### EMA: The Basics - Machine pulses different magnetic fields from different coils oriented around the head - Sensors get strength of each of these fields via electromagnetic induction - The relative strengths of each field are used to calculate the sensor's position in space - Accurate to 0.1mm, at up to 1250 measurements per second --- ### EMA Data
--- ## The Data - EMA Data collected from 7 speakers at USC (by Jelena Krivokapic and Ben Parrell) - 1891 curves total - Sentences like... - "I don't know about MIma. Mini does, though." - "Does she know about MIma? Mini discovered MIma!" - "I know about biBU, mini, and the rest of the gang." - Now being analyzed at University of Michigan --- ## The main problem - Pause postures are newly discovered - They're based on unusual patterns of curvature - Nobody has directly measured these before - **Can we reliably identify and measure pause postures at all?** --- ## Why measuring them might be hard - These articulatory data are incredibly rich - Pause postures are seen in changes *over time* - Speakers differ in articulation and absolute positions - Relatively fewer tokens seem to show pause postures at all - Only 30% of current data seem to contain PPs - There are clear "Yes" and "No" tokens, but also "Maybe" --- ### Some pause postures are quite clear
--- ### Some pause postures are quite clear
--- ### Some clearly lack Pause Postures
--- ### Some clearly lack Pause Postures
--- ### Some tokens are less certain
--- ### Some tokens are less certain
--- ### Our questions - Are there measureable, reproducible patterns associated with pause postures in these data? - Can we empirically capture the gradience and uncertainty of these pause postures? - Can we identify pause postures without human intervention? --- ## Methods - Human annotator marks pause boundaries - End of prior gesture to start of following C - Human annotator classifies each as "Yes" or "No" Pause Posture based on Lip Aperture - With secondary marking as "Yes", "Maybe", "Unlikely", and "No" - Train SVM Classifiers to find PPs using the annotator's Yes/No judgement - Test on new data to gauge accuracy --- ### Machine Learning can address our questions - Is the pattern measureable? - **If the SVM can find PPs based on mathematical features, then YES!** - Can we capture the gradience of these pause posture? - **If the SVM can differentiate "Yes", "Maybe", "Unlikely" and "No" tokens, then YES!** - Can we identify pause postures without human intervention? - **If the SVM shows high agreement with the human, then YES!** --- ### Choosing the features - Simple features like means or straight lines don't work - The data are curved, and scales differ for each speaker - Even more complex features like "area under the curve" or "deviation from straight line" fail - Lots of tokens deviate from a straight line, but this is more specific - Instead, we're interested in specifically timed changes in lip trajectory - These specific movements represent not just noise or interpolation, but gesture - ... but how do we describe specific types of curvature? --- ## Functional Principal Component Analysis - "Feed all the curves in, then observe the dominant *patterns* of variation" - Each pattern is called a "component" - Each component is *orthogonal* to the others, representing an indepenent type of difference - Each pause is given a score for each component - We'll do two Principal Component Analyses --- ### PCA #1: Trajectory Modeling - Model the trajectory itself
--- ### PCA #1 (Trajectory Modeling): Components
--- ### PCA #2: Difference Modeling - Model the difference between the trajectory and the interpolation
--- ### PCA #2 (Difference Modeling): Components
--- ### The PCA Scores - Each token has 12 scores, six from trajectory models, six from difference models - High or low scores represent the presence of a specific shape and timing of curvature in a given pause --- ## The Model - Randomly split the data into 80% for training, 20% for testing - This is very conservative, but still gives a reasonable estimation of accuracy - The SVM is trained using these curvature-based measures - Twelve PC scores per pause, along with the "Yes" or "No" Pause Posture judgement - Class weights are adjusted to compensate for the rarity of pause postures - "Give more weight to the pause posture tokens, as they're important" - Returns classification accuracy and Cohen's Kappa to measure human-to-computer agreement - Kappa measures agreement with the human *accounting for chance* --- ### The World's Dumbest Model "Ignore the data, guess that every item is not a PP."
ACTUAL NO
ACTUAL YES
PREDICTED NO
268
107
PREDICTED YES
0
0
Prediction Accuracy
71.4%
Cohen's Kappa
0
--- ### PCA #1 ONLY Model: Trajectory Modeling Only using the 6 PCs per item that are based on actual trajectories
ACTUAL NO
ACTUAL YES
PREDICTED NO
254
8
PREDICTED YES
14
99
Prediction Accuracy
94.1%
Cohen's Kappa
0.85
--- ### PCA #1 AND #2: 12 Feature Model 6 PCs per item based on actual trajectories, 6 PCs based on differences
ACTUAL NO
ACTUAL YES
PREDICTED NO
258
7
PREDICTED YES
10
100
Prediction Accuracy
95.4%
Cohen's Kappa
0.889
--- ## Let's return to our questions --- ### Are there measureable, reproducible patterns in these data? - The SVM finds the same patterns as the annotator, with good accuracy, using only curvature measurements - Pause postures are readily identifiable using PC2 and PC3 from PCA #1 (Trajectory), and PC1 from PCA #2 (Difference) - Using those three features **alone** offers 92% accuracy, with a kappa of 0.832 ---
--- ### Are there measureable, reproducible patterns associated with pause postures in these data? The SVM finds the same patterns as the annotator, with good accuracy, using only curvature measurements Pause postures are readily identifiable using PC2 and PC3 from PCA #1 (Trajectory), and PC1 from PCA #2 (Difference) Using those three features **alone** offers 92% accuracy, with a kappa of 0.832 - ## Yes! --- ### Can we empirically capture "degree" of pause posture? - We had "Yes"", "Maybe", "Unlikely", "No" judgements too - The SVM only saw "Yes" and "No" - The SVM returns a "probability" judgement for each item - Let's see how that aligns with our human's judgements ---
--- ### Can we empirically capture "degree" of pause posture? - SVM probability judgements reflect human judgements nicely - So, the gradience is now directly measurable - ## Yes! --- ### Can we identify pause postures without human intervention? - Our best-performing model finds PPs in novel data with 96.5% accuracy - Out of 375 unknown items, it made 17 mistakes - It's slightly more prone to false positives - This could be viewed as a feature or a bug - This is not ideal, but respectable - ... and human-to-machine agreement (0.889) is an acceptable IAA in the field - So, depending on your needs... - ## Maybe! --- ### So, Pause Postures are empirically findable in the data - We can now characterize their nature more clearly - ... and we can measure their degree reliably - Our major methodological concern is cleared up - **So we can move on to studying their distribution in English in the future** --- ## Part 2: Conclusion - Machine learning was able to very effectively model annotator judgements - These sorts of curve-based phenomena can be effectively captured with functional PCA - SVM probability judgements can address the gradiency of the task - Automatic annotation of pause postures is imperfect, but promising --- # Final Conclusions --- ### Machine learning is a valuable tool in Linguistic research - Great for identifying subtle differences among classes or items - Detecting multi-feature differences in vowel nasality, or differences in curvature which seem like pause postures - Great for working with curves and nuances, where conventional statistical models often struggle - Using fPCA with machine learning helped identify a very nuanced pattern of curvature - Great for evaluating feature meaningfulness, particularly with large sets of opaque features - Narrowing the field of features in both cases - Great for roughly simulating human decisions, perception, or annotation - Predicting human listeners' perceptions (in Part 1) - Predicting human annotators' judgements (in Part 2) --- ### There is much left to do with Machine Learning and Speech - Using Machine Learning to explore the relationship between articulation, acoustics, and speaker variation in nasality - Simulating the time course of perception based on information accessiblity - Studying articulation and acoustics with regard to a feature's functional and informational load --- ## Any hypothesis about human language needs to be tested with human speakers - ... but sometimes, it's a good idea to trust the machines! --- ### (Just be careful)
--- ## Acknowledgements My collaborators in Part 2, Jelena Krivokapic, Ben Parrell, and Jiseung Kim University of Michigan Phonetics Laboratory (and Phondi group) NSF Grant BCS-1348150 to Patrice Speeter Beddor and Andries W. Coetzee NSF Grant 1551513 to Goldstein, Katsika, Krivokapic, Nam, Saltzman NIH Grant DC003172 to Dani Byrd, and DC002717 to Doug Whalen The University of Colorado at Boulder and Rebecca Scarborough The speakers and listeners who participated in the studies The great many electrons inconvenienced in the process of building these SVMs --- ### References [Gick et al., 2005] Gick, B., Wilson, I., Koch, K., and Cook, C. (2005). Language-specific articulatory settings: Evidence from inter-utterance rest position. Phonetica, 61(4):220–233. [Hawkins and Stevens, 1985] Hawkins, S. and Stevens, K. N. (1985). Acoustic and perceptual correlates of the non-nasal–nasal distinction for vowels. The Journal of the Acoustical Society of America, 77(4):1560– 1575. [Katsika et al., 2014] Katsika, A., Krivokapi ́c, J., Mooshammer, C., Tiede, M., and Goldstein, L. (2014). The coordination of boundary tones and its interaction with prominence. Journal of phonetics, 44:62–82. [Ramanarayanan et al., 2013] Ramanarayanan, V., Goldstein, L., Byrd, D., and Narayanan, S. S. (2013). An investigation of articulatory setting using real-time magnetic resonance imaging. The Journal of the Acoustical Society of America, 134(1):510–519. [Shaw and Kawahara 2017], Shaw, J., Kawaha, S. (2017) Modelling articulatory dynamics in the frequency domain. A Poster Presented at Workshop on Dynamic Modeling in Phonetics and Phonology 2017, Hosted by the Chicago Linguistic Society. [Styler, 2015] Styler, W. (2015). On the Acoustical and Perceptual Features of Vowel Nasality. PhD thesis, University of Colorado at Boulder. [Wilson and Gick, 2014] Wilson, I. and Gick, B. (2014). Bilinguals use language-specific articulatory set- tings. Journal of Speech, Language, and Hearing Research, 57(2):361–373. --- Title: Ask an Algorithm: Using Machine Learning to study Human Speech Abstract: Machine learning, the use of nuanced computer models to analyze and predict data, has a long history in speech recognition and natural language processing, but have largely been limited to more applied, engineering tasks. This talk will describe two more research-focused applications of machine learning in the study of speech perception and production. For speech perception, we'll examine the difficult problem of identifying acoustic cues to a complex phonetic contrast, in this case, vowel nasality. Here, by training machine learning algorithms on acoustic measurements, we can more directly measure the informativeness of the various acoustic features to the contrast. This by-feature informativeness data was then used to create hypotheses about human cue usage, and then, to model the observed human patterns of perception, showing that these models were able to predict not only the utilized cue, but the subtle patterns of perception arising from less informative changes. For speech production, we'll focus on data from Electromagnetic Articulography (EMA), which provides position data for the articulators with high temporal and spatial resolution, and discuss our ongoing efforts to identify and characterize pause postures (specific vocal tract configurations at prosodic boundaries, c.f. Katsika et al. 2014) in the speech of 7 speakers of American English. Here, the lip aperture trajectories of 800+ individual pauses were gold-standard annotated by a member of the research team, and then subjected to principal component analysis. These analyses were then used to train a support vector machine (SVM) classifier, which achieved a 96% classification accuracy in cross-validation tests, with a Cohen's Kappa showing machine-to-annotator agreement of 0.79, suggesting the potential for improvements in speed, consistency, and objective characterization of gestures. These methods of modeling feature importance and classifying curves using machine learning both demonstrate concrete methods which are potentially useful and applicable to a variety of questions in phonetics, and potentially, in linguistics in general.