# Chalk Talk ## Will Styler UC San Diego --- ### The linguistic signal is incredibly complex - Natural language text is complex - Speech is complex - It reflects and describes a complex world - ... yet, somehow, we’re able to consistenly get relevant information from person to person using language --- ### The signal must be carrying contrast - Some feature (or group of features) allows us to recover the difference between .. - Multiple phonemes - Different temporal orderings - Many possible social identities - Two closely related meanings - **There are many possible features which could be signalling any given contrast** --- ### My research has focused on this problem - "I'm drowning in possible features that could signal what I’m studying." - "**How can we figure out which ones humans are actually using?**" - (Without requiring 15 years and a series of improbably large grants) --- ### There are many approaches to this problem - 15 years and a series of improbably large grants (to test them all with humans) - Educated guesswork to narrow the field - Conventional Statistical analysis - ... and my personal favorite - **Machine Learning, *then* humans!** --- ### Today's Case Study: Vowel Nasality * Vowel Nasality is the opening of the Velopharyngeal port during the vowel
‘Cat’
[kæt]
‘Can't’
[kæ̃nt]
* What are the acoustic cues used for perceiving vowel nasality in English? --- ## The Problem
Figure from Styler 2017
--- ### Feature selection is half the battle - In a perfect world, you would test all possible features as potential cues - Some *a priori* choices must be made to reduce the feature space - ... but every time you exclude a feature, you run the risk of excluding *the* feature --- ### 29 Potential Cues for evaluation * All spectral or temporal features in the signal * Some absolute, some relative * Features like... * Formant Frequencies and Bandwidths * Spectral Relationships (like A1-P0 or A3-P0) * Nasal Peaks and Zeroes * Spectral Tilt * Vowel Duration * ... and more! --- ### The *other* problem --- ### Humans are troublesome * Human responses aren't very transparent * Subtle changes often produce subtle differences in response * Observations are not independent * Participants have different language and knowledge backgrounds * Participants have limited endurance - Especially with boring tasks * Running human experiments is... non-trivial * ... and really, really expensive --- - So, instead of asking humans to evaluate all 29, let's use... --- ## Machine Speech Perception --- ### The Basic Idea Human speech perception is just classifying sounds based on acoustical features * **Computers can do that too!** * Give the acoustic feature information to a classifier and ask for oral vs. nasal judgements * Greater accuracy means a feature or grouping is more useful and informative! --- ### Computers are not humans
--- ### They have some serious advantages for studying language! * Their decisions are easier to quantify than humans' * They'll (often) tell you *how* they made the decision they did * They have no knowledge that you don't give to them * They make all decisions independently * They don't require payment or scheduling * They're available 24/7 --- ### Supervised Machine Classification 101 * Select a large corpus of data, and manually assign each observation to a group * **Training:** Feed this labeled data into an algorithm so it can learn the patterns * **Testing:** Give the trained algorithm new data without labels, and check the accuracy of its classifications * Better accuracy often indicates more useful information was given to the classifier! --- ### Machine Classification is *everywhere* --- ### My approach * 1) Collect a corpus of oral and nasal words, and measure each feature * 2) Give each feature to a Machine Learning Algorithm individually * The most informative features should be the most accurate * 3) Find the best group of features * Find the balance between "few features" and "good accuracy" * 4) Test *those* features with expensive and difficult humans --- ### Labeling and Training * Recorded 12 English speakers making words with oral and nasal(ized) vowels * "Oral" vowels were in CVC contexts, and "Nasal" were in CVN/NVC/NVN contexts * This resulted in 3823 words * Then, I measured each of the 29 features at two timepoints per vowel * All measurement was done automatically by Praat Script * Then I handed them to a Support Vector Machine as training data --- ### Support Vector Machines * A very common, very accurate machine learning algorithm * Look at all the data in an multi-dimensional space * As many dimensions as features * Try to find a line or hyperplane that optimally** separates the classes * Classification is just seeing where the new data is relative to that line --- (There are other algorithms that can work well, too!) --- ## **So how does it perform with nasality?** --- ### Single-Feature testing * Are any features good enough *on their own* to allow nasal perception? * 29 separate models (one per feature) classifying datapoints as "oral" or "nasal" * Each model outputs accuracy figures, which we can compare! --- ### Single-feature findings * F1's Bandwidth is the most useful and informative feature * 67.6% SVM accuracy * A1-P0, a measure of relative spectral prominence, gets second place * 64.7% SVM accuracy * The worst feature performed at 51.23% accuracy * *None of the features are good enough on their own!* --- ## What *group of features* provides the best information? --- ### Multi-feature modeling * Tested 10 *a priori* feature groupings * Selected from various outputs of the machine learning and statistics * Compared the accuracy *in light of the number of features* * The winning model gets the best performance from the fewest features --- ### Multi-feature Results * SVMs with all features worked best (29 features) * 84.7% accuracy * Formant Frequency and Bandwidth, Spectral Tilt, A1-P0, and Vowel Duration was the best subgroup (5 features) * 82.2% accuracy * **We only lose 2.5% accuracy when we reduce our feature set by 69%!** --- ### Overall Machine Learning Results * **Formant Bandwidth** was the most useful single feature for English (62.5% accuracy) * ... and we've got a multi-feature grouping with very good accuracy (82.2% accuracy)! * Formant Width, Formant Frequency, Spectal Tilt, A1-P0, and Duration * **So, let's test those five features with actual humans!** --- # Human Perception --- ### Methods * English listeners can use vowel nasality to identify missing nasal consonants * ba_ could be "bad" or "ban" * **Let's add or remove features from vowels to see what indicates "nasality"!** * If adding or removing a feature changes perception, or makes them react more slowly, it's important! --- ### The Modifications Use signal processing to simulate the oral-to-nasal change (or vice versa) in... * 1) A1-P0 (or vice versa) * 2) Duration * 3) Spectral Tilt * 4) Formant Bandwidth and Frequency * Combined * 5) Modify *all five features at once!* --- ### The Experiment * Recruited 42 normal-hearing Native English speakers from a department subject pool * Each listened to 400 words with different modifications * Analyzed both confusion and reaction time associated with stimulus changes ---
bad
ban
---
bomb
bob
--- ### Human Perception Summary * Only **formant modification** had a significant effect on perception * Formant modification caused listeners to respond more slowly * Formant modification made oral vowels sound "nasal" * F1's bandwidth is probably the cue * This makes sense acoustically, and Hawkins and Stevens (1985) also points in that direction --- ### Score one for the Machine! - The machine learning models predicted F1's bandwidth as the most useful feature... * ... and the humans agreed! * ### How similar *are* the SVMs and the humans? --- *Let's give the computer the same experimental task as the humans, using the same altered stimuli, and see how they compare!* --- ### Testing Humans vs. Machines * 1) Train an SVM on all of the English Data * 2) Extract acoustic features from the stimuli used in the experiment * 3) Test those SVMs using the experimental stimuli data * Again classifying "oral" or "nasal" * 4) Compare the by-condition confusion results to the humans --- ### Confusion by Condition
--- ### Humans vs. Machines Summary * Humans and machines *did* show similar patterns * Modifications that were difficult for humans were difficult for SVMs * Humans are still more accurate overall --- ### The SVMs didn't model the humans exactly! * SVMs predicted gradient usefulness of the features * Humans based their decisions entirely on F1's Bandwidth * SVMs showed greater accuracy when all features were available * Humans weren't meaningfully affected by the additional three features * So, SVMs can show relative informativeness of features * ... but they can't show what humans actually do use --- ## Conclusion --- ### What did using machine learning win us? * The SVM studies very effectively narrowed the field * The SVM studies and the humans both agreed on the best feature * Trained SVMs were able to perform the same experiment, with similar results * **Modeling human language using machine learning is helpful!** --- ### Machine learning is widely applicable - Computational Linguistics loves it - Modeling gestural data to identify discrete gestures rather than interpolation - Modeling the time course of speech perception with machine learning - Neural networks for finding tongue shapes in ultrasound data (Jian Zhu, cf. ASA 2018) - Classifiying athletes as 'white' or 'black' based on media portrayal (Kelly Wright) - ... and there are always more rich signals --- ## Any hypothesis about human language needs to be tested with human speakers - ... but sometimes, it's a good idea to trust the machines! --- ### (Just be careful)
--- # Let's talk! --- ## Acknowledgements * The speakers and listeners who participated in the study * The great many electrons inconvenienced in the process of building these SVMs * The University of Colorado at Boulder and Dr. Rebecca Scarborough * The University of Michigan for the support and training, and the Michigan Phondi Group --- ### References * Chen, M. Y. (1997). Acoustic correlates of english and french nasalized vowels. The Journal of the Acoustical Society of America, 102(4):2350–2370. * Hawkins, S. and Stevens, K. N. (1985b). Acoustic and perceptual correlates of the non-nasal–nasal distinction for vowels. The Journal of the Acoustical Society of America, 77(4):1560–1575. * Styler, W. (2015) On the Acoustical and Perceptual Features of Vowel Nasality. PhD thesis, University of Colorado at Boulder, March 2015. --- # Additional Information --- ### Feature List
--- ### Single-Feature Models
--- ### Multi-feature Machine Learning Results
--- ### Feature Importance *
--- ### Feature Addition (oral-made-nasal) Findings
* *Modifying formants (or all together) resulted in more confusion!* * People called oral vowels "nasal" more often with modified formants * The pattern of the All-Modified stimuli was statistically similar. ---
* *Modifying formants (or all together) resulted in slower reaction times!* * People were slower to call vowels "oral" or "nasal" with modified formants --- ## Feature Reduction (Nasal-made-Oral) Findings ---
* Confusion wasn't affected by modificaton for nasal-to-oral stimuli! * **We never changed "nasal" to "oral" by modifying features** ---
* *Modifying formants (or all features) resulted in slower reaction times!* * People were slower to call vowels "oral" or "nasal" with modified formants ---