--- # Modeling human speech perception using machine learning ### Will Styler University of Michigan Department of Linguistics --- ### A Case Study: Vowel Nasality * Vowel Nasality is the opening of the Velopharyngeal port during the vowel
‘Cat’
[kæt]
‘Can't’
[kæ̃nt]
* What are the acoustic cues used for perceiving vowel nasality in English? * The biggest problem with this question is that... --- ## The signal is incredibly rich
Figure from Chen 1997
--- ### Feature selection is half the battle * Some *a priori* choices must be made to reduce the feature space * ... but every time you exclude a feature, you run the risk of excluding *the* feature * You want to test as many features as you can without injuring your participant * **So, I narrowed the field...** * (kind of) --- ### 29 Potential Cues for evaluation * All spectral or temporal features in the signal * Some absolute, some relative * Features like... * Formant Frequencies and Bandwidths * Spectral Relationships (like A1-P0 or A3-P0) * Nasal Peaks and Zeroes * Spectral Tilt * Vowel Duration * ... and more! --- Testing 29 features is pretty problematic because... --- ### Humans are troublesome * Human responses aren't very transparent * You just get linguistic decisions and timing * Subtle changes often produce subtle differences in response * So you need a large n to effectively evaluate anything * They have limited endurance * So you can't have a large n * Observations are not independent * Fatigue, learning, and real-world knowledge all play a role * Running human experiments is... non-trivial * ... and really, really expensive --- ### I needed a better way to test these 29 features... * A way to obtain more detailed information about judgements * A way to avoid bias and real-world knowledge * A way to let each trial and feature be independent * ... and most importantly... * A way to do all of the above without ever leaving my apartment --- # Machine Speech Perception --- ### The Basic Idea Human speech perception is just classifying sounds based on acoustical features * **Computers can do that too!** * Give the feature information to a classifier and ask for oral vs. nasal judgements * Greater accuracy means a feature or grouping is more useful and informative! --- ### Machines have some advantages for studying speech perception! * Their decisions are easier to quantify * They'll tell you *how* they made the decision they did * They have no knowledge that you don't give to them * They make all decisions independently * They don't require payment or scheduling * ... and most importantly... * They live in my apartment! --- ### Supervised Machine Classification 101 * "Find the patterns in this training data, then use them to predict which group this new datapoint belongs to!" * Select a large corpus of data, and assign each observation to a group * "Oral" vs. "Nasal" or "Good mail" vs. "Spam" * **Training:** Feed this labeled data into an algorithm so it can learn the patterns * **Testing:** Give the algorithm new data without labels, and check the accuracy of its classifications * Better accuracy often indicates more useful information! --- ### Machine Classification is used *everywhere* * "Is this email spam, or not?" * "Is this handwritten symbol "1" or "2" or "3" or...?" * "Is the car in front of me slamming the brakes?" * "Is this word a noun, or a verb, or an adjective, or...?" * So, these algorithms are well-studied and optimized! --- ### The Plan * 1) Collect a corpus of oral and nasal words, and measure each feature * 2) Give each feature to a Machine Learning Algorithm individually * The most informative features should be the most accurate * 3) Find the best group of features * Find the balance between "few features" and "good accuracy" * 4) Test *those* features with expensive and difficult humans --- ### Labeling and Training * I recorded 12 English speakers making words with oral and nasal(ized) vowels * "Oral" vowels were in CVC contexts, and "Nasal" were in CVN/NVC/NVN contexts * This resulted in 3823 words * Then, I measured each of the 29 features at two timepoints per vowel * All measurement was done automatically by Praat Script * Then I handed them to a Support Vector Machine as training data * (I also tried some other algorithms, see Styler (2015)) --- ### Support Vector Machines * A very common, very accurate machine learning algorithm * A specialized statistical model, tuned for prediction * Look at all the data in an n dimensional space * n is the number of features * Try to find a hyperplane of some kind that provides the best separation * Classification is just seeing where the new data is relative to that line * This doesn't have to be linear * ... and mine weren't, but that's details. --- ## **So how does it perform with nasality?** --- ### Single-Feature testing * Are any features good enough *on their own* to allow nasal perception? * 29 separate models (one per feature) classifying datapoints as "oral" or "nasal" * Each model outputs accuracy figures, which we can compare! ---
--- ### Single-feature findings * F1's Bandwidth is the most useful and informative feature * 67.6% SVM accuracy * A1-P0, a measure of relative spectral prominence, gets second place * 64.7% SVM accuracy * F1's Amplitude takes 3rd place * 62.5% SVM accuracy * The worst feature performed at 51.23% accuracy * *None of the features are good enough on their own!* --- ## What *group of features* provides the best information? --- ### Multi-feature modeling * Tested 10 *a priori* feature groupings * Selected from various outputs of the machine learning and statistics * Compared the accuracy *in light of the number of features* * The winning model gets the best performance from the fewest features --- ### Multi-feature Results * SVMs with all features worked best (29 features) * 84.7% accuracy * Formant Frequency and Bandwidth, Spectral Tilt, A1-P0, and Vowel Duration was the best subgroup (5 features) * 82.2% accuracy * **We only lose 2.5% accuracy when we reduce our feature set by 69%!** * That's a promising grouping! --- ### Overall Machine Learning Results * **Formant Bandwidth** was the most useful single feature for English (62.5% accuracy) * ... and we've got a multi-feature grouping with very good accuracy (82.2% accuracy)! * Formant Width, Formant Frequency, Spectal Tilt, A1-P0, and Duration * **So, let's test those five features with actual humans!** --- # Human Perception --- ### Methods * English listeners can use vowel nasality to identify missing nasal consonants * ba_ could be "bad" or "ban" * **Let's add or remove features from vowels to see what indicates "nasality"!** * If adding or removing a feature changes perception, or makes them react more slowly, it's important! --- ### The Modifications Use signal processing to simulate the oral-to-nasal change (or vice versa) in... * 1) A1-P0 (or vice versa) * 2) Duration * 3) Spectral Tilt * 4) Formant Bandwidth and Frequency * Combined * 5) Modify *all five features at once!* --- ### The Experiment * Recruited 42 normal-hearing Native English speakers from a department subject pool * Each listened to 400 words with different modifications * Analyzed both confusion and reaction time associated with stimulus changes ---
bad
ban
---
bomb
bob
--- ### Human Perception Summary * Only **formant modification** had a significant effect on perception * Formant modification caused listeners to respond more slowly * Formant modification made oral vowels sound "nasal" * F1's bandwidth is probably the cue * Hawkins and Stevens (1985) also points in that direction * Please see Styler (2015) for more details! --- So, the machine learning models predicted F1's bandwidth as the most useful feature... * ... and the humans agreed! * ### How similar *are* the SVMs and the humans? --- # Humans vs. Machines *
--- *Let's give the computer the same experimental task as the humans, using the same altered stimuli, and see how they compare!* --- ### Testing Humans vs. Machines * 1) Train SVMs on the original data * All English data, English CVC/CVN/NVC Words, and English *and* French data combined * 2) Measure the stimuli used in the experiment * 3) Test those SVMs using the experimental stimuli data * Again classifying "oral" or "nasal" * 4) Compare the by-condition confusion results to the humans --- ### Confusion by Condition
--- ### Humans vs. Machines Summary * Humans and machines *did* show similar patterns * Modifications that were difficult for humans were difficult for SVMs * The Generic English model showed the most similarity * Adding in French training data was a **bad** idea * Humans are still more accurate overall --- ### The SVMs didn't model the humans exactly! * SVMs predicted gradient usefulness of the features * Humans based their decisions entirely on F1's Bandwidth * SVMs showed greater accuracy when all features were available * Humans weren't meaningfully affected by the additional three features * So, SVMs can show relative informativeness of features * **... but I guess we still need humans to study human speech perception** --- # Conclusions --- ### Modeling human speech perception using machine learning is helpful! * The SVM studies very effectively narrowed the field * They eliminated the worst features quickly * The SVM studies and the humans both agreed on the best feature * F1 Bandwidth won for both * Trained SVMs can perform similar experiments to humans, with similar results * SVMs resembled the humans in relative confusion of features * Other algorithms can provide different information * RandomForests can give relative importance of features in groups --- ### ... and machine learning may work for you, too! * Great for identifying acoustic correlates and features * Great for identifying subtle differences in classes * Great for testing for learnable biases in stimuli * Great for "piloting" a perception study or testing a crazy idea * ... and the data comes free with other statistical analysis * Most good algorithms are available as R or Python packages! --- So, although human speech perception must be studied in humans... ### Sometimes, it's a good idea to trust the machine! --- ### (Although be careful)
--- ## Acknowledgements * The University of Colorado at Boulder and Dr. Rebecca Scarborough * The speakers and listeners who participated in the study * The great many electrons inconvenienced in the process of building these SVMs * The University of Michigan --- ### References * Chen, M. Y. (1997). Acoustic correlates of english and french nasalized vowels. The Journal of the Acoustical Society of America, 102(4):2350–2370. * Hawkins, S. and Stevens, K. N. (1985b). Acoustic and perceptual correlates of the non-nasal–nasal distinction for vowels. The Journal of the Acoustical Society of America, 77(4):1560–1575. * Styler, W. (2015) On the Acoustical and Perceptual Features of Vowel Nasality. PhD thesis, University of Colorado at Boulder, March 2015. --- # Additional Information --- ### Feature List
--- ### Multi-feature Machine Learning Results
--- ### Feature Importance *
--- ### Feature Addition (oral-made-nasal) Findings
* *Modifying formants (or all together) resulted in more confusion!* * People called oral vowels "nasal" more often with modified formants * The pattern of the All-Modified stimuli was statistically similar. ---
* *Modifying formants (or all together) resulted in slower reaction times!* * People were slower to call vowels "oral" or "nasal" with modified formants --- ## Feature Reduction (Nasal-made-Oral) Findings ---
* Confusion wasn't affected by modificaton for nasal-to-oral stimuli! * **We never changed "nasal" to "oral" by modifying features** ---
* *Modifying formants (or all features) resulted in slower reaction times!* * People were slower to call vowels "oral" or "nasal" with modified formants --- ---