UCSD Chalk Talk - Will Styler

# Chalk Talk
## Will Styler
UC San Diego

---

### The linguistic signal is incredibly complex

- Natural language text is complex

- Speech is complex

- It reflects and describes a complex world

- ... yet, somehow, we’re able to consistenly get relevant information from person to person using language

---

### The signal must be carrying contrast

- Some feature (or group of features) allows us to recover the difference between ..

- Multiple phonemes

- Different temporal orderings

- Many possible social identities

- Two closely related meanings
	
- **There are many possible features which could be signalling any given contrast**

---

### My research has focused on this problem

- "I'm drowning in possible features that could signal what I’m studying."

- "**How can we figure out which ones humans are actually using?**"

- (Without requiring 15 years and a series of improbably large grants)

---

### There are many approaches to this problem

- 15 years and a series of improbably large grants (to test them all with humans)

- Educated guesswork to narrow the field

- Conventional Statistical analysis

- ... and my personal favorite

- **Machine Learning, *then* humans!**

---

### Today's Case Study: Vowel Nasality

* Vowel Nasality is the opening of the Velopharyngeal port during the vowel

</tr>
</table>
</center>

* What are the acoustic cues used for perceiving vowel nasality in English?

---

## The Problem

<small>Figure from Styler 2017</small>

---

### Feature selection is half the battle

- In a perfect world, you would test all possible features as potential cues
	
- Some *a priori* choices must be made to reduce the feature space

- ... but every time you exclude a feature, you run the risk of excluding *the* feature

---

### 29 Potential Cues for evaluation

* All spectral or temporal features in the signal

* Some absolute, some relative

* Features like...

* Formant Frequencies and Bandwidths
	* Spectral Relationships (like A1-P0 or A3-P0)
	* Nasal Peaks and Zeroes
	* Spectral Tilt
	* Vowel Duration
	* ... and more!

---

### The *other* problem

---

### Humans are troublesome

* Human responses aren't very transparent

* Subtle changes often produce subtle differences in response

* Observations are not independent

* Participants have different language and knowledge backgrounds

* Participants have limited endurance
	
	- Especially with boring tasks
	
* Running human experiments is... non-trivial
	* ... and really, really expensive

---

- So, instead of asking humans to evaluate all 29, let's use...

---
	
## Machine Speech Perception

---
### The Basic Idea

Human speech perception is just classifying sounds based on acoustical features

* **Computers can do that too!**

* Give the acoustic feature information to a classifier and ask for oral vs. nasal judgements

* Greater accuracy means a feature or grouping is more useful and informative!

---

### Computers are not humans

---

### They have some serious advantages for studying language!

* Their decisions are easier to quantify than humans'

* They'll (often) tell you *how* they made the decision they did

* They have no knowledge that you don't give to them

* They make all decisions independently

* They don't require payment or scheduling

* They're available 24/7

---

### Supervised Machine Classification 101

* Select a large corpus of data, and manually assign each observation to a group
	
* **Training:** Feed this labeled data into an algorithm so it can learn the patterns

* **Testing:** Give the trained algorithm new data without labels, and check the accuracy of its classifications

* Better accuracy often indicates more useful information was given to the classifier!
	
---

### Machine Classification is *everywhere*

---

### My approach

* 1) Collect a corpus of oral and nasal words, and measure each feature

* 2) Give each feature to a Machine Learning Algorithm individually

* The most informative features should be the most accurate

* 3) Find the best group of features

* Find the balance between "few features" and "good accuracy"
	
* 4) Test *those* features with expensive and difficult humans

---

### Labeling and Training

* Recorded 12 English speakers making words with oral and nasal(ized) vowels

* "Oral" vowels were in CVC contexts, and "Nasal" were in CVN/NVC/NVN contexts
	
	* This resulted in 3823 words
		
* Then, I measured each of the 29 features at two timepoints per vowel
	
	* All measurement was done automatically by Praat Script
	
* Then I handed them to a Support Vector Machine as training data

---

### Support Vector Machines

* A very common, very accurate machine learning algorithm

* Look at all the data in an multi-dimensional space
	* As many dimensions as features

* Try to find a line or hyperplane that optimally** separates the classes

* Classification is just seeing where the new data is relative to that line
	
---

(There are other algorithms that can work well, too!)

---

## **So how does it perform with nasality?**

---

### Single-Feature testing

* Are any features good enough *on their own* to allow nasal perception?

* 29 separate models (one per feature) classifying datapoints as "oral" or "nasal"

* Each model outputs accuracy figures, which we can compare!

---

### Single-feature findings

* F1's Bandwidth is the most useful and informative feature

* 67.6% SVM accuracy
	
* A1-P0, a measure of relative spectral prominence, gets second place

* 64.7% SVM accuracy
	
* The worst feature performed at 51.23% accuracy
	
* *None of the features are good enough on their own!*

---

## What *group of features* provides the best information?

---

### Multi-feature modeling

* Tested 10 *a priori* feature groupings
	* Selected from various outputs of the machine learning and statistics
	
* Compared the accuracy *in light of the number of features*
	* The winning model gets the best performance from the fewest features

---

### Multi-feature Results

* SVMs with all features worked best (29 features)

* 84.7% accuracy
	
* Formant Frequency and Bandwidth, Spectral Tilt, A1-P0, and Vowel Duration was the best subgroup (5 features)

* 82.2% accuracy
	
* **We only lose 2.5% accuracy when we reduce our feature set by 69%!**
	
---

### Overall Machine Learning Results

* **Formant Bandwidth** was the most useful single feature for English (62.5% accuracy)

* ... and we've got a multi-feature grouping with very good accuracy (82.2% accuracy)!

* Formant Width, Formant Frequency, Spectal Tilt, A1-P0, and Duration

* **So, let's test those five features with actual humans!**

---

# Human Perception

---

### Methods

* English listeners can use vowel nasality to identify missing nasal consonants

* ba_ could be "bad" or "ban"

* **Let's add or remove features from vowels to see what indicates "nasality"!**

* If adding or removing a feature changes perception, or makes them react more slowly, it's important!

---

### The Modifications

Use signal processing to simulate the oral-to-nasal change (or vice versa) in...

* 1) A1-P0 (or vice versa)

* 2) Duration

* 3) Spectral Tilt

* 4) Formant Bandwidth and Frequency
	* Combined

* 5) Modify *all five features at once!*

---

### The Experiment

* Recruited 42 normal-hearing Native English speakers from a department subject pool

* Each listened to 400 words with different modifications

* Analyzed both confusion and reaction time associated with stimulus changes

---

</tr>
</table>
</center>

---

</tr>
</table>
</center>

---

### Human Perception Summary

* Only **formant modification** had a significant effect on perception

* Formant modification caused listeners to respond more slowly

* Formant modification made oral vowels sound "nasal"

* F1's bandwidth is probably the cue
	
	* This makes sense acoustically, and Hawkins and Stevens (1985) also points in that direction

---

### Score one for the Machine!

- The machine learning models predicted F1's bandwidth as the most useful feature...

* ... and the humans agreed!

* ### How similar *are* the SVMs and the humans?

---

*Let's give the computer the same experimental task as the humans, using the same altered stimuli, and see how they compare!*

---

### Testing Humans vs. Machines

* 1) Train an SVM on all of the English Data
	
* 2) Extract acoustic features from the stimuli used in the experiment

* 3) Test those SVMs using the experimental stimuli data
	* Again classifying "oral" or "nasal"

* 4) Compare the by-condition confusion results to the humans

---

### Confusion by Condition

---

### Humans vs. Machines Summary

* Humans and machines *did* show similar patterns

* Modifications that were difficult for humans were difficult for SVMs
	
* Humans are still more accurate overall

---

### The SVMs didn't model the humans exactly!

* SVMs predicted gradient usefulness of the features

* Humans based their decisions entirely on F1's Bandwidth

* SVMs showed greater accuracy when all features were available

* Humans weren't meaningfully affected by the additional three features

* So, SVMs can show relative informativeness of features

* ... but they can't show what humans actually do use
	
---

## Conclusion

---

### What did using machine learning win us?

* The SVM studies very effectively narrowed the field

* The SVM studies and the humans both agreed on the best feature

* Trained SVMs were able to perform the same experiment, with similar results
		
* **Modeling human language using machine learning is helpful!**

---

### Machine learning is widely applicable

- Computational Linguistics loves it

- Modeling gestural data to identify discrete gestures rather than interpolation

- Modeling the time course of speech perception with machine learning

- Neural networks for finding tongue shapes in ultrasound data (Jian Zhu, cf. ASA 2018)

- Classifiying athletes as 'white' or 'black' based on media portrayal (Kelly Wright)

- ... and there are always more rich signals

---

## Any hypothesis about human language needs to be tested with human speakers

- ... but sometimes, it's a good idea to trust the machines!

---

### (Just be careful)

---

# Let's talk!

---

## Acknowledgements

* The speakers and listeners who participated in the study

* The great many electrons inconvenienced in the process of building these SVMs

* The University of Colorado at Boulder and Dr. Rebecca Scarborough

* The University of Michigan for the support and training, and the Michigan Phondi Group

---

### References

* Chen, M. Y. (1997). Acoustic correlates of english and french nasalized vowels. The Journal of the Acoustical Society of America, 102(4):2350–2370.

* Hawkins, S. and Stevens, K. N. (1985b). Acoustic and perceptual correlates of the non-nasal–nasal distinction for vowels. The Journal of the Acoustical Society of America, 77(4):1560–1575.

* Styler, W. (2015) On the Acoustical and Perceptual Features of Vowel Nasality. PhD thesis, University of Colorado at Boulder, March 2015.
---

# Additional Information

---

### Feature List

---

### Single-Feature Models

---

### Multi-feature Machine Learning Results

---

### Feature Importance

* <img class="r-stretch" src="img/diss_enfrimportance.png" alt="$txt">

---

### Feature Addition (oral-made-nasal) Findings

* *Modifying formants (or all together) resulted in more confusion!*

* People called oral vowels "nasal" more often with modified formants
	
	* The pattern of the All-Modified stimuli was statistically similar.

---
<img class="r-stretch" src="img/diss_rt.add.sum.png" alt="$txt">

* *Modifying formants (or all together) resulted in slower reaction times!*

* People were slower to call vowels "oral" or "nasal" with modified formants

---
## Feature Reduction (Nasal-made-Oral) Findings

---

* Confusion wasn't affected by modificaton for nasal-to-oral stimuli!

* **We never changed "nasal" to "oral" by modifying features**
	
---
<img class="r-stretch" src="img/diss_rt.rem.sum.png" alt="$txt">

* *Modifying formants (or all features) resulted in slower reaction times!*

* People were slower to call vowels "oral" or "nasal" with modified formants

---