CSSA Talk 2025 - Will Styler

# What can we learn about humans from looking at speech technology?

### Will Styler
<https://savethevowels.org/talks/colloq_cogs_speechlg.html>

---

### First, the elephant in the room

- I'm not actually in Cognitive Science!

---

- <img class="r-stretch" src="img/impostor.jpg" alt="impostor.jpg The image displays a digital illustration set against a backdrop of deep space. At the top center, the word Impostor is written in large, red, pixelated letters. Below this text stands a cartoon character from the video game Among Us. The character has a bean-like shape with a thick black outline. Its suit is primarily bright red, but it features darker purple and magenta shading on the left side of its body and its right leg to suggest folds in the fabric. On the front of the character's head area is a large, light blue-grey visor with rounded corners and a white reflection in the center. A red backpack protrudes from the back of the figure. The background depicts outer space, filled with numerous small white stars against a dark blue and black sky. There are also nebula clouds visible in shades of purple, pink, and reddish-orange. Beneath the character's feet is a dark grey shadow on the ground. This description was generated automatically. Please feel free to ask questions if you have further questions about the nature of the image or its meaning within the presentation.">

---

### Cognitive Science wasn't a separate program where I grew up

- I have my BA, MA and Doctorate in Linguistics from the [University of Colorado at Boulder](https://www.colorado.edu/linguistics/)

- CU Boulder had the [Institute of Cognitive Science](https://www.colorado.edu/ics/)
	- Collaborations between faculty in CS, Philosophy, Linguistics, Psych, Education, and more

- I went to their talks and was advised by an affiliate, but there was not a COGS major or Ph.D Specialization

- ... but I am bothered by COGS-flavored questions

---

### I'm a Computational Phonetician

- This means I study human speech perception and production

- ... using computational methods and models

- This involves a mix of experiments, data analysis, recordings, and instrumental measurements

- I collaborate lots with [Dr. Sarah Creel](https://quote.ucsd.edu/lasr/)

---

### I'm also Director of [Computational Social Science](https://css.ucsd.edu) at UCSD

- We have [lots of Cognitive Scientists](https://css.ucsd.edu/people/faculty.html#Cognitive-Science)!
	- Including [Dr. Sean Tro]

- So, this means I'm always thinking about computers as a tool for understanding humans!

---

## What is Speech Technology?

---

### We're getting very used to speech technology

- Siri/Alexa/GoogleAssistant

- ChatGPT Voice Mode

- Speech-to-Text Keyboards

- Text-to-Speech (e.g. in Twitch streams)

---

### There are many kinds of speech technology

- Voice Activity Detection

- Automatic Noise Filtering

- Voice Compression and Encryption

- Forced Alignment and Timestamping

- Automatic Speech Recognition (ASR)

- Speech Synthesis or Text-to-Speech (TTS)

---

### These are really interesting tools!

- They allow new kinds of human-computer interactions

- They are incredible tools for accessibility

- They're great for processing large amounts of data

- ... but the most interesting part of these tools?

---

### The most interesting part about them is that they work at all!

---

### Today's Plan

- Why is speech so hard to produce?

- How do computers produce speech?

- Why is speech so hard to perceive?

- How do computers perceive speech?

---

## Speech is Hard

---

### Human Speech is incredibly difficult

- This is an incredibly intricate gestural dance in your mind and mouth

- Let's try it

---

### "A Linguistics Major goes very well with Cognitive Science"

- First, focus on your jaw

- Now, on your tongue

- Now, feel the vibes

---

### Ultrasound of Speech

---

### Ultrasound of Speech

(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. <a href="https://accessibility.ucsd.edu/policies-standards/ucsd-accessibility-guidelines.html">Please see this site for details.</a>)

<tiny>From University of Michigan Phonetics Lab</tiny>

---

### Many paths to the same sound

---

### Many paths to the same sound

---

### Speech is *hard*

- Speech is flapping bits of meat around in your head and throat while you expel air.

- This creates tiny vibrations in the air, ca

---

### So, what do these vibrations look like?

---

---

---

### There is incredible complexity in this process

- Fluid movement of your mouth and tongue

- Careful planning of air and breathing

- Control of pitch, gestures, and other aspects

---

### ... and we want to do *this* with software?!?

- 'Speech Synthesis' or 'Text-to-Speech' (TTS)

- *How do we do that?*

---

### The Task

- "A linguistics major goes very well with Cognitive Science"

---

### Text Analysis

- "OK, the human gave me text, what do they actually want me to say?"

- This part is usually done in Python

- This is actually hard
	- '1997' is many things

---

### "PG&E will file schedules on April 20."

- <img class="r-stretch wide" src="comp/tts_phones.jpg" alt="ttsphones.jpg The image appears to be a phonetic transcription of a sentence using the International Phonetic Alphabet (IPA). The sentence is: Peggy and Eddie will file schedules on April twentieth. Here's a breakdown of the text in the IPA: - P is pronounced as /p/ - e is pronounced as /ɛ/ - g is pronounced as /ɡ/ - y is pronounced as /j/ - a is pronounced as /æ/ - n is pronounced as /n/ - d is pronounced as /d/ - E is pronounced as /iː/ - d is pronounced as /d/ - i is pronounced as /ɪ/ - w is pronounced as /w/ - l is pronounced as /l/ - f is pronounced as /f/ - i is pronounced as /ɪ/ - l is pronounced as /l/ - e is pronounced as /ɛ/ - s is pronounced as /z/ - k is pronounced as /k/ - h is pronounced as /h/ - a is pronounced as /æ/ - j is pronounced as /dʒ/ - ax is pronounced as /æks/ - s is pronounced as /s/ - o is pronounced as /ɑː/ - n is pronounced as /n/ - d is pronounced as /d/ - S is pronounced as /ɛ/ - k is pronounced as /k/ - h is pronounced as /h/ - a is pronounced as /æ/ - j is pronounced as /dʒ/ - ax is pronounced as /æks/ - s is pronounced as /s/ - o is pronounced as /ɑː/ - n is pronounced as /n/ - d is pronounced as /d/ - E is pronounced as /iː/ - y is pronounced as /j/ - w is pronounced as /w/ - h is pronounced as /h/ - a is pronounced as /æ/ - l is pronounced as /l/ - e is pronounced as /ɛ/ - s is pronounced as /z/ - k is pronounced as /k/ - h is pronounced as /h/ - a is pronounced as /æ/ - j is pronounced as /dʒ/ - ax is pronounced as /æks/ - s is pronounced as /s/ - o is pronounced as /ɑː/ - n is pronounced as /n/ - d is pronounced as /d/ - A is pronounced as /eɪ/ - P is pronounced as /p/ - r is pronounced as /ɜː/ - i is pronounced as /ɪ/ - l is pronounced as /l/ - e is pronounced as /ɛ/ - t is pronounced as /t/ - w is pronounced as /w/ - e is pronounced as /ɛ/ - n is pronounced as /n/ - d is pronounced as /d/ - T is pronounced as /t/ - h is pronounced as /θ/ The sentence also includes a percentage: -1.4%. This likely refers to the probability or likelihood of an event, but without more context, it's unclear what specific event this percentage relates to. There are no diagrams in the image; only phonetic transcriptions and text. This description was generated automatically from image files by a local LLM, and thus, may not be fully accurate. Please feel free to ask questions if you have further questions about the nature of the image or its meaning within the presentation.">

- <img class="r-stretch wide" src="comp/tts_wave.jpg" alt="ttswave.jpg The image provided is a visual representation of an audio waveform. This type of graph shows how sound pressure changes over time in an audio signal. - Shape and Structure: The waveform consists of horizontal lines that represent the amplitude (loudness) of the sound at different points in time. The vertical axis represents the amplitude, while the horizontal axis represents time. - Amplitude Variations: The height of each line indicates how loud or soft a particular part of the audio is. Higher peaks indicate louder sounds, and lower troughs represent quieter moments. - Duration: The length of the waveform shows the duration of the sound. Each segment of the graph corresponds to a different section of the audio file. - Text and Numbers: There are no visible text or numbers in this image that provide additional information about the content of the audio, such as its title, artist, or genre. This type of visual is commonly used by audio editors and sound engineers to analyze and manipulate audio files. It helps them understand the dynamics of a recording—how loud it gets, how quiet it becomes, and where there might be unwanted noise or silence that could be removed for better quality. This description was generated automatically from image files by a local LLM, and thus, may not be fully accurate. Please feel free to ask questions if you have further questions about the nature of the image or its meaning within the presentation.">

- (Thanks to Julia Hirschberg for this annotated chunk)

---

### Then, we turn that into audio

- "I know what needs to be said, now, give me a wave I can play back for the humans"

---

### For a long time, we cheated using humans!

- **Concatenative or 'Unit Selection' TTS** chops up bits and pieces of existing speech to create new speech

- You record a huge database of speech from a voice actor, with optimum 'coverage'
	- You update as new words emerge (e.g. COVID, rawdogging, skibidi)

- **You then combine these words into sentences to match the text**
	- ... and you use fancy algorithms to smooth the results out.

---

### This isn't easy

- You have to choose the best recorded token
	- You might have 500 recordings of 'went'

- Context matters a lot
	- "park" can sound very different in different places with words

- You can't get full coverage
	- "Ruaridh", "Krivokapic", "simp", "La Jolla"

---

### The result can be imperfect

---

### ... and you've only got one voice

- Which means that each new person needs a new collection of data

---

### Then, Artificial Neural Networks arrived, and everything changed

---

### The World's Worst Introduction to Neural Networks

---

### Take COGS 181 to actually understand this!

---

### Neural TTS is quite powerful

- Train a neural network with text and corresponding audio

- Make it output something which can made into a wave very readily

- Either make the wave directly, or make an intermediate representation which can be turned into a wave

---

### [TacoTron2](https://arxiv.org/pdf/1712.05884) is a relatively simple, open system

- It takes text, and generates spectrograms, chunk-by-chunk, which can be turned into a waveform

---

### TacoTron 2

---

### This allows us to go from text to speech!

- We feed in text, and we get back a wave, with no humans involved past making training data!

- The results are getting very, very good.

---

### The State of the Art is Advanced, but closed

- Current state of the art models from ElevenLabs, OpenAI, Google, and Amazon are all closed and proprietary
	- If you want the best TTS in the world, it has to happen on somebody else's computer

- Details are often not published and considered "trade secrets"
	- They may well be open-source models with changes and tweaks

- It's not currently possible to teach the state of the art in TTS!
	- ... and this should disturb us as a society

---

### Neural TTS can be trained using *any* voice

- You can build a model from the ground up using any voice you'd like
	- [Except Scarlett Johansson](https://www.npr.org/2024/05/20/1252495087/openai-pulls-ai-voice-that-was-compared-to-scarlett-johansson-in-the-movie-her)

- If all your training data are from a bored Bostonian, you'll end up with a bored Bostonian TTS voice

- This is very expensive, though, and doesn't scale well at all
	- You also need *lots* of data from the new speaker

---

> All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

---

### ... but isn't linguistic information separate from talker information?

- Why can't we just adapt to new voices?

---

### We can think of all speech as having 'content' and 'style'

- Voices express linguistic 'content'
	- Phonemes, with ordering, and necessary tone/prosody for comprehension
	- This is 'all we need' to understand the utterances

- 'Style' is everything else we've been talking about
	- 'Speaker' identity
	- Social components
	- Emotional content
	- Plus prosodic factors (e.g. speed, emphasis, prosodic 'tunes', sarcasm)

---

### Couldn't we just abstract out the 'style' component and apply it to whatever Linguistic content we'd like?

- Yes!

---

### Here's a Multi-Speaker Version of TacoTron 2

---

### The results of this are... terrifying

- ... and has given rise to 'Deepfake' voices

---

### Neural Styler Transfer

(Credit to Erick Amaro and Mia Khattar!)

---

### Multilingual Examples

---

### This system isn't perfect

> Adenocarcinoma in Tubovillious Adenoma bona fide certiorari de jure collusion RICO ex post facto CVN AWACS Escapement Tourbillion Remontoir de Egalite

---

### Prosody is still hard

---

### ... but OMG, this thing can do arbitrary speech, in an arbitrary voice

- ... and it's never had a tongue, had phonics training, and doesn't actually know anything at all about mouths

- Arguably, it doesn't know anything about English
	- ... although some systems use a language model too

- *This is amazing!*
	- ... but it can model more complexity still

---

### Code Switching

It's like sometimes mezclo un poco de español con my English, cuando me siento particularmente spicy, y tengo curiosidad to know cómo la TTS handles it.

---

### Wow.

- Not only can exposure to data allow a deep neural network to learn to map written language into speech in one language

- ... but it can do it for two languages

- ... at once

- ... with clear mixing of the two

---

# This shouldn't work

- Yet, here we are

- Espicy!

---

# What about Speech Perception?

---

### Speech Perception is *hard*

- Speech is flapping bits of meat around in your head and throat while you expel air.

- This creates tiny vibrations in the air

- **Speech perception is turning the resulting vibrations in the air back into language**

---

### For computers, this is 'Automatic Speech Recognition' (ASR)

- The task is to turn speech into equivalent text

- This is *really, really hard*

---

### Let's focus on one of the really hard problems

---

## Vowel Perception

---

### What is a vowel?

What kind of vowels are we talking about?

---

---

---

### Review: What is a vowel?

* A vowel is voicing passing through (and resonating in) an unobstructed vocal tract!

* If we change the position of the tongue, we change the resonances

---

---

### Review: What is a vowel?

A vowel is voicing passing through (and resonating in) an unobstructed vocal tract!

If we change the position of the tongue, we change the resonances

* Different resonances *filter* the sound differently and determine the vowel quality

* **Different tongue shapes create different resonances, and different vowels!**

---

---

### What do vowels sound like?

* We talk about vowel quality in terms of "formants"

* These are bands of the spectrum where the energy is strongest

* The frequencies of these formants are our primary cues

---

---

---

### Vowel formants

* F1 and F2 are generally considered to be the most important

* F3 is good for rounding and rhoticity

---

### Formants alone can be enough for some perception!

---

### Let's listen to some sounds

---

### Let's listen to some sounds

### Now let's play all three at once!

---

### Let's listen to some sounds

### Now let's play all three at once!

### Does this help?

---

### So, vowels are basically formant patterns

---

<img class="r-stretch big" src="phonmedia/vowelformants.gif" alt="vowelformants.gif This image displays a scientific chart containing eight individual spectrograms arranged in a grid of two rows and four columns. The background is white, and the graphical elements are black. There are no people or characters in this image. The vertical axis on the far left represents frequency in Hertz (Hz), with markings for 1000, 2000, 3000, and 4000 Hz increasing from bottom to top. The horizontal axis at the very bottom represents time in milliseconds (ms), with markings for 0, 300, and 400 visible. Each of the eight panels displays a spectrogram of a specific vowel sound, identified by an International Phonetic Alphabet (IPA) symbol centered below each graph. In these graphs, darker horizontal bands represent formants, which are concentrations of acoustic energy at specific frequencies. Small black arrows point from the left margin toward these dark bands in most panels to highlight them. Top Row (from left to right): 1. Panel labeled [ i ]: Shows a spectrogram with distinct dark bands. One band is visible near the top around 3000 Hz, another below it around 2500 Hz, and a lower one near 1000 Hz. Arrows point to these frequencies. 2. Panel labeled [ ɪ ]: Similar to the first panel but with slightly different band positions. The upper bands are clustered between 2000 and 3000 Hz. Arrows indicate specific formant locations. 3. Panel labeled [ ɛ ]: Shows bands that are generally lower in frequency than the previous two. There is a cluster of energy around 2000-2500 Hz and a distinct band near 1000 Hz. Arrows point to these features. 4. Panel labeled [ æ ]: This graph shows dark bands distributed across the vertical range, with significant energy visible between 2000 and 3000 Hz and another band lower down around 800-900 Hz. Bottom Row (from left to right): 1. Panel labeled [ a ]: This spectrogram shows three distinct dark bands indicated by arrows on the far left. One is low, near 500 Hz; one is in the middle around 1200 Hz; and one is higher up around 2800 Hz. 2. Panel labeled [ ɔ ]: Shows a pattern with energy concentrated between 1000 and 2500 Hz. Arrows point to bands near 600 Hz, 1200 Hz, and 2300 Hz. 3. Panel labeled [ ʊ ]: This graph shows a concentration of energy lower down on the frequency scale compared to the front vowels. There are arrows pointing to bands around 800 Hz and 1500 Hz. The texture is somewhat grainier than the others. 4. Panel labeled [ u ]: This final panel shows dark bands primarily in the lower half of the graph, below 2000 Hz. On the right side of this specific graph, there is a distinct curved line or hook shape rising from the middle frequency area up towards the top right corner. Arrows point to bands near 500 Hz and 1200 Hz. This description was generated automatically. Please feel free to ask questions if you have further questions about the nature of the image or its meaning within the presentation.">
Different American English vowels, as spoken by a male speaker

---

### ... and vowel formants map to articulation!

---

## Speaker Variation!

---

### Speaker Vowel Space Variation

* Different speakers produce different resonances, even for the “same” vowels

* Vocal tracts can be shorter, longer, wider...

---

---

### Speaker Vowel Space Variation

Different speakers produce different resonances, even for the “same” vowels

* Speaker can have colds or allergies, can have more nasal voices...

* Sociolinguistic factors galore

* Every person has a different set of basic vowel formant positions

* This is called the speaker’s “vowel space”

---

---

---

### Moment-to-moment Vowel Variation

* Even the same speaker will have variation from moment to moment

* Sometimes we misarticulate, accidentally making the wrong vowel quality

* Or we talk with food in our mouths, producing different resonances

* Or sometimes, we’re just plain lazy

* This leads to constant and massive changes in vowel production

---

---

---

---

### Every person you've ever talked with has had different vowel formant patterns

* ... and yet, we understand each other, somehow

---

---

### There are a few ways this might work

---

### Speaker-intrinsic vowel space normalization

* Normalization is a process that “happens”

* You meet somebody, you create a model of their vowel space, and you move on

* These models of speaker vowels are maintained in memory

* One model per person, and a new model each time!

---

### Direct Realism

- We're using our senses to form a model of reality, including inside the mouth

- We don't really care about the acoustics per se, just estimating the gestures

- "Based on everything I'm hearing, this seems like she's making the same tongue shape I hear in /i/"

- This includes lots of adjustments 'for free'

---

### Speaker-extrinsic vowel space normalization

* We store information from *every vowel we hear*!

* Normalization is then just bulk comparison and probability

* Vowel identities are probabilistically determined

* Perhaps we also segment by speaker, dialect, language, etc

---

### Humans are able to do this process

- ... but what about ASR?

---

## ASR

---

### ASR builds mappings from audio to text

- We feed the system lots of text, and lots of corresponding audio

- It learns the patterns of sound associated with a given text

- Some use language models to give better predictions

---

### Vintage ASR used to require explicit speaker normalization

- In the HMM days, ASR software required personalization and 'training'

- Setup began with "Read these texts aloud"

- It would then process for a little while as it 'customized' to your voice

- The model *simply wouldn't work* without this level of customization

---

### ... but when Neural Networks happened, things changed

---

### Whisper's architecture is complicated

---

### ... but it is wildly effective

- It works relatively quickly

- On relatively low-end hardware

- ... and most amazing of all...

### Whisper can get human-like performance in speech transcription*

---

### Wait, what was that asterisk?

---