CSSA Talk 2025 - Will Styler

# Speech Technology is Incredible

### Will Styler - CSSA Talk Series
<https://savethevowels.org/talks/cssa2025.html>

---

### Introductions

- I'm an Associate Teaching Professor of Linguistics

- I'm a Computational Phonetician
	- I study human speech perception and production using computers
	- This involves a mix of experiments, data analysis, recordings, and instrumental measurements

- I've also done Natural Language Processing work

- I work to develop new tools for measurement and teaching

---

### I'm also Director of [Computational Social Science](https://css.ucsd.edu) at UCSD

- We're an interdisciplinary program uniting people across Social Science who work with and care about computation and data

- We believe that **data matters because people matter**

- I'm here because I'm always thinking about computers as a tool for understanding humans!

---

### But there remains an elephant in this CSSA-filled room

- I'm not actually in Cognitive Science!

---

---

### Cognitive Science wasn't a separate program where I grew up

- I have my BA, MA and Doctorate in Linguistics from the [University of Colorado at Boulder](https://www.colorado.edu/linguistics/)

- CU Boulder had the [Institute of Cognitive Science](https://www.colorado.edu/ics/)
	- Collaborations between faculty in CS, Philosophy, Linguistics, Psych, Education, and more

- I went to their talks and was advised by an affiliate, but there was not a COGS major or Ph.D Specialization

- ... but I am bothered by many COGS-flavored questions

---

### But I do have a lot of COGS ties

- I collaborate lots with [Dr. Sarah Creel](https://quote.ucsd.edu/lasr/) in COGS
	- Who is an incredible person to talk to if you're interested in speech in COGS

- CSS has [lots of Cognitive Scientists](https://css.ucsd.edu/people/faculty.html#Cognitive-Science)!
	- Including [Dr. Sean Trott](https://seantrott.github.io/), one of our Core Faculty
	- And [Dr. Shannon Ellis](https://shannon-ellis.com/) is on our steering committee

- ... and I know that Language folks over here are just as awesome and effective

---

### No matter where my office is, Language is the best

- Speech is even cooler than Language

- ... and all the more awesome when we put computers into the process

---

## Speech Technology

---

### Speech Technology is pervasive in the US

- Siri/Alexa/GoogleAssistant

- ChatGPT Voice Mode

- Speech-to-Text Keyboards

- Text-to-Speech (e.g. in GPS or Twitch streams)

- ... and much more!

---

### Speech technology is absolutely fascinating

- **... but the most interesting part is that it works at all!**

---

### Today's Plan

- Why is speech so hard to produce?

- How can computers produce speech?

- Why is speech so hard to perceive?

- How can computers perceive speech?

---

## Speech is Hard

---

### Human Speech is incredibly difficult

- This is an incredibly intricate gestural dance in your mind and mouth

- Let's try it

---

### "A Linguistics Major goes very well with Cognitive Science"

- First, focus on your jaw

- Now, on your tongue

- Now, feel the vibes

---

### Speech is *hard*

- Fluid movement of your mouth and tongue

- Careful planning of air and breathing

- Control of pitch, gestures, and other aspects

- All to create tiny pressure variations in the air

---

---

---

### ... and we want to do *this* with software?!?

- 'Speech Synthesis' or 'Text-to-Speech' (TTS)

- *How do we do that?*

---

### The Task

"A Linguistics major goes very well with Cognitive Science"

(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. <a href="https://accessibility.ucsd.edu/policies-standards/ucsd-accessibility-guidelines.html">Please see this site for details.</a>)

---

### Historically, the steps were simple

- Analyze what the text needs to sound like ('Text Analysis')
	- Jelena saw 1985 listings in La Jolla CA for over $2 million
	- /jɛlɛnə sɑ najntin hʌndɹɪd . . . /

- Now, transform that into a wave we can play back for the humans

---

### For a long time, we cheated using humans!

- **Concatenative or 'Unit Selection' TTS** chops up bits and pieces of existing speech to create new speech

- You record a huge database of speech from a voice actor, with optimum 'coverage'
	- You update as new words emerge (e.g. COVID, rizz, skibidi)

- **You then combine these words into sentences to match the text**
	- ... and you use fancy algorithms to smooth the results out.

---

### The result can be imperfect

---

### Then, Artificial Neural Networks arrived, and everything changed

---

### The World's Worst Introduction to Neural Networks

---

### For today, Neural Networks learn to transform input data into a desired output data

- Training involves presenting the network with both input and output
	- Then we change the network between to make the output closer to the desired output

- Then you feed in new input, and get new output

- They are wildly complex, and wildly powerful

- **Take COGS 181 to actually understand this!**

---

### [TacoTron2](https://arxiv.org/pdf/1712.05884) is a relatively simple, neural TTS system

- For now, we input text, we get speech

- It's trained using speech with paired text

- It takes text, and generates spectrograms, chunk-by-chunk, which can be turned into a waveform

---

### TacoTron 2

---

### This allows us to go from text to speech!

- We feed in text, and we get back a wave, with no humans involved past making training data!

- State-of-the-art models are getting very, very good!

---

### The State of the Art is Advanced, but closed

- Current state of the art models from ElevenLabs, OpenAI, Google, and Amazon are all closed and proprietary
	- If you want the best TTS in the world, it has to happen on somebody else's computer
	- Details are often not published and considered "trade secrets"

- It's not currently possible to teach the state of the art in TTS!
	- ... and this should disturb us as a society

---

### Neural TTS can be trained using *any* voice

- You can build a model from the ground up using any voice you'd like

- If all your training data are from a bored Bostonian, you'll end up with a bored Bostonian TTS voice

- Yet, we might want different voices...

---

> All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

---

### Humans know that content and style are different

- Speech expresses linguistic 'content'
	- Phonemes, with ordering, and necessary pitch and timing for comprehension
	- This is 'all we need' to understand the utterances

- There's also 'Style', which gives us lots of other details
	- 'Speaker' identity
	- Social components
	- Emotional content
	- Plus things like speed, emphasis, pitch 'tunes', sarcasm

---

### Couldn't we just abstract out the 'style' component and apply it to whatever Linguistic content we'd like?

- Yes!

---

### Here's a Multi-Speaker Version of TacoTron 2

---

### The results of this are... terrifying

- ... and has given rise to 'Deepfake' voices

---

### Neural Styler Transfer

(Credit to Erick Amaro and Mia Khattar!)

---

### Multilingual Examples

---

### Prosody is still hard

---

### ... but OMG, this thing can do arbitrary speech, in an arbitrary voice

- ... and it's never had a tongue, had phonics training, and doesn't actually know anything at all about mouths

- Arguably, it doesn't know anything about English
	- ... although some systems use a language model too

- *This is amazing!*
	- ... but it can model more complexity still

---

### Code Switching

It's like sometimes mezclo un poco de español con my English, cuando me siento particularmente spicy, y tengo curiosidad to know cómo la TTS handles it.

---

### Wow.

- Not only can exposure to data allow a deep neural network to learn to map written language into speech in one language

- ... but it can do it for two languages

- ... at once

- ... with clear mixing of the two

- **Espicy!**

---

# What about Speech Perception?

---

### Speech Perception is *hard*

- Speech is flapping bits of meat around in your head and throat while you expel air.

- This creates tiny vibrations in the air

- **Speech perception is turning the resulting vibrations in the air back into language**

---

### Let's focus on one of the really hard problems

---

## Vowel Perception

---

### What is a vowel?

* A vowel is letting the voice resonate in the vocal tract while you move the tongue

* If we change the position of the tongue, we change the resonances

---

---

### English has lots of vowels

/i/ - beet, see, seen, sear, seal

/ɪ/ - bit, sit, tin, sill

/ɛ/ - bet, set, sent, fair, sell

/æ/ - bat, sat, pant, pal

/ʌ/ - but, sun, pun, lull (ə in sofa, amount)

/əɹ/ - bird, purr, earl, butter, clamor (this is often broken into two vowels!)

/ɑ/ - bot, saw, star, paul, pawn, (cot*)

/ɔ/ - corn /kɔɹn/, boy /bɔj/ (caught*)

/ʊ/ - book, hood, puss

/u/ - boot, who’d, loose, lure, loon

---

### Diphthongs, too!

/ɔj/ - boy, soy, toy, join, oil, Roy

/aj/ - buy, right, try, sigh, die, fire

/ej/ - play, bay, may, ray, lay, trail

/ow/ - boat, oat, wrote, pope, toll

/aw/ - how, now, brown, cow, prow, louse

---

---

---

### What do vowels sound like?

* We talk about vowel quality in terms of "formants"

* These are bands of the spectrum where the energy is strongest

* The frequencies of these formants are how we distinguish vowels

---

---

---

### So, different vowels are basically different formant patterns

---

<img class="r-stretch big" src="phonmedia/vowelformants.gif" alt="vowelformants.gif This image displays a scientific chart containing eight individual spectrograms arranged in a grid of two rows and four columns. The background is white, and the graphical elements are black. There are no people or characters in this image. The vertical axis on the far left represents frequency in Hertz (Hz), with markings for 1000, 2000, 3000, and 4000 Hz increasing from bottom to top. The horizontal axis at the very bottom represents time in milliseconds (ms), with markings for 0, 300, and 400 visible. Each of the eight panels displays a spectrogram of a specific vowel sound, identified by an International Phonetic Alphabet (IPA) symbol centered below each graph. In these graphs, darker horizontal bands represent formants, which are concentrations of acoustic energy at specific frequencies. Small black arrows point from the left margin toward these dark bands in most panels to highlight them. Top Row (from left to right): 1. Panel labeled [ i ]: Shows a spectrogram with distinct dark bands. One band is visible near the top around 3000 Hz, another below it around 2500 Hz, and a lower one near 1000 Hz. Arrows point to these frequencies. 2. Panel labeled [ ɪ ]: Similar to the first panel but with slightly different band positions. The upper bands are clustered between 2000 and 3000 Hz. Arrows indicate specific formant locations. 3. Panel labeled [ ɛ ]: Shows bands that are generally lower in frequency than the previous two. There is a cluster of energy around 2000-2500 Hz and a distinct band near 1000 Hz. Arrows point to these features. 4. Panel labeled [ æ ]: This graph shows dark bands distributed across the vertical range, with significant energy visible between 2000 and 3000 Hz and another band lower down around 800-900 Hz. Bottom Row (from left to right): 1. Panel labeled [ a ]: This spectrogram shows three distinct dark bands indicated by arrows on the far left. One is low, near 500 Hz; one is in the middle around 1200 Hz; and one is higher up around 2800 Hz. 2. Panel labeled [ ɔ ]: Shows a pattern with energy concentrated between 1000 and 2500 Hz. Arrows point to bands near 600 Hz, 1200 Hz, and 2300 Hz. 3. Panel labeled [ ʊ ]: This graph shows a concentration of energy lower down on the frequency scale compared to the front vowels. There are arrows pointing to bands around 800 Hz and 1500 Hz. The texture is somewhat grainier than the others. 4. Panel labeled [ u ]: This final panel shows dark bands primarily in the lower half of the graph, below 2000 Hz. On the right side of this specific graph, there is a distinct curved line or hook shape rising from the middle frequency area up towards the top right corner. Arrows point to bands near 500 Hz and 1200 Hz. This description was generated automatically. Please feel free to ask questions if you have further questions about the nature of the image or its meaning within the presentation.">
Different American English vowels, as spoken by a male speaker

---

### Speaker Vowel Space Variation

* Different speakers produce different resonances, even for the “same” vowels

* Vocal tracts can be shorter, longer, wider...

---

### Here's the weird part!

- Different speakers have different formants, even for the “same” vowels!

* Every person has a different set of basic vowel formant positions

* This is called the speaker’s “vowel space”

---

### 'Idealized' Formants

---

### Speaker Average Formants

---

### Moment-to-moment Vowel Variation

* Even the same speaker will have variation from moment to moment

* We often move our tongues differently, changing the vowel's quality
	* For many, many reasons

* This leads to constant and massive changes in vowel production

---

### Speaker Average Formants

---

### Individual Token Formants

---

### Individual Token Formants

---

---

### Every person you've ever talked with has had different vowel formant patterns

* ... and yet, we understand each other, somehow

---

---

### Humans are able to cope with this easily, somehow

- ... but what about computers?

---

## Automatic Speech Recognition

---

### ASR builds mappings from audio to text

- We feed the system lots of text, and lots of corresponding audio

- It learns the patterns of sound associated with a given text

- Some use language models to give better predictions

---

### Vintage ASR used to require explicit speaker adaptation

- Around the turn of the century, ASR software required personalization and 'training'

- Setup began with "Read these texts aloud"

- It would then process for a little while as it 'customized' to your voice

- The model *simply wouldn't work* without this level of customization

---

### Then, Artificial Neural Networks arrived, and everything changed (again)

---

### Whisper is OpenAI's Neural ASR Tool

---

### ... but it is wildly effective

- It works relatively quickly

- On relatively low-end hardware

- ... and most amazing of all...

---

### Whisper can get human-like performance in speech transcription*

---

### Wow.

- These ASR tools just 'listened' to a bunch of audio with texts

- They built representations of speech

- They combined it with some knowledge of how text usually looks

- ... and suddenly, it approaches human ability in speech perception*

---

### Wait, what were those asterisks?

- About that....

---

### How many of you have had great experiences with speech-to-text?

- How many think it's OK?

- How many think it works terribly?

---

### These tools are great at recognizing speech *for the dialects that they were trained on*

- ... but they're substantially weaker at adapting to different dialects

- Many people are working on that
	- ... including your very own Oishani Bandopadhyay!

- **So, humans still win!**

---

### Hooray!

---

Yet, there's still one incredible fact...

---

### Neural networks can be as good as humans at speech

- ... without tongues, ears, grammatical knowledge, or human brains

- Just like LLMs are the second thing *ever* which can do human language

- *All it takes it lots of data and the right architecture*
	- Take LIGN 168 to understand more about speech tools and follow the progress!

---

### We're still figuring out what this means for Language

- ... but one thing is for sure ...

---

### It's a fascinating time to study the Cognitive Science of Language!

- ... and to have a Linguistics double major

- (Sorry, couldn't resist)

---

<huge>Thank you!</huge>

---

### Bonus Content: Formants are enough for speech perception in humans

---

### Let's listen to some sounds

### Now let's play all three at once!

### Does this help?