---
### First, the elephant in the room
- I'm not actually in Cognitive Science!
---
-
---
### Cognitive Science wasn't a separate program where I grew up
- I have my BA, MA and Doctorate in Linguistics from the [University of Colorado at Boulder](https://www.colorado.edu/linguistics/)
- CU Boulder had the [Institute of Cognitive Science](https://www.colorado.edu/ics/)
- Collaborations between faculty in CS, Philosophy, Linguistics, Psych, Education, and more
- I went to their talks and was advised by an affiliate, but there was not a COGS major or Ph.D Specialization
- ... but I am bothered by COGS-flavored questions
---
### I'm a Computational Phonetician
- This means I study human speech perception and production
- ... using computational methods and models
- This involves a mix of experiments, data analysis, recordings, and instrumental measurements
- I collaborate lots with [Dr. Sarah Creel](https://quote.ucsd.edu/lasr/)
---
### I'm also Director of [Computational Social Science](https://css.ucsd.edu) at UCSD
- We have [lots of Cognitive Scientists](https://css.ucsd.edu/people/faculty.html#Cognitive-Science)!
- Including [Dr. Sean Tro]
- So, this means I'm always thinking about computers as a tool for understanding humans!
---
## What is Speech Technology?
---
### We're getting very used to speech technology
- Siri/Alexa/GoogleAssistant
- ChatGPT Voice Mode
- Speech-to-Text Keyboards
- Text-to-Speech (e.g. in Twitch streams)
---
### There are many kinds of speech technology
- Voice Activity Detection
- Automatic Noise Filtering
- Voice Compression and Encryption
- Forced Alignment and Timestamping
- Automatic Speech Recognition (ASR)
- Speech Synthesis or Text-to-Speech (TTS)
---
### These are really interesting tools!
- They allow new kinds of human-computer interactions
- They are incredible tools for accessibility
- They're great for processing large amounts of data
- ... but the most interesting part of these tools?
---
### The most interesting part about them is that they work at all!
---
### Today's Plan
- Why is speech so hard to produce?
- How do computers produce speech?
- Why is speech so hard to perceive?
- How do computers perceive speech?
---
## Speech is Hard
---
### Human Speech is incredibly difficult
- This is an incredibly intricate gestural dance in your mind and mouth
- Let's try it
---
### "A Linguistics Major goes very well with Cognitive Science"
- First, focus on your jaw
- Now, on your tongue
- Now, feel the vibes
---
### Ultrasound of Speech
---
### Ultrasound of Speech
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
From University of Michigan Phonetics Lab
---
### Many paths to the same sound
---
### Many paths to the same sound
---
### Speech is *hard*
- Speech is flapping bits of meat around in your head and throat while you expel air.
- This creates tiny vibrations in the air, ca
---
### So, what do these vibrations look like?
---
---
---
### There is incredible complexity in this process
- Fluid movement of your mouth and tongue
- Careful planning of air and breathing
- Control of pitch, gestures, and other aspects
---
### ... and we want to do *this* with software?!?
- 'Speech Synthesis' or 'Text-to-Speech' (TTS)
- *How do we do that?*
---
### The Task
- "A linguistics major goes very well with Cognitive Science"
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Text Analysis
- "OK, the human gave me text, what do they actually want me to say?"
- This part is usually done in Python
- This is actually hard
- '1997' is many things
---
### "PG&E will file schedules on April 20."
-
-
- (Thanks to Julia Hirschberg for this annotated chunk)
---
### Then, we turn that into audio
- "I know what needs to be said, now, give me a wave I can play back for the humans"
---
### For a long time, we cheated using humans!
- **Concatenative or 'Unit Selection' TTS** chops up bits and pieces of existing speech to create new speech
- You record a huge database of speech from a voice actor, with optimum 'coverage'
- You update as new words emerge (e.g. COVID, rawdogging, skibidi)
- **You then combine these words into sentences to match the text**
- ... and you use fancy algorithms to smooth the results out.
---
### This isn't easy
- You have to choose the best recorded token
- You might have 500 recordings of 'went'
- Context matters a lot
- "park" can sound very different in different places with words
- You can't get full coverage
- "Ruaridh", "Krivokapic", "simp", "La Jolla"
---
### The result can be imperfect
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### ... and you've only got one voice
- Which means that each new person needs a new collection of data
---
### Then, Artificial Neural Networks arrived, and everything changed
---
### The World's Worst Introduction to Neural Networks
---
### Take COGS 181 to actually understand this!
---
### Neural TTS is quite powerful
- Train a neural network with text and corresponding audio
- Make it output something which can made into a wave very readily
- Either make the wave directly, or make an intermediate representation which can be turned into a wave
---
### [TacoTron2](https://arxiv.org/pdf/1712.05884) is a relatively simple, open system
- It takes text, and generates spectrograms, chunk-by-chunk, which can be turned into a waveform
---
### TacoTron 2
---
### This allows us to go from text to speech!
- We feed in text, and we get back a wave, with no humans involved past making training data!
- The results are getting very, very good.
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### The State of the Art is Advanced, but closed
- Current state of the art models from ElevenLabs, OpenAI, Google, and Amazon are all closed and proprietary
- If you want the best TTS in the world, it has to happen on somebody else's computer
- Details are often not published and considered "trade secrets"
- They may well be open-source models with changes and tweaks
- It's not currently possible to teach the state of the art in TTS!
- ... and this should disturb us as a society
---
### Neural TTS can be trained using *any* voice
- You can build a model from the ground up using any voice you'd like
- [Except Scarlett Johansson](https://www.npr.org/2024/05/20/1252495087/openai-pulls-ai-voice-that-was-compared-to-scarlett-johansson-in-the-movie-her)
- If all your training data are from a bored Bostonian, you'll end up with a bored Bostonian TTS voice
- This is very expensive, though, and doesn't scale well at all
- You also need *lots* of data from the new speaker
---
> All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### ... but isn't linguistic information separate from talker information?
- Why can't we just adapt to new voices?
---
### We can think of all speech as having 'content' and 'style'
- Voices express linguistic 'content'
- Phonemes, with ordering, and necessary tone/prosody for comprehension
- This is 'all we need' to understand the utterances
- 'Style' is everything else we've been talking about
- 'Speaker' identity
- Social components
- Emotional content
- Plus prosodic factors (e.g. speed, emphasis, prosodic 'tunes', sarcasm)
---
### Couldn't we just abstract out the 'style' component and apply it to whatever Linguistic content we'd like?
- Yes!
---
### Here's a Multi-Speaker Version of TacoTron 2
---
### The results of this are... terrifying
- ... and has given rise to 'Deepfake' voices
---
### Neural Styler Transfer
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(TacoTron2)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(ElevenLabs)
(Credit to Erick Amaro and Mia Khattar!)
---
### Multilingual Examples
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(English)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(French)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(Spanish)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(Mandarin)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(Italian)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(Russian)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(Japanese)
---
### This system isn't perfect
> Adenocarcinoma in Tubovillious Adenoma bona fide certiorari de jure collusion RICO ex post facto CVN AWACS Escapement Tourbillion Remontoir de Egalite
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Prosody is still hard
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### ... but OMG, this thing can do arbitrary speech, in an arbitrary voice
- ... and it's never had a tongue, had phonics training, and doesn't actually know anything at all about mouths
- Arguably, it doesn't know anything about English
- ... although some systems use a language model too
- *This is amazing!*
- ... but it can model more complexity still
---
### Code Switching
It's like sometimes mezclo un poco de español con my English, cuando me siento particularmente spicy, y tengo curiosidad to know cómo la TTS handles it.
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Wow.
- Not only can exposure to data allow a deep neural network to learn to map written language into speech in one language
- ... but it can do it for two languages
- ... at once
- ... with clear mixing of the two
---
# This shouldn't work
- Yet, here we are
- Espicy!
---
# What about Speech Perception?
---
### Speech Perception is *hard*
- Speech is flapping bits of meat around in your head and throat while you expel air.
- This creates tiny vibrations in the air
- **Speech perception is turning the resulting vibrations in the air back into language**
---
### For computers, this is 'Automatic Speech Recognition' (ASR)
- The task is to turn speech into equivalent text
- This is *really, really hard*
---
### Let's focus on one of the really hard problems
---
## Vowel Perception
---
### What is a vowel?
What kind of vowels are we talking about?
---
---
---
### Review: What is a vowel?
* A vowel is voicing passing through (and resonating in) an unobstructed vocal tract!
* If we change the position of the tongue, we change the resonances
---
---
### Review: What is a vowel?
A vowel is voicing passing through (and resonating in) an unobstructed vocal tract!
If we change the position of the tongue, we change the resonances
* Different resonances *filter* the sound differently and determine the vowel quality
* **Different tongue shapes create different resonances, and different vowels!**
---
---
### What do vowels sound like?
* We talk about vowel quality in terms of "formants"
* These are bands of the spectrum where the energy is strongest
* The frequencies of these formants are our primary cues
---
---
---
### Vowel formants
* F1 and F2 are generally considered to be the most important
* F3 is good for rounding and rhoticity
---
### Formants alone can be enough for some perception!
---
### Let's listen to some sounds
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Let's listen to some sounds
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
### Now let's play all three at once!
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Let's listen to some sounds
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
### Now let's play all three at once!
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
### Does this help?
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### So, vowels are basically formant patterns
---
Different American English vowels, as spoken by a male speaker
---
### ... and vowel formants map to articulation!
---
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
## Speaker Variation!
---
### Speaker Vowel Space Variation
* Different speakers produce different resonances, even for the “same” vowels
* Vocal tracts can be shorter, longer, wider...
---
---
### Speaker Vowel Space Variation
Different speakers produce different resonances, even for the “same” vowels
* Speaker can have colds or allergies, can have more nasal voices...
* Sociolinguistic factors galore
* Every person has a different set of basic vowel formant positions
* This is called the speaker’s “vowel space”
---
---
---
### Moment-to-moment Vowel Variation
* Even the same speaker will have variation from moment to moment
* Sometimes we misarticulate, accidentally making the wrong vowel quality
* Or we talk with food in our mouths, producing different resonances
* Or sometimes, we’re just plain lazy
* This leads to constant and massive changes in vowel production
---
---
---
---
---
### Every person you've ever talked with has had different vowel formant patterns
* ... and yet, we understand each other, somehow
---
---
### There are a few ways this might work
---
### Speaker-intrinsic vowel space normalization
* Normalization is a process that “happens”
* You meet somebody, you create a model of their vowel space, and you move on
* These models of speaker vowels are maintained in memory
* One model per person, and a new model each time!
---
### Direct Realism
- We're using our senses to form a model of reality, including inside the mouth
- We don't really care about the acoustics per se, just estimating the gestures
- "Based on everything I'm hearing, this seems like she's making the same tongue shape I hear in /i/"
- This includes lots of adjustments 'for free'
---
### Speaker-extrinsic vowel space normalization
* We store information from *every vowel we hear*!
* Normalization is then just bulk comparison and probability
* Vowel identities are probabilistically determined
* Perhaps we also segment by speaker, dialect, language, etc
---
---
### Humans are able to do this process
- ... but what about ASR?
---
## ASR
---
### ASR builds mappings from audio to text
- We feed the system lots of text, and lots of corresponding audio
- It learns the patterns of sound associated with a given text
- Some use language models to give better predictions
---
### Vintage ASR used to require explicit speaker normalization
- In the HMM days, ASR software required personalization and 'training'
- Setup began with "Read these texts aloud"
- It would then process for a little while as it 'customized' to your voice
- The model *simply wouldn't work* without this level of customization
---
### ... but when Neural Networks happened, things changed
---
### Whisper's architecture is complicated
---
### ... but it is wildly effective
- It works relatively quickly
- On relatively low-end hardware
- ... and most amazing of all...
### Whisper can get human-like performance in speech transcription*
---
### Wait, what was that asterisk?
---