---
### Introductions
- I'm an Associate Teaching Professor of Linguistics
- I'm a Computational Phonetician
- I study human speech perception and production using computers
- This involves a mix of experiments, data analysis, recordings, and instrumental measurements
- I've also done Natural Language Processing work
- I work to develop new tools for measurement and teaching
---
### I'm also Director of [Computational Social Science](https://css.ucsd.edu) at UCSD
- We're an interdisciplinary program uniting people across Social Science who work with and care about computation and data
- We believe that **data matters because people matter**
- I'm here because I'm always thinking about computers as a tool for understanding humans!
---
### But there remains an elephant in this CSSA-filled room
- I'm not actually in Cognitive Science!
---
---
### Cognitive Science wasn't a separate program where I grew up
- I have my BA, MA and Doctorate in Linguistics from the [University of Colorado at Boulder](https://www.colorado.edu/linguistics/)
- CU Boulder had the [Institute of Cognitive Science](https://www.colorado.edu/ics/)
- Collaborations between faculty in CS, Philosophy, Linguistics, Psych, Education, and more
- I went to their talks and was advised by an affiliate, but there was not a COGS major or Ph.D Specialization
- ... but I am bothered by many COGS-flavored questions
---
### But I do have a lot of COGS ties
- I collaborate lots with [Dr. Sarah Creel](https://quote.ucsd.edu/lasr/) in COGS
- Who is an incredible person to talk to if you're interested in speech in COGS
- CSS has [lots of Cognitive Scientists](https://css.ucsd.edu/people/faculty.html#Cognitive-Science)!
- Including [Dr. Sean Trott](https://seantrott.github.io/), one of our Core Faculty
- And [Dr. Shannon Ellis](https://shannon-ellis.com/) is on our steering committee
- ... and I know that Language folks over here are just as awesome and effective
---
### No matter where my office is, Language is the best
- Speech is even cooler than Language
- ... and all the more awesome when we put computers into the process
---
## Speech Technology
---
### Speech Technology is pervasive in the US
- Siri/Alexa/GoogleAssistant
- ChatGPT Voice Mode
- Speech-to-Text Keyboards
- Text-to-Speech (e.g. in GPS or Twitch streams)
- ... and much more!
---
### Speech technology is absolutely fascinating
- **... but the most interesting part is that it works at all!**
---
### Today's Plan
- Why is speech so hard to produce?
- How can computers produce speech?
- Why is speech so hard to perceive?
- How can computers perceive speech?
---
## Speech is Hard
---
### Human Speech is incredibly difficult
- This is an incredibly intricate gestural dance in your mind and mouth
- Let's try it
---
### "A Linguistics Major goes very well with Cognitive Science"
- First, focus on your jaw
- Now, on your tongue
- Now, feel the vibes
---
### Speech is *hard*
- Fluid movement of your mouth and tongue
- Careful planning of air and breathing
- Control of pitch, gestures, and other aspects
- All to create tiny pressure variations in the air
---
---
---
### ... and we want to do *this* with software?!?
- 'Speech Synthesis' or 'Text-to-Speech' (TTS)
- *How do we do that?*
---
### The Task
"A Linguistics major goes very well with Cognitive Science"
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Historically, the steps were simple
- Analyze what the text needs to sound like ('Text Analysis')
- Jelena saw 1985 listings in La Jolla CA for over $2 million
- /jɛlɛnə sɑ najntin hʌndɹɪd . . . /
- Now, transform that into a wave we can play back for the humans
---
### For a long time, we cheated using humans!
- **Concatenative or 'Unit Selection' TTS** chops up bits and pieces of existing speech to create new speech
- You record a huge database of speech from a voice actor, with optimum 'coverage'
- You update as new words emerge (e.g. COVID, rizz, skibidi)
- **You then combine these words into sentences to match the text**
- ... and you use fancy algorithms to smooth the results out.
---
### The result can be imperfect
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Then, Artificial Neural Networks arrived, and everything changed
---
### The World's Worst Introduction to Neural Networks
---
### For today, Neural Networks learn to transform input data into a desired output data
- Training involves presenting the network with both input and output
- Then we change the network between to make the output closer to the desired output
- Then you feed in new input, and get new output
- They are wildly complex, and wildly powerful
- **Take COGS 181 to actually understand this!**
---
### [TacoTron2](https://arxiv.org/pdf/1712.05884) is a relatively simple, neural TTS system
- For now, we input text, we get speech
- It's trained using speech with paired text
- It takes text, and generates spectrograms, chunk-by-chunk, which can be turned into a waveform
---
### TacoTron 2
---
### This allows us to go from text to speech!
- We feed in text, and we get back a wave, with no humans involved past making training data!
- State-of-the-art models are getting very, very good!
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### The State of the Art is Advanced, but closed
- Current state of the art models from ElevenLabs, OpenAI, Google, and Amazon are all closed and proprietary
- If you want the best TTS in the world, it has to happen on somebody else's computer
- Details are often not published and considered "trade secrets"
- It's not currently possible to teach the state of the art in TTS!
- ... and this should disturb us as a society
---
### Neural TTS can be trained using *any* voice
- You can build a model from the ground up using any voice you'd like
- If all your training data are from a bored Bostonian, you'll end up with a bored Bostonian TTS voice
- Yet, we might want different voices...
---
> All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Humans know that content and style are different
- Speech expresses linguistic 'content'
- Phonemes, with ordering, and necessary pitch and timing for comprehension
- This is 'all we need' to understand the utterances
- There's also 'Style', which gives us lots of other details
- 'Speaker' identity
- Social components
- Emotional content
- Plus things like speed, emphasis, pitch 'tunes', sarcasm
---
### Couldn't we just abstract out the 'style' component and apply it to whatever Linguistic content we'd like?
- Yes!
---
### Here's a Multi-Speaker Version of TacoTron 2
---
### The results of this are... terrifying
- ... and has given rise to 'Deepfake' voices
---
### Neural Styler Transfer
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(TacoTron2)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(ElevenLabs)
(Credit to Erick Amaro and Mia Khattar!)
---
### Multilingual Examples
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(English)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(French)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(Spanish)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(Mandarin)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(Italian)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(Russian)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(Japanese)
---
### Prosody is still hard
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### ... but OMG, this thing can do arbitrary speech, in an arbitrary voice
- ... and it's never had a tongue, had phonics training, and doesn't actually know anything at all about mouths
- Arguably, it doesn't know anything about English
- ... although some systems use a language model too
- *This is amazing!*
- ... but it can model more complexity still
---
### Code Switching
It's like sometimes mezclo un poco de español con my English, cuando me siento particularmente spicy, y tengo curiosidad to know cómo la TTS handles it.
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Wow.
- Not only can exposure to data allow a deep neural network to learn to map written language into speech in one language
- ... but it can do it for two languages
- ... at once
- ... with clear mixing of the two
- **Espicy!**
---
# What about Speech Perception?
---
### Speech Perception is *hard*
- Speech is flapping bits of meat around in your head and throat while you expel air.
- This creates tiny vibrations in the air
- **Speech perception is turning the resulting vibrations in the air back into language**
---
### Let's focus on one of the really hard problems
---
## Vowel Perception
---
### What is a vowel?
* A vowel is letting the voice resonate in the vocal tract while you move the tongue
* If we change the position of the tongue, we change the resonances
---
---
### English has lots of vowels
/i/ - beet, see, seen, sear, seal
/ɪ/ - bit, sit, tin, sill
/ɛ/ - bet, set, sent, fair, sell
/æ/ - bat, sat, pant, pal
/ʌ/ - but, sun, pun, lull (ə in sofa, amount)
/əɹ/ - bird, purr, earl, butter, clamor (this is often broken into two vowels!)
/ɑ/ - bot, saw, star, paul, pawn, (cot*)
/ɔ/ - corn /kɔɹn/, boy /bɔj/ (caught*)
/ʊ/ - book, hood, puss
/u/ - boot, who’d, loose, lure, loon
---
### Diphthongs, too!
/ɔj/ - boy, soy, toy, join, oil, Roy
/aj/ - buy, right, try, sigh, die, fire
/ej/ - play, bay, may, ray, lay, trail
/ow/ - boat, oat, wrote, pope, toll
/aw/ - how, now, brown, cow, prow, louse
---
---
---
### What do vowels sound like?
* We talk about vowel quality in terms of "formants"
* These are bands of the spectrum where the energy is strongest
* The frequencies of these formants are how we distinguish vowels
---
---
---
### So, different vowels are basically different formant patterns
---
Different American English vowels, as spoken by a male speaker
---
### Speaker Vowel Space Variation
* Different speakers produce different resonances, even for the “same” vowels
* Vocal tracts can be shorter, longer, wider...
---
### Here's the weird part!
- Different speakers have different formants, even for the “same” vowels!
* Every person has a different set of basic vowel formant positions
* This is called the speaker’s “vowel space”
---
### 'Idealized' Formants
---
### Speaker Average Formants
---
### Moment-to-moment Vowel Variation
* Even the same speaker will have variation from moment to moment
* We often move our tongues differently, changing the vowel's quality
* For many, many reasons
* This leads to constant and massive changes in vowel production
---
### Speaker Average Formants
---
### Individual Token Formants
---
### Individual Token Formants
---
---
### Every person you've ever talked with has had different vowel formant patterns
* ... and yet, we understand each other, somehow
---
---
### Humans are able to cope with this easily, somehow
- ... but what about computers?
---
## Automatic Speech Recognition
---
### ASR builds mappings from audio to text
- We feed the system lots of text, and lots of corresponding audio
- It learns the patterns of sound associated with a given text
- Some use language models to give better predictions
---
### Vintage ASR used to require explicit speaker adaptation
- Around the turn of the century, ASR software required personalization and 'training'
- Setup began with "Read these texts aloud"
- It would then process for a little while as it 'customized' to your voice
- The model *simply wouldn't work* without this level of customization
---
### Then, Artificial Neural Networks arrived, and everything changed (again)
---
### Whisper is OpenAI's Neural ASR Tool
---
### ... but it is wildly effective
- It works relatively quickly
- On relatively low-end hardware
- ... and most amazing of all...
---
### Whisper can get human-like performance in speech transcription*
---
### Wow.
- These ASR tools just 'listened' to a bunch of audio with texts
- They built representations of speech
- They combined it with some knowledge of how text usually looks
- ... and suddenly, it approaches human ability in speech perception*
---
### Wait, what were those asterisks?
- About that....
---
### How many of you have had great experiences with speech-to-text?
- How many think it's OK?
- How many think it works terribly?
---
### These tools are great at recognizing speech *for the dialects that they were trained on*
- ... but they're substantially weaker at adapting to different dialects
- Many people are working on that
- ... including your very own Oishani Bandopadhyay!
- **So, humans still win!**
---
### Hooray!
---
Yet, there's still one incredible fact...
---
### Neural networks can be as good as humans at speech
- ... without tongues, ears, grammatical knowledge, or human brains
- Just like LLMs are the second thing *ever* which can do human language
- *All it takes it lots of data and the right architecture*
- Take LIGN 168 to understand more about speech tools and follow the progress!
---
### We're still figuring out what this means for Language
- ... but one thing is for sure ...
---
### It's a fascinating time to study the Cognitive Science of Language!
- ... and to have a Linguistics double major
- (Sorry, couldn't resist)
---
Thank you!
---
### Bonus Content: Formants are enough for speech perception in humans
---
### Let's listen to some sounds
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
### Now let's play all three at once!
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
### Does this help?
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)