Speech Technology in the Era of AI

# Speech Technology in the Era of 'AI'

### Will Styler - UC San Diego 
<https://savethevowels.org/talks/colloq_pomona_speechtech.html>

---

### Introductions

- I'm an Associate Teaching Professor of Linguistics

- I'm a Computational Phonetician
	- I study human speech perception and production using computers
	- This involves a mix of experiments, data analysis, recordings, and instrumental measurements

- I've also done work with processing text and LLMs

- I'm also Director of [Computational Social Science](https://css.ucsd.edu) at UCSD

---

### Today's Plan

- What is speech technology?

- How does text-to-speech work?

- What happened to make it *so* good?

- How does Speech Recognition work?

- What happened to make it *so* good?
 
---

## Speech Technology

---

### Speech Technology is pervasive in the US

- Siri/Alexa/GoogleAssistant

- ChatGPT Voice Mode

- Speech-to-Text Keyboards

- Text-to-Speech (e.g. in GPS or Twitch streams)

- ... and much more!

---

### Speech technology is absolutely fascinating

- **... but the most interesting part is that it works at all!**

---

## Producing Speech

---

### Human Speech is incredibly difficult

- This is an incredibly intricate gestural dance in your mind and mouth

- Let's try it

---

### "All human beings are born free and equal in dignity and rights."

- First, focus on your jaw

- Now, on your tongue

- Now, feel the vibes

---

### Speech is *hard*

- Fluid movement of your mouth and tongue

- Careful planning of air and breathing

- Control of pitch, gestures, and other aspects

- All to create tiny pressure variations in the air

---

---

---

### ... and we want to do *this* with software?!?

- 'Speech Synthesis' or 'Text-to-Speech' (TTS)

- *How do we do that?*

---

### The Task

- Turn arbitrary text in your desired language into an audio recording which is indistinguishable from human output

---

### Historically, the steps were simple

- Analyze what the text needs to sound like ('Text Analysis')
	- Jelena saw 1985 listings in La Jolla CA for over $2 million
	- /jɛlɛnə sɑ najntin hʌndɹɪd . . . /

- Now, transform that into a wave we can play back for the humans

---

### For a long time, we cheated using humans!

- **Concatenative or 'Unit Selection' TTS** chops up bits and pieces of existing speech to create new speech

- You record a huge database of speech from a voice actor, with optimum 'coverage'
	- You update as new words emerge (e.g. COVID, rizz, skibidi)

- **You then combine these words into sentences to match the text**
	- ... and you use fancy algorithms to smooth the results out.

---

### The result can be imperfect

(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. <a href="https://accessibility.ucsd.edu/policies-standards/ucsd-accessibility-guidelines.html">Please see this site for details.</a>)

---

### Then, Artificial Neural Networks arrived, and everything changed

---

### The World's Worst Introduction to Neural Networks

---

### For today, Neural Networks learn to transform input data into a desired output data

- Training involves presenting the network with both input and output
	- Then we change the network between to make the output closer to the desired output

- Then you feed in new input, and get new output

- They are wildly complex, and wildly powerful

- **We as a species do not fully understand how neural networks are as powerful as they are**

---

### Neural Networks have brought on the age of 'AI'

- Large Language Models use neural architectures

- Computer image recognition and generation are all neural

- Most speech technology is neural too

---

### [TacoTron2](https://arxiv.org/pdf/1712.05884) is a relatively simple, neural TTS system

- For now, we input text, we get speech

- It's trained using speech with paired text

- It takes text, and generates spectrograms, chunk-by-chunk, which can be turned into a waveform

---

### TacoTron 2

---

### This allows us to go from text to speech!

- We feed in text, and we get back a wave, with no humans involved past making training data!

- State-of-the-art models are getting very, very good!

---

### The State of the Art is Advanced, but closed

- Current state of the art models from ElevenLabs, OpenAI, Google, and Amazon are all closed and proprietary
	- If you want the best TTS in the world, it has to happen on somebody else's computer
	- Details are often not published and considered "trade secrets"

- It's not currently possible to teach the state of the art in TTS!
	- ... and this should disturb us as a society

---

### Neural TTS can be trained using *any* voice

- You can build a model from the ground up using any voice you'd like

- If all your training data are from a bored Bostonian, you'll end up with a bored Bostonian TTS voice

- Yet, we might want different voices...

---

> All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

---

### Humans know that content and style are different

- Speech expresses linguistic 'content'
	- Speech sounds, with ordering, and necessary pitch and timing for comprehension
	- This is 'all we need' to understand the utterances

- There's also 'Style', which gives us lots of other details
	- 'Speaker' identity
	- Social components
	- Emotional content
	- Plus things like speed, emphasis, pitch 'tunes', sarcasm

---

### Couldn't we just abstract out the 'style' component and apply it to whatever Linguistic content we'd like?

- Yes!

---

### Here's a Multi-Speaker Version of TacoTron 2

---

### The results of this are... terrifying

- ... and has given rise to 'Deepfake' voices

---

### Neural Styler Transfer

(Credit to Erick Amaro and Mia Khattar!)

---

### Multilingual Examples

---

### Getting timing, pitch and pauses right is still hard
- What happened to make it *so* good?

---

### ... but OMG, this thing can do arbitrary speech, in an arbitrary voice

- ... and it's never had a tongue, had phonics training, and doesn't actually know anything at all about mouths

- Arguably, it doesn't know anything about English
	- ... although some systems use a language model too

- *This is amazing!*
	- ... but it can model more complexity still

---

### Code Switching

It's like sometimes mezclo un poco de español con my English, cuando me siento particularmente spicy, y tengo curiosidad to know cómo la TTS handles it.

---

### Wow.

- Not only can exposure to data allow a deep neural network to learn to map written language into speech in one language

- ... but it can do it for two languages

- ... at once

- ... with clear mixing of the two

- **Espicy!**

---

# Recognizing speech

---

### Speech Perception is also *hard*

- Speech is flapping bits of meat around in your head and throat while you expel air.

- This creates tiny vibrations in the air

- **Speech perception is turning the resulting vibrations in the air back into language**

---

### Automatic Speech Recognition

- We take a recording of spoken language, and expect to turn it into an accurate text transcription automatically

- There are hundreds of complexities with this, but let's focus on one of the really hard problems...

---

## Vowel Perception

---

### What is a vowel?

* A vowel is letting the voice resonate in the vocal tract while you move the tongue

* If we change the position of the tongue, we change the resonances

---

---

### English has lots of vowels

/i/ - beet, see, seen, sear, seal

/ɪ/ - bit, sit, tin, sill

/ɛ/ - bet, set, sent, fair, sell

/æ/ - bat, sat, pant, pal

/ʌ/ - but, sun, pun, lull (ə in sofa, amount)

/əɹ/ - bird, purr, earl, butter, clamor (this is often broken into two vowels!)

/ɑ/ - bot, saw, star, paul, pawn, (cot*)

/ɔ/ - corn /kɔɹn/, boy /bɔj/ (caught*)

/ʊ/ - book, hood, puss

/u/ - boot, who’d, loose, lure, loon

---

### Diphthongs, too!

/ɔj/ - boy, soy, toy, join, oil, Roy

/aj/ - buy, right, try, sigh, die, fire

/ej/ - play, bay, may, ray, lay, trail

/ow/ - boat, oat, wrote, pope, toll

/aw/ - how, now, brown, cow, prow, louse

---

---

---

### What do vowels sound like?

* We talk about vowel quality in terms of "formants"

* These are bands of the spectrum where the energy is strongest

* The frequencies of these formants are how we distinguish vowels

---

---

---

### So, different vowels are basically different formant patterns

---

<img class="r-stretch big" src="phonmedia/vowelformants.gif" alt="vowelformants.gif This image displays a scientific chart containing eight individual spectrograms arranged in a grid of two rows and four columns. The background is white, and the graphical elements are black. There are no people or characters in this image. The vertical axis on the far left represents frequency in Hertz (Hz), with markings for 1000, 2000, 3000, and 4000 Hz increasing from bottom to top. The horizontal axis at the very bottom represents time in milliseconds (ms), with markings for 0, 300, and 400 visible. Each of the eight panels displays a spectrogram of a specific vowel sound, identified by an International Phonetic Alphabet (IPA) symbol centered below each graph. In these graphs, darker horizontal bands represent formants, which are concentrations of acoustic energy at specific frequencies. Small black arrows point from the left margin toward these dark bands in most panels to highlight them. Top Row (from left to right): 1. Panel labeled [ i ]: Shows a spectrogram with distinct dark bands. One band is visible near the top around 3000 Hz, another below it around 2500 Hz, and a lower one near 1000 Hz. Arrows point to these frequencies. 2. Panel labeled [ ɪ ]: Similar to the first panel but with slightly different band positions. The upper bands are clustered between 2000 and 3000 Hz. Arrows indicate specific formant locations. 3. Panel labeled [ ɛ ]: Shows bands that are generally lower in frequency than the previous two. There is a cluster of energy around 2000-2500 Hz and a distinct band near 1000 Hz. Arrows point to these features. 4. Panel labeled [ æ ]: This graph shows dark bands distributed across the vertical range, with significant energy visible between 2000 and 3000 Hz and another band lower down around 800-900 Hz. Bottom Row (from left to right): 1. Panel labeled [ a ]: This spectrogram shows three distinct dark bands indicated by arrows on the far left. One is low, near 500 Hz; one is in the middle around 1200 Hz; and one is higher up around 2800 Hz. 2. Panel labeled [ ɔ ]: Shows a pattern with energy concentrated between 1000 and 2500 Hz. Arrows point to bands near 600 Hz, 1200 Hz, and 2300 Hz. 3. Panel labeled [ ʊ ]: This graph shows a concentration of energy lower down on the frequency scale compared to the front vowels. There are arrows pointing to bands around 800 Hz and 1500 Hz. The texture is somewhat grainier than the others. 4. Panel labeled [ u ]: This final panel shows dark bands primarily in the lower half of the graph, below 2000 Hz. On the right side of this specific graph, there is a distinct curved line or hook shape rising from the middle frequency area up towards the top right corner. Arrows point to bands near 500 Hz and 1200 Hz. This description was generated automatically. Please feel free to ask questions if you have further questions about the nature of the image or its meaning within the presentation.">
Different American English vowels, as spoken by a male speaker

---

### 'Idealized' Formants

---

### Formants are enough for speech perception in humans

---

### Let's listen to some sounds

### Now let's play all three at once!

### Does this help?

---

### So, if you know the formants, you can understand the vowel

- There's just one problem...

---

### Speaker Vowel Space Variation

* Different speakers produce different resonances, even for the “same” vowels

* Vocal tracts can be shorter, longer, wider...

---

### Here's the weird part!

- Different speakers have different formants, even for the “same” vowels!

* Every person has a different set of basic vowel formant positions

* This is called the speaker’s “vowel space”

---

### Speaker Average Formants

---

### Moment-to-moment Vowel Variation

* Even the same speaker will have variation from moment to moment

* We often move our tongues differently, changing the vowel's quality
	* For many, many reasons

* This leads to constant and massive changes in vowel production

---

### Speaker Average Formants

---

### Individual Token Formants

---

### Individual Token Formants

---

---

### Every person you've ever talked with has had different vowel formant patterns

* ... and yet, we understand each other, somehow

- **Weirdly, we don't seem to care at all!**

---

---

### How humans do this is still a topic of ongoing research

- ... so how do computers have any hope of doing this?

---

## Automatic Speech Recognition

---

### ASR builds mappings from audio to text

- We feed the system lots of text, and lots of corresponding audio

- It learns the patterns of sound associated with a given text

- Some use language models to give better predictions

---

### Vintage ASR used to require explicit speaker adaptation

- Around the turn of the century, ASR software required personalization and 'training'

- Setup began with "Read these texts aloud"

- It would then process for a little while as it 'customized' to your voice

- The model *simply wouldn't work well* without this level of customization

---

### Then, Artificial Neural Networks arrived, and everything changed (again)

---

### Whisper is OpenAI's Neural ASR Tool

---

### ... but it is wildly effective

- It works relatively quickly

- On relatively low-end hardware

- ... and most amazing of all...

---

### Whisper can get human-like performance in speech transcription*

---

### Wow.

- These ASR tools just 'listened' to a bunch of audio with texts

- They built representations of speech

- They combined it with some knowledge of how text usually looks

- ... and suddenly, it approaches human ability in speech perception*

---

### Wait, what were those asterisks?

- About that....

---

### How many of you have had great experiences with speech-to-text?

- How many think it's OK?

- How many think it works terribly?

---

### These tools are great at recognizing speech *for the dialects that they were trained on*

- ... but they're substantially weaker at adapting to different dialects

- Many people are working to make these models better at generalizing
    - ... and to train on more diverse datasets

---

### So, these tools have largely 'solved' variation within a language variety

- ... but they're still not very good at adapting to new dialects and language varieties

- **This is one place where humans still win!**

---

### Hooray!

---

### Speech Technology (and 'AI') are as bad as they'll ever be

- Even in my lifetime, speech technology has gone from 'awful' to 'amazing'

- New datasets are being produced/collected that allow us to improve the input models learn from

- Increased computing power allows for larger, more complicated models

- Improvements in model architecture will allow more efficient and effective use of data and compute

- **So, we're likely to see further improvement, faster than we expect**

---

Yet, even right now...

---

### Neural networks can be as good as humans at speech

- ... without tongues, ears, grammatical knowledge, or human brains

- *All it takes it lots of data and the right architecture*

---

### These are fascinating times for understanding Language

- Large Language Models are the second thing *ever* which can do human language

- The ability to produce and perceive speech is possible without being human

- Statistical learning appears to be sufficient to do most of the tasks in Language

- Many are having to rethink their theories of 'how Language works'

---

### There is always more to learn about language and speech

- The fields of Linguistics and Phonetics are more relevant than ever in this 'AI' future

- Humans will always be our best source of understanding about human language

- **We should also ask what computers can teach us about human language!**

---

<huge>Thank you!</huge>