Deep Learning and Speech

# Deep Learning and Speech Data

### Will Styler - LIGN 167

<http://savethevowels.org/talks/deep_learning_speech.html>

---

# Why work with speech at all?

---

### Human language is mostly speech-based

- The vast majority of human languages are spoken

- Many human/computer interactions work better by voice

- We can speak faster and more readily than we can type

- **We want our systems to be able to work with spoken language too!**

---

### What are common speech tasks?

- Automatic Speech Recognition (ASR)
	
	- Spoken language to orthography

- Speech Synthesis or Text-to-speech (TTS)

- Orthography to spoken language

- Voice or language recognition

- "I don't care what they're saying, but who/what is it?"

- Real-time spoken translation

- "Which NLP problems do you want to have?" "Yes."

---

### These tasks are hard

- ASR and TTS are fraught with thousands of complexities in the tasks

- The datasets are very expensive to get and store and annotate

- Speech is *amazingly* complicated

- ... but we want to be good at them, *so badly*

---

### ... but they're fundamentally similar to many other deep learning tasks

- Taking acoustic data as input and classifying it

- ASR or Voice/Language Recognition

- Taking written words as input and generating appropriate spoken data

- Text-to-speech

---

### Deep Learning for speech is a rapidly changing field

- Most speech work was done with HMMs for a long time

- Now deep neural network approaches are taking over

- Apple is [announcing Neural TTS in keynotes](https://www.theverge.com/2019/6/3/18650906/siri-new-voice-ios-13-iphone-homepod-neutral-text-to-speech-technology-natural-wwdc-2019)!

- As a result...

---

### Getting into implementation is very hard

- Companies are *really* guarded about their speech processing algos

- The state-of-the-art is changing every day

- **We're going to focus on the basic issues involved with speech classification**

- ... rather than diving deep on an algorithm which will be outdated by the end of the talk

---

### Today's Plan

- What is the nature of speech?

- How do we discuss and display sound?

- What is the nature of the speech signal?

- How can we turn speech into features?

- What is the training data?

---

# What is the nature of speech?

---

### The Speech Process

* Flapping bits of meat inside your head while blowing out air

* This creates vibrations in the air you're expelling

* The ear picks these up, and inteprets them as speech.

* This process is studied in the Linguistic subfield of **Phonetics**

---

### The Lungs

---

### Flapping bits of meat ("articulation")

---

### Simplified a bit...

<img class="r-stretch" src="phonmedia/sagittal_simple.jpg" alt="sagittalsimple.jpg The image is a detailed diagram illustrating The Vocal Tract, which includes various anatomical parts of the human mouth and throat used for speech production. The diagram is labeled with dotted lines pointing to each part. Starting from the top, there's an outline of the upper jaw or maxilla, which forms the roof of the mouth. This area is labeled as Palate. Below it, you can see a small structure called the Velum, which is involved in controlling airflow during speech and swallowing. Moving down further, we have the Nasal cavity at the back of the nose, which is connected to the throat through an opening. The Oral cavity is shown below this, where sounds are produced by the tongue, lips, teeth, and palate. The Tongue, a muscular organ used for articulating speech, occupies a significant portion in the oral cavity. It's positioned centrally with its tip pointing downward towards the bottom of the diagram. To the left side of the image, there is an outline of the lower jaw or mandible, which includes the Teeth and the Epiglottis, a flap that covers the trachea to prevent food from entering it when swallowing. The Adam's apple, also known as the laryngeal prominence, is shown on the left side near the throat. The Larynx or voice box is located at the base of the throat and contains the vocal cords (vocal folds). These are essential for producing sound during speech. Below the larynx lies the Trachea, commonly known as the windpipe, which leads to the lungs. On the right side of the diagram, you can see parts of the pharynx, including the Nasal pharynx and Oral pharynx. The Uvula, a small fleshy projection at the back of the throat, is also labeled. Further down, there's an indication for the Laryngeal pharynx. Finally, on the far right side, you can see parts of the digestive system such as the Oesophagus (also known as the esophagus) and the Pharynx, which connects to both the respiratory and digestive tracts. This diagram is a comprehensive representation of how sound travels through these structures during speech production. This description was generated automatically from image files by a local LLM, and thus, may not be fully accurate. Please feel free to ask questions if you have further questions about the nature of the image or its meaning within the presentation.">
	
---

### Let's do an experiment

---

> The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.

---

### Speech is absolutely insane

- It's a series of fluid and overlapping gestures

- It's amazingly complex

- ... and it's nothing like we think it is

---

### How do we wrap our heads around it?

- First, we break speech into 'segments' or 'phones'

- Then, we figure out how to describe those phones and their properties

- This lets us *transcribe* what was said, rather than what words were said

- But first you need to realize that...

---

## Your writing system is a trainwreck

- <img class="r-stretch" src="humorimg/trainwreck.png" alt="trainwreck.png The image depicts a railway accident scene with several damaged train cars lying on their sides across multiple tracks. The train cars are in various states of disrepair; some have been bent and twisted, while others appear more intact but still show signs of impact. In the foreground, there is a green freight car that has derailed and is resting on its side over the tracks. Its roof is crushed, and it appears to be leaking or damaged at the top. Nearby, another train car with a brownish-green coloration lies on its side as well, with visible damage to its structure. Further back in the image, there are more train cars that have also derailed and are scattered across the tracks. Some of these cars appear to be intact but leaning at odd angles due to the collision or derailment. Several people can be seen near the scene of the accident. They seem to be inspecting the damage or possibly assessing the situation for safety reasons. The individuals vary in appearance, with some wearing darker clothing and others lighter attire. Their postures suggest they are engaged in conversation or observation. The tracks themselves appear to be part of a larger railway system, as there is another set of tracks running parallel to the damaged ones on the right side of the image. The ground around the tracks consists of gravel and dirt, typical for railway infrastructure. In the background, some greenery can be seen, indicating that this accident has occurred in an area with vegetation nearby. There are also some structures visible at a distance, possibly part of the railway station or related facilities. The sky is clear, suggesting it might have been taken during daytime under fair weather conditions. The overall scene conveys a sense of aftermath and investigation following a significant railway incident. This description was generated automatically from image files by a local LLM, and thus, may not be fully accurate. Please feel free to ask questions if you have further questions about the nature of the image or its meaning within the presentation.">

---

### Your writing system is lying to you

- Every minute of every day

- "They thoroughly and roughly wrought the boughs in the borough, through and through"
	
	- C doesn't exist
	
	- TH is neither a t nor an h, and represents two different sounds

- We have 15 vowels

- ... and if you start thinking about letters, you're going to start struggling

- Consider your writing system with the same skepticism you would normally reserve for a guy with a broken bottle walking towards you in a dark alley.

---

### For more on this, LIGN 110!

---

### We use different writing systems to capture the sounds being made

- The International Phonetic Alphabet was developed by Linguists

- ðə ɪntəɹ'næʃɪnəl fə'nɛtɪk 'ælfəbət wʌz də'vɛləpt baj 'lɪŋgwɪsts

- ARPABET uses two character combinations to encode the sounds of *English*

- AAA R P AX B EH T / Y UW Z IH Z / T UW / K EH R IH K T ER ...

- **Often, TTS and ASR use these alphabets as a 'go-between'**

- They're used in resources like [CMUDict](http://www.speech.cs.cmu.edu/cgi-bin/cmudict)

---

### ... but we can think about speech as a sequence of 'phones'

- Individual speech sounds

- A series of articulatory targets...

- With hard-to-identify boundaries between them

- Adjacent sounds affect each other

- Which is broadcasted to the world acoustically

---

# What is Sound?

---

---

---

### Sound is compression and rarefaction in a medium

- Sound needs something to travel in (like air or water)

---

### (Yes, your childhood is a lie)

---

### Thinking of sound as waves of air compression is helpful

- Why does clapping cause a sound, but waving your hand through the air doesn’t?

- Why are gunshots loud?

---

### We're good at hearing sound

- ... but we need to visualize it

---

### Visualizing Sound

- Waveforms

- Spectrograms

---

## Waveform

A horizontal cut through the wave showing the peaks and troughs over time

- The height/strength of the wave is called its "amplitude"

---

---

### Let's look at the sounds in this room right now

---

### Waveforms are well and good

- ... and you can tell a lot from a waveform

---

---

### ... but we'll need better information to process speech

---

## Frequency

The speed with which a wave oscillates

- Measured in Hertz (Hz), Cycles per second

---

### 100 Hz - Waveform

---

### 200Hz - Waveform
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. <a href="https://accessibility.ucsd.edu/policies-standards/ucsd-accessibility-guidelines.html">Please see this site for details.</a>)

---

### Voice Pitch

- Changing the "fundamental frequency" of your voice changes the perceived "pitch" of your voice

- Higher frequency of vocal fold vibration == "higher pitch"

- *Intonation is all about this frequency!*

---

### Frequency is important

- Different phenomena produce sounds at different frequencies

- Most things produce sounds with a mix of different frequencies, each at different amplitudes

- Speech has *many* components at different frequencies

- Each of those frequencies has a different power

---

### How do we visualize this?

- Spectra only show one 'moment' of the signal

---

### "Noise" - Waveform

---

## Spectrogram

Displays signal strength by frequency, over time

---

### "Noise" - Waveform

---

### "Noise" - Spectrogram

---

### Let's have a bit of spectrogram fun

---

### You can have fun on your own

- SpectrumView on iOS

- https://musiclab.chromeexperiments.com/Spectrogram/

- Praat (http://praat.org)

---

# Fundamentals of Speech Acoustics

---

### Voicing

---

### Spectrograms show us many evenly-spaced vertical lines

- These are individual glottal pulses

- Higher pitched voices will have...?

- More tightly spaced lines!
	
---

### Resonances in the mouth

---

### Different vowels have different resonances in the mouth

- Resonances vary depending on the tongue's position

- ... as well as the size and shape of the talker's head

- *Different resonances from the same speaker mean different vowels*

---

Different American English vowels, as spoken by a male speaker

---

### Different speakers have different resonances

- This is a fundamental problem in ASR and speech perception

- Enough data can help, but this is a *major* issue

---

### Other speech sounds have their own acoustics

---

### /l r w j/ act a lot like vowels

---

### Nasals sounds look like quiet vowels

---

### Fricative consonants have little black clouds

- ... and the cloud is higher frequency as you get closer to the mouth

---

### For stop consonants, the signal... stops

---

## Patterns of frequency and amplitude changes are indicative of sounds and words

---

### Cats

---

### Owls

---

### Chickadees

---

### Koalas

---

### Sparrows

---

### There is often no one-to-one mapping

- The expression of a given gesture can have many acoustic consequences

- Different speakers have different realizations of each sound

- Different phones sound different in different contexts

- ... but it all has to happen from this signal

---

### Representing sound as frequency, power and time is the basis of speech technology

- If we don't know what words sound like, we can't teach computers what they sound like

- Similar patterns are easy to confuse for humans and computers

- This lets us understand a bit more about how speech technology might work

---

... but first, we need to ask an important question

---

# How do we turn speech sounds into features?

---
	   
### We've got a fundamental problem, to start

---

### Computers don't do waves

010001110010101000100101101010101010

---

### Sound is analog, computers are digital

- How do we deal with that?

---

### Quantization ('Sampling')

---

### Quantization ('Sampling')

---

### Quantization ('Sampling')

---

### Analog-to-digital conversion

- Sample the wave many times per second

- Record the amplitude at each sample

- The resulting series of measurements will faithfully capture the signal

---

### Relevant Parameters

- The **Bit Depth** describes how many bits of information encode amplitude

- 16 bit audio is the norm
	
- The **Sampling Rate** describes how many samples per second we take

- 44,100 Hz is the norm, and captures everything you need for speech

---

### AD Conversion now yields a signal that the computer can read

- ... but what are the features?

---

### Putting in the waveform itself is a possibility

- It's cheap and easy

- [Wave2Vec](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) is showing amazing results doing just this!

- This uses transformers to work directly on the waveform and identify 'latent speech units'
	
- This is a very, very tantalizing possibility

---

### ... but we might want more information!
	
- Important parts of the signal live only in frequency band info

- Many approaches try to give it all the information we can

- Not the same features that linguists usually use

---

### We don't need transparent or parsimonious features

- Things like vowel features and pitch and other details are a pain to extract

- We're plugging it into a black box

- We're happy to plug in hundreds of features, if need be

- We'd just as soon turn that sound into a boring matrix

---

### Let's get that algorithm a Matrix

- Algorithms love Matrices

---

## Mel-Frequency Cepstral Coefficients (MFCCs)

---

### We're not going deep here

- This is a lot of signal processing

- We're going to teach the idea, not the practice

---

### MFCCs

---

### MFCC Process

- 1: Create a spectrogram (effectively)

- 2: Extract the most useful bands for speech (in Mels)

- 3: Look at the frequencies of this banded signal (repeating the Fourier Transform process)

- 4: Simplify this into a smaller number of coefficients using Discrete Cosine Transform (DCT)

- Usually 12 or 13

---

### MFCC Input

---

### MFCC Output

---

### So, the sound becomes a matrix of features

- Many columns (representing time during the signal)

- N columns (usually 13) with coefficients which tell us the spectral shape

- It's black-boxy, but we don't care.

- We've created a Matrix

---

---

### Now we've got a matrix representing the sound

- MFCCs captures frequency information, according to our perceptual needs

- Wav2Vec (and equivalents) go straight to vectors
 
---

### It's Neural Network time!

---

# What is the learning task like for ASR and TTS?

---

First, one major question...

- ### What units of speech are we working with?

---

### What's the desired data labeling?

- We need to give the NN labeled data

- [Chunk of Sound] == [Labeled Linguistic Info]

- (x Many many many many tokens)
	
- What level do we want to recognize and generate at?

---

### Possible levels of labeling

- Sentences?

- Words?

- Phones?

- Diphones?

---

### Sentences

- Why are sentences a bad idea?

---

### Words

"Noise"

---

### What are the pros and cons of words?

---

### Phones

---

### Diphones

---

### What are the pros and cons of phones and diphones?

---

### In practice, many systems use diphones

- [CMU's Sphynx does](https://cmusphinx.github.io/)

- As do many others

- Triphones are often a possibility

- Some go straight to entire words

- Speech recognition systems are often kept secret

---

### So, we can now train a system

- Capture sounds and annotate them as diphones

- Vectorize them and feed them into a neural network as training data

- We can do speech recognition, text-to-speech, and more!

---

### Using Neural Networks for ASR

- Feed the vectorized sound data in and get the most likely diphone sequence back

- ... or go straight to words, if you feel dangerous!
	
---

### Why is ASR hard?

- ASR requires good dictionaries

- "Bashira yeeted the Mel Frequency Cepstral Coefficients into the RNN"

- ASR requires some context awareness

- "Robb took a wok from the Chinese restaurant"

- Dialect is always a thing

- "English" is a convenient lie
	
- ... and 99 other problems
	
---

### Using Neural Networks for TTS

- Feed the diphone sequence in, get back a likely acoustic signal

- This will generate a voice which matches (roughly) the input training voice

- Style transfer is possible too!

- Training the model on a generic voice

- Then learning the variation associated with another as a style embedding

- Then applying the variation to the model

---

### Neural Network Text-to-Speech Style Transfer Examples

(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. <a href="https://accessibility.ucsd.edu/policies-standards/ucsd-accessibility-guidelines.html">Please see this site for details.</a>)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. <a href="https://accessibility.ucsd.edu/policies-standards/ucsd-accessibility-guidelines.html">Please see this site for details.</a>)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. <a href="https://accessibility.ucsd.edu/policies-standards/ucsd-accessibility-guidelines.html">Please see this site for details.</a>)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. <a href="https://accessibility.ucsd.edu/policies-standards/ucsd-accessibility-guidelines.html">Please see this site for details.</a>)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. <a href="https://accessibility.ucsd.edu/policies-standards/ucsd-accessibility-guidelines.html">Please see this site for details.</a>)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. <a href="https://accessibility.ucsd.edu/policies-standards/ucsd-accessibility-guidelines.html">Please see this site for details.</a>)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. <a href="https://accessibility.ucsd.edu/policies-standards/ucsd-accessibility-guidelines.html">Please see this site for details.</a>)

---

### Text-to-Speech is hard!
	
- Text-to-speech requires you to understand how humans talk

- "The NSA and NASA printed 1200 t-shirts for area code 303"

- ... and the prosody is really hard

- "Let's eat, Grandpa"
	
- ... and 99 other problems

---

### We talk a lot more about the linguistic difficulties with these tasks in LIGN 6 'Language and Computers'

---

### ... and we'll talk a lot more about processing speech outside of Neural Networks in LIGN 168 'Computational Speech Processing'

---

### Wrapping up

- Speech is movement of the articulators in the airstream

- This creates sounds which vary in frequency and amplitude over time

- This signal can be analyzed as a matrix of opaque cepstral features

- ... and then fed into a neural network with linguistic annotations

- To generate and classify human speech

- So...

---

### Deep Learning works for Speech too!

- It's never going to be easy

- It's never going to be cheap

- ... but it'll work

- And it'll get you that much closer to actual human interaction!

---

---

<huge>Thank you!</huge>

<http://savethevowels.org/talks/deep_learning_speech.html>