Deep Learning and Speech

# Deep Learning and Speech Data

### Will Styler - LIGN 167

<http://savethevowels.org/talks/deep_learning_speech.html>

---

# Why work with speech at all?

---

### Human language is mostly speech-based

- The vast majority of human languages are spoken

- Many human/computer interactions work better by voice

- We can speak faster and more readily than we can type

- **We want our systems to be able to work with spoken language too!**

---

### What are common speech tasks?

- Automatic Speech Recognition (ASR)
	
	- Spoken language to orthography

- Speech Synthesis or Text-to-speech (TTS)

- Orthography to spoken language

- Voice or language recognition

- "I don't care what they're saying, but who/what is it?"

- Real-time spoken translation

- "Which NLP problems do you want to have?" "Yes."

---

### These tasks are hard

- ASR and TTS are fraught with thousands of complexities in the tasks

- The datasets are very expensive to get and store and annotate

- Speech is *amazingly* complicated

- ... but we want to be good at them, *so badly*

---

### ... but they're fundamentally similar to many other deep learning tasks

- Taking acoustic data as input and classifying it

- ASR or Voice/Language Recognition

- Taking written words as input and generating appropriate spoken data

- Text-to-speech

---

### Deep Learning for speech is a rapidly changing field

- Most speech work was done with HMMs for a long time

- Now deep neural network approaches are taking over

- Apple is [announcing Neural TTS in keynotes](https://www.theverge.com/2019/6/3/18650906/siri-new-voice-ios-13-iphone-homepod-neutral-text-to-speech-technology-natural-wwdc-2019)!

- As a result...

---

### Getting into implementation is very hard

- Companies are *really* guarded about their speech processing algos

- The state-of-the-art is changing every day

- **We're going to focus on the basic issues involved with speech classification**

- ... rather than diving deep on an algorithm which will be outdated by the end of the talk

---

### Today's Plan

- What is the nature of speech?

- How do we discuss and display sound?

- What is the nature of the speech signal?

- How can we turn speech into features?

- What is the training data?

---

# What is the nature of speech?

---

### The Speech Process

* Flapping bits of meat inside your head while blowing out air

* This creates vibrations in the air you're expelling

* The ear picks these up, and inteprets them as speech.

* This process is studied in the Linguistic subfield of **Phonetics**

---

### The Lungs

---

### Flapping bits of meat ("articulation")

---

### Simplified a bit...

<img class="r-stretch" src="phonmedia/sagittal_simple.jpg">
	
---

### Let's do an experiment

---

> The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.

---

### Speech is absolutely insane

- It's a series of fluid and overlapping gestures

- It's amazingly complex

- ... and it's nothing like we think it is

---

### How do we wrap our heads around it?

- First, we break speech into 'segments' or 'phones'

- Then, we figure out how to describe those phones and their properties

- This lets us *transcribe* what was said, rather than what words were said

- But first you need to realize that...

---

## Your writing system is a trainwreck

- <img class="r-stretch" src="humorimg/trainwreck.png">

---

### Your writing system is lying to you

- Every minute of every day

- "They thoroughly and roughly wrought the boughs in the borough, through and through"
	
	- C doesn't exist
	
	- TH is neither a t nor an h, and represents two different sounds

- We have 15 vowels

- ... and if you start thinking about letters, you're going to start struggling

- Consider your writing system with the same skepticism you would normally reserve for a guy with a broken bottle walking towards you in a dark alley.

---

### For more on this, LIGN 110!

---

### We use different writing systems to capture the sounds being made

- The International Phonetic Alphabet was developed by Linguists

- ðə ɪntəɹ'næʃɪnəl fə'nɛtɪk 'ælfəbət wʌz də'vɛləpt baj 'lɪŋgwɪsts

- ARPABET uses two character combinations to encode the sounds of *English*

- AAA R P AX B EH T / Y UW Z IH Z / T UW / K EH R IH K T ER ...

- **Often, TTS and ASR use these alphabets as a 'go-between'**

- They're used in resources like [CMUDict](http://www.speech.cs.cmu.edu/cgi-bin/cmudict)

---

### ... but we can think about speech as a sequence of 'phones'

- Individual speech sounds

- A series of articulatory targets...

- With hard-to-identify boundaries between them

- Adjacent sounds affect each other

- Which is broadcasted to the world acoustically

---

# What is Sound?

---

---

---

### Sound is compression and rarefaction in a medium

- Sound needs something to travel in (like air or water)

---

### (Yes, your childhood is a lie)

---

### Thinking of sound as waves of air compression is helpful

- Why does clapping cause a sound, but waving your hand through the air doesn’t?

- Why are gunshots loud?

---

### We're good at hearing sound

- ... but we need to visualize it

---

### Visualizing Sound

- Waveforms

- Spectrograms

---

## Waveform

A horizontal cut through the wave showing the peaks and troughs over time

- The height/strength of the wave is called its "amplitude"

---

---

### Let's look at the sounds in this room right now

---

### Waveforms are well and good

- ... and you can tell a lot from a waveform

---

---

---

### ... but we'll need better information to process speech

---

## Frequency

The speed with which a wave oscillates

- Measured in Hertz (Hz), Cycles per second

---

### 100 Hz - Waveform

---

### 200Hz - Waveform
<audio controls src="phonmedia/200Hz.wav"></audio>

---

### Voice Pitch

- Changing the "fundamental frequency" of your voice changes the perceived "pitch" of your voice

- Higher frequency of vocal fold vibration == "higher pitch"

- *Intonation is all about this frequency!*

---

### Frequency is important

- Different phenomena produce sounds at different frequencies

- Most things produce sounds with a mix of different frequencies, each at different amplitudes

- Speech has *many* components at different frequencies

- Each of those frequencies has a different power

---

### How do we visualize this?

- Spectra only show one 'moment' of the signal

---

### "Noise" - Waveform

---

## Spectrogram

Displays signal strength by frequency, over time

---

### "Noise" - Waveform

---

### "Noise" - Spectrogram

---

### Let's have a bit of spectrogram fun

---

### You can have fun on your own

- SpectrumView on iOS

- https://musiclab.chromeexperiments.com/Spectrogram/

- Praat (http://praat.org)

---

# Fundamentals of Speech Acoustics

---

### Voicing

---

### Spectrograms show us many evenly-spaced vertical lines

- These are individual glottal pulses

- Higher pitched voices will have...?

- More tightly spaced lines!
	
---

### Resonances in the mouth

---

### Different vowels have different resonances in the mouth

- Resonances vary depending on the tongue's position

- ... as well as the size and shape of the talker's head

- *Different resonances from the same speaker mean different vowels*

---

Different American English vowels, as spoken by a male speaker

---

### Different speakers have different resonances

- This is a fundamental problem in ASR and speech perception

- Enough data can help, but this is a *major* issue

---

### Other speech sounds have their own acoustics

---

### /l r w j/ act a lot like vowels

---

### Nasals sounds look like quiet vowels

---

### Fricative consonants have little black clouds

- ... and the cloud is higher frequency as you get closer to the mouth

---

### For stop consonants, the signal... stops

---

## Patterns of frequency and amplitude changes are indicative of sounds and words

---

### Cats

---

### Owls

---

### Chickadees

---

### Koalas

---

### Sparrows

---

### There is often no one-to-one mapping

- The expression of a given gesture can have many acoustic consequences

- Different speakers have different realizations of each sound

- Different phones sound different in different contexts

- ... but it all has to happen from this signal

---

### Representing sound as frequency, power and time is the basis of speech technology

- If we don't know what words sound like, we can't teach computers what they sound like

- Similar patterns are easy to confuse for humans and computers

- This lets us understand a bit more about how speech technology might work

---

... but first, we need to ask an important question

---

# How do we turn speech sounds into features?

---
	   
### We've got a fundamental problem, to start

---

### Computers don't do waves

010001110010101000100101101010101010

---

### Sound is analog, computers are digital

- How do we deal with that?

---

### Quantization ('Sampling')

---

### Quantization ('Sampling')

---

### Quantization ('Sampling')

---

### Analog-to-digital conversion

- Sample the wave many times per second

- Record the amplitude at each sample

- The resulting series of measurements will faithfully capture the signal

---

### Relevant Parameters

- The **Bit Depth** describes how many bits of information encode amplitude

- 16 bit audio is the norm
	
- The **Sampling Rate** describes how many samples per second we take

- 44,100 Hz is the norm, and captures everything you need for speech

---

### AD Conversion now yields a signal that the computer can read

- ... but what are the features?

---

### Putting in the waveform itself is a possibility

- It's cheap and easy

- [Wave2Vec](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) is showing amazing results doing just this!

- This uses transformers to work directly on the waveform and identify 'latent speech units'
	
- This is a very, very tantalizing possibility

---

### ... but we might want more information!
	
- Important parts of the signal live only in frequency band info

- Many approaches try to give it all the information we can

- Not the same features that linguists usually use

---

### We don't need transparent or parsimonious features

- Things like vowel features and pitch and other details are a pain to extract

- We're plugging it into a black box

- We're happy to plug in hundreds of features, if need be

- We'd just as soon turn that sound into a boring matrix

---

### Let's get that algorithm a Matrix

- Algorithms love Matrices

---

## Mel-Frequency Cepstral Coefficients (MFCCs)

---

### We're not going deep here

- This is a lot of signal processing

- We're going to teach the idea, not the practice

---

### MFCCs

---

### MFCC Process

- 1: Create a spectrogram (effectively)

- 2: Extract the most useful bands for speech (in Mels)

- 3: Look at the frequencies of this banded signal (repeating the Fourier Transform process)

- 4: Simplify this into a smaller number of coefficients using Discrete Cosine Transform (DCT)

- Usually 12 or 13

---

### MFCC Input

---

### MFCC Output

---

### So, the sound becomes a matrix of features

- Many columns (representing time during the signal)

- N columns (usually 13) with coefficients which tell us the spectral shape

- It's black-boxy, but we don't care.

- We've created a Matrix

---

---

### Now we've got a matrix representing the sound

- MFCCs captures frequency information, according to our perceptual needs

- Wav2Vec (and equivalents) go straight to vectors
 
---

### It's Neural Network time!

---

# What is the learning task like for ASR and TTS?

---

First, one major question...

- ### What units of speech are we working with?

---

### What's the desired data labeling?

- We need to give the NN labeled data

- [Chunk of Sound] == [Labeled Linguistic Info]

- (x Many many many many tokens)
	
- What level do we want to recognize and generate at?

---

### Possible levels of labeling

- Sentences?

- Words?

- Phones?

- Diphones?

---

### Sentences

- Why are sentences a bad idea?

---

### Words

"Noise"

---

### What are the pros and cons of words?

---

### Phones

---

### Diphones

---

### What are the pros and cons of phones and diphones?

---

### In practice, many systems use diphones

- [CMU's Sphynx does](https://cmusphinx.github.io/)

- As do many others

- Triphones are often a possibility

- Some go straight to entire words

- Speech recognition systems are often kept secret

---

### So, we can now train a system

- Capture sounds and annotate them as diphones

- Vectorize them and feed them into a neural network as training data

- We can do speech recognition, text-to-speech, and more!

---

### Using Neural Networks for ASR

- Feed the vectorized sound data in and get the most likely diphone sequence back

- ... or go straight to words, if you feel dangerous!
	
---

### Why is ASR hard?

- ASR requires good dictionaries

- "Bashira yeeted the Mel Frequency Cepstral Coefficients into the RNN"

- ASR requires some context awareness

- "Robb took a wok from the Chinese restaurant"

- Dialect is always a thing

- "English" is a convenient lie
	
- ... and 99 other problems
	
---

### Using Neural Networks for TTS

- Feed the diphone sequence in, get back a likely acoustic signal

- This will generate a voice which matches (roughly) the input training voice

- Style transfer is possible too!

- Training the model on a generic voice

- Then learning the variation associated with another as a style embedding

- Then applying the variation to the model

---

### Neural Network Text-to-Speech Style Transfer Examples

---

### Text-to-Speech is hard!
	
- Text-to-speech requires you to understand how humans talk

- "The NSA and NASA printed 1200 t-shirts for area code 303"

- ... and the prosody is really hard

- "Let's eat, Grandpa"
	
- ... and 99 other problems

---

### We talk a lot more about the linguistic difficulties with these tasks in LIGN 6 'Language and Computers'

---

### ... and we'll talk a lot more about processing speech outside of Neural Networks in LIGN 168 'Computational Speech Processing'

---

### Wrapping up

- Speech is movement of the articulators in the airstream

- This creates sounds which vary in frequency and amplitude over time

- This signal can be analyzed as a matrix of opaque cepstral features

- ... and then fed into a neural network with linguistic annotations

- To generate and classify human speech

- So...

---

### Deep Learning works for Speech too!

- It's never going to be easy

- It's never going to be cheap

- ... but it'll work

- And it'll get you that much closer to actual human interaction!

---

---

<huge>Thank you!</huge>

<http://savethevowels.org/talks/deep_learning_speech.html>