Deep Learning and Speech Data

Will Styler - LIGN 167

http://savethevowels.org/talks/deep_learning_speech.html

Why work with speech at all?

Human language is mostly speech-based

The vast majority of human languages are spoken
Many human/computer interactions work better by voice
We can speak faster and more readily than we can type
We want our systems to be able to work with spoken language too!

What are common speech tasks?

Automatic Speech Recognition (ASR)
- Spoken language to orthography
Speech Synthesis or Text-to-speech (TTS)
- Orthography to spoken language
Voice or language recognition
- “I don’t care what they’re saying, but who/what is it?”
Real-time spoken translation
- “Which NLP problems do you want to have?” “Yes.”

These tasks are hard

ASR and TTS are fraught with thousands of complexities in the tasks
The datasets are very expensive to get and store and annotate
Speech is amazingly complicated
… but we want to be good at them, so badly

… but they’re fundamentally similar to many other deep learning tasks

Taking acoustic data as input and classifying it
- ASR or Voice/Language Recognition
Taking written words as input and generating appropriate spoken data
- Text-to-speech

Deep Learning for speech is a rapidly changing field

Most speech work was done with HMMs for a long time
Now deep neural network approaches are taking over
Apple is announcing Neural TTS in keynotes!
As a result…

Getting into implementation is very hard

Companies are really guarded about their speech processing algos
The state-of-the-art is changing every day
We’re going to focus on the basic issues involved with speech classification
- … rather than diving deep on an algorithm which will be outdated by the end of the talk

Today’s Plan

What is the nature of speech?
How do we discuss and display sound?
What is the nature of the speech signal?
How can we turn speech into features?
What is the training data?

What is the nature of speech?

The Speech Process

Flapping bits of meat inside your head while blowing out air
This creates vibrations in the air you’re expelling
The ear picks these up, and inteprets them as speech.
This process is studied in the Linguistic subfield of Phonetics

The Lungs

Flapping bits of meat (“articulation”)

Simplified a bit…

Let’s do an experiment

The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.

Speech is absolutely insane

It’s a series of fluid and overlapping gestures
It’s amazingly complex
… and it’s nothing like we think it is

How do we wrap our heads around it?

First, we break speech into ‘segments’ or ‘phones’
Then, we figure out how to describe those phones and their properties
This lets us transcribe what was said, rather than what words were said
But first you need to realize that…

Your writing system is a trainwreck

Your writing system is lying to you

Every minute of every day
- “They thoroughly and roughly wrought the boughs in the borough, through and through”
- C doesn’t exist
- TH is neither a t nor an h, and represents two different sounds
- We have 15 vowels
… and if you start thinking about letters, you’re going to start struggling
Consider your writing system with the same skepticism you would normally reserve for a guy with a broken bottle walking towards you in a dark alley.

For more on this, LIGN 110!

We use different writing systems to capture the sounds being made

The International Phonetic Alphabet was developed by Linguists
- ðə ɪntəɹ’næʃɪnəl fə’nɛtɪk ’ælfəbət wʌz də’vɛləpt baj ’lɪŋgwɪsts
ARPABET uses two character combinations to encode the sounds of English
- AAA R P AX B EH T / Y UW Z IH Z / T UW / K EH R IH K T ER …
Often, TTS and ASR use these alphabets as a ‘go-between’
- They’re used in resources like CMUDict

… but we can think about speech as a sequence of ‘phones’

Individual speech sounds
- A series of articulatory targets…
With hard-to-identify boundaries between them
- Adjacent sounds affect each other
Which is broadcasted to the world acoustically

What is Sound?

Sound is compression and rarefaction in a medium

Sound needs something to travel in (like air or water)

(Yes, your childhood is a lie)

Thinking of sound as waves of air compression is helpful

Why does clapping cause a sound, but waving your hand through the air doesn’t?
Why are gunshots loud?

We’re good at hearing sound

… but we need to visualize it

Visualizing Sound

Waveforms
Spectrograms

Waveform

A horizontal cut through the wave showing the peaks and troughs over time

The height/strength of the wave is called its “amplitude”

Let’s look at the sounds in this room right now

Waveforms are well and good

… and you can tell a lot from a waveform

… but we’ll need better information to process speech

Frequency

The speed with which a wave oscillates

Measured in Hertz (Hz), Cycles per second

100 Hz - Waveform

200Hz - Waveform

Voice Pitch

Changing the “fundamental frequency” of your voice changes the perceived “pitch” of your voice
Higher frequency of vocal fold vibration == “higher pitch”
Intonation is all about this frequency!

Frequency is important

Different phenomena produce sounds at different frequencies
Most things produce sounds with a mix of different frequencies, each at different amplitudes
Speech has many components at different frequencies
Each of those frequencies has a different power

How do we visualize this?

Spectra only show one ‘moment’ of the signal

“Noise” - Waveform

Spectrogram

Displays signal strength by frequency, over time

“Noise” - Waveform

“Noise” - Spectrogram

Let’s have a bit of spectrogram fun

You can have fun on your own

SpectrumView on iOS
https://musiclab.chromeexperiments.com/Spectrogram/
Praat (http://praat.org)

Fundamentals of Speech Acoustics

Voicing

Spectrograms show us many evenly-spaced vertical lines

These are individual glottal pulses
Higher pitched voices will have…?
- More tightly spaced lines!

Resonances in the mouth

Different vowels have different resonances in the mouth

Resonances vary depending on the tongue’s position
- … as well as the size and shape of the talker’s head
Different resonances from the same speaker mean different vowels

Different American English vowels, as spoken by a male speaker

Different speakers have different resonances

This is a fundamental problem in ASR and speech perception
Enough data can help, but this is a major issue

Other speech sounds have their own acoustics

/l r w j/ act a lot like vowels

Nasals sounds look like quiet vowels

Fricative consonants have little black clouds

… and the cloud is higher frequency as you get closer to the mouth

For stop consonants, the signal… stops

Patterns of frequency and amplitude changes are indicative of sounds and words

Cats

Owls

Chickadees

Koalas

Sparrows

There is often no one-to-one mapping

The expression of a given gesture can have many acoustic consequences
Different speakers have different realizations of each sound
Different phones sound different in different contexts
… but it all has to happen from this signal

Representing sound as frequency, power and time is the basis of speech technology

If we don’t know what words sound like, we can’t teach computers what they sound like
Similar patterns are easy to confuse for humans and computers
This lets us understand a bit more about how speech technology might work

… but first, we need to ask an important question

How do we turn speech sounds into features?

We’ve got a fundamental problem, to start

Computers don’t do waves

010001110010101000100101101010101010

Sound is analog, computers are digital

How do we deal with that?

Quantization (‘Sampling’)

Analog-to-digital conversion

Sample the wave many times per second
Record the amplitude at each sample
The resulting series of measurements will faithfully capture the signal

Relevant Parameters

The Bit Depth describes how many bits of information encode amplitude
- 16 bit audio is the norm
The Sampling Rate describes how many samples per second we take
- 44,100 Hz is the norm, and captures everything you need for speech

AD Conversion now yields a signal that the computer can read

… but what are the features?

Putting in the waveform itself is a possibility

It’s cheap and easy
Wave2Vec is showing amazing results doing just this!
- This uses transformers to work directly on the waveform and identify ‘latent speech units’
This is a very, very tantalizing possibility

… but we might want more information!

Important parts of the signal live only in frequency band info
Many approaches try to give it all the information we can
Not the same features that linguists usually use

We don’t need transparent or parsimonious features

Things like vowel features and pitch and other details are a pain to extract
We’re plugging it into a black box
We’re happy to plug in hundreds of features, if need be
We’d just as soon turn that sound into a boring matrix

Let’s get that algorithm a Matrix

Algorithms love Matrices

Mel-Frequency Cepstral Coefficients (MFCCs)

We’re not going deep here

This is a lot of signal processing
We’re going to teach the idea, not the practice

MFCCs

MFCC Process

1: Create a spectrogram (effectively)
2: Extract the most useful bands for speech (in Mels)
3: Look at the frequencies of this banded signal (repeating the Fourier Transform process)
4: Simplify this into a smaller number of coefficients using Discrete Cosine Transform (DCT)
- Usually 12 or 13

MFCC Input

MFCC Output

So, the sound becomes a matrix of features

Many columns (representing time during the signal)
N columns (usually 13) with coefficients which tell us the spectral shape
It’s black-boxy, but we don’t care.
We’ve created a Matrix

Now we’ve got a matrix representing the sound

MFCCs captures frequency information, according to our perceptual needs
Wav2Vec (and equivalents) go straight to vectors

It’s Neural Network time!

What is the learning task like for ASR and TTS?

First, one major question…

What units of speech are we working with?

What’s the desired data labeling?

We need to give the NN labeled data
[Chunk of Sound] == [Labeled Linguistic Info]
- (x Many many many many tokens)
What level do we want to recognize and generate at?

Possible levels of labeling

Sentences?
Words?
Phones?
Diphones?

Sentences

Why are sentences a bad idea?

Words

“Noise”

What are the pros and cons of words?

Phones

Diphones

What are the pros and cons of phones and diphones?

In practice, many systems use diphones

CMU’s Sphynx does
As do many others
Triphones are often a possibility
Some go straight to entire words
Speech recognition systems are often kept secret

So, we can now train a system

Capture sounds and annotate them as diphones
Vectorize them and feed them into a neural network as training data
We can do speech recognition, text-to-speech, and more!

Using Neural Networks for ASR

Feed the vectorized sound data in and get the most likely diphone sequence back
- … or go straight to words, if you feel dangerous!

Why is ASR hard?

ASR requires good dictionaries
- “Bashira yeeted the Mel Frequency Cepstral Coefficients into the RNN”
ASR requires some context awareness
- “Robb took a wok from the Chinese restaurant”
Dialect is always a thing
- “English” is a convenient lie
… and 99 other problems

Using Neural Networks for TTS

Feed the diphone sequence in, get back a likely acoustic signal
- This will generate a voice which matches (roughly) the input training voice
Style transfer is possible too!
- Training the model on a generic voice
- Then learning the variation associated with another as a style embedding
- Then applying the variation to the model

Neural Network Text-to-Speech Style Transfer Examples

Text-to-Speech is hard!

Text-to-speech requires you to understand how humans talk
- “The NSA and NASA printed 1200 t-shirts for area code 303”
… and the prosody is really hard
- “Let’s eat, Grandpa”
… and 99 other problems

We talk a lot more about the linguistic difficulties with these tasks in LIGN 6 ‘Language and Computers’

… and we’ll talk a lot more about processing speech outside of Neural Networks in LIGN 168 ‘Computational Speech Processing’

Wrapping up

Speech is movement of the articulators in the airstream
This creates sounds which vary in frequency and amplitude over time
This signal can be analyzed as a matrix of opaque cepstral features
… and then fed into a neural network with linguistic annotations
To generate and classify human speech
So…

Deep Learning works for Speech too!

It’s never going to be easy
It’s never going to be cheap
… but it’ll work
And it’ll get you that much closer to actual human interaction!

Thank you!

http://savethevowels.org/talks/deep_learning_speech.html