---
# Why work with speech at all?
---
### Human language is mostly speech-based
- The vast majority of human languages are spoken
- Many human/computer interactions work better by voice
- We can speak faster and more readily than we can type
- **We want our systems to be able to work with spoken language too!**
---
### What are common speech tasks?
- Automatic Speech Recognition (ASR)
- Spoken language to orthography
- Speech Synthesis or Text-to-speech (TTS)
- Orthography to spoken language
- Voice or language recognition
- "I don't care what they're saying, but who/what is it?"
- Real-time spoken translation
- "Which NLP problems do you want to have?" "Yes."
---
### These tasks are hard
- ASR and TTS are fraught with thousands of complexities in the tasks
- The datasets are very expensive to get and store and annotate
- Speech is *amazingly* complicated
- ... but we want to be good at them, *so badly*
---
### ... but they're fundamentally similar to many other deep learning tasks
- Taking acoustic data as input and classifying it
- ASR or Voice/Language Recognition
- Taking written words as input and generating appropriate spoken data
- Text-to-speech
---
### Deep Learning for speech is a rapidly changing field
- Most speech work was done with HMMs for a long time
- Now deep neural network approaches are taking over
- Apple is [announcing Neural TTS in keynotes](https://www.theverge.com/2019/6/3/18650906/siri-new-voice-ios-13-iphone-homepod-neutral-text-to-speech-technology-natural-wwdc-2019)!
- As a result...
---
### Getting into implementation is very hard
- Companies are *really* guarded about their speech processing algos
- The state-of-the-art is changing every day
- **We're going to focus on the basic issues involved with speech classification**
- ... rather than diving deep on an algorithm which will be outdated by the end of the talk
---
### Today's Plan
- What is the nature of speech?
- How do we discuss and display sound?
- What is the nature of the speech signal?
- How can we turn speech into features?
- What is the training data?
---
# What is the nature of speech?
---
### The Speech Process
* Flapping bits of meat inside your head while blowing out air
* This creates vibrations in the air you're expelling
* The ear picks these up, and inteprets them as speech.
* This process is studied in the Linguistic subfield of **Phonetics**
---
### The Lungs
---
### Flapping bits of meat ("articulation")
---
### Simplified a bit...
---
### Let's do an experiment
---
> The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
---
### Speech is absolutely insane
- It's a series of fluid and overlapping gestures
- It's amazingly complex
- ... and it's nothing like we think it is
---
### How do we wrap our heads around it?
- First, we break speech into 'segments' or 'phones'
- Then, we figure out how to describe those phones and their properties
- This lets us *transcribe* what was said, rather than what words were said
- But first you need to realize that...
---
## Your writing system is a trainwreck
-
---
### Your writing system is lying to you
- Every minute of every day
- "They thoroughly and roughly wrought the boughs in the borough, through and through"
- C doesn't exist
- TH is neither a t nor an h, and represents two different sounds
- We have 15 vowels
- ... and if you start thinking about letters, you're going to start struggling
- Consider your writing system with the same skepticism you would normally reserve for a guy with a broken bottle walking towards you in a dark alley.
---
### For more on this, LIGN 110!
---
### We use different writing systems to capture the sounds being made
- The International Phonetic Alphabet was developed by Linguists
- ðə ɪntəɹ'næʃɪnəl fə'nɛtɪk 'ælfəbət wʌz də'vɛləpt baj 'lɪŋgwɪsts
- ARPABET uses two character combinations to encode the sounds of *English*
- AAA R P AX B EH T / Y UW Z IH Z / T UW / K EH R IH K T ER ...
- **Often, TTS and ASR use these alphabets as a 'go-between'**
- They're used in resources like [CMUDict](http://www.speech.cs.cmu.edu/cgi-bin/cmudict)
---
### ... but we can think about speech as a sequence of 'phones'
- Individual speech sounds
- A series of articulatory targets...
- With hard-to-identify boundaries between them
- Adjacent sounds affect each other
- Which is broadcasted to the world acoustically
---
# What is Sound?
---
---
---
### Sound is compression and rarefaction in a medium
- Sound needs something to travel in (like air or water)
---
### (Yes, your childhood is a lie)
---
### Thinking of sound as waves of air compression is helpful
- Why does clapping cause a sound, but waving your hand through the air doesn’t?
- Why are gunshots loud?
---
### We're good at hearing sound
- ... but we need to visualize it
---
### Visualizing Sound
- Waveforms
- Spectrograms
---
## Waveform
A horizontal cut through the wave showing the peaks and troughs over time
- The height/strength of the wave is called its "amplitude"
---

(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Let's look at the sounds in this room right now
---
### Waveforms are well and good
- ... and you can tell a lot from a waveform
---
---
---
### ... but we'll need better information to process speech
---
## Frequency
The speed with which a wave oscillates
- Measured in Hertz (Hz), Cycles per second
---
### 100 Hz - Waveform
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### 200Hz - Waveform
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Voice Pitch
- Changing the "fundamental frequency" of your voice changes the perceived "pitch" of your voice
- Higher frequency of vocal fold vibration == "higher pitch"
- *Intonation is all about this frequency!*
---
### Frequency is important
- Different phenomena produce sounds at different frequencies
- Most things produce sounds with a mix of different frequencies, each at different amplitudes
- Speech has *many* components at different frequencies
- Each of those frequencies has a different power
---
### How do we visualize this?
- Spectra only show one 'moment' of the signal
---
### "Noise" - Waveform

(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
## Spectrogram
Displays signal strength by frequency, over time
---
### "Noise" - Waveform

(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### "Noise" - Spectrogram

(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Let's have a bit of spectrogram fun
---
### You can have fun on your own
- SpectrumView on iOS
- https://musiclab.chromeexperiments.com/Spectrogram/
- Praat (http://praat.org)
---
# Fundamentals of Speech Acoustics
---
### Voicing

(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Spectrograms show us many evenly-spaced vertical lines
- These are individual glottal pulses
- Higher pitched voices will have...?
- More tightly spaced lines!
---
### Resonances in the mouth

(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Different vowels have different resonances in the mouth
- Resonances vary depending on the tongue's position
- ... as well as the size and shape of the talker's head
- *Different resonances from the same speaker mean different vowels*
---
Different American English vowels, as spoken by a male speaker
---
### Different speakers have different resonances
- This is a fundamental problem in ASR and speech perception
- Enough data can help, but this is a *major* issue
---
### Other speech sounds have their own acoustics
---
### /l r w j/ act a lot like vowels
---
### Nasals sounds look like quiet vowels
---
### Fricative consonants have little black clouds
- ... and the cloud is higher frequency as you get closer to the mouth
---
### For stop consonants, the signal... stops
---
## Patterns of frequency and amplitude changes are indicative of sounds and words
---
### Cats
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Owls
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Chickadees
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Koalas
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Sparrows
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### There is often no one-to-one mapping
- The expression of a given gesture can have many acoustic consequences
- Different speakers have different realizations of each sound
- Different phones sound different in different contexts
- ... but it all has to happen from this signal
---
### Representing sound as frequency, power and time is the basis of speech technology
- If we don't know what words sound like, we can't teach computers what they sound like
- Similar patterns are easy to confuse for humans and computers
- This lets us understand a bit more about how speech technology might work
---
... but first, we need to ask an important question
---
# How do we turn speech sounds into features?
---
### We've got a fundamental problem, to start
---
### Computers don't do waves
010001110010101000100101101010101010
---
### Sound is analog, computers are digital
- How do we deal with that?
---
### Quantization ('Sampling')
---
### Quantization ('Sampling')
---
### Quantization ('Sampling')
---
### Analog-to-digital conversion
- Sample the wave many times per second
- Record the amplitude at each sample
- The resulting series of measurements will faithfully capture the signal
---
### Relevant Parameters
- The **Bit Depth** describes how many bits of information encode amplitude
- 16 bit audio is the norm
- The **Sampling Rate** describes how many samples per second we take
- 44,100 Hz is the norm, and captures everything you need for speech
---
### AD Conversion now yields a signal that the computer can read
- ... but what are the features?
---
### Putting in the waveform itself is a possibility
- It's cheap and easy
- [Wave2Vec](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) is showing amazing results doing just this!
- This uses transformers to work directly on the waveform and identify 'latent speech units'
- This is a very, very tantalizing possibility
---
### ... but we might want more information!
- Important parts of the signal live only in frequency band info
- Many approaches try to give it all the information we can
- Not the same features that linguists usually use
---
### We don't need transparent or parsimonious features
- Things like vowel features and pitch and other details are a pain to extract
- We're plugging it into a black box
- We're happy to plug in hundreds of features, if need be
- We'd just as soon turn that sound into a boring matrix
---
### Let's get that algorithm a Matrix
- Algorithms love Matrices
---
## Mel-Frequency Cepstral Coefficients (MFCCs)
---
### We're not going deep here
- This is a lot of signal processing
- We're going to teach the idea, not the practice
---
### MFCCs
---
### MFCC Process
- 1: Create a spectrogram (effectively)
- 2: Extract the most useful bands for speech (in Mels)
- 3: Look at the frequencies of this banded signal (repeating the Fourier Transform process)
- 4: Simplify this into a smaller number of coefficients using Discrete Cosine Transform (DCT)
- Usually 12 or 13
---
### MFCC Input
---
### MFCC Output
---
### So, the sound becomes a matrix of features
- Many columns (representing time during the signal)
- N columns (usually 13) with coefficients which tell us the spectral shape
- It's black-boxy, but we don't care.
- We've created a Matrix
---
---
### Now we've got a matrix representing the sound
- MFCCs captures frequency information, according to our perceptual needs
- Wav2Vec (and equivalents) go straight to vectors
---
### It's Neural Network time!
---
# What is the learning task like for ASR and TTS?
---
First, one major question...
- ### What units of speech are we working with?
---
### What's the desired data labeling?
- We need to give the NN labeled data
- [Chunk of Sound] == [Labeled Linguistic Info]
- (x Many many many many tokens)
- What level do we want to recognize and generate at?
---
### Possible levels of labeling
- Sentences?
- Words?
- Phones?
- Diphones?
---
### Sentences
- Why are sentences a bad idea?
---
### Words
"Noise"
---
### What are the pros and cons of words?
---
### Phones
---
### Diphones
---
### What are the pros and cons of phones and diphones?
---
### In practice, many systems use diphones
- [CMU's Sphynx does](https://cmusphinx.github.io/)
- As do many others
- Triphones are often a possibility
- Some go straight to entire words
- Speech recognition systems are often kept secret
---
### So, we can now train a system
- Capture sounds and annotate them as diphones
- Vectorize them and feed them into a neural network as training data
- We can do speech recognition, text-to-speech, and more!
---
### Using Neural Networks for ASR
- Feed the vectorized sound data in and get the most likely diphone sequence back
- ... or go straight to words, if you feel dangerous!
---
### Why is ASR hard?
- ASR requires good dictionaries
- "Bashira yeeted the Mel Frequency Cepstral Coefficients into the RNN"
- ASR requires some context awareness
- "Robb took a wok from the Chinese restaurant"
- Dialect is always a thing
- "English" is a convenient lie
- ... and 99 other problems
---
### Using Neural Networks for TTS
- Feed the diphone sequence in, get back a likely acoustic signal
- This will generate a voice which matches (roughly) the input training voice
- Style transfer is possible too!
- Training the model on a generic voice
- Then learning the variation associated with another as a style embedding
- Then applying the variation to the model
---
### Neural Network Text-to-Speech Style Transfer Examples
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
(This audiovisual content has been removed for compliance with recent federal accessibility guidelines. Please see this site for details.)
---
### Text-to-Speech is hard!
- Text-to-speech requires you to understand how humans talk
- "The NSA and NASA printed 1200 t-shirts for area code 303"
- ... and the prosody is really hard
- "Let's eat, Grandpa"
- ... and 99 other problems
---
### We talk a lot more about the linguistic difficulties with these tasks in LIGN 6 'Language and Computers'
---
### ... and we'll talk a lot more about processing speech outside of Neural Networks in LIGN 168 'Computational Speech Processing'
---
### Wrapping up
- Speech is movement of the articulators in the airstream
- This creates sounds which vary in frequency and amplitude over time
- This signal can be analyzed as a matrix of opaque cepstral features
- ... and then fed into a neural network with linguistic annotations
- To generate and classify human speech
- So...
---
### Deep Learning works for Speech too!
- It's never going to be easy
- It's never going to be cheap
- ... but it'll work
- And it'll get you that much closer to actual human interaction!
---
---
Thank you!