Deep Learning and Speech Data

Will Styler - LIGN 167

http://savethevowels.org/talks/deep_learning_speech.html


Why work with speech at all?


Human language is mostly speech-based


What are common speech tasks?


These tasks are hard


… but they’re fundamentally similar to many other deep learning tasks


Deep Learning for speech is a rapidly changing field


Getting into implementation is very hard


Today’s Plan


What is the nature of speech?


The Speech Process


The Lungs


Flapping bits of meat (“articulation”)


Simplified a bit…


Let’s do an experiment


The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.


Speech is absolutely insane


How do we wrap our heads around it?


Your writing system is a trainwreck


Your writing system is lying to you


For more on this, LIGN 110!


We use different writing systems to capture the sounds being made


… but we can think about speech as a sequence of ‘phones’


What is Sound?




Sound is compression and rarefaction in a medium


(Yes, your childhood is a lie)


Thinking of sound as waves of air compression is helpful


We’re good at hearing sound


Visualizing Sound


Waveform

A horizontal cut through the wave showing the peaks and troughs over time




Let’s look at the sounds in this room right now


Waveforms are well and good




… but we’ll need better information to process speech


Frequency

The speed with which a wave oscillates


100 Hz - Waveform


200Hz - Waveform


Voice Pitch


Frequency is important


How do we visualize this?


“Noise” - Waveform



Spectrogram

Displays signal strength by frequency, over time


“Noise” - Waveform



“Noise” - Spectrogram



Let’s have a bit of spectrogram fun


You can have fun on your own


Fundamentals of Speech Acoustics


Voicing



Spectrograms show us many evenly-spaced vertical lines


Resonances in the mouth



Different vowels have different resonances in the mouth


Different American English vowels, as spoken by a male speaker


Different speakers have different resonances


Other speech sounds have their own acoustics


/l r w j/ act a lot like vowels


Nasals sounds look like quiet vowels


Fricative consonants have little black clouds


For stop consonants, the signal… stops


Patterns of frequency and amplitude changes are indicative of sounds and words


Cats


Owls


Chickadees


Koalas


Sparrows


There is often no one-to-one mapping


Representing sound as frequency, power and time is the basis of speech technology


… but first, we need to ask an important question


How do we turn speech sounds into features?


We’ve got a fundamental problem, to start


Computers don’t do waves

010001110010101000100101101010101010


Sound is analog, computers are digital


Quantization (‘Sampling’)


Quantization (‘Sampling’)


Quantization (‘Sampling’)


Analog-to-digital conversion


Relevant Parameters


AD Conversion now yields a signal that the computer can read


Putting in the waveform itself is a possibility


… but we might want more information!


We don’t need transparent or parsimonious features


Let’s get that algorithm a Matrix


Mel-Frequency Cepstral Coefficients (MFCCs)


We’re not going deep here


MFCCs


MFCC Process


MFCC Input


MFCC Output


So, the sound becomes a matrix of features



Now we’ve got a matrix representing the sound


It’s Neural Network time!


What is the learning task like for ASR and TTS?


First, one major question…


What’s the desired data labeling?


Possible levels of labeling


Sentences


Words

“Noise”


What are the pros and cons of words?


Phones


Diphones


What are the pros and cons of phones and diphones?


In practice, many systems use diphones


So, we can now train a system


Using Neural Networks for ASR


Why is ASR hard?


Using Neural Networks for TTS


Neural Network Text-to-Speech Style Transfer Examples


Text-to-Speech is hard!


We talk a lot more about the linguistic difficulties with these tasks in LIGN 6 ‘Language and Computers’


… and we’ll talk a lot more about processing speech outside of Neural Networks in LIGN 168 ‘Computational Speech Processing’


Wrapping up


Deep Learning works for Speech too!



Thank you!

http://savethevowels.org/talks/deep_learning_speech.html