NO CLASS FRIDAY


Sound, Computers, and ASR

Will Styler - LIGN 6


How do acousticians say hello?


Today’s Plan


We’ve got a fundamental problem, to start


Computers don’t do waves

010001110010101000100101101010101010


Sound is analog, computers are digital


Quantization (‘Sampling’)


Quantization (‘Sampling’)


Quantization (‘Sampling’)


Analog-to-digital conversion


How often do we sample?


Sampling Rate


Sampling Rate


Sampling Rate (low rate)


Sampling Rate (awful rate)


Bad sampling makes for bad waves


Nyquist Theorem

The highest frequency captured by a sample signal is one half the sampling rate


Sampling Rates (Shpongle - ‘Nothing is something worth doing’)

44,100 Hz

22,050 Hz

11,025 Hz

6000 Hz


Sampling Rates (Shpongle - ‘Nothing is something worth doing’)

44,100 Hz

6000 Hz

3000 Hz

1500 Hz

800 Hz


Different media use different sampling rates


The ‘Bit Depth’ controls how much detail we store about each amplitude


Here’s a talk about this I did which goes into more detail


AD Conversion now yields a signal that the computer can read


Well, much like the rest of us!


There are more problems


Putting in the waveform itself was historically a poor choice


Why not linguistically useful features?


Linguistically useful features benefits


Linguistically useful features downsides


For research, linguistically useful features are great


We don’t need transparent or minimal


Let’s get that algorithm a Matrix


Mel-Frequency Cepstral Coefficients (MFCCs)


We’re not going deep here


MFCCs


MFCC Process


MFCC Input


MFCC Output


So, the sound becomes a matrix of features



Now we’ve got a matrix representing the sound


It’s Neural Network time!


… Wait, hold on.


What are we recognizing in speech recognition?


Possible levels of recognition


Sentences


Words

“Noise”


Word Recognition Pros


Word Recognition Cons


Grapheme-based Recognition


Grapheme-based Pros


Grapheme-based Cons


Phones


Phone Recognition Pros


Phone Recognition Cons


Diphones


Diphone Recognition Pros


Diphone Recognition Cons


In practice, many systems use diphones


… but modern systems are often going waveform-to-grapheme


So, we can now train a system


That’s a tricky step right there


Your ASR system is only as good as your dictionary and/or training data


Users have very specific matches they expect


“Hey Siri play songs by the Bedsit Infamy”


“Hey Siri play songs by the Bedsit Infamy”


How do we test the system?


Like this

https://dictation.io/speech


Wrapping Up


For next time


Thank you!