NO CLASS FRIDAY

You are legally obligated to take the time to do something kind to yourself

Sound, Computers, and ASR

Will Styler - LIGN 6

How do acousticians say hello?

They wave!

Today’s Plan

Computers and Sound
Turning Signals into features
Automatic Speech Recognition

We’ve got a fundamental problem, to start

Computers don’t do waves

010001110010101000100101101010101010

Sound is analog, computers are digital

How do we deal with that?

Quantization (‘Sampling’)

Quantization (‘Sampling’)

Quantization (‘Sampling’)

Analog-to-digital conversion

Sample the wave many times per second
Record the amplitude at each sample
The resulting wave will faithfully capture the signal

How often do we sample?

This is called the ‘Sampling Rate’
Measured in samples per second (Hz)

Sampling Rate

Sampling Rate

Sampling Rate (low rate)

Sampling Rate (awful rate)

Bad sampling makes for bad waves

Nyquist Theorem

The highest frequency captured by a sample signal is one half the sampling rate

Sampling Rates (Shpongle - ‘Nothing is something worth doing’)

44,100 Hz

22,050 Hz

11,025 Hz

6000 Hz

Sampling Rates (Shpongle - ‘Nothing is something worth doing’)

44,100 Hz

6000 Hz

3000 Hz

1500 Hz

800 Hz

Different media use different sampling rates

Radio was historically less than this
CDs are at 44,100 Hz
DVDs are at 48,000 Hz
High-End Audio DVDs are at 96,000 Hz
Some people want 192,000 Hz
- Likely they are dolphins

The ‘Bit Depth’ controls how much detail we store about each amplitude

16 bits gives 65,563 levels, which is the default in modern machines

Here’s a talk about this I did which goes into more detail

Covers compression, bit depth, mp3, and more
https://www.youtube.com/watch?v=o1OADV71g1Y
Also LIGN 168!

AD Conversion now yields a signal that the computer can read

… but how does it interpret it?

Well, much like the rest of us!

There are more problems

We’re going to use Neural Networks
- Or, historically, hidden markov models
… but what are the algorithms looking at?

Putting in the waveform itself was historically a poor choice

It’s cheap and easy
NNs weren’t amazing at estimating frequency-based effects
- Recent approaches are changing that (c.f. Wav2Vec)
Important parts of the signal live only in frequency band info
We want to be able to give it all the information we can, in the most useful format!

Why not linguistically useful features?

Linguistically useful features benefits

They reflect speech-specific understanding
- They treat speech as “special”
They reflect articulatory facts
They’re efficient
- Optimal informativeness per feature
They’re very transparent
- We know what each of them means

Linguistically useful features downsides

Slow to extract
Require specialized algorithms to extract
They treat speech as “special”

For research, linguistically useful features are great

… but in production, we don’t care

We don’t need transparent or minimal

We’re plugging it into a black box
We’re happy to plug in hundreds of features, if need be
We’d just as soon turn that sound into a boring matrix

Let’s get that algorithm a Matrix

Algorithms love Matrices

Mel-Frequency Cepstral Coefficients (MFCCs)

We’re not going deep here

This is a lot of signal processing
We’re going to teach the idea, not the practice

MFCCs

MFCC Process

1: Create a spectrogram
2: Extract the most useful bands for speech (in Mels)
3: Look at the frequencies of this banded signal (repeating the Fourier Transform process)
4: Simplify this into a smaller number of coefficients using DCT
- Usually 12 or 13

MFCC Input

MFCC Output

So, the sound becomes a matrix of features

Many rows (representing time during the signal)
N columns (usually 13) with coefficients which tell us the spectral shape
It’s black-boxy, but we don’t care.
We’ve created a Matrix

Now we’ve got a matrix representing the sound

… which captures frequency information, according to our perceptual needs

It’s Neural Network time!

… Wait, hold on.

What are we actually recognizing?

What are we recognizing in speech recognition?

We need to give the NN labeled data
[Chunk of Sound MFCCed] == [Labeled Linguistic Info]
- (for Many many many many tokens)
What level do we want to recognize at?

Possible levels of recognition

Sentences?
Words?
Letters?
Phones?
Diphones?

Sentences

Why are sentences a bad idea?

Words

“Noise”

Word Recognition Pros

Handles larger patterns of coarticulation
Captures word specific effects
Robust to short duration noise
Word annotation is way cheaper

Word Recognition Cons

What about novel words?
Training data becomes much more sparse
Can we really learn nothing about “boy” from “soy”?

Grapheme-based Recognition

You could use the orthography itself as the ‘pronunciation dictionary’ and recognize letters (‘graphemes’)
Mapping straight from letters to speech signal
This is actually happening now!
- Here’s one example and another
- Here’s another production system you can play with: HuggingFace2

Grapheme-based Pros

The data are much easier to get
- Subtitles, transcripts, etc
More able to handle new words and names
- It can guess how ‘Haligtree’ or ‘Maliketh’ sound without dictionary entries
You don’t need dictionaries to map from words to phones!

Grapheme-based Cons

Grapheme-to-phone conversion is very language specific
It’s often roughly and thoroughly arbitrary
Some languages’ writing systems have less mutual information with spoken language
It throws away data for many homograph differences (e.g. record, villa, does)

Phones

Phone Recognition Pros

The most basic unit, so training data is rich
Can (theoretically) work for any language
Can still capture unknown words
- “Fuzzy matching”

Phone Recognition Cons

Annotation is brutally expensive
Coarticulation is problematic
Phone-level recognition is overkill for many contexts

Diphones

Diphone Recognition Pros

Coarticulation becomes a feature, not a bug
Still very basic, so all training data provides data
Can still (theoretically) work for any language
- … but patterns of coarticulation differ
Can still capture unknown words via Fuzzy matching

Diphone Recognition Cons

Still stupidly expensive to annotate
Still overkill in many contexts

In practice, many systems use diphones

CMU’s Sphynx does
As do many others
Triphones are often a possibility

… but modern systems are often going waveform-to-grapheme

This is absolutely wild

So, we can now train a system

Capture sounds and annotate them as diphones or words
MFCC them, or read in the waveform alongside word labels, and feed them into a neural network as training data
Then later, feed new data in and get back a list of phones (or words), which you can use to predict which words were intended!

That’s a tricky step right there

Why?

Your ASR system is only as good as your dictionary and/or training data

“For shizzle, Bashira”
“Mel Frequency Cepstral Coefficient”
“Differentiating Theta and Eth”
“Take Caminito Santa Fe, then Mira Mesa into La Jolla”

Users have very specific matches they expect

“Hey Siri play songs by the Bedsit Infamy”

“Hey Siri play songs by the Bedsit Infamy”

How do we test the system?

Like this

https://dictation.io/speech

Wrapping Up

Computers can learn to do the wave
MFCCs turn beautiful sounds into opaque, useful matrices
Speech Recognition often uses diphones
You’re only as good as your dictionary

For next time

NO CLASS FRIDAY
Why is speech recognition so damned hard?

Thank you!