LIGN 168 - Neural ASR

# Neural Speech Recognition

### Will Styler - LIGN 168

---

### So, now we know what Neural Networks are

- We understand their pros, their cons, and their various types

- **How can they be applied for ASR?**

---

### Today is just a sampler platter

- There are hundreds of ways to do this task

- We're going to talk about a few interesting approaches
	- As always, you can dive deeper if you'd like

---

### Today's Plan

- Spectral and CNN-based Approaches

- The Alignment Problem

- Sound-to-Vector Models and Wav2Vec 2

- Whisper

- Neural ASR is boring

---

## Spectral and CNN-based Approaches

---

### So, we know Neural Networks win at ASR

- ... but how do we give acoustical data to these models?

---

### Spectral Representations as Input

- "Let's give the model a spectrogram directly"

- Often something like a logged mel spectrogram

- Sometimes something like a mel cepstrogram

- Sometimes a png file of a spectrogram itself
	- Please don't

---

### Once we have a grid containing useful acoustic data...

- Treat it like any other image!

- Convolutional Neural Networks excel at finding patterns in images

- Spectrogram reading is just finding patterns in images

- Neural Network training will ensure you find the right patterns

---

### You can go from spectrogram to phones!

---

### Now you just need to match those phones to words

- ... which raises....

---

## The Alignment Problem

---

### Note that 'alignment' has many meanings in ML

- "Is this system going to do what I want it to do, for the reasons I want it to do it?"

- "Are these two things synchronized, ordered, and time-aligned properly?"
	- This is the problem we have today!

---

### Some tasks have one-to-one input/output alignment

- "Is this picture a rabbit or a cat?"

- "Would this Instagram user buy diet pills, based on their follow list?"

- "Is this email likely an attempt to purchase chemical weapons?"

---

### Other tasks have ambiguous alignment

- "Detect motion in this noisy video feed from the warehouse"

- "Identify all characters in this cursive sentence"

- "Take this music and generate a score for it"

- **ASR!**

---

### Alignment is hard in ASR

- I have an input waveform which describes a sequence of phones
	- e.g. /lɪŋgwɪstɪks/

- I have a set of output characters I'd like from that
	- e.g. "linguistics"

- You'll likely need to process the sound frame-by-frame

---

### Alignment for a given frame is hard

- Not all frames contain a character

- Some characters take more than one frame to complete
	- A long /i/ vowel might be several frames long

- Some characters are repeated
	- 'Mississippi' doesn't involve multiple /s/ or /p/

- **Any ASR approach needs to address this!**

---

### Connectionist Temporal Classification (CTC) is one way

- Let's assume something allows us to assign every frame to a likely category label (e.g. phone)

- Create a list of possible output labels (e.g. 'cat', 'penguin', 'access')
	- If output repeats characters place a blank ('-') into the label between the repeated items ('ac-ces-s')

- "If a frame label repeats across multiple frames, it's probably the same chunk"

- "If an output label has a blank between, it's probably just one chunk"

---

### CTC Collapsing

---

### CTC just finds the most likely path through the options

- Sort of like decoding an HMM using the Viterbi algorithm

- "What's the output label(s) for this sequence that best fit the data?"

- Now, you can match a series of frame labels, to the most likely output labels

- **The probability of the correct choice being made is also a good loss function!**

---

### CTC Decoding

---

### We can unite the output with a language model

---

### ... and we figure out the most likely candidates

- Either choose the most likely candidate over and over given both acoustic and language model data
	- "Greedy" decoding

- Or you do [Beam Search](https://en.wikipedia.org/wiki/Beam_search) to identify multiple candidates
	- You limit the number of candidates to consider at once
	- Then, choose the sequence with the highest overall probability

---

### Beam Search

<img class="r-stretch" src="phonmedia/asr_beam_search.gif" alt="asrbeamsearch.gif The user wants a detailed description of an image showing a tree diagram. 1. Analyze the Image: Type: It's a black and white line drawing of a hierarchical structure, specifically a tree graph or data structure. Structure: It has levels (root at top, children below). Nodes: The nodes are circles containing numbers. Some are empty (the root), some have single digits, some have double digits. Edges: Lines connect the parent nodes to their child nodes. 2. Breakdown by Level/Node: Root Node (Level 0): Top center. It's an empty circle. No number inside. Children of Root (Level 1): There are four children connected directly to the root. From left to right: Node with 1 Node with 5 Node with 7 Node with 4 Children of Node 1: Two children below it. Left child: Node with 12 (double digit). Right child: Node with 8. Children of Node 5: Three children below it. Left child: Node with 9. Middle child: Node with 8. Right child: Node with 7 (Wait, let me look closer at the connections). Actually, looking at the lines from node 5: One goes to 9. One goes to 8. Wait, is that a line from 5 to 7? No, let's trace carefully. Node 5 connects to Node 9 (left-ish), Node 8 (middle-ish). And there is a line going down-right towards... wait, let me re-evaluate the connections for node 5 and 7. Let's look at the cluster under 5 and 7. Under 5: There is a line to 9. There is a line to 8. Wait, looking at the layout... Let's trace from the top down again. Root -> 1, 5, 7, 4. Correct. From 1: Line to 12 (left), Line to 8 (right). Correct. From 5: Line to 9 (left). Line to 8 (middle). Wait, is there a third line? Let's look at the node 7 below the root. Ah, let's look at the connections between the second and third levels carefully. Node 5 connects to Node 9. Node 5 also seems to connect to Node 8 (the one directly below it). Wait, let's look at the node labeled 7 under the root. It has children. Let's re-examine the whole structure. It looks like a generic tree diagram often used in computer science examples. Let's try tracing from bottom up or grouping by parent. Parent 1: Children are 12 and 8. Parent 5: Children appear to be 9, 8 (the one below it), and... wait. Let's look at the lines crossing. There is a line from 5 going down-left to 9. There is a line from 5 going straight down to 8. Is there a connection to the node labeled 7 below? No, that looks like it belongs to parent 7. Parent 7: Children appear to be... wait, let's look at the node labeled 7 under the root. It has a line going down-left to... actually, looking closely at the junction near 9, 8, 7. Let's trace the lines from the second level nodes (1, 5, 7, 4). Node 1: Connects to 12 and 8. Node 5: Connects to 9. And connects to 8 (the one below it). Wait, looking at the line from 5... it goes down to 9. Another line goes down to 8. Is there a third? Let's look at the node labeled 7 under the root. It connects to... wait, is that a 7 or part of the previous group? Let's look at the labels again. Top: Empty circle. Row 2: 1, 5, 7, 4. Row 3 (under 1): 12, 8. Row 3 (under 5/7 area): There is a node 9, a node 8, and a node 7. The line from 5 goes to 9. The line from 5 also seems to go to the node labeled 8 below it? No, let's look at the vertical alignment. Under 5 is an empty space then a node 8? No, under 5 is a node 9 and a node 8. Let's trace the line from 7 (the parent). It goes to a node labeled 7 below it? And a node labeled 5? Let's try a different approach: Identify every node and its connections. Root: Top center circle. No number. Level 1 Nodes (Left to Right): Node A: Number 1. Node B: Number 5. Node C: Number 7. Node D: Number 4. Children of Node A (1): Left child: Node with number 12. Right child: Node with number 8. Children of Node B (5): It has a line going down-left to Node with number 9. It has a line going straight down to Node with number 8 (the one below the 9). Wait, let's look at the node labeled 8 under the root. No, that's under the space between 5 and 7. Let's look really closely at the connections around the middle. There is a node 9. Above it is a line coming from 5. To the right of 9 is a node 8. Above it is a vertical line coming from... where? It looks like it comes from 5. To the right of that 8 is a node 7. Above it is a line coming from 7 (the parent). Wait, let's look at the node labeled 7 in Level 1. It has a line going down-left to... actually, looking at the diagram, there is a node labeled 7 below the root, slightly right of center. Let's call this Parent Node C. Parent Node C (7) connects to: A node labeled 7 (below it). A node labeled 5 (to its right). Let's restart the tracing, very carefully. Root: Top circle. Branches from Root: 1. Left branch to Node 1. 2. Middle-left branch to Node 5. 3. Middle-right branch to Node 7. 4. Right branch to Node 4. Sub-branches from Node 1: Line down-left to Node 12. Line straight down to Node 8. Sub-branches from Node 5: Line down-left to Node 9. Line straight down to Node 8 (the one below the 9). Wait, looking at the node labeled 8 under the root... no, that's not right. Let's look at the cluster: 9, 8, 7. The node 9 is connected to 5. The node 8 (below the 9) is connected to... actually, looking at the line, it seems to come from 5. Wait, let's look at the node labeled 7 under the root. It connects to a node labeled 7 below it and a node labeled 5 to its right. Let's try tracing from the bottom up. This is often easier for trees. Bottom Row (Leaves): 4, 6, 6, 9, 2, 8, 7. Wait, let's count the circles in the lowest visible row. Far left: 4. Connected to... 9? Yes, line goes up-right to 9. Next: 6. Connected to... 9? Yes, line goes up-left to 9. Next: 6. Connected to... 8 (the one above it)? Yes. Next: 9. Connected to... 8 (the one above it)? Yes. Wait, let's look at the node labeled 8 in the middle row. There is a node labeled 8 below the root area. Let's call this Middle Node E. It connects down to 9. Next: 2. Connected to... 7 (the one above it)? Yes. Next: 8. Connected to... 7 (the one above it)? Yes. Next: 7. Connected to... 7 (the one above it)? Yes. Let's refine the structure based on this bottom-up analysis. Level 3 (Bottom-most visible nodes): Node 4 (far left). Parent is 9. Node 6 (left-center). Parent is 9. Node 6 (center-left). Parent is... let's trace up. It goes to the node labeled 8 which is below the root area? No, that doesn't fit. Let's look at the node labeled 9 in Level 2. It has children: 4, 6. And another line going down to... 6? Yes, a vertical line goes from 9 down to a 6. Let's look at the node labeled 8 below the root area (let's call it Node X). It is connected to parent 5? Or parent 7? Let's trace the lines from Level 1 again. Parent 5: Connects to Node 9. Connects to Node 8 (Node X). Parent 7: Connects to Node 7 (Node Y). Connects to Node 5 (Node Z). Let's try one more pass, strictly left-to-right. Root: Top circle. Level 1 Nodes: Left: Circle with 1. Center-Left: Circle with 5. Center-Right: Circle with 7. Right: Circle with 4. Children of Node 1: Left child: Circle with 12. Right child: Circle with 8. Children of Node 5: It has a line going down-left to a circle with 9. It has a line going straight down to a circle with 8 (let's call this Middle-Child-1). Wait, looking at the diagram, there is a node labeled 8 directly below the space between 5 and 7. Let's assume it belongs to 5. Actually, let's look at the node labeled 9. It has children: 4, 6, and another 6. Wait, looking at the lines from 9: One goes left-down to 4. One goes right-down to 6. One goes straight down to 6. So Node 9 is a parent of three nodes: 4, 6, 6. Children of Node 7 (the one in Level 1): It has a line going down-left to a circle with 7 (let's call this Middle-Child-2). Wait, looking at the diagram... There is a node labeled 8 below the root. Let's trace its parent. The line goes up-right towards 5. So Node 8 is a child of 5. There is a node labeled 7 below the root (to the right of the previous 8). Let's trace its parent. The line goes up-left towards 7 (Level 1). So this 7 is a child of Level-1 Node 7. Wait, let's look at the node labeled 5 to the right of that 7. It connects to... nothing above? No, it must connect to something. The line goes up-left towards 7 (Level 1). So this 5 is a child of Level-1 Node 7. Children of Node 4 (Level 1): Left child: Circle with 6. Right child: Circle with 7. Let's re-evaluate the middle section. It's confusing. Let's look at the node labeled 9 again. Parent: Node 5. (Line goes up-right to 5). Children: Left: Node 4. Middle: Node 6. Right: Node 6. Wait, looking at the bottom row... 4, 6, 6. Yes. Let's look at the node labeled 8 (the one below the root area). Parent: Node 5? Or Node 7? The line goes up-right to Node 5. So it is a child of 5. Children: One vertical line down to Node 9. Wait, no. Let's look at the node labeled 8 in the middle row (below 5/7). It has a line going straight down to Node 9? No, that doesn't make sense if 9 is already a child of 5. Let's look at the lines again. From Node 5: Line to 9. Line to 8 (the one below it). From Node 7 (Level 1): Line to 7 (below it). Line to 5 (to its right). Let's try to map the connections strictly. Root connects to: 1, 5, 7, 4. Node 1 connects to: 12, 8. Node 5 connects to: 9, 8 (the one below it). Wait, looking at the node labeled 8 below the root... is that a child of 5? Yes. And does it have children? The line from that 8 goes down to... 9? No, that's a different 9. Let's look at the bottom row again: 4, 6, 6, 9, 2, 8, 7. Wait, I missed some nodes. Let's count all circles. Top: 1 (empty). Row 2: 4 circles (1, 5, 7, 4). Total = 5. Row 3: Under 1 -> 12, 8. Under 5/7 area -> 9, 8, 7, 5? No. Let's look at the cluster under 5 and 7. There is a node 9. Parent seems to be 5. There is a node 8 (below 9). Parent seems to be... actually, looking at the line, it goes up to 5? Or does it go up to 7? It looks like it's under 5. There is a node 7 (to the right of the previous 8). Parent seems to be 7 (Level 1). There is a node 5 (to the right of that 7). Parent seems to be 7 (Level 1). Let's look at the bottom-most row again. Under Node 9: Nodes 4, 6. And another line going down to... 6? Yes. So Node 9 has children 4, 6, 6. Under the This description was generated automatically. Please feel free to ask questions if you have further questions about the nature of the image or its meaning within the presentation.">

(Source: BogdanShevchenko - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=128835919)

---

### Now we have a working ASR model!

- We get predictions from the acoustics by feeding

- We match those to word representations using CTC

- We can unite this with a language model, evaluating probabilities using beam search

- This can turn a sequence of phones into a transcript!

---

### That's amazing!

- ... but can't we make this simpler?

---

## Sound-to-Vector Models and Wav2Vec2

---

### It would be nice not to need to generate spectral represenations

- What if we could turn the waveform into a vector directly, and just go end-to-end?

- This leads to approaches like...

---

### Wav2Vec

- Developed by Facebook's AI team

- Takes in a waveform directly, and outputs higher quality features for use in ASR

- **Feature Encoding:** Uses CNNs to create 'latent representations' every 10ms or so
	- These aren't exactly segments, they're just 'things the network notices'

- **Context Network:** Combines the features to capture adjacency effects

---

### Wav2Vec is trained in a *self supervised* way

- We're going to take away ('mask') some of the representations coming out of the model

- Now, we make the algorithm predict what's missing

- The loss function involves correctly choosing the missing representations, among a random set sampled from elsewhere

- **This allows it to learn without as much labeled data**

---

### Wav2Vec 2.0

- Same idea, but the encoded features get quantized to a set of specific 'tokens'
	- Tokens are kind of like phones, but not 1-to-1
	- We get a sequence out of the waveform

- The context network can now be a transformer
	- This has all the benefits, plus easy fine-tuning for new languages or tasks

- **You can map these tokens to whatever representation you'd like**
	- Phones/Diphones/Triphones
	- IPA characters
	- Even direct to orthography!

---

### Wav2Vec2

---

### Wav2Vec2 can go from sound to orthography!

- This is absolutely wild, given how bad our writing system is!

- It also doesn't need CTC, as that's a part of the transformer's core competencies

- These quantized units raise the possibility of a language-independent ASR system!
	- The 'Universal Transcriber'

---

### Wav2Vec2 replaces the entire acoustical pipeline

---

### To 'Decode' Wav2Vec2 data into words

You combine the token-representation with a separate language model to find hypotheses

- Again, you can use probabilities and/or beam search to find the best approach

- Some people are trying to make this *fully* end to end, building the language model into the Wav2Vec Model
	- We'll see an example of this shortly!

---

### This is a working ASR system!

- Wav2Vec2 goes from acoustics to quantized tokens

- Those tokens can be mapped directly to linguistic units

- Linguistic units plus a language model give probabilities for outputs

- Output probabilities can be beam-searched, to arrive at the best transcription

- **We've gone from Waves to Words in just two steps!**
	- ... and they're both very boring steps

---

### You can also build on this system!

- You can take an existing model, and fine tune with a bit of data for another language
	- So, we use a model trained on all speech, and then fine tune on Tira
	- This is great for low-resource languages

- [Char Siu](https://github.com/lingjzhu/charsiu) turns Wav2Vec2 into a forced aligner
	- "Find me the exact temporal boundaries of these segments"

- **Open Models make the world better!**

---

### This is a very common setup now

- It's not often revealed how commercial ASR systems work
	- Secret Sauce abounds!

- We should assume many systems are using a similar architecture under the hood!

- ... but at least one system is different!

---

## Whisper

---

### Whisper is an ASR Model from OpenAI

- It does transcription
	- It also does time-aligned transcription (e.g. for video captioning)

- It does some multilingual ASR and translation too!
	- This is a neat trick!

- It also can be used for language identification!

- It is shockingly good!

---

### Whisper performs nearly as well as human transcribers

---

### You can actually use it!!

- It is free and open
	- Unlike anything OpenAI does these days

- You can use and download the models for free
	- <https://github.com/openai/whisper>

- It's been tuned to run (slowly) even without a GPU!
	- This allows fully local transcription!!

---

### Whisper uses a hybrid of CNNs and Transformers

---

### Whisper embeds the language model into the acoustic model!

- There is no separate hypothesis step and no separate language model!

---

### Classical ASR Architecture

---

### Whisper's End-to-End Architecture

---

### Whisper offers a number of models

---

### Whisper works (mostly) great!

- It's probably the right choice for transcribing files at the moment!

- ... it's also the last neural ASR architecture we're going to look at this quarter!

---

## Neural ASR is boring lately

---

### This is a rant

- Many people have lovely careers, and the tools are exhilaratingly powerful

- ... but...

---

### Neural ASR has largely abandoned linguistics

- "Feed in waves and text to a transformer, cook for two weeks, then have a model"

- There is no linguistic nuance

- There's not even transcription anymore!

---

### More parameters == More Better

- Thus, more memory and energy cost generally wins

- This privileges large companies with large resources

- This focuses development on wealthy languages and groups

---

### ASR is approaching 'solved' for high-resource people

- For the mainest-stream speakers of American English, ASR is amazing
	- ASR performance for me is at a place I didn't expect to see in my lifetime
	- There is room for improvement, to be sure, but it's not a lot of room

- Most of the improvement is in brittleness with other dialects and variation

---

### The most interesting work in ASR right now is for low resource languages

- How do we get *great* ASR working for a language with low amounts of data/money/hardware?
	- Imagine your language not having voice-to-type?

- How can we use ASR to more quickly generate, study, and clean language data?

- How can we use ASR to enable field work and linguistic inquiry more effectively

- **How can ASR help people who the tech industry doesn't care enough to help?**

---

### For information on this, talk to Mark Simmons!

- He's in the trenches with these ideas right now!

- ... and we'll hear from him next time!

---

### Wrapping up

- Feeding spectral information into CNNs is a great way to extract features

- You need to solve the alignment problem, either with CTC or Transformers

- Wav2Vec2 offers a path straight from audio to intermediate representations
	- ... which can be fed into a language model

- Whisper is free, great, and completely end-to-end!

---

<huge>Thank you!</huge>