LIGN 168 - Introduction

All my slides are presented using [reveal.js](revealjs.com), and do not exist as Powerpoint or PDF, but instead, as websites.

To navigate my slides, use the arrow keys, the onscreen controls, or swipe on mobile.  To scan through slides, hit 'Escape'.

To print or make a PDF, add ?print-pdf to the URL after .html in Google Chrome and then 'Print' as usual.

(Please note that PDF/printing is not 'officially supported' by the instructor, and the results may be very strange in places)

---

### Handout Mode

- You can also turn slides into a handout by adding "_handout" before .html

---

# Linguistics 168: Computational Speech Processing
### Dr. Will Styler

---

### Today's Plan

- Introductions

- Class Admin Stuff

- Speech Processing Tasks

- What's in and out of scope for the class?

- What's the plan?

---

## Introductions

---

### Dr. Will Styler

Teaching Professor, Linguist, Gigantic Nerd

---

### Mark Simmons

- You'll meet him Wednesday!

---

### You all

- How many of you have had 110?

- How many of you can read a spectrogram?

- How many of you know Python?

- How many of you have had signal processing?

---

## Class Administration

---

### The Syllabus

<https://savethevowels.org/168>

---

### What are we assuming you know?

- Basic Understanding of Computers
	- How to install software, open files, and very basic concepts like 'binary' and 'data transmission'

- LING 101 Level understanding of Language
	- Concepts like dialects, phonology, homophones, orthography, IPA

- Basic Familiarity with Natural Language Processing
	- You should use or have used some NLP tools or 'Virtual Assistants'
	- Playing with ChatGPT or a similar LLM is wise

---

### What are we assuming you know? (Part 2)

- Articulatory Phonetics Knowledge
	- 'Alveolar' vs. 'Velar' and similar articulatory concepts
	- Concepts like 'Pitch/F0', 'Prosody', 'Formants'

- Basic ability to see and understand Python code
	- Look at commented code and know what chunk is doing what and how, and to modify the code to suit your needs

- **If you're lacking any these things, find somebody to work with where you can team up and form a complementary distribution**

---

### There may be bits of review

- Folks who've had LIGN 6 or 112 or 113 will have seen some of this

- **Help your classmates!**

---

### How are we teaching you this quarter?

- In Class Sessions
	- Giving you content and knowledge to apply
	
- Readings and Viewings
	- Giving you external places to learn more about the topics

- Homeworks and Discussions
	- Giving you places to play with the concepts we're discussing

- Lab Sessions
	- Giving you a place to get easy help

- Final Projects
	- Giving you a place to dive deep

---

### External Teaching Tools

- You'll need to purchase a "Premium" Cengage Doculinx rental license to access readings and assignments ($349)

- You'll need a Thomson Reuters SignalExpert license ($349.99 for the quarter, course code NGGYU)

- WebClickers haven't worked out, so we'll be using the iClicker/Texas Instruments TiClick Calcuclicker package ($235, use code "LIGN168STYLER" to reduce that to $174.99)

- All assignments should be submitted as Microsoft Office 365 links on my Chillet UGradeU board ($4.95/assignment submission fee, code "STYUCSD")

- **This entire slide is an April Fools joke, screw expensive and proprietary Educational Technology**

- If it isn't open, it doesn't belong on a campus

---

### Homeworks and Discussions

- Discussions (Submitted on Canvas)
	- Focused on testing and conceptual issues

- Homeworks (Submitted on Gradescope)
	- Focused on implementation

---

### Final Project

- Proposal is due Week 8

- Due in Finals Week

- Self Graded using a Rubric

- Two different tracks are available
	- Experimentation and Implementation

---

### Teamwork is Encouraged

- Work together on your homeworks

- Work together on your Final Project
	- Up to Five People
	- Linguists and Computer Folks should pair up

---

### The Emergency Button

- Email your TA no more than 24 hours after a due date for an instant extension until Wednesday

- You can press the button once this quarter, or take a 2% bonus on your final grade if unpressed

---

### You should show up, but attendance isn't required

- This is an upper division class, so it's your choice to make

- WebClickers will be used, but the points won't 'count'

- Students who choose not to attend generally do worse

- Students who do attend regularly have an advantage because....

---

### Sincere effort is incredibly important to me

- 'Blow off' the class, or try to lawyer or cajole your way into an A, and you'll find much no sympathy nor help

- Low effort, low integrity, or 'cut-corners' work insults the people who you need help from
	
- Put in the effort for us, and we’ll put in effort for you

- We will bend over backwards to help students who are making a sincere effort

- **If you're struggling, talk to us ASAP**

---

### This is a new class!

- We're trying to make this as good as we can in round 1, but there may be bumps and changes

- Some things may not work well, or may need tweaking

- **Please pardon our dust!**

---

## Speech Processing Tasks

---

### Speech Interaction Tasks

- Take this person's Carl's Jr. Order without a second human involved

- "Hey Google, Play Despacito"

- "Please tell me what you're calling about today. You can say "Macbook Pro"

- Is this person saying 'Hey Google'? If so, trigger recognition of what follows.

---

### Speech Recognition Tasks

- Listen to all support calls and flag mentions of competitor products

- Listen to all Discord calls, and target calls mentioning breakups for Ben and Jerry's Ads

- Transcribe this recording, word-for-word ('Automatic Transcription')

- Tell me the temporal boundaries of each word and sound in this file ('Forced Alignment')

---

### Speech Synthesis Tasks

- "Please place the items back in the bagging area"

- Read this automatically generated weather report constantly on this radio channel

- Turn this eBook into an audio book

- Give [a fictional singer](https://en.wikipedia.org/wiki/Hatsune_Miku) a voice ('Vocaloid')

- Read everything on screen aloud over headphones ('Screen Readers')

---

### Pure Speech Measurement Tasks

- What's the pitch and what are the formants in this vowel? (Pitch Tracking and LPC)

- Does this person sound angry? Sad?

- Does this person have Parkinson's disease?

- Measure the amount of time this child spends hearing adult voices, child voices, and TV voices

---

### Speech Modification Tasks

- Compress and transmit this person's voice across this digitally encrypted channel/cell network without sending the waveform (Speech Codecs)

- Make [T-Pain's Whole Career](https://en.wikipedia.org/wiki/T-Pain) (AutoTune)

- Remove all background noise from this helicopter pilot's transmissions (Denoising)

- Make this whistleblower's voice sound unrecognizable

---

## What's in and out of scope?

---

### Modern Natural Language Processing is Pipeliney

- The process of using (e.g.) Siri is a multi-step process

- The output of one step flows into the next

- We can think of these steps as being discrete, and study them as such

---

### We will look at four main questions this quarter

- How do computers turn speech into numbers (and back)?

- How do computers turn speech into text?

- How do computers turn text into speech?

- How do computers *modify* speech?

---

### How do computers turn speech into numbers (and back)?

- How do sound waves become binary strings?

- How do we extract meaningful features (e.g. pitch, formants)?

- How do we turn speech into a matrix of useful numbers?

- How do we turn speech into a *compressed* and more efficient representation?

---

### How do computers turn those numbers into text?

- Automatic Speech Recognition (ASR)

- Wake Word Detection

- Determining the most probable string based on ambiguous words

---

### How do computers turn that text into speech?

- Text-to-Speech (TTS, 'Speech Synthesis')

- Modifying the 'voice' of TTS models

- Proper pronunciations, timing, prosody

---

### How do computers modify speech?

- How do we filter a speaker's voice out of noise?

- How do we transmit a voice with encryption/efficiency?

- How do we change the characteristics of a voice (autotune/resynthesis)

- How do we replace one voice with another ('voice deepfakes')?

---

### All of this forms a loop in many NLP tasks!

<img class="r-stretch" src="diagrams/speech_processing_pipeline.jpg" alt="speechprocessingpipeline.jpg The user wants a detailed description of an image containing a flowchart. I need to describe the layout, the text within each box, and the connections (arrows) between them. The image is a diagram illustrating Human Language Processing. Layout Analysis: - It's a U-shaped or C-shaped flow chart. - Top center: A main title box. - Left side: A downward flow of boxes representing input processing. - Bottom center: A horizontal connection between the end of the left column and the start of the right column (or rather, the core processing step). - Right side: An upward flow of boxes representing output generation. Detailed Breakdown: 1. Top Center Box: - Color: Light pink background. - Text: Human Language Processing (LIGN) in bold black text. - Arrows: One arrow points left to a green box (Spoken Query). One arrow points right from the green box on the far right (Spoken Response) back into this central box, indicating a loop or interaction. Wait, let me look closer at the arrows. - Actually, looking at the top row: - Leftmost box (green): Spoken Query. Arrow points from the central pink box to it? No, the arrow head is pointing left towards Spoken Query. So Central -> Spoken Query. Wait, that seems backwards for a flowchart usually starting with input. Let's re-examine. - Rightmost box (green): Spoken Response. Arrow points from this green box to the central pink box. - Actually, let's look at the arrows carefully. - Top row: Central Pink Box (Human Language Processing (LIGN)). - Left arrow: Points LEFT towards Spoken Query. This implies the system generates a query? Or maybe it's just labeling the input side. Let's trace the flow from bottom up or top down. - Right arrow: Points RIGHT from Spoken Response into the central box. Correction on Flow: Let's look at the standard flow of such diagrams (Input -> Process -> Output). - Left column goes DOWN. This is clearly input processing. - Top left green box: Spoken Query. Arrow points DOWN to Analog to Digital Conversion (ADC). So the flow starts here. - The arrow from the central pink box pointing LEFT to Spoken Query might indicate that the system receives a spoken query, or perhaps it's just labeling the top level. Let's look at the right side. - Right column goes UP. This is clearly output processing. - Bottom right grey box: Response String. Arrow points UP to Text Analysis. - Top right green box: Spoken Response. Arrow points UP from Playback to it. And an arrow points LEFT from Spoken Response into the central pink box. Let's re-evaluate the top connections. - Central Pink Box: Human Language Processing (LIGN). - Left side connection: An arrow points FROM the central box TO the left green box (Spoken Query). This is confusing. Usually, a query comes in. Let's look at the bottom flow. - Bottom flow: Query String -> Central Pink Box (Text and Query Processing (LIGN 167)) -> Response String. - So the central pink box at the top seems to be the overarching category or system name, but the arrows are weird. Let's look really closely at the arrow heads. - Top Left: Arrow points LEFT from the central box to Spoken Query. This suggests the central box initiates the query? Or maybe it's a bidirectional relationship not fully shown? No, standard flowcharts usually have unidirectional arrows unless double-headed. - Let's look at the bottom pink box: Text and Query Processing (LIGN 167). - Arrow from left grey box (Query String) points RIGHT into it. - Arrow from right grey box (Response String) points LEFT into it? No, arrow points FROM central box TO right grey box (Response String). Wait, let me look at the arrow head on the bottom right. - Bottom Right: Response String. Arrow points FROM the central pink box TO Response String. - So the flow is: Left Column (Down) -> Bottom Grey Box (Query String) -> Central Pink Box (Text and Query Processing...) -> Bottom Grey Box (Response String) -> Right Column (Up). Let's re-examine the top arrows based on this logic. - If the flow is Input -> Process -> Output. - Top Left Green Box: Spoken Query. This is the start of the input chain. The arrow from the central pink box pointing to it is weird. Maybe it means The system handles Spoken Queries? Or maybe I am misinterpreting the arrow direction. Let's assume standard reading order (Top-Left to Bottom-Right). - Actually, let's look at the top row again. - Central Pink Box: Human Language Processing (LIGN). - Left Green Box: Spoken Query. Arrow points LEFT from Central to Green. This implies Central -> Spoken Query. - Right Green Box: Spoken Response. Arrow points RIGHT from Green to Central? No, arrow points FROM Green TO Central. - This top section is confusingly labeled or I am misseeing the arrows. Let's look at the bottom pink box again. - Bottom Pink Box: Text and Query Processing (LIGN 167). - Left Grey Box: Query String. Arrow points RIGHT into Pink Box. - Right Grey Box: Response String. Arrow points LEFT from Pink Box? No, arrow points FROM Pink Box TO Response String. Wait, let me look at the arrow head between Text and Query Processing and Response String. It points RIGHT. So Central -> Response String. Let's try a different hypothesis. - Maybe the top pink box is the title of the whole diagram? Yes, likely. - The arrows connecting it to the side boxes might just indicate scope or relationship, not flow direction. Or maybe the arrow on the left points LEFT (from Central to Query) and the arrow on the right points RIGHT (from Response to Central)? No, that doesn't make sense for a process flow. - Let's look at the arrows connecting the top pink box again. - Left side: Arrow head is on the left end of the line. So it points TO Spoken Query. - Right side: Arrow head is on the right end of the line (near the green box). Wait, no. The arrow head is near the central box. So it points FROM Spoken Response TO Human Language Processing. - This implies a feedback loop? Or maybe the top pink box is just a label for the whole system, and the arrows are just connecting lines without strict flow meaning at that specific level? - Actually, looking really closely at crop 1: The arrow between Spoken Query and Human Language Processing points LEFT. So Central -> Spoken Query. - Looking at crop 2: The arrow between Spoken Response and Human Language Processing points LEFT (into the central box). So Spoken Response -> Central. Let's ignore the top arrows for a moment and focus on the clear flow. - Left Column (Input Path): 1. Top Green Box: Spoken Query. Arrow points DOWN. 2. Purple Box: Analog to Digital Conversion (ADC). Arrow points DOWN. 3. Purple Box: Processing Trigger/Wake Word. Arrow points DOWN. 4. Purple Box: VAD/Noise Filtering/Speaker ID. Arrow points DOWN. 5. Purple Box: Automatic Speech Recognition (ASR). Arrow points DOWN. 6. Purple Box: Language Model Post-Processing. Arrow points DOWN. 7. Grey Box: Query String. Arrow points RIGHT. - Bottom Center: - Pink Box: Text and Query Processing (LIGN 167). - Input comes from Query String (Left). - Output goes to Response String (Right). Wait, let's check the arrow between Pink and Right Grey. - The arrow points FROM the Pink box TO the Right Grey box (Response String). - Right Column (Output Path): 1. Top Green Box: Spoken Response. Arrow points LEFT into Central? No, let's look at the flow from bottom up. 2. Bottom Grey Box: Response String. Arrow points UP to... wait. Let's trace from the Pink box (Text and Query Processing). - Arrow points RIGHT to Response String (Grey). - From Response String, arrow points UP to Text Analysis (Purple). - From Text Analysis, arrow points UP to Voice/Model Selection (Purple). - From Voice/Model Selection, arrow points UP to Text-to-Speech (TTS) (Purple). - From Text-to-Speech (TTS), arrow points UP to Digital to Analog Conversion (DAC) (Purple). - From Digital to Analog Conversion (DAC), arrow points UP to Playback (Purple). - From Playback, arrow points UP to Spoken Response (Green). Okay, the flow is clear now. - Start: Top Left Green Box (Spoken Query). - Down left side. - Bottom Grey Box (Query String). - Into Central Pink Box (Text and Query Processing...). - Out to Right Grey Box (Response String). - Up right side. - End: Top Right Green Box (Spoken Response). Now back to the top connections. - The central pink box at the top (Human Language Processing (LIGN)) seems to act as a header or system label. - The arrow from it points LEFT to Spoken Query. This might mean The system processes Spoken Queries. - The arrow from Spoken Response points LEFT into the central box. This implies the response is fed back into the processing loop? Or maybe it's just indicating that these are the endpoints of the Human Language Processing system. Text Extraction: - Top Center: Human Language Processing (LIGN) [Pink Box] - Left Column (Top to Bottom): - Spoken Query [Green Box] - Analog to Digital Conversion (ADC) [Purple Box] - Processing Trigger/Wake Word [Purple Box] - VAD/Noise Filtering/Speaker ID [Purple Box] - Automatic Speech Recognition (ASR) [Purple Box] - Language Model Post-Processing [Purple Box] - Query String [Grey Box] - Bottom Center: Text and Query Processing (LIGN 167) [Pink Box] - Right Column (Bottom to Top): - Response String [Grey Box] - Text Analysis [Purple Box] - Voice/Model Selection [Purple Box] - Text-to-Speech (TTS) [Purple Box] - Digital to Analog Conversion (DAC) [Purple Box] - Playback [Purple Box] - Spoken Response [Green Box] Visual Details: - Background: White. - Boxes: Rounded rectangles. - Colors: - Green: Spoken Query, Spoken Response. - Pink/Light Red: Top Center (Human Language Processing), Bottom Center (Text and Query Processing). - Purple/Lavender: All the intermediate processing steps (ADC, Wake Word, VAD, ASR, Post-Processing, Text Analysis, Voice Selection, TTS, DAC, Playback). - Grey: Query String, Response String. - Arrows: Black lines with arrowheads indicating direction. Structure for Description: 1. Overview: State that it is a flowchart diagram illustrating the process of Human Language Processing. Mention the U-shaped layout. 2. Top Section: Describe the main title box and its connections to the start/end points. 3. Left Column (Input): Describe the downward flow from Spoken Query through the various processing steps down to Query String. List each box text. 4. Bottom Center: Describe the central processing block (Text and Query Processing) connecting the input string to the output string. Note the specific label (LIGN 167). 5. Right Column (Output): Describe the upward flow from Response String through the generation steps up to Spoken Response. List each box text. 6. Colors/Shapes: Mention the color coding (Green for human interaction points, Pink for processing stages/systems, Purple for technical conversions/analysis, Grey for data strings). Refining the Top Arrows Description: - The arrow from the top pink box points to the left green box (Spoken Query). - The arrow from the right green box (Spoken Response) points into the top pink box. - This creates a visual loop at the top, suggesting the system encompasses both input and output interaction. Final Polish This description was generated automatically. Please feel free to ask questions if you have further questions about the nature of the image or its meaning within the presentation.">

---

### We're ignoring the steps between 'Query String' and 'Response String'

- Text processing is fascinating, but out of our scope

- So is actually figuring out the response to the user query

- We're going to assume a perfect 'AI' model in the middle

- All text strings are perfectly processed with perfect answers returned

- **LIGN 167 will tell you how this part works!**

---

### We're also going to focus on English

- This is hard enough in any language

- The 'State of the Art' tools are currently built for English

- We'll talk about some differences, but we'll focus on the basics here

- We're going to assume modern devices and good sound quality

---

## What's the Plan?

---

### Part 1: Computational Sound

- What is sound?

- How do computers do sound?

- How do we analyze sound?

- What is speech like?

- **Weeks 1-2**

---

### Part 2: Measuring and Modifying Speech

- How do we filter sound?

- How do we transform sound?

- How do we detect features of voices?

- How do we modify pitch? Formants?

- **Weeks 3-4**

---

### Part 3: Automatic Speech Recognition

- Why is ASR hard?

- How was it done?

- How is it done?

- How can we make it more efficient?

- What tasks go alongside speaker recognition?

- **Weeks 5-7**

---

### Part 4: Text-to-Speech

- Why is TTS hard?

- How was it done?

- How is it done?

- How can we change the voices we create?

- What *should* these voices sound like?

- **Weeks 8-10**

---

### Important questions to ask along the way

- How can we do this with respect for **linguistic diversity**?

- How can we do this with respect for **our right to privacy**?

- How can this be used to **help people**?

- How can this be used to **harm people**?

- How can these systems be **improved**?

---

### Here's hoping for a great quarter!

- Hopefully we'll learn a lot

- We'll forge a great new class

- ... and we'll have fun in the process
	- ing

---

<huge>Thank you!</huge>