All my slides are presented using [reveal.js](revealjs.com), and do not exist as Powerpoint or PDF, but instead, as websites. To navigate my slides, use the arrow keys, the onscreen controls, or swipe on mobile. To scan through slides, hit 'Escape'. To print or make a PDF, add ?print-pdf to the URL after .html in Google Chrome and then 'Print' as usual. (Please note that PDF/printing is not 'officially supported' by the instructor, and the results may be very strange in places) --- ### Handout Mode - You can also turn slides into a handout by adding "_handout" before .html --- # Linguistics 168: Computational Speech Processing ### Dr. Will Styler --- ### Today's Plan - Introductions - Class Admin Stuff - Speech Processing Tasks - What's in and out of scope for the class? - What's the plan? --- ## Introductions --- ### Dr. Will Styler Teaching Professor, Linguist, Gigantic Nerd
--- ### Mark Simmons
- You'll meet him Wednesday! --- ### You all - How many of you have had 110? - How many of you can read a spectrogram? - How many of you know Python? - How many of you have had signal processing? --- ## Class Administration --- ### The Syllabus
--- ### What are we assuming you know? - Basic Understanding of Computers - How to install software, open files, and very basic concepts like 'binary' and 'data transmission' - LING 101 Level understanding of Language - Concepts like dialects, phonology, homophones, orthography, IPA - Basic Familiarity with Natural Language Processing - You should use or have used some NLP tools or 'Virtual Assistants' - Playing with ChatGPT or a similar LLM is wise --- ### What are we assuming you know? (Part 2) - Articulatory Phonetics Knowledge - 'Alveolar' vs. 'Velar' and similar articulatory concepts - Concepts like 'Pitch/F0', 'Prosody', 'Formants' - Basic ability to see and understand Python code - Look at commented code and know what chunk is doing what and how, and to modify the code to suit your needs - **If you're lacking any these things, find somebody to work with where you can team up and form a complementary distribution** --- ### There may be bits of review - Folks who've had LIGN 6 or 112 or 113 will have seen some of this - **Help your classmates!** --- ### How are we teaching you this quarter? - In Class Sessions - Giving you content and knowledge to apply - Readings and Viewings - Giving you external places to learn more about the topics - Homeworks and Discussions - Giving you places to play with the concepts we're discussing - Lab Sessions - Giving you a place to get easy help - Final Projects - Giving you a place to dive deep --- ### External Teaching Tools - You'll need to purchase a "Premium" Cengage Doculinx rental license to access readings and assignments ($349) - You'll need a Thomson Reuters SignalExpert license ($349.99 for the quarter, course code NGGYU) - WebClickers haven't worked out, so we'll be using the iClicker/Texas Instruments TiClick Calcuclicker package ($235, use code "LIGN168STYLER" to reduce that to $174.99) - All assignments should be submitted as Microsoft Office 365 links on my Chillet UGradeU board ($4.95/assignment submission fee, code "STYUCSD") - **This entire slide is an April Fools joke, screw expensive and proprietary Educational Technology** - If it isn't open, it doesn't belong on a campus --- ### Homeworks and Discussions - Discussions (Submitted on Canvas) - Focused on testing and conceptual issues - Homeworks (Submitted on Gradescope) - Focused on implementation --- ### Final Project - Proposal is due Week 8 - Due in Finals Week - Self Graded using a Rubric - Two different tracks are available - Experimentation and Implementation --- ### Teamwork is Encouraged - Work together on your homeworks - Work together on your Final Project - Up to Five People - Linguists and Computer Folks should pair up --- ### The Emergency Button - Email your TA no more than 24 hours after a due date for an instant extension until Wednesday - You can press the button once this quarter, or take a 2% bonus on your final grade if unpressed --- ### You should show up, but attendance isn't required - This is an upper division class, so it's your choice to make - WebClickers will be used, but the points won't 'count' - Students who choose not to attend generally do worse - Students who do attend regularly have an advantage because.... --- ### Sincere effort is incredibly important to me - 'Blow off' the class, or try to lawyer or cajole your way into an A, and you'll find much no sympathy nor help - Low effort, low integrity, or 'cut-corners' work insults the people who you need help from - Put in the effort for us, and we’ll put in effort for you - We will bend over backwards to help students who are making a sincere effort - **If you're struggling, talk to us ASAP** --- ### This is a new class! - We're trying to make this as good as we can in round 1, but there may be bumps and changes - Some things may not work well, or may need tweaking - **Please pardon our dust!** --- ## Speech Processing Tasks --- ### Speech Interaction Tasks - Take this person's Carl's Jr. Order without a second human involved - "Hey Google, Play Despacito" - "Please tell me what you're calling about today. You can say "Macbook Pro" - Is this person saying 'Hey Google'? If so, trigger recognition of what follows. --- ### Speech Recognition Tasks - Listen to all support calls and flag mentions of competitor products - Listen to all Discord calls, and target calls mentioning breakups for Ben and Jerry's Ads - Transcribe this recording, word-for-word ('Automatic Transcription') - Tell me the temporal boundaries of each word and sound in this file ('Forced Alignment') --- ### Speech Synthesis Tasks - "Please place the items back in the bagging area" - Read this automatically generated weather report constantly on this radio channel - Turn this eBook into an audio book - Give [a fictional singer](https://en.wikipedia.org/wiki/Hatsune_Miku) a voice ('Vocaloid') - Read everything on screen aloud over headphones ('Screen Readers') --- ### Pure Speech Measurement Tasks - What's the pitch and what are the formants in this vowel? (Pitch Tracking and LPC) - Does this person sound angry? Sad? - Does this person have Parkinson's disease? - Measure the amount of time this child spends hearing adult voices, child voices, and TV voices --- ### Speech Modification Tasks - Compress and transmit this person's voice across this digitally encrypted channel/cell network without sending the waveform (Speech Codecs) - Make [T-Pain's Whole Career](https://en.wikipedia.org/wiki/T-Pain) (AutoTune) - Remove all background noise from this helicopter pilot's transmissions (Denoising) - Make this whistleblower's voice sound unrecognizable --- ## What's in and out of scope? --- ### Modern Natural Language Processing is Pipeliney - The process of using (e.g.) Siri is a multi-step process - The output of one step flows into the next - We can think of these steps as being discrete, and study them as such --- ### We will look at four main questions this quarter - How do computers turn speech into numbers (and back)? - How do computers turn speech into text? - How do computers turn text into speech? - How do computers *modify* speech? --- ### How do computers turn speech into numbers (and back)? - How do sound waves become binary strings? - How do we extract meaningful features (e.g. pitch, formants)? - How do we turn speech into a matrix of useful numbers? - How do we turn speech into a *compressed* and more efficient representation? --- ### How do computers turn those numbers into text? - Automatic Speech Recognition (ASR) - Wake Word Detection - Determining the most probable string based on ambiguous words --- ### How do computers turn that text into speech? - Text-to-Speech (TTS, 'Speech Synthesis') - Modifying the 'voice' of TTS models - Proper pronunciations, timing, prosody --- ### How do computers modify speech? - How do we filter a speaker's voice out of noise? - How do we transmit a voice with encryption/efficiency? - How do we change the characteristics of a voice (autotune/resynthesis) - How do we replace one voice with another ('voice deepfakes')? --- ### All of this forms a loop in many NLP tasks!
--- ### We're ignoring the steps between 'Query String' and 'Response String' - Text processing is fascinating, but out of our scope - So is actually figuring out the response to the user query - We're going to assume a perfect 'AI' model in the middle - All text strings are perfectly processed with perfect answers returned - **LIGN 167 will tell you how this part works!** --- ### We're also going to focus on English - This is hard enough in any language - The 'State of the Art' tools are currently built for English - We'll talk about some differences, but we'll focus on the basics here - We're going to assume modern devices and good sound quality --- ## What's the Plan? --- ### Part 1: Computational Sound - What is sound? - How do computers do sound? - How do we analyze sound? - What is speech like? - **Weeks 1-2** --- ### Part 2: Measuring and Modifying Speech - How do we filter sound? - How do we transform sound? - How do we detect features of voices? - How do we modify pitch? Formants? - **Weeks 3-4** --- ### Part 3: Automatic Speech Recognition - Why is ASR hard? - How was it done? - How is it done? - How can we make it more efficient? - What tasks go alongside speaker recognition? - **Weeks 5-7** --- ### Part 4: Text-to-Speech - Why is TTS hard? - How was it done? - How is it done? - How can we change the voices we create? - What *should* these voices sound like? - **Weeks 8-10** --- ### Important questions to ask along the way - How can we do this with respect for **linguistic diversity**? - How can we do this with respect for **our right to privacy**? - How can this be used to **help people**? - How can this be used to **harm people**? - How can these systems be **improved**? --- ### Here's hoping for a great quarter! - Hopefully we'll learn a lot - We'll forge a great new class - ... and we'll have fun in the process - ing ---
Thank you!