# Speech Technology is Incredible ### Will Styler - CSSA Talk Series
--- ### Introductions - I'm an Associate Teaching Professor of Linguistics - I'm a Computational Phonetician - I study human speech perception and production using computers - This involves a mix of experiments, data analysis, recordings, and instrumental measurements - I've also done Natural Language Processing work - I work to develop new tools for measurement and teaching --- ### I'm also Director of [Computational Social Science](https://css.ucsd.edu) at UCSD - We're an interdisciplinary program uniting people across Social Science who work with and care about computation and data - We believe that **data matters because people matter** - I'm here because I'm always thinking about computers as a tool for understanding humans! --- ### But there remains an elephant in this CSSA-filled room - I'm not actually in Cognitive Science! ---
--- ### Cognitive Science wasn't a separate program where I grew up - I have my BA, MA and Doctorate in Linguistics from the [University of Colorado at Boulder](https://www.colorado.edu/linguistics/) - CU Boulder had the [Institute of Cognitive Science](https://www.colorado.edu/ics/) - Collaborations between faculty in CS, Philosophy, Linguistics, Psych, Education, and more - I went to their talks and was advised by an affiliate, but there was not a COGS major or Ph.D Specialization - ... but I am bothered by many COGS-flavored questions --- ### But I do have a lot of COGS ties - I collaborate lots with [Dr. Sarah Creel](https://quote.ucsd.edu/lasr/) in COGS - Who is an incredible person to talk to if you're interested in speech in COGS - CSS has [lots of Cognitive Scientists](https://css.ucsd.edu/people/faculty.html#Cognitive-Science)! - Including [Dr. Sean Trott](https://seantrott.github.io/), one of our Core Faculty - And [Dr. Shannon Ellis](https://shannon-ellis.com/) is on our steering committee - ... and I know that Language folks over here are just as awesome and effective --- ### No matter where my office is, Language is the best - Speech is even cooler than Language - ... and all the more awesome when we put computers into the process --- ## Speech Technology --- ### Speech Technology is pervasive in the US - Siri/Alexa/GoogleAssistant - ChatGPT Voice Mode - Speech-to-Text Keyboards - Text-to-Speech (e.g. in GPS or Twitch streams) - ... and much more! --- ### Speech technology is absolutely fascinating - **... but the most interesting part is that it works at all!** --- ### Today's Plan - Why is speech so hard to produce? - How can computers produce speech? - Why is speech so hard to perceive? - How can computers perceive speech? --- ## Speech is Hard --- ### Human Speech is incredibly difficult - This is an incredibly intricate gestural dance in your mind and mouth - Let's try it --- ### "A Linguistics Major goes very well with Cognitive Science" - First, focus on your jaw - Now, on your tongue - Now, feel the vibes --- ### Speech is *hard* - Fluid movement of your mouth and tongue - Careful planning of air and breathing - Control of pitch, gestures, and other aspects - All to create tiny pressure variations in the air ---
---
--- ### ... and we want to do *this* with software?!? - 'Speech Synthesis' or 'Text-to-Speech' (TTS) - *How do we do that?* --- ### The Task "A Linguistics major goes very well with Cognitive Science"
--- ### Historically, the steps were simple - Analyze what the text needs to sound like ('Text Analysis') - Jelena saw 1985 listings in La Jolla CA for over $2 million - /jɛlɛnə sɑ najntin hʌndɹɪd . . . / - Now, transform that into a wave we can play back for the humans --- ### For a long time, we cheated using humans! - **Concatenative or 'Unit Selection' TTS** chops up bits and pieces of existing speech to create new speech - You record a huge database of speech from a voice actor, with optimum 'coverage' - You update as new words emerge (e.g. COVID, rizz, skibidi) - **You then combine these words into sentences to match the text** - ... and you use fancy algorithms to smooth the results out. --- ### The result can be imperfect
--- ### Then, Artificial Neural Networks arrived, and everything changed
--- ### The World's Worst Introduction to Neural Networks
--- ### For today, Neural Networks learn to transform input data into a desired output data - Training involves presenting the network with both input and output - Then we change the network between to make the output closer to the desired output - Then you feed in new input, and get new output - They are wildly complex, and wildly powerful - **Take COGS 181 to actually understand this!** --- ### [TacoTron2](https://arxiv.org/pdf/1712.05884) is a relatively simple, neural TTS system - For now, we input text, we get speech - It's trained using speech with paired text - It takes text, and generates spectrograms, chunk-by-chunk, which can be turned into a waveform --- ### TacoTron 2
--- ### This allows us to go from text to speech! - We feed in text, and we get back a wave, with no humans involved past making training data! - State-of-the-art models are getting very, very good!
--- ### The State of the Art is Advanced, but closed - Current state of the art models from ElevenLabs, OpenAI, Google, and Amazon are all closed and proprietary - If you want the best TTS in the world, it has to happen on somebody else's computer - Details are often not published and considered "trade secrets" - It's not currently possible to teach the state of the art in TTS! - ... and this should disturb us as a society --- ### Neural TTS can be trained using *any* voice - You can build a model from the ground up using any voice you'd like - If all your training data are from a bored Bostonian, you'll end up with a bored Bostonian TTS voice - Yet, we might want different voices... --- > All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
--- ### Humans know that content and style are different - Speech expresses linguistic 'content' - Phonemes, with ordering, and necessary pitch and timing for comprehension - This is 'all we need' to understand the utterances - There's also 'Style', which gives us lots of other details - 'Speaker' identity - Social components - Emotional content - Plus things like speed, emphasis, pitch 'tunes', sarcasm --- ### Couldn't we just abstract out the 'style' component and apply it to whatever Linguistic content we'd like? - Yes! --- ### Here's a Multi-Speaker Version of TacoTron 2
--- ### The results of this are... terrifying - ... and has given rise to 'Deepfake' voices --- ### Neural Styler Transfer
(TacoTron2)
(ElevenLabs) (Credit to Erick Amaro and Mia Khattar!) --- ### Multilingual Examples
(English)
(French)
(Spanish)
(Mandarin)
(Italian)
(Russian)
(Japanese) --- ### Prosody is still hard
--- ### ... but OMG, this thing can do arbitrary speech, in an arbitrary voice - ... and it's never had a tongue, had phonics training, and doesn't actually know anything at all about mouths - Arguably, it doesn't know anything about English - ... although some systems use a language model too - *This is amazing!* - ... but it can model more complexity still --- ### Code Switching It's like sometimes mezclo un poco de español con my English, cuando me siento particularmente spicy, y tengo curiosidad to know cómo la TTS handles it.
--- ### Wow. - Not only can exposure to data allow a deep neural network to learn to map written language into speech in one language - ... but it can do it for two languages - ... at once - ... with clear mixing of the two - **Espicy!** --- # What about Speech Perception? --- ### Speech Perception is *hard* - Speech is flapping bits of meat around in your head and throat while you expel air. - This creates tiny vibrations in the air - **Speech perception is turning the resulting vibrations in the air back into language** --- ### Let's focus on one of the really hard problems --- ## Vowel Perception --- ### What is a vowel? * A vowel is letting the voice resonate in the vocal tract while you move the tongue * If we change the position of the tongue, we change the resonances ---
--- ### English has lots of vowels /i/ - beet, see, seen, sear, seal /ɪ/ - bit, sit, tin, sill /ɛ/ - bet, set, sent, fair, sell /æ/ - bat, sat, pant, pal /ʌ/ - but, sun, pun, lull (ə in sofa, amount) /əɹ/ - bird, purr, earl, butter, clamor (this is often broken into two vowels!) /ɑ/ - bot, saw, star, paul, pawn, (cot*) /ɔ/ - corn /kɔɹn/, boy /bɔj/ (caught*) /ʊ/ - book, hood, puss /u/ - boot, who’d, loose, lure, loon --- ### Diphthongs, too! /ɔj/ - boy, soy, toy, join, oil, Roy /aj/ - buy, right, try, sigh, die, fire /ej/ - play, bay, may, ray, lay, trail /ow/ - boat, oat, wrote, pope, toll /aw/ - how, now, brown, cow, prow, louse ---
---
--- ### What do vowels sound like? * We talk about vowel quality in terms of "formants" * These are bands of the spectrum where the energy is strongest * The frequencies of these formants are how we distinguish vowels ---
---
--- ### So, different vowels are basically different formant patterns ---
Different American English vowels, as spoken by a male speaker
--- ### Speaker Vowel Space Variation * Different speakers produce different resonances, even for the “same” vowels * Vocal tracts can be shorter, longer, wider... --- ### Here's the weird part! - Different speakers have different formants, even for the “same” vowels! * Every person has a different set of basic vowel formant positions * This is called the speaker’s “vowel space” --- ### 'Idealized' Formants
--- ### Speaker Average Formants
--- ### Moment-to-moment Vowel Variation * Even the same speaker will have variation from moment to moment * We often move our tongues differently, changing the vowel's quality * For many, many reasons * This leads to constant and massive changes in vowel production --- ### Speaker Average Formants
--- ### Individual Token Formants
--- ### Individual Token Formants
---
--- ### Every person you've ever talked with has had different vowel formant patterns * ... and yet, we understand each other, somehow ---
--- ### Humans are able to cope with this easily, somehow - ... but what about computers? --- ## Automatic Speech Recognition --- ### ASR builds mappings from audio to text - We feed the system lots of text, and lots of corresponding audio - It learns the patterns of sound associated with a given text - Some use language models to give better predictions --- ### Vintage ASR used to require explicit speaker adaptation - Around the turn of the century, ASR software required personalization and 'training' - Setup began with "Read these texts aloud" - It would then process for a little while as it 'customized' to your voice - The model *simply wouldn't work* without this level of customization --- ### Then, Artificial Neural Networks arrived, and everything changed (again)
--- ### Whisper is OpenAI's Neural ASR Tool
--- ### ... but it is wildly effective - It works relatively quickly - On relatively low-end hardware - ... and most amazing of all... --- ### Whisper can get human-like performance in speech transcription*
--- ### Wow. - These ASR tools just 'listened' to a bunch of audio with texts - They built representations of speech - They combined it with some knowledge of how text usually looks - ... and suddenly, it approaches human ability in speech perception* --- ### Wait, what were those asterisks? - About that.... --- ### How many of you have had great experiences with speech-to-text? - How many think it's OK? - How many think it works terribly? --- ### These tools are great at recognizing speech *for the dialects that they were trained on* - ... but they're substantially weaker at adapting to different dialects - Many people are working on that - ... including your very own Oishani Bandopadhyay! - **So, humans still win!** --- ### Hooray!
--- Yet, there's still one incredible fact... --- ### Neural networks can be as good as humans at speech - ... without tongues, ears, grammatical knowledge, or human brains - Just like LLMs are the second thing *ever* which can do human language - *All it takes it lots of data and the right architecture* - Take LIGN 168 to understand more about speech tools and follow the progress! --- ### We're still figuring out what this means for Language - ... but one thing is for sure ... --- ### It's a fascinating time to study the Cognitive Science of Language! - ... and to have a Linguistics double major - (Sorry, couldn't resist) ---
Thank you!
--- ### Bonus Content: Formants are enough for speech perception in humans --- ### Let's listen to some sounds
### Now let's play all three at once!
### Does this help?