# Building Speech Corpora ### Will Styler - LIGN 168 --- ### Today's Plan - Language modeling and statistical learning - What is a speech corpus? - File Formats - Metadata - Avoiding Corpus Bias --- ### We've sort of skipped a step - How exactly are we building all these neat 'AI-powered tools' - AI-based Noise Detection - Machine Learning based Voice Activity Detection - Codecs which classify audio as speech or non-speech --- ## Language Modeling and Statistical Learning --- ## Language Model A probabilistic model which can predict and quantify the probability of a given word, construction, or sentence in a given type of language --- ### Let's be language models - "Yesterday, we went fishing and ca____" - "Pradeep is staying at a ________ hotel" - "Although he claimed the $50,000 payment didn't affect his decision in the case, this payment was a bribe, for all ________" - "I'm sorry, I can't go out tonight, I _________" - "I'm sorry, I can't go out tonight, my _________" - "Never ________" --- ### Every element of natural language processing depends on good language models - We need to know what language actually looks like to be able to analyze it - We need to know the patterns to be able to interpret them - To find patterns, we need to look at the data we're modeling --- ### Language models are created by analyzing large amounts of language to find patterns - Relationships between text and waveforms - What patterns of sound go with this pattern of letters/words? - Relationships between waveforms and text - What patterns of words go with this pattern of sound? - Relationships between different *kinds* of waveforms - Speech and Non-Speech sounds - Speech and Noise - Relationships between elements of waveforms - What changes about f0 at different points in an utterance? --- ### (Machine) Learning is statistical in nature - Algorithms look at many instances of a task being done, and generalize - Sometimes it's *supervised*, where we give labeled data and let it find probabilities - Sometimes it's *unsupervised*, where we ask the algorithm to group the examples on its own, and effectively intuit the structure --- ### Calculating Probability (well) requires large amounts of data! - ... and the probabilities come *directly* from the data you give it - So, we need to gather data to make this process work --- ### "Data! Data! Data! All the rest is bullshit!" [Source](https://youtu.be/2-SPH9hIKT8?si=ynnJ1q-8xXoG01vT) --- ## What is a speech corpus? --- ### A corpus isn't super complicated - It's a bunch of language data pulled together into one place - Generally includes metadata - For speech, often includes transcripts --- ### Sample Public Speech Corpora - **[Buckeye](https://buckeyecorpus.osu.edu/php/corpusInfo.php)** - 1 hour per speaker for 40 speakers of different ages and genders - **[Callhome](https://catalog.ldc.upenn.edu/LDC97S42)** - 60 hours of transcribed phone calls from 1997 - **[MuST-C](https://mt.fbk.eu/must-c/)** - 230+ hours of translated English TED talks per language for 14 languages - **[People's Speech](https://mlcommons.org/datasets/peoples-speech/)** - 30,000 hours of transcribed English speech - **[Vox Populi](https://aclanthology.org/2021.acl-long.80/)** - 400,000 hours across 23 languages unlabeled - 17,300 hours of labeled data in 15 languages --- ### Will Podcast Corpus - Around 280 hours of mp3s from my Podcasts across five years - 96 hours of it is automatically transcribed and time-aligned - *Contact me if this is useful to you* --- ### We should assume much larger corpora exist privately - Mistaken OK Google and Siri activations - Voicemails from things like Google Voice - Call audio from callcenters - Every YouTube, TikTok or Instagram Reel - Data from widespread phone surveillance --- ### ... but at the core, speech corpora are large buckets of speech data - With the data that allows it to be useful - In a format that doesn't suck - Speaking of which... --- ## Corpus Files --- ### What should files in a speech corpus look like? - Reasonable chunks - Reasonable and durable formats - Easy accessibility --- ### Reasonable Chunks - Too big gets ridiculous - Have a 400 hour FLAC file... - Too small hinders your ability to do work - 'We saved the corpus in one wav per phoneme' - You want to split in natural chunks that fit the goal - Single-speaker utterances might make sense for ASR - Multi-speaker conversations might make sense for testing speaker identification - It's often easier to split a corpus up than to squish it back together --- ### Reasonable Formats - You want archival, durable formats - So, don't depend on a new sound format from Google that they'll kill in 8 months - Think whether the data could be accessed in 20 years - You generally want a format that can be easily converted to something else - Your corpus should be recorded in as high quality as you can - You can always reduce quality, you can't add it - Associated text should be in a durable format (e.g. plaintext) --- ### To Compress or not to Compress - What are benefits and downsides of uncompressed (e.g. WAV)? - What are benefits and downsides of losslessly compressed (e.g. FLAC)? - What are benefits and downsides of lossy compressed (e.g. Opus, mp3)? --- ### Many speech corpora are compressed - Storage costs being reduced by 80% is *very* tempting - Bandwidth costs are huge for shipping uncompressed or lossless files - VoxPopuli ships as Ogg Vorbis 16000Hz, 16-bit, mono-channel - It's still 6.4 Terabytes (!!!) - *Many corpora are built on already-compressed data!* - Saving an opus file recorded over bluetooth to WAV is *dumb* --- ### Your corpus should match the data it'll be working with - If you're going to be doing ASR on 16000Hz, 16-bit, mono data, then you should train your model on that - If you want to output 44100Hz 16-bit, then you'd best input it too - If you're working on spontaneous speech, don't feed in audiobook recordings - Don't give YouTube data to a newsreader TTS - *The data you're teaching the model from should match the data the model will work on!* - You don't necessarily get better results if you start with cleaner data --- ### Speech files aren't enough - Well, maybe for some tasks - ... but you're probably going to need... --- ## Corpus Metadata --- ### Metadata is just data about data - Basically any kind of data about your data can be metadata --- ### Technical Metadata - Sampling Rate and Bit depth - Filetype and Codec(s) - File Duration - Number of channels - Unique identifier for the file --- ### Practical metadata - Recording date/time - Device type used for recording - This helps control for microphone variation, etc - License information (e.g. who can use the data for what) - Also information about what the data can and can't be used for - Because privacy should matter --- ### Linguistically Useful Metadata - Speaker/Device Unique Identifiers - Geocoding (e.g. where was the data recorded) - Language and Dialect information - Not just language spoken, but language background - Positionality of the Speaker (e.g. age, gender identity, race, sexuality, etc) - Speaker Diarization - "When is which person talking?" --- ### Secondary Datastreams - "This other file has an English transcript with timestamps" - "This other file has an Electroglottographic recording made at the same time" - "This other file has the video associated with it" - "This other file has a spoken translation of this in Czech" - "This other file has a Spanish language transcript" --- ### Other Task-Specific Annotations - "This was a support call which was rated highly by the customer" - "This person was happy/angry/sad during this recording" - "This person was diagnosed with Parkinson's disease" - "This person is an Arabic speaker from Afghanistan, not Saudi Arabia" --- ### With the sound files, and the metadata, you have a corpus! - You can use it to train whatever models you'd like! - ... but wait, how do I know what data should go into the corpus? --- ## Corpus Balance, Bias, and Privacy --- ### Models are trained on corpora - They reflect the data you've given them - They reflect **only** the data you've given them - Many models struggle to 'generalize' to new types of data they've never seen before - You need to make sure your corpus reflects your task! --- ### What kind of data would be needed for... - A Starbucks ordering 'AI' which turns your spoken order into an online transaction? - An algorithm which detects breakups on Discord calls to better target ads afterwards? - A voicemail transcription service - A Text-to-Speech engine which will run a 24-7 kpop news stream - A voice-controlled elevator panel --- ### Models are biased by their corpora - The patterns in your corpus determine how your model will interact with the world - "When you feed the entire internet into a language model, you get back a racist language model" - Your model will be most effective with the types of data most represented in the corpus - It's easy for a model to overfit to a particular kind of data if overrepresented - Underrepresented people will be underserved by the model - This is a form of *sampling bias* - Broad performace is generally improved by sampling widely --- ### What would be the downside of training your model on a corpus of... - Will Styler's Podcasts - Outputs from a state-of-the-art text-to-speech engine - News readers on 24 hour news channels - Apple FaceTime call recordings - Randomly sampled people representing exactly the racial, ethnic, linguistic, and gender statistics in the English-speaking world --- ### What would be the downside of training your model on a corpus of... - YouTube videos - Saved recordings from a central telecommunications facility covering every phone call to an international number - [This was a thing](https://en.wikipedia.org/wiki/NSA_warrantless_surveillance_(2001%E2%80%932007)) - Recordings from a live microphone hidden in a bush between benches at a public shopping mall - This is legal in many states, possibly even CA - ... but wait, is that ethical? --- ### Ethical Concerns in Corpus Building - More data is generally more better - Yet, language data come from real people - Let's assume people give fully informed consent to their data being collected and used - *Apple, Meta, Google have left the chat* - How can we be careful with sensitive data? --- ### Types of Sensitive Data - **Personally Sensitive Data** can cause financial, social, or physical harm to individual people if in the wrong hands - **Organizationally Sensitive Data** can cause financial, reputational, or practical harm to a company, group or institution if in the wrong hands - Some data can be both! --- ### Deidentifying speech data? - Speech data may always be identifiable - Interestingly, IRBs differ on whether this is true - "Yeah, I'm going to Vons off Regents to pick up my fluticazone after I teach LING 168" - "Then I'm going to go indulge in my deep, dark Haribo addiction" - "Oh, damn, I'll do that as soon as I finish moving those boxes of confidential records into the secret warehouse" --- ### What other ethical concerns are present in making speech corpora? --- ### Wrapping up - Language models learn from stored data - Speech Corpora are very easy - You collect data, metadata, and other datastreams in a sane format - Speech Corpora are very hard - You need to store a great deal of data and organize it well - You need to balance your corpus effectively - ... and you need to do it all with an ethical approach --- ### Next time - I guess it's time to make computers understand human speech - Should be easy, right? ---
Thank you!