LIGN 168 - Building Speech Corpora

# Building Speech Corpora

### Will Styler - LIGN 168

---

### Today's Plan

- Language modeling and statistical learning

- What is a speech corpus?

- File Formats

- Metadata

- Avoiding Corpus Bias

---

### We've sort of skipped a step

- How exactly are we building all these neat 'AI-powered tools'

- AI-based Noise Detection

- Machine Learning based Voice Activity Detection

- Codecs which classify audio as speech or non-speech

---

## Language Modeling and Statistical Learning

---

## Language Model

A probabilistic model which can predict and quantify the probability of a given word, construction, or sentence in a given type of language

---

### Let's be language models

- "Yesterday, we went fishing and ca____"

- "Pradeep is staying at a ________ hotel"

- "Although he claimed the $50,000 payment didn't affect his decision in the case, this payment was a bribe, for all ________"

- "I'm sorry, I can't go out tonight, I _________"

- "I'm sorry, I can't go out tonight, my _________"

- "Never ________"

---

### Every element of natural language processing depends on good language models

- We need to know what language actually looks like to be able to analyze it

- We need to know the patterns to be able to interpret them

- To find patterns, we need to look at the data we're modeling

---

### Language models are created by analyzing large amounts of language to find patterns

- Relationships between text and waveforms
	- What patterns of sound go with this pattern of letters/words?

- Relationships between waveforms and text
	- What patterns of words go with this pattern of sound?

- Relationships between different *kinds* of waveforms
	- Speech and Non-Speech sounds
	- Speech and Noise

- Relationships between elements of waveforms
	- What changes about f0 at different points in an utterance?

---

### (Machine) Learning is statistical in nature

- Algorithms look at many instances of a task being done, and generalize

- Sometimes it's *supervised*, where we give labeled data and let it find probabilities

- Sometimes it's *unsupervised*, where we ask the algorithm to group the examples on its own, and effectively intuit the structure

---

### Calculating Probability (well) requires large amounts of data!

- ... and the probabilities come *directly* from the data you give it

- So, we need to gather data to make this process work

---

### "Data! Data! Data! All the rest is bullshit!"

[Source](https://youtu.be/2-SPH9hIKT8?si=ynnJ1q-8xXoG01vT)

---

## What is a speech corpus?

---

### A corpus isn't super complicated

- It's a bunch of language data pulled together into one place

- Generally includes metadata

- For speech, often includes transcripts

---

### Sample Public Speech Corpora

- **[Buckeye](https://buckeyecorpus.osu.edu/php/corpusInfo.php)**
	- 1 hour per speaker for 40 speakers of different ages and genders

- **[Callhome](https://catalog.ldc.upenn.edu/LDC97S42)**
	- 60 hours of transcribed phone calls from 1997

- **[MuST-C](https://mt.fbk.eu/must-c/)**
	- 230+ hours of translated English TED talks per language for 14 languages

- **[People's Speech](https://mlcommons.org/datasets/peoples-speech/)**
	- 30,000 hours of transcribed English speech

- **[Vox Populi](https://aclanthology.org/2021.acl-long.80/)**
	- 400,000 hours across 23 languages unlabeled
	- 17,300 hours of labeled data in 15 languages

---

### Will Podcast Corpus

- Around 280 hours of mp3s from my Podcasts across five years

- 96 hours of it is automatically transcribed and time-aligned

- *Contact me if this is useful to you*

---

### We should assume much larger corpora exist privately

- Mistaken OK Google and Siri activations

- Voicemails from things like Google Voice

- Call audio from callcenters

- Every YouTube, TikTok or Instagram Reel

- Data from widespread phone surveillance

---

### ... but at the core, speech corpora are large buckets of speech data

- With the data that allows it to be useful

- In a format that doesn't suck
	- Speaking of which...

---

## Corpus Files

---

### What should files in a speech corpus look like?

- Reasonable chunks

- Reasonable and durable formats

- Easy accessibility

---

### Reasonable Chunks

- Too big gets ridiculous
	- Have a 400 hour FLAC file...

- Too small hinders your ability to do work
	- 'We saved the corpus in one wav per phoneme'

- You want to split in natural chunks that fit the goal
	- Single-speaker utterances might make sense for ASR
	- Multi-speaker conversations might make sense for testing speaker identification

- It's often easier to split a corpus up than to squish it back together

---

### Reasonable Formats

- You want archival, durable formats
	- So, don't depend on a new sound format from Google that they'll kill in 8 months
	- Think whether the data could be accessed in 20 years
	- You generally want a format that can be easily converted to something else

- Your corpus should be recorded in as high quality as you can
	- You can always reduce quality, you can't add it

- Associated text should be in a durable format (e.g. plaintext)

---

### To Compress or not to Compress

- What are benefits and downsides of uncompressed (e.g. WAV)?

- What are benefits and downsides of losslessly compressed (e.g. FLAC)?

- What are benefits and downsides of lossy compressed (e.g. Opus, mp3)?

---

### Many speech corpora are compressed

- Storage costs being reduced by 80% is *very* tempting

- Bandwidth costs are huge for shipping uncompressed or lossless files

- VoxPopuli ships as Ogg Vorbis 16000Hz, 16-bit, mono-channel
	- It's still 6.4 Terabytes (!!!)

- *Many corpora are built on already-compressed data!*
	- Saving an opus file recorded over bluetooth to WAV is *dumb*

---

### Your corpus should match the data it'll be working with

- If you're going to be doing ASR on 16000Hz, 16-bit, mono data, then you should train your model on that

- If you want to output 44100Hz 16-bit, then you'd best input it too

- If you're working on spontaneous speech, don't feed in audiobook recordings
	- Don't give YouTube data to a newsreader TTS

- *The data you're teaching the model from should match the data the model will work on!*
	- You don't necessarily get better results if you start with cleaner data

---

### Speech files aren't enough

- Well, maybe for some tasks

- ... but you're probably going to need...

---

## Corpus Metadata

---

### Metadata is just data about data

- Basically any kind of data about your data can be metadata

---

### Technical Metadata

- Sampling Rate and Bit depth

- Filetype and Codec(s)

- File Duration

- Number of channels

- Unique identifier for the file

---

### Practical metadata

- Recording date/time

- Device type used for recording
	- This helps control for microphone variation, etc

- License information (e.g. who can use the data for what)
	- Also information about what the data can and can't be used for
	- Because privacy should matter

---

### Linguistically Useful Metadata

- Speaker/Device Unique Identifiers

- Geocoding (e.g. where was the data recorded)

- Language and Dialect information
	- Not just language spoken, but language background

- Positionality of the Speaker (e.g. age, gender identity, race, sexuality, etc)

- Speaker Diarization
	- "When is which person talking?"

---

### Secondary Datastreams

- "This other file has an English transcript with timestamps"

- "This other file has an Electroglottographic recording made at the same time"

- "This other file has the video associated with it"

- "This other file has a spoken translation of this in Czech"

- "This other file has a Spanish language transcript"

---

### Other Task-Specific Annotations

- "This was a support call which was rated highly by the customer"

- "This person was happy/angry/sad during this recording"

- "This person was diagnosed with Parkinson's disease"

- "This person is an Arabic speaker from Afghanistan, not Saudi Arabia"

---

### With the sound files, and the metadata, you have a corpus!

- You can use it to train whatever models you'd like!

- ... but wait, how do I know what data should go into the corpus?

---

## Corpus Balance, Bias, and Privacy

---

### Models are trained on corpora

- They reflect the data you've given them

- They reflect **only** the data you've given them

- Many models struggle to 'generalize' to new types of data they've never seen before

- You need to make sure your corpus reflects your task!

---

### What kind of data would be needed for...

- A Starbucks ordering 'AI' which turns your spoken order into an online transaction?

- An algorithm which detects breakups on Discord calls to better target ads afterwards?

- A voicemail transcription service

- A Text-to-Speech engine which will run a 24-7 kpop news stream

- A voice-controlled elevator panel

---

### Models are biased by their corpora

- The patterns in your corpus determine how your model will interact with the world

- "When you feed the entire internet into a language model, you get back a racist language model"

- Your model will be most effective with the types of data most represented in the corpus
	- It's easy for a model to overfit to a particular kind of data if overrepresented
	- Underrepresented people will be underserved by the model

- This is a form of *sampling bias*
	- Broad performace is generally improved by sampling widely

---

### What would be the downside of training your model on a corpus of...

- Will Styler's Podcasts

- Outputs from a state-of-the-art text-to-speech engine

- News readers on 24 hour news channels

- Apple FaceTime call recordings

- Randomly sampled people representing exactly the racial, ethnic, linguistic, and gender statistics in the English-speaking world

---

### What would be the downside of training your model on a corpus of...

- YouTube videos

- Saved recordings from a central telecommunications facility covering every phone call to an international number
	- [This was a thing](https://en.wikipedia.org/wiki/NSA_warrantless_surveillance_(2001%E2%80%932007))

- Recordings from a live microphone hidden in a bush between benches at a public shopping mall
	- This is legal in many states, possibly even CA

- ... but wait, is that ethical?

---

### Ethical Concerns in Corpus Building

- More data is generally more better

- Yet, language data come from real people

- Let's assume people give fully informed consent to their data being collected and used
	- *Apple, Meta, Google have left the chat*

- How can we be careful with sensitive data?

---

### Types of Sensitive Data

- **Personally Sensitive Data** can cause financial, social, or physical harm to individual people if in the wrong hands

- **Organizationally Sensitive Data** can cause financial, reputational, or practical harm to a company, group or institution if in the wrong hands

- Some data can be both!

---

### Deidentifying speech data?

- Speech data may always be identifiable
	- Interestingly, IRBs differ on whether this is true

- "Yeah, I'm going to Vons off Regents to pick up my fluticazone after I teach LING 168"
	- "Then I'm going to go indulge in my deep, dark Haribo addiction"

- "Oh, damn, I'll do that as soon as I finish moving those boxes of confidential records into the  secret warehouse"

---

### What other ethical concerns are present in making speech corpora?

---

### Wrapping up

- Language models learn from stored data

- Speech Corpora are very easy

- You collect data, metadata, and other datastreams in a sane format

- Speech Corpora are very hard

- You need to store a great deal of data and organize it well

- You need to balance your corpus effectively

- ... and you need to do it all with an ethical approach

---

### Next time

- I guess it's time to make computers understand human speech

- Should be easy, right?

---

<huge>Thank you!</huge>