# Cepstra, Pitch Tracking and Voice Activity Detection ### Will Styler - LIGN 168 --- ### Last time, we talked about estimating the filter - LPC gives us a sense of how the vocal tract is *filtering* the vocal folds - We estimate the filter's shape, independent of the voicing (the 'excitation signal') - We discussed how LPC can be used to deconstruct and then reconstitute the voice - **... but what about estimating the source?** --- ### The Excitation Signal - To do LPC synthesis and resynthesis, we need two pieces of information about the source - What is the pitch/fundamental frequency/f0 of the voice? - Where (during the duration of the file) is there voicing to reconstruct? - "From 253-284 ms, there's no voicing" - Are there any irregular pulses? - "There's creak here, so, drop single pulses here, here, and here" - We can model this, effectively, as a series of numbers indicating f0 over time, with 'zero' as an option - With some noise modeling for not-so-voicing-like components ---
--- ### To get this, we need to be able to reliably find f0! - ... and to find where somebody's talking at all - So, how's that happen? - ... and what kind of representations of speech are used to generate it? --- ### Today's Plan - The Cepstral Domain - Mel Frequency Cepstral Coefficients (MFCCs) - Voice Activity Detection - How do we determine the f0 of the voice? --- ## The Cepstral Domain --- ### We've already thought a lot about the time domain - Amplitude varying over time gives us the 'time domain' - Waveforms view signals in the time domain - **Very good** for understanding pressure, for playback, and otherwise --- ### Then we went into the frequency domain - "What are the component frequencies and their time courses of this signal?" - Fourier Transform turns a time-domain signal into a frequency domain signal - **Very good** for seeing frequency components, spectral balance, and resonance --- ### Now we're going into the cepstral domain - "What are the dominant patterns of periodicity? What rates of change matter most?" - We do a *cepstral* transformation to see the cepstral domain - **Very good** for seeing periodicity and patterns of change --- ### To do a Cepstral Transform - First, do fourier analysis to get a spectrum of a signal - Then, log **each individual amplitude measure in the power spectrum** - This de-emphasizes larger values, and has other mathematically convenient purposes - Then, do a fourier analysis *of the log-transformed fourier analysis*! - This captures the periodicities *of the spectrum itself* ---
--- ### This results in a 'cepstrum' - Cepstra show amplitude by **quefrency** - This can be shown on a **cepstrogram** - Filtering in the cepstral domain is **liftering** - Spectra separate components by frequency - Cepstra separate components by rate of change - They're not straightforward to interpret for humans ---
---
---
--- ### This is a very common signal processing tool - Excellent when what you care about boils down to a change in rate of change - ... and is very important for... --- ## Mel-Frequency Cepstral Coefficients (MFCCs) --- ### Mel-Frequency Cepstral Coefficients turn sounds into matrices - Each frame is put into the frequency domain and scaled to Mels - Then it's put into the quefrency domain, to emphasize time differences - Then, we reduce the dimensionality using DCT --- ### Why Mels? --- ### Our perception of frequency is non-linear - This is partly due to hearing anatomy - This is partly evolution's fault too - We care primarily about ~80-4000Hz! --- ### ... but Hertz doesn't capture this at all - Hertz captures cycles per second - ... but not our perception of frequency! --- ### Our perception of frequency is weird! - We've already talked about auditory masking - Two sounds within the same 'critical band' seem like one sound - We also percieve jumps in frequency non-linearly --- ### Do we hear frequency in a linear and reliable way? Is the jump in file A the same as in file B? A.
B.
- **Both of these are a 200Hz Jump!** --- Is the jump in file A the same as in file B? A.
B.
- **Both of these are a 100Hz Jump** --- ### So, our perception of frequency isn't Hertz-like - **We want a perceptual scale for hearing!** --- ### Mel Scale - Maps numerical pitch measures to human perceptions of changes in pitch - People will tell you that a sound's pitch is 'half as high' at x/2 mels relative to x mels - Mel is the dominant perceptual frequency scale in use --- ### Mel Scale
--- ### Mel Formula - Mel(f) = 1125 \* ln(1+f/700) - f = Frequency in Hertz - This is a natural log - There are multiple formulae! --- ### MFCCs approximate human hearing better - Mels scale to the frequencies which matter to humans - ... and hopefully that helps the computers find the most important elements of speech, too! --- ### Mel-Frequency Cepstral Coefficient Process, Continued - Each frame is put into the frequency domain and scaled to Mels - Then it's put into the quefrency domain, to emphasize time differences - Then, we reduce the dimensionality using DCT --- ### MFCCs
--- ### Discrete Cosine Transform (DCT) - DCT breaks a complex curve into the sum of many cosine functions' coefficents - "Which frequency of cosine functions would we need, and with what power, in order to describe the shape of this curve?" - Each cosine we 'need' is a coefficient - More coefficients mean more detail, but also more data - Some coefficients *really won't matter at all*, and can be discarded or set to zero --- ### DCT for MFCC - "Let's find the most meaningful N coefficients which describe what's going on in the cepstrum *for each frame*" - We get a set of numbers which describe the frame-by-frame cepstral shape of the sound - With some of the least important coefficients thrown away - Usually the first 12-13 coefficients are used in MFCCs - The highest frequencies aren't as relevant in the cepstral domain --- ### We're going to see DCT a LOT this quarter - Particularly next week - [You should watch Computerphile's Video on DCT](https://www.youtube.com/watch?v=Q2aEzeMDHMA) - This will also prepare you for our discussion on compression and codecs --- ### MFCC Input
--- ### MFCC Output
--- ### So, the sound becomes a matrix of features - Many rows (representing time during the signal) - N columns (usually 13) with coefficients which **tell us the spectral shape** - It's black-boxy, but we don't care. - We've created a Matrix ---
--- ### Nerdy Aside: MFCC vs. PCA - In practice, MFCC is not so different from doing Principal Component Analysis on a log spectrum - Both find dominant patterns of variation in the time domain - They're not exactly equivalent, but the results are very close - This helps with intuition, if you're used to PCA! --- ### Both MFCC and LPC represent the spectral shape - LPC tries explicitly to model only the filter - MFCC models everything, agnostically - It's also much better for simplifying the signal than LPC --- ### Now we've got a matrix representing the sound - ... which captures frequency information, according to our perceptual needs - This will be useful for many things, including... --- ## Voice Activity Detection --- ### Not every sound file contains speech - ... sadly - One important task before we start modeling speech is detecting whether it's present or not - This is referred to as 'Voice Activity Detection' or VAD - "In this file, for every frame, tell me whether there's voicing or not" --- ### The Simplicity/Complexity Tradeoff in Speech Tools - There is a 'dumb and cheap' way to do a task - Poor results, but extremely fast computation and low data requirements - Balanced approaches between complexity and accuracy often exist - Reasonable results, at cost of complexity and required training data - Often, these are the top performing 'legacy' approaches - Expensive and accurate approaches exist, but require lots of compute or *massive* training data - Neural approaches often fall into this category - **We'll see this tradeoff again and again this quarter!** --- ### Dumb and Cheap VAD methods - These work poorly, but cheaply! - **Signal Intensity**: "Is this signal loud enough to be voice?" - This is useful for detecting speech vs. silence - **Spectral Slope**: "Is there more energy at the lower frequencies than higher frequencies?" - Speech generally has a falling spectral slope - **Zero-Crossing Rate**: "How often does this signal cross zero?" - Speech has different rates (due to expected f0 ranges) than non-speech ---
--- ### Balanced Approaches to VAD - **Classification based on Spectral Features**: Use machine learning classifiers to interpret frequency patterns as 'voice' or 'non-voice' - "Does this particular balance of power in the spectrum look speechy?" - **Classification based on Mel-Frequency Cepstral Coefficients**: Use ML classifiers to interpret MFCCs as "voice" or not. - "Here's what this sound looks like in its entirety. Is it speechy?" - Often, this uses Hidden Markov Models and Gaussian Mixture Models to do the classification - But other classifiers work too! --- ### Expensive and Accurate Neural VAD systems - Turn a great deal of training data into a matrix of numbers somehow (MFCCs, wav2vec, or others) - Label the data as 'speech' and 'non-speech' - This is generally data from a particular language - Train a deep neural network on that data - Usually CNN, RNN, or LSTM - The neural network looks at the data and decides - Often *very* robust to noise and more adaptable to new segments --- ### The output is a frame-by-frame vector full of 0 and 1 - 0 indicates 'No speech in this frame' - 1 indicates 'There be speech here' - Some algorithms give you probabilities too - "42% chance of speech here" --- ### VAD can be hard! - Speech in noise is hard to detect, especially when quiet - Complex approaches may not detect un-trained kinds of speech - Fricatives, clicks, or segments outside the training data - Human voices are very variable - Children vs. Adults - Environments vary a lot - Voice detection in a crowded restaurant is a different task - Cost is a factor - The 'best' algorithm may be too slow or expensive for the use case --- ### Tuning for desired types of error - [All models are wrong, some are useful](https://en.wikipedia.org/wiki/All_models_are_wrong) - You should make sure your analysis fails optimally - Do I want to *favor preserving as much speech as possible*, and detect other things as speech more often? - More false positives - "Failing safe" - Do I want to *favor preserving only speech*, and throw out everything I'm not sure about? - More false negatives --- ### Evaluating VAD - What amount of speech is clipped away (at the start and middle of speech)? - What amount of noise is called speech at the end of speech? - What amount of noise labeled as speech? --- ### What is VAD useful for? - **Constraining Analyses**: Discard non-speech frames when doing (e.g.) LPC - **Removing Silences**: In this recording, mark the places somebody's talking for transcription - **Mute-when-not-talking**: Don't transmit the mechanical keyboard clicking please - **Call Compression**: Don't send sound data to the other person when no speech is occurring - **Telemarketing**: Make robocalls, and then put a human on the line only if a human answers --- ### "OK, great, we know where there's talking" - "... but we still have to model the f0, how's that work?" --- ## f0 Detection --- ### How do we identify the f0 of a voice? - What is the fundamental frequency in a given frame? - "168 Hz in frame 134" - Where are the boundaries of individual glottal pulses? - "Delineate the right edge of each period in the glottal cycle in this file" - What's the irregularity of the signal? - Do we need to model variation in the cycle time? --- ### Dumb and Cheap f0 Detection - **Zero Crossing Rate (with filtering)**: I expect f0 to be between 80 and 350 Hz, so possible zero crossing rates associated with the fundamental would be between X and Y.... - **Cepstral Analysis**: Look for prominences in the Cepstral domain, where the voicing cycle should give a prominent peak - A related measure is **Cepstral Peak Prominence**, a measure of noise in voicing signals - **Harmonic Product Spectra**: Take an FFT, then downsample it, and multiply the original by the downsampled version, then repeat - The peaks associated with f0 (including harmonics) will grow stronger, and the noise will fade away ---
--- ### The Balanced Approach: Autocorrelation for f0 Tracking
--- ### Autocorrelation is a very common approach for f0 tracking - Identify candidate pitches based on autocorrelation spikes for lag - "Huh, big spike at 10ms lag, this is probably a 100Hz signal" - Filter out candidates which fall outside the expected pitch domain (e.g. outside 50 to 400 Hz) - This isn't always desirable for some kinds of voice study - These candidates go on to be post-processed, so we can choose the right candidates! - **This is how Praat (and others) do f0 tracking** --- ### [Post-Processing Constraints on f0 Data in Praat](https://www.fon.hum.uva.nl/praat/manual/Sound__To_Pitch__filtered_ac____.html) - **Silence threshold**: How quiet should a frame be before we call it voiceless? - **Voicing threshold**: How much autocorrelation do you need to consider something voiced? - **Octave cost**: How much do you want to favor higher-frequency candidates? - **Octave-jump cost**: How much do you want to avoid sudden octave jumps? - **Voiced/unvoiced cost**: How much do you want to penalize switching from 'voiced' to 'voiceless' --- ### We can smooth the output of this function, too - We don't tend to suddenly jump way up or down in f0 in one or two cycles - Dynamic Programming approaches work well for this - "Find the smoothest path through all the candidates" --- ### There are other approaches - Phase-based approaches - Very fancy signal processing approaches - e.g. [Reassignment Methods](https://en.wikipedia.org/wiki/Reassignment_method) - Neural Network approaches - Good for extracting speech pitch from other pitched noise --- ### Aside: More direct measures can be more reliable - Where knowing pitch is crucial (e.g. recording for text-to-speech), other information sources are used - Electroglottography (EGG) gives much more reliable and detailed f0 information ---
--- ### Difficulties with f0 tracking - Variability of f0 - Children and Sopranos have higher f0 than expected - Periodic Background Noise - Try doing f0 detection with sheep and goats in the background - Octave Jumps - It's very easy to find 300 Hz as the 'f0' of a 150Hz Signal --- ### Difficulties with f0 tracking (Continued) - Speaker overlaps - Two people talking over each other make complex signals - ... but f0 differences are a great way to identify when there are two speakers! - f0 has some variability which you need to model - Both in period ('jitter') and amplitude ('shimmer') - Modeling when f0 stops can be tricky - ... and if you do it wrong, it's very much not good --- ### Why is f0 tracking useful? - **Modeling the voice:** LPC requires good information about the excitation signal - **Measuring f0:** f0 is a great cue for many phenomena in many languages - Particularly in tone languages! - **Identifying musical notes:** These same algorithms work great for finding the pitch of musical instruments, too! - **Finding pulses:** Once you've found a fundamental, you can identify pulses too - ... and of course.... --- ### Modifying the pitch of a voice! - Next time! --- ### Wrapping up - Voice Activity Detection identifies frames where there's speech - There are many approaches to VAD, suiting many needs - f0 detection helps us identify the pitch of the voice over many frames - There are many methods, but autocorrelation is unreasonably effective --- ### For Next Time - We'll think about how to modify the source, and the filter! ---
Thank you!