LIGN 168 - Legacy TTS

### We've come a LONG way

---
# Legacy Approaches to TTS

### Will Styler - LIGN 168

---

### Today's Plan

- Unit Selection Synthesis

- Strengths and Weaknesses of Unit Selection

- Parametric TTS

---

### Today, we're learning more legacy methods

- Neural Networks have 'won' TTS too
	- We'll talk about how that works next time

- As always, old methods still work
	- ... and they're actually quite a bit easier to implement!

- Today, we'll focus on two of them, parametric and unit selection TTS

---

### There are four main methods for TTS

- Concatenative or Unit Selection Synthesis
	- These generally are used interchangeably

- Parametric TTS

- Neural TTS with intermediate representations

- Neural End-to-End TTS

---

### Each works slightly differently

- Concatenative, Unit Selection Synthesis
	- "Combine together existing chunks to form the desired utterance from the text"

- Parametric TTS
	- "Let's map text to acoustic parameters (e.g. LPC coefficents, spectral envelopes, pitch) using statistics"

- Neural TTS
	- "Let's turn text into an intermediate representation like a spectrogram, and then create sound from that"

- Neural End-to-End TTS
	- "Let's just map text directly to a waveform"

- *Today, we'll focus on the two pre-neural approaches!*

---

## Unit Selection TTS

---

### Concatenative TTS

- We started thinking about this last time!

- Concatenative TTS builds utterances from pre-recorded chunks
	- You can record whatever chunks make sense
	- Or slice up existing recordings of text

- Synthesis is simply the selection and concatenation and processing of pre-recorded units

- *Concatenative Synthesis doesn't ever generate new audio, it just combines and tweaks existing audio!*

---

### Concatenative Synthesis Units can be of any length!

- Phones, diphones, triphones, syllables, words, utterances or phrases

- ... But every size of unit comes with ups and downs!

---

### Unit size tradeoffs

- (Di)phone chunks are flexible and tiny, but don't sound as natural as larger units

- (Di)phone chunks can be used to create 'guesses' generated by text analysis even for something out of the dictionary

- Word chunks are natural and still have some flexibility, but have choppy word-to-word transitions

- Phrase-based chunks sound great internally, but completely lack flexibility and breadth (without huge data size)

- **Why not choose the best size for the situation?**

---

### Unit Selection

- "Let's build everything by concatenation, but grab whichever chunk makes the most sense!"

- Generally, start by grabbing the largest chunks you can, then grab individual words

- Words which aren't found in the data can be constructed from existing (di)phones
	- This is usually done by a spelling-to-phoneme converter

- Then smooth over the gaps by adjusting the prosody!

---

### Domain-Specific Systems

> ...HIGH SURF THURSDAY AFTERNOON THROUGH FRIDAY...

> A long period west to northwest swell will bring high surf
Thursday afternoon through Friday. The peak swell and surf will
occur Thursday night
into early Friday morning, with the highest
surf in southern San Diego County. Minor coastal flooding will
occur during periods of high tides.

---

### Arbitrary Text Systems

> Alaina Rutkowska posted a great close-up photo of the tube of an ice-cream cone worm. The tube is made of sand grains, carefully selected and fitted together, and bound with a special adhesive. The worm has golden bristles used to rake through sediment so it can pick up yummy bits with little tentacles.

---

### Concatenative Synthesis requires a living database to work from

- You'll need access to the whole database to reproduce speech
	- Rather than other approaches which just use the corpus to learn from

- The breadth of your database has a huge effect on the naturalness of the system

---

### Creating a Concatenative TTS system

- Record a large speech database

- Annotate the features of the speech in the database

- Select clever algorithms for text analysis, prosody generation, and unit selection

- Then deploy!

---

### Concatenative TTS has a few steps

- Collect and annotate a speech database (once)

- Do text analysis, grabbing chunks and modeling missing words from graphemes

- Model the prosody of the sentence

- Select the right units

- Combine them and smooth them over

---

### Collecting a speech database (once)

- Record a LOT of speech from an actual human

- You can record words, sentences, or running text, and then segment down from there
	- "Expect localized flooding." gives 'Expect', 'localized', as well as /ɛ k s p t/

- You want many versions of important words, and good coverage of the possible set of words and phrases

- You'll need to update this as new words and vocabulary come into use
	- 'COVID', 'social distancing', 'Pete Buttigeig', 'skibidi'

- **All of this happens offline, before the system is deployed**

---

### Speech Databases generally have one actor

---

### Then you annotate the database (once)

- You'll mark boundaries of words, phrases, (di)phones

- You'll get the acoustic properties (via formants, MFCCs on edges) and prosodic properties (pitch, duration)

- You'll mark the adjacent phonemes
	- If you're building 'My cat king', you'd prefer the 'cat' from 'Sky cat came' rather than from 'no cat checked'

- This will let you know which chunk is the best 'fit' later

---

### Now you do text analysis on the input texts

- Basic syntactic parsing to choose the right parts of speech
	- Record vs Record

- Bits of language modeling to pick up on DJs dropping the bass

- Dictionary lookups for individual words

- Grapheme-to-phoneme modeling for the rest

---

### You'll model prosody too!

- Generating values for pitch, duration, loudness, pauses

- This is often done with syntactic models

- Machine learning can take punctuation into account

---

### Then, you choose the optimal chunks

- You're going to have (generally) multiple recordings of a given word, which combination is best?

- You'll do this by optimizing the **Target Cost** and the **Join Cost**

---

### Target Cost

- *How good is a given recording for the context it's going in?*

- What are the neighboring phonemes?
	- "Walk" before /i/ is different from before /a/

- How well does it fit the desired prosodic standards?
	- Don't put a stressed syllable in an unstressed context

- Is it the same part of speech, etc?
	- Do you want to use 'fit' as a verb, as a noun?

---

### Join Cost

- *How well does the chunk 'match' with its neighbors?*

- Do the edges of this recording sound more like the edges of the neighbors?
	- Often done by looking at spectral distance

- Are the neighbor tokens from different prosodic conditions?
	- It may be better to 'match' neighbors even if the token is a bit less right for the context and needs fixed

- Is the pitch before and after different from the token pitch?
	- Our ability to adjust pitch is only so powerful!

---

### We want both of these things!

- We want to choose units which are *correct for the context*, and *fit in well with adjacent tokens*

- This is a hard optimization problem

- **Viterbi-ish Algorithms** find the best pathway through the data
	- Forwards and backwards walks

- The optimal solution is the 'least worst' in terms of both target and join costs

---

### Then you concatenate, and modify the output

- Adjust loudness directly

- Adjust pitch and duration using PSOLA

- Remove any join artifacts

---

### ... and you're done!

- You play back the audio generated!

---

### Each given inference looks the same

- Do text analysis, grabbing chunks and modeling missing words from graphemes

- Model the prosody of the sentence

- Select the right units from the database

- Combine them and smooth them over

---

### ... but each of these steps is hard!

---

## Difficulties with Unit Selection TTS

---

### Building a database is hard

- Recording all the words is hard!

- Segmenting sounds is hard

- Measuring their properties is hard
	- What about errors in the database?

- Keeping the database up to date is hard
	- What if your 'voice' leaves the company?

---

### Aside: This is different in different languages

- "Record all the words" is basically impossible for heavily synthetic or morphology-using languages

- So, you'll need to work more below word level

---

### Text Analysis is hard

- Especially computational grapheme-to-phoneme modeling!
	- Kaetlyinn
	- Ruaridh

- Modeling prosody is hard too!

- All the rest of the text analysis difficulties we covered!

---

### Choosing optimal chunks is hard

- You don't always have the chunks you want

- The "San" from San Diego may be subtly different from the "San" in "San Ysidro"

- Many criteria for optimal fit, all need to be optimized quickly!

---

### Concatenation is hard

- Sometimes the 'split' points aren't ideal

- Remember PSOLA? Yeah, hard, and not 100% effective (e.g. creak)

- Even just joining waveforms without artifacts is hard

---

### So, Concatenative synthesis is hard!

- We've all heard bad unit selection

---

### Concatenative Synthesis is actually still quite useful

- It's very possible to get a very OK result, with an OK amount of work

- The amount of data needed for a good database is *much* less than what's needed to train a neural TTS model

- The running costs and compute requirements are *much* lower

- The results are very natural with large, predictable units

- **If you need to do TTS for a lower resource language, you should probably think concatenative first!**

---

### ... What if I don't want to record a database?

- Couldn't we just figure out what phonemes sound like and then generate the voice directly?

- That's...

---

## Parametric TTS

---

### Parametric Vocoders

- [WORLD](https://github.com/mmorise/World) is a good example of this

- Takes as input 'parameters' which describe the desired signal, like...
	- Spectral shape (think LPC)
	- Pitch and source filters
	- Aperiodic components and their timings

- Outputs sound

- There are similarities in concept to the decompression part of speech compression

---

### Parametric TTS goes from text to vocoder parameters to sound

- "What sequence of parameters for a vocoder corresponds to this text?"

- The goal is to identify the parameters which create sound that sounds like the given phoneme sequence

- You're 'shaping' the sound output based on the text, using a few key parameters to control the vocoder

---

### Parametric TTS is a three stage process

- Text analysis

- Parameter Modeling

- Vocoding to produce sound

---

### Text Analysis works just like we've discussed before

- Turning text into phoneme sequences with prosody

---

### Model training associates phoneme sequences with parameter sequences

- "In order to create this phoneme/diphone/triphone's sound, what's the sequence of parameters I need to iterate through?"

- This is usually HMM-based, predicting parameter vector 'state' from the phoneme sequence from text analysis
	- Duration is modeled too, as well as rate of parameter change
	- We are not going deeper into this because no.

- The parameters are *smoothed*, because they shouldn't change faster than tongues do

---

### New text generates new parameter sequences

- The model takes the text analysis output and creates a parameter sequence which should approximate the sound of the words

- Then, you just feed those parameters into the vocoder, and boom, speech!

---

### Parametric TTS has benefits

- They're wildly flexible
	- As long as text analysis can analyze it, you can create it

- You don't need one specific voice actor
	- You can learn from any larger amount of phoneme-annotated data

- The system can be fairly small
	- You just need to do compute to run the statistical models

- There's no chance of concatenation artifacts
	- It's built continuously!

---

### ... but it's also got disadvantages

- The naturalness is much lower than recordings of humans

- The number of parameters to estimate gets very unwieldy quickly

- The compute needed isn't trivial

- Your training data needs to be a bit homogenous, as it can't generalize well

- Long-distance effects are harder to capture with HMMs

---

### Wait a second...

- It needs to estimate a ton of parameters

- HMM-based parametric synthesis already needs expensive compute

- It fails to generalize over heterogenous data

- ... and it has difficulty capturing long-distance sequence effects?

- **Aren't these all things which deep neural networks are really good at?**
	- Foreshadowing? In my LIGN 168?

---

### Wrapping up

- Unit Selection TTS combines existing chunks in smart ways to generate good ouput

- It can sound very natural and is (relatively) cheap to build
	- ... but it's not as flexible, and has its own difficulties

- Parametric TTS uses fancy statistics to turn phoneme sequences into vocoder parameters
	- So it's incredibly flexible, but had a number of problems which we no longer need to have!

---

### Next time

- TTS with Deep Neural Networks

---

<huge>Thank you!</huge>