## asr_ctc.png The image is a visual representation explaining how CTC (Connectionist Temporal Classification) collapsing works in the context of speech recognition systems. It includes an audio waveform at the top and text below it that illustrates the process step-by-step. ### Audio Waveform: - The top part of the image shows an audio waveform, which represents sound over time. - The waveform is divided into segments, each corresponding to a specific phoneme or sound unit in speech. ### Textual Explanation: The text below the waveform explains the CTC collapsing process: 1. **Predict a sequence of tokens:** - The first step involves predicting a sequence of tokens (which could be letters, words, or other units) from the audio input. - In this example, the predicted sequence is "w o r r e l l d !". 2. **Merge repeats and drop ε:** - The next step merges repeated characters in the token sequence to form a more coherent output. - Here, 'o' and 'r' are merged into one character each, resulting in "w o r e l l d !". - The symbol 'ε' (epsilon) is dropped as it represents an empty state or placeholder that doesn't contribute meaningfully. 3. **Final output:** - After merging repeats and dropping unnecessary characters, the final output is "world!". - This is the intended word that was spoken in the audio input. ### Source: - The image includes a source link at the bottom: [https://alanfangblog.com/2021/08/29/ASR-CTC/](https://alanfangblog.com/2021/08/29/ASR-CTC/) ### Characters: There are no characters or people depicted in this image. It is purely a diagram explaining the CTC collapsing process used in speech recognition systems. This explanation should help you understand the content of the image without relying on visual elements. This description was generated automatically from image files by a local LLM, and thus, may not be fully accurate. Please feel free to ask questions if you have further questions about the nature of the image or its meaning within the presentation.