# Deep Neural Networks I: A Conceptual Introduction ### Will Styler - LIGN 168 --- ### Speech Processing is Neural now - For better or worse, there is exactly one kind of model that wins speech processing now - ASR, TTS, Denoising, even VAD are all now neural - So, uh, yeah. Guess we better cover that. --- ### Today's Plan - Neural Networks - What is a Deep Neural Network (DNN)? - Training Deep Neural Networks - Neural Inputs and Outputs - End-to-End Learning --- ### So far, we've been extracting features, and doing statistical learning - "Hey, computer, here are things you should care about, lovingly isolated" - MFCCs, F0, LPC coefficients, pulse locations, and more - "Look at these features and find the patterns based on these labels" - "Then, when I give you a new set of curated measurements, tell me what label fits" --- ### This is great, but it also requires a lot of work - Identify features which are helpful - Find the features - Measure them - Store them, with labeled data, and only then, classify --- ### What if we didn't have to do all that? - What if we could just feed in the data, and make the system figure out the features itself? - Then just insert data directly, and get back answers? - That would be world changing! --- ## Neural Networks --- ### Aside: Neural Networks are not actually neural - Brains are not actually involved - They are not 'modeling the brain' any more than submarines are 'modeling fish' - The goal here is not to make fake brains, it's to make good decisions! --- ### Neural Networks have a simple idea - "We're going to feed the input into a network of simple functions, and with the right network design and parameters, we should get a complex decision" - This depends on one core concept... --- ### Artificial Neurons - This is a tiny computational model of how a neuron (sorta kinda) works - Neurons take in a signal, and if they 'activate' or 'fire', output a modified version of the input signal - Every neuron has an **activation function**, **weights** for outgoing connections, and a **bias** - Weights and biases are the 'parameters' which people mean when they say 'N parameter model' --- ### Let's look at the dumbest network ever
--- ### Neural Activation Function - The activation function describes *the mathematical relationship between the input and the output* - Examples are Sigmoid, Tanh, Softmax, Linear - ReLu ('Rectified Linear Unit') is the most common: "If the input is less than zero, do nothing, otherwise, pass it through" - High pass filter! - f(x) = max(0,x) --- ### Neural Weight - Weights *modify the magnitude of the output signal to another neuron* - The output of a neuron is the input times the weight plus the bias - Output = Weight*Input + Bias - Negative weights make the output signal *smaller* - This makes the next neuron less likely to fire - Positive weights make the output signal *larger* - This makes the next neuron more likely to fire - **Every individual connection gets a weight** --- ### Neural Bias - Biases *modify the threshold for the neuron to activate* - Higher bias means a neuron needs *higher activation to fire* - This makes the *current* neuron less likely to fire - Lower bias means a neuron needs *less activation to fire* - This makes the *current* neuron more likely to fire - **Every individual Neuron gets a bias** --- ### So, every neuron takes an input signal from its connections... - Activates based on activation function - Outputs based on the weight and bias to each subsequent neuron - For Linear Neurons, Output = Weight*Input + Bias - Changes to the weight and bias change the output of a neuron! --- ### A 'firing' neuron
--- ### An 'inhibited' neuron
--- ### A Single Neuron can do work - Output = Weight*Input + Bias is a lot like y=mx+b - Very basic arrangements of neurons can accomplish tasks - The 'perceptron' - ... but there are classes of problems a single neuron can't cover - This is part of why we use... --- ## Deep Neural Networks --- ### A Deep Neural Network is just a Deep Neural Network - It has multiple layers of neurons between input and output - These are called **hidden layers** - Each neuron is (generally) fully connected to the prior and next layers - Each connection has an individual weight ---
--- ### Each Neuron still has activation function and bias - As well as a list of connected neurons which give input and take outputs - With each connection having its own weight - But now, hidden layers are activated by other hidden layers! --- ### Hidden layers
--- ### At the end, there's an output function - For now, let's focus on... - **Sigmoid:** Give a *single probability* between zero and one --- ### Inputs map to outputs
--- ### Inference is mapping an input to an output - Inputs are put into the network - The inputs are passed through the entire network, with each neuron and connection contributing to the final output value at the output function - The output layer, with the right output functions, should show the correct answer - This is the 'inference' process --- ### The best performing network will most correctly map inputs to desired outputs Common multi-input-to-Sigmoid tasks: - "Given these 15 sensor readings, should I report a car crash?" - "Given all these risk and protective factors, should this person be released on bail?" - "Given the pixels in this image, is the image likely to be pornographic?" --- ### The key to the model isn't the code, but the parameters - The architecture of the model matters, but that's not the expensive and important part - The right set of weights and biases will yield correct results - The wrong set of weights and biases *in exactly the same network shape* will yield gibberish and garbage - Sharing "the model" is sharing these correct weights - **... but how do we find them?** --- ## Training Deep Neural Networks --- ### We're giving intuitions, not calculus - I'm not describing the math - [Here's a great book which does more](https://udlbook.github.io/udlbook/) - The math isn't so crazy, but requires calculus and linear algebra to fully understand - LIGN 167 goes harder on the math - I am oversimplifying many aspects to get you the core intuition --- ### How does training work? - **Initialization:** Set up the weights and bias matrices with reasonable values - **Forward Pass:** What is the NN's current 'decision' based on the input? - This is the same process we'll use later to get decisions from the model - **Loss Calculation:** What does the output look like relative to the desired output? - **Backpropagation:** Let's tweak the biases and weights to try and get closer to the desired output --- ### Initialization - Every parameter (e.g. Weight and bias) needs to start with some value - Zeroes are mathematically Not Good for this. - There's a lot of study on this, but the answer turns out to be "Randomish numbers, not too small, not too large" --- ### Forward Pass
--- ### Calculating Error/Loss - A *loss function* tells the model how to evaluate the difference between the desired, actual output and the predicted output - Many functions are possible, and each task will have preferred loss functions - This is important, because it tells the model what it needs to get right! --- ### Common Loss Functions - **Mean Squared Error**: Penalize based on the average squared difference between the predicted and actual values - **Mean Absolute Error**: Penalize based on the average absolute difference between the predicted and actual values - **Cross-Entropy Loss**: For probability outputs, penalize more heavily as the predicted probability diverges from the 'correct' answer - **Dice Coefficient**: How much do the predicted pixels overlap with the actual pixels? --- ### Now, you know how far off you were from 'Correct' - It's time for... --- ### Backpropagation - "Let's update the biases of each neuron and connection weights in the network by looking at the partial derivative of each with respect to the loss function" - There's a slope of movement which leads towards less loss, go that way! - This also determines how much a given weight "matters" - The amount of modification is controlled by the 'Learning Rate' - This is an optimization problem! - Many optimizers are used, Adam ('Adaptive Moment Estimation') is perhaps most common - The goal is to find the parameter values which result in the lowest possible loss --- ### Backpropagation
--- ### You keep cycling through the data to train - Forward pass, Loss, Backpropagation and Update, then Forward Pass, Loss, Backprop... - You process **batches** of data at a time - When you've trained on the entire dataset, and loop back around, you've completed an **epoch** - This can take a while! --- ### This eventually trains the model to give correct answers - "Keep guessing, and tweaking, and guessing, and tweaking" - Eventually, you'll arrive at a local minimum - "Well, there's no direction I can go which makes the loss any less!" - This may or may not be the true minimum, but that's another lecture! - The final parameters (e.g. Weights and Biases) are then ready to use! --- ### You can 'fine tune' later - You take a pre-trained model, and give it a smaller amount of new representative data - Repeat the learning process with smaller learning rates and focus on deeper layers - This is great for adapting a general model to a more specific task or new domain - "I'm going to fine-tune this existing ASR model with data from Tira, so it works well there!" - Mark --- ### Interim Summary - Artificial Neurons turn inputs into outputs according to function and bias, and output according to weights - Deep Neural Networks allow more complex decision making - Training involves doing inference, finding error, assigning that error to individual weights/biases, and updating parameters - Inference is just putting the input in, and observing the outputs - ... but what are the possible inputs and outputs? --- ## Inputs and Outputs --- ### Neural Networks have very flexible inputs and outputs
--- ### Numbers - This is easy, just slap them into the input position! --- ### Text - 'Tokenize' the text, storing each possible text item as an entry in the Tokenizer's dictionary - Then, give the input as a series of numbers, corresponding to tokens in that dictionary - Sentences are strings of numbers representing tokens --- ### Images
--- ### Audio - More on this soon! --- ### Output Flexibility - **Linear:** Just give me the last layer's summed output - **Tanh:** Turn the last layer's output into a prediction between -1 and 1 - **Sigmoid:** Give a *single probability* between zero and one - **Softmax:** Please choose among possible output *classes*, giving probabilities for each that add to one - **Multiple Output Functions:** You don't have to use the same function for every output neuron! --- ### So, you can map a variety of inputs *directly* to a variety of outputs - This leads to the most attractive property of Neural Networks... --- ## End-to-End Learning --- ### End-to-End learning - Deep Learning doesn't require feature engineering --- ## Deep Learning doesn't require feature engineering! --- ### End-to-End learning Deep Learning doesn't require feature engineering - You can go straight from raw data to your final decision - DNNs allow intermediate representations and features which work to **emerge** - Smart architectures can amplify this effect --- ### DNNs create their own features and use them! - Features just happen, and then get used for classification - Inputs map to outputs, in an emergent way - We don't actually understand what features they're using to make decisions --- ### "Wait, we don't know how they're making decisions?" - Nope. - **We as a species have no idea how these things do what they do.** --- ### "Isn't that a problem?" - Why would we put up with that?! - ... and is there just one kind of Deep Neural Network? - **Next time!** --- ### Wrapping up - Artificial Neurons turn inputs into outputs according to function and bias, and output according to weights, and can combine into deep networks - Training involves doing inference, finding error, assigning that error to individual weights/biases, and updating parameters - You can use a variety of input and output data types - Neural networks generate their own features and find their own patterns in the data ---
Thank you!