Lecture 6
The training set is used
The model’s prediction accuracy over the training set is called the training accuracy.
Q: Can we use the training accuracy to estimate how well a model will perform on new data?
We set aside a test set of labeled examples.
The model’s prediction accuracy over the test set is called the test accuracy.
The purpose of the test set is to give us a good estimate of how well a model will perform on new data.
Q: In general, will the test accuracy be higher or lower than the training accuracy?
But what about decisions like:
Q: Why can’t we use the test set to determine which model we should deploy?
We therefore need a third set of labeled data called the validation set
The model’s prediction accuracy over the validation set is called the validation accuracy.
This dataset is used to:
Example split:
The actual split depends on the amount of data that you have.
If you have more data, you can get a way with a smaller % validation and set.
Learning curve:
Q: In which epochs is the model overfitting? Underfitting?
Q: Why don’t we plot the test accuracy plot?
The best way to improve generalization is to collect more data!
But if we already have all the data we’re willing to collect. We can augment the training data by transforming the examples.
This is called data augmentation.
Examples (for images, but depends on task):
We should only augment the training examples, not the validation or test examples (why?)
Adding a bottleneck layer is another way to reduce the number of parameters
In practise, this isn’t a great idea (as too much informaiton may get lost).
Idea: Penalize large weights, by adding a term (e.g. \(\sum_k w_k ^ 2\)) to the cost function
Q: Why is it not ideal to have large (absolute value) weights?
Because large weights mean that the prediction relies a lot on the content of one feature (e.g. one pixel)
The red polynomial overfits. Notice it has really large coefficients
Cost function:
\[\mathcal{E}({\bf w}, b) = \frac{1}{2N}\sum_i \left(\left({\bf w} {\bf x}^{(i)} + b\right) - t^{(i)}\right)^2\]
Cost function with weight decay:
\[\mathcal{E}_{WD}({\bf w}, b) = \frac{1}{2N}\sum_i \left(\left({\bf w} {\bf x}^{(i)} + b\right) - t^{(i)}\right)^2 + \lambda \sum_j w_j^2\]
\[\mathcal{E}_{WD}({\bf w}, b) = \frac{1}{2N}\sum_i \left(\left({\bf w} {\bf x}^{(i)} + b\right) - t^{(i)}\right)^2 + \lambda \sum_j w_j^2\]
\[\frac{\partial \mathcal{E}_{WD}}{\partial w_j} = \frac{\partial \mathcal{E}}{\partial w_j} + \lambda 2 w_j\]
So the gradient descent update rule becomes:
\[w_j \leftarrow w_j - \alpha\left(\frac{\partial \mathcal{E}}{\partial w_j} + 2 \lambda w_j\right)\]
Idea: Stop training when the validation error starts going up.
In practice, this is implemented by checkpointing (saving) the neural network weights every few iterations/epochs during training.
We choose the checkpoint with the best validation error to actually use. (And if there is a tie, use the earlier checkpoint)
Weights start off small, so it takes time for them to grow large.
Therefore, stopping early has a similar effect to weight decay.
If you’re using sigmoid units, and the weights start out small, then the inputs to the activation functions take only a small range of values.
If a loss function is convex (with respect to the predictions), you have a bunch of predictions for an input, and you don’t know which one is best, you are always better off averaging them!
\[\mathcal{L}(\lambda_1 y_1 + \dots \lambda_N y_N, t) \le \lambda_1 \mathcal{L}(y_1, t) + \dots \lambda_N\mathcal{L}(y_N, t)\]
for \(\lambda_i \ge 0\) and \(\sum_i \lambda_i = 1\)
Idea: Build multiple candidate models, and average the predictions on the test data.
This set of models is called an ensemble.
Ensembles can improve generalization substantially.
However, ensembles are expensive.
For a network to overfit, its computations need to be really precise.
This suggests regularizing them by injecting noise into the computations, a strategy known as stochastic regularization.
One example is dropout: in each training iteration, random choose a portion of activations to set to 0.
The probability \(p\) that an activation is set to 0 is a hyperparameter.
Don’t do dropout at test time (why not?)
Multiply the weights by \(1-p\) (why?)
The following update is applied to each coordinate j independently: \[\begin{align*} \mathbb{E}_{D \sim P^n, (x,y) \sim P} \left[ (h_D(x) - y)^2 \right] \end{align*}\]
\[\begin{align*} \mathbb{E}_{D \sim P^n, (x,y) \sim P} \left[ (h_D(x) - y)^2 \right] \end{align*}\]
Let’s start by adding and subtracting the same quantity \[\begin{multline*} \mathbb{E}_{D,x,y} \left[ \left(h_D(x) - y\right)^2 \right] \\ = \mathbb{E}_{D,x,y} \left[ \left(h_D(x) - \hat{h}(x) + \hat{h}(x) - y\right)^2 \right] \end{multline*}\]
\(\hat{h}(x) = \mathbb{E}_{D \sim P^n}[h_D(x)]\) is the expected regressor over possible training sets, given the learning algorithm \(\mathcal{A}\).
After some algebraic manipulation (proof), we can show that:
\[\begin{multline*} \underbrace{\mathbb{E}_{D,x,y} \left[ (h_D(x) - y)^2 \right]}_{\text{Expected test error}} = \underbrace{\mathbb{E}_{D,x} \left[\left(h_D(x) - \hat{h}(x)\right)^2 \right]}_{\text{Variance}} \\ + \underbrace{\mathbb{E}_{x,y} \left[\left(\hat{y}(x) - y\right)^2 \right]}_{\text{Noise}} + \underbrace{\mathbb{E}_{x} \left[\left(\hat{h}(x) - \hat{y}(x)\right)^2 \right]}_{\text{Bias}} \end{multline*}\]
\(\hat{y}(x) = \mathbb{E}_{y|x}[y]\) is the expected label given \(x\). Labels might not be deterministic given x.
Why?
\[\begin{align*} \Pr[Model(D) = S] \leq \exp(\varepsilon) \cdot \Pr[Model(D') = S] \end{align*}\]
Applying DP in Neural Networks: Introduce noise to gradients during training.
Mechanisms (DP-SGD)
Scenario: Predicting Disease Risk
Task: A neural network is trained to predict the risk of a disease based on patient health records.
Data: Sensitive medical information such as diagnosis history.
What Happens Without DP?
With DP:
Scenario: Personalized Product Recommendations
Task: A deep learning model recommends products based on customer browsing history and previous purchases.
Data: Includes user shopping patterns, age, and location.
Without DP:
With DP:
The learning rate \(\alpha\) is a hyperparameter we need to tune. Here are the things that can go wrong in batch mode:
\(\alpha\) too small: | \(\alpha\) too large: | \(\alpha\) much too large: |
---|---|---|
![]() |
![]() |
![]() |
slow progress | oscillations | instability |
Batch gradient descent moves directly downhill. SGD takes steps in a noisy direction, but moves downhill on average.
batch gradient descent: | stochastic gradient descent: |
---|---|
![]() |
![]() |
In stochastic training, the learning rate also influences the fluctuations due to the stochasticity of the gradients.
The tradeoff between smaller vs larger batch size
\[\begin{align*} \text{Var}\left[\frac{1}{S} \sum_{i=1}^S \frac{\partial \mathcal{L}^{(i)}}{\partial \theta_j}\right] &= \frac{1}{S^2} \text{Var} \left[\sum_{i=1}^S \frac{\partial \mathcal{L}^{(i)}}{\partial \theta_j} \right] \\ &= \frac{1}{S} \text{Var} \left[\frac{\partial \mathcal{L}^{(i)}}{\partial \theta_j} \right] \end{align*}\]
Larger batch size implies smaller variance, but at what cost?
To diagnose optimization problems, it’s useful to look at learning curves: plot the training cost (or other metrics) as a function of iteration.
You might want to check out these links:
An overview of gradient descent algorithms: https://ruder.io/optimizing-gradient-descent
Why momentum really works: https://distill.pub/2017/momentum/
The training set is used
The model’s prediction accuracy over the training set is called the training accuracy.
Q: Can we use the training accuracy to estimate how well a model will perform on new data?
Underfitting:
Overfitting:
Overfitting:
We set aside a test set of labelled examples.
The model’s prediction accuracy over the test set is called the test accuracy.
The purpose of the test set is to give us a good estimate of how well a model will perform on new data.
Q: In general, will the test accuracy be higher or lower than the training accuracy?
But what about decisions like:
Q: Why can’t we use the test set to determine which model we should deploy?
We therefore need a third set of labeled data called the validation set
The model’s prediction accuracy over the validation set is called the validation accuracy.
This dataset is used to:
The model’s prediction accuracy over the validation set is called the validation accuracy.
This dataset is used to:
Example split:
The actual split depends on the amount of data that you have.
If you have more data, you can get a way with a smaller % validation and set.
Learning curve:
Q: In which epochs is the model overfitting? Underfitting?
Q: Why don’t we plot the test accuracy plot?
The best way to improve generalization is to collect more data!
But if we already have all the data we’re willing to collect. We can augment the training data by transforming the examples. This is called data augmentation.
Example (for images, but depends on task):
We should only warp the training examples, not the validation or test examples (why?)
Networks with fewer trainable parameters are less likely to overfit. We can reduce the number of layers, or the number of parameters per layer.
Adding a bottleneck layer is another way to reduce the number of parameters
Adding a bottleneck layer is another way to reduce the number of parameters
In practise, this isn’t a great idea.
Idea: Penalize large weights, by adding a term (e.g. \(\sum_k w_k ^ 2\)) to the cost function
Q: Why is it not ideal to have large (absolute value) weights?
Because large weights mean that the prediction relies a lot on the content of one feature (e.g. one pixel)
The red polynomial overfits. Notice it has really large coefficients
Cost function:
\[\mathcal{E}({\bf w}, b) = \frac{1}{2N}\sum_i \left(\left({\bf w} {\bf x}^{(i)} + b\right) - t^{(i)}\right)^2\]
Cost function with weight decay:
\[\mathcal{E}_{WD}({\bf w}, b) = \frac{1}{2N}\sum_i \left(\left({\bf w} {\bf x}^{(i)} + b\right) - t^{(i)}\right)^2 + \lambda \sum_j w_j^2\]
\[\mathcal{E}_{WD}({\bf w}, b) = \frac{1}{2N}\sum_i \left(\left({\bf w} {\bf x}^{(i)} + b\right) - t^{(i)}\right)^2 + \lambda \sum_j w_j^2\]
\[\frac{\partial \mathcal{E}_{WD}}{\partial w_j} = \frac{\partial \mathcal{E}}{\partial w_j} + \lambda 2 w_j\]
So the gradient descent update rule becomes:
\[w_j \leftarrow w_j - \alpha\left(\frac{\partial \mathcal{E}}{\partial w_j} + 2 \lambda w_j\right)\]
Idea: Stop training when the validation error starts going up.
In practice, this is implemented by checkpointing (saving) the neural network weights every few iterations/epochs during training.
We choose the checkpoint with the best validation error to actually use. (And if there is a tie, use the earlier checkpoint)
Weights start off small, so it takes time for them to grow large.
Therefore, stopping early has a similar effect to weight decay.
If you’re using sigmoid units, and the weights start out small, then the inputs to the activation functions take only a small range of values.
If a loss function is convex (with respect to the predictions), you have a bunch of predictions for an input, and you don’t know which one is best, you are always better off averaging them!
\[\mathcal{L}(\lambda_1 y_1 + \dots \lambda_N y_N, t) \le \lambda_1 \mathcal{L}(y_1, t) + \dots \lambda_N\mathcal{L}(y_N, t)\]
for \(\lambda_i \ge 0\) and \(\sum_i \lambda_i = 1\)
Idea: Build multiple candidate models, and average the predictions on the test data.
This set of models is called an ensemble.
Ensembles can improve generalization substantially.
However, ensembles are expensive.
For a network to overfit, its computations need to be really precise. This suggests regularizing them by injecting noise into the computations, a strategy known as stochastic regularization.
One example is dropout: in each training iteration, random choose a portion of activations to set to 0.
The probability \(p\) that an activation is set to 0 is a hyperparameter.
Dropout can be seen as training an ensemble of 2D different architectures with shared weights (where D is the number of units)
Don’t do dropout at test time (why not?)
Multiply the weights by \(1-p\) (why?)
Since the weights are on \(1-p\) fraction of the time, multiplying the weights by \(1-p\) matches the expected value of the activation magnitude (e.g. going into the next layer).
Training set \(D = \{(x_1, y_1), ..., (x_n, y_n)\}\) drawn i.i.d. from distribution \(P(X,Y)\). Let’s write this as \(D \sim P^n\).
Assume for simplicity this is a regression problem with \(y \in \mathbb{R}\) and \(L_2\) loss.
What is the expected test error for a function \(h_D(x)=y\) trained on the training set \(D \sim P^n\), assuming a learning algorithm \(\mathcal{A}\)? It is:
\[\begin{align*} \mathbb{E}_{D \sim P^n, (x,y) \sim P} \left[ (h_D(x) - y)^2 \right] \end{align*}\]
\[\begin{align*} \mathbb{E}_{D \sim P^n, (x,y) \sim P} \left[ (h_D(x) - y)^2 \right] \end{align*}\]
The expectation is taken with respect to possible training sets \(D \sim P^n\) and the test distribution P. Let’s write the expectation as \(\mathbb{E}_{D,x,y}\) for notational simplicity.
Note that this is the expected test error not the empirical test error that we report after training. How are they different?
Let’s start by adding and subtracting the same quantity \[\begin{align*} \mathbb{E}_{D,x,y} \left[ \left(h_D(x) - y\right)^2 \right] = \mathbb{E}_{D,x,y} \left[ \left(h_D(x) - \hat{h}(x) + \hat{h}(x) - y\right)^2 \right] \end{align*}\]
\(\hat{h}(x) = \mathbb{E}_{D \sim P^n}[h_D(x)]\) is the expected regressor over possible training sets, given the learning algorithm \(\mathcal{A}\).
\(\hat{y}(x) = \mathbb{E}_{y|x}[y]\) is the expected label given \(x\). Labels might not be deterministic given x.
After some algebraic manipulation (proof), we can show that:
\[\begin{align*} \underbrace{\mathbb{E}_{D,x,y} \left[ (h_D(x) - y)^2 \right]}_{\text{Expected test error}} =\;& \underbrace{\mathbb{E}_{D,x} \left[\left(h_D(x) - \hat{h}(x)\right)^2 \right]}_{\text{Variance}} + \\ & \underbrace{\mathbb{E}_{x,y} \left[\left(\hat{y}(x) - y\right)^2 \right]}_{\text{Noise}} + \\ & \underbrace{\mathbb{E}_{x} \left[\left(\hat{h}(x) - \hat{y}(x)\right)^2 \right]}_{\text{Bias}} \end{align*}\]
Variance: Captures how much your regressor \(h_D\) changes if you train on a different training set. How “over-specialized” is your regressor \(h_D\) to a particular training set \(D\)? I.e. how much does it overfit? If we have the best possible model for our training data, how far off are we from the average regressor \(\hat{h}\)?
Bias: What is the inherent error that you obtain from your regressor \(h_D\) even with infinite training data? This is due to your model being “biased” to a particular kind of solution (e.g. linear model). In other words, bias is inherent to your model/architecture.
If you use a low-capacity model, you will get high bias, but the variance over different training sets will be low.
There is a sweet spot that trades off between the two.