Lecture 1
Igor Gilitschenski (LEC0101)
Florian Shkurti (LEC0102)
Please use Piazza for course-related questions
Artificial Intelligence: Create intelligent machines that perceive, reason, and act like humans. (CSC384)
Machine Learning: Find an algorithm that automatically learns from example data. (CSC311)
Deep Learning: Using deep neural networks to automatically learn from example data. (CSC413)
For many problems, it is difficult to program the correct behaviour by hand.
Machine learning approach: program an algorithm to automatically learn from data.
Reframe learning problems into optimization problems by:
Different machine learning approaches differ in the model, loss, and optimizer choice.
This is why it is important to have a strong foundation in math, specifically calculus, linear algebra, and probability
Neural networks are a class of models originally inspired by the brain.
\[ y = \phi \left(\bf{w}^\top \bf{x} + b\right) \]
\[ y = \phi \left(\bf{w}^\top \bf{x} + b\right) \]
A “deep” neural network contains many “layers”.
Later layers use the output of earlier layers as input.
The term deep learning emphasizes that the neural network algorithms often involve hierarchies with many stages of processing.
One of the fundamental building blocks in deep learning are linear models, where you decide based on a linear function of the input vector.
Common supervised learning problems;
Input: Represented using the vector \(\textbf{x}\)
Output: Represented using the scalar \(t\)
A model is a set of assumptions about the underlying nature of the data we wish to learn about. The model, or architecture defines the set of allowed family of hypotheses.
In linear regression, our model will look like this
\[y = \sum_j w_j x_j + b\]
Where \(y\) is a prediction for \(t\), and the \(w_j\) and \(b\) are parameters of the model, to be determined based on the data.
For the exam prediction problem, we only have a single feature, so we can simplify our model to:
\[y = w x + b\]
Our hypothesis space includes all functions of the form \(y = w x + b\). Here are some examples:
The variables \(w\) and \(b\) are called weights or parameters of our model. (Sometimes \(w\) and \(b\) are referred to as coefficients and intercept, respectively.)
We can visualize the hypothesis space or weight space:
Each point in the weight space represents a hypothesis.
The “badness” of an entire hypothesis is the average badness across our labeled data.
\[\begin{align*} \mathcal{E}(w, b) &= \frac{1}{N} \sum_i \mathcal{L}\left(y^{(i)}, t^{(i)}\right) \\ &= \frac{1}{2N} \sum_i \left(y^{(i)} - t^{(i)}\right)^2 \\ &= \frac{1}{2N} \sum_i \left(\left(w x^{(i)} + b\right) - t^{(i)}\right)^2 \end{align*}\]
This is called the cost of a particular hypothesis (in practice, “loss” and “cost” functions are used inter-changeably).
Since the loss depends on the choice of \(w\) and \(b\), we call \(\mathcal{E}(w, b)\) the cost function.
Find a critical point by setting \[ \frac{\partial \mathcal{E}}{\partial w} = 0 \quad \text{and} \quad \frac{\partial \mathcal{E}}{\partial b} = 0 \]
Possible for our hypothesis space, and covered in the notes.
However, let’s use a technique that can also be applied to more general models.
We can use gradient descent to minimize the cost function.
\[ \textbf{w} \leftarrow \textbf{w} - \alpha \frac{\partial \mathcal{E}}{\partial \textbf{w}}, \quad \text{where }\, \frac{\partial \mathcal{E}}{\partial \textbf{w}} = \begin{bmatrix} \frac{\partial \mathcal{E}}{\partial w_1} \\ \vdots \\ \frac{\partial \mathcal{E}}{\partial w_D} \\ \end{bmatrix} \]
The \(\alpha\) is the learning rate, which we choose.
We’ll initialize \(w = 0\) and \(b = 0\) (arbitrary choice)
We’ll also choose \(\alpha = 0.5\)
In theory:
In practice:
To compute the gradient \(\frac{\partial \mathcal{E}}{\partial w}\)
\[ \frac{\partial \mathcal{E}}{\partial w} = \frac{1}{N}\sum_{i=1}^{N} \frac{\partial \mathcal{L}(y^{(i)}, t^{(i)})}{\partial w} \]
But this computation can be expensive if \(N\) is large!
Solution: estimate \(\frac{\partial \mathcal{E}}{\partial w}\) using a subset of the data
Full batch gradient descent:
\[ \frac{1}{N}\sum_{i=1}^{N} \frac{\partial \mathcal{L}(y^{(i)}, t^{(i)})}{\partial w} \]
Stochastic Gradient Descent:
Estimate the above quantity by computing the average of \(\frac{\partial \mathcal{L}(y^{(i)}, t^{(i)})}{\partial w}\) across a small number of \(i\)’s
Stochastic Gradient Descent:
Estimate the above quantity by computing the average of \(\frac{\partial \mathcal{L}(y^{(i)}, t^{(i)})}{\partial w}\) across a small number of \(i\)’s
The set of examples that we use to estimate the gradient is called a mini-batch.
The number of examples in each mini-batch is called the mini-batch size or just the batch size
In theory, any way of sampling a mini-batch is okay.
In practice, SGD is almost always implemented like this:
# repeat until convergence:
# randomly split the data into mini-batches of size k
# for each mini-batch:
# estimate the gradient using the mini-batch
# update the parameters based on the estimate
Suppose we have 1000 examples in our training set.
Q: How many iterations are in one epoch if our batch size is 10?
Q: How many iterations are in one epoch if our batch size is 50?
Q: What happens if the batch size is too large?
Q: What happens if the batch size is too small?
Model | \[y = {\bf w}^T{\bf x} + b\] |
Loss Function | \[\mathcal{L}(y, t) = (y- t)^2\] |
Optimization Method | \[\min_{{\bf w},\, b}\left(\left\{\mathcal{E}({\bf w}, b)\right\}\right) \text{ via Gradient Descent}\] |
Updating rules:
\[ {\bf w} \leftarrow {\bf w} - \alpha \frac{\partial \mathcal{E}}{\partial {\bf w}}, \quad b \leftarrow b - \alpha \frac{\partial \mathcal{E}}{\partial b} \]
Use vectors rather than writing
\[ \mathcal{E}({\bf w}, b) = \frac{1}{2N}\sum_{i = 1}^N \left(\left({\bf w} {\bf x}^{(i)} + b\right) - t^{(i)}\right)^2 \]
So we have: \[ \textbf{y} = \textbf{X}\textbf{w} + b{\bf 1} \]
where…
\[\begin{align*} \textbf{X} &= \begin{bmatrix} x^{(1)}_1 & ... & x^{(1)}_D \\ \vdots & \ddots & \vdots \\ x^{(N)}_1 & ... & x^{(N)}_D \end{bmatrix}, \, \textbf{w} = \begin{bmatrix} w_1 \\ \vdots \\ w_D \\ \end{bmatrix}, \, \textbf{y} = \begin{bmatrix} y^{(1)} \\ \vdots \\ y^{(N)} \\ \end{bmatrix}, \, \textbf{t} = \begin{bmatrix} t^{(1)} \\ \vdots \\ t^{(N)} \\ \end{bmatrix} \end{align*}\]
(You can also fold the bias \(b\) into the weight , but we won’t.)
After vectorization, the loss function becomes:
\[ \mathcal{E}(\textbf{w}) = \frac{1}{2N}(\textbf{y} - \textbf{t})^\top(\textbf{y} - \textbf{t}) \]
or
\[ \mathcal{E}(\textbf{w}) = \frac{1}{2N}({\bf Xw} + b{\bf 1}- {\bf t})^\top({\bf Xw} + b{\bf 1}- {\bf t}) \]
\[ {\bf w} \leftarrow {\bf w} - \alpha \frac{\partial \mathcal{E}}{\partial {\bf w}}, \quad b \leftarrow b - \alpha \frac{\partial \mathcal{E}}{\partial b} \] Where \(\frac{\partial \mathcal{E}}{\partial \textbf{w}}\) is the vector of partial derivatives: \[\begin{align*} \frac{\partial \mathcal{E}}{\partial \textbf{w}} = \begin{bmatrix} \frac{\partial \mathcal{E}}{\partial w_1} \\ ... \\ \frac{\partial \mathcal{E}}{\partial w_D} \\ \end{bmatrix} \end{align*}\]
Vectorization is not just for mathematical elegance.
When using Python with numpy/PyTorch/Tensorflow/JAX, code that performs vector computations is faster than code that loops.
Same holds for many other high level languages and software.
In classification, the \(t^{(i)}\) are discrete.
In binary classification, we’ll use the labels \(t \in \{0, 1\}\). Training examples with
Why can’t we set up this problem as a regression problem?
Use the model:
\[ y = wx + b \]
Our prediction for \(t\) would be \(1\) if \(y \geq 0.5\), and \(0\) otherwise.
With the loss function
\[\mathcal{L}(y, t) = \frac{1}{2}(y - t)^2\]
And minimize the cost function via gradient descent?
If we have \(\mathcal{L}(y, t) = \frac{1}{2}(y - t)^2\), then points that are correctly classified will still have high loss!
(blue dotted line above = decision boundary)
Example: a point on the top right
Why not still use the model:
\[ y = \begin{cases} 1, & \text{ if } \mathbf{w}^\top\mathbf{x} + b > 0 \\ 0, & \text{ otherwise }\end{cases} \]
But use this loss function instead:
\[ \mathcal{L}(y, t) = \begin{cases} 0, & \text{ if } y = t \\ 1, & \text{ otherwise }\end{cases} \]
\[ \mathcal{L}(y, t) = \begin{cases} 0, & \text{ if } y = t \\ 1, & \text{ otherwise }\end{cases} \]
The gradient of this function is 0 almost everywhere!
So gradient descent will not change the weights! We need to define a surrogate loss function that is better behaved.
Apply a non-linearity or activation function to the linear model \(z\):
\[\begin{align*} z &= wx + b \quad \quad \text{also called the logit}\\ y &= \sigma(z) \quad \quad \text{also called a log-linear model} \end{align*}\]
where
\[\sigma(z) = \frac{1}{1 + e^{-z}}\]
is called the logistic or sigmoid function. Using the model \(y\) for solving a classification problem is called logistic regression.
\[\sigma(z) = \frac{1}{1 + e^{-z}}\]
Properties:
A logistic regression model will have this shape:
But how do we train this model?
Suppose we define the model like this:
\[\begin{align*} z &= wx + b \\ y &= \sigma(z) \\ \mathcal{L}_{SE}(y, t) &= \frac{1}{2}(y - t)^2 \end{align*}\]
The gradient of \(\mathcal{L}\) with respect to \(w\) is (homework):
\[ \frac{\partial \mathcal{L}}{\partial w} = \frac{\partial \mathcal{L}}{\partial y} \frac{dy}{dz} \frac{\partial z}{\partial w} = (y - t) y (1 - y) x \]
Suppose we have a positive example (\(t = 1\)) that our model classifies extremely wrongly (\(z = -5\)):
Then we have \(y = \sigma(z) \approx 0.0067\)
Ideally, the gradient should give us strong signals regarding how to update \(w\) to do better.
But… \(\frac{\partial \mathcal{L}}{\partial w} = (y - t) y (1- y) x\) is small!
Which means that the update \(w \leftarrow w - \alpha \frac{\partial \mathcal{L}}{\partial w}\) won’t change \(w\) much!
The problem with using sigmoid activation with square loss is that we get poor gradient signal.
We need a loss function that distinguishes between a wrong prediction and a very wrong prediction.
The cross entropy loss provides the desired behaviour:
\[ \mathcal{L}(y, t) = \begin{cases} -\log(y), & \text{if } t = 1 \\ -\log(1 - y), & \text{ if } t = 0\end{cases} \]
We can write the loss as:
\[\mathcal{L}(y, t) = - t \log(y) - (1-t) \log(1-y)\]
Model | \[y = \sigma({\bf w}^T{\bf x} + b)\] |
Loss Function | \[\mathcal{L}(y, t) = -t \log(y) - (1-t) \log(1-y)\] |
Optimization Method | \[\min_{{\bf w},\, b}\mathcal{E}({\bf w}, b) \text{ via Gradient Descent}\] |
Updating rules:
\[ {\bf w} \leftarrow {\bf w} - \alpha \frac{\partial \mathcal{E}}{\partial {\bf w}}, \quad b \leftarrow b - \alpha \frac{\partial \mathcal{E}}{\partial b} \]
After running gradient descent, we’ll get a model that looks something like:
Instead of there being two targets (pass/fail, cancer/not cancer, before/after 2000), we have \(K > 2\) targets.
Example:
We use a one-hot vector to represent the target:
\[{\bf t} = (0,\, 0,\, \ldots ,\, 1,\, \ldots ,\, 0)\]
This vector contains \(K-1\) zeros, and a single 1 somewhere.
Each index (column) in the vector represents one of the classes.
The prediction \({\bf y}\) will also be a vector. Like in logistic regression there will be a linear part, and an activation function.
Linear part: \({\bf z} = {\bf W}^\top{\bf x} + {\bf b}\)
So far, this is like having \(K\) separate logistic regression models, one for each element of the one-hot vector.
Q: What are the shapes of \({\bf z}\), \({\bf W}\), \({\bf x}\) and \({\bf b}\)?
Instead of using a sigmoid function, we instead use a softmax activation function:
\[y_k = \text{softmax}(z_1,...,z_K)_k = \frac{e^{z_k}}{\sum_{m=1}^K e^{z_m}}\]
The vector of predictions \(y_k\) is now a probability distribution over the classes!
The cross-entropy loss naturally generalizes to the multi-class case:
\[\begin{align*} \mathcal{L}({\bf y}, {\bf t}) &= -\sum_{k=1}^K t_k \log (y_k) \\ &= - {\bf t}^\top \log({\bf y}) \end{align*}\]
Recall that only one of the \(t_k\) is going to be 1, and the rest are 0.
Model | \[{\bf y} = \text{softmax}({\bf W}^T{\bf x} + {\bf b})\] |
Loss Function | \[\mathcal{L}({\bf y}, {\bf t}) =- {\bf t}^T \log({\bf y})\] |
Optimization Method | \[\min_{{\bf w},\, b}\mathcal{E}({\bf w}, b) \text{ via Gradient Descent}\] |
Updating rules:
\[ {\bf W} \leftarrow {\bf W} - \alpha \frac{\partial \mathcal{E}}{\partial {\bf W}}, \quad {\bf b} \leftarrow {\bf b} - \alpha \frac{\partial \mathcal{E}}{\partial {\bf b}} \]
Given a \(100 \times 100\) pixel colour image of a face of a Beatle, identify the Beatle
Four possible labels:
This is what John Lennon looks like to a computer:
Each of our input images are \(100 \times 100\) pixels
\({\bf y} = \text{softmax}\left({\bf W}^\top{\bf x} + {\bf b}\right)\)
Q: What will be the length of our input (feature) vectors \({\bf x}\)?
Q: What will be the length of our one-hot targets \({\bf t}\)?
Q: What are the shapes of \({\bf W}\) and \({\bf b}\)?
Q: How many (scalar) parameters are in our model, in total?