Lecture 2
Neurons can fire in response to…
We use \(K\) neurons (one for each class):
A set \(S\subseteq\mathbb{R}^n\) is convex if any line segment connecting points in \(S\) lies in \(S\).
More formally, \(S\) is convex iff
\[{\bf x_1}, {\bf x_2} \in S \implies \forall \lambda \in [0,\, 1],\, \lambda {\bf x_1} + (1 - \lambda){\bf x_2} \in S.\]
A simple inductive argument shows that for \({\bf x_1}, \dots, {\bf x_N} \in S\), the weighted average or convex combination lies in the set:
\[\lambda_1 {\bf x_1} + \dots + \lambda_N {\bf x_N} \in S \text{ for }\lambda_1 + \dots + \lambda_N = 1\ .\]
Initial Observations
These images represent 16-dimensional vectors. Want to distinguish patterns A and B in all possible translations (with wrap-around).
Suppose there’s a feasible solution. The average of all translations of A is the vector (0.25, 0.25, . . . , 0.25). Therefore, this point must be classified as A. All translations of B have the same average. Contradiction!
Sometimes, we can overcome this limitation with nonlinear feature maps.
Nonlinear feature maps transform the original input features into a different (often higher dimensional) representation.
Consider the XOR problem again and use the following feature map: \[\Psi({\bf x}) = \begin{pmatrix}x_1 \\ x_2 \\ x_1x_2 \end{pmatrix}\]
\(x_1\) | \(x_2\) | \(\phi_1({\bf x})\) | \(\phi_2({\bf x})\) | \(\phi_3({\bf x})\) | t |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 0 | 1 | 0 | 1 |
1 | 0 | 1 | 0 | 0 | 1 |
1 | 1 | 1 | 1 | 1 | 0 |
This is linearly separable (Try it!)
… but generally, it can be hard to pick good basis functions.
We’ll use neural nets to learn nonlinear hypotheses directly
Idea
With a logistic regression model, we would have:
Two layer neural network
When discussing machine learning and deep learning models, we usually
Often the second step requires gradient descent or some other optimization method
\[\begin{align*} h_1 &= f\left(\sum_{i=1}^{784} w^{(1)}_{1,i} x_i + b^{(1)}_1\right) \\ h_2 &= f\left(\sum_{i=1}^{784} w^{(1)}_{2,i} x_i + b^{(1)}_2\right) \\ ... \end{align*}\]
\[\begin{align*} z_1 &= \sum_{j=1}^{50} w^{(2)}_{1,j} h_j + b^{(2)}_1 \\ z_2 &= \sum_{j=1}^{50} w^{(2)}_{2,j} h_j + b^{(2)}_2 \\ ... \end{align*}\]
\[\begin{align*} {\bf z} &= \begin{bmatrix}z_1 \\ z_2 \\ \vdots \\ z_{10}\end{bmatrix},\quad {\bf y} = \text{softmax}({\bf z}) \end{align*}\]
\[\begin{align*} {\bf h} &= f(W^{(1)}{\bf x} + {\bf b}^{(1)}) \\ {\bf z} &= W^{(2)}{\bf h} + {\bf b}^{(2)} \\ {\bf y} &= \text{softmax}({\bf z}) \end{align*}\]
Common Choices:
Rule of thumb: Start with ReLU activation. If necessary, try tanh.
Neural nets can be viewed as a way of learning features:
The goal is for these features to become linearly separable:
Exercise: design a network to compute XOR
Use a hard threshold activation function:
\[\begin{align*} f(x) = \begin{cases} 1, & x \geq 0 \\ 0, & x < 0 \end{cases} \end{align*}\]
\[\begin{align*} {\bf y} &= \left(W^{(3)} W^{(2)} W^{(1)}\right) {\bf x} \\ &= W^\prime {\bf x} \end{align*}\]
Limits of universality
We can use gradient descent!
Goal: Compute the minimum of a function \(\mathcal{E}({\bf a})\)
Idea: Use gradient descent for “learning” neural networks.
Challenge: How to compute \(\frac{\partial \mathcal{L}}{\partial w}\) effectively.
Solution: Backpropagation!
Recall: if \(f(x)\) and \(x(t)\) are univariate functions, then
\[\frac{d}{dt}f(x(t)) = \frac{df}{dx}\frac{dx}{dt}\]
Recall: Univariate logistic least squares model
\[\begin{align*} z &= wx + b \\ y &= \sigma(z) \\ \mathcal{L} &= \frac{1}{2}(y - t)^2 \end{align*}\]
Let’s compute the loss derivative
How you would have done it in calculus class
\[\begin{align*} \mathcal{L} &= \frac{1}{2} ( \sigma(w x + b) - t)^2 \\ \frac{\partial \mathcal{L}}{\partial w} &= \frac{\partial}{\partial w} \left[ \frac{1}{2} ( \sigma(w x + b) - t)^2 \right] \\ &= \frac{1}{2} \frac{\partial}{\partial w} ( \sigma(w x + b) - t)^2 \\ &= (\sigma(w x + b) - t) \frac{\partial}{\partial w} (\sigma(w x + b) - t) \\ &\ldots \end{align*}\]
How you would have done it in calculus class
\[\begin{align*} \ldots &= (\sigma(w x + b) - t) \frac{\partial}{\partial w} (\sigma(w x + b) - t) \\ &= (\sigma(w x + b) - t) \sigma^\prime (w x + b) \frac{\partial}{\partial w} (w x + b) \\ &= (\sigma(w x + b) - t) \sigma^\prime (w x + b) x \end{align*}\]
Similarly for \(\frac{\partial \mathcal{L}}{\partial b}\)
\[\begin{align*} \mathcal{L} &= \frac{1}{2} ( \sigma(w x + b) - t)^2 \\ \frac{\partial \mathcal{L}}{\partial b} &= \frac{\partial}{\partial b} \left[ \frac{1}{2} ( \sigma(w x + b) - t)^2 \right] \\ &= \frac{1}{2} \frac{\partial}{\partial b} ( \sigma(w x + b) - t)^2 \\ &= (\sigma(w x + b) - t) \frac{\partial}{\partial b} (\sigma(w x + b) - t) \\ &\ldots \end{align*}\]
Similarly for \(\frac{\partial \mathcal{L}}{\partial b}\)
\[\begin{align*} \ldots &= (\sigma(w x + b) - t) \frac{\partial}{\partial b} (\sigma(w x + b) - t) \\ &= (\sigma(w x + b) - t) \sigma^\prime (w x + b) \frac{\partial}{\partial b} (w x + b) \\ &= (\sigma(w x + b) - t) \sigma^\prime (w x + b) \end{align*}\]
Q: What are the disadvantages of this approach?
\[\begin{align*} z &= wx + b \\ y &= \sigma(z) \\ \mathcal{L} &= \frac{1}{2}(y - t)^2 \end{align*}\]
Less repeated work; easier to write a program to efficiently compute derivatives
\[\begin{align*} \frac{d \mathcal{L}}{d y} &= y - t \\ \frac{d \mathcal{L}}{d z} &= \frac{d \mathcal{L}}{d y}\sigma'(z) \\ \frac{\partial \mathcal{L}}{\partial w} &= \frac{d \mathcal{L}}{d z} \, x \\ \frac{\partial \mathcal{L}}{\partial b} &= \frac{d \mathcal{L}}{d z} \end{align*}\]
Less repeated work; easier to write a program to efficiently compute derivatives
We can diagram out the computations using a computation graph.
Computing the loss:
\[\begin{align*} z &= wx + b \\ y &= \sigma(z) \\ \mathcal{L} &= \frac{1}{2}(y - t)^2 \end{align*}\]
Computing the derivatives:
\[\begin{align*} \overline{y} &= y - t \\ \overline{z} &= \overline{y} \sigma'(z) \\ \overline{w} &= \overline{z} \, x \\ \overline{b} &= \overline{z} \end{align*}\]
In general, the computation graph fans out:
\[\begin{align*} z_l &= \sum_j w_{lj} x_j + b_l \\ y_k &= \frac{e^{z_k}}{\sum_l e^{z_l}} \\ \mathcal{L} &= -\sum_k t_k \log{y_k} \end{align*}\]
There are multiple paths for which a weight like \(w_{11}\) affects the loss \(\mathcal{L}\).
Suppose we have a function \(f(x, y)\) and functions \(x(t)\) and \(y(t)\) (all the variables here are scalar-valued). Then
\[\frac{d}{dt}f(x(t), y(t)) = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt}\]
If \(f(x, y) = y + e^{xy}\), \(x(t) = \cos t\) and \(y(t) = t^2\)…
\[\begin{align*} \frac{d}{dt}f(x(t), y(t)) &= \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt} \\ &= \left( y e^{xy} \right) \cdot \left( -\sin (t) \right) + \left( 1 + xe^{xy} \right) \cdot 2t \end{align*}\]
In our notation
\[\overline{t} = \overline{x} \frac{dx}{dt} + \overline{y} \frac{dy}{dt}\]
Forward pass: \[\begin{align*} z_i &= \sum_j w_{ij}^{(1)} x_j + b_i^{(1)} \\ h_i &= \sigma(z_i) \\ y_k &= \sum_i w_{ki}^{(2)} h_i + b_k^{(2)} \\ \mathcal{L} &= \frac{1}{2}\sum_k (y_k - t_k)^2 \end{align*}\]
Backward pass:
\[\begin{align*} \overline{\mathcal{L}} &= 1 \\ \overline{y_k} &= \overline{\mathcal{L}}(y_k - t_k) \\ \overline{w_{ki}^{(2)}} &= \overline{y_k}h_i \\ \overline{b_{k}^{(2)}} &= \overline{y_k} \end{align*}\]
\[\begin{align*} \overline{h_i} &= \sum_k \overline{y_k} w_{ki}^{(2)} \\ \overline{z_i} &= \overline{h_i} \sigma'(z_i) \\ \overline{w_{ij}^{(1)}} &= \overline{z_i} x_j \\ \overline{b_{i}^{(1)}} &= \overline{z_i} \end{align*}\]
Forward pass: \[\begin{align*} {\bf z} &= W^{(1)}{\bf x} + {\bf b}^{(1)} \\ {\bf h} &= \sigma({\bf z}) \\ {\bf y} &= W^{(2)}{\bf h} + {\bf b}^{(2)} \\ \mathcal{L} &= \frac{1}{2} || {\bf y} - {\bf t}||^2 \end{align*}\]
Backward pass: \[\begin{align*} \overline{\mathcal{L}} &= 1 \\ \overline{{\bf y}} &= \overline{\mathcal{L}}({\bf y} - {\bf t}) \\ \overline{W^{(2)}} &= \overline{{\bf y}}{\bf h}^\top \\ \overline{{\bf b^{(2)}}} &= \overline{{\bf y}} \\ & \ldots \end{align*}\]
Backward pass: \[\begin{align*} & \ldots \\ \overline{{\bf h}} &= {W^{(2)}}^\top\overline{y} \\ \overline{{\bf z}} &= \overline{{\bf h}} \circ \sigma'({\bf z}) \\ \overline{W^{(1)}} &= \overline{{\bf z}} {\bf x}^\top \\ \overline{{\bf b}^{(1)}} &= \overline{{\bf z}} \end{align*}\]
Forward pass: Each node…
Backward pass: Each node…
This algorithm provides modularity!