CSC413 Neural Networks and Deep Learning

Lecture 2

Last Week

  • Review of linear models
    • linear regression
    • linear classification (logistic regression)
  • Gradient descent to train these models

This Week

  • Biological and Artifical Neurons
  • Limitations of Linear Models for Classification
  • Multilayer Perceptrons
  • Backpropagation

Biological and Artificial Neurons

Neuron

Neuron Anatomy

  • The dendrites, which are connected to other cells that provide information.
  • The cell body, which consolidates information from the dendrites.
  • The axon, which is an extension from the cell body that passes information to other cells.
  • The synapse, which is the area where the axon of one neuron and the dendrite of another connect.

What does a neuron do?

  • Consolidates “information” (voltage difference) from its dendrites
  • If the total activity in a neuron’s dendrite lowers the voltage difference enough, the entire cell depolarizes and the neuron fires

What does a neuron do?

  • The voltage signal spreads along the axon and to the synapse, then to the next neurons
  • Neuron sends information to the next cell

What makes a neuron fire?

Neurons can fire in response to…

  • retinal cells
  • certain edges, lines, angles, movements
  • hands and faces (in primates)
  • specific people (in humans)
    • although the existence of these “grandmother cells” is contested

Modeling Individual Neurons

  • \(x_{i}\) are inputs to the neuron
  • \(w_{i}\) are the neuron’s weights
  • \(b\) is the neuron’s bias

Modeling Individual Neurons II

  • \(f\) is an activation function
  • \(f(\sum_i x_i w_i + b)\) is the neuron’s activation (output)

Linear Models as a Single Neuron

  • \(x_{i}\) are the inputs
  • \(w_{i}\) are components of the weight vector \({\bf w}\)
  • \(b\) is the bias

Linear Models as a Single Neuron II

  • \(f\) is the identity function
  • \(y = \sum_i x_i w_i + b = {\bf w}^\top {\bf x} + b\) is the output

Logistic Regression Models (for Binary Classification) as a Single Neuron

  • \(x_{i}\) are the inputs
  • \(w_{i}\) are components of the weight vector \({\bf w}\)
  • \(b\) is the bias

Logistic Regression Models (for Binary Classification) as a Single Neuron II

  • \(f = \sigma\)
  • \(y = \sigma(\sum_i x_i w_i + b) = \sigma({\bf w}^\top {\bf x} + b)\)

Logistic Regression Models (for Multi-Class Classification) as a Neural Network

We use \(K\) neurons (one for each class):

  • \(x_{i}\) are the inputs
  • \(w_{j,i}\) are components of the weight matrix \(W\in \mathbb{R}^{K\times N}\)
  • \(b_i\) are components of the bias vector \({\bf b}\)
  • \(f = \text{softmax}\) : applied to the entire vector of values
  • \({\bf y} = \text{softmax}(W{\bf x} + {\bf b})\) : outputs of \(K\) neurons

Limits of Linear Models for Binary Classification

XOR example

  • Single neurons (linear classifiers) are very limited in expressive power
  • XOR is a classic example of a function that’s not linearly separable, with an elegant proof using convexity

Convex Sets

A set \(S\subseteq\mathbb{R}^n\) is convex if any line segment connecting points in \(S\) lies in \(S\).

More formally, \(S\) is convex iff

\[{\bf x_1}, {\bf x_2} \in S \implies \forall \lambda \in [0,\, 1],\, \lambda {\bf x_1} + (1 - \lambda){\bf x_2} \in S.\]

A simple inductive argument shows that for \({\bf x_1}, \dots, {\bf x_N} \in S\), the weighted average or convex combination lies in the set:

\[\lambda_1 {\bf x_1} + \dots + \lambda_N {\bf x_N} \in S \text{ for }\lambda_1 + \dots + \lambda_N = 1\ .\]

XOR not linearly separable

Initial Observations

  • A binary linear classifier divides the euclidean space into two half-spaces
  • Half-spaces are convex

XOR not linearly separable II

  • Suppose there were some feasible hypothesis. If the positive examples are in the positive half-space, then the green line segment must be as well.
  • Similarly, red line segment must lie within the negative half-space.
  • But the intersection can’t lie in both half-spaces. Contradiction!

History of the XOR Example

  • Minsky and Papert shown in their work Perceptrons that XOR cannot be learned by a Neuron.
  • Its pessimistic outlook on perceptrons is assumed as one of the causes for the AI winter of the 70s / early 80s.

A more troubling example

These images represent 16-dimensional vectors. Want to distinguish patterns A and B in all possible translations (with wrap-around).

Suppose there’s a feasible solution. The average of all translations of A is the vector (0.25, 0.25, . . . , 0.25). Therefore, this point must be classified as A. All translations of B have the same average. Contradiction!

(Nonlinear) Feature Maps

Sometimes, we can overcome this limitation with nonlinear feature maps.

Nonlinear feature maps transform the original input features into a different (often higher dimensional) representation.

Consider the XOR problem again and use the following feature map: \[\Psi({\bf x}) = \begin{pmatrix}x_1 \\ x_2 \\ x_1x_2 \end{pmatrix}\]

(Nonlinear) Feature Maps II

\(x_1\) \(x_2\) \(\phi_1({\bf x})\) \(\phi_2({\bf x})\) \(\phi_3({\bf x})\) t
0 0 0 0 0 0
0 1 0 1 0 1
1 0 1 0 0 1
1 1 1 1 1 0

This is linearly separable (Try it!)

… but generally, it can be hard to pick good basis functions.

We’ll use neural nets to learn nonlinear hypotheses directly

Multilayer Perceptrons

An Artificial Neural Network (Multilayer Perceptron)

Idea

  • Use a simplified (mathematical) model of a neuron as building blocks
  • Connect the neurons together accross multiple layers.

An Artificial Neural Network (Multilayer Perceptron) II

  • An input layer: feed in input features (e.g. like retinal cells in your eyes)
  • A number of hidden layers: don’t have specific meaning
  • An output layer: interpret output like a “grandmother cell”

But what do these neurons mean?

  • Use \(x_i\) to encode the input
    • e.g. pixels in an image
  • Use \(y\) to encode the output (of a binary classification problem)
    • e.g. cancer vs. not cancer
  • Use \(h_i^{(k)}\) to denote a unit in the hidden layer
    • difficult to interpret

Example: MNIST Digit Recognition

MNIST Digit Recognition II

With a logistic regression model, we would have:

  • Input: An 28x28 pixel grayscale image
    • \({\bf x}\) is a vector of length 784
  • Target: The digit represented in the image
    • \({\bf t}\) is a one-hot vector of length 10
  • Model: \({\bf y} = \text{softmax}(W{\bf x} + {\bf b})\)

Adding a Hidden Layer

Two layer neural network

  • Input size: 784 (number of features)
  • Hidden size: 50 (we choose this number)
  • Output size: 10 (number of classes)

Side note about machine learning models

When discussing machine learning and deep learning models, we usually

  • first talk about how to make predictions assume the weights are trained
  • then talk about how to train the weights

Often the second step requires gradient descent or some other optimization method

Making Predictions: computing the hidden layer

\[\begin{align*} h_1 &= f\left(\sum_{i=1}^{784} w^{(1)}_{1,i} x_i + b^{(1)}_1\right) \\ h_2 &= f\left(\sum_{i=1}^{784} w^{(1)}_{2,i} x_i + b^{(1)}_2\right) \\ ... \end{align*}\]

Making Predictions: computing the output (pre-activation)

\[\begin{align*} z_1 &= \sum_{j=1}^{50} w^{(2)}_{1,j} h_j + b^{(2)}_1 \\ z_2 &= \sum_{j=1}^{50} w^{(2)}_{2,j} h_j + b^{(2)}_2 \\ ... \end{align*}\]

Making Predictions: applying the output activation

\[\begin{align*} {\bf z} &= \begin{bmatrix}z_1 \\ z_2 \\ \vdots \\ z_{10}\end{bmatrix},\quad {\bf y} = \text{softmax}({\bf z}) \end{align*}\]

Making Predictions: Vectorized

\[\begin{align*} {\bf h} &= f(W^{(1)}{\bf x} + {\bf b}^{(1)}) \\ {\bf z} &= W^{(2)}{\bf h} + {\bf b}^{(2)} \\ {\bf y} &= \text{softmax}({\bf z}) \end{align*}\]

Activation Functions: common choices

Common Choices:

  • Sigmoid activation
  • Tanh activation
  • ReLU activation

Rule of thumb: Start with ReLU activation. If necessary, try tanh.

Activation Function: Sigmoid

  • Gradient vanishes at the extremes as the function converges to \(0\) and \(1\) respectively.
  • All activations are positive (see this blog post to learn why we don’t want this)

Activation Function: Tanh

  • scaled version of the sigmoid activation
  • also somewhat problematic due to gradient signal
  • activations can be positive or negative

Activation Function: ReLU

  • most often used nowadays
  • all activations are positive
  • easy to compute gradients
  • can be problematic if the bias is too large and negative, so the activations are always 0

Feature Learning

Neural nets can be viewed as a way of learning features:

Feature Learning (cont’d)

The goal is for these features to become linearly separable:

Computing XOR

Exercise: design a network to compute XOR

Use a hard threshold activation function:

\[\begin{align*} f(x) = \begin{cases} 1, & x \geq 0 \\ 0, & x < 0 \end{cases} \end{align*}\]

Computing XOR Demo

Demo: https://playground.tensorflow.org/

Expressive Power: Linear Layers (No Activation Function)

  • We’ve seen that there are some functions that linear classifiers can’t represent. Are deep networks any better?
  • Any sequence of layers (with no activation function) can be equivalently represented with a single linear layer.

\[\begin{align*} {\bf y} &= \left(W^{(3)} W^{(2)} W^{(1)}\right) {\bf x} \\ &= W^\prime {\bf x} \end{align*}\]

Expressive Power: MLP (nonlinear activation)

  • Multilayer feed-forward neural nets with nonlinear activation functions are universal approximators: they can approximate any function arbitrarily well.
  • This has been shown for various activation functions (thresholds, logistic, ReLU, etc.)
    • Even though ReLU is “almost” linear, it’s nonlinear enough!

Universality for binary inputs and targets

  • Hard threshold hidden units, linear output
  • Strategy: \(2^D\) hidden units, each of which responds to one particular input configuration
    • Only requires one hidden layer, though it needs to be extremely wide!

Limits of universality

  • You may need to represent an exponentially large network.
  • If you can learn any function, you might just overfit.

Backpropagation

Training Neural Networks

  • How do we find good weights for the neural network?
  • We can continue to use the loss functions:
    • cross-entropy loss for classification
    • square loss for regression
  • The neural network operations we used (weights, etc) are (almost everywhere) differentiable

We can use gradient descent!

Gradient Descent Recap

Goal: Compute the minimum of a function \(\mathcal{E}({\bf a})\)

  • Start with a set of parameters \(\mathbf{a}_0\) (initialize to some value)
  • Compute the gradient \(\frac{\partial \mathcal{E}}{\partial \mathbf{a}}\).
  • Update the parameters towards the negative direction of the gradient

Gradient Descent for Neural Networks

Idea: Use gradient descent for “learning” neural networks.

  • We have a lot of parameters
    • High dimensional (all weights and biases are parameters)
    • Hard to visualize
    • Many iterations (“steps”) needed
  • In Deep Learning \(\frac{\partial \mathcal{E}}{\partial w}\) is the average of \(\frac{\partial \mathcal{L}}{\partial w}\) over multiple training examples

Challenge: How to compute \(\frac{\partial \mathcal{L}}{\partial w}\) effectively.

Solution: Backpropagation!

Univariate Chain Rule

Recall: if \(f(x)\) and \(x(t)\) are univariate functions, then

\[\frac{d}{dt}f(x(t)) = \frac{df}{dx}\frac{dx}{dt}\]

Univariate Chain Rule for Least Squares with a Logistic Model

Recall: Univariate logistic least squares model

\[\begin{align*} z &= wx + b \\ y &= \sigma(z) \\ \mathcal{L} &= \frac{1}{2}(y - t)^2 \end{align*}\]

Let’s compute the loss derivative

Univariate Chain Rule Computation I

How you would have done it in calculus class

\[\begin{align*} \mathcal{L} &= \frac{1}{2} ( \sigma(w x + b) - t)^2 \\ \frac{\partial \mathcal{L}}{\partial w} &= \frac{\partial}{\partial w} \left[ \frac{1}{2} ( \sigma(w x + b) - t)^2 \right] \\ &= \frac{1}{2} \frac{\partial}{\partial w} ( \sigma(w x + b) - t)^2 \\ &= (\sigma(w x + b) - t) \frac{\partial}{\partial w} (\sigma(w x + b) - t) \\ &\ldots \end{align*}\]

Univariate Chain Rule Computation II

How you would have done it in calculus class

\[\begin{align*} \ldots &= (\sigma(w x + b) - t) \frac{\partial}{\partial w} (\sigma(w x + b) - t) \\ &= (\sigma(w x + b) - t) \sigma^\prime (w x + b) \frac{\partial}{\partial w} (w x + b) \\ &= (\sigma(w x + b) - t) \sigma^\prime (w x + b) x \end{align*}\]

Univariate Chain Rule Computation III

Similarly for \(\frac{\partial \mathcal{L}}{\partial b}\)

\[\begin{align*} \mathcal{L} &= \frac{1}{2} ( \sigma(w x + b) - t)^2 \\ \frac{\partial \mathcal{L}}{\partial b} &= \frac{\partial}{\partial b} \left[ \frac{1}{2} ( \sigma(w x + b) - t)^2 \right] \\ &= \frac{1}{2} \frac{\partial}{\partial b} ( \sigma(w x + b) - t)^2 \\ &= (\sigma(w x + b) - t) \frac{\partial}{\partial b} (\sigma(w x + b) - t) \\ &\ldots \end{align*}\]

Univariate Chain Rule Computation IV

Similarly for \(\frac{\partial \mathcal{L}}{\partial b}\)

\[\begin{align*} \ldots &= (\sigma(w x + b) - t) \frac{\partial}{\partial b} (\sigma(w x + b) - t) \\ &= (\sigma(w x + b) - t) \sigma^\prime (w x + b) \frac{\partial}{\partial b} (w x + b) \\ &= (\sigma(w x + b) - t) \sigma^\prime (w x + b) \end{align*}\]

Q: What are the disadvantages of this approach?

A More Structured Way to Compute the Derivatives

\[\begin{align*} z &= wx + b \\ y &= \sigma(z) \\ \mathcal{L} &= \frac{1}{2}(y - t)^2 \end{align*}\]

Less repeated work; easier to write a program to efficiently compute derivatives

A More Structured Way to Compute the Derivatives II

\[\begin{align*} \frac{d \mathcal{L}}{d y} &= y - t \\ \frac{d \mathcal{L}}{d z} &= \frac{d \mathcal{L}}{d y}\sigma'(z) \\ \frac{\partial \mathcal{L}}{\partial w} &= \frac{d \mathcal{L}}{d z} \, x \\ \frac{\partial \mathcal{L}}{\partial b} &= \frac{d \mathcal{L}}{d z} \end{align*}\]

Less repeated work; easier to write a program to efficiently compute derivatives

Computation Graph

We can diagram out the computations using a computation graph.

  • The nodes represent all the inputs and computed quantities.
  • The edges represent which nodes are computed directly as a function of which other nodes.

Chain Rule (Error Signal) Notation

  • Use \(\overline{y}\) to denote the derivative \(\frac{d \mathcal{L}}{d y}\)
    • sometimes called the error signal
  • This notation emphasizes that the error signals are just values our program is computing (rather than a mathematical operation).
  • This is notation introduced by Prof. Roger Grosse, and not standard notation

Chain Rule (Error Signal) Notation II

Computing the loss:

\[\begin{align*} z &= wx + b \\ y &= \sigma(z) \\ \mathcal{L} &= \frac{1}{2}(y - t)^2 \end{align*}\]

Computing the derivatives:

\[\begin{align*} \overline{y} &= y - t \\ \overline{z} &= \overline{y} \sigma'(z) \\ \overline{w} &= \overline{z} \, x \\ \overline{b} &= \overline{z} \end{align*}\]

Multiclass Logistic Regression Computation Graph

In general, the computation graph fans out:

Multiclass Logistic Regression Computation Graph II

\[\begin{align*} z_l &= \sum_j w_{lj} x_j + b_l \\ y_k &= \frac{e^{z_k}}{\sum_l e^{z_l}} \\ \mathcal{L} &= -\sum_k t_k \log{y_k} \end{align*}\]

There are multiple paths for which a weight like \(w_{11}\) affects the loss \(\mathcal{L}\).

Multivariate Chain Rule

Suppose we have a function \(f(x, y)\) and functions \(x(t)\) and \(y(t)\) (all the variables here are scalar-valued). Then

\[\frac{d}{dt}f(x(t), y(t)) = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt}\]

Multivariate Chain Rule Example

If \(f(x, y) = y + e^{xy}\), \(x(t) = \cos t\) and \(y(t) = t^2\)

\[\begin{align*} \frac{d}{dt}f(x(t), y(t)) &= \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt} \\ &= \left( y e^{xy} \right) \cdot \left( -\sin (t) \right) + \left( 1 + xe^{xy} \right) \cdot 2t \end{align*}\]

Multivariate Chain Rule Notation

In our notation

\[\overline{t} = \overline{x} \frac{dx}{dt} + \overline{y} \frac{dy}{dt}\]

The Backpropagation Algorithm

  • Backpropagation is an algorithm to compute gradients efficiency
    • Forward Pass: Compute predictions (and save intermediate values)
    • Backwards Pass: Compute gradients
  • The idea behind backpropagation is very similar to dynamic programming
    • Use chain rule, and be careful about the order in which we compute the derivatives

Backpropagation Example

Backpropagation for a MLP

Forward pass: \[\begin{align*} z_i &= \sum_j w_{ij}^{(1)} x_j + b_i^{(1)} \\ h_i &= \sigma(z_i) \\ y_k &= \sum_i w_{ki}^{(2)} h_i + b_k^{(2)} \\ \mathcal{L} &= \frac{1}{2}\sum_k (y_k - t_k)^2 \end{align*}\]

Backpropagation for a MLP II

Backward pass:

\[\begin{align*} \overline{\mathcal{L}} &= 1 \\ \overline{y_k} &= \overline{\mathcal{L}}(y_k - t_k) \\ \overline{w_{ki}^{(2)}} &= \overline{y_k}h_i \\ \overline{b_{k}^{(2)}} &= \overline{y_k} \end{align*}\]

\[\begin{align*} \overline{h_i} &= \sum_k \overline{y_k} w_{ki}^{(2)} \\ \overline{z_i} &= \overline{h_i} \sigma'(z_i) \\ \overline{w_{ij}^{(1)}} &= \overline{z_i} x_j \\ \overline{b_{i}^{(1)}} &= \overline{z_i} \end{align*}\]

Backpropagation for a MLP

Forward pass: \[\begin{align*} {\bf z} &= W^{(1)}{\bf x} + {\bf b}^{(1)} \\ {\bf h} &= \sigma({\bf z}) \\ {\bf y} &= W^{(2)}{\bf h} + {\bf b}^{(2)} \\ \mathcal{L} &= \frac{1}{2} || {\bf y} - {\bf t}||^2 \end{align*}\]

Backpropagation for a MLP

Backward pass: \[\begin{align*} \overline{\mathcal{L}} &= 1 \\ \overline{{\bf y}} &= \overline{\mathcal{L}}({\bf y} - {\bf t}) \\ \overline{W^{(2)}} &= \overline{{\bf y}}{\bf h}^\top \\ \overline{{\bf b^{(2)}}} &= \overline{{\bf y}} \\ & \ldots \end{align*}\]

Backpropagation for a MLP

Backward pass: \[\begin{align*} & \ldots \\ \overline{{\bf h}} &= {W^{(2)}}^\top\overline{y} \\ \overline{{\bf z}} &= \overline{{\bf h}} \circ \sigma'({\bf z}) \\ \overline{W^{(1)}} &= \overline{{\bf z}} {\bf x}^\top \\ \overline{{\bf b}^{(1)}} &= \overline{{\bf z}} \end{align*}\]

Implementing Backpropagation I

Implementing Backpropagation II

Forward pass: Each node…

  • receives messages (inputs) from its parents
  • uses these messages to compute its own values

Backward pass: Each node…

  • receives messages (error signals) from its children
  • uses these messages to compute its own error signal
  • passes message to its parents

This algorithm provides modularity!

Backpropagation in Vectorized Form

Backpropagation in practice

  • Backprop is used to train the overwhelming majority of neural nets today.
    • Even optimization algorithms much fancier than gradient descent (e.g.~second-order methods) use backprop to compute the gradients.

Backpropagation in practice II

  • Despite its practical success, backprop is believed to be neurally (biologically) implausible.
    • No evidence for biological signals analogous to error derivatives.
    • All the biologically plausible alternatives we know about learn much more slowly (on computers).

Wrap Up

Summary

  • Artificial neurons draw inspiration from nerve cells / neurons in the brain
  • On its own, the expresiveness of a single neuron is limited
  • Stacking neurons and nonlinear activation functions allows for learning more complex functions
  • Backpropagation can be used for this learning task

What to do this week?

  • You can already complete Assignment 1
    • Start early so you can get help early!
  • Attend tutorials this week!
  • Complete the readings for this week.
  • Preview next week’s materials