Lecture 4
How can you “hard code” an algorithm that still recognizes that this is a cat?
In the week 3 tutorial, we worked with small, MNIST images, which are \(28 \times 28\) pixels, black and white.
How do our models work?
Q: How many parameters will there be in the first layer?
A: \(200 \times 200 \times 500 + 500 =\) over 20 million!
Q: Why might using a fully connected layer be problematic?
There is evidence that biological neurons in the visual cortex have locally-connected connections
See Hubel and Wiesel Cat Experiment (Note: there is an anesthetised cat in the video that some may find disturbing).
Each hidden unit connects to a small region of the input (in this case a \(3 \times 3\) region)
Hidden unit geometry has a 2D geometry consistent with the input.
Q: Which region of the input is this hidden unit connected to?
Fully-connected layers:
Locally connected layers:
Locally connected layers
Convolutional layers
Use the same weights across each region (each colour represents the same weight)
\[\begin{align*} 300 = & 100 \times 1 + 100 \times 2 + 100 \times 1 + \\ & 100 \times 0 + 100 \times 0 + 100 \times 0 + \\ & 100 \times (-1) + 0 \times (-2) + 0 \times (-1) \end{align*}\]
\[\begin{align*} 300 = & 100 \times 1 + 100 \times 2 + 100 \times 1 + \\ & 100 \times 0 + 100 \times 0 + 100 \times 0 + \\ & 100 \times (-1) + 0 \times (-2) + 0 \times (-1) \end{align*}\]
\[\begin{align*} 300 = &100 \times 1 + 100 \times 2 + 100 \times 1 + \\ &100 \times 0 + 100 \times 0 + 100 \times 0 + \\ &0 \times (-1) + 0 \times (-2) + 100 \times (-1) \end{align*}\]
Q: What is the value of the highlighted hidden activation?
\[\begin{align*} 100 = &100 \times 1 + 100 \times 2 + 100 \times 1 + \\ &100 \times 0 + 100 \times 0 + 100 \times 0 + \\ &0 \times (-1) + 100 \times (-2) + 100 \times (-1) \end{align*}\]
Each neuron on the higher layer is detecting the same feature, but in different locations on the lower layer
“Detecting” = output (activation) is high if feature is present “Feature” = something in a part of the image, like an edge or shape
Q: What is the kernel size of this convolution?
Greyscale input image: \(7\times 7\)
Convolution kernel: \(3 \times 3\)
Q: How many hidden units are in the output of this convolution?
Q: How many trainable weights are there?
There are \(3 \times 3 + 1\) trainable weights (\(+ 1\) for the bias)
What if we have a coloured image?
What if we want to compute multiple features?
The kernel becomes a 3-dimensional tensor!
In this example, the kernel has size 3 \(\times 3 \times 3\)
Colour input image: 3 \(\times 7 \times 7\)
Convolution kernel: 3 \(\times 3 \times 3\)
Questions:
Input image: \(3 \times 32 \times 32\)
Convolution kernel: 3 \(\times 3 \times 3\)
Q: What if we want to detect many features of the input? (i.e. both horizontal edges and vertical edges, and maybe even other features?)
A: Have many convolutional filters!
Input image: \(3 \times 7\times 7\)
Convolution kernel: \(3 \times 3 \times 3 \times\) 5
Q:
Input image of size \(3 \times 32 \times 32\)
Convolution kernel of 3 \(\times 3 \times 3 \times\) 5
Input features: \(5 \times 32 \times 32\)
Convolution kernel: \(5 \times 3 \times 3 \times 10\)
Questions:
In a neural network with fully-connected layers, we reduced the number of units in each hidden layer
Q: Why?
Q: How can we consolidate information in a neural network with convolutional layers?
Idea: take the maximum value in each \(2 \times 2\) grid.
We can add a max-pooling layer after each convolutional layer
More recently people are doing away with pooling operations, using strided convolutions instead:
Shift the kernel by 2 (stride=2) when computing the next output feature.
With backprop, of course!
Recall what we need to do. Backprop is a message passing procedure, where each layer knows how to pass messages backwards through the computation graph. Let’s determine the updates for convolution layers.
The only new feature is: how do we do backprop with tied weights?
Consider the computation graph for the inputs:
Each input unit influences all the output units that have it within their receptive fields. Using the multivariate Chain Rule, we need to sum together the derivative terms for all these edges
Consider the computation graph for the weights:
Each of the weights affects all the output units for the corresponding input and output feature maps.
The formula for the convolution layer for 1-D signals:
\[ y_{i,t} = \sum_{j=1}^{J} \sum_{\tau = -R}^{R} w_{i,j,\tau} \, x_{j, t + \tau}. \]
We compute the derivatives, which requires summing over all spatial locations:
\[\begin{align*} \overline{w_{i,j,\tau}} &= \sum_{t} y_{i,t} \frac{\partial y_{i,t}}{\partial w_{i,j,\tau}} \\ &= \sum_{t} y_{i,t} x_{j, t + \tau} \end{align*}\]
Object recognition is the task of identifying which object category is present in an image.
It’s challenging because objects can differ widely in position, size, shape, appearance, etc., and we have to deal with occlusions, lighting changes, etc.
Why we care
Used for: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC), an annual benchmark competition for object recognition algorithms
Design Decisions
Size: 1.2 million full-sized images for the ILSVRC
Source: Results from image search engines, hand-labeled by Mechanical Turkers
Normalization: None, although the contestants are free to do preprocessing
Year | Model | Top-5 error |
---|---|---|
2010 | Hand-designed descriptors + SVM | 28.2% |
2011 | Compressed Fisher Vectors + SVM | 25.8% |
2012 | AlexNet | 16.4% |
2013 | a variant of AlexNet | 11.7% |
2014 | GoogLeNet | 6.6% |
2015 | deep residual nets | 4.5% |
Same idea as ResNet blocks, but instead of addition \(f(x) = x + g(x)\) they use concatenation \(f(x) = [x, g(x)]\).
See https://d2l.ai/chapter_convolutional-modern/densenet.html