Lecture 7
Because of downsampling (pooling and use of strides), higher-layer filters “cover” a larger region of the input than equal-sized filters in the lower layers.
Transfer Learning is the idea of using weights/features trained on one task, and using it on another task.
We already saw the idea of transfer learning in project 2:
Practioners rarely train a CNN “from scratch”. Instead we could:
What we want you to know:
Most of these networks have fully connected layers at the very end.
Idea: instead of fully connected layers, we could…
This is more frequently done on pixel-wise prediction problems, which we’ll see later in this course.
Convolutional neural networks are successful, but how do we know that the network has learned useful patterns from the training set?
Interpretation of deep learning models is a challenge, but there are some tricks we can use to interpret CNN models
Recall: we can understand what first-layer features in a MLP are doing by visualizing the weight matrices (left)
We can do the same thing with convolutional networks (right)
But what about higher-level features?
One approach: pick the images in the training set which activate a unit most strongly.
(Compute forward pass for each image in the training set, track when a feature was most active, and look for the portion of the image that lead to that activation)
Here is the visualization for layer 1:
Higher layer seems to pick up more abstract, high-level information.
Problem: Can’t tell what unit is actually responding in the image!
Maybe we can use input gradients?
Recall this computation graph:
From this graph, we could compute \(\frac{\partial L}{\partial x}\) – the model’s sensitivity with respect to the input.
(We’ve never done this because there hasn’t been a need to—until now!)
Input gradients can be noisy and hard to interpret
Take a good object recognition conv net and compute the gradient of \(\log\left(p(y = \text{"deer"}|{\bf x})\right)\)
Several methods modify these gradients:
From: https://proceedings.neurips.cc/paper/2018/file/294a8ed24b1ad22ec2e7efea049b8737-Paper.pdf
Can we use gradient ascent on an image to maximize the activation of a given neuron?
Requires a few tricks to make this work; see https://distill.pub/2017/feature-visualization/
Similar idea:
This will accentuate whatever features of an image already kind of resemble the object (link).
Producing adversarial images: Given an image for one category (e.g. panda), compute the image gradient to maximize the network’s output unit for a different category (e.g. gibbon)
Goal: Choose a small perturbation \(\epsilon\) on an image \(x\) so that a neural network \(\, f\) misclassifies \(\, x + \epsilon\).
Approach:
Use the same optimization process to choose \(\epsilon\) to minimize the probability that
\[f(x + \epsilon) = \text{correct class}\]
Targeted attack
Maximize the probability that \(f(x + \epsilon) =\) target incorrect class
Non-targeted attack
Minimize the probability that \(f(x + \epsilon) =\) correct class
Demo time!
Adversarial examples transfer to different networks trained on a totally separate training set!
White-box Adversarial Attack: Model architecture and weights are known, so we can compute gradients. (What we’ve been doing so far in the demos)
Black-box Adversarial Attack: Model architecture and weights are unknown.
Attack carried out against proprietary classification networks accessed using prediction APIs (MetaMind, Amazon, Google)
It is possible to have a 3D object that gets misclassified by a neural network from all angles.
It is possible for a printed image to cause object detection to fail.
Let’s suppose we have a training set \(D=\{(x_1,\, y_1),\, \ldots,\, (x_N,\, y_N)\}\)
We typically solve the following problem on the training data:
\[\begin{align*} \hat{\theta} = \text{argmin}_{\theta} \frac{1}{N} \left[\sum_{i=1}^{N} L\left(x_i,\, y_i;\, \theta\right) \right] \end{align*}\]
\[\begin{align*} \hat{\theta}({\epsilon}) = \text{argmin}_{\theta} \frac{1}{N} \left[ \sum_{i=1}^{N} L(x_i, y_i; \theta) \right] + \epsilon L(x,y; \theta) \end{align*}\]
\[\begin{align*} \hat{\theta}({\epsilon}) \approx \hat{\theta} + \epsilon\frac{d\hat{\theta}(\epsilon)}{d\epsilon} {\Bigr |}_{\epsilon=0} \end{align*}\]
The derivative \(\frac{d\hat{\theta}(\epsilon)}{d\epsilon}{\Bigr |}_{\epsilon=0}\) is called the influence of point \((x,y)\) on the optimum. We denote it as \(\mathcal{I}(x,y)\).
How do we compute it?
Let’s denote \(R(\theta)=\frac{1}{N}\sum_{i=1}^{N}L(x_i, y_i; \theta)\)
Since \(\hat{\theta}({\epsilon}) = \text{argmin}_{\theta} \left[ R(\theta) + \epsilon L(x,y; \theta) \right]\) we have
\[\begin{align*} 0 = \nabla_{\theta}R(\theta) + \epsilon \nabla_{\theta}L(x,y; \theta) {\Bigr |}_{\theta=\hat{\theta}(\epsilon)} \end{align*}\]
\[\begin{multline*} 0 \approx \nabla_{\theta}R(\hat{\theta}) + \epsilon \nabla_{\theta}L(x,y; \hat{\theta})\\ + \left[ \nabla_{\theta}^2 R(\hat{\theta}) + \epsilon \nabla_{\theta}^2 L(x,y; \hat{\theta}) \right](\hat{\theta}(\epsilon) - \hat{\theta}) \end{multline*}\]
\[\begin{align*} \hat{\theta}(\epsilon) - \hat{\theta} \approx -\left[ \nabla_{\theta}^2 R(\hat{\theta}) + \epsilon \nabla_{\theta}^2 L(x,y; \hat{\theta}) \right]^{-1} \nabla_{\theta}L(x,y; \hat{\theta})\epsilon \end{align*}\]
\[\begin{align*} \frac{\hat{\theta}(\epsilon) - \hat{\theta}}{\epsilon} \approx -\left[ \nabla_{\theta}^2 R(\hat{\theta}) + \epsilon \nabla_{\theta}^2 L(x,y; \hat{\theta}) \right]^{-1} \nabla_{\theta}L(x,y; \hat{\theta}) \end{align*}\]
\[\begin{align*} \frac{d \hat{\theta}(\epsilon)}{d \epsilon} {\Bigr |}_{\epsilon=0} \approx -\left[ \nabla_{\theta}^2 R(\hat{\theta}) \right]^{-1} \nabla_{\theta}L(x,y; \hat{\theta}) \end{align*}\]
Because we can compute the sensitivity of the optimal weights to a training point
We can also compute the sensitivity of the test loss to a training point!
Consider a test point \((u,v)\), a training point \((x,y)\), and the test loss \(L(u,v; \hat{\theta})\)
How sensitive is \(L(u,v; \hat{\theta}(\epsilon))\) to a perturbation \(\hat{\theta}(\epsilon)\) where we have upweighted the training point \((x,y)\)?
How sensitive is \(L(u,v; \hat{\theta}(\epsilon))\) to a perturbation \(\hat{\theta}(\epsilon)\) where we have upweighted the training point \((x,y)\)?
Using chain rule:
\[\begin{align*} \frac{d L(u,v; \hat{\theta}(\epsilon))}{d \epsilon} {\Bigr |}_{\epsilon=0} = \frac{d L(u,v; \theta)}{d \theta}{\Bigr |}_{\theta=\hat{\theta}} \frac{d \hat{\theta}(\epsilon)}{d \epsilon}{\Bigr |}_{\epsilon=0} \end{align*}\]