The exercises this week involve some old material so you can check your learning and understanding.
Exercise 1 - Maximum Likelihood Estimator
Assume you are given datapoints \((x_i)_{i=1}^N\) with \(x_i\in\R\) coming from a Exponential distribution. The probability density function of a exponential distribution is given by \(f(x) = \la \exp(-\la x)\) with \(x\in\R\). Derive the maximum likelihood estimator of the parameter \(\la\).
Solution
First, let’s quickly remember that the maximum likelihood estimator (MLE) of a probability distribution from dataapoints \(\fx_1, \ldots, \fx_N\) is given by \[ \hte_{\mathrm{MLE}} = \argmax_{\te \in \Te} \prod_{i=1}^N f(\fx_i | \te), \] where \(f\) is the probability density function of the considered probability distribution family, \(\te\) are the parameters of the distribution, and \(\Te\) is the parameter space (a set containing all possible parameters).
As mentioned in our previous exercise, we usually work with the log-likelihood in practice. In this particular case, the log likelihood is given by \[\begin{aligned} l(\la | x_1, \ldots, x_N) & := \sum_{i=1}^N \ln f(\fx_i | \la) \\ & = \sum_{i=1}^N \ln\li(\la \exp(-\la x_i) \ri) \\ & = \sum_{i=1}^N \ln(\la) + \ln\li( \exp(-\la x_i) \ri) \\ & = N \ln(\la) - \sum_{i=1}^N \la x_i . \end{aligned}\] The derivative with respect to \(\la\) is \[ \fr{\partial l(\la | x_1, \ldots, x_N)}{\partial \la} = \fr{N}{\la} - \sum_{i=1}^N x_i . \] The MLE is obtained by setting it to 0 and solving for \(\la\) as \[ \hla_{MLE} = N \li( \sum_{i=1}^N x_i\ri)^{-1}. \]
Exercise 2 - Convolutional Layers
Consider the following \(4\times 4 \times 1\) input X and a \(2\times 2 \times 1\) convolutional kernel K with no bias term
\[ X = \bpmat 1 & 2 & -1 & 1 \\ 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 2 \\ 2 & 1 & 0 & -1 \epmat, \qquad % K = \bpmat 1, & 0 \\ 2, & 1 \\ \epmat \]
What is the output of the convolutional layer for the case of stride 1 and no padding?
What if we have stride 2 and no padding?
What if we have stride 2 and zero-padding of size 1?
Solution
- Here, we simply apply the convolutional kernel over each \(2\times 2\) patch of the input. There are 9 such patches. The output \(Y\) is then
\[ Y = \bpmat 3 & 3 & 1 \\ 2 & 2 & 3 \\ 5 & 3 & -1 \epmat \]
- Same idea except that we skip every other patch resulting in only 4 patches. The output \(Y\) is then
\[ Y = \bpmat 3 & 1 \\ 5 & -1 \epmat \]
- Now, we have added zeros on each side of the input. The resulting \(6\times 6\) padded input \(X_\mathrm{padded}\) and corresponding output \(Y\) are
\[ X_\mathrm{padded} = \bpmat 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 2 & -1 & 1 & 0 \\ 0 & 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 2 & 0 \\ 0 & 2 & 1 & 0 & -1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \epmat, \qquad Y = \bpmat 1 & 3 & 2 \\ 0 & 2 & 4 \\ 0 & 1 & -1 \epmat \]
Exercise 3 - Computational Parameter Counting
Use PyTorch to load the vgg11
model and automatically compute its number of parameters. Output the number of parameters for each layer and the total number of parameters in the model.
Solution
First, we hvae to load the vgg11
model which is part of torchvision as has been shown in the lecture:
import torchvision
= torchvision.models.vgg.vgg11(pretrained=False) vgg11
The number of parameters for the entire model, is the easier part: We can simply use the parameters()
iterator which returns the set of parameters for each module. Those can then be counted using the numel()
method resulting in
sum(p.numel() for p in vgg11.parameters())
Obtaining the number of parameters for each of the layer requires looking into the source code of the vgg11 model. All VGG models are ultimately instantiated by using the VGG
class. Its forward pass looks like this:
= self.features(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.classifier(x) x
A closer look at the implementation reveals that we can obtain the individual layers parameter by simply iterating over the self.features
and self.classifier
modules. The self.avgpool
module does not have any parameters. The following code snippet shows how to obtain the number of parameters for each layer of the convolutional backbone:
for layer in vgg11.features:
print(layer, sum(p.numel() for p in layer.parameters()))
To get the number of paramers in the cnn head, simply update the code snippet to iterate over vgg11.classifier
instead of vgg11.features
.
Exercise 4 - Influence Functions
Let \(\hte\) and \(\hte(\ve)\) be as defined in class. Show that the first order Taylor expansion of \(\hte(\ve)\) around \(\ve=0\) is given by the equation given in class, i.e. by \[\begin{align*} \hat{\theta}({\epsilon}) \approx \hat{\theta} + \epsilon\frac{d\hat{\theta}(\epsilon)}{d\epsilon} {\Bigr |}_{\epsilon=0} . \end{align*}\]
Solution
First, let’s recall the definitions of \(\te\) and \(\hte(\epsilon)\): \[\begin{align*} \hat{\theta} &= \text{argmin}_{\theta} \frac{1}{N} \left[ \sum_{i=1}^{N} L(x_i, y_i; \theta) \right] \\ \hte({\epsilon}) &= \text{argmin}_{\theta} \frac{1}{N} \left[ \sum_{i=1}^{N} L(x_i, y_i; \theta) \right] + \epsilon L(x,y; \theta) \end{align*}\] The first order taylor series expansion around \(\epsilon = 0\) is given by \[ \hte(0) + \epsilon \fr{d\hte(\epsilon)}{d\epsilon} {\Bigr |}_{\epsilon=0} \] From the definitions above, it can directly be seen that \(\hte(0) = \hte\) which completes the proof.