The exercises this week involve some old material so you can check your learning and understanding.

Exercise 1 - Maximum Likelihood Estimator

Assume you are given datapoints \((x_i)_{i=1}^N\) with \(x_i\in\R\) coming from a Exponential distribution. The probability density function of a exponential distribution is given by \(f(x) = \la \exp(-\la x)\) with \(x\in\R\). Derive the maximum likelihood estimator of the parameter \(\la\).

Solution

First, let’s quickly remember that the maximum likelihood estimator (MLE) of a probability distribution from dataapoints \(\fx_1, \ldots, \fx_N\) is given by \[ \hte_{\mathrm{MLE}} = \argmax_{\te \in \Te} \prod_{i=1}^N f(\fx_i | \te), \] where \(f\) is the probability density function of the considered probability distribution family, \(\te\) are the parameters of the distribution, and \(\Te\) is the parameter space (a set containing all possible parameters).

As mentioned in our previous exercise, we usually work with the log-likelihood in practice. In this particular case, the log likelihood is given by \[\begin{aligned} l(\la | x_1, \ldots, x_N) & := \sum_{i=1}^N \ln f(\fx_i | \la) \\ & = \sum_{i=1}^N \ln\li(\la \exp(-\la x_i) \ri) \\ & = \sum_{i=1}^N \ln(\la) + \ln\li( \exp(-\la x_i) \ri) \\ & = N \ln(\la) - \sum_{i=1}^N \la x_i . \end{aligned}\] The derivative with respect to \(\la\) is \[ \fr{\partial l(\la | x_1, \ldots, x_N)}{\partial \la} = \fr{N}{\la} - \sum_{i=1}^N x_i . \] The MLE is obtained by setting it to 0 and solving for \(\la\) as \[ \hla_{MLE} = N \li( \sum_{i=1}^N x_i\ri)^{-1}. \]

Exercise 2 - Convolutional Layers

Consider the following \(4\times 4 \times 1\) input X and a \(2\times 2 \times 1\) convolutional kernel K with no bias term

\[ X = \bpmat 1 & 2 & -1 & 1 \\ 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 2 \\ 2 & 1 & 0 & -1 \epmat, \qquad % K = \bpmat 1, & 0 \\ 2, & 1 \\ \epmat \]

  1. What is the output of the convolutional layer for the case of stride 1 and no padding?

  2. What if we have stride 2 and no padding?

  3. What if we have stride 2 and zero-padding of size 1?

Solution

  1. Here, we simply apply the convolutional kernel over each \(2\times 2\) patch of the input. There are 9 such patches. The output \(Y\) is then

\[ Y = \bpmat 3 & 3 & 1 \\ 2 & 2 & 3 \\ 5 & 3 & -1 \epmat \]

  1. Same idea except that we skip every other patch resulting in only 4 patches. The output \(Y\) is then

\[ Y = \bpmat 3 & 1 \\ 5 & -1 \epmat \]

  1. Now, we have added zeros on each side of the input. The resulting \(6\times 6\) padded input \(X_\mathrm{padded}\) and corresponding output \(Y\) are

\[ X_\mathrm{padded} = \bpmat 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 2 & -1 & 1 & 0 \\ 0 & 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 2 & 0 \\ 0 & 2 & 1 & 0 & -1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \epmat, \qquad Y = \bpmat 1 & 3 & 2 \\ 0 & 2 & 4 \\ 0 & 1 & -1 \epmat \]

Exercise 3 - Computational Parameter Counting

Use PyTorch to load the vgg11 model and automatically compute its number of parameters. Output the number of parameters for each layer and the total number of parameters in the model.

Solution

First, we hvae to load the vgg11 model which is part of torchvision as has been shown in the lecture:

import torchvision
vgg11 = torchvision.models.vgg.vgg11(pretrained=False)

The number of parameters for the entire model, is the easier part: We can simply use the parameters() iterator which returns the set of parameters for each module. Those can then be counted using the numel() method resulting in

sum(p.numel() for p in vgg11.parameters())

Obtaining the number of parameters for each of the layer requires looking into the source code of the vgg11 model. All VGG models are ultimately instantiated by using the VGG class. Its forward pass looks like this:

x = self.features(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.classifier(x)

A closer look at the implementation reveals that we can obtain the individual layers parameter by simply iterating over the self.features and self.classifier modules. The self.avgpool module does not have any parameters. The following code snippet shows how to obtain the number of parameters for each layer of the convolutional backbone:

for layer in vgg11.features:
    print(layer, sum(p.numel() for p in layer.parameters()))

To get the number of paramers in the cnn head, simply update the code snippet to iterate over vgg11.classifier instead of vgg11.features.

Exercise 4 - Influence Functions

Let \(\hte\) and \(\hte(\ve)\) be as defined in class. Show that the first order Taylor expansion of \(\hte(\ve)\) around \(\ve=0\) is given by the equation given in class, i.e. by \[\begin{align*} \hat{\theta}({\epsilon}) \approx \hat{\theta} + \epsilon\frac{d\hat{\theta}(\epsilon)}{d\epsilon} {\Bigr |}_{\epsilon=0} . \end{align*}\]

Solution

First, let’s recall the definitions of \(\te\) and \(\hte(\epsilon)\): \[\begin{align*} \hat{\theta} &= \text{argmin}_{\theta} \frac{1}{N} \left[ \sum_{i=1}^{N} L(x_i, y_i; \theta) \right] \\ \hte({\epsilon}) &= \text{argmin}_{\theta} \frac{1}{N} \left[ \sum_{i=1}^{N} L(x_i, y_i; \theta) \right] + \epsilon L(x,y; \theta) \end{align*}\] The first order taylor series expansion around \(\epsilon = 0\) is given by \[ \hte(0) + \epsilon \fr{d\hte(\epsilon)}{d\epsilon} {\Bigr |}_{\epsilon=0} \] From the definitions above, it can directly be seen that \(\hte(0) = \hte\) which completes the proof.