Exercise 1 - Eigenvalues and Eigenvectors

You are given the following set of eigevalues and eigenvectors. Compute the corresponding matrix.

\(\la_1 = 1\), \(\la_2 = 2\), \(\fv_1 = (\sqrt{0.5}, \sqrt{0.5})^\top\), \(\fv_2 = (\sqrt{0.5},-\sqrt{0.5})^\top\).

Solution

First, remember that the normalized eigenvectors of a symmetric matrix are orthogonal. Thus, we have \[ \fe_i^\top \fe_j = \begin{cases} 1 & i=j \\ 0 & i\neq j \end{cases}. \]

Second, for symmetric \(\fA\), its spectral decomposition is given by \(\fA = \fQ \fLa \fQ^\top\), where \(\fQ\) is a matrix where each column is an (orthogonal) eigenvector of unit length.

In our case, the eigenvectors are already normalized and orthogonal, so we can simply write \(\fQ = (\fe_1, \fe_2)\) and \(\fLa = \diag(\lambda_1, \lambda_2)\). Then, we have \[ \fA = \bpmat 1.5 & -0.5 \\ -0.5 & 1.5 \epmat \]

Exercise 2 - Parameter Counting

Use PyTorch to load the alexnet model and automatically compute its number of parameters. Output the number of parameters for each layer and the total number of parameters in the model.

Solution

First, we hvae to load the alexnet model which is part of torchvision:

import torchvision
alexnet = torchvision.models.alexnet()

The number of parameters for the entire model, is the easier part: We can simply use the parameters() iterator which returns the set of parameters for each module. Those can then be counted using the numel() method resulting in

sum(p.numel() for p in alexnet.parameters())

Obtaining the number of parameters for each of the layer requires looking into the source code of the alexnet model. The structure is similar to vgg and the forward pass looks like this:

x = self.features(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.classifier(x)

A closer look at the implementation reveals that we can obtain the individual layers parameter by simply iterating over the self.features and self.classifier modules. The self.avgpool module does not have any parameters. The following code snippet shows how to obtain the number of parameters for each layer of the convolutional backbone:

for layer in alexnet.features:
    print(layer, sum(p.numel() for p in layer.parameters()))

To get the number of paramers in the cnn head, simply update the code snippet to iterate over alexnet.classifier instead of alexnet.features.

Exercise 3 - Convolutional Layers

Consider the following \(4\times 4 \times 1\) input X and a \(2\times 2 \times 1\) convolutional kernel K with no bias term

\[ X = \bpmat 1 & 0 & 1 & -1 \\ 1 & 0 & 1 & 0 \\ 0 & 3 & 0 & 1 \\ 1 & -1 & 0 & 1 \epmat, \qquad % K = \bpmat 1, & 2 \\ 0, & 1 \\ \epmat \]

  1. What is the output of the convolutional layer for the case of stride 1 and no padding?

  2. What if we have stride 2 and no padding?

  3. What if we have stride 2 and zero-padding of size 1?

Solution

  1. Here, we simply apply the convolutional kernel over each \(2\times 2\) patch of the input. There are 9 such patches. The output \(Y\) is then

\[ Y = \bpmat 1 & 3 & -1 \\ 4 & 2 & 2 \\ 5 & 3 & 3 \epmat \]

  1. Same idea except that we skip every other patch resulting in only 4 patches. The output \(Y\) is then

\[ Y = \bpmat 1 & -1 \\ 5 & 3 \epmat \]

  1. Now, we have added zeros on each side of the input. The resulting \(6\times 6\) padded input \(X_\mathrm{padded}\) and corresponding output \(Y\) are

\[ X_\mathrm{padded} = \bpmat 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 1 & -1 & 0 \\ 0 & 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & 3 & 0 & 1 & 0 \\ 0 & 1 & -1 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \epmat, \qquad Y = \bpmat 1 & 1 & 0 \\ 2 & 2 & 0 \\ 2 & -1 & 1 \epmat \]

Exercise 4 - Scaled Dot-Product Attention

Consider the matrices \(Q\), \(K\), \(V\) given by \[ Q = \bpmat 1 & 3\\ 0 & 1 \epmat,\quad K = \bpmat 1 & 1\\ 1 & 2\\ 0 & 1 \epmat,\quad V=\bpmat 1 & 0 & -2\\ 2 & 1 & 2 \\ 0 & 3 & -1 \epmat. \] Compute the context matrix \(C\) using the scaled dot product attention.

Solution

The resulting context matrix is given by: \[ C\approx \bpmat 1.80 & 1.00 & 1.44\\ 1.26 & 1.25 & 0.26 \epmat \] A simple implementation would look as follows:

import torch
Q = torch.tensor([[1, 2], [3, 1]]).float()
K = torch.tensor([[2, 1], [1, 1], [0, 1]]).float()
V = torch.tensor([[1, 2, -2], [1, 1, 2], [0, 1, -1]]).float()
d_k = torch.tensor(K.shape[1])
M = torch.matmul(Q, K.transpose(0, 1)) / torch.sqrt(d_k)
S = torch.exp(M) / torch.sum(torch.exp(M), dim=1).view(-1,1)
torch.matmul(S, V)

Pytorch also provides a function for scaled dot product attention:

import torch.nn.functional as F
F.scaled_dot_product_attention(Q, K, V)