exercises03_solution – CSC413

Exercise 1 - Maximum Likelihood Estimation Refresher

Assume you are given datapoints \((\fx_i)_{i=1}^N\) with \(\fx_i\in\R^n\) coming from a Gaussian distribution. Derive the maximum likelihood estimator of its mean.

Solution

First, let’s quickly remember that the maximum likelihood estimator (MLE) of a probability distribution from dataapoints \(\fx_1, \ldots, \fx_N\) is given by \[ \hte_{\mathrm{MLE}} = \argmax_{\te \in \Te} \prod_{i=1}^N f(\fx_i | \te), \] where \(f\) is the probability density function of the considered probability distribution family, \(\te\) are the parameters of the distribution, and \(\Te\) is the parameter space (a set containing all possible parameters). The product on the right hand side is also known as the likelihood function. In practice, we usually work with the log-likelihood function instead. Because the logarithm is monotonously increasing, the resulting estimators are the same, i.e. \[ \hte_{\mathrm{MLE}} = \argmax_{\te \in \Te} \prod_{i=1}^N f(\fx_i | \te), = \argmax_{\te \in \Te} \sum_{i=1}^N \log \li( f(\fx_i | \te)\ri). \]

Second, let’s remember that the probability density function of a multivariate Gaussian distribution is given by \[ f(\fx_i | \mu, \Si) = \frac{1}{(2 \pi)^{n/2} |\Si|^{1/2}} \exp \li( - \frac{1}{2} (\fx_i - \mu)^\top \Si^{-1} (\fx_i - \mu) \ri) \] with parameters \(\te=(\mu, \Si)\), where \(\mu\in\R^n\) is the mean vector and \(\Si\in\R^{n\times n}\) is the covariance matrix. Moreover, the notation \(|\Si|\) denotes the determinant of \(\Si\).

Our goal is to maximize the log-likelihood function which in this case is \[\begin{aligned} l(\mu, \Si| \fx_1, \ldots, \fx_N) & := \sum_{i=1}^N \log f(\fx_i | \mu, \Si) \\ & = \sum_{i=1}^N \li( - \frac{n}{2} \log (2 \pi) - \frac{1}{2} \log |\Si| - \frac{1}{2} (\fx_i - \mu)^\top \Si^{-1} (\fx_i - \mu) \ri). \end{aligned}\] For obtaining the MLE of \(\mu\), we can simply take the gradient of \(l\) with respect to \(\mu\) and set it to zero resulting in: \[ \nabla_\mu l(\mu, \Si| \fx_1, \ldots, \fx_N) = \sum_{i=1}^N \Si^{-1} ( \fx_i - \mu ) = 0 \] Since \(\Si\) is a covariance matrix, it is positive definite. Thus, we can multiply both sides of the equation by \(\Si\) and obtain \[ \begin{aligned} 0 & = N \mu - \sum_{i=1}^N \fx_i, \\ \Rightarrow \hmu_{\mathrm{MLE}} &= \frac{1}{N} \sum_{i=1}^N \fx_i . \end{aligned} \] Conveniently, \(\Si\) disappears and thus we do not have to worry about it.

Exercise 2 - More Gradients

You are an ML Engineer at Googlezon where you are working on an internal ML framework called TorchsorFlow. You are tasked with implementing a new layer known as BatchNormalization. The idea of this layer is as follows:

During training, consider the outputs of the previous layer \(\fa_i=(a_i^{(1)}, \ldots, a_i^{(N)})\) for each element \(i\in \{1, \ldots, M\}\) of the input batch. Compute the mean \(\mu_j\) and variance \(\si_j^2\) over each input dimension \(j\). Use the resulting statistics to normalize the output of the previous layer. Finally, rescale the resulting vector with a learned constant \(\gm\) and shift it by another learned constant \(\be\).

Write down the mathematical expression for the BatchNormalization layer. What are its learnable parameters?
Compute the gradient of the loss \(\mcL\) with respect to the input of the BatchNormalization \(\fa_i\) layer.
At test time, the batch size is usually 1. So, it is not meaningful (or even possible) to compute mean / variance. How would you implement a layer like this?

Solution

For the purpose of batch normalization, we can consider each output neuron individually. Thus, we will simplify our notation and write \(a_i\), \(y_i\), … instead of \(a_i^{(j)}\), \(y_i^{(j)}\), … respectively.

The forward pass is given by the following equations \[ \begin{aligned} \mu_B &:= \frac{1}{M}\sum_{i=1}^m a_i, &\text{(mini-batch mean)} \\ \sigma_B^2 &:=\frac{1}{M}\sum_{i=1}^M (a_i-\mu_B)^2, &\text{(mini-batch variance)}\\ \ha_i &:= \frac{a_i-\mu_B}{\sqrt{\si_B^2+\epsilon}}, &\text{(normalize)}\\ y_i &:= BN_{\gamma, \beta}((a_i)_{i=1}^M) := \gamma\ha_i + \beta. &\text{(scale and shift)} \end{aligned} \] The entire layer is defined as \[ BN(\fa_1, \ldots \fa_M)=\big( BN_{\gm^{(1)}, \be^{(1)}}\big((a_i^{(1)})_{i=1}^M\big), \ldots, BN_{\gm^{(N)}, \be^{(N)}}\big((a_i^{(N)})_{i=1}^M\big) \big) \] where \(\gm^{(1)}, \ldots, \gm^{(N)}\) and \(\be^{(1)}, \ldots, \be^{(N)}\) are learnable parameters.
The derivatives can be expressed using the chain rule where we obtain \(\mu_B\), \(\si_B\), \(\ha_i\), and \(y_i\) during the forward pass while \(\partial \mcL/\partial y_i\) is obtained from earlier steps of the backward pass. The remaining derivatives are: \[ \begin{aligned} \frac{\partial\mcL}{\partial \ha_i} &= \fr{\partial \mcL}{y_i}\cdot \gm, \\ \frac{\partial\mcL}{\partial \si_B^2} &= \sum_{i=1}^M \fr{\partial\mcL}{\partial\ha_i}\cdot(a_i-\mu_B)\cdot \frac{-1}{2}(\si_B^2+\epsilon)^{-3/2}, \\ \fr{\partial\mcL}{\partial\mu_B} &= \bigg(\sum_{i=1}^M \fr{\partial\mcL}{\partial\ha_i} \cdot\frac{-1}{\sqrt{\sigma_B^2+\epsilon}}\bigg) +\fr{\partial\mcL}{ \partial \si_B^2} \cdot\frac{\sum_{i=1}^M -2(a_i-\mu_B)}{M},\\ \fr{\partial\mcL}{\partial a_i} &= \fr{\partial\mcL}{\partial\ha_i} \cdot \frac{1}{\sqrt{\sigma_B^2+\epsilon}} + \fr{\partial \mcL}{\partial \sigma_B^2} \cdot \frac{2(a_i-\mu_B)}{M} + \fr{\partial \mcL}{\partial \mu_B}\cdot \frac{1}{M},\\ \fr{\partial\mcL}{\partial\gamma}&= \sum_{i=1}^M \fr{\partial\mcL}{\partial y_i} \cdot \ha_i, \\ \fr{\partial\mcL}{\partial\beta} &= \sum_{i=1}^M \fr{\partial\mcL}{\partial y_i}. \end{aligned} \] Here \(\epsilon\) is a small constant which is added in practice to the variance to avoid division by zero. It is actually not part of the derivative.

Exercise 3 - Autodiff Modes

Consider the function \(F(x) = f_3(f_2(f_1(x))\) and assume you also know the derivatives \(f_i'\) for all \(f_i\).

Apply the chain rule to express \(F'(x)\) in terms of \(f_i'\)s and \(f_i\).
Write down the pseudocode for computing \(F'(x)\) using the forward mode and the reverse mode respectively. Assume all functions to be scalar functions of a scalar variable, i.e. \(f_i: \R \rightarrow \R\).
If you simply ask your interpreter / compiler to evaluate the expression in (a), will the computation be in forward mode, reverse mode, or neither of the modes? Why? You can assume that your interpreter / compiler does not do any caching or optimization and simply evaluates the expression from left to right. Does anything change if you assume that your interpreter caches results that have been computed before?

Solution

For better readability we will write \((g\circ h)(x)\) for \(g(h(x))\). By applying the chain rule, we obtain \[ \begin{aligned} F'(x) &= (f_3 \circ f_2 \circ f_1)' (x) \\ &= (f_2 \circ f_1)'(x) \cdot (f_3' \circ f_2 \circ f_1)(x)\\ &= f_1'(x) \cdot (f_2' \circ f_1)(x) \cdot (f_3' \circ f_2 \circ f_1)(x) \end{aligned} \]
First, let’s start with the forward mode
```
d = f1'(x)
v = f1(x)
d = f2'(x) * d
v = f2(v)
d = f3'(v) * d
```
Now, for the reverse mode, we first do a “forward pass” before computing gradients:
```
v1 = f1(x)
v2 = f2(v1)
d = f3'(v2)
d = d*f2'(v1)
d = d*f1'(x)
```
Simply evaluating the expression in (a) is not in line with any of the modes. It also involves repeated computations because \(f_1(x)\) will be computed twice. Now, if we allow for caching of ingtermediate results, this doubling of compugtaiton disappears. The order written above will then be in line with forward move automatic differentiation. However, this is specific to our example and in general not true.

Exercise 4 - GloVe Embeddings

Open the notebook presented in class and work through it by trying some of the ideas presented therein for different word combinations.