exercises08_solution – CSC413

Exercise 1 - RNN for Sentiment Analysis

Suppose we are training a vanilla RNN like below to determine whether a sentence expresses positive or negative sentiment. This RNN will be a character-level RNN where \(x^{(1)}, \ldots, x^{(T)}\) is the sequence of input characters. The RNN is given as follows: \[\begin{align*} h^{(t)} &= \tanh\li(U x^{(t)} + W h^{(t-1)} + b\ri) \\ y &= \sigma\li(V h^{(T)} + d\ri) \end{align*}\]

How many times do we need to apply the weight matrix \(U\), \(W\), and \(V\)?
What are the shapes of the matrices \(U\), \(W\), and \(V\)?
How many addition and multiplication operations are required to make a prediction? You can assume that no addition and multiplications are performed when applying the tanh and sigmoid activation functions.

Solution

We will need to compute \(h^{(t)}\) for \(t = 1, \ldots, T\). Each of this computation requires applying the weight matrices \(W\) and \(T\) once. The matrix \(V\) is only applied once at the end. Therefore, we need to apply \(W\) and \(U\) \(T\) times each and \(V\) once.
The shape of \(U\) is \(d_h \times d_x\), the shape of \(W\) is \(d_h \times d_h\), and the shape of \(V\) is \(d_y \times d_h\), where \(d_h\) is the dimensionality of the \(h^{(i)}\) (i.e. \(h^{(i)}\in\R^{d_h}\)), \(d_x\) is the dimensionality of the inputs \(x^{(i)}\), and \(d_y\) is the dimensionality of the ouput \(y\).
For each of the \(T\) steps, we need to perform two matrix-vector multiplications (one for \(Ux^{(i)}\) and one for \(Uh^{(i)}\)) and two vector additions. To compute the output, we need one additional matrix-vector multiplication and one vector addition.

Exercise 2 - Scalar RNN

Suppose we have the following vanilla RNN network, where the inputs and hidden units are scalars. \[\begin{align*} h^{(t)} &= \tanh\li(w \cdot h^{(t-1)} + u \cdot x^{(t-1)} + b_h\ri) \\ y &= \sigma\li(v \cdot h^{(T)} + b_y\ri) \end{align*}\]

Show that if \(|w| < 1\), and the number of time steps \(T\) is large, then the gradient \(\frac{\partial y}{\partial x^{(0)}}\) vanishes.
Why is the result from Part (a) troubling?

Solution

To make the sequence length \(T\) explicit in the notation, we will write \(y\) instead of \(y_T\). Formally, what we have to show is \[ |w|<1 \implies \lim_{T\to\infty} \fr{\partial y_T}{\partial x^{(0)}} = 0 . \] For the proof, we expand the derivative of \(y_T\) with respect to \(x^{(0)}\) using the chain rule: \[ \begin{aligned} \fr{\partial y_T}{\partial x^{(0)}} & = \si'\li(v \cdot h^{(T)} + b_y\ri) \cdot v \cdot \fr{\partial h^{(T)}}{\partial x^{(0)}} \\ & = \si'\li(v \cdot h^{(T)} + b_y\ri) \cdot v \cdot \underbrace{ \tanh'\li(w \cdot h^{(T-1)} + u \cdot x^{(T-1)} + b_h\ri) }_{A_{T-1}(x^{(0)})} \cdot w \cdot \fr{\partial h^{(T-1)}}{\partial x^{(0)}} \\ & = \ldots \\ & = \si'\li(v \cdot h^{(T)} + b_y\ri) \cdot v \cdot \prod_{t=2}^{T-1} A_{t}(x^{(0)}) \cdot w^{T-1} \cdot \fr{\partial h^{(1)}}{\partial x^{(0)}} .\\ \end{aligned} \] Using this, we can analyze the absolute value of the derivative \(\partial y_T/\partial x^{(0)}\). For \(\tanh\) and \(\si\), the absolute value of their respective derivatives is bounded by \(1\). Thus, we have \[ \begin{aligned} \li|\fr{\partial y_T}{\partial x^{(0)}} \ri| & = \underbrace{ \li|\si'\li(v \cdot h^{(T)} + b_y\ri) \ri| }_{\leq 1} \cdot |v| \cdot \prod_{t=2}^{T-1} \underbrace{\li| A_{t}(x^{(0)}) \ri|}_{\leq 1} \cdot \li|w^{T-1}\ri| \cdot \li|\fr{\partial h^{(1)}}{\partial x^{(0)}}\ri| \\ & \leq |v| \cdot \li|w^{T-1}\ri| \cdot \li|\fr{\partial h^{(1)}}{\partial x^{(0)}}\ri| \\ \end{aligned} \] Because \(|w|<1\), this converges to \(0\) as \(T\to\infty\) and thus \(|\partial y_T/\partial x^{(0)}|\) also converges to \(0\), i.e. the gradient vanishes.
It implies that in the considered setting, the input has no impact on the output.

Exercise 3 - RNN Addition

In this problem, you will implement a recurrent neural network which implements binary addition. The inputs are given as binary sequences, starting with the significant binary digit. (It is easier to start from the least significant bit, just like how you did addition in grade school.) The sequences will be padded with at least one zero as the most significant digit, so that the output length is the same as the input length. For example, the problem \(100111 + 110010\), whose target output value is \(1011001\), will be represented as follows: \[\begin{align*} \mathbf{x}^{(1)} = \begin{bmatrix}1 \\ 0\end{bmatrix}, \mathbf{x}^{(2)} = \begin{bmatrix}1 \\ 1\end{bmatrix}, \mathbf{x}^{(3)} = \begin{bmatrix}1 \\ 0\end{bmatrix}, \mathbf{x}^{(4)} = \begin{bmatrix}0 \\ 0\end{bmatrix}, \mathbf{x}^{(5)} = \begin{bmatrix}0 \\ 1\end{bmatrix}, \mathbf{x}^{(6)} = \begin{bmatrix}1 \\ 1\end{bmatrix}, \mathbf{x}^{(7)} = \begin{bmatrix}0 \\ 0\end{bmatrix} \end{align*}\]

With the target output: \[\begin{align*} y^{(1)} = 1, y^{(2)} = 0, y^{(3)} = 0, y^{(4)} = 1, y^{(5)} = 1, y^{(6)} = 0, y^{(7)} = 1, \end{align*}\]

There are two input units corresponding to the two inputs, and one output unit. Therefore, the pattern of inputs and outputs for this example would be:

Design, by hand, the weights and biases for an RNN which has two input units, three hidden units, and one output unit, which implements binary addition as discussed above. All of the units use the hard threshold activation function (\(f(x) = 1\) if \(x > 0\) and \(0\) otherwise). In particular, specify weight matrices \(\mathbf{U}\), \(\mathbf{v}\), and \(\mathbf{W}\), bias vector \(\mathbf{b}_{\mathbf{h}}\), and scalar bias \(b_y\) for the following architecture: \[\begin{align*} h^{(t)} &= f(\mathbf{W}h^{(t-1)} + \mathbf{U}\mathbf{x}^{(t)} + \mathbf{b_h}) \\ y^{(t)} &= f(\mathbf{v}^T h^{(t)} + b_y) \end{align*}\]

What are the shapes of \(\mathbf{U}\), \(\mathbf{v}\), \(\mathbf{W}\), and \(\mathbf{b}_{\mathbf{h}}\)?
Come up with values for \(\mathbf{U}\), \(\mathbf{W}\), and \(\mathbf{b}_{\mathbf{h}}\). Justify your answer. Hint: When performing binary addition, in addition to adding up two digits in a column, we need to track whether there is a digit from the previous column. We will choose one of the three units in \(\mathbf{h}^{(t)}\), say \(\mathbf{h}_2^{(t)}\), to represent this carry digit. You may also find it helpful to set \(\mathbf{h}_1\) to activate if the sum of the 3 digits is at least 1, \(\mathbf{h}_2\) to activate if the sum is at least 2, and \(\mathbf{h}_3\) to activate if the sum is at least 3.
Come up with the values of \(\mathbf{v}\) and \(b_y\). Justify your answer.

Solution

Since the inputs \(\mathbf{x}^{(t)}\) are \(2 \times 1\) and the hidden units \(\mathbf{h}^{(t)}\) are \(2 \times 1\), we should have:
- \(\mathbf{W}\) is \(3 \times 3\)
- \(\mathbf{U}\) is \(3 \times 2\)
- \(\mathbf{b}_h\) is \(3 \times 1\)
- \(\mathbf{v}\) is \(3 \times 1\)
We will follow the hint and implement the addition in our RNN such that:
1. The first of our hidden units \(h_1^{(t)}\) is 1 if and only if the sum \(S^{(t)} \doteq x_1^{(t)} + x_2^{(t)} + c^{(t-1)} \geq 1\), where by \(c^{(t-1)}\) we denote a carry (\(\mathbf{h_2}^{(t-1)}\) from the previous addition). Note, these \(S^{(t)}\) and \(c^{(t-1)}\) are not variables of the model, merely our notation to help us to work out the solution.
2. The \(h_2^{(t)}\) is 1 iff the sum \(S^{(t)} \geq 2\),
3. and \(h_3^{(t)}\) is 1 iff the sum \(S^{(t)}\) is 3.
Notice that the carry \(c^{(t-1)}\) is going to be 1 iff \(h_2^{(t-1)}=1\) and 0 otherwise, i.e. when the previous addition was 2 or 3. Therefore to compute \(h_i^{(t)}\) we need to first compute the sum \(S^{(t)} = x_1^{(t)} + x_2^{(t)} + h_2^{(t-1)}\) and then offset it by \(-i+1\) so that after applying the hard threshold function we get the desired value as specified above. This can be achieved with the following set of parameters: \[ \mathbf{U}= \begin{bmatrix} 1 & 1 \\ 1 & 1 \\ 1 & 1 \end{bmatrix},\quad \mathbf{W}=\begin{bmatrix} 0 & 1 & 0 \\ 0 & 1 & 0 \\ 0 & 1 & 0\end{bmatrix},\quad \mathbf{b_h}= \begin{bmatrix} -0.5 \\ -1.5 \\ -2.5 \end{bmatrix}. \]
To compute the output \(y^{(t)}\) we need to check if the \(S^{(t)}\) is 1 or 3, that is, if either \(h_1^{(t)} = 1\) while all other hidden units are zero or all hidden units are 1. We can accomplish this by setting: \(\mathbf{v}=\begin{bmatrix} 1, -1, 1 \end{bmatrix}\) and \(b_y = -0.5\).