Exercise 1 - RNN for Sentiment Analysis
Suppose we are training a vanilla RNN like below to determine whether a sentence expresses positive or negative sentiment. This RNN will be a character-level RNN where \(x^{(1)}, \ldots, x^{(T)}\) is the sequence of input characters. The RNN is given as follows: \[\begin{align*} h^{(t)} &= \tanh\li(U x^{(t)} + W h^{(t-1)} + b\ri) \\ y &= \sigma\li(V h^{(T)} + d\ri) \end{align*}\]
How many times do we need to apply the weight matrix \(U\), \(W\), and \(V\)?
What are the shapes of the matrices \(U\), \(W\), and \(V\)?
How many addition and multiplication operations are required to make a prediction? You can assume that no addition and multiplications are performed when applying the tanh and sigmoid activation functions.
Exercise 2 - Scalar RNN
Suppose we have the following vanilla RNN network, where the inputs and hidden units are scalars. \[\begin{align*} h^{(t)} &= \tanh\li(w \cdot h^{(t-1)} + u \cdot x^{(t-1)} + b_h\ri) \\ y &= \sigma\li(v \cdot h^{(T)} + b_y\ri) \end{align*}\]
Show that if \(|w| < 1\), and the number of time steps \(T\) is large, then the gradient \(\frac{\partial y}{\partial x^{(0)}}\) vanishes.
Why is the result from Part (a) troubling?
Exercise 3 - RNN Addition
In this problem, you will implement a recurrent neural network which implements binary addition. The inputs are given as binary sequences, starting with the significant binary digit. (It is easier to start from the least significant bit, just like how you did addition in grade school.) The sequences will be padded with at least one zero as the most significant digit, so that the output length is the same as the input length. For example, the problem \(100111 + 110010\), whose target output value is \(1011001\), will be represented as follows: \[\begin{align*} \mathbf{x}^{(1)} = \begin{bmatrix}1 \\ 0\end{bmatrix}, \mathbf{x}^{(2)} = \begin{bmatrix}1 \\ 1\end{bmatrix}, \mathbf{x}^{(3)} = \begin{bmatrix}1 \\ 0\end{bmatrix}, \mathbf{x}^{(4)} = \begin{bmatrix}0 \\ 0\end{bmatrix}, \mathbf{x}^{(5)} = \begin{bmatrix}0 \\ 1\end{bmatrix}, \mathbf{x}^{(6)} = \begin{bmatrix}1 \\ 1\end{bmatrix}, \mathbf{x}^{(7)} = \begin{bmatrix}0 \\ 0\end{bmatrix} \end{align*}\]
With the target output: \[\begin{align*} y^{(1)} = 1, y^{(2)} = 0, y^{(3)} = 0, y^{(4)} = 1, y^{(5)} = 1, y^{(6)} = 0, y^{(7)} = 1, \end{align*}\]
There are two input units corresponding to the two inputs, and one output unit. Therefore, the pattern of inputs and outputs for this example would be:
Design, by hand, the weights and biases for an RNN which has two input units, three hidden units, and one output unit, which implements binary addition as discussed above. All of the units use the hard threshold activation function (\(f(x) = 1\) if \(x > 0\) and \(0\) otherwise). In particular, specify weight matrices \(\mathbf{U}\), \(\mathbf{v}\), and \(\mathbf{W}\), bias vector \(\mathbf{b}_{\mathbf{h}}\), and scalar bias \(b_y\) for the following architecture: \[\begin{align*} h^{(t)} &= f(\mathbf{W}h^{(t-1)} + \mathbf{U}\mathbf{x}^{(t)} + \mathbf{b_h}) \\ y^{(t)} &= f(\mathbf{v}^T h^{(t)} + b_y) \end{align*}\]
What are the shapes of \(\mathbf{U}\), \(\mathbf{v}\), \(\mathbf{W}\), and \(\mathbf{b}_{\mathbf{h}}\)?
Come up with values for \(\mathbf{U}\), \(\mathbf{W}\), and \(\mathbf{b}_{\mathbf{h}}\). Justify your answer. Hint: When performing binary addition, in addition to adding up two digits in a column, we need to track whether there is a digit from the previous column. We will choose one of the three units in \(\mathbf{h}^{(t)}\), say \(\mathbf{h}_2^{(t)}\), to represent this carry digit. You may also find it helpful to set \(\mathbf{h}_1\) to activate if the sum of the 3 digits is at least 1, \(\mathbf{h}_2\) to activate if the sum is at least 2, and \(\mathbf{h}_3\) to activate if the sum is at least 3.
Come up with the values of \(\mathbf{v}\) and \(b_y\). Justify your answer.