Lecture 9
A part of the loss landscape where the gradient of the loss with respect to a parameter…
RNN For Prediction:
RNN For Generation:
RNN For Prediction:
RNN For Generation:
Unlike other models we discussed so far, the training time behaviour of Generative RNNs will be different from the test time behaviour
Test time behaviour:
During training, we try to get the RNN to generate one particular sequence in the training set:
Q1: What kind of a problem is this? (regression or classification?)
Q2: What loss function should we use during training?
First classification problem:
First classification problem:
Second classification problem:
Second classification problem:
Continue until we get to the “<EOS>” (end of string) token
Another approach is to model text one character at a time
This solves the problem of what to do about previously unseen words.
Note that long-term memory is essential at the character level!
In lecture 8, we showed a discriminative RNN that makes a prediction based on a sequence (sequence as an input).
In the week 11 tutorial, we will build a generator RNN to generate sequences (sequence as an output)
Another common example of a sequence-to-sequence task (seq2seq) is machine translation.
The network first reads and memorizes the sentences. When it sees the “end token”, it starts outputting the translation. The “encoder” and “decoder” are two different networks with different weights.
The encoder network reads an input sentence and stores all the information in its hidden units.
The decoder network then generates the output sentence one word at a time.
But some sentences can be really long. Can we really store all the information in a vector of hidden units?
Human translators refer back to the input.
We’ll look at the translation model from the classic paper:
Bahdanau et al., Neural machine translation by jointly learning to align and translate. ICLR, 2015.
Basic idea: each output word comes from one input word, or a handful of input words. Maybe we can learn to attend to only the relevant ones as we produce the output.
We’ll use the opportunity to look at architectural changes we can make to RNN models to make it even more performant.
The encoder computes an annotation (hidden state) of each word in the input.
The decoder network is also an RNN, and makes predictions one word at a time.
The encoder is a bidirectional RNN. We have two RNNs: one that runs forward and one that runs backwards. These RNNs can be LSTMs or GRUs.
The annotation of a word is the concatenation of the forward and backward hidden vectors.
The decoder network is also an RNN, and makes predictions one word at a time.
The difference is that it also derives a context vector \({\bf c}^{(t)}\) at each time step, computed by attending to the inputs
“My language model tells me the next word should be an adjective. Find me an adjective in the input”
We would like to refer back to one (or a few) of the input words to help with the translation task (e.g. find the adjective)
If you were programming a translator, you might…
If you were programming a translator, you might
An attentional decoder is like a continuous form of these last three steps.
The context vector is computed as a weighted average of the encoder’s annotations:
\[{\bf c}^{(i)} = \sum_j \alpha_{ij} {\bf h}^{(j)}\]
The attention weights are computed as a softmax, where the input depends on the annotation \({\bf h}^{(j)}\) and the decoder states \({\bf s}^{(t)}\):
\[ e_{ij} = a({\bf s}^{(i-1)}, {\bf h}^{(j)}), \qquad \alpha_{ij} = \frac{ \exp(e_{ij}) }{\sum_{j^\prime} exp(e_{ij^\prime})} \]
The attention function depends on the annotation vector, rather than the position in the sentence. It is a form of content-based addressing.
\[\begin{align*} {\bf h}^{(1)} &= \begin{bmatrix}1 & 3 & 9\end{bmatrix}^\top \\ {\bf h}^{(2)} &= \begin{bmatrix}0 & 0 & 1\end{bmatrix}^\top \\ {\bf h}^{(3)} &= \begin{bmatrix}5 & -1 & 2\end{bmatrix}^\top \\ \\ {\bf c}^{(t)} &= \begin{bmatrix}? & ? & ?\end{bmatrix}^\top \\ \end{align*}\]
\[\begin{align*} {\bf h}^{(1)} &= \begin{bmatrix}1 & 3 & 9\end{bmatrix}^\top \\ {\bf h}^{(2)} &= \begin{bmatrix}0 & 0 & 1\end{bmatrix}^\top \\ {\bf h}^{(3)} &= \begin{bmatrix}5 & -1 & 2\end{bmatrix}^\top \\ \\ {\bf c}^{(t)} &= \text{average}({\bf h}^{(1)} , {\bf h}^{(2)}, {\bf h}^{(3)})\\ &= \begin{bmatrix}2 & 0.6 & 1\end{bmatrix}^\top \\ \end{align*}\]
\[\begin{align*} {\bf h}^{(1)} &= \begin{bmatrix}1 & 3 & 9\end{bmatrix}^\top \\ {\bf h}^{(2)} &= \begin{bmatrix}0 & 0 & 1\end{bmatrix}^\top \\ {\bf h}^{(3)} &= \begin{bmatrix}5 & -1 & 2\end{bmatrix}^\top \\ \\ {\bf s}^{(t-1)} &= \begin{bmatrix}0 & 1 & 1\end{bmatrix}^\top \\ \alpha_t &= \text{softmax}\left(\begin{bmatrix} f({\bf s}^{(t-1)}, {\bf h}^{(1)}) \\ f({\bf s}^{(t-1)}, {\bf h}^{(2)}) \\ f({\bf s}^{(t-1)}, {\bf h}^{(3)}) \\ \end{bmatrix}\right) = \begin{bmatrix}\alpha_{t1} \\ \alpha_{t2} \\ \alpha_{t3}\end{bmatrix} \end{align*}\]
\[\begin{align*} {\bf h}^{(1)} &= \begin{bmatrix}1 & 3 & 9\end{bmatrix}^\top \\ {\bf h}^{(2)} &= \begin{bmatrix}0 & 0 & 1\end{bmatrix}^\top \\ {\bf h}^{(3)} &= \begin{bmatrix}5 & -1 & 2\end{bmatrix}^\top \\ \\ {\bf s}^{(t-1)} &= \begin{bmatrix}0 & 1 & 1\end{bmatrix}^\top \\ \alpha_t &= \text{softmax}\left(\begin{bmatrix} {\bf s}^{(t-1)} \cdot {\bf h}^{(1)} \\ {\bf s}^{(t-1)} \cdot {\bf h}^{(2)} \\ {\bf s}^{(t-1)} \cdot {\bf h}^{(3)} \\ \end{bmatrix}\right) = \begin{bmatrix}\alpha_{t1} \\ \alpha_{t2} \\ \alpha_{t3}\end{bmatrix} \end{align*}\]
\[\begin{align*} {\bf h}^{(1)} &= \begin{bmatrix}1 & 3 & 9\end{bmatrix}^\top \\ {\bf h}^{(2)} &= \begin{bmatrix}0 & 0 & 1\end{bmatrix}^\top \qquad \quad {\bf s}^{(t-1)} = \begin{bmatrix}0 & 1 & 1\end{bmatrix}^\top \\ {\bf h}^{(3)} &= \begin{bmatrix}5 & -1 & 2\end{bmatrix}^\top \\ \\ \alpha_t &= \text{softmax}\left(\begin{bmatrix}1 & 0 & 5 \\ 3 & 0 & -1 \\ 0 & 1 &2 \end{bmatrix}^\top \begin{bmatrix}0 \\ 1 \\ 1\end{bmatrix}\right) \\ {\bf c}^{(t)} &= \alpha_{t1} {\bf h}^{(1)} + \alpha_{t2} {\bf h}^{(2)} + \alpha_{t2} {\bf h}^{(3)}\\ \end{align*}\]
https://play.library.utoronto.ca/watch/9ed8b3c497f82b510e9ecf441c5eef4f
Visualization of the attention map (the \(\alpha_{ij}\)s at each time step)
Nothing forces the model to go (roughly) linearly through the input sentences, but somehow it learns to do it!
The attention-based translation model does much better than the encoder/decoder model on long sentences.
Caption Generation Task:
Attention can also be used to understand images.
The next few slides are based on this paper from the UofT machine learning group:
Xu et al. Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention. ICML, 2015.
The caption generation task: take an image as input, and produce a sentence describing the image.
Similar math as before: difference is that \(j\) is a pixel location
\[\begin{align*} e_{ij} &= a({\bf s}^{(i-1)}, {\bf h}^{(j)}) \\ \alpha_{ij} &= \frac{ \exp(e_{ij}) }{\sum_{j^\prime} exp(e_{ij^\prime})} \end{align*}\]
This lets us understand where the network is looking as it generates a sentence.
This can also help us understand the network’s mistakes.
Finally, to get more capacity/performance out of RNNs, you can stack multiple RNN’s together!
The hidden state of your first RNN becomes the input to your second layer RNN.
One disadvantage of RNNS (and especially multi-layer RNNs) is that they require a long time to train, and are more difficult to parallelize. (Need the previous hidden state \(h^{(t)}\) to be able to compute \(h^{(t+1)}\))
What is ChatGPT? We’ll let it speak for itself:
I am ChatGPT, a large language model developed by OpenAI. I use machine learning algorithms to generate responses to questions and statements posed to me by users. I am designed to understand and generate natural language responses in a variety of domains and topics, from general knowledge to specific technical fields. My purpose is to assist users in generating accurate and informative responses to their queries and to provide helpful insights and suggestions.
ChatGPT is based on OpenAI’s GPT-3, which itself is based on the transformer architecture.
Idea: Do away with recurrent networks altogether; instead exclusively use attention to obtain the history at the hidden layers
Vaswani, Ashish, et al. ”Attention is all you need.” Advances in Neural Information Processing Systems. 2017.
https://openai.com/blog/better-language-models/
Transformer has a encoder-decoder architecture similar to the previous sequence-to-sequence RNN models, except all the recurrent connections are replaced by the attention modules.
In general, attention mapping can be described as a function of a query and a set of key-value pairs. Transformer uses a “scaled dot-product attention” to obtain the context vector:
\[\begin{align*} {\bf c}^{(t)} = \text{attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_K}}\right) V \end{align*}\]
This is very similar to the attetion mechanism we saw eariler, but we scale the pre-softmax values (the logits) down by the square root of the key dimension \(d_K\).
When training the decoder (e.g. to generate a sequence), we desired output so that have to be careful to mask out the desired output so that we preserve the autoregressive property.
\[\begin{align*} {\bf h}^{(1)} &= \begin{bmatrix}1 & 3 & 9\end{bmatrix}^\top \\ {\bf h}^{(2)} &= \begin{bmatrix}0 & 0 & 1\end{bmatrix}^\top \qquad \quad {\bf s}^{(t-1)} = \begin{bmatrix}0 & 1 & 1\end{bmatrix}^\top \\ {\bf h}^{(3)} &= \begin{bmatrix}5 & -1 & 2\end{bmatrix}^\top \\ \\ \alpha_t &= \text{softmax}\left(\begin{bmatrix}1 & 0 & 5 \\ 3 & 0 & -1 \\ 0 & 1 &2 \end{bmatrix}^\top \begin{bmatrix}0 \\ 1 \\ 1\end{bmatrix}\right) \\ {\bf c}^{(t)} &= \alpha_{t1} {\bf h}^{(1)} + \alpha_{t2} {\bf h}^{(2)} + \alpha_{t2} {\bf h}^{(3)}\\ \end{align*}\]
\[\begin{align*} {\bf h}^{(1)} &= \begin{bmatrix}1 & 3 & 9\end{bmatrix}^\top \qquad \quad {\bf h}^{(2)} = \begin{bmatrix}0 & 0 & 1\end{bmatrix}^\top \\ {\bf h}^{(3)} &= \begin{bmatrix}5 & -1 & 2\end{bmatrix}^\top \qquad \quad {\bf s}^{(t-1)} = \begin{bmatrix}0 & 1 & 1\end{bmatrix}^\top \\ \alpha_t &= \text{softmax}\left(\frac{1}{\sqrt{3}} \begin{bmatrix}1 & 0 & 5 \\ 3 & 0 & -1 \\ 0 & 1 &2 \end{bmatrix}^\top \begin{bmatrix}0 \\ 1 \\ 1\end{bmatrix}\right) \\ {\bf c}^{(t)} &= \alpha_{t1} {\bf h}^{(1)} + \alpha_{t2} {\bf h}^{(2)} + \alpha_{t2} {\bf h}^{(3)}\\ \end{align*}\] Q: Which values represent the Q, K, and V?
Transformer models attend to both the encoder annotations and its previous hidden layers.
When attending to the encoder annotations, the model computes the key-value pairs using linearly transformed the encoder outputs.
Transformer models also use “self-attention” on its previous hidden layers. When applying attention to the previous hidden layers, the causal structure is preserved:
The Scaled Dot-Product Attention attends to one or few entries in the input key-value pairs.
But humans can attend to many things simultaneously
Idea: apply scaled dot-product attention multiple times on the linearly transformed inputs:
\[\begin{align*} \mathbf c_i &= \text{attention}\left(QW_i^Q, KW_i^K, VW_i^V\right) \\ \text{MultiHead}(Q, K, V) &= \text{concat}({\bf c_1}, \dots, {\mathbf c_h})W^O \end{align*}\]
\[\begin{align*} \mathbf c_i &= \text{attention}\left(QW_i^Q, KW_i^K, VW_i^V\right) \\ \text{MultiHead}(Q, K, V) &= \text{concat}({\bf c_1}, \dots, {\mathbf c_h})W^O \end{align*}\]
Unlike RNNs and CNN encoders, the attention encoder output do not depend on the order of the inputs. Can you see why?
However, the order of the sequence convey important information for the machine translation task, language modeling, and other tasks.
Idea: Add positional information of each input token in the sequence into the input embedding vectors.
\[\begin{align*} PE_{\text{pos}, 2i} &= \sin\left(\text{pos}/10000^{2i/d_{emb}}\right) \\ PE_{\text{pos}, 2i+1} &= \cos\left(\text{pos}/10000^{2i/d_{emb}}\right) \end{align*}\]
The final input embeddings are the concatenation of the learnable embeddings and the positional encoding.
Self-attention layer learns that “it” could refer to different entities in different contexts.
See ai.googleblog.com/2017/08/transformer-novel-neural-network.html
During backprop, in the standard encoder-decoder RNN, the maximum path length across time is the number of time steps.
Attention-based RNNs have a constant path length between the encoder inputs and the decoder hidden states. (Learning becomes easier. Why?)
Similar to pre-training computer vision models on ImageNet, we can pre-train a language model for NLP tasks.
The pre-trained model is then fine-tuned on textual entailment, question answering, semantic similarity assessment, and document classification
Title: United Methodists Agree to Historic Split
Subtitle: Those who oppose gay marriage will form their own denomination
Article: After two days of intense debate, the United Methodist Church has agreed to a historic split - one that is expected to end in the creation of a new denomination, one that will be “theologically and socially conservative,” according to The Washington Post. The majority of delegates attending the church’s annual General Conference in May voted to strengthen a ban on the ordination of LGBTQ clergy and to write new rules that will “discipline” clergy who officiate at same-sex weddings. But those who opposed these measures have a new plan: They say they will form a separate denomination by 2020, calling their church the Christian Methodist denomination.
The Post notes that the denomination, which claims 12.5 million members, was in the early 20th century the “largest Protestant denomination in the U.S.,” but that it has been shrinking in recent decades. The new split will be the second in the church’s history. The first occurred in 1968, when roughly 10 percent of the denomination left to form the Evangelical United Brethren Church.
The Post notes that the proposed split “comes at a critical time for the church, which has been losing members for years,” which has been “pushed toward the brink of a schism over the role of LGBTQ people in the church.” Gay marriage is not the only issue that has divided the church. In 2016, the denomination was split over ordination of transgender clergy, with the North Pacific regional conference voting to ban the
Poor English input: The patient was died.
Good English output: The patient died.
Poor English input: We think that Leslie likes ourselves.
Good English output: We think that Leslie likes us.
Poor English input: Janet broke Bill on the finger.
Good English output: Janet broke Bill’s finger.
Poor English input: Mary arranged for, in St. Louis, John to rent a house cheap.
Good English output: Mary arranged for John to rent a house in St. Louis.
Poor English input: Today I have went to the store to to buys some many bottle of water.
Good English output: Today I went to the store to buy some bottles of water.
Poor English input: I have tried to hit ball with bat, but my swing is has miss.
Good English output: I tried to hit the ball with the bat, but my swing missed.
Many Transformer-based models have been developed since 2017.
Many benchmarks have been developed such as GLUE and SQuAD.
Big players in the LLM space include Google (Brain, DeepMind), Meta (formerly Facebook, FAIR), Microsoft, Amazon, EleutherAI, OpenAI, Cohere, Hugging Face.