This year, we saw a stunning application of machine learning. A very basic choice for the Encoder and the Decoder of the Seq2Seq model is a single LSTM for each of them. Where one can optionally divide the dot product of Q and K by the dimensionality of key vectors dk. There are N encoder layers in the transformer. By now we have established that Transformers discard the sequential nature of RNNs and process the sequence elements in parallel instead. In the rambling case, we can simply hand it the starting token and have it start producing words (the trained model uses <endoftext> as its start token. The part of the Decoder that I refer to as postprocessing in the Figure above is similar to what one would usually find in the RNN Decoder for an NLP task: a fully connected (FC) layer, which follows the RNN that extracted certain features from the network's inputs, and a softmax layer on top of the FC one that will assign probabilities to each of the tokens in the model's vocabulary being the next element in the output sequence. The Transformer architecture was introduced in the paper whose title is worthy of that of a self-help book: Attention is All You Need Again, another self-descriptive heading: the authors literally take the RNN Encoder-Decoder model with Attention, and throw away the RNN. We focus on the Transformers for our analysis as they have been shown effective on various tasks, including machine translation (MT), standard left-to-right language models (LM) and masked language modeling (MLM). Actually, there are two different types of transformers and three different types of underlying data. It bakes in the model's understanding of relevant and associated words that designate the context of a certain word before processing that word (passing it through a neural network). Transformer calculates self-attention using 64-dimension vectors. This is an implementation of the Transformer translation model as described in the Attention is All You Want paper. The language modeling process is to assign a probability for the likelihood of a given word (or a sequence of words) to follow a sequence of words. To start with, each pre-processed (more on that later) element of the input sequence wi gets fed as input to the Encoder network - this is done in parallel, unlike the RNNs. This seems to give transformer models enough representational capacity to handle the tasks that have been thrown at them so far. For the language modeling task, any tokens on the future positions should be masked. New deep learning models are introduced at an increasing rate and sometimes it is hard to keep track of all of the novelties.

Since attention weights apply to all tokens in the sequences, the Transformer model is able to easily capture long-distance dependencies. Those matrices Q, K and V are different for each position of the attention modules in the architecture depending on whether they are in the encoder, decoder or in-between encoder and decoder. The GPT2 paper also shows results of summarization after pre-training the model on language modeling. Example: Consider a training a dataset with 100 examples that is divided into 20 batches with 5 examples per batch. The difference between the transformers is subtle and you should always consider what the "norm" data for a field should really be. For example, the "norm" data for a text field is a string, but is a DateTime object for a date field. During training this example uses teacher-forcing (like in the text generation tutorial). Teacher forcing is passing the true output to the next time step regardless of what the model predicts at the current time step. Each input element's Encoder also receives information about the other elements via its Self-Attention sublayers, allowing the relationships between words in the sentence to be captured. The output z_1 of the self-attention layer for je" is finally obtained by summing up the weighted value vectors. The most well-known language models are smartphone keyboards that suggest the next word based on what you've currently typed. Just imagine, we have more of these Wq, Wk, Wv matrices, which were used to calculate the Q, K and V matrices, which were further used to compute self-attention for all words. A copy of the set of output properties in effect for the next transformation. Q is a matrix that contains the query (vector representation of one word in the sequence), K are all the keys (vector representations of all the words in the sequence) and V are the values, which are again the vector representations of all the words in the sequence. Transformer consists of the encoder, decoder and a final linear layer. We also need to remove the SoftMax layer from the output of the Transformer because our output nodes are not probabilities but real values. This means that the encoder gets a window of 24 data points as input and the decoder input is a window of 12 data points where the first one is a 'start-of-sequence' value and the next data points are just the target sequence. Now we can drown-out irrelevant words, such as étudiant", and reduce the attention on suis", by multiply each value vector by the softmax score. After a mapping has been built, Transformer saves both the input test data and the resulting output, along with the mapping itself. To have the actual words, the output of nn.TransformerEncoder model is sent to the final Linear layer, which is followed by a log-Softmax function. Notice that the model now can handle up to 4,000 tokens in a certain segment - a massive upgrade from the 512 in the original transformer. XLM (from Facebook) released along with the paper Cross-lingual Language Model Pretraining by Guillaume Lample and Alexis Conneau. Input both the encoder sequence and the new decoder sequence into the model. There are two parts to preprocessing: first, there is the familiar word embedding, a staple in most modern NLP models.

Let us use hi to label the final hidden state of the last Encoder layer for each wi. The Decoder also contains multiple layers - typically, the number is equal to that of the Encoder. This leads to the output vector hE1 (hidden state 1), which serves as the next input for the Encoder RNN, along with the second element in the input sequence "suis". The first layer is 4 times the size of the model (Since GPT2 small is 768, this network would have 768*4 = 3072 units). Each layer of GPT-2 has retained its own interpretation of the first token and will use it in processing the second token (we'll get into more detail about this in the following section about self-attention). I have expanded the first one so you can see its self-attention layer is the masked variant. Concatenate the expected word to the decoder input as pass it to the decoder. The model continues iterating until the entire context is generated (1024 tokens) or until an end-of-sequence token is produced. The context vector is the first input to the Decoder RNN, which should then generate the first element of the output sequence "I" (in reality, the last layer of the Decoder is typically a softmax, but for simplicity we will just keep the most likely element at the end of each Decoder step). The analysis and training strings are tokenized, and the resulting data is sharded, shuffled, and saved as TFRecords. Transformer is a distinct architecture for transforming one sequence into another one with the help of two parts, Encoder and Decoder. There are N decoder layers in the transformer. I created it to introduce more visual language to describe self-attention in order to make describing later transformer models easier to examine and describe (looking at you, TransformerXL and XLNet). This allows the network to pay attention to relevant parts of the input sequence at different levels of abstraction: the values V of the lower Encoder layers will be closest to the original input tokens, whereas Self-Attention of the deeper layers will contain more abstract structures. In fact, the Encoder Self-Attention, that is bi-directional by design, is an important part of BERT, the pre-trained contextual word embeddings, that we shall discuss later on. First, "je" (or, most likely, a word embedding for the token representing "je"), usually accompanied by a constant vector hE0 which could be either learned or fixed, gets fed into the Encoder RNN. This is true for Seq2Seq models and for the Transformer. The trick here is to re-feed our model for each position of the output sequence until we come across an end-of-sentence token. At each location in the sequence, y, the MultiHeadAttention runs all eight attention heads across all other locations in the sequence, returning a new vector of the same length at each location.

