Introducing To Transformers

Images/Introduction_To_Transformers/Introduction_To_Transformers_1.png

Welcome to this video on Transformers in Keras. After watching this video, you'll be able to define the importance of transformers in deep learning, explain the architecture and functionality of transformers, implement a transformer model using Keras with examples.

Images/Introduction_To_Transformers/Introduction_To_Transformers_2.png

Transformers have transformed the field of natural language processing and are now being applied to a wide range of tasks, including image processing and time series prediction. Transformers were introduced by Vaswani Adal in the landmark paper, "Attention is All You Need". Unlike traditional sequence models such as RNNs, transformers leverage self attention mechanisms to process and put data in parallel, making them highly efficient and powerful. Transformers are now the backbone of state of the art models like BERT, GPT, and many others.

Images/Introduction_To_Transformers/Introducing_To_Tansformers_3.png

The transformer model consists of two main parts, the encoder and the decoder. Both the encoder and the decoder are composed of layers that include self attention mechanisms and feed forward neural networks. Self-attention allows the model to weigh the importance of different words in a sentence when encoding a particular word. This is crucial for capturing dependencies that are far apart in the input sequence. The feed forward neural network layers help in transforming the input data after the self attention mechanism. Each layer in the encoder and decoder stacks multiple such sub layers enabling the model to learn complex representations.

Images/Introduction_To_Transformers/5.png

Self-attention is the core component of the transformer architecture. It allows each word and the input to attend to every other word, making it possible to capture contexts and relationships more effectively. In self-attention, each word is represented by three vectors. Query, key and value. The attention score is computed as a dot product of the query and key vectors, which is then used to weigh the value vectors. This process allows the model to focus on different parts of the input sequence when making predictions.

pyenv activate venv3.10.4

import tensorflow as tf
from tensorflow.keras.layers import Layer

class SelfAttention(Layer):
    def __init__(self, d_model):
        super(SelfAttention, self).__init__()
        self.d_model = d_model
        self.query_dense = tf.keras.layers.Dense(d_model)
        self.key_dense = tf.keras.layers.Dense(d_model)
        self.value_dense = tf.keras.layers.Dense(d_model)
    def call(self, inputs):
        q = self.query_dense(inputs)
        k = self.key_dense(inputs)
        v = self.value_dense(inputs)
        attention_weights = tf.nn.softmax(
            tf.matmul(q, k, transpose_b=True) /
            tf.math.sqrt(tf.cast(self.d_model, tf.float32)),
            axis=-1
        )
        output = tf.matmul(attention_weights, v)
        return output


# Example usage
inputs = tf.random.uniform((1, 60, 512))  # Batch size of 1, sequence length of 60, and model dimension of 512
self_attention = SelfAttention(d_model=512)
output = self_attention(inputs)

print(output.shape)  # Should print (1, 60, 512)

In this code example, the self attention class defines the self attention mechanism. The init method initializes the dense layers for query, key and value projections. The call method computes the attention weights and applies them to the value vectors to get the output. The tf.nn.softmax parameter applies the Softmax function to the attention scores to get the attention weights.

Images/Introduction_To_Transformers/Introducing_To_Tansformers_5.png

The transformer encoder is composed of multiple layers, each containing a self attention mechanism followed by a feed forward neural network. Each layer also includes residual connections and layer normalization to stabilize training. The input to the encoder is first embedded and then passed through positional encoding to add information about the position of words in the sequence. This helps the model understand the order of words.

class TransformerEncoder(Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(TransformerEncoder, self).__init__()
        self.mha = tf.keras.layers.MultiHeadAttention(
            num_heads=num_heads,
            key_dim=d_model
        )
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(dff, activation='relu'),
            tf.keras.layers.Dense(d_model)
        ])
    def call(self, x, training, mask):
        attn_output = self.mha(x, x, x, attention_mask=mask)  # Self attention
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # Residual connection and normalization
        ffn_output = self.ffn(out1)  # Feed forward network
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # Residual connection and normalization
        return out2


# Example usage
encoder = TransformerEncoder(d_model=512, num_heads=8, dff=2048)
x = tf.random.uniform((1, 60, 512))
mask = None

In the example, the transformer encoder class defines the transformer encoder. The init method initializes the multi head attention, feed forward network, layer normalization, and dropout layers. The multi head attention method applies multi head attention to the input. The call method applies self attention, residual connection, normalization, feed forward network, and another residual connection and normalization.

Images/Introduction_To_Transformers/Introducing_To_Tansformers_6.png

The transformer decoder is similar to the encoder, but with an additional cross attention mechanism to attend to the encoders output. This allows the decoder to generate sequences based on the context provided by the encoder. The decoder takes the target sequence as input, applies self attention and cross attention with the encoders output, and then passes through a feed forward neural network.

class TransformerDecoder(Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(TransformerDecoder, self).__init__()
        self.mha1 = tf.keras.layers.MultiHeadAttention(
            num_heads=num_heads,
            key_dim=d_model
        )
        self.mha2 = tf.keras.layers.MultiHeadAttention(
            num_heads=num_heads,
            key_dim=d_model
        )
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(dff, activation='relu'),
            tf.keras.layers.Dense(d_model)
        ])
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
        self.dropout3 = tf.keras.layers.Dropout(rate)

    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        attn1 = self.mha1(x, x, x, attention_mask=look_ahead_mask)  # Self attention
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(x + attn1)  # Residual connection and normalization

        attn2 = self.mha2(
            out1,
            enc_output,
            enc_output,
            attention_mask=padding_mask
        )  # Cross attention
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(out1 + attn2)  # Residual connection and normalization

        ffn_output = self.ffn(out2)  # Feed forward network
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(out2 + ffn_output)  # Residual connection and normalization

        return out3

In the example, the transformer decoder class defines the transformer decoder. The init method initializes the multi head attention, feed forward network, layer normalization, and drop out layers. The multi head attention method applies multi head attention to the input and encoder output. The call method applies self attention, cross attention, residual connection, normalization, feed forward network, and another residual connection and normalization.

In this video, you learned the transformer model consists of two main parts, the encoder and the decoder. Both the encoder and decoder are composed of layers that include self attention mechanisms and feed forward neural networks. Transformers have become a cornerstone in deep learning, especially in natural language processing. Understanding and implementing transformers will enable you to build powerful models for various tasks.