Welcome back! In the first part of our series, we meticulously prepared our dataset. Now comes the exciting part: building our very own Transformer model from the ground up using TensorFlow. We'll assemble all the essential components, including self-attention and cross-attention, and finally, train our model to understand and translate languages.
Think of a Transformer as a team of expert linguists in a room. To translate a sentence, they don't just look at words in isolation. They need to understand the context, the word order, and the intricate relationships between them. Our model will do the same, and we'll build each part of this "expert team" ourselves.
Every great construction project needs a blueprint. In machine learning, this is our configuration class. It holds all the important settings and hyperparameters that define our model's architecture and training process.
class Config:
# --- Vocabulary and Language Settings ---
MIX_VOCAB_SIZE = 32000 # Total unique words our model will know
# --- Model Architecture ---
EMBED_DIM = 256 # The "richness" of meaning for each word
FF_DIM = 1024 # Workspace size for deeper thinking
NUM_HEADS = 8 # Number of "experts" focusing on different word relationships
NUM_ENCODER_LAYERS = 4 # Number of layers in our "understanding" unit
NUM_DECODER_LAYERS = 4 # Number of layers in our "writing" unit
# --- Regularization to Prevent Overthinking ---
DROPOUT_RATE = 0.2 # A technique to ensure the model doesn't "memorize" the data
# --- Training and Data Handling ---
MAX_LENGTH = 128 # The maximum sentence length we'll handle
BATCH_SIZE = 64 # How many sentences to read at once
EPOCHS = 150 # How many times we'll review the entire dataset
WARMUP_STEPS = 4000 # A special learning rate strategy
config = Config()
Think of EMBED_DIM as the number of adjectives you can use to describe a word's meaning. A higher dimension allows for a more nuanced understanding. NUM_HEADS is like having multiple specialists, one might focus on grammatical structure, another on subject-verb relationships, helping the model capture different kinds of context.
Transformer models are brilliant at finding relationships between words but have one small quirk: they don't inherently understand the order of words. If we feed them "the cat sat on the mat," they see a collection of words, not a sequence.
That's where Positional Embedding comes in. It's like adding a unique timestamp or a page number to every word. We create a special "positional" vector using a clever combination of sine and cosine waves and add it to each word's embedding. This gives the model a crucial clue about the word's position in the sentence.
class PositionalEmbedding(Layer):
def __init__(self, sequence_length, embed_dim, **kwargs):
super(PositionalEmbedding, self).__init__(**kwargs)
# Create a matrix of positional encodings
pos = tf.range(sequence_length, dtype=tf.float32)[:, tf.newaxis]
i = tf.range(embed_dim, dtype=tf.float32)[tf.newaxis, :]
angle_rates = 1.0 / tf.pow(10000.0, (2.0 * (i // 2)) / embed_dim)
angle_rads = pos * angle_rates
# Apply sin to even indices; cos to odd indices
sines = tf.sin(angle_rads[:, 0::2])
cosines = tf.cos(angle_rads[:, 1::2])
# Combine them to create the final positional encoding matrix
pos_encoding = tf.reshape(
tf.stack([sines, cosines], axis=-1),
[sequence_length, embed_dim]
)
self.pos_encoding = tf.cast(pos_encoding, tf.float32)
def call(self, inputs):
# Add the positional encoding to the input word embeddings
seq_len = tf.shape(inputs)[1]
return inputs + self.pos_encoding[tf.newaxis, :seq_len, :]
Code Explained: This layer pre-computes a matrix where each row corresponds to a position in the sentence and each column represents a dimension of the positional signal. The call method simply adds this positional information to the word embeddings, enriching them with context about their order.
The Encoder's job is to read and understand the input sentence. Imagine it as a meticulous reader who rereads a sentence multiple times, each time focusing on different connections between words.
Each TransformerEncoder layer has two main parts:
We also use Normalization and Dropout to keep the learning process stable and prevent the model from becoming too specialized on the training data.
class TransformerEncoder(Layer):
def __init__(self, embed_dim, num_heads, ff_dim, drop_rate=0.2, **kwargs):
super(TransformerEncoder, self).__init__(**kwargs)
# The self-attention mechanism
self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
# The "thinking" network
self.ffn = Sequential([
Dense(ff_dim, activation="relu"),
Dense(embed_dim),
])
# Normalization and Dropout layers
self.layernorm1 = LayerNormalization(epsilon=1e-6)
self.layernorm2 = LayerNormalization(epsilon=1e-6)
self.dropout1 = Dropout(drop_rate)
self.dropout2 = Dropout(drop_rate)
def call(self, inputs, training=False):
# First, apply attention to the inputs
attn_output = self.att(inputs, inputs)
attn_output = self.dropout1(attn_output, training=training)
# Add the original input and normalize (residual connection)
out1 = self.layernorm1(inputs + attn_output)
# Then, pass it through the feed-forward network
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
# Add and normalize again
out2 = self.layernorm2(out1 + ffn_output)
return out2
Code Explained: The call function defines the flow of information. The input first goes through self-attention, is regularized with dropout, and then combined with the original input (this is called a "residual connection"). This result is then passed to the feed-forward network for another round of processing and normalization.
The Decoder's job is to generate the translated sentence word by word. It's like a writer who has the original sentence for context (from the Encoder) and what they've already written, to decide on the next best word.
The Decoder is similar to the Encoder but has an extra attention layer:
class TransformerDecoder(Layer):
def __init__(self, embed_dim, ff_dim, num_heads, **kwargs):
super(TransformerDecoder, self).__init__(**kwargs)
# 1. Self-attention on the target sentence (masked)
self.attention_1 = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
# 2. Cross-attention with the encoder's output
self.attention_2 = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
# The "thinking" network
self.dense_proj = Sequential([
Dense(ff_dim, activation="relu"),
Dense(embed_dim),
])
self.layernorm_1 = LayerNormalization()
self.layernorm_2 = LayerNormalization()
self.layernorm_3 = LayerNormalization()
self.supports_masking = True
def call(self, inputs, encoder_outputs, mask=None):
# Create a mask to hide future words
causal_mask = self.get_causal_attention_mask(inputs)
# Self-attention step
attention_output_1 = self.attention_1(query=inputs, value=inputs, key=inputs, attention_mask=causal_mask)
out_1 = self.layernorm_1(inputs + attention_output_1)
# Cross-attention step (linking encoder and decoder)
attention_output_2 = self.attention_2(query=out_1, value=encoder_outputs, key=encoder_outputs)
out_2 = self.layernorm_2(out_1 + attention_output_2)
# Feed-forward step
proj_output = self.dense_proj(out_2)
out_3 = self.layernorm_3(out_2 + proj_output)
return out_3
def get_causal_attention_mask(self, inputs):
# A helper function to create the "no-peeking" mask
input_shape = tf.shape(inputs)
batch_size, sequence_length = input_shape[0], input_shape[1]
i = tf.range(sequence_length)[:, tf.newaxis]
j = tf.range(sequence_length)
mask = tf.cast(i >= j, dtype="int32")
mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
mult = tf.concat([tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], axis=0)
return tf.tile(mask, mult)
Code Explained: The Decoder's call method orchestrates three main steps: masked self-attention on its own output, cross-attention with the encoder's output, and a final feed-forward processing stage.
Instead of using a fixed learning rate (like a student studying at the same pace all semester), we use a scheduler that changes the rate over time. Our CustomSchedule implements the strategy from the original "Attention Is All You Need" paper:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, d_model, warmup_steps=4000):
super(CustomSchedule, self).__init__()
self.d_model = tf.cast(d_model, tf.float32)
self.warmup_steps = warmup_steps
def __call__(self, step):
# Implements the learning rate formula
arg1 = tf.math.rsqrt(step)
arg2 = step * (self.warmup_steps ** -1.5)
return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)
Now, we bring all our components together. We'll stack multiple Encoder layers to form the full Encoder block and multiple Decoder layers for the Decoder block. This stacking allows the model to learn more complex patterns in the language.
def get_transformer(config):
# --- Input Layers ---
encoder_inputs = Input(shape=(None,), dtype="int64", name="encoder_inputs")
decoder_inputs = Input(shape=(None,), dtype="int64", name="decoder_inputs")
# --- ENCODER STACK ---
# Start with embedding and positional encoding
encoder_embedding_layer = Embedding(config.MIX_VOCAB_SIZE, config.EMBED_DIM)
encoder_pos_embedding_layer = PositionalEmbedding(config.MAX_LENGTH, config.EMBED_DIM)
x = encoder_embedding_layer(encoder_inputs)
x = encoder_pos_embedding_layer(x)
# Pass the input through all encoder layers
for i in range(config.NUM_ENCODER_LAYERS):
x = TransformerEncoder(
embed_dim=config.EMBED_DIM, num_heads=config.NUM_HEADS, ff_dim=config.FF_DIM
)(x)
encoder_outputs = x
# --- DECODER STACK ---
# Start with embedding and positional encoding for the decoder
decoder_embedding_layer = Embedding(config.MIX_VOCAB_SIZE, config.EMBED_DIM)
decoder_pos_embedding_layer = PositionalEmbedding(config.MAX_LENGTH, config.EMBED_DIM)
x = decoder_embedding_layer(decoder_inputs)
x = decoder_pos_embedding_layer(x)
# Pass through all decoder layers, connecting them to the encoder's output
for i in range(config.NUM_DECODER_LAYERS):
x = TransformerDecoder(
embed_dim=config.EMBED_DIM, ff_dim=config.FF_DIM, num_heads=config.NUM_HEADS
)(inputs=x, encoder_outputs=encoder_outputs)
# --- Final Prediction Head ---
decoder_outputs = Dense(config.MIX_VOCAB_SIZE, activation="softmax")(x)
# --- Build and Compile the Model ---
transformer = Model([encoder_inputs, decoder_inputs], decoder_outputs)
transformer.compile(
optimizer=Adam(CustomSchedule(config.EMBED_DIM), beta_1=0.9, beta_2=0.98, epsilon=1e-9),
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
return transformer
transformer = get_transformer(config)
transformer.summary()
Code Explained: This function defines the complete workflow. The encoder_inputs flow through the stack of TransformerEncoder layers. The final encoder_outputs are then fed into every TransformerDecoder layer, along with the decoder_inputs. Finally, a Dense layer predicts the probability of the next word in the vocabulary. Since we will be predicting the probability of the next word so our loss function is going to be sparse_categorical_crossentropy. So our model will be predicting out of 32000 possible values.
Training is where our model learns from the dataset. We'll use a couple of helpful Keras callbacks:
model_name = "transformer_eng_to_esp.weights.h5"
if config.IS_TRAINING:
# Callback to save the best model
checkpoints = tf.keras.callbacks.ModelCheckpoint(
model_name,
monitor="val_accuracy",
save_best_only=True,
save_weights_only=True,
)
# Callback for early stopping
early_stop = tf.keras.callbacks.EarlyStopping(patience=15, monitor="val_loss")
# Start the training!
history = transformer.fit(
train_ds,
epochs=config.EPOCHS,
validation_data=valid_ds,
callbacks=[checkpoints, early_stop],
)
# Load the best weights saved during training
transformer.load_weights(model_name)
After training, it's time to check our model's "report card." By plotting the training and validation loss and accuracy, we can understand how well it learned.
When we train a model to translate from English to Spanish, it might seem natural to check how “accurate” it is. That is, how often its predictions exactly match the reference translations. But for translation, accuracy can be very misleading.
Here’s why: in language, there’s rarely just one correct way to say something. For example, the English sentence “I’m going home.” could be translated as either:
Both are perfectly valid translations. But if our model produces the second one while the reference is the first, accuracy would count it as wrong even though the meaning is exactly the same.
That’s why, in translation tasks, we don’t rely on accuracy. Instead, we use a more language aware metric the BLEU Score.
BLEU stands for Bilingual Evaluation Understudy, and it’s a standard way to evaluate machine translations. Instead of checking for exact word matches, BLEU measures how similar the model’s output is to the reference translation by looking at word patterns (called n-grams).
In simple terms BLEU compare word sequences, it looks for overlapping sequences of 1 word (unigrams), 2 words (bigrams), and so on between your translation and the reference. For example, if the reference is “El gato está sobre la alfombra,” and the model predicts “El gato está en la alfombra,” BLEU will notice that most of the word sequences match, even though one word (“sobre” vs. “en”) is different.
Precision and length check: It rewards the model for generating correct word patterns and also checks that the translation isn’t too short (to prevent “cheating” by skipping words).
Final score: The BLEU score is a number between 0 and 1 (or sometimes shown as 0–100). A higher score means the translation is closer in meaning and structure to the human reference. A score of 1 (or 100) would mean a perfect match.
import matplotlib.pyplot as plt
# Plotting the training and validation loss
plt.figure(figsize=(10, 6))
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()
# Plotting the training and validation accuracy
plt.figure(figsize=(10, 6))
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()
And there you have it! You've successfully built and trained a complete Transformer model from scratch. In the next part, we'll put our model to the test and see how well it can translate new sentences. Stay tuned!