Back to Blogs

From English to Español: Model Training

Welcome back! In the first part of our series, we meticulously prepared our dataset. Now comes the exciting part: building our very own Transformer model from the ground up using TensorFlow. We'll assemble all the essential components, including self-attention and cross-attention, and finally, train our model to understand and translate languages.

Think of a Transformer as a team of expert linguists in a room. To translate a sentence, they don't just look at words in isolation. They need to understand the context, the word order, and the intricate relationships between them. Our model will do the same, and we'll build each part of this "expert team" ourselves.

The Blueprint: Our Configuration (Config Class)

Every great construction project needs a blueprint. In machine learning, this is our configuration class. It holds all the important settings and hyperparameters that define our model's architecture and training process.

class Config:
    # --- Vocabulary and Language Settings ---
    MIX_VOCAB_SIZE = 32000  # Total unique words our model will know

    # --- Model Architecture ---
    EMBED_DIM = 256         # The "richness" of meaning for each word
    FF_DIM = 1024           # Workspace size for deeper thinking
    NUM_HEADS = 8           # Number of "experts" focusing on different word relationships
    NUM_ENCODER_LAYERS = 4  # Number of layers in our "understanding" unit
    NUM_DECODER_LAYERS = 4  # Number of layers in our "writing" unit

    # --- Regularization to Prevent Overthinking ---
    DROPOUT_RATE = 0.2      # A technique to ensure the model doesn't "memorize" the data

    # --- Training and Data Handling ---
    MAX_LENGTH = 128        # The maximum sentence length we'll handle
    BATCH_SIZE = 64         # How many sentences to read at once
    EPOCHS = 150            # How many times we'll review the entire dataset
    WARMUP_STEPS = 4000     # A special learning rate strategy

config = Config()

Think of EMBED_DIM as the number of adjectives you can use to describe a word's meaning. A higher dimension allows for a more nuanced understanding. NUM_HEADS is like having multiple specialists, one might focus on grammatical structure, another on subject-verb relationships, helping the model capture different kinds of context.

Stop! Words Have an Order: The Positional Embedding Layer

Transformer models are brilliant at finding relationships between words but have one small quirk: they don't inherently understand the order of words. If we feed them "the cat sat on the mat," they see a collection of words, not a sequence.

That's where Positional Embedding comes in. It's like adding a unique timestamp or a page number to every word. We create a special "positional" vector using a clever combination of sine and cosine waves and add it to each word's embedding. This gives the model a crucial clue about the word's position in the sentence.

class PositionalEmbedding(Layer):
    def __init__(self, sequence_length, embed_dim, **kwargs):
        super(PositionalEmbedding, self).__init__(**kwargs)
        # Create a matrix of positional encodings
        pos = tf.range(sequence_length, dtype=tf.float32)[:, tf.newaxis]
        i = tf.range(embed_dim, dtype=tf.float32)[tf.newaxis, :]
        angle_rates = 1.0 / tf.pow(10000.0, (2.0 * (i // 2)) / embed_dim)
        angle_rads = pos * angle_rates

        # Apply sin to even indices; cos to odd indices
        sines = tf.sin(angle_rads[:, 0::2])
        cosines = tf.cos(angle_rads[:, 1::2])

        # Combine them to create the final positional encoding matrix
        pos_encoding = tf.reshape(
            tf.stack([sines, cosines], axis=-1),
            [sequence_length, embed_dim]
        )
        self.pos_encoding = tf.cast(pos_encoding, tf.float32)

    def call(self, inputs):
        # Add the positional encoding to the input word embeddings
        seq_len = tf.shape(inputs)[1]
        return inputs + self.pos_encoding[tf.newaxis, :seq_len, :]

Code Explained: This layer pre-computes a matrix where each row corresponds to a position in the sentence and each column represents a dimension of the positional signal. The call method simply adds this positional information to the word embeddings, enriching them with context about their order.

The Understanding Unit: The Transformer Encoder

The Encoder's job is to read and understand the input sentence. Imagine it as a meticulous reader who rereads a sentence multiple times, each time focusing on different connections between words.

Each TransformerEncoder layer has two main parts:

  1. Multi-Head Self-Attention: This is the core of the Transformer. The model weighs the importance of all other words in the sentence for the current word it's looking at. For example, in "The dog chased the cat," when processing "chased," the attention mechanism would likely pay more attention to "dog" (the chaser) and "cat" (the one being chased).
  2. Feed-Forward Network: After gathering context from the attention step, this network does some deeper "thinking" on each word individually to process the information it has gathered.

We also use Normalization and Dropout to keep the learning process stable and prevent the model from becoming too specialized on the training data.

class TransformerEncoder(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, drop_rate=0.2, **kwargs):
        super(TransformerEncoder, self).__init__(**kwargs)
        # The self-attention mechanism
        self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        # The "thinking" network
        self.ffn = Sequential([
            Dense(ff_dim, activation="relu"),
            Dense(embed_dim),
        ])
        # Normalization and Dropout layers
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(drop_rate)
        self.dropout2 = Dropout(drop_rate)

    def call(self, inputs, training=False):
        # First, apply attention to the inputs
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        # Add the original input and normalize (residual connection)
        out1 = self.layernorm1(inputs + attn_output)

        # Then, pass it through the feed-forward network
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        # Add and normalize again
        out2 = self.layernorm2(out1 + ffn_output)

        return out2

Code Explained: The call function defines the flow of information. The input first goes through self-attention, is regularized with dropout, and then combined with the original input (this is called a "residual connection"). This result is then passed to the feed-forward network for another round of processing and normalization.

The Writing Unit: The Transformer Decoder

The Decoder's job is to generate the translated sentence word by word. It's like a writer who has the original sentence for context (from the Encoder) and what they've already written, to decide on the next best word.

The Decoder is similar to the Encoder but has an extra attention layer:

  1. Masked Multi-Head Self-Attention: The Decoder looks at the words it has already generated to inform its next choice. The "masking" is crucial here because it prevents the model from "cheating" by looking at future words in the sentence it is trying to predict.
  2. Encoder-Decoder Cross-Attention: This is where the magic happens! The Decoder pays attention to the output of the Encoder. It looks at the understood source sentence to decide which parts are most relevant for generating the next word in the target language.
  3. Feed-Forward Network: Just like in the Encoder, this allows for deeper processing of the information gathered from both attention steps.
class TransformerDecoder(Layer):
    def __init__(self, embed_dim, ff_dim, num_heads, **kwargs):
        super(TransformerDecoder, self).__init__(**kwargs)
        # 1. Self-attention on the target sentence (masked)
        self.attention_1 = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        # 2. Cross-attention with the encoder's output
        self.attention_2 = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        # The "thinking" network
        self.dense_proj = Sequential([
            Dense(ff_dim, activation="relu"),
            Dense(embed_dim),
        ])
        self.layernorm_1 = LayerNormalization()
        self.layernorm_2 = LayerNormalization()
        self.layernorm_3 = LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, encoder_outputs, mask=None):
        # Create a mask to hide future words
        causal_mask = self.get_causal_attention_mask(inputs)
        
        # Self-attention step
        attention_output_1 = self.attention_1(query=inputs, value=inputs, key=inputs, attention_mask=causal_mask)
        out_1 = self.layernorm_1(inputs + attention_output_1)

        # Cross-attention step (linking encoder and decoder)
        attention_output_2 = self.attention_2(query=out_1, value=encoder_outputs, key=encoder_outputs)
        out_2 = self.layernorm_2(out_1 + attention_output_2)

        # Feed-forward step
        proj_output = self.dense_proj(out_2)
        out_3 = self.layernorm_3(out_2 + proj_output)
        return out_3

    def get_causal_attention_mask(self, inputs):
        # A helper function to create the "no-peeking" mask
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat([tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], axis=0)
        return tf.tile(mask, mult)

Code Explained: The Decoder's call method orchestrates three main steps: masked self-attention on its own output, cross-attention with the encoder's output, and a final feed-forward processing stage.

A Smart Study Plan: Custom Learning Rate Scheduler

Instead of using a fixed learning rate (like a student studying at the same pace all semester), we use a scheduler that changes the rate over time. Our CustomSchedule implements the strategy from the original "Attention Is All You Need" paper:

  1. Warm-up: Start with a small learning rate and gradually increase it. This is like easing into a workout, preventing the model from making drastic, incorrect updates at the beginning.
  2. Cool-down: After the warm-up period, gradually decrease the learning rate. This allows the model to make smaller, finer adjustments as it gets closer to the best solution.
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super(CustomSchedule, self).__init__()
        self.d_model = tf.cast(d_model, tf.float32)
        self.warmup_steps = warmup_steps

    def __call__(self, step):
        # Implements the learning rate formula
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps ** -1.5)
        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

Assembling the Full Model

Now, we bring all our components together. We'll stack multiple Encoder layers to form the full Encoder block and multiple Decoder layers for the Decoder block. This stacking allows the model to learn more complex patterns in the language.

def get_transformer(config):
    # --- Input Layers ---
    encoder_inputs = Input(shape=(None,), dtype="int64", name="encoder_inputs")
    decoder_inputs = Input(shape=(None,), dtype="int64", name="decoder_inputs")

    # --- ENCODER STACK ---
    # Start with embedding and positional encoding
    encoder_embedding_layer = Embedding(config.MIX_VOCAB_SIZE, config.EMBED_DIM)
    encoder_pos_embedding_layer = PositionalEmbedding(config.MAX_LENGTH, config.EMBED_DIM)
    x = encoder_embedding_layer(encoder_inputs)
    x = encoder_pos_embedding_layer(x)
    
    # Pass the input through all encoder layers
    for i in range(config.NUM_ENCODER_LAYERS):
        x = TransformerEncoder(
            embed_dim=config.EMBED_DIM, num_heads=config.NUM_HEADS, ff_dim=config.FF_DIM
        )(x)
    encoder_outputs = x

    # --- DECODER STACK ---
    # Start with embedding and positional encoding for the decoder
    decoder_embedding_layer = Embedding(config.MIX_VOCAB_SIZE, config.EMBED_DIM)
    decoder_pos_embedding_layer = PositionalEmbedding(config.MAX_LENGTH, config.EMBED_DIM)
    x = decoder_embedding_layer(decoder_inputs)
    x = decoder_pos_embedding_layer(x)

    # Pass through all decoder layers, connecting them to the encoder's output
    for i in range(config.NUM_DECODER_LAYERS):
        x = TransformerDecoder(
            embed_dim=config.EMBED_DIM, ff_dim=config.FF_DIM, num_heads=config.NUM_HEADS
        )(inputs=x, encoder_outputs=encoder_outputs)

    # --- Final Prediction Head ---
    decoder_outputs = Dense(config.MIX_VOCAB_SIZE, activation="softmax")(x)

    # --- Build and Compile the Model ---
    transformer = Model([encoder_inputs, decoder_inputs], decoder_outputs)
    transformer.compile(
        optimizer=Adam(CustomSchedule(config.EMBED_DIM), beta_1=0.9, beta_2=0.98, epsilon=1e-9),
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"]
    )
    return transformer

transformer = get_transformer(config)
transformer.summary()

Code Explained: This function defines the complete workflow. The encoder_inputs flow through the stack of TransformerEncoder layers. The final encoder_outputs are then fed into every TransformerDecoder layer, along with the decoder_inputs. Finally, a Dense layer predicts the probability of the next word in the vocabulary. Since we will be predicting the probability of the next word so our loss function is going to be sparse_categorical_crossentropy. So our model will be predicting out of 32000 possible values.

The Training Process: Let the Learning Begin!

Training is where our model learns from the dataset. We'll use a couple of helpful Keras callbacks:

  • ModelCheckpoint: This is our diligent assistant who saves the model's weights every time it achieves a new "high score" on the validation accuracy. This ensures we always keep the best version of our model.
  • EarlyStopping: This prevents the model from "over-studying." If the model's performance on the validation set stops improving for a set number of epochs, training will stop automatically to prevent overfitting.
model_name = "transformer_eng_to_esp.weights.h5"

if config.IS_TRAINING:
    # Callback to save the best model
    checkpoints = tf.keras.callbacks.ModelCheckpoint(
        model_name,
        monitor="val_accuracy",
        save_best_only=True,
        save_weights_only=True,
    )
    # Callback for early stopping
    early_stop = tf.keras.callbacks.EarlyStopping(patience=15, monitor="val_loss")

    # Start the training!
    history = transformer.fit(
        train_ds,
        epochs=config.EPOCHS,
        validation_data=valid_ds,
        callbacks=[checkpoints, early_stop],
    )
    # Load the best weights saved during training
    transformer.load_weights(model_name)

Grading Our Model: Plotting the Results

After training, it's time to check our model's "report card." By plotting the training and validation loss and accuracy, we can understand how well it learned.

  • Loss Plot: We want to see both training and validation loss decrease and stabilize. If the validation loss starts to increase while the training loss continues to decrease, it's a sign of overfitting.
  • Accuracy Plot: We want to see both accuracies increase and converge. A large gap between the two might also indicate overfitting.

The BLEU Score

When we train a model to translate from English to Spanish, it might seem natural to check how “accurate” it is. That is, how often its predictions exactly match the reference translations. But for translation, accuracy can be very misleading.

Here’s why: in language, there’s rarely just one correct way to say something. For example, the English sentence “I’m going home.” could be translated as either:

  • “Voy a casa.”
  • or “Me voy a casa.”

Both are perfectly valid translations. But if our model produces the second one while the reference is the first, accuracy would count it as wrong even though the meaning is exactly the same.

That’s why, in translation tasks, we don’t rely on accuracy. Instead, we use a more language aware metric the BLEU Score.

BLEU stands for Bilingual Evaluation Understudy, and it’s a standard way to evaluate machine translations. Instead of checking for exact word matches, BLEU measures how similar the model’s output is to the reference translation by looking at word patterns (called n-grams).

In simple terms BLEU compare word sequences, it looks for overlapping sequences of 1 word (unigrams), 2 words (bigrams), and so on between your translation and the reference. For example, if the reference is “El gato está sobre la alfombra,” and the model predicts “El gato está en la alfombra,” BLEU will notice that most of the word sequences match, even though one word (“sobre” vs. “en”) is different.

Precision and length check: It rewards the model for generating correct word patterns and also checks that the translation isn’t too short (to prevent “cheating” by skipping words).

Final score: The BLEU score is a number between 0 and 1 (or sometimes shown as 0–100). A higher score means the translation is closer in meaning and structure to the human reference. A score of 1 (or 100) would mean a perfect match.

import matplotlib.pyplot as plt

# Plotting the training and validation loss
plt.figure(figsize=(10, 6))
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()

# Plotting the training and validation accuracy
plt.figure(figsize=(10, 6))
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy Over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()

And there you have it! You've successfully built and trained a complete Transformer model from scratch. In the next part, we'll put our model to the test and see how well it can translate new sentences. Stay tuned!