Deepanshu Joshi | Portfolio

Hi everyone! Today, we're going on an exciting journey into the world of Artificial Intelligence. We'll be rolling up our sleeves and building our very own translator, a model that can take an English sentence and magically turn it into Spanish. Our secret weapon? The revolutionary "Attention Is All You Need" paper and its star, the Transformer model.

The Transformer: A Game-Changer in the World of NLP

Picture this: it's 2017, and the world of Natural Language Processing (NLP) is about to have its "aha!" moment. A paper titled Attention Is All You Need is published, and it introduces a groundbreaking model called the Transformer. Before the Transformer, models processed sentences word by word in a sequence, like reading a book one word at a time. This could be slow and sometimes the model would forget the beginning of a long sentence by the time it reached the end.

The Transformer changed the game by being able to look at all the words in a sentence at once, weighing their importance. This new approach, built on a clever mechanism called "attention," paved the way for the powerful language models we know and love today, like BERT and GPT.

At its heart, the Transformer has two main parts:

The Encoder: Think of the encoder as a super-smart summarizer. It reads the input sentence (in our case, English) and condenses all its meaning and context into a rich, numerical representation.
The Decoder: The decoder is like a creative writer. It takes the summary from the encoder and, word by word, writes out the translated sentence in the target language (Spanish).

But what makes the Transformer so special is its attention mechanism. Imagine you're translating the sentence, "The cat sat on the mat." When you're figuring out the Spanish word for "sat," the attention mechanism helps the model pay extra attention to "cat" because who is doing the sitting is very important. This is done through:

Self-attention: This allows the model to understand the relationships between words in the same sentence. For example, in "The bank of the river," self-attention helps the model understand that "bank" is related to "river" and not a financial institution.
Multi-head attention: This is like having several people read the same sentence, each focusing on different things. One person might focus on the grammar, another on the meaning, and a third on the nuances. The model then combines all these perspectives for a much richer understanding.

What's a Sequence-to-Sequence (Seq2Seq) Model Anyway?

Before we dive into the code, let's quickly understand the big picture: the Sequence-to-Sequence (Seq2Seq) model. As the name suggests, it's a model designed to convert one sequence into another. Think of it like a magical machine that takes in a string of beads of one color (our English sentence) and outputs a string of beads of another color (the Spanish translation).

Encoder: This part of the machine scans the entire input sequence and creates a "thought vector" or "context vector" – a numerical summary that captures the essence of the input.
Decoder: The decoder then takes this "thought vector" and begins generating the output sequence, one element at a time, until the full translation is complete.

We train this model by showing it thousands of examples of English sentences and their corresponding Spanish translations, and over time, it learns the patterns and rules of both languages to become a proficient translator.

Let's Build Our Translator: The Code

Setting Up Our Workshop: Importing Libraries

First things first, we need to gather our tools. We'll be using several Python libraries, including:

numpy and pandas for handling our data.
matplotlib for visualizing our data.
tensorflow and keras for building and training our neural network.
sentencepiece for a smart way of tokenizing our text.

# https://www.statmt.org/europarl/

# Import necessary libraries
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Dropout, Embedding, Input, MultiHeadAttention, Dense, Layer, LayerNormalization 
from tensorflow.keras.preprocessing.sequence import pad_sequences 
from sklearn.model_selection import train_test_split
from tensorflow.keras.optimizers import Adam
import sentencepiece as spm
from pathlib import Path

tf.random.set_seed(34)

Getting Our Data Ready

Download Dataset

This dataset is a CSV file of English and Spanish words and sentences. So now, we'll load our dataset of English and Spanish sentence pairs. We'll do some basic cleaning, like converting all text to lowercase and removing any duplicate entries. Then, we'll take a quick look at the length of our sentences. This helps us understand the data we're working with.

# Load the dataset
dataset = pd.read_csv("./data.csv")

# Dropping duplicates
dataset.drop_duplicates(inplace=True)

print("Dataset shape:", dataset.shape)
dataset.head()

dataset["english"] = dataset["english"].str.lower()
dataset["spanish"] = dataset["spanish"].str.lower()

dataset.sort_index()
dataset.head()

import matplotlib.pyplot as plt

# histogram of sentence length in tokens
eng_length = [len(sentence.split()) for sentence in dataset["english"]]
esp_length = [len(sentence.split()) for sentence in dataset["spanish"]]

plt.hist(eng_length, label="eng", color="red", alpha=0.33)
plt.hist(esp_length, label="esp", color="blue", alpha=0.33)
plt.yscale("log")     # sentence length fits Benford"s law
plt.ylim(plt.ylim())  # make y-axis consistent for both plots
plt.plot([max(eng_length), max(eng_length)], plt.ylim(), color="red")
plt.plot([max(esp_length), max(esp_length)], plt.ylim(), color="blue")
plt.legend()
plt.title("Examples count vs Token length")
plt.show()

max_eng = max(eng_length)
max_esp = max(esp_length)
print (max_eng, max_esp)

Turning Words into Numbers: Tokenization

Computers don't understand words, they understand numbers. So, we need to convert our sentences into a numerical format. This process is called tokenization. We'll be using a clever technique called Byte-Pair Encoding (BPE).

Think of BPE like this: instead of just breaking a sentence into words, it can also break words down into smaller, common sub-words. This is super helpful for handling rare or new words. For example, if our model has seen "eating" and "running," BPE might break the new word "jumping" into "jump" and "ing," allowing the model to guess its meaning.

Since English and Spanish share the Latin alphabet and many word roots, we can train a single tokenizer on both languages. This is more efficient and allows the model to leverage the similarities between them.

# Write clean text lists to disk for SP training
Path("sp_data").mkdir(exist_ok=True)
eng_path = "sp_data/eng.txt"
esp_path = "sp_data/esp.txt"

# Writing the dataset into text files for training the tokenizer
with open(eng_path, "w") as f:
    for line in dataset['english'].dropna():
        f.write(line.replace('"', '').lower() + "\n")

with open(esp_path, "w") as f:
    for line in dataset['spanish'].dropna():
        f.write(line.lower() + "\n")

# Training the tokenizer
spm.SentencePieceTrainer.Train(
    input=[eng_path, esp_path], # Input files
    model_prefix="bpe_mixed",   # Output model name
    vocab_size=32000,           # Vocabulary size (number of words in the vocabulary)
    model_type="bpe",           # Model type (bpe or unigram)
    character_coverage=0.9995,  # Character coverage (fraction of characters to cover)
    pad_id=0, unk_id=1, bos_id=2, eos_id=3 # Padding, Unknown, Beginning of Sentence, End of Sentence token ids
)

We've set our vocabulary size to 32,000, meaning our tokenizer will learn the 32,000 most common sub-words across both languages. We've also defined special tokens for padding (to make sentences the same length), unknown words, the beginning of a sentence ([SOS]), and the end of a sentence ([EOS]).

Code to encode and decode the sentences using the trained tokenizer

# Load the trained BPE models
eng_sp = spm.SentencePieceProcessor()
eng_sp.load('bpe_mixed.model')

esp_sp = spm.SentencePieceProcessor()
esp_sp.load('bpe_mixed.model')

# Create tokenization functions
def encode_eng(text):
    return eng_sp.encode(text, add_bos=False, add_eos=False), eng_sp.encode_as_pieces(text, add_bos=False, add_eos=False)

def encode_esp(text):
    return esp_sp.encode(text, add_bos=True, add_eos=True), esp_sp.encode_as_pieces(text, add_bos=True, add_eos=True)

def decode_esp(tokens):
    return esp_sp.decode(tokens)

# Test the tokenizers
sample_eng = dataset['english'][0]
sample_esp = dataset['spanish'][0]

print(f"Original EN: {sample_eng}")
print(f"Tokenized EN: {encode_eng(sample_eng)}")
print(f"Original ESP: {sample_esp}")
print(f"Tokenized ESP: {encode_esp(sample_esp)}")

Preparing the Data for Training

Now, we'll prepare our tokenized data for the model. We'll start by splitting our dataset into a training set (for teaching the model) and a validation set (for checking its progress).

First, let's create a function to tokenize our dataset and filter out any sentences that are too long. In this function we will first encode the English and Spanish sentences into numeric tokens. Here we are also going to add the [SOS] and the [EOS] tokens to the Spanish sentences in order to tell the model when to start and stop translating. Then we will filter out all the sentences which are greater that the MAX_LENGTH in order to remove the outliers and keep our dataset uniform.

MAX_LENGTH = 128

# Load the trained BPE models
eng_sp = spm.SentencePieceProcessor()
eng_sp.load('bpe_mixed.model')

esp_sp = spm.SentencePieceProcessor()
esp_sp.load('bpe_mixed.model')

def tokenize_data(df):
    # Apply your encoding functions
    eng_tokenized = df['english'].dropna().apply(lambda x: eng_sp.encode(str(x), add_bos=False, add_eos=False))
    esp_tokenized = df['spanish'].dropna().apply(lambda x: esp_sp.encode(str(x), add_bos=True, add_eos=True))

    # Filter out sentences that are too long
    filtered_eng, filtered_esp = [], []

    for eng, esp in zip(eng_tokenized, esp_tokenized):
        if len(eng) <= MAX_LENGTH and len(esp) <= MAX_LENGTH + 1:
            filtered_eng.append(eng)
            filtered_esp.append(esp)

    return filtered_eng, filtered_esp

So this function will convert the spanish sentences in smaller words/sub-words and then into numeric tokens. We are also adding the [SOS] and the [EOS] tokens to the Spanish sentences in order to tell the model when to start and stop translating.

Now we'll split our data, with 80% for training and 20% for validation. After that we will tokenize our dataset using the function we defined earlier.

train_df, valid_df = train_test_split(dataset, test_size=0.2, random_state=42)
train_df.shape, valid_df.shape

# Tokenize the training and validation dataframes
eng_train_tokens, esp_train_tokens = tokenize_data(train_df)
eng_valid_tokens, esp_valid_tokens = tokenize_data(valid_df)

Now this part is crucial, as we need to format our data in a way that our transformer model can understand. For each English sentence, we need to provide two versions of the corresponding Spanish sentence to the decoder:

The Spanish sentence with a [SOS] (start of sentence) token at the beginning. This tells the decoder, "Okay, start translating!"
The Spanish sentence with an [EOS] (end of sentence) token at the end. This will be our "ground truth" that we'll compare the model's output against to see how well it's doing.

So, we are going from here:

English: "I am learning Deep Neural Networks"
Spanish: "Yo estoy aprendiendo redes neuronales profundas"

To here:

Encoder Input: [I, am, learning, Deep, Neural, Networks]
Decoder Input: [[BOS], Yo, estoy, aprendiendo, redes, neuronales, profundas]
Decoder Output: [Yo, estoy, aprendiendo, redes, neuronales, profundas, [EOS]]

This setup allows the model to learn to predict the next word in the Spanish sentence based on the English sentence and the Spanish words it has already generated.

So in this step we will first format our dataset into the required format and then we will create Tensorflow Dataset out of it. Lastly we will stream our dataset, which means we will not load the complete dataset into the RAM, instead we will load and feed it into the model incrementally, rather than all at once.

Teacher Forcing Method

The method we are implementing is known as Teacher Forcing Method. In this at every timestamp we provide the decoder with 2 types of input, first the actual source language at that perticular timestamp and the target language of the previous timestamp. This way we make the model learn to predict the next word, instead of just learning to map the source and target words.

For example, for the word "I" we will provide the decoder with 2 input which are "I" and "BOS" token. For these input the model will learn to generate a response. In the next timestamp, we will again provide 2 input and they will be "am" and "Yo" token (See we provided the correct word, instead of the word generated by our model). This way the model will learn the sequences and not just word mapping.

Time Step	Encoder Input	Decoder Input	Target Output
1	I	[SOS]	Yo
2	am	Yo	estoy
3	learning	estoy	aprendiendo
4	Deep	aprendiendo	redes
5	Neural	redes	neuronales
6	Networks	neuronales	profundas
7		profundas	[EOS]

# Create a generator function that yields the correct inputs and targets
def data_generator(eng_tokens, esp_tokens):
    def gen():
        for eng, esp in zip(eng_tokens, esp_tokens):
            yield (eng, esp[:-1], esp[1:])
    return gen

# Create the raw training dataset from the generator
train_dataset_raw = tf.data.Dataset.from_generator(
    data_generator(eng_train_tokens, esp_train_tokens),
    output_signature=(
        tf.TensorSpec(shape=(None,), dtype=tf.int64),
        tf.TensorSpec(shape=(None,), dtype=tf.int64),
        tf.TensorSpec(shape=(None,), dtype=tf.int64),
    )
)

# Create the raw validation dataset from the generator
valid_dataset_raw = tf.data.Dataset.from_generator(
    data_generator(eng_valid_tokens, esp_valid_tokens),
    output_signature=(
        tf.TensorSpec(shape=(None,), dtype=tf.int64),
        tf.TensorSpec(shape=(None,), dtype=tf.int64),
        tf.TensorSpec(shape=(None,), dtype=tf.int64),
    )
)

Finally, we'll organize our data into batches and buckets. Think of bucketing as sorting your laundry by color before washing. We group sentences of similar lengths together. This makes our training more efficient because we don't have to add a lot of extra padding to the shorter sentences in a batch. So in this step we will:

Bucket the dataset on the basis of English Sentences. In our case the sentences are divided into buckets of [10,20,30,60,60+] words per sentence.
After bucketing, smaller batches are made of size specified in the bucket_batch_sizes. In our case all the buckets will create batches of size 64, which means 64 sentences in each batch. So as soon as a buckets gets 64 sentences, it will create a batch out of it.
After batching, we will pad the sentences to the length of the longest sentence in the batch. This step is essential so that we can ensure that all the sentences in a batch have the same length. This process enhances the efficiency of our training and keep it uniform.
Lastly we are using the map function and inside it we are calling the format_dataset function which convert our dataset format from X, y to (("encoder_inputs": eng, "decoder_inputs": esp_in), esp_out).

Why bucketing and padding

In our dataset, its not necessary that all the sentences will have the same length, so to overcome this problem we add extra numeric tokens (mostly 0 is used as a padding token) the sentences to make our dataset uniform in length. In our case we are padding the sentences upto the length of the longest english sentence in a batch.

We could have padded all the sentences upto the length of the longest sentence in the entire dataset. But it would have wasted our training time and efforts, plus the model would have got trained on the padded tokens instead of the actual tokens.

For example, if 90% of our sentences have length between 10 to 20 words and the rest 10% have length between 50 to 60 words then padding all the sentences upto 60 words would have added a lot of padding and this would have detoriated our model's training. Thus to overcome this problem, we divided our dataset into smaller buckets of common length and then after creating batches from those bucket we padded those sentences upto the length of the longest sentence in that batch.

BATCH_SIZE = 64
BUFFER_SIZE = 20000

# This function tells the bucketing mechanism how to measure the length
def element_length_func(eng, esp_in, esp_out):
    return tf.shape(eng)[0]

# This formats the output of the dataset to match the model's input names
def format_dataset(eng, esp_in, esp_out):
    return ({"encoder_inputs": eng, "decoder_inputs": esp_in}, esp_out)

# Define bucket boundaries and the batch size for each bucket
bucket_boundaries = [10, 20, 30, 60]
bucket_batch_sizes = [BATCH_SIZE, BATCH_SIZE, BATCH_SIZE, BATCH_SIZE, BATCH_SIZE]

# Create the final training dataset
train_ds = train_dataset_raw.shuffle(BUFFER_SIZE)
    .map(format_dataset, num_parallel_calls=tf.data.AUTOTUNE)
    .cache()
    .shuffle(BUFFER_SIZE)
    .bucket_by_sequence_length(
        element_length_func=element_length_func,
        bucket_boundaries=bucket_boundaries,
        bucket_batch_sizes=bucket_batch_sizes,

        # The padding arguments are still needed to create dense batches
        padding_values=(
            tf.constant(0, dtype=tf.int64), # pad encoder inputs with 0
            tf.constant(0, dtype=tf.int64), # pad decoder inputs with 0
            tf.constant(0, dtype=tf.int64)  # pad decoder targets with 0
        ),
        drop_remainder=True # Ensures all batches have a fixed size
    )
    .prefetch(tf.data.AUTOTUNE)

# Create the final validation dataset (no shuffling)
valid_ds = valid_dataset_raw
    .map(format_dataset, num_parallel_calls=tf.data.AUTOTUNE)
    .bucket_by_sequence_length(
        element_length_func=element_length_func,
        bucket_boundaries=bucket_boundaries,
        bucket_batch_sizes=bucket_batch_sizes,
        padding_values=(
            tf.constant(0, dtype=tf.int64),
            tf.constant(0, dtype=tf.int64),
            tf.constant(0, dtype=tf.int64)
        ),
        drop_remainder=True
    )
    .prefetch(tf.data.AUTOTUNE)

print("Data pipelines created successfully using the stable API.")

And there you have it! Our data is now perfectly prepped and ready to be fed into our Transformer model. In the next part, we'll build the model architecture itself and start the training process. Stay tuned!