All blogs

From English to Español: The Inference

Hi everyone! In the previous parts we preprocessed our dataset and then built and trained our Transformer for English-to-Spanish translation. Now it's time to test it out. The process of using a trained model to generate predictions is known as inference.

Before we begin let's revisit the theory behind generating the output sequence. Our Transformer is a Seq-to-Seq model with an encoder and a decoder. During training, the decoder receives two inputs at every step: the encoded source sentence, and the target tokens up to the previous timestep. That made the model learn to predict the next token rather than just map words 1:1.

So at inference time we do the same thing iteratively: we feed the encoder the English sentence, seed the decoder with <bos>, and then repeatedly feed the decoder's own predictions back in until it produces <eos> or we hit the max length.

Greedy Decoding

The simplest decoding strategy is greedy search: at every step, pick the single highest-probability token from the model's output and append it.

Remember: in Part 2 our final Dense head emits raw logits (we never applied softmax). tf.argmax only cares about which value is largest, not whether it sums to one, so it works directly on the logits.

import tensorflow as tf

def translate(input_sentence, model, tokenizer, config):
    """
    Translates a sentence using Greedy Decoding.
    """
    # 1. Preprocess the source sentence
    text = f"{input_sentence}".lower()
    input_ids = tokenizer.encode(text).ids

    # Truncate if too long
    if len(input_ids) > config.MAX_LENGTH:
        input_ids = input_ids[:config.MAX_LENGTH]

    encoder_input = tf.convert_to_tensor([input_ids], dtype=tf.int64)

    # 2. Initialize the decoder with <bos>
    bos_id = tokenizer.token_to_id("<bos>")
    eos_id = tokenizer.token_to_id("<eos>")
    decoder_input = [bos_id]

    print(f"Translating: '{input_sentence}'...")

    # 3. Generation loop
    for i in range(config.MAX_LENGTH):
        decoder_tensor = tf.convert_to_tensor([decoder_input], dtype=tf.int64)
        predictions = model([encoder_input, decoder_tensor], training=False)

        # Logits for the LAST token only -> shape (1, 1, vocab_size)
        last_token_logits = predictions[:, -1:, :]

        # Greedy: highest-logit token
        predicted_id = tf.argmax(last_token_logits, axis=-1).numpy()[0][0]

        # Stop if model emits <eos>
        if predicted_id == eos_id:
            break

        decoder_input.append(predicted_id)

    # 4. Decode (skip the leading <bos>)
    result_ids = decoder_input[1:]
    return tokenizer.decode(result_ids)

A quick test:

es_text = translate("I want to go to the library.", transformer, bpe_tok, config)
print(f"Spanish: {es_text}")

Greedy v/s Beam Search

Greedy decoding is fast but myopic: the locally best token at step t isn't always part of the globally best sequence. To address this we explore several high-probability paths in parallel and keep the ones with the best cumulative score.

This is Beam Search: at each step, instead of keeping a single token, we keep the top k sequences ("beams") by log-probability, expand each of them with its top k continuations, and prune back down to the top k overall.

Two practical refinements make beam search work well in real systems:

  1. Log-probabilities: we sum log P(token) instead of multiplying probabilities. This is numerically far more stable for long sequences. Since our model outputs logits, we use tf.nn.log_softmax to convert them.
  2. Length penalty: without normalization, beam search prefers short sequences because every extra token adds another (negative) log-prob. We use the Google NMT length penalty: lp(L) = ((5 + L) / 6) ** alpha, with alpha = 0.6. Dividing the score by lp(L) rewards longer, well-formed translations without overcompensating.

We wrap all of this into a BeamTranslator class:

import tensorflow as tf

class BeamTranslator:
    def __init__(self, model, tokenizer, beam_width=5, max_length=60, alpha=0.6):
        self.model = model
        self.tokenizer = tokenizer
        self.beam_width = beam_width
        self.max_length = max_length
        self.alpha = alpha  # Length penalty factor (0.6 is standard)

        self.bos_id = tokenizer.token_to_id("<bos>")
        self.eos_id = tokenizer.token_to_id("<eos>")
        if self.bos_id is None: self.bos_id = 2
        if self.eos_id is None: self.eos_id = 3

    def calc_length_penalty(self, length):
        """Google NMT length penalty, prevents bias toward short translations."""
        return ((5 + length) / 6) ** self.alpha

    def translate(self, sentence):
        # 1. Preprocess source
        sentence = sentence.lower()
        input_ids = self.tokenizer.encode(sentence).ids
        encoder_input = tf.constant([input_ids], dtype=tf.int64)

        # 2. Initialize the beam: (score, sequence)
        beams = [(0.0, [self.bos_id])]
        completed_sequences = []

        # 3. Decoding loop
        for _ in range(self.max_length):
            candidates = []
            for score, seq in beams:
                # Don't expand beams that already ended
                if seq[-1] == self.eos_id:
                    completed_sequences.append((score, seq))
                    continue

                decoder_input = tf.constant([seq], dtype=tf.int64)
                predictions = self.model([encoder_input, decoder_input], training=False)

                # Logits -> log-probs for the LAST token
                last_token_logits = predictions[:, -1, :]
                log_probs = tf.nn.log_softmax(last_token_logits, axis=-1)

                top_k_log_probs, top_k_ids = tf.math.top_k(log_probs, k=self.beam_width)
                top_k_log_probs = top_k_log_probs.numpy()[0]
                top_k_ids = top_k_ids.numpy()[0]

                for i in range(self.beam_width):
                    token_id = top_k_ids[i]
                    log_prob = top_k_log_probs[i]
                    new_score = score + log_prob
                    new_seq = seq + [token_id]
                    candidates.append((new_score, new_seq))

            candidates.sort(key=lambda x: x[0], reverse=True)
            beams = candidates[:self.beam_width]

            if all(seq[-1] == self.eos_id for _, seq in beams):
                break

        # 4. Final ranking with length penalty
        completed_sequences.extend(beams)
        final_candidates = []
        for score, seq in completed_sequences:
            length_penalty = self.calc_length_penalty(len(seq))
            final_score = score / length_penalty
            final_candidates.append((final_score, seq))

        final_candidates.sort(key=lambda x: x[0], reverse=True)
        best_seq = final_candidates[0][1]

        # 5. Strip BOS/EOS and decode
        if best_seq[0] == self.bos_id: best_seq = best_seq[1:]
        if best_seq and best_seq[-1] == self.eos_id: best_seq = best_seq[:-1]
        return self.tokenizer.decode(best_seq)

Now let's try it on a few sentences. We use beam_width=5, the industry standard for high-quality NMT decoding.

translator = BeamTranslator(
    model=transformer,
    tokenizer=bpe_tok,
    beam_width=5,
    alpha=0.6,
)

sentences = [
    "Hello, how are you?",
    "The car is very fast.",
    "I want to go to the library.",
]

for sent in sentences:
    translation = translator.translate(sent)
    print(f"Input:  {sent}")
    print(f"Output: {translation}")
    print("-" * 30)

With this, we've reached the end of our series. Across three parts we built an English-to-Spanish translation system entirely from scratch which includes beginning with dataset preparation and a shared BPE tokenizer, moving through TFRecord streaming, model design, masked-loss training, and finally exploring inference strategies like greedy and beam search.

This end-to-end process not only demonstrated how transformer-based architectures can be applied to sequence-to-sequence translation tasks, but also highlighted the importance of careful experimentation and evaluation at each stage.

As always, the field of natural language processing continues to evolve rapidly. Keep exploring, stay curious, and continue building on these foundations to push your understanding and your models even further.

Must Read and Future Scope

  • Neural Machine Translation with Transformer
  • Attention is All You Need
  • Try training on a larger dataset.
  • Read about Hyperparameter Tuning and apply it to this project. After all, Machine Learning is all about playing with hyperparameters.
  • Try training with a larger batch size, more epochs, or a deeper model (more encoder/decoder layers).
  • Cache the encoder output during beam search which right now we recompute it for every beam expansion, which is wasteful.