Back to Blogs

From English to Español: The Inference

Hi everyone, in the previous parts we preprocessed our dataset and then built and trained our transformer model for English to Spanish translation. Now it’s time to test it out. The process of using a trained model to generate predictions is known as inference.

Before we begin lets study the theory behind generating the output sequence. In the last part we saw that our transformer is a Seq-to-Seq model which consist of encoder and deocder. And while training we saw that in order to generate the translation the decoder needs 2 types of input, first the actual source language at that perticular timestamp and the target language of the previous timestamp. This way we made the model learn to predict the next word, instead of just learning to map the source and target words.

So in inference we will do the same process iteratively, which is we will pass the English sentence and the [SOS] token to the decoder, and then as the decoder will start to generate the output we will keep on passing it to the decoder for further generation.


def translate(english_sentence):
    """
    Translates an English sentence to Spanish using the trained Transformer model.
    """
    # Preprocess the input sentence
    tokenized_english = eng_sp.encode(english_sentence.lower())
    encoder_input = tf.constant([tokenized_english], dtype=tf.int64)

    # The decoder's input starts with the BOS token
    decoder_input = [esp_sp.bos_id()]

    output = tf.constant([decoder_input], dtype=tf.int64)

    for i in range(config.MAX_LENGTH):
        # Make a prediction
        predictions = transformer([encoder_input, output], training=False)
        
        # Select the last token from the seq_len dimension
        predictions = predictions[:, -1:, :]  # (batch_size, 1, vocab_size)

        # Get the token with the highest probability (greedy search)
        # This returns a tensor with dtype=tf.int64.
        predicted_id = tf.argmax(predictions, axis=-1)

        # Append the predicted token to the output.
        output = tf.concat([output, predicted_id], axis=-1)

        # Return the result if the EOS token is predicted
        if predicted_id == esp_sp.eos_id():
            break
            
    # Decode the sequence of token IDs back to a text string
    predicted_sentence = esp_sp.decode(output.numpy().flatten().tolist())
    
    return predicted_sentence

In this code, we first convert our English input sentence into tokens. For the first decoding step, we provide the [SOS] (start-of-sequence) token along with the encoded representation of the complete English sentence. The model then predicts the next token in the target (Spanish) sequence not the actual word, just its token ID.

In each subsequent step, we feed the decoder with the tokens it has generated so far together with the encoded English sentence, allowing it to predict the next token. This process repeats until the model outputs the [EOS] (end-of-sequence) token.

Once the [EOS] token is produced, we take all the predicted tokens and decode them back into words to form the final Spanish translation.

Greedy v/s Beam Search

In the previous section, we implemented the Greedy Search technique, where we selected the token with the highest probability at each decoding step. However, this approach can sometimes lead to suboptimal translations, since the locally most probable token might not result in the best overall sequence.

To address this limitation, we can consider multiple possible sequences instead of just one. The idea is to explore several high-probability paths simultaneously and keep track of those that lead to the lowest overall loss (or highest total probability) in the long run.

In simpler terms, at each decoding step, instead of picking only the single best token, we look at the top k most probable tokens. For each of these, we then generate their next k possible continuations, evaluate their cumulative probabilities (or losses), and keep the top k best-performing sequences. We continue this process until an [EOS] token is produced.

This approach is known as Beam Search, a refined version of the Best-First Search algorithm from heuristic search methods.


def translate(english_sentence, beam_width=3):
    """
    Translates an English sentence to Spanish using the trained Transformer model
    with beam search.
    """
    # Preprocess the input sentence
    tokenized_english = eng_sp.encode(english_sentence.lower())
    encoder_input = tf.constant([tokenized_english], dtype=tf.int64)

    # The decoder's input starts with the BOS token
    start_token = esp_sp.bos_id()
    end_token = esp_sp.eos_id()

    # Initialize the beam with a Python list of lists.
    initial_beam = [([start_token], 0.0)]
    
    completed_hypotheses = []

    for _ in range(config.MAX_LENGTH):
        new_beam = []
        for seq, score in initial_beam:
            if seq[-1] == end_token:
                completed_hypotheses.append((seq, score))
                continue

            decoder_input = tf.constant([seq], dtype=tf.int64)
            predictions = transformer([encoder_input, decoder_input], training=False)
            
            # Get the log probabilities of the next possible tokens
            last_token_probs = predictions[:, -1, :]
            log_probs = tf.math.log(last_token_probs)
            
            # Get the top k most likely next tokens
            top_k_log_probs, top_k_indices = tf.nn.top_k(log_probs, k=beam_width)

            for i in range(beam_width):
                new_token = top_k_indices[0, i]
                new_log_prob = top_k_log_probs[0, i].numpy()
                
                # FIX: Convert the new_token to a native Python integer using .item()
                new_seq = seq + [new_token.numpy().item()]
                new_score = score + new_log_prob
                
                new_beam.append((new_seq, new_score))

        # If all beams have ended in EOS, we can stop early
        if not new_beam:
            break

        # Sort all new possible hypotheses by their score and keep the top k
        initial_beam = sorted(new_beam, key=lambda x: x[1] / len(x[0]), reverse=True)[:beam_width]

    # Add any remaining hypotheses from the beam to the completed list
    completed_hypotheses.extend(initial_beam)

    # Find the best translation among the completed hypotheses
    if not completed_hypotheses:
        return ""
        
    best_hypothesis = sorted(completed_hypotheses, key=lambda x: x[1] / len(x[0]), reverse=True)[0]
    best_seq = best_hypothesis[0]
    
    # Decode the sequence of token IDs back to a text string
    predicted_sentence = esp_sp.decode(best_seq)
    
    return predicted_sentence

Now its time to test our model on some sentences. Right now we will be using some samples from the validation set only so that we can actually test if the translations are exact same or close enough or completely differnet.


sample_index = 89
english_sentence = valid_df.iloc[sample_index]['english']
reference_hindi = valid_df.iloc[sample_index]['spanish']

# Use the translate function to get the model's prediction
predicted_hindi = translate(english_sentence)


print(f"English Input:     {english_sentence}")
print(f"Reference Spanish:   {reference_hindi}")
print(f"Predicted Spanish:   {predicted_hindi}")

With this, we’ve reached the end of our series. Throughout this series, we built an English-to-Spanish translation system entirely from scratch — beginning with dataset preparation, moving through model design and training, and finally exploring inference strategies such as Greedy Search and Beam Search.

This end-to-end process not only demonstrated how transformer-based architectures can be applied to sequence-to-sequence translation tasks, but also highlighted the importance of careful experimentation and evaluation at each stage.

As always, the field of natural language processing continues to evolve rapidly. Keep exploring, stay curious, and continue building on these foundations to push your understanding and your models even further.

Must Read and Future Scope