Hi everyone! In the previous parts we preprocessed our dataset and then built and trained our Transformer for English-to-Spanish translation. Now it's time to test it out. The process of using a trained model to generate predictions is known as inference.
Before we begin let's revisit the theory behind generating the output sequence. Our Transformer is a Seq-to-Seq model with an encoder and a decoder. During training, the decoder receives two inputs at every step: the encoded source sentence, and the target tokens up to the previous timestep. That made the model learn to predict the next token rather than just map words 1:1.
So at inference time we do the same thing iteratively: we feed the encoder the English sentence, seed the decoder with <bos>, and then repeatedly feed the decoder's own predictions back in until it produces <eos> or we hit the max length.
Greedy Decoding
The simplest decoding strategy is greedy search: at every step, pick the single highest-probability token from the model's output and append it.
Remember: in Part 2 our final Dense head emits raw logits (we never applied softmax). tf.argmax only cares about which value is largest, not whether it sums to one, so it works directly on the logits.
import tensorflow as tf
def translate(input_sentence, model, tokenizer, config):
"""
Translates a sentence using Greedy Decoding.
"""
# 1. Preprocess the source sentence
text = f"{input_sentence}".lower()
input_ids = tokenizer.encode(text).ids
# Truncate if too long
if len(input_ids) > config.MAX_LENGTH:
input_ids = input_ids[:config.MAX_LENGTH]
encoder_input = tf.convert_to_tensor([input_ids], dtype=tf.int64)
# 2. Initialize the decoder with <bos>
bos_id = tokenizer.token_to_id("<bos>")
eos_id = tokenizer.token_to_id("<eos>")
decoder_input = [bos_id]
print(f"Translating: '{input_sentence}'...")
# 3. Generation loop
for i in range(config.MAX_LENGTH):
decoder_tensor = tf.convert_to_tensor([decoder_input], dtype=tf.int64)
predictions = model([encoder_input, decoder_tensor], training=False)
# Logits for the LAST token only -> shape (1, 1, vocab_size)
last_token_logits = predictions[:, -1:, :]
# Greedy: highest-logit token
predicted_id = tf.argmax(last_token_logits, axis=-1).numpy()[0][0]
# Stop if model emits <eos>
if predicted_id == eos_id:
break
decoder_input.append(predicted_id)
# 4. Decode (skip the leading <bos>)
result_ids = decoder_input[1:]
return tokenizer.decode(result_ids)
A quick test:
es_text = translate("I want to go to the library.", transformer, bpe_tok, config)
print(f"Spanish: {es_text}")
Greedy v/s Beam Search
Greedy decoding is fast but myopic: the locally best token at step t isn't always part of the globally best sequence. To address this we explore several high-probability paths in parallel and keep the ones with the best cumulative score.
This is Beam Search: at each step, instead of keeping a single token, we keep the top k sequences ("beams") by log-probability, expand each of them with its top k continuations, and prune back down to the top k overall.
Two practical refinements make beam search work well in real systems:
- Log-probabilities: we sum
log P(token)instead of multiplying probabilities. This is numerically far more stable for long sequences. Since our model outputs logits, we usetf.nn.log_softmaxto convert them. - Length penalty: without normalization, beam search prefers short sequences because every extra token adds another (negative) log-prob. We use the Google NMT length penalty:
lp(L) = ((5 + L) / 6) ** alpha, withalpha = 0.6. Dividing the score bylp(L)rewards longer, well-formed translations without overcompensating.
We wrap all of this into a BeamTranslator class:
import tensorflow as tf
class BeamTranslator:
def __init__(self, model, tokenizer, beam_width=5, max_length=60, alpha=0.6):
self.model = model
self.tokenizer = tokenizer
self.beam_width = beam_width
self.max_length = max_length
self.alpha = alpha # Length penalty factor (0.6 is standard)
self.bos_id = tokenizer.token_to_id("<bos>")
self.eos_id = tokenizer.token_to_id("<eos>")
if self.bos_id is None: self.bos_id = 2
if self.eos_id is None: self.eos_id = 3
def calc_length_penalty(self, length):
"""Google NMT length penalty, prevents bias toward short translations."""
return ((5 + length) / 6) ** self.alpha
def translate(self, sentence):
# 1. Preprocess source
sentence = sentence.lower()
input_ids = self.tokenizer.encode(sentence).ids
encoder_input = tf.constant([input_ids], dtype=tf.int64)
# 2. Initialize the beam: (score, sequence)
beams = [(0.0, [self.bos_id])]
completed_sequences = []
# 3. Decoding loop
for _ in range(self.max_length):
candidates = []
for score, seq in beams:
# Don't expand beams that already ended
if seq[-1] == self.eos_id:
completed_sequences.append((score, seq))
continue
decoder_input = tf.constant([seq], dtype=tf.int64)
predictions = self.model([encoder_input, decoder_input], training=False)
# Logits -> log-probs for the LAST token
last_token_logits = predictions[:, -1, :]
log_probs = tf.nn.log_softmax(last_token_logits, axis=-1)
top_k_log_probs, top_k_ids = tf.math.top_k(log_probs, k=self.beam_width)
top_k_log_probs = top_k_log_probs.numpy()[0]
top_k_ids = top_k_ids.numpy()[0]
for i in range(self.beam_width):
token_id = top_k_ids[i]
log_prob = top_k_log_probs[i]
new_score = score + log_prob
new_seq = seq + [token_id]
candidates.append((new_score, new_seq))
candidates.sort(key=lambda x: x[0], reverse=True)
beams = candidates[:self.beam_width]
if all(seq[-1] == self.eos_id for _, seq in beams):
break
# 4. Final ranking with length penalty
completed_sequences.extend(beams)
final_candidates = []
for score, seq in completed_sequences:
length_penalty = self.calc_length_penalty(len(seq))
final_score = score / length_penalty
final_candidates.append((final_score, seq))
final_candidates.sort(key=lambda x: x[0], reverse=True)
best_seq = final_candidates[0][1]
# 5. Strip BOS/EOS and decode
if best_seq[0] == self.bos_id: best_seq = best_seq[1:]
if best_seq and best_seq[-1] == self.eos_id: best_seq = best_seq[:-1]
return self.tokenizer.decode(best_seq)
Now let's try it on a few sentences. We use beam_width=5, the industry standard for high-quality NMT decoding.
translator = BeamTranslator(
model=transformer,
tokenizer=bpe_tok,
beam_width=5,
alpha=0.6,
)
sentences = [
"Hello, how are you?",
"The car is very fast.",
"I want to go to the library.",
]
for sent in sentences:
translation = translator.translate(sent)
print(f"Input: {sent}")
print(f"Output: {translation}")
print("-" * 30)
With this, we've reached the end of our series. Across three parts we built an English-to-Spanish translation system entirely from scratch which includes beginning with dataset preparation and a shared BPE tokenizer, moving through TFRecord streaming, model design, masked-loss training, and finally exploring inference strategies like greedy and beam search.
This end-to-end process not only demonstrated how transformer-based architectures can be applied to sequence-to-sequence translation tasks, but also highlighted the importance of careful experimentation and evaluation at each stage.
As always, the field of natural language processing continues to evolve rapidly. Keep exploring, stay curious, and continue building on these foundations to push your understanding and your models even further.
Must Read and Future Scope
- Neural Machine Translation with Transformer
- Attention is All You Need
- Try training on a larger dataset.
- Read about Hyperparameter Tuning and apply it to this project. After all, Machine Learning is all about playing with hyperparameters.
- Try training with a larger batch size, more epochs, or a deeper model (more encoder/decoder layers).
- Cache the encoder output during beam search which right now we recompute it for every beam expansion, which is wasteful.