Neural machine translation with a Transformer and Keras. Part 1: Using a larger dataset

April 30, 2024

Summary

In my previous post, I referenced several great videos covering the transformer. To get my hands dirty I followed this tutorial by Google using TensorFlow. The tutorial is a good introduction, but I wanted to see if I could take it a step or two further.

For one, use a larger dataset.

Two, the tutorial uses a smaller model than the hallmark "Attention is All You Need" paper.

This post summarizes my efforts to train a larger dataset on the original paramters of the model and tuning of the hyper paramters for better results.

1. A Larger Dataset

On his channel, Python Lessons, Rokas Liuberskis also works through the same TensorFlow tutorial. I found his detailed you tube playlist helpful and I wanted to mention here. The OPUS website he used is a great source of sentence pairs from many languages. They have some datasets topping 100M sentence pairs. Python Lessons wrote their own Custom Tokenizer. While I used the OPUS site to get a larger dataset, I stuck with the tensorflow tokenizer tutorial to tokenize the sentence pairs from OPUS.

OPUS Language Datasets

This link gives all the English to Spanish datasets on the site. There are datasets of 400M sentence pairs. https://opus.nlpl.eu/results/en&es/corpus-result-table

I used the data from Wikipedia consisting of 1,811,000 english and spanish sentence pairs. https://object.pouta.csc.fi/OPUS-Wikipedia/v1.0/moses/en-es.txt.zip

opus

Imports:

import numpy as np
import os
import tensorflow as tf
import tensorflow_text as text

Read the downloaded file and make es_data and en_data sets that are limited to a max length.

def read_files(path):
    with open(path, "r", encoding="utf-8") as f:
        data = f.read().split("\n")[:-1]
    return data

#https://object.pouta.csc.fi/OPUS-Wikipedia/v1.0/mono/es.txt.gz
es_data_path = "datasets/Wikipedia.en-es.es"
es_data = read_files(es_data_path)

# https://object.pouta.csc.fi/OPUS-wikimedia/v20230407/mono/en.txt.gz
en_data_path = "datasets/Wikipedia.en-es.en"
en_data = read_files(en_data_path)

#Only use sentences that are at a max length
max_lenght = 500
dataset = [[es_sentence, en_sentence] for es_sentence, en_sentence in zip(es_data, en_data) if len(es_sentence) <= max_lenght and len(en_sentence) <= max_lenght]

#unzip the dataset back to get the filtered spansih and engish sentences for tokenizing below
es_data, en_data = zip(*dataset)

Build Vocabulary for the tokenizer

From the downloaded en-es.txt.zip build a subword vocabulary then tokenize it.
Following this google tutorial.
https://www.tensorflow.org/text/guide/subwords_tokenizer

Notes: You will need to save the Tokenizer from the tutorial. I made my own module and imported it here

# from the tutorial I saved the tokenizer in a .py file in my transformers module
from transformers import CustomTokenizer

Then I followed the tutorial using the OPUS dataset. First, set up the sub word tokenizer.

from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab

bert_tokenizer_params=dict(lower_case=True)
reserved_tokens=["[PAD]", "[UNK]", "[START]", "[END]"]

bert_vocab_args = dict(
    # The target vocabulary size
    vocab_size = 8000,
    # Reserved tokens that must be included in the vocabulary
    reserved_tokens=reserved_tokens,
    # Arguments for `text.BertTokenizer`
    bert_tokenizer_params=bert_tokenizer_params,
    # Arguments for `wordpiece_vocab.wordpiece_tokenizer_learner_lib.learn`
    learn_params={},
)

Then tokenize. Creating the tokens took 10-20 minutes.

#create Dataset's for tensorflow
es_dataset =tf.data.Dataset.from_tensor_slices([es_data])
en_dataset =tf.data.Dataset.from_tensor_slices([en_data])

def write_vocab_file(filepath, vocab):
with open(filepath, 'w') as f:
for token in vocab:
print(token, file=f)

es_vocab = bert_vocab.bert_vocab_from_dataset(
es_dataset.batch(5000).prefetch(2),
bert_vocab_args
)

en_vocab = bert_vocab.bert_vocab_from_dataset(
en_dataset.batch(5000).prefetch(2),
bert_vocab_args
)

You can see the tutorial by google for more details. I am just showing my steps here to create the tokenizer from a custom dataset and save it.

This is where I used the CustomTokenizer class from Google. Python lessons on his guide has his own implementation.

#save the vocab files
write_vocab_file('./datasets/es_wiki_vocab.txt', es_vocab)
write_vocab_file('./datasets/en_wiki_vocab.txt', en_vocab)

tokenizers = tf.Module()
tokenizers.es = CustomTokenizer(reserved_tokens, './datasets/es_wiki_vocab.txt')
tokenizers.en = CustomTokenizer(reserved_tokens, './datasets/en_wiki_vocab.txt')

model_name = './datasets/wikipedia_translate_es_en_converter_16k'
tf.saved_model.save(tokenizers, model_name)

← Back to home