Practical 10: Neural Machine Translation¶

Ayoub Bagheri¶

logo

Applied Text Mining - Utrecht Summer School¶

You made it! It is the last practical of the Applied Text Mining course.

In this practical, we will create models for neural machine translation. Today we are curious to see how a simple deep learning-based model translates a sentence into its counterpart. See these examples:

The aim of this practical is to convert a Dutch sentence to its English counterpart using a Neural Machine Translation (NMT) system. We will implement this task by building a simple Sequence-to-Sequence model (a special class of Recurrent Neural Network architectures) with the help of Keras library.

Today we will use the following libraries. Take care to have them installed!

In [1]:
import string
import re
import statistics
from numpy import array, argmax, random, take
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, Bidirectional, RepeatVector, TimeDistributed
from keras.preprocessing.text import Tokenizer
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from keras import optimizers
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# matplotlib inline
pd.set_option('display.max_colwidth', 200)

Let's get started!¶

1. In this practical we will use a data set of tab-delimited Bilingual Sentence Pairs from http://www.manythings.org/anki/. Use the following two functions (read_text and to_lines) and read the nld.txt data set (also provided in the course webpage next to the practial link). This data set contains phrases in Dutch with their translation in English. Convert the text sequences to an array and check the first items in your array.

In [2]:
# function to read raw text file
def read_text(filename):
    # open the file
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    file.close()
    return text
In [3]:
# split a text into sentences
def to_lines(text):
    sents = text.strip().split('\n')
    sents = [i.split('\t') for i in sents]
    return sents
In [4]:
data = read_text("nld-eng/nld.txt")
nld_eng = to_lines(data)
nld_eng = array(nld_eng)
In [5]:
nld_eng.shape
Out[5]:
(54972, 3)
In [6]:
nld_eng
Out[6]:
array([['Go.', 'Lopen!',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #7764436 (LinguisticFusion)'],
       ['Go.', 'Vooruit.',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #7915821 (Elsofie)'],
       ['Hi.', 'Hoi.',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #537889 (Dorenda)'],
       ...,
       ['Always use distilled water in steam irons because using ordinary water will cause a mineral build-up over time that will clog the steam holes.',
        'Gebruik altijd gedistilleerd water in stoomstrijkijzers, want gewoon water zorgt voor mineraalophoping dat de stoomgaatjes na verloop van tijd verstopt.',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #3020388 (Delian) & #3037091 (Citrine)'],
       ["If you translate from your second language into your own native language, rather than the other way around, you're less likely to make mistakes.",
        'Als je vanuit je tweede taal naar je eigen moedertaal vertaalt, in plaats van andersom, maak je minder snel fouten.',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #1230823 (CK) & #8627687 (MarijnKp)'],
       ["If someone who doesn't know your background says that you sound like a native speaker, it means they probably noticed something about your speaking that made them realize you weren't a native speaker. In other words, you don't really sound like a native speaker.",
        'Als iemand die je achtergrond niet kent zegt dat je klinkt als een moedertaalspreker betekent dat dat diegene waarschijnlijk iets in je spreken opgemerkt heeft dat hem deed realiseren dat je geen moedertaalspreker bent. Met andere woorden, je klinkt niet echt als een moedertaalspreker.',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #953936 (CK) & #1056762 (ReneeMona)']],
      dtype='<U286')

Using a code like nld_eng = nld_eng[:1000,:], you can use a subset of data, for example the first 1000 sentence pairs, to reduce the training time of the model. Be aware that you will lose the performance.

Pre-processing¶

2. Use the maketrans() function to remove punctuations from the nld_enp object. The maketrans() function is a function from the library str that is used to construct a transition table, i.e that it specifies a list of characters that need to be replaced in a string or the characters that need to be deleted from the string. To use this transition table, you can use the translate() function and apply that on a string. It is also possible to use these functions to remove the punctuations. Similar to the example below, apply the maketrans() function to remove punctuations from the nld_eng array.

In [7]:
### Here you see an example on how to use the maketrans() function ###
# specify the list of characters that need to be replaced
str1 = "mtex"

# specify the list of characters with which the characters need to be replaced
str2 = "dwoo"

# specify the list of characters that needs to be deleted
str3 = "u"

# target string 
temp_str = "text mining"

# using maketrans() to construct a translate table
table = temp_str.maketrans(str1, str2, str3)
  
# Printing original string 
print ("The string before translating is : ", end ="")
print (temp_str)
  
# using translate() to make translations.
print ("The string after translating is : ", end ="")
print (temp_str.translate(table))
The string before translating is : text mining
The string after translating is : woow dining

Note that it is possible to replace one character with two or more. You need to supply a dict as argument to maketrans().

In [8]:
nld_eng[:,0] = [s.translate(str.maketrans('', '', string.punctuation)) for s in nld_eng[:,0]]
nld_eng[:,1] = [s.translate(str.maketrans('', '', string.punctuation)) for s in nld_eng[:,1]]
In [9]:
nld_eng
Out[9]:
array([['Go', 'Lopen',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #7764436 (LinguisticFusion)'],
       ['Go', 'Vooruit',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #7915821 (Elsofie)'],
       ['Hi', 'Hoi',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #537889 (Dorenda)'],
       ...,
       ['Always use distilled water in steam irons because using ordinary water will cause a mineral buildup over time that will clog the steam holes',
        'Gebruik altijd gedistilleerd water in stoomstrijkijzers want gewoon water zorgt voor mineraalophoping dat de stoomgaatjes na verloop van tijd verstopt',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #3020388 (Delian) & #3037091 (Citrine)'],
       ['If you translate from your second language into your own native language rather than the other way around youre less likely to make mistakes',
        'Als je vanuit je tweede taal naar je eigen moedertaal vertaalt in plaats van andersom maak je minder snel fouten',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #1230823 (CK) & #8627687 (MarijnKp)'],
       ['If someone who doesnt know your background says that you sound like a native speaker it means they probably noticed something about your speaking that made them realize you werent a native speaker In other words you dont really sound like a native speaker',
        'Als iemand die je achtergrond niet kent zegt dat je klinkt als een moedertaalspreker betekent dat dat diegene waarschijnlijk iets in je spreken opgemerkt heeft dat hem deed realiseren dat je geen moedertaalspreker bent Met andere woorden je klinkt niet echt als een moedertaalspreker',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #953936 (CK) & #1056762 (ReneeMona)']],
      dtype='<U286')

3. Convert all words into their lowercase.

In [10]:
# convert to lowercase
for i in range(len(nld_eng)):
    nld_eng[i,0] = nld_eng[i,0].lower()    
    nld_eng[i,1] = nld_eng[i,1].lower()
In [11]:
nld_eng
Out[11]:
array([['go', 'lopen',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #7764436 (LinguisticFusion)'],
       ['go', 'vooruit',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #7915821 (Elsofie)'],
       ['hi', 'hoi',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #537889 (Dorenda)'],
       ...,
       ['always use distilled water in steam irons because using ordinary water will cause a mineral buildup over time that will clog the steam holes',
        'gebruik altijd gedistilleerd water in stoomstrijkijzers want gewoon water zorgt voor mineraalophoping dat de stoomgaatjes na verloop van tijd verstopt',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #3020388 (Delian) & #3037091 (Citrine)'],
       ['if you translate from your second language into your own native language rather than the other way around youre less likely to make mistakes',
        'als je vanuit je tweede taal naar je eigen moedertaal vertaalt in plaats van andersom maak je minder snel fouten',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #1230823 (CK) & #8627687 (MarijnKp)'],
       ['if someone who doesnt know your background says that you sound like a native speaker it means they probably noticed something about your speaking that made them realize you werent a native speaker in other words you dont really sound like a native speaker',
        'als iemand die je achtergrond niet kent zegt dat je klinkt als een moedertaalspreker betekent dat dat diegene waarschijnlijk iets in je spreken opgemerkt heeft dat hem deed realiseren dat je geen moedertaalspreker bent met andere woorden je klinkt niet echt als een moedertaalspreker',
        'CC-BY 2.0 (France) Attribution: tatoeba.org #953936 (CK) & #1056762 (ReneeMona)']],
      dtype='<U286')

Text to Sequence¶

4. What is the maximum length of a sentence in each of the Dutch and English sets? What about the average length?

In [12]:
# empty lists
eng_l = []
nld_l = []
# populate the lists with sentence lengths
for i in nld_eng[:,0]:
    eng_l.append(len(i.split()))

for i in nld_eng[:,1]:
    nld_l.append(len(i.split()))
In [13]:
length_df = pd.DataFrame({'eng':eng_l, 'nld':nld_l})
length_df.hist(bins = 30)
plt.show()
In [14]:
max(eng_l)
Out[14]:
44
In [15]:
max(nld_l)
Out[15]:
44
In [16]:
type(nld_l)
Out[16]:
list
In [17]:
statistics.mean(nld_l)
Out[17]:
5.853307138179437
In [18]:
statistics.mean(eng_l)
Out[18]:
5.619569962890199

Coincidentally the maximum length of the Dutch sentences and that of the English phrases is equal to 44.

5. Use the train_test_split function from sklearn to split the data set into training (80%) and test (20%) sets.

In [19]:
train, test = train_test_split(nld_eng, test_size=0.2, random_state=321)

6. Time to tokenize the sentences. Use the Tokenizer function from Keras and fit the sentences. Find out about the vocabulary size for the Dutch and English sets.

In [20]:
# prepare english tokenizer
eng_tokenizer = Tokenizer()
eng_tokenizer.fit_on_texts(nld_eng[:, 0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
print('English Vocabulary Size: %d' % eng_vocab_size)
English Vocabulary Size: 9073
In [21]:
# prepare Dutch tokenizer
nld_tokenizer = Tokenizer()
nld_tokenizer.fit_on_texts(nld_eng[:, 1])
nld_vocab_size = len(nld_tokenizer.word_index) + 1
print('Dutch Vocabulary Size: %d' % nld_vocab_size)
# print('Dutch Vocabulary Size:', nld_vocab_size)
Dutch Vocabulary Size: 12794

7. Write a function to convert tokens into sequences using an argument for maximum sentence length. Other input arguments to this function are tokenizer and sentences, and its output will be sequences of tokens.

In [22]:
# encode and pad sequences
def encode_sequences(tokenizer, maximum_length, sentences):
    # integer encode sequences
    seq = tokenizer.texts_to_sequences(sentences)
    # pad sequences with 0 values
    seq = pad_sequences(seq, maxlen=maximum_length, padding='post')
    return seq

8. Convert your tokenized training data into sequences. Use a maximum length of 20 and name the dataframes train_X and train_Y.

In [23]:
eng_length = 40
nld_length = 40
# prepare training data
train_X = encode_sequences(nld_tokenizer, nld_length, train[:, 1])
train_Y = encode_sequences(eng_tokenizer, eng_length, train[:, 0])

9. In the same way, convert your tokenized test data into sequences and name the dataframs test_X and test_Y.

In [24]:
# prepare test data
test_X = encode_sequences(nld_tokenizer, nld_length, test[:, 1])
test_Y = encode_sequences(eng_tokenizer, eng_length, test[:, 0])

Neural Network Model¶

10. Define a Sequence-to-Sequence (Seq2Seq) model architecture using an embedding layer as input layer, an LSTM layer as encoder and another LSTM layer followed by a Dense layer as decoder. Make this a function and name it build_model(). Define different input arguments for your function including the embedding_size and the number of LSTM units.

In [25]:
# build NMT model
def build_model(in_vocab, out_vocab, in_timesteps, out_timesteps, embedding_size, LSTMunits):
    model = Sequential()
    model.add(Embedding(in_vocab, embedding_size, input_length=in_timesteps, mask_zero=True))
    model.add(LSTM(LSTMunits))
    model.add(RepeatVector(out_timesteps))
    model.add(LSTM(LSTMunits, return_sequences=True))
    model.add(Dense(out_vocab, activation='softmax'))
    return model

11. Create a model by calling the function with embedding_size of 300 and 512 units for the LSTM layers.

In [26]:
model = build_model(nld_vocab_size, 
                    eng_vocab_size, 
                    nld_length, 
                    eng_length, 
                    300, 
                    512)

12. Compile the model with the RMSprop optimizer and sparse_categorical_crossentropy for loss.

In [27]:
model.compile(optimizer=optimizers.RMSprop(lr=0.001), 
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 40, 300)           3838200   
_________________________________________________________________
lstm (LSTM)                  (None, 512)               1665024   
_________________________________________________________________
repeat_vector (RepeatVector) (None, 40, 512)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 40, 512)           2099200   
_________________________________________________________________
dense (Dense)                (None, 40, 9073)          4654449   
=================================================================
Total params: 12,256,873
Trainable params: 12,256,873
Non-trainable params: 0
_________________________________________________________________

Note that we have used 'sparse_categorical_crossentropy' as the loss function as it allows us to use the target sequence as it is instead of one-hot encoded format. One-hot encoding the target sequences with such a huge vocabulary might consume your system's entire memory.

13. Fit the model with your desired number of epochs (e.g. 1 :)), validation_split of 0.2, and batch_size of 128. You can use smaller values for the number of LSTM units (100) and embedding size (50) if it takes a lot of time to run.

In [28]:
# history = model.fit(train_X, train_Y.reshape(train_Y.shape[0], train_Y.shape[1], 1), 
#                    epochs=2, 
#                    batch_size=128, 
#                    validation_split = 0.2,
#                    verbose=1)

With ModelCheckpoint you can save your best model during the training. We used this option of Keras to run our neural translator for 15 epochs and save the model for your use.

In [29]:
filename = 'model.15Epochs'
checkpoint = ModelCheckpoint(filename, 
                             monitor='val_loss', 
                             verbose=1, 
                             save_best_only=True, 
                             mode='min')

history = model.fit(train_X, train_Y.reshape(train_Y.shape[0], train_Y.shape[1], 1),
                    epochs=15,
                    batch_size=128, 
                    validation_split = 0.2,
                    callbacks=[checkpoint],
                    verbose=1)
Epoch 1/15
275/275 [==============================] - ETA: 0s - loss: 1.0358 - accuracy: 0.8638
Epoch 00001: val_loss improved from inf to 0.89008, saving model to model.15Epochs
WARNING:tensorflow:From c:\programdata\miniconda3\lib\site-packages\tensorflow\python\ops\resource_variable_ops.py:1817: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: model.15Epochs\assets
275/275 [==============================] - 662s 2s/step - loss: 1.0358 - accuracy: 0.8638 - val_loss: 0.8901 - val_accuracy: 0.8688
Epoch 2/15
275/275 [==============================] - ETA: 0s - loss: 0.8327 - accuracy: 0.8756
Epoch 00002: val_loss improved from 0.89008 to 0.79787, saving model to model.15Epochs
INFO:tensorflow:Assets written to: model.15Epochs\assets
275/275 [==============================] - 691s 3s/step - loss: 0.8327 - accuracy: 0.8756 - val_loss: 0.7979 - val_accuracy: 0.8799
Epoch 3/15
275/275 [==============================] - ETA: 0s - loss: 0.7589 - accuracy: 0.8841
Epoch 00003: val_loss improved from 0.79787 to 0.75407, saving model to model.15Epochs
INFO:tensorflow:Assets written to: model.15Epochs\assets
275/275 [==============================] - 708s 3s/step - loss: 0.7589 - accuracy: 0.8841 - val_loss: 0.7541 - val_accuracy: 0.8871
Epoch 4/15
275/275 [==============================] - ETA: 0s - loss: 0.6980 - accuracy: 0.8914
Epoch 00004: val_loss improved from 0.75407 to 0.69438, saving model to model.15Epochs
INFO:tensorflow:Assets written to: model.15Epochs\assets
275/275 [==============================] - 769s 3s/step - loss: 0.6980 - accuracy: 0.8914 - val_loss: 0.6944 - val_accuracy: 0.8930
Epoch 5/15
275/275 [==============================] - ETA: 0s - loss: 0.6487 - accuracy: 0.8966
Epoch 00005: val_loss improved from 0.69438 to 0.66017, saving model to model.15Epochs
INFO:tensorflow:Assets written to: model.15Epochs\assets
275/275 [==============================] - 789s 3s/step - loss: 0.6487 - accuracy: 0.8966 - val_loss: 0.6602 - val_accuracy: 0.8972
Epoch 6/15
275/275 [==============================] - ETA: 0s - loss: 0.6044 - accuracy: 0.9013
Epoch 00006: val_loss improved from 0.66017 to 0.62693, saving model to model.15Epochs
INFO:tensorflow:Assets written to: model.15Epochs\assets
275/275 [==============================] - 776s 3s/step - loss: 0.6044 - accuracy: 0.9013 - val_loss: 0.6269 - val_accuracy: 0.9006
Epoch 7/15
275/275 [==============================] - ETA: 0s - loss: 0.5637 - accuracy: 0.9053
Epoch 00007: val_loss improved from 0.62693 to 0.60318, saving model to model.15Epochs
INFO:tensorflow:Assets written to: model.15Epochs\assets
275/275 [==============================] - 770s 3s/step - loss: 0.5637 - accuracy: 0.9053 - val_loss: 0.6032 - val_accuracy: 0.9033
Epoch 8/15
275/275 [==============================] - ETA: 0s - loss: 0.5271 - accuracy: 0.9090
Epoch 00008: val_loss improved from 0.60318 to 0.58197, saving model to model.15Epochs
INFO:tensorflow:Assets written to: model.15Epochs\assets
275/275 [==============================] - 765s 3s/step - loss: 0.5271 - accuracy: 0.9090 - val_loss: 0.5820 - val_accuracy: 0.9054
Epoch 9/15
275/275 [==============================] - ETA: 0s - loss: 0.4937 - accuracy: 0.9128
Epoch 00009: val_loss improved from 0.58197 to 0.56812, saving model to model.15Epochs
INFO:tensorflow:Assets written to: model.15Epochs\assets
275/275 [==============================] - 1005s 4s/step - loss: 0.4937 - accuracy: 0.9128 - val_loss: 0.5681 - val_accuracy: 0.9074
Epoch 10/15
275/275 [==============================] - ETA: 0s - loss: 0.4634 - accuracy: 0.9163
Epoch 00010: val_loss improved from 0.56812 to 0.55173, saving model to model.15Epochs
INFO:tensorflow:Assets written to: model.15Epochs\assets
275/275 [==============================] - 1010s 4s/step - loss: 0.4634 - accuracy: 0.9163 - val_loss: 0.5517 - val_accuracy: 0.9094
Epoch 11/15
275/275 [==============================] - ETA: 0s - loss: 0.4352 - accuracy: 0.9195
Epoch 00011: val_loss improved from 0.55173 to 0.53879, saving model to model.15Epochs
INFO:tensorflow:Assets written to: model.15Epochs\assets
275/275 [==============================] - 1046s 4s/step - loss: 0.4352 - accuracy: 0.9195 - val_loss: 0.5388 - val_accuracy: 0.9103
Epoch 12/15
275/275 [==============================] - ETA: 0s - loss: 0.4089 - accuracy: 0.9229
Epoch 00012: val_loss improved from 0.53879 to 0.53323, saving model to model.15Epochs
INFO:tensorflow:Assets written to: model.15Epochs\assets
275/275 [==============================] - 1036s 4s/step - loss: 0.4089 - accuracy: 0.9229 - val_loss: 0.5332 - val_accuracy: 0.9116
Epoch 13/15
275/275 [==============================] - ETA: 0s - loss: 0.3835 - accuracy: 0.9264
Epoch 00013: val_loss improved from 0.53323 to 0.52817, saving model to model.15Epochs
INFO:tensorflow:Assets written to: model.15Epochs\assets
275/275 [==============================] - 1049s 4s/step - loss: 0.3835 - accuracy: 0.9264 - val_loss: 0.5282 - val_accuracy: 0.9121
Epoch 14/15
275/275 [==============================] - ETA: 0s - loss: 0.3601 - accuracy: 0.9299
Epoch 00014: val_loss improved from 0.52817 to 0.52340, saving model to model.15Epochs
INFO:tensorflow:Assets written to: model.15Epochs\assets
275/275 [==============================] - 1055s 4s/step - loss: 0.3601 - accuracy: 0.9299 - val_loss: 0.5234 - val_accuracy: 0.9136
Epoch 15/15
275/275 [==============================] - ETA: 0s - loss: 0.3382 - accuracy: 0.9330
Epoch 00015: val_loss improved from 0.52340 to 0.52015, saving model to model.15Epochs
INFO:tensorflow:Assets written to: model.15Epochs\assets
275/275 [==============================] - 1095s 4s/step - loss: 0.3382 - accuracy: 0.9330 - val_loss: 0.5201 - val_accuracy: 0.9139

14. Plot the accuracy and loss of your model for the training and validations sets.

We will use the plot_history function from the previous practical:

In [30]:
plt.style.use('ggplot')
def plot_history(history, val=0):
    acc = history.history['accuracy']
    if val == 1:
        val_acc = history.history['val_accuracy'] # we can add a validation set in our fit function with nn
    loss = history.history['loss']
    if val == 1:
        val_loss = history.history['val_loss']
    x = range(1, len(acc) + 1)

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x, acc, 'b', label='Training accuracy')
    if val == 1:
        plt.plot(x, val_acc, 'r', label='Validation accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.title('Accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'b', label='Training loss')
    if val == 1:
        plt.plot(x, val_loss, 'r', label='Validation loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.title('Loss')
    plt.legend()
In [31]:
plot_history(history, val=1)

You can also use the following code without the plot_history function:

In [32]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['train','validation'])
plt.show()

15. Predict translations for the test set.

You can continue using your model or you can use the following code to load the trained model with 10 epochs. Remember that our chosen architecture for a deep learning encoder-decoder translation model is a very simple architecture.

In [33]:
# model = load_model('model.15Epochs')
In [34]:
preds = model.predict_classes(test_X.reshape((test_X.shape[0],test_X.shape[1])))
WARNING:tensorflow:From <ipython-input-34-4097ca6b98b5>:1: Sequential.predict_classes (from tensorflow.python.keras.engine.sequential) is deprecated and will be removed after 2021-01-01.
Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).
In [35]:
preds
Out[35]:
array([[   6, 1044,  330, ...,    0,    0,    0],
       [   2,  184,   23, ...,    0,    0,    0],
       [   2,  416,    7, ...,    0,    0,    0],
       ...,
       [   4,   49,  365, ...,    0,    0,    0],
       [ 120,  143,  486, ...,    0,    0,    0],
       [   3,   35,   19, ...,    0,    0,    0]], dtype=int64)

As you noticed preds are only indices of words, so we need to convert them to words to be able to read them.

16. Use the sequences_to_texts function to convert an index to a word on your predictions.

In [36]:
preds_text = eng_tokenizer.sequences_to_texts(preds)
In [1]:
# preds_text

17. Create a new dataframe with three columns where you show the input Dutch text of the test set, the actual output, and your predictions. Use the sample() function with your dataframe to randomly check some of the lines.

In [44]:
pred_df = pd.DataFrame({'input' : test[:, 1], 'actual output' : test[:,0], 'predicted output' : preds_text})
In [39]:
#pd.set_option('display.max_colwidth', 50)
In [45]:
pred_df.head(15)
Out[45]:
input actual output predicted output
0 een aap beklimt een hoge boom a monkey is climbing up a tall tree a earthquake wearing a a a
1 de vredesonderhandelingen beginnen deze week the peace talks begin this week the children for week
2 het licht is groen the light is green the light is green
3 krijgt u genoeg zuurstof are you getting enough oxygen are you enough enough
4 zijn ze in leven are they alive are we in
5 veel mooie bloemen bloeien in de lente many pretty flowers bloom in the spring many flowers are made in are
6 onbevoegden zijn niet toegestaan op het terrein no unauthorised persons will be permitted inside the perimeter dont not not on the the
7 ik weet dat ik het erin heb gestopt i know i packed it i know i see it
8 het water stijgt the water is rising the ship is is
9 het is iets wat ze altijd wilden doen its something they always wanted to do its is something i can to do
10 ze is net twintig geworden she has just turned twenty she just just been
11 in de zomer draag ik shirts met korte mouwen in the summer i wear shortsleeved shirts i in i i in very
12 je kunt erop vertrouwen dat hij zijn woord zal houden you can trust him to keep his word you can know him to his his
13 zullen we lopen of met de auto gaan shall we walk or drive will we go to the the bus
14 laten we honkbal spelen lets play baseball lets play baseball
In [41]:
pred_df
Out[41]:
actual predicted
0 a monkey is climbing up a tall tree a earthquake wearing a a a
1 the peace talks begin this week the children for week
2 the light is green the light is green
3 are you getting enough oxygen are you enough enough
4 are they alive are we in
... ... ...
10990 i like dogs i love dogs
10991 i noticed you didnt finish your dinner i didnt you you your your your
10992 tom didnt even know who mary was tom didnt even know mary was was
10993 few people are able to understand his theories some people decided people is are are are
10994 you can leave me alone you can me me

10995 rows × 2 columns

In [46]:
pred_df.sample(15)
Out[46]:
input actual output predicted output
4236 wat heb je besloten what have you decided what did you
3325 hij keek neer op de vallei onder hem he looked down at the valley below he looked out out the the
2855 ik denk dat dat aan tom verteld moet worden i think tom needs to be told that i think tom should to to
3365 tom heeft een hekel aan klerken tom hates clerks tom hates hates
916 het werd in olie gebakken that was cooked in oil it was in in
9413 waarom zijn jullie getrouwd why are you married why are you married
8726 de politiemannen arresteerden de inbreker the policemen arrested the burglar the moon says the means
3322 ik zal terug zijn om 230 ill be back at 230 ill be back by
6718 ik kan me niet herinneren wat tom deed i dont remember what tom did i cant remember what did did
10403 zij stierven op het slachtveld they died on the battlefield she is on the south
2019 ik hou van wiskunde i like math i love math
3495 we hebben een half dozijn eieren we have half a dozen eggs we have a eggs eggs
6171 tom bezocht het rijksmuseum toen hij in amsterdam was tom visited the national museum when he was in amsterdam tom ran in when when he he in in
2982 goeiemorgen allemaal good morning everybody good are
4825 dat was nog maar het begin that was only the beginning it was just but

Tatoeba.org (https://tatoeba.org/en/downloads) has a large database of example sentences translated into many languages by volunteers. To have a better data for your neural machine translator you can use this tool to generate and download customized sentence pairs. For example it has more than one million sentence pairs translated from Dutch to English. This time, try to tune the hyperparameters and add an attention layer after the dense layer.