Character-level text generator with Pytorch

Using PyTorch and SageMaker

Some parts of the notebook have been extracted or modified from a notebook of my exercises in the Machine Learning Egineer Nanodegree.

In this notebook we will be implementing a simple RNN character model with PyTorch to familiarize ourselves with the PyTorch library and get started with RNNs. The goal is to build a model that can complete your sentence based on a few characters or a word used as input. And we will use AWS Sagemaker to train the model, evaluate and deploy.

General Outline

Recall the general outline for SageMaker projects using a notebook instance.

Download or otherwise retrieve the data.
Process / Prepare the data.
Upload the processed data to S3.
Train a chosen model.
Test the trained model (typically using a batch transform job).
Deploy the trained model.
Use the deployed model.

For this project, you will be following the steps in the general outline with some modifications.

First, we will not be testing the model in its own step. We will still be testing the model, however, we will do it by deploying your model and then using the deployed model by sending the test data to it. One of the reasons for doing this is so that we can make sure that our deployed model is working correctly before moving forward.

Loading the libraries

import os
import random as rnd
import numpy as np
import pickle
import time

Step 1: Downloading and loading the data

First, we'll define the sentences that we want our model to output when fed with the first word or the first few characters. Our dataset is a text file containing Shakespeare's plays or books from which we will extract sequence of chars to use as input to our model. Then our model will learn how to complete sentences like "Shakespeare would do".

This dataset can be downloaded from Karpathy's Github account: https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt.

The dataset is stored in our notebook instance, it is small and easy to "move", so we do not need to store it in S3 or other cloud storage service.

As in many of my notebooks, we set some variables to the data directory and filenames. If you want to run this code on your own enviroment you must change these values:

# Set the root folder
root_folder='.'
# Set the folder with the dataset
data_folder_name='data'
model_folder_name='model'
# Set the filename
filename='input.txt'

# Path to the data folder
DATA_PATH = os.path.abspath(os.path.join(root_folder, data_folder_name))
model_dir = os.path.abspath(os.path.join(root_folder, model_folder_name))

# Set the path where the text for training is stored
train_path = os.path.join(DATA_PATH, filename)

# Set a seed
seed = 1

def load_text_data(filename, init_dialog=False):
    ''' Load the texts from the filename, splitting by lines and removing empty strings.
        Setting init_dialog = True will remove lines where the character who is going to speak is indicated
    '''
    sentences = []
    with open(filename, 'r') as reader:
        #sentences = reader.readlines()
        for line in reader:
            #if ':' not in line and line !='\n':
            if init_dialog or ':' not in line:
                # Append the line to the sentences, removing the end of line character
                sentences.append(line[:-1])
                
    return sentences

Loading the input data, sentences from Shakespeare's plays.

sentences = load_text_data(train_path)
print('Number of sentences: ', len(sentences))
print(sentences[:20])

Number of sentences:  29723
['Before we proceed any further, hear me speak.', '', 'Speak, speak.', '', 'You are all resolved rather to die than to famish?', '', 'Resolved. resolved.', '', 'First, you know Caius Marcius is chief enemy to the people.', '', "We know't, we know't.", '', "Let us kill him, and we'll have corn at our own price.", "Is't a verdict?", '', '', 'One word, good citizens.', '', 'We are accounted poor citizens, the patricians good.', 'would yield us but the superfluity, while it were']

Step 2: Preparing and Processing the data

Also, we will be doing some initial data processing. The first few steps are the same as in many other notebooks that works in NLP tasks. To begin with, we will read in each of the lines and combine them into a single input structure.

Cleaning the input data

When working with text data, we usually need to perform some cleanings to prepare the data for our algorithm. This time we will start with a simple cleaning, convert to lowercase the text and that's all.

def clean_text(sentences, alpha=False):
    ''' Cleaning process of the text'''
    if alpha:
        # Remove non alphabetic character
        cleaned_text = [''.join([t.lower() for t in text if t.isalpha() or t.isspace()]) for text in sentences]
    else:
        # Simply lower the characters
        cleaned_text = [t.lower() for t in sentences]
    # Remove any emoty string
    cleaned_text = [t for t in cleaned_text if t!='']
    
    return cleaned_text

sentences = clean_text(sentences, False)
# Join all the sentences in a one long string
sentences = ' '.join(sentences)
print('Number of characters: ', len(sentences))
print(sentences[:100])

Number of characters:  894876
before we proceed any further, hear me speak. speak, speak. you are all resolved rather to die than

Our input data is a sequence of 900,000 characters, we will extract the label data from this sequence and split it into a train and validation dataset. But we will do this tasks after encoding the text data.

Creating the dictionary

Now we'll create a dictionary out of all the characters that we have in the sentences and map them to an integer. This will allow us to convert our input characters to their respective integers (char2int) and viceversa (int2char).

class CharVocab: 
    ''' Create a Vocabulary for '''
    def __init__(self, type_vocab,pad_token='<PAD>', eos_token='<EOS>', unk_token='<UNK>'): #Initialization of the type of vocabulary
        self.type = type_vocab
        #self.int2char ={}
        self.int2char = []
        if pad_token !=None:
            self.int2char += [pad_token]
        if eos_token !=None:
            self.int2char += [eos_token]
        if unk_token !=None:
            self.int2char += [unk_token]
        #self.int2char[1]=eos_token
        #self.int2char[2]=unk_token
        self.char2int = {}
        
    def __call__(self, text):       #When called, adds the values of parameters x_1 and x_2, prints and returns the result 
        # Join all the sentences together and extract the unique characters from the combined sentences
        chars = set(''.join(text))

        # Creating a dictionary that maps integers to the characters
        self.int2char += list(chars)

        # Creating another dictionary that maps characters to integers
        self.char2int = {char: ind for ind, char in enumerate(self.int2char)}

vocab = CharVocab('char',None,None,'<UNK>')
vocab(sentences)
print('Length of vocabulary: ', len(vocab.int2char))
print('Int to Char: ', vocab.int2char)
print('Char to Int: ', vocab.char2int)

Length of vocabulary:  38
Int to Char:  ['<UNK>', 'a', 'n', '-', 'h', '!', 'i', 'j', 'd', '.', 'f', 'x', 'k', 'w', '3', 'c', 'l', 'q', ' ', 'u', '&', ',', '$', 't', 'b', 'm', 'p', ';', 'z', 'g', 'r', 's', '?', "'", 'e', 'v', 'o', 'y']
Char to Int:  {'<UNK>': 0, 'a': 1, 'n': 2, '-': 3, 'h': 4, '!': 5, 'i': 6, 'j': 7, 'd': 8, '.': 9, 'f': 10, 'x': 11, 'k': 12, 'w': 13, '3': 14, 'c': 15, 'l': 16, 'q': 17, ' ': 18, 'u': 19, '&': 20, ',': 21, '$': 22, 't': 23, 'b': 24, 'm': 25, 'p': 26, ';': 27, 'z': 28, 'g': 29, 'r': 30, 's': 31, '?': 32, "'": 33, 'e': 34, 'v': 35, 'o': 36, 'y': 37}

Save the dictionary

In this example it is not mandatory to save the dictionary inmediately, because it is a fast and easy to reproduce task. But when dealing with a huge corpus and a large dictionary, we should save the dictionary to restore it latter when new experiments will be executed.

Later on when we construct an endpoint which processes a submitted review we will need to make use of the char2int and int2char dictionaries which we have created. As such, we will save them to a file now for future use.

# Check or create the directory where dictionary will be saved
if not os.path.exists(DATA_PATH): # Make sure that the folder exists
    os.makedirs(DATA_PATH)
    
# Save the dictionary to data path dir  
with open(os.path.join(DATA_PATH, 'char_dict.pkl'), "wb") as f:
    pickle.dump(vocab.char2int, f)
    
with open(os.path.join(DATA_PATH, 'int_dict.pkl'), "wb") as f:
    pickle.dump(vocab.int2char, f)

Create the input data and labels for training

As we're going to predict the next character in the sequence at each time step, we'll have to divide each sentence into:

Input data: The last input character should be excluded as it does not need to be fed into the model (it is the target label for the last input character)
Target/Ground Truth Label: One time-step ahead of the Input data as this will be the "correct answer" for the model at each time step corresponding to the input data

def one_hot_encode(indices, dict_size):
    ''' Define one hot encode matrix for our sequences'''
    # Creating a multi-dimensional array with the desired output shape
    # Encode every integer with its one hot representation
    features = np.eye(dict_size, dtype=np.float32)[indices.flatten()]
    
    # Finally reshape it to get back to the original array
    features = features.reshape((*indices.shape, dict_size))
            
    return features

def encode_text(input_text, vocab, one_hot = False):
    # Replace every char by its integer value based on the vocabulary
    output = [vocab.char2int.get(character,0) for character in input_text]
    
    if one_hot:
    # One hot encode every integer of the sequence
        dict_size = len(vocab.char2int)
        return one_hot_encode(output, dict_size)
    else:
        return np.array(output)

Now, we can encode our text, replacing every character by the integer value in the dictionary. When we have our dataset unified and prepared, we should do a quick check and see an example of the data our model will be trained on. This is generally a good idea as it allows you to see how each of the further processing steps affects the reviews and it also ensures that the data has been loaded correctly.

# Encode the train dataset
train_data = encode_text(sentences, vocab, one_hot = False)

# Create the input sequence, from 0 to len-1
input_seq=train_data[:-1]
# Create the target sequence, from 1 to len. It is right-shifted one place
target_seq=train_data[1:]
print('\nOriginal text:')
print(sentences[:100])
print('\nEncoded text:')
print(train_data[:100])
print('\nInput sequence:')
print(input_seq[:100])
print('\nTarget sequence:')
print(target_seq[:100])

Original text:
before we proceed any further, hear me speak. speak, speak. you are all resolved rather to die than 

Encoded text:
[24 34 10 36 30 34 18 13 34 18 26 30 36 15 34 34  8 18  1  2 37 18 10 19
 30 23  4 34 30 21 18  4 34  1 30 18 25 34 18 31 26 34  1 12  9 18 31 26
 34  1 12 21 18 31 26 34  1 12  9 18 37 36 19 18  1 30 34 18  1 16 16 18
 30 34 31 36 16 35 34  8 18 30  1 23  4 34 30 18 23 36 18  8  6 34 18 23
  4  1  2 18]

Input sequence:
[24 34 10 36 30 34 18 13 34 18 26 30 36 15 34 34  8 18  1  2 37 18 10 19
 30 23  4 34 30 21 18  4 34  1 30 18 25 34 18 31 26 34  1 12  9 18 31 26
 34  1 12 21 18 31 26 34  1 12  9 18 37 36 19 18  1 30 34 18  1 16 16 18
 30 34 31 36 16 35 34  8 18 30  1 23  4 34 30 18 23 36 18  8  6 34 18 23
  4  1  2 18]

Target sequence:
[34 10 36 30 34 18 13 34 18 26 30 36 15 34 34  8 18  1  2 37 18 10 19 30
 23  4 34 30 21 18  4 34  1 30 18 25 34 18 31 26 34  1 12  9 18 31 26 34
  1 12 21 18 31 26 34  1 12  9 18 37 36 19 18  1 30 34 18  1 16 16 18 30
 34 31 36 16 35 34  8 18 30  1 23  4 34 30 18 23 36 18  8  6 34 18 23  4
  1  2 18 23]

Lets check our one-hot-encode function that we will use later during the training phase:

print('Encoded characters: ',train_data[100:102])
print('One-hot-encoded characters: ',one_hot_encode(train_data[100:102], len(vocab.int2char)))

Encoded characters:  [23 36]
One-hot-encoded characters:  [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]

Step 3: Upload the data to S3

Now, we will need to upload the training dataset to S3 in order for our training code to access it. For now we will save it locally and we will upload to S3 later on.

Save the processed training dataset locally

It is important to note the format of the data that we are saving as we will need to know it when we write the training code. In our case, we will save the dataset as a pickle object, it is a list containing the whole dataset encoded as an integer value for every character.

# Save the encoded text to a file
encoded_data = os.path.join(DATA_PATH, 'input_data.pkl')
with open(encoded_data, 'wb') as fp:
    pickle.dump(train_data, fp)

Uploading the training data

Next, we need to upload the training data to the SageMaker default S3 bucket so that we can provide access to it while training our model.

import sagemaker

# Get the session id 
sagemaker_session = sagemaker.Session()
# Get the bucet, in our example the default buack
bucket = sagemaker_session.default_bucket()
# Set the S3 subfolder where our data will be stored 
prefix = 'sagemaker/char_level_rnn'
# Get the role for permission
role = sagemaker.get_execution_role()

input_data = sagemaker_session.upload_data(path=DATA_PATH, bucket=bucket, key_prefix=prefix)

NOTE: The cell above uploads the entire contents of our data directory. This includes the char_dict.pkl and int_dict.pkl file. This is fortunate as we will need this later on when we create an endpoint that accepts an arbitrary input text. For now, we will just take note of the fact that it resides in the data directory (and so also in the S3 training bucket) and that we will need to make sure it gets saved in the model directory.

Step 4: Build and Train the PyTorch Model

A model in the SageMaker framework, in particular, comprises three objects:

Model Artifacts,
Training Code, and
Inference Code,

each of which interact with one another.

We will start by implementing our own neural network in PyTorch along with a training script. For the purposes of this project we need to provide the model object implementation in the model.py file, inside of the train folder. You can see the provided implementation by running the cell below.

!pygmentize train/model.py

import torch
from torch import nn
from torch.autograd import Variable

class RNNModel(nn.Module):
    def __init__(self, vocab_size, embedding_size, hidden_dim, n_layers, drop_rate=0.2):
        
        super(RNNModel, self).__init__()

        # Defining some parameters
        self.hidden_dim = hidden_dim
        self.embedding_size = embedding_size
        self.n_layers = n_layers
        self.vocab_size = vocab_size
        self.drop_rate = drop_rate
        self.char2int = None
        self.int2char = None


        #Defining the layers
        # Define the encoder as an Embedding layer
        #self.encoder = nn.Embedding(vocab_size, embedding_size)
            
        # Dropout layer
        self.dropout = nn.Dropout(drop_rate)
        # RNN Layer
        self.rnn = nn.LSTM(embedding_size, hidden_dim, n_layers, dropout=drop_rate, batch_first = True)
        # Fully connected layer
        self.decoder = nn.Linear(hidden_dim, vocab_size)
    
    def forward(self, x, state):
        
        # input shape: [batch_size, seq_len, embedding_size]
        # Apply the embedding layer and dropout
        #embed_seq = self.dropout(self.encoder(x))
            
        #print('Input RNN shape: ', embed_seq.shape)
        # shape: [batch_size, seq_len, embedding_size]
        rnn_out, state = self.rnn(x, state)
        #print('Out RNN shape: ', rnn_out.shape)
        # rnn_out shape: [batch_size, seq_len, rnn_size]
        # hidden shape: [2, num_layers, batch_size, rnn_size]
        rnn_out = self.dropout(rnn_out)

        # shape: [seq_len, batch_size, rnn_size]
        # Stack up LSTM outputs using view
        # you may need to use contiguous to reshape the output
        rnn_out = rnn_out.contiguous().view(-1, self.hidden_dim)

        logits = self.decoder(rnn_out)
        # output shape: [seq_len * batch_size, vocab_size]
        #print('Output model shape: ', logits.shape)
        return logits, state
    
    def init_state(self, device, batch_size=1):
        """
        initialises rnn states.
        """
        #return (Variable(torch.zeros(self.n_layers, batch_size, self.hidden_dim)),
        #        Variable(torch.zeros(self.n_layers, batch_size, self.hidden_dim)))
        return (torch.zeros(self.n_layers, batch_size, self.hidden_dim).to(device),
                torch.zeros(self.n_layers, batch_size, self.hidden_dim).to(device))

    def predict(self, input):
        # input shape: [seq_len, batch_size]
        logits, hidden = self.forward(input)
        # logits shape: [seq_len * batch_size, vocab_size]
        # hidden shape: [2, num_layers, batch_size, rnn_size]
        probs = F.softmax(logits)
        # shape: [seq_len * batch_size, vocab_size]
        probs = probs.view(input.size(0), input.size(1), probs.size(1))
        # output shape: [seq_len, batch_size, vocab_size]
        return probs, hidden

Create a batch data generator

When training on the dataset, we need to extract a batch size examples from the inputs and targets, forward and backward the RNN on them and then repite the iteration with another batch size examples. A batch generator will help us to extract a batch size examples from our datasets.

The next code defines our batch generator:

def batch_generator_sequence(features_seq, label_seq, batch_size, seq_len):
    """Generator function that yields batches of data (input and target)

    Args:
        batch_size (int): number of examples (in this case, sentences) per batch.
        max_length (int): maximum length of the output tensor.
        NOTE: max_length includes the end-of-sentence character that will be added
                to the tensor.  
                Keep in mind that the length of the tensor is always 1 + the length
                of the original line of characters.
        input_lines (list): list of the input data to group into batches.
        target_lines (list): list of the target data to group into batches.
        shuffle (bool, optional): True if the generator should generate random batches of data. Defaults to True.

    Yields:
        tuple: two copies of the batch and the mask 
    """
    # calculate the number of batches we can supply
    num_batches = len(features_seq) // (batch_size * seq_len)
    if num_batches == 0:
        raise ValueError("No batches created. Use smaller batch size or sequence length.")
    # calculate effective length of text to use
    rounded_len = num_batches * batch_size * seq_len
    # Reshape the features matrix in batch size x num_batches * seq_len
    x = np.reshape(features_seq[: rounded_len], [batch_size, num_batches * seq_len])
    
    # Reshape the target matrix in batch size x num_batches * seq_len
    y = np.reshape(label_seq[: rounded_len], [batch_size, num_batches * seq_len])
    
    epoch = 0
    while True:
        # roll so that no need to reset rnn states over epochs
        x_epoch = np.split(np.roll(x, -epoch, axis=0), num_batches, axis=1)
        y_epoch = np.split(np.roll(y, -epoch, axis=0), num_batches, axis=1)
        for batch in range(num_batches):
            yield x_epoch[batch], y_epoch[batch]
        epoch += 1

Writing the training method

Next we need to write the training code itself. This should be very similar to training methods that you have written before to train PyTorch models. We will leave any difficult aspects such as model saving / loading and parameter loading until a little later.

def train_main(model, optimizer, loss_fn, batch_data, num_batches, val_batches, batch_size, seq_len, n_epochs, clip_norm, device):
    # Training Run
    for epoch in range(1, n_epochs + 1):
        start_time = time.time()
        # Store the loss in every batch iteration
        epoch_losses =[]
        # Init the hidden state
        hidden = model.init_state(device, batch_size)
        # Train all the batches in every epoch
        for i in range(num_batches-val_batches):
            #print('Batch :', i)
            # Get the next batch data for input and target
            input_batch, target_batch = next(batch_data)
            # Onr hot encode the input data
            input_batch = one_hot_encode(input_batch, model.vocab_size)
            # Tranform to tensor
            input_data = torch.from_numpy(input_batch)
            target_data = torch.from_numpy(target_batch)
            # Create a new variable for the hidden state, necessary to calculate the gradients
            hidden = tuple(([Variable(var.data) for var in hidden]))
            # Move the input data to the device
            input_data = input_data.to(device)
            #print('Input shape: ', input_data.shape)
            #print('Hidden shape: ', hidden[0].shape, hidden[1].shape)
            # Set the model to train and prepare the gradients
            model.train()
            optimizer.zero_grad() # Clears existing gradients from previous epoch
            # Pass Fordward the RNN
            output, hidden = model(input_data, hidden)
            #print('Output shape: ', output.shape)
            output = output.to(device)
            #print('Output shape: ', output.shape)
            #print('Target shape; ', target_data.shape)
            # Move the target data to the device
            target_data = target_data.to(device)
            #print('Target shape; ', target_data.shape)
            target_data = torch.reshape(target_data, (batch_size*seq_len,))
            #print('Target shape; ', target_data.shape)
            loss = loss_fn(output, target_data.view(batch_size*seq_len))
            #print(loss)
            # Save the loss
            epoch_losses.append(loss.item()) #data[0]
        
            loss.backward() # Does backpropagation and calculates gradients
            # clip gradient norm
            nn.utils.clip_grad_norm_(model.parameters(), clip_norm)
            
            optimizer.step() # Updates the weights accordingly
    
        # Now, when epoch is finished, evaluate the model on validation data
        model.eval()
        val_hidden = model.init_state(device, batch_size)
        val_losses = []
        for i in range(val_batches):
            # Get the next batch data for input and target
            input_batch, target_batch = next(batch_data)
            # Onr hot encode the input data
            input_batch = one_hot_encode(input_batch, model.vocab_size)
            # Tranform to tensor
            input_data = torch.from_numpy(input_batch)
            target_data = torch.from_numpy(target_batch)
            # Create a new variable for the hidden state, necessary to calculate the gradients
            hidden = tuple(([Variable(var.data) for var in val_hidden]))
            # Move the input data to the device
            input_data = input_data.to(device)
            # Pass Fordward the RNN
            output, hidden = model(input_data, hidden)
            #print('Output shape: ', output.shape)
            output = output.to(device)
            #print('Output shape: ', output.shape)
            #print('Target shape; ', target_data.shape)
            # Move the target data to the device
            target_data = target_data.to(device)
            #print('Target shape; ', target_data.shape)
            target_data = torch.reshape(target_data, (batch_size*seq_len,))
            #print('Target shape; ', target_data.shape)
            loss = loss_fn(output, target_data.view(batch_size*seq_len))
            #print(loss)
            # Save the loss
            val_losses.append(loss.item()) #data[0]

        model.train()                  
        #if epoch%2 == 0:
        print('Epoch: {}/{}.............'.format(epoch, n_epochs), end=' ')
        print('Time: {:.4f}'.format(time.time() - start_time), end=' ')
        print("Train Loss: {:.4f}".format(np.mean(epoch_losses)), end=' ')
        print("Val Loss: {:.4f}".format(np.mean(val_losses)))
        
    return epoch_losses

Supposing we have the training method above, we will test that it is working by writing a bit of code in the notebook that executes our training method on a small sample training set. Because we are not using a GPU and we are just testing the training code we take 50,000 characters from the input data. The reason for doing this in the notebook is so that we have an opportunity to fix any errors that arise early when they are easier to diagnose.

import torch
from torch import nn
from torch.autograd import Variable

from tqdm import tqdm

from train.model import RNNModel

# Set a seed to reproduce experiments
torch.manual_seed(seed)
# Set the device for training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create the model
model = RNNModel(38, 38, 16, 1).to(device)

# Define Loss, Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Limit the size of our input sequence for this simple test
input_seq = input_seq[:50000]
target_seq = target_seq[:50000]

# Calculate the number of batches to train
batch_size=32
maxlen=64
num_batches = len(input_seq) // (batch_size*maxlen)
# Calculate the validation batches
val_frac = 0.1
val_batches = int(num_batches*val_frac)

# Create the batch data generator
batch_data = batch_generator_sequence(input_seq, target_seq, batch_size, maxlen)
# Train the model
losses = train_main(model, optimizer, criterion, batch_data, num_batches, val_batches, batch_size, maxlen, 5, 5, device)

Epoch: 1/5............. Time: 0.3447 Train Loss: 3.2359 Val Loss: 2.9530
Epoch: 2/5............. Time: 0.3499 Train Loss: 2.9400 Val Loss: 2.8528
Epoch: 3/5............. Time: 0.3474 Train Loss: 2.7989 Val Loss: 2.6553
Epoch: 4/5............. Time: 0.3487 Train Loss: 2.6213 Val Loss: 2.4888
Epoch: 5/5............. Time: 0.3474 Train Loss: 2.5100 Val Loss: 2.3986

In order to construct a PyTorch model using SageMaker we must provide SageMaker with a training script. We may optionally include a directory which will be copied to the container and from which our training code will be run. When the training container is executed it will check the uploaded directory (if there is one) for a requirements.txt file and install any required Python libraries, after which the training script will be run.

In this example, we only requiere the numpy package.

Training the model

When a PyTorch model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained. Inside of the train directory is a file called train.py which contains most of the necessary code to train our model.

NOTICE: The train() method written above and has been pasted into the train/train.py file where required.

The way that SageMaker passes hyperparameters to the training script is by way of arguments. These arguments can then be parsed and used in the training script. To see how this is done take a look at the provided train/train.py file.

First, we need to set which type of instance will run our training:

Local: We do not launch a real compute instance, just a container where our scripts will run. This scenario is very useful to test that the train script is working fine because it is faster to run a container than an compute instance. But finally, when we confirm that everything is working we must change the instance type for a "real" training instance.
ml.m4.4xlarge: It is a CPU instance
ml.p2.xlarge: A GPU instance to use when managing a big volume of data to train on.

# Select the type of instance to use for training
#instance_type='ml.m4.4xlarge' # CPU instance
instance_type='ml.p2.xlarge' # GPU instance
#instance_type='local'

from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point="train.py",
                    source_dir="train",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type=instance_type,
                    hyperparameters={
                        'epochs': 50,
                        'hidden_dim': 512,
                        'n_layers': 2,
                    })

estimator.fit({'training': input_data})

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.

2020-09-04 18:09:13 Starting - Starting the training job...
2020-09-04 18:09:15 Starting - Launching requested ML instances......
2020-09-04 18:10:30 Starting - Preparing the instances for training.........
2020-09-04 18:12:11 Downloading - Downloading input data......
2020-09-04 18:13:01 Training - Downloading the training image..bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2020-09-04 18:13:23,643 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training
2020-09-04 18:13:23,671 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2020-09-04 18:13:26,690 sagemaker_pytorch_container.training INFO     Invoking user training script.
2020-09-04 18:13:27,188 sagemaker-containers INFO     Module train does not provide a setup.py. 
Generating setup.py
2020-09-04 18:13:27,189 sagemaker-containers INFO     Generating setup.cfg
2020-09-04 18:13:27,189 sagemaker-containers INFO     Generating MANIFEST.in
2020-09-04 18:13:27,189 sagemaker-containers INFO     Installing module with the following command:
/usr/bin/python -m pip install -U . -r requirements.txt
Processing /opt/ml/code
Collecting numpy (from -r requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/b5/36/88723426b4ff576809fec7d73594fe17a35c27f8d01f93637637a29ae25b/numpy-1.18.5-cp35-cp35m-manylinux1_x86_64.whl (19.9MB)
Building wheels for collected packages: train
  Running setup.py bdist_wheel for train: started
  Running setup.py bdist_wheel for train: finished with status 'done'
  Stored in directory: /tmp/pip-ephem-wheel-cache-7uiumycw/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
Successfully built train
Installing collected packages: numpy, train
  Found existing installation: numpy 1.15.4
    Uninstalling numpy-1.15.4:
      Successfully uninstalled numpy-1.15.4

2020-09-04 18:13:22 Training - Training image download completed. Training in progress.Successfully installed numpy-1.18.5 train-1.0.0
You are using pip version 18.1, however version 20.2.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
2020-09-04 18:13:34,112 sagemaker-containers INFO     Invoking user script

Training Env:

{
    "num_gpus": 1,
    "resource_config": {
        "network_interface_name": "eth0",
        "hosts": [
            "algo-1"
        ],
        "current_host": "algo-1"
    },
    "input_data_config": {
        "training": {
            "S3DistributionType": "FullyReplicated",
            "TrainingInputMode": "File",
            "RecordWrapperType": "None"
        }
    },
    "model_dir": "/opt/ml/model",
    "output_data_dir": "/opt/ml/output/data",
    "module_name": "train",
    "hosts": [
        "algo-1"
    ],
    "framework_module": "sagemaker_pytorch_container.training:main",
    "module_dir": "s3://sagemaker-us-east-1-223817798831/sagemaker-pytorch-2020-09-04-18-09-12-678/source/sourcedir.tar.gz",
    "output_dir": "/opt/ml/output",
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "num_cpus": 4,
    "network_interface_name": "eth0",
    "input_dir": "/opt/ml/input",
    "log_level": 20,
    "current_host": "algo-1",
    "user_entry_point": "train.py",
    "job_name": "sagemaker-pytorch-2020-09-04-18-09-12-678",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "hyperparameters": {
        "epochs": 50,
        "hidden_dim": 512,
        "n_layers": 2
    },
    "additional_framework_parameters": {},
    "input_config_dir": "/opt/ml/input/config"
}

Environment variables:

SM_USER_ENTRY_POINT=train.py
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CURRENT_HOST=algo-1
SM_NUM_GPUS=1
SM_HP_EPOCHS=50
SM_HP_N_LAYERS=2
SM_MODULE_NAME=train
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"epochs":50,"hidden_dim":512,"n_layers":2},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","job_name":"sagemaker-pytorch-2020-09-04-18-09-12-678","log_level":20,"model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-1-223817798831/sagemaker-pytorch-2020-09-04-18-09-12-678/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":4,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_OUTPUT_DIR=/opt/ml/output
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_FRAMEWORK_PARAMS={}
SM_HOSTS=["algo-1"]
SM_NUM_CPUS=4
PYTHONPATH=/usr/local/bin:/usr/lib/python35.zip:/usr/lib/python3.5:/usr/lib/python3.5/plat-x86_64-linux-gnu:/usr/lib/python3.5/lib-dynload:/usr/local/lib/python3.5/dist-packages:/usr/lib/python3/dist-packages
SM_CHANNELS=["training"]
SM_MODEL_DIR=/opt/ml/model
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HPS={"epochs":50,"hidden_dim":512,"n_layers":2}
SM_HP_HIDDEN_DIM=512
SM_INPUT_DIR=/opt/ml/input
SM_NETWORK_INTERFACE_NAME=eth0
SM_MODULE_DIR=s3://sagemaker-us-east-1-223817798831/sagemaker-pytorch-2020-09-04-18-09-12-678/source/sourcedir.tar.gz
SM_USER_ARGS=["--epochs","50","--hidden_dim","512","--n_layers","2"]

Invoking script with the following command:

/usr/bin/python -m train --epochs 50 --hidden_dim 512 --n_layers 2


Using device cuda.
Get train data loader.
Model loaded with embedding_dim 38, hidden_dim 512, vocab_size 38.
Epoch: 1/50............. Time: 14.4431 Train Loss: 2.9868 Val Loss: 2.9684
Epoch: 2/50............. Time: 14.4539 Train Loss: 2.8127 Val Loss: 2.5652
Epoch: 3/50............. Time: 14.5240 Train Loss: 2.3785 Val Loss: 2.2880
Epoch: 4/50............. Time: 14.6011 Train Loss: 2.1525 Val Loss: 2.0916
Epoch: 5/50............. Time: 14.6164 Train Loss: 1.9958 Val Loss: 1.9362
Epoch: 6/50............. Time: 14.6468 Train Loss: 1.8589 Val Loss: 1.8146
Epoch: 7/50............. Time: 14.6547 Train Loss: 1.7433 Val Loss: 1.7126
Epoch: 8/50............. Time: 14.6899 Train Loss: 1.6541 Val Loss: 1.6359
Epoch: 9/50............. Time: 14.6877 Train Loss: 1.5832 Val Loss: 1.5861
Epoch: 10/50............. Time: 14.9234 Train Loss: 1.5350 Val Loss: 1.5421
Epoch: 11/50............. Time: 14.7145 Train Loss: 1.6048 Val Loss: 1.5478
Epoch: 12/50............. Time: 14.8416 Train Loss: 1.4783 Val Loss: 1.4964
Epoch: 13/50............. Time: 14.6023 Train Loss: 1.4303 Val Loss: 1.4738
Epoch: 14/50............. Time: 14.6174 Train Loss: 1.3992 Val Loss: 1.4667
Epoch: 15/50............. Time: 14.8158 Train Loss: 1.3759 Val Loss: 1.4611
Epoch: 16/50............. Time: 15.3926 Train Loss: 1.3576 Val Loss: 1.4520
Epoch: 17/50............. Time: 15.4225 Train Loss: 1.3431 Val Loss: 1.4551
Epoch: 18/50............. Time: 15.7483 Train Loss: 1.3301 Val Loss: 1.4477
Epoch: 19/50............. Time: 15.5521 Train Loss: 1.3183 Val Loss: 1.4493
Epoch: 20/50............. Time: 15.5506 Train Loss: 1.3071 Val Loss: 1.4519
Epoch: 21/50............. Time: 15.8857 Train Loss: 1.2994 Val Loss: 1.4560
Epoch: 22/50............. Time: 15.8872 Train Loss: 1.2918 Val Loss: 1.4624
Epoch: 23/50............. Time: 15.5845 Train Loss: 1.2837 Val Loss: 1.4590
Epoch: 24/50............. Time: 15.4095 Train Loss: 1.2769 Val Loss: 1.4608
Epoch: 25/50............. Time: 15.3529 Train Loss: 1.2723 Val Loss: 1.4590
Epoch: 26/50............. Time: 15.4680 Train Loss: 1.2655 Val Loss: 1.4589
Epoch: 27/50............. Time: 15.2459 Train Loss: 1.2594 Val Loss: 1.4576
Epoch: 28/50............. Time: 14.8407 Train Loss: 1.2541 Val Loss: 1.4611
Epoch: 29/50............. Time: 14.8493 Train Loss: 1.2491 Val Loss: 1.4626
Epoch: 30/50............. Time: 14.8030 Train Loss: 1.2439 Val Loss: 1.4673
Epoch: 31/50............. Time: 14.8256 Train Loss: 1.2406 Val Loss: 1.4672
Epoch: 32/50............. Time: 14.8147 Train Loss: 1.2360 Val Loss: 1.4802
Epoch: 33/50............. Time: 14.8162 Train Loss: 1.2309 Val Loss: 1.4683
Epoch: 34/50............. Time: 14.7752 Train Loss: 1.2273 Val Loss: 1.4689
Epoch: 35/50............. Time: 14.8036 Train Loss: 1.2238 Val Loss: 1.4734
Epoch: 36/50............. Time: 14.9146 Train Loss: 1.2201 Val Loss: 1.4775
Epoch: 37/50............. Time: 14.7848 Train Loss: 1.2188 Val Loss: 1.4706
Epoch: 38/50............. Time: 14.8986 Train Loss: 1.2141 Val Loss: 1.4750
Epoch: 39/50............. Time: 15.3889 Train Loss: 1.2104 Val Loss: 1.4759
Epoch: 40/50............. Time: 15.4683 Train Loss: 1.2065 Val Loss: 1.4744
Epoch: 41/50............. Time: 14.8328 Train Loss: 1.2049 Val Loss: 1.4741
Epoch: 42/50............. Time: 14.8112 Train Loss: 1.2023 Val Loss: 1.4774
Epoch: 43/50............. Time: 14.8114 Train Loss: 1.1987 Val Loss: 1.4803
Epoch: 44/50............. Time: 14.7854 Train Loss: 1.1956 Val Loss: 1.4837
Epoch: 45/50............. Time: 14.8066 Train Loss: 1.1929 Val Loss: 1.4888
Epoch: 46/50............. Time: 14.9014 Train Loss: 1.1889 Val Loss: 1.4856
Epoch: 47/50............. Time: 14.7856 Train Loss: 1.1877 Val Loss: 1.4897
Epoch: 48/50............. Time: 14.7771 Train Loss: 1.1857 Val Loss: 1.4911
Epoch: 49/50............. Time: 15.2428 Train Loss: 1.1826 Val Loss: 1.4963

2020-09-04 18:26:23 Uploading - Uploading generated training model
2020-09-04 18:26:23 Completed - Training job completed
Epoch: 50/50............. Time: 15.3424 Train Loss: 1.1790 Val Loss: 1.4912
2020-09-04 18:26:12,138 sagemaker-containers INFO     Reporting training SUCCESS
Training seconds: 852
Billable seconds: 852

Step 5: Testing the model

As mentioned at the top of this notebook, we will be testing this model by first deploying it and then sending the testing data to the deployed endpoint. We will do this so that we can make sure that the deployed model is working correctly.

Step 6 - Deploy the model for inference

Now that our model is trained, it's time to create some custom inference code so that we can send the model a initial string which has not been processed and determine the next caracters on the string.

By default the estimator which we created, when deployed, will use the entry script and directory which we provided when creating the model. However, since we wish to accept a string as input and our model expects a processed review, we need to write some custom inference code.

We will store the code that we write in the serve directory. Provided in this directory is the model.py file that we used to construct our model, a utils.py file which contains the one-hot-encode and encode_text pre-processing functions which we used during the initial data processing, and predict.py, the file which will contain our custom inference code. Note also that requirements.txt is present which will tell SageMaker what Python libraries are required by our custom inference code.

When deploying a PyTorch model in SageMaker, you are expected to provide four functions which the SageMaker inference container will use.

model_fn: This function is the same function that we used in the training script and it tells SageMaker how to load our model. This function must be called model_fn() and takes as its only parameter a path to the directory where the model artifacts are stored. This function must also be present in the python file which we specified as the entry point. It also reads the saved dictionaries because they could be used during the inference process.
input_fn: This function receives the raw serialized input that has been sent to the model's endpoint and its job is to de-serialize and make the input available for the inference code. Latter we will mention what our input_fn function is doing.
output_fn: This function takes the output of the inference code and its job is to serialize this output and return it to the caller of the model's endpoint.
predict_fn: The heart of the inference script, this is where the actual prediction is done and is the function which you will need to complete.

For the simple example that we are constructing during this project, the input_fn and output_fn methods are relatively straightforward. We require being able to accept a string as input, composed by the desired length of the output and the initial string. And we expect to return a single string as output, the new text generated. You might imagine though that in a more complex application the input or output may be image data or some other binary data which would require some effort to serialize.

Writing inference code

Before writing our custom inference code, we will begin by taking a look at the code which has been provided.

!pygmentize serve/predict.py

import argparse
import json
import os
import pickle
import sys
import sagemaker_containers
import numpy as np
import torch
import torch.nn as nn

from model import RNNModel

from utils import clean_text, encode_text, one_hot_encode

def model_fn(model_dir):
    """Load the PyTorch model from the `model_dir` directory."""
    print("Loading model.")

    # First, load the parameters used to create the model.
    model_info = {}
    model_info_path = os.path.join(model_dir, 'model_info.pth')
    with open(model_info_path, 'rb') as f:
        model_info = torch.load(f)

    print("model_info: {}".format(model_info))

    # Determine the device and construct the model.
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = RNNModel(model_info['vocab_size'], model_info['embedding_dim'], model_info['hidden_dim'], model_info['n_layers'], model_info['drop_rate'])

    # Load the stored model parameters.
    model_path = os.path.join(model_dir, 'model.pth')
    with open(model_path, 'rb') as f:
        model.load_state_dict(torch.load(f, map_location=lambda storage, loc: storage))

    # Load the saved word_dict.
    word_dict_path = os.path.join(model_dir, 'char_dict.pkl')
    with open(word_dict_path, 'rb') as f:
        model.char2int = pickle.load(f)

    word_dict_path = os.path.join(model_dir, 'int_dict.pkl')
    with open(word_dict_path, 'rb') as f:
        model.int2char = pickle.load(f)

    model.to(device).eval()

    print("Done loading model.")
    return model

def input_fn(serialized_input_data, content_type):
    print('Deserializing the input data.')
    if content_type == 'text/plain':
        data = serialized_input_data.decode('utf-8')
        # Extract the desired length of the output string
        sep_pos= data.find('-')
        if sep_pos > 0:
            length = data[:sep_pos]
            initial_string = data[sep_pos+1:]
        return (length,initial_string)
    raise Exception('Requested unsupported ContentType in content_type: ' + content_type)

def output_fn(prediction_output, accept):
    print('Serializing the generated output.')
    return str(prediction_output)

def sample_from_probs(probs, top_n=10):
    """
    truncated weighted random choice.
    """
    _, indices = torch.sort(probs)
    # set probabilities after top_n to 0
    probs[indices.data[:-top_n]] = 0
    #print(probs.shape)
    sampled_index = torch.multinomial(probs, 1)
    return sampled_index

def predict_probs(model, hidden, character, vocab, device):
    # One-hot encoding our input to fit into the model
    character = np.array([[vocab[c] for c in character]])
    character = one_hot_encode(character, len(vocab))
    character = torch.from_numpy(character)
    character = character.to(device)
    
    with torch.no_grad():
        out, hidden = model(character, hidden)

    prob = nn.functional.softmax(out[-1], dim=0).data

    return prob, hidden

def predict_old(input_data, model):
    print('Inferring sentiment of input data.')

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    if model.char2int is None:
        raise Exception('Model has not been loaded properly, no word_dict.')
    
    model.eval() # eval mode
    start = input_data.lower()
    # Clean the text as the text used in training 
    start = clean_text(start, True)
    # Encode the text
    #encode_text(start, model.vocab, one_hot = False):
    # First off, run through the starting characters
    chars = [ch for ch in start]

    state = model.init_state(device, 1)
    probs, state = predict_probs(model, state, chars, model.vocab, device)
    #print(probs.shape)
    #probs = torch.transpose(probs, 0, 1)
    next_index = sample_from_probs(probs[-1].squeeze(), top_n=3)
    
    return next_index.data[0].item()

def predict_fn(input_data, model):

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    if model.char2int is None:
        raise Exception('Model has not been loaded properly, no word_dict.')
    
    # TODO: Process input_data so that it is ready to be sent to our model.
    #       You should produce two variables:
    #         data_X   - A sequence of length 500 which represents the converted review
    #         data_len - The length of the review
    # Extract the input data and the desired length
    out_len, start = input_data
    out_len = int(out_len)

    model.eval() # eval mode
    start = start.lower()
    # Clean the text as the text used in training 
    start = clean_text(start, True)
    # First off, run through the starting characters
    chars = [ch for ch in start]
    size = out_len - len(chars)
    # Init the hidden state
    state = model.init_state(device, 1)

    # Warm up the initial state, predicting on the initial string
    for ch in chars:
        #char, state = predict(model, ch, state, top_n=top_k)
        probs, state = predict_probs(model, state, ch, model.char2int, device)
        next_index = sample_from_probs(probs, 5)

    # Include the last char predicted to the predicted output
    chars.append(model.int2char[next_index.data[0]])   
    # Now pass in the previous characters and get a new one
    for ii in range(size-1):
        #char, h = predict_char(model, chars, vocab)
        probs, state = predict_probs(model, state, chars[-1], model.char2int, device)
        next_index = sample_from_probs(probs, 5)
        # append to sequence
        chars.append(model.int2char[next_index.data[0]])

    # Join all the chars    
    #chars = chars.decode('utf-8')
    return ''.join(chars)

As mentioned earlier, the model_fn method is the same as the one provided in the training code and the input_fn and output_fn methods are very simple. Finally we must build a predict_fn method that will receive the input string, encode it (char2int), one-hot-encode and send it to the model. Every output will be decode (int2char) and appended to the final output string.

Make sure that you save the completed file as predict.py in the serve directory.

Deploying the model

Now that the custom inference code has been written, we will create and deploy our model. To begin with, we need to construct a new PyTorchModel object which points to the model artifacts created during training and also points to the inference code that we wish to use. Then we can call the deploy method to launch the deployment container.

NOTE: The default behaviour for a deployed PyTorch model is to assume that any input passed to the predictor is a numpy array. In our case we want to send a string so we need to construct a simple wrapper around the RealTimePredictor class to accomodate simple strings. In a more complicated situation you may want to provide a serialization object, for example if you wanted to sent image data.

NOTE: When deploying a model you are asking SageMaker to launch an compute instance that will wait for data to be sent to it. As a result, this compute instance will continue to run until you shut it down. This is important to know since the cost of a deployed endpoint depends on how long it has been running for.

In other words If you are no longer using a deployed endpoint, shut it down!

Now, we can deploy our trained model

Loading a previously trained model

In many situations, you have trained the model in another execution of this notebook or you shutdown the notebook while the model was training. So the estimator variable is empty or undefined and then, you want to restore or use a previous training job and deploy it. In the next cell, we attach that trained model to the estimator variable and continue the necessary steps to launch and deploy the model.

# Attach the estimator to a oreviously trained job
from sagemaker.pytorch import PyTorch

my_training_job_name = 'sagemaker-pytorch-2020-09-02-19-49-57-475'

estimator = PyTorch.attach(my_training_job_name)

2020-09-02 20:00:18 Starting - Preparing the instances for training
2020-09-02 20:00:18 Downloading - Downloading input data
2020-09-02 20:00:18 Training - Training image download completed. Training in progress.
2020-09-02 20:00:18 Uploading - Uploading generated training model
2020-09-02 20:00:18 Completed - Training job completedbash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2020-09-02 19:53:51,432 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training
2020-09-02 19:53:51,457 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2020-09-02 19:53:51,461 sagemaker_pytorch_container.training INFO     Invoking user training script.
2020-09-02 19:53:51,761 sagemaker-containers INFO     Module train does not provide a setup.py. 
Generating setup.py
2020-09-02 19:53:51,761 sagemaker-containers INFO     Generating setup.cfg
2020-09-02 19:53:51,761 sagemaker-containers INFO     Generating MANIFEST.in
2020-09-02 19:53:51,761 sagemaker-containers INFO     Installing module with the following command:
/usr/bin/python -m pip install -U . -r requirements.txt
Processing /opt/ml/code
Collecting pandas (from -r requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/74/24/0cdbf8907e1e3bc5a8da03345c23cbed7044330bb8f73bb12e711a640a00/pandas-0.24.2-cp35-cp35m-manylinux1_x86_64.whl (10.0MB)
Collecting numpy (from -r requirements.txt (line 2))
  Downloading https://files.pythonhosted.org/packages/b5/36/88723426b4ff576809fec7d73594fe17a35c27f8d01f93637637a29ae25b/numpy-1.18.5-cp35-cp35m-manylinux1_x86_64.whl (19.9MB)
Collecting pytz>=2011k (from pandas->-r requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/4f/a4/879454d49688e2fad93e59d7d4efda580b783c745fd2ec2a3adf87b0808d/pytz-2020.1-py2.py3-none-any.whl (510kB)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.5.0 in /usr/local/lib/python3.5/dist-packages (from pandas->-r requirements.txt (line 1)) (2.7.5)
Requirement already satisfied, skipping upgrade: six>=1.5 in /usr/local/lib/python3.5/dist-packages (from python-dateutil>=2.5.0->pandas->-r requirements.txt (line 1)) (1.11.0)
Building wheels for collected packages: train
  Running setup.py bdist_wheel for train: started
  Running setup.py bdist_wheel for train: finished with status 'done'
  Stored in directory: /tmp/pip-ephem-wheel-cache-9xt478wn/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
Successfully built train
Installing collected packages: pytz, numpy, pandas, train
  Found existing installation: numpy 1.15.4
    Uninstalling numpy-1.15.4:
      Successfully uninstalled numpy-1.15.4
Successfully installed numpy-1.18.5 pandas-0.24.2 pytz-2020.1 train-1.0.0
You are using pip version 18.1, however version 20.2.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
2020-09-02 19:54:01,620 sagemaker-containers INFO     Invoking user script

Training Env:

{
    "output_dir": "/opt/ml/output",
    "log_level": 20,
    "network_interface_name": "eth0",
    "hosts": [
        "algo-1"
    ],
    "job_name": "sagemaker-pytorch-2020-09-02-19-49-57-475",
    "num_cpus": 4,
    "user_entry_point": "train.py",
    "module_dir": "s3://sagemaker-us-east-1-223817798831/sagemaker-pytorch-2020-09-02-19-49-57-475/source/sourcedir.tar.gz",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "input_dir": "/opt/ml/input",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "hyperparameters": {
        "epochs": 3,
        "hidden_dim": 64
    },
    "output_data_dir": "/opt/ml/output/data",
    "model_dir": "/opt/ml/model",
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "resource_config": {
        "network_interface_name": "eth0",
        "hosts": [
            "algo-1"
        ],
        "current_host": "algo-1"
    },
    "num_gpus": 1,
    "input_config_dir": "/opt/ml/input/config",
    "additional_framework_parameters": {},
    "module_name": "train",
    "current_host": "algo-1"
}

Environment variables:

SM_LOG_LEVEL=20
SM_HP_EPOCHS=3
SM_FRAMEWORK_PARAMS={}
SM_HPS={"epochs":3,"hidden_dim":64}
SM_NUM_CPUS=4
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"epochs":3,"hidden_dim":64},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","job_name":"sagemaker-pytorch-2020-09-02-19-49-57-475","log_level":20,"model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-1-223817798831/sagemaker-pytorch-2020-09-02-19-49-57-475/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":4,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
PYTHONPATH=/usr/local/bin:/usr/lib/python35.zip:/usr/lib/python3.5:/usr/lib/python3.5/plat-x86_64-linux-gnu:/usr/lib/python3.5/lib-dynload:/usr/local/lib/python3.5/dist-packages:/usr/lib/python3/dist-packages
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_INPUT_DIR=/opt/ml/input
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_NUM_GPUS=1
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_USER_ARGS=["--epochs","3","--hidden_dim","64"]
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_USER_ENTRY_POINT=train.py
SM_OUTPUT_DIR=/opt/ml/output
SM_CURRENT_HOST=algo-1
SM_MODULE_DIR=s3://sagemaker-us-east-1-223817798831/sagemaker-pytorch-2020-09-02-19-49-57-475/source/sourcedir.tar.gz
SM_MODEL_DIR=/opt/ml/model
SM_CHANNELS=["training"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_NETWORK_INTERFACE_NAME=eth0
SM_MODULE_NAME=train
SM_HP_HIDDEN_DIM=64
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_HOSTS=["algo-1"]

Invoking script with the following command:

/usr/bin/python -m train --epochs 3 --hidden_dim 64


Using device cuda.
Get train data loader.
Model loaded with embedding_dim 28, hidden_dim 64, vocab_size 28.
Epoch: 1/3............. Loss: 1.5655
Epoch: 2/3............. Loss: 1.5143
Epoch: 3/3............. Loss: 1.5086
2020-09-02 20:00:09,820 sagemaker-containers INFO     Reporting training SUCCESS
Training seconds: 453
Billable seconds: 453

from sagemaker.predictor import RealTimePredictor
from sagemaker.pytorch import PyTorchModel

class StringPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')

model = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='0.4.0',
                     entry_point='predict.py',
                     source_dir='serve',
                     predictor_cls=StringPredictor)
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.

---------------!

Step 7 - Use the model for testing

Now that we have deployed our model with the custom inference code, we should test it to see if everything is working. Here we test our model by creating an initial string and send it to the endpoint, then collect the result. But we also want to tell the inference how long should be the expected output.

It means that we need to send to our predict function not only the initial string but the the length of the output. As the expected input by the deserialized function, input_fn, is a string and we are looking for a simple solution, our input data will be a string composed by: the length of the output+'-'+initial string.

Now, it is time to test our model, sending a very common initial string: you are.

test_text = '100-you are '
new_text = predictor.predict(test_text).decode('utf-8')
print(new_text)

you are think too the starry a my seens weeped as he to be then that tonight we was shall be wear my

Another example to try out is to send the model a text included in the training dataset and see what the model predicts:

print('Text: ',sentences[963:1148])
init_text = sentences[963:1148]
print('Init text: ', sentences[963:1020])
test_text = str(len(init_text))+'-'+init_text
new_text = predictor.predict(test_text).decode('utf-8')
print(new_text)

Text:  he did content to say it was for his country he did it to please his mother and to be partly proud; which he is, even till the altitude of his virtue. what he cannot help in his nature,
Init text:  he did content to say it was for his country he did it to
he did content to say it was for his country he did it to please his mother and to be partly proud which he is even till the altitude of his virtue what he cannot help in his nature of

We can check that the model "remember" the texts during the training, it can mostly reproduce the original text.

Now that we know our endpoint is working as expected, we can set up a web page or app that will interact with it.

Make sure to skip down to the end of this notebook and shut down your endpoint. You can deploy it again when you come back.

Delete the endpoint

Remember to always shut down your endpoint if you are no longer using it. You are charged for the length of time that the endpoint is running so if you forget and leave it on you could end up with an unexpectedly large bill.

predictor.delete_endpoint()