SageMaker Experiments, TensorFlow script mode training and restore checkpoint to resume training

Some sections of this notebook has been inspired by the tutorial:

Sagemaker Python SDK Examples: tensorflow_script_mode_training_and_serving.ipynb

https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_script_mode_training_and_serving/tensorflow_script_mode_training_and_serving.ipynb

In this notebook we will describe the most relevant steps to start training a custom algorithm in AWS SageMaker, not using a custom container, showing how to deal with experiments and solving some of the problems when facing with custom models when using SageMaker script mode on. Some basics concepts on SageMaker will not be detailed in order to focus on the relevant concepts.

Following steps will be explained:

Create an Experiment and Trial to keep track of our experiments
Load the training data to our training instance
Create the scripts to train our custom model, a Transformer.
Create an Estimator to train our model in a Tensorflow 2.1 container in script mode
Create metric definitions to keep track of them in SageMaker
Download the trained model to make predictions
Resume training using the latest checkpoint from a previous training

Amazon SageMaker Overview

Amazon SageMaker is a fully managed machine learning service. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment.

Amazon SageMaker Developer Guide

Amazon SageMaker provides many tools to help developers to manage the Machine Learning Lifecycle workflow:

Fetch, Clean and transform the data: you can use SageMaker notebook instances to manipulate and analyze your data, then you can clean and transform it to the requiered format for your algorithm. And you can use Pipelines functionality to serve the data to your model during training.
Train and evaluate the model: There are many different posibilities to train your model. You can use built-in algorithm, models provided by SageMaker, or you can use custom code to train in the most popular deep learning framewors (Tensorflow, Pytorch, Apache MXNet,..) or even use Apache Spark. Finally, you can use your own custom algorithm and build a Docker container then training the model on SageMaker. You can keep track of your model metrics to evaluate the performance.
Deploy your model: Once your model is trained, you can deploy it in and endpoint service in SageMaker and make prediction one at a time or in batch mode.

Alt

A simple and popular way to get started and work with SageMaker is to use the Amazon SageMaker Python SDK. It provides Python APIs and containers that make it easy to train and deploy models in SageMaker, as well as examples for use with several different machine learning and deep learning frameworks.

Problem description

For this project we will develope notebooks and scripts to train a Transformer Tensorflow 2 model to solve a neural machine translation problem, traslating simple sentences from English to Spanish. This problem and the model is extensively described in my Mdeium post "Attention is all you need: Discovering the Transformer paper".

Data description

For this exercise, we’ll use pairs of simple sentences. The source text will be in English, and the target text will be in Spanish, from the Tatoeba project where people contribute, adding translations every day. This is the link to some translations in different languages. There you can download the Spanish/English spa_eng.zip file; it contains 124,457 pairs of sentences.

Set up the environment

Let's start by setting up the environment:

First, we will import and load the libraries to use in our project.

%load_ext autoreload
%autoreload 2

import os
import sagemaker
from sagemaker import get_execution_role
import pandas as pd
import numpy as np
import time
import pickle

import tensorflow as tf
# Create a SageMaker session to work with
sagemaker_session = sagemaker.Session()
# Get the role of our user and the region
role = get_execution_role()
region = sagemaker_session.boto_session.region_name
print(role)
print(region)

arn:aws:iam::223817798831:role/service-role/AmazonSageMaker-ExecutionRole-20200708T194212
us-east-1

Define global variables and parameters

# Set the variables for data locations
data_folder_name='data'
train_filename = 'spa.txt'
non_breaking_en = 'nonbreaking_prefix.en'
non_breaking_es = 'nonbreaking_prefix.es'
# Set the directories for our nodel output
trainedmodel_path = 'trained_model'
output_data_path = 'output_data'
# Set the name of the artifacts that our model generate (model not included) 
model_info_file = 'model_info.pth'
input_vocab_file = 'in_vocab.pkl'
output_vocab_file = 'out_vocab.pkl'
# Set the absolute path of the train data 
train_file = os.path.abspath(os.path.join(data_folder_name, train_filename))
non_breaking_en_file = os.path.abspath(os.path.join(data_folder_name, non_breaking_en))
non_breaking_es_file = os.path.abspath(os.path.join(data_folder_name, non_breaking_es))

When working with Amazon SageMaker training jobs that will run on containers in a new instance or "vm", the data has to be share using a S3 Storage folder. For this purpose we define the bucket name and the folder names where our inputs and outputs will be stored. In our case we define:

The training data URI: where our input data is located
The output folder: where our training saves the outputs fron our model
The checkpoint folder: where our model uploads the checkpoints

# Specify your bucket name
bucket_name = 'edumunozsala-ml-sagemaker'
# Set the training data folder in S3
training_folder = r'transformer-nmt/train'
# Set the output folder in S3
output_folder = r'transformer-nmt'
# Set the checkpoint in S3 folder for our model 
ckpt_folder = r'transformer-nmt/ckpt'

training_data_uri = r's3://' + bucket_name + r'/' + training_folder
output_data_uri = r's3://' + bucket_name + r'/' + output_folder
ckpt_data_uri = r's3://' + bucket_name + r'/' + ckpt_folder

training_data_uri,output_data_uri,ckpt_data_uri

('s3://edumunozsala-ml-sagemaker/transformer-nmt/train',
 's3://edumunozsala-ml-sagemaker/transformer-nmt',
 's3://edumunozsala-ml-sagemaker/transformer-nmt/ckpt')

Then we can upload to the training data folder in S3 the files necessary for training: training data, non breaking prefixes for the inputs (English) and the non breaking prefixes for the outputs (Spanish). Once uploaded they can be loaded for training in the SageMaker container.

inputs = sagemaker_session.upload_data(train_file,
                              bucket=bucket_name, 
                              key_prefix=training_folder)

sagemaker_session.upload_data(non_breaking_en_file,
                              bucket=bucket_name, 
                              key_prefix=training_folder)

sagemaker_session.upload_data(non_breaking_es_file,
                              bucket=bucket_name, 
                              key_prefix=training_folder)

's3://edumunozsala-ml-sagemaker/transformer-nmt/train/nonbreaking_prefix.es'

Create an experiment and trial

Amazon SageMaker Experiments is a capability of Amazon SageMaker that lets you organize, track, compare, and evaluate your machine learning experiments.

Machine learning is an iterative process. You need to experiment with multiple combinations of data, algorithm and parameters, all the while observing the impact of incremental changes on model accuracy. Over time this iterative experimentation can result in thousands of model training runs and model versions. This makes it hard to track the best performing models and their input configurations. It’s also difficult to compare active experiments with past experiments to identify opportunities for further incremental improvements.

Experiments will help us to organize and manage all executions, metrics and results of a ML project.

# Install the library necessary to handle experiments
!pip install sagemaker-experiments

Collecting sagemaker-experiments
  Using cached sagemaker_experiments-0.1.24-py3-none-any.whl (36 kB)
Requirement already satisfied: boto3>=1.12.8 in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from sagemaker-experiments) (1.16.9)
Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from boto3>=1.12.8->sagemaker-experiments) (0.3.3)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from boto3>=1.12.8->sagemaker-experiments) (0.10.0)
Requirement already satisfied: botocore<1.20.0,>=1.19.9 in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from boto3>=1.12.8->sagemaker-experiments) (1.19.9)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from botocore<1.20.0,>=1.19.9->boto3>=1.12.8->sagemaker-experiments) (2.8.1)
Requirement already satisfied: urllib3<1.26,>=1.25.4; python_version != "3.4" in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from botocore<1.20.0,>=1.19.9->boto3>=1.12.8->sagemaker-experiments) (1.25.10)
Requirement already satisfied: six>=1.5 in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.20.0,>=1.19.9->boto3>=1.12.8->sagemaker-experiments) (1.14.0)
Installing collected packages: sagemaker-experiments
Successfully installed sagemaker-experiments-0.1.24
WARNING: You are using pip version 20.0.2; however, version 20.2.4 is available.
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/tensorflow2_p36/bin/python -m pip install --upgrade pip' command.

Load the libraries to handle experiments

# Import the libraries to work with Experiments in SageMaker
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent

Set the experiment and trial name and one tag to help us to identify the reason for this items.

# Set the experiment name
experiment_name='tf-transformer'
# Set the trial name 
trial_name="{}-{}".format(experiment_name,'single-gpu')

tags = [{'Key': 'my-experiments', 'Value': 'transformerEngSpa1'}]

You can create an experiment to track all the model training iterations. Experiments are a great way to organize your data science work. You can create experiments to organize all your model development work for : a business use case you are addressing (e.g. create experiment named “customer churn prediction”), or a data science team that owns the experiment (e.g. create experiment named “marketing analytics experiment”), or a specific data science and ML project. Think of it as a “folder” for organizing your “files”.

We will create a Trial to track each training job run. But this is just a simple example, not intented to explore all the capabilities of the product.

# create the experiment if it doesn't exist
try:
    training_experiment = Experiment.load(experiment_name=experiment_name)
    print('Loaded experiment ',experiment_name)
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        training_experiment = Experiment.create(experiment_name=experiment_name,
                                      description = "Experiment to track trainings on my tensorflow Transformer Eng-Spa", 
                                      tags = tags)
        print('Created experiment ',experiment_name)
# create the trial if it doesn't exist
try:
    single_gpu_trial = Trial.load(trial_name=trial_name)
    print('Loaded trial ',trial_name)
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        single_gpu_trial = Trial.create(experiment_name=experiment_name, 
                             trial_name= trial_name,
                             tags = tags)
        print('Created trial ',trial_name)

Loaded experiment  tf-transformer
Loaded trial  tf-transformer-single-gpu

Trackers

Another interesting tool to mention, is Tracker objects. They can store information about different types of topics or objects in our model or training process like inputs, parameters, artifacts or metrics. The tracker is attached to a trial, associating the object to the training job. We can record that information and analyze it later on the experiment. Note that only parameters, input artifacts, and output artifacts are saved to SageMaker. Metrics are saved to file.

As an example, we create a Tracker to register the input data and two parameters about how that data is processed in our project.

from smexperiments.tracker import Tracker
# Create the tracker for the inout data
tracker_name='TextPreprocessing'
trial_comp_name = None # Change to a an exsting TrialComponent to load it

try:
    tracker = Tracker.load(trial_component_name=trial_comp_name)
    print('Loaded Tracker ',tracker_name)
except Exception as ex:
    tracker = Tracker.create(display_name=tracker_name)
    tracker.log_input(name="EngtoSpa Translations", media_type="s3/uri", value=inputs)
    tracker.log_parameters({
        "Tokenizer": 'Subword',
        "Max Length": 15,
    })
    print('Created Tracker ',tracker_name)
    
# Atach the Tracker to the trial
single_gpu_trial.add_trial_component(tracker.trial_component)

Created Tracker  TextPreprocessing

Our last step consist in create the experiment configuration, a dictionary that contains the experiment name, the trial name and the trial component and it will be used to label our training job.

# Create a configuration definition for our experiment and trial
trial_comp_name = 'single-gpu-components'
# Set the configuration parameters for the experiment
experiment_config = {'ExperimentName': training_experiment.experiment_name, 
                       'TrialName': single_gpu_trial.trial_name,
                       'TrialComponentDisplayName': trial_comp_name}

Check and show information about the experiment and trial

print('Experiment: ',training_experiment.experiment_name)
# Show the trials in the experiment
#for trial in training_experiment.list_trials():
    #print('Trial: ',trial.trial_name)

for trial_comp in TrialComponent.list(trial_name=single_gpu_trial.trial_name):
        print('Trial Components: ',trial_comp)

Experiment:  tf-transformer
Trial Components:  TrialComponentSummary(trial_component_name='TrialComponent-2020-11-12-115920-sbov',trial_component_arn='arn:aws:sagemaker:us-east-1:223817798831:experiment-trial-component/trialcomponent-2020-11-12-115920-sbov',display_name='TextPreprocessing',creation_time=datetime.datetime(2020, 11, 12, 11, 59, 20, 739000, tzinfo=tzlocal()),created_by={},last_modified_time=datetime.datetime(2020, 11, 12, 11, 59, 20, 739000, tzinfo=tzlocal()),last_modified_by={})
Trial Components:  TrialComponentSummary(trial_component_name='tf-transformer-single-gpu-2020-11-12-11-44-28-aws-training-job',trial_component_arn='arn:aws:sagemaker:us-east-1:223817798831:experiment-trial-component/tf-transformer-single-gpu-2020-11-12-11-44-28-aws-training-job',display_name='single-gpu-components',trial_component_source={'SourceArn': 'arn:aws:sagemaker:us-east-1:223817798831:training-job/tf-transformer-single-gpu-2020-11-12-11-44-28', 'SourceType': 'SageMakerTrainingJob'},status=TrialComponentStatus(primary_status='Failed',message='Status: Failed, secondary status: Failed, failure reason: AlgorithmError: ExecuteUserScriptError:\nCommand "/usr/bin/python3 train.py --epochs 8 --model_dir s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-11-44-28/model --non_breaking_in nonbreaking_prefix.en --non_breaking_out nonbreaking_prefix.es --nsamples 60000 --resume False --train_file spa.txt".'),creation_time=datetime.datetime(2020, 11, 12, 11, 44, 32, 948000, tzinfo=tzlocal()),created_by={},last_modified_time=datetime.datetime(2020, 11, 12, 11, 50, 27, 732000, tzinfo=tzlocal()),last_modified_by={})
Trial Components:  TrialComponentSummary(trial_component_name='TrialComponent-2020-11-12-113905-rpfc',trial_component_arn='arn:aws:sagemaker:us-east-1:223817798831:experiment-trial-component/trialcomponent-2020-11-12-113905-rpfc',display_name='TextPreprocessing',creation_time=datetime.datetime(2020, 11, 12, 11, 39, 5, 995000, tzinfo=tzlocal()),created_by={},last_modified_time=datetime.datetime(2020, 11, 12, 11, 39, 5, 995000, tzinfo=tzlocal()),last_modified_by={})

Construct a script for training

Script mode is a training script format for TensorFlow that lets you execute any TensorFlow training script in SageMaker with minimal modification. The SageMaker Python SDK handles transferring your script to a SageMaker training instance. On the training instance, SageMaker's native TensorFlow support sets up training-related environment variables and executes your training script. In this tutorial, we use the SageMaker Python SDK to launch a training job.

Script mode supports training with a Python script, a Python module, or a shell script.

This project's training script was adapted from the Tensorflow model of a Transformer, we develop in a previous post (mentioned previously). We have modified it to handle:

the train_file, non_breaking_inand non_breaking_out parameters passed in with the values of the training data-set, the non breaking prefixes for the input data and the non breaking prefixes for the output data.
the data_dir parameter passed in by SageMaker with the value of the enviroment variable SM_CHANNEL_TRAINING. This is an S3 path used for input data sharing during training.
the model_dir parameter passed in by SageMaker. This is an S3 path which can be used for data sharing during distributed training and checkpointing and/or model persistence. We have also added an argument-parsing function to handle processing training-related variables.
the local checkpoint path to store the model checkpoints during training. We use the default value /opt/ml/checkpoints that will be uploaded to S3. We comment this behavior later when defining our estimator.
At the end of the training job we have added a step to export the trained model, only the weights, to the path stored in the environment variable SM_MODEL_DIR, which always points to /opt/ml/model. This is critical because SageMaker uploads all the model artifacts in this folder to S3 at end of training.
the output_data_dir parameter passed in by SageMaker with the value of the enviroment variable SM_OUTPUT_DATA_DIR. This is a folder path used to save output data from our model. This folder will be uploaded to S3 to store the output.tar.zip. In our case we need to save the tokenizer for the input texts, the tokenizer for the outputs, the input and output vocab size and the tokens for eos and sos.

In addition to the train.py file, our source code folder includes the files:

model.py: Tensorflow model definition
utils.py: utility functions to process the text data
utils_train.py: contains functions to calculate the loss and learning rate scheduler.

Here is the entire script for the train.py file:

!pygmentize 'train/train.py'

import argparse
import json
import sys
#import sagemaker_containers

import math
import os
import gc
import time
import pandas as pd
import pickle

import tensorflow as tf

# To install tensorflow_datasets
import subprocess

def install(package):
    subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package])

# Install the library tensorflow_datasets
install('tensorflow_datasets')

from utils import preprocess_text_nonbreaking, subword_tokenize
#from utils_train import loss_function, CustomSchedule

from model import Transformer

INPUT_COLUMN = 'input'
TARGET_COLUMN = 'target'
#NUM_SAMPLES = 80000 #40000
#MAX_VOCAB_SIZE = 2**14

#BATCH_SIZE = 64  # Batch size for training.
#EPOCHS = 10  # Number of epochs to train for.
#MAX_LENGTH = 15

def get_train_data(training_dir, nonbreaking_in, nonbreaking_out, train_file, nsamples):
    # Load the nonbreaking files
    with open(os.path.join(training_dir, nonbreaking_in), 
        mode = "r", encoding = "utf-8") as f:
        non_breaking_prefix_en = f.read()
    with open(os.path.join(training_dir, nonbreaking_out), 
        mode = "r", encoding = "utf-8") as f:
        non_breaking_prefix_es = f.read()

    non_breaking_prefix_en = non_breaking_prefix_en.split("\n")
    non_breaking_prefix_en = [' ' + pref + '.' for pref in non_breaking_prefix_en]
    non_breaking_prefix_es = non_breaking_prefix_es.split("\n")
    non_breaking_prefix_es = [' ' + pref + '.' for pref in non_breaking_prefix_es]
    # Load the training data
    # Load the dataset: sentence in english, sentence in spanish 
    df=pd.read_csv(os.path.join(training_dir, train_file), sep="\t", header=None, names=[INPUT_COLUMN,TARGET_COLUMN], usecols=[0,1], 
               nrows=nsamples)
    # Preprocess the input data
    input_data=df[INPUT_COLUMN].apply(lambda x : preprocess_text_nonbreaking(x, non_breaking_prefix_en)).tolist()
    # Preprocess and include the end of sentence token to the target text
    target_data=df[TARGET_COLUMN].apply(lambda x : preprocess_text_nonbreaking(x, non_breaking_prefix_es)).tolist()

    return input_data, target_data

def main_train(dataset, transformer, n_epochs, print_every=50):
  ''' Train the transformer model for n_epochs using the data generator dataset'''
  losses = []
  accuracies = []
  # In every epoch
  for epoch in range(n_epochs):
    print("Starting epoch {}".format(epoch+1))
    start = time.time()
    # Reset the losss and accuracy calculations
    train_loss.reset_states()
    train_accuracy.reset_states()
    # Get a batch of inputs and targets
    for (batch, (enc_inputs, targets)) in enumerate(dataset):
        # Set the decoder inputs
        dec_inputs = targets[:, :-1]
        # Set the target outputs, right shifted
        dec_outputs_real = targets[:, 1:]
        with tf.GradientTape() as tape:
            # Call the transformer and get the predicted output
            predictions = transformer(enc_inputs, dec_inputs, True)
            # Calculate the loss
            loss = loss_function(dec_outputs_real, predictions)
        # Update the weights and optimizer
        gradients = tape.gradient(loss, transformer.trainable_variables)
        optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))
        # Save and store the metrics
        train_loss(loss)
        train_accuracy(dec_outputs_real, predictions)
        
        if batch % print_every == 0:
            losses.append(train_loss.result())
            accuracies.append(train_accuracy.result())
            print("Epoch {} Batch {} Loss {:.4f} Accuracy {:.4f}".format(
                epoch+1, batch, train_loss.result(), train_accuracy.result()))
            
    # Checkpoint the model on every epoch        
    ckpt_save_path = ckpt_manager.save()
    print("Saving checkpoint for epoch {} in {}".format(epoch+1,
                                                        ckpt_save_path))
    #print("Time for 1 epoch: {} secs\n".format(time.time() - start))
    # Save the model
    #transformer.save(args.sm_model_dir, overwrite=True, save_format='tf')
    
  return losses, accuracies


def loss_function(target, pred):
    mask = tf.math.logical_not(tf.math.equal(target, 0))
    loss_ = loss_object(target, pred)
    
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    
    return tf.reduce_mean(loss_)

class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    
    def __init__(self, d_model, warmup_steps=4000):
        super(CustomSchedule, self).__init__()
        
        self.d_model = tf.cast(d_model, tf.float32)
        self.warmup_steps = warmup_steps
    
    def __call__(self, step):
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps**-1.5)
        
        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)


if __name__ == '__main__':
    # Install tensorflow_datasets
    #install('tensorflow_datasets')

    # All of the model parameters and training parameters are sent as arguments when the script
    # is executed. Here we set up an argument parser to easily access the parameters.

    parser = argparse.ArgumentParser()

    # Training Parameters
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--max-len', type=int, default=15, metavar='N',
                        help='input max sequence length for training (default: 60)')
    parser.add_argument('--epochs', type=int, default=2, metavar='N',
                        help='number of epochs to train (default: 2)')
    parser.add_argument('--nsamples', type=int, default=10000, metavar='N',
                        help='number of samples to train (default: 20000)')
    parser.add_argument('--resume', type=bool, default=False, metavar='N',
                        help='Resume training from the latest checkpoint (default: False)')

    # Data parameters                    
    parser.add_argument('--train_file', type=str, default=None, metavar='N',
                        help='Training data file name')
    parser.add_argument('--non_breaking_in', type=str, default=None, metavar='N',
                        help='Non breaking prefixes for input vocabulary')
    parser.add_argument('--non_breaking_out', type=str, default=None, metavar='N',
                        help='Non breaking prefixes for output vocabulary')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')

    # Model Parameters
    parser.add_argument('--d_model', type=int, default=64, metavar='N',
                        help='Model dimension (default: 64)')
    parser.add_argument('--ffn_dim', type=int, default=128, metavar='N',
                        help='size of the FFN layer (default: 128)')
    parser.add_argument('--vocab_size', type=int, default=10000, metavar='N',
                        help='size of the vocabulary (default: 10000)')
    parser.add_argument('--n_layers', type=int, default=4, metavar='N',
                        help='number of layers (default: 4)')
    parser.add_argument('--n_heads', type=int, default=8, metavar='N',
                        help='number of heads (default: 8)')
    parser.add_argument('--dropout_rate', type=float, default=0.1, metavar='N',
                        help='Dropout rate (default: 0.1)')

    # SageMaker Parameters
    parser.add_argument('--hosts', type=list, default=json.loads(os.environ['SM_HOSTS']))
    parser.add_argument('--current-host', type=str, default=os.environ['SM_CURRENT_HOST'])
    parser.add_argument('--sm-model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--model_dir', type=str)
    parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
    parser.add_argument('--num-gpus', type=int, default=os.environ['SM_NUM_GPUS'])
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])

    args = parser.parse_args()

    print(args.sm_model_dir, args.model_dir)
    # Load the training data.
    print("Get the train data")
    input_data, target_data = get_train_data(args.data_dir, args.non_breaking_in, args.non_breaking_out, args.train_file, args.nsamples)

    # Tokenize and pad the input sequences
    print("Tokenize the input and output data and create the vocabularies") 
    encoder_inputs, tokenizer_inputs, num_words_inputs, sos_token_input, eos_token_input, del_idx_inputs= subword_tokenize(input_data, 
                                                                                                        args.vocab_size, args.max_len)
    # Tokenize and pad the outputs sequences
    decoder_outputs, tokenizer_outputs, num_words_output, sos_token_output, eos_token_output, del_idx_outputs = subword_tokenize(target_data,                                                                                                       args.vocab_size, args.max_len)
    print('Input vocab: ',num_words_inputs)
    print('Output vocab: ',num_words_output)
    
    # Define a dataset 
    dataset = tf.data.Dataset.from_tensor_slices(
                    (encoder_inputs, decoder_outputs))
    dataset = dataset.shuffle(len(input_data), reshuffle_each_iteration=True).batch(
                    args.batch_size, drop_remainder=True)
    dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

    # Clean the session
    tf.keras.backend.clear_session()
    # Create the Transformer model
    transformer = Transformer(vocab_size_enc=num_words_inputs,
                          vocab_size_dec=num_words_output,
                          d_model=args.d_model,
                          n_layers=args.n_layers,
                          FFN_units=args.ffn_dim,
                          n_heads=args.n_heads,
                          dropout_rate=args.dropout_rate)

    # Define a categorical cross entropy loss
    loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True,
                                                            reduction="none")
    # Define a metric to store the mean loss of every epoch
    train_loss = tf.keras.metrics.Mean(name="train_loss")
    # Define a matric to save the accuracy in every epoch
    train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")
    # Create the scheduler for learning rate decay
    leaning_rate = CustomSchedule(args.d_model)
    # Create the Adam optimizer
    optimizer = tf.keras.optimizers.Adam(leaning_rate,
                                     beta_1=0.9,
                                     beta_2=0.98,
                                     epsilon=1e-9)

    #Create the Checkpoint 
    print('Creating the checkpoint ...')
    ckpt = tf.train.Checkpoint(transformer=transformer,
                           optimizer=optimizer)

    ckpt_manager = tf.train.CheckpointManager(ckpt, '/opt/ml/checkpoints/', max_to_keep=1)
    # Restore from the latest checkpoint if requiered
    if ckpt_manager.latest_checkpoint and args.resume:
        ckpt.restore(ckpt_manager.latest_checkpoint)
        print("Last checkpoint restored.")
    # to save the model in tf 2.1.0
    #print('Preparing the model to be saved....')
    #for enc_inputs, targets in dataset.take(1):
    #    dec_inputs = targets[:, :-1]
    #    print (enc_inputs.shape, dec_inputs.shape)
    #    transformer._set_inputs(enc_inputs, dec_inputs, True)

    # Train the model
    print('Training the model ....')
    losses, accuracies = main_train(dataset, transformer, args.epochs, 100)

    # Save the while model
    # Save the entire model to a HDF5 file
    print('Saving the model ....')
    transformer.save_weights(os.path.join(args.sm_model_dir, 'transformer'), overwrite=True, save_format='tf')
    #transformer.save_weights(args.sm_model_dir, overwrite=True, save_format='tf')
    # Save the parameters used to construct the model
    print("Saving the model parameters")
    model_info_path = os.path.join(args.output_data_dir, 'model_info.pth')
    with open(model_info_path, 'wb') as f:
        model_info = {
            'vocab_size_enc': num_words_inputs,
            'vocab_size_dec': num_words_output,
            'sos_token_input': sos_token_input,
            'eos_token_input': eos_token_input,
            'sos_token_output': sos_token_output,
            'eos_token_output': eos_token_output,
            'n_layers': args.n_layers,
            'd_model': args.d_model,
            'ffn_dim': args.ffn_dim,
            'n_heads': args.n_heads,
            'drop_rate': args.dropout_rate
        }
        pickle.dump(model_info, f)
          
	# Save the tokenizers with the vocabularies
    print('Saving the dictionaries ....')
    vocabulary_in = os.path.join(args.output_data_dir, 'in_vocab.pkl')
    with open(vocabulary_in, 'wb') as f:
        pickle.dump(tokenizer_inputs, f)

    vocabulary_out = os.path.join(args.output_data_dir, 'out_vocab.pkl')
    with open(vocabulary_out, 'wb') as f:
        pickle.dump(tokenizer_outputs, f)

Our source code needs the tensorflow_dataset library and it is not include in the Tensorflow 2.1. container image provided by SageMaker. To solve this issue we explicitly install it in our train.py file using the command subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package]).

Create a training job using the `TensorFlow` estimator

The sagemaker.tensorflow.TensorFlow estimator handles locating the script mode container where the model will run, uploading your script or source code to a S3 location and creating a SageMaker training job. Let's call out a couple important parameters here:

source_dirand entry_point, the folder with the source code and the file to run the training.
framework_version is the tensorflow version we want to run our code.
py_version is set to 'py3' to indicate that we are using script mode since legacy mode supports only Python 2. Though Python 2 will be deprecated soon, you can use script mode with Python 2 by setting py_version to 'py2' and script_mode to True.
code_location is a S3 folder URI where the source_dir will be upload. When the instace starts the content of that folder will be downloaded to a local path, opt/ml/code. The entry_point, our main code or function, has to be included in that folder.
output_path is the S3 path where all the outputs of our training job will be uploaded when the training ends. In our example we will upload to this S3 folder the local content in the folders SM_MODEL_DIR and SM_OUTPUT_DATA_DIR.
the checkpoint_local_pathand checkpoint_s3_uri parameters will be explained in the next section "Resume training from a checkpoint"
script_mode = True to set script mode.

from sagemaker.tensorflow import TensorFlow

# Uncomment the type of instance to use
#instance_type='ml.m4.4xlarge'
instance_type='ml.p2.xlarge'
#instance_type='local'

Another important parameter of our Tensorflow estimator is the instance_type that is the type of "virtual machine" where the container will run. The values we play around in this project are:

local: The container will run locally on the notebook instance. this is very useful to debug or verify that our estimator definition is correct and the train.py runs successfully. It is much more faster to run the container locally, the start up time for a remote instance is too long when you are coding and debugging.
ml.mX.Yxlarge: It is a CPU instance, when you are running your code for a short train, maybe for validation purposes. Check AWS documentation for a list of alternative instance.
ml.p2.xlarge: This instance use a GPU and it is the preferred one when you want to launch a long running training.

When running in local mode, some estimator functionalities are not available like uploading the checkpoints to S3 and its parameters should not be defined.

Finally we want to mention the definition of metrics. Using a dictionary, we can define a metric name and the regular expression to extract its value from the messages the training script writes on the logs or the stdout during training. Later we can see those metrics in the SageMaker console. We show you how to do it in a following section.

# Define the metrics to search for
metric_definitions = [{'Name': 'loss', 'Regex': 'Loss ([0-9\\.]+)'},{'Name': 'Accuracy', 'Regex': 'Accuracy ([0-9\\.]+)'}]

Now, we can define the estimator:

# Create the Tensorflow estimator using a Tensorflow 2.1 container
estimator = TensorFlow(entry_point='train.py',
                       source_dir="train",
                       role=role,
                       instance_count=1,
                       instance_type=instance_type,
                       framework_version='2.1.0',
                       py_version='py3',
                       output_path=output_data_uri,
                       code_location=output_data_uri,
                       base_job_name='tf-transformer',
                       script_mode= True,
                       #checkpoint_local_path = 'ckpt', #Use default value /opt/ml/checkpoint
                       checkpoint_s3_uri = ckpt_data_uri,
                       metric_definitions = metric_definitions, 
                       hyperparameters={
                        'epochs': 8,
                        'nsamples': 60000,
                        'resume': False,
                        'train_file': 'spa.txt',
                        'non_breaking_in': 'nonbreaking_prefix.en',
                        'non_breaking_out': 'nonbreaking_prefix.es'
                       })

Start the training job: `fit`

To start a training job, we call estimator.fit method with the a few parameter values.

An S3 location is used here as the input. fit creates a default channel named 'training', which points to this S3 location. In the training script we can access the training data from the local location stored in SM_CHANNEL_TRAINING. fit accepts a couple other types of input as well. See the API doc here for details.
job_name the name for the training job.
experiment_config the dictionary with the name of the experiment and trial to attach this job to.

When training starts, the TensorFlow container executes train.py, passing hyperparameters and model_dir from the estimator as script arguments. Because we didn't explicitly define it, model_dir defaults to s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>/model, so the script execution is as follows:

python train.py --model_dir s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>/model --epochs=1 --nsamples=5000 ...

When training is complete, the training job will upload the saved model and other output artifacts to S3.

# Set the job name and show it
job_name = '{}-{}'.format(trial_name,time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime()))
print(job_name)

tf-transformer-single-gpu-2020-11-12-12-25-30

Calling fit to train a model with TensorFlow 2.1 scroipt.

# Call the fit method to launch the training job
estimator.fit({'training':training_data_uri}, job_name = job_name, 
              experiment_config = experiment_config)

INFO:sagemaker:Creating training-job with name: tf-transformer-single-gpu-2020-11-12-12-25-30

2020-11-12 12:25:33 Starting - Starting the training job...
2020-11-12 12:25:39 Starting - Launching requested ML instances......
2020-11-12 12:26:52 Starting - Preparing the instances for training.........
2020-11-12 12:28:30 Downloading - Downloading input data
2020-11-12 12:28:30 Training - Downloading the training image...........2020-11-12 12:30:19,406 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training
2020-11-12 12:30:19,885 sagemaker-containers INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_tensorflow_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "resume": false,
        "non_breaking_out": "nonbreaking_prefix.es",
        "nsamples": 60000,
        "train_file": "spa.txt",
        "model_dir": "s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-12-25-30/model",
        "non_breaking_in": "nonbreaking_prefix.en",
        "epochs": 8
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "tf-transformer-single-gpu-2020-11-12-12-25-30",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-12-25-30/source/sourcedir.tar.gz",
    "module_name": "train",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 1,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "train.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"epochs":8,"model_dir":"s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-12-25-30/model","non_breaking_in":"nonbreaking_prefix.en","non_breaking_out":"nonbreaking_prefix.es","nsamples":60000,"resume":false,"train_file":"spa.txt"}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-12-25-30/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_tensorflow_container.training:main","hosts":["algo-1"],"hyperparameters":{"epochs":8,"model_dir":"s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-12-25-30/model","non_breaking_in":"nonbreaking_prefix.en","non_breaking_out":"nonbreaking_prefix.es","nsamples":60000,"resume":false,"train_file":"spa.txt"},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"tf-transformer-single-gpu-2020-11-12-12-25-30","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-12-25-30/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":4,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_USER_ARGS=["--epochs","8","--model_dir","s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-12-25-30/model","--non_breaking_in","nonbreaking_prefix.en","--non_breaking_out","nonbreaking_prefix.es","--nsamples","60000","--resume","False","--train_file","spa.txt"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_RESUME=false
SM_HP_NON_BREAKING_OUT=nonbreaking_prefix.es
SM_HP_NSAMPLES=60000
SM_HP_TRAIN_FILE=spa.txt
SM_HP_MODEL_DIR=s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-12-25-30/model
SM_HP_NON_BREAKING_IN=nonbreaking_prefix.en
SM_HP_EPOCHS=8
PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/lib/python36.zip:/usr/lib/python3.6:/usr/lib/python3.6/lib-dynload:/usr/local/lib/python3.6/dist-packages:/usr/lib/python3/dist-packages

Invoking script with the following command:

/usr/bin/python3 train.py --epochs 8 --model_dir s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-12-25-30/model --non_breaking_in nonbreaking_prefix.en --non_breaking_out nonbreaking_prefix.es --nsamples 60000 --resume False --train_file spa.txt


Collecting tensorflow_datasets
  Downloading tensorflow_datasets-4.1.0-py3-none-any.whl (3.6 MB)
Requirement already satisfied: absl-py in /usr/local/lib/python3.6/dist-packages (from tensorflow_datasets) (0.9.0)
Collecting importlib-resources; python_version < "3.9"
  Downloading importlib_resources-3.3.0-py2.py3-none-any.whl (26 kB)
Collecting tensorflow-metadata
  Downloading tensorflow_metadata-0.25.0-py3-none-any.whl (44 kB)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from tensorflow_datasets) (1.14.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from tensorflow_datasets) (1.18.1)
Collecting typing-extensions; python_version < "3.8"
  Downloading typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Collecting tqdm
  Downloading tqdm-4.51.0-py2.py3-none-any.whl (70 kB)
Collecting dill
  Downloading dill-0.3.3-py2.py3-none-any.whl (81 kB)
Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow_datasets) (2.22.0)
Collecting promise
  Downloading promise-2.3.tar.gz (19 kB)
Collecting future
  Downloading future-0.18.2.tar.gz (829 kB)

2020-11-12 12:30:13 Training - Training image download completed. Training in progress.Requirement already satisfied: termcolor in /usr/local/lib/python3.6/dist-packages (from tensorflow_datasets) (1.1.0)
Collecting dataclasses; python_version < "3.7"
  Downloading dataclasses-0.7-py3-none-any.whl (18 kB)
Requirement already satisfied: protobuf>=3.6.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow_datasets) (3.11.3)
Collecting attrs>=18.1.0
  Downloading attrs-20.3.0-py2.py3-none-any.whl (49 kB)
Requirement already satisfied: zipp>=0.4; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from importlib-resources; python_version < "3.9"->tensorflow_datasets) (3.1.0)
Collecting googleapis-common-protos<2,>=1.52.0
  Downloading googleapis_common_protos-1.52.0-py2.py3-none-any.whl (100 kB)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->tensorflow_datasets) (1.25.9)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->tensorflow_datasets) (2020.4.5.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->tensorflow_datasets) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->tensorflow_datasets) (2.8)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from protobuf>=3.6.1->tensorflow_datasets) (46.1.3)
Building wheels for collected packages: promise, future
  Building wheel for promise (setup.py): started
  Building wheel for promise (setup.py): finished with status 'done'
  Created wheel for promise: filename=promise-2.3-py3-none-any.whl size=21495 sha256=5bfcc52cd449f0f077404e249a0252f083de1fc5a2950e1eede78546f1c0db75
  Stored in directory: /root/.cache/pip/wheels/59/9a/1d/3f1afbbb5122d0410547bf9eb50955f4a7a98e53a6d8b99bd1
  Building wheel for future (setup.py): started
  Building wheel for future (setup.py): finished with status 'done'
  Created wheel for future: filename=future-0.18.2-py3-none-any.whl size=491058 sha256=3897bdb5f59ddb7663519e2e63bce2390ad30686d73d2202d0e0a5b80c16db55
  Stored in directory: /root/.cache/pip/wheels/6e/9c/ed/4499c9865ac1002697793e0ae05ba6be33553d098f3347fb94
Successfully built promise future
Installing collected packages: importlib-resources, googleapis-common-protos, tensorflow-metadata, typing-extensions, tqdm, dill, promise, future, dataclasses, attrs, tensorflow-datasets
Successfully installed attrs-20.3.0 dataclasses-0.7 dill-0.3.3 future-0.18.2 googleapis-common-protos-1.52.0 importlib-resources-3.3.0 promise-2.3 tensorflow-datasets-4.1.0 tensorflow-metadata-0.25.0 tqdm-4.51.0 typing-extensions-3.7.4.3
WARNING: You are using pip version 20.0.2; however, version 20.2.4 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
/opt/ml/model s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-12-25-30/model
Get the train data
Tokenize the input and output data and create the vocabularies
Input vocab:  11460
Output vocab:  9383
Creating the checkpoint ...
Training the model ....
Starting epoch 1
Epoch 1 Batch 0 Loss 4.2718 Accuracy 0.0000
Epoch 1 Batch 100 Loss 4.4878 Accuracy 0.0394
Epoch 1 Batch 200 Loss 4.4102 Accuracy 0.0553
Epoch 1 Batch 300 Loss 4.2828 Accuracy 0.0607
Epoch 1 Batch 400 Loss 4.1153 Accuracy 0.0634
Epoch 1 Batch 500 Loss 3.9366 Accuracy 0.0683
Epoch 1 Batch 600 Loss 3.7706 Accuracy 0.0784
Epoch 1 Batch 700 Loss 3.6183 Accuracy 0.0863
Epoch 1 Batch 800 Loss 3.4776 Accuracy 0.0956
Epoch 1 Batch 900 Loss 3.3528 Accuracy 0.1038
Saving checkpoint for epoch 1 in /opt/ml/checkpoints/ckpt-1
Starting epoch 2
Epoch 2 Batch 0 Loss 2.1816 Accuracy 0.1741
Epoch 2 Batch 100 Loss 2.2265 Accuracy 0.1770
Epoch 2 Batch 200 Loss 2.2128 Accuracy 0.1801
Epoch 2 Batch 300 Loss 2.1806 Accuracy 0.1838
Epoch 2 Batch 400 Loss 2.1546 Accuracy 0.1871
Epoch 2 Batch 500 Loss 2.1287 Accuracy 0.1900
Epoch 2 Batch 600 Loss 2.1031 Accuracy 0.1930
Epoch 2 Batch 700 Loss 2.0789 Accuracy 0.1955
Epoch 2 Batch 800 Loss 2.0580 Accuracy 0.1981
Epoch 2 Batch 900 Loss 2.0352 Accuracy 0.2006
Saving checkpoint for epoch 2 in /opt/ml/checkpoints/ckpt-2
Starting epoch 3
Epoch 3 Batch 0 Loss 1.7447 Accuracy 0.2243
Epoch 3 Batch 100 Loss 1.7695 Accuracy 0.2276
Epoch 3 Batch 200 Loss 1.7488 Accuracy 0.2286
Epoch 3 Batch 300 Loss 1.7416 Accuracy 0.2300
Epoch 3 Batch 400 Loss 1.7318 Accuracy 0.2310
Epoch 3 Batch 500 Loss 1.7194 Accuracy 0.2320
Epoch 3 Batch 600 Loss 1.7094 Accuracy 0.2330
Epoch 3 Batch 700 Loss 1.6998 Accuracy 0.2340
Epoch 3 Batch 800 Loss 1.6888 Accuracy 0.2351
Epoch 3 Batch 900 Loss 1.6789 Accuracy 0.2361
Saving checkpoint for epoch 3 in /opt/ml/checkpoints/ckpt-3
Starting epoch 4
Epoch 4 Batch 0 Loss 1.5789 Accuracy 0.2511
Epoch 4 Batch 100 Loss 1.5092 Accuracy 0.2525
Epoch 4 Batch 200 Loss 1.5019 Accuracy 0.2533
Epoch 4 Batch 300 Loss 1.4967 Accuracy 0.2539
Epoch 4 Batch 400 Loss 1.4913 Accuracy 0.2550
Epoch 4 Batch 500 Loss 1.4852 Accuracy 0.2563
Epoch 4 Batch 600 Loss 1.4726 Accuracy 0.2577
Epoch 4 Batch 700 Loss 1.4652 Accuracy 0.2592
Epoch 4 Batch 800 Loss 1.4583 Accuracy 0.2604
Epoch 4 Batch 900 Loss 1.4501 Accuracy 0.2620
Saving checkpoint for epoch 4 in /opt/ml/checkpoints/ckpt-4
Starting epoch 5
Epoch 5 Batch 0 Loss 1.3142 Accuracy 0.2902
Epoch 5 Batch 100 Loss 1.2605 Accuracy 0.2823
Epoch 5 Batch 200 Loss 1.2688 Accuracy 0.2834
Epoch 5 Batch 300 Loss 1.2683 Accuracy 0.2847
Epoch 5 Batch 400 Loss 1.2640 Accuracy 0.2859
Epoch 5 Batch 500 Loss 1.2610 Accuracy 0.2869
Epoch 5 Batch 600 Loss 1.2564 Accuracy 0.2884
Epoch 5 Batch 700 Loss 1.2507 Accuracy 0.2897
Epoch 5 Batch 800 Loss 1.2449 Accuracy 0.2912
Epoch 5 Batch 900 Loss 1.2371 Accuracy 0.2925
Saving checkpoint for epoch 5 in /opt/ml/checkpoints/ckpt-5
Starting epoch 6
Epoch 6 Batch 0 Loss 0.9893 Accuracy 0.3114
Epoch 6 Batch 100 Loss 1.0413 Accuracy 0.3140
Epoch 6 Batch 200 Loss 1.0532 Accuracy 0.3141
Epoch 6 Batch 300 Loss 1.0536 Accuracy 0.3152
Epoch 6 Batch 400 Loss 1.0589 Accuracy 0.3157
Epoch 6 Batch 500 Loss 1.0589 Accuracy 0.3163
Epoch 6 Batch 600 Loss 1.0593 Accuracy 0.3168
Epoch 6 Batch 700 Loss 1.0561 Accuracy 0.3173
Epoch 6 Batch 800 Loss 1.0541 Accuracy 0.3180
Epoch 6 Batch 900 Loss 1.0528 Accuracy 0.3186
Saving checkpoint for epoch 6 in /opt/ml/checkpoints/ckpt-6
Starting epoch 7
Epoch 7 Batch 0 Loss 1.0120 Accuracy 0.3504
Epoch 7 Batch 100 Loss 0.9046 Accuracy 0.3345
Epoch 7 Batch 200 Loss 0.9157 Accuracy 0.3350
Epoch 7 Batch 300 Loss 0.9213 Accuracy 0.3342
Epoch 7 Batch 400 Loss 0.9231 Accuracy 0.3343
Epoch 7 Batch 500 Loss 0.9272 Accuracy 0.3341
Epoch 7 Batch 600 Loss 0.9302 Accuracy 0.3342
Epoch 7 Batch 700 Loss 0.9300 Accuracy 0.3343
Epoch 7 Batch 800 Loss 0.9303 Accuracy 0.3346
Epoch 7 Batch 900 Loss 0.9315 Accuracy 0.3348
Saving checkpoint for epoch 7 in /opt/ml/checkpoints/ckpt-7
Starting epoch 8
Epoch 8 Batch 0 Loss 0.8039 Accuracy 0.3315
Epoch 8 Batch 100 Loss 0.8109 Accuracy 0.3457
Epoch 8 Batch 200 Loss 0.8154 Accuracy 0.3461
Epoch 8 Batch 300 Loss 0.8282 Accuracy 0.3454
Epoch 8 Batch 400 Loss 0.8360 Accuracy 0.3459
Epoch 8 Batch 500 Loss 0.8395 Accuracy 0.3456
Epoch 8 Batch 600 Loss 0.8426 Accuracy 0.3453
Epoch 8 Batch 700 Loss 0.8456 Accuracy 0.3451
Epoch 8 Batch 800 Loss 0.8459 Accuracy 0.3451
Epoch 8 Batch 900 Loss 0.8486 Accuracy 0.3448
Saving checkpoint for epoch 8 in /opt/ml/checkpoints/ckpt-8
Saving the model ....
Saving the model parameters
Saving the dictionaries ....
2020-11-12 13:15:31,672 sagemaker_tensorflow_container.training WARNING  Your model will NOT be servable with SageMaker TensorFlow Serving container. The model artifact was not saved in the TensorFlow SavedModel directory structure:
https://www.tensorflow.org/guide/saved_model#structure_of_a_savedmodel_directory
2020-11-12 13:15:31,672 sagemaker-containers INFO     Reporting training SUCCESS

2020-11-12 13:15:47 Uploading - Uploading generated training model
2020-11-12 13:15:55 Completed - Training job completed
Training seconds: 2867
Billable seconds: 2867

Save the experiment, then you can view it and its trials from SageMaker Studio

# Save the trial
single_gpu_trial.save()
# Save the experiment
training_experiment.save()

Experiment(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f3f90c75d68>,experiment_name='tf-transformer',experiment_arn='arn:aws:sagemaker:us-east-1:223817798831:experiment/tf-transformer',display_name='tf-transformer',description='Experiment to track trainings on my tensorflow Transformer Eng-Spa',creation_time=datetime.datetime(2020, 11, 8, 17, 0, 49, 116000, tzinfo=tzlocal()),created_by={},last_modified_time=datetime.datetime(2020, 11, 12, 11, 50, 27, 732000, tzinfo=tzlocal()),last_modified_by={},response_metadata={'RequestId': '862d03a0-abf6-4215-a759-2ddcc4f622fd', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '862d03a0-abf6-4215-a759-2ddcc4f622fd', 'content-type': 'application/x-amz-json-1.1', 'content-length': '86', 'date': 'Thu, 12 Nov 2020 13:17:51 GMT'}, 'RetryAttempts': 0})

Show metrics from SageMaker Console

You can monitor the metrics that a training job emits in real time in the CloudWatch console:

Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.
Choose Metrics, then choose /aws/sagemaker/TrainingJobs.
Choose TrainingJobName.
On the All metrics tab, choose the names of the training metrics that you want to monitor.

Another option is to monitor the metrics by using the SageMaker console.

Open the SageMaker console at https://console.aws.amazon.com/sagemaker/.
Choose Training jobs, then choose the training job whose metrics you want to see.
Choose TrainingJobName.
In the Monitor section, you can review the graphs of instance utilization and algorithm metrics

It is a simple way to check how your model is "learning" during the training stage.

Restore a training job and download the trained model

At this point, we have a trained model stored in S3. But we are interested in making some predictions with it.

After you train your model, you can deploy it using Amazon SageMaker to get predictions in any of the following ways:

To set up a persistent endpoint to get one prediction at a time, use SageMaker hosting services.
To get predictions for an entire dataset, use SageMaker batch transform.

But in this notebook we do not cover this feature because sometimes we are more interested in reloading our model in a new notebook to apply an evaluation method or study its parameters or gradients. So, here we are going to download the model artifacts from S3 and load them to an "empty" model instance.

Attach a previous training job

If we have just trained a model using our estimator variable in this notebook execution, we can skip this step. But probably you trained your model for hours and now you need to restore your estimator variable from a previous training job. Check for the training job you want to restore the model in SageMaker console, copy the name and paste it in the next section of code. And then you call the attach method of the estimator object and now you can continue to work with our training job.

We can skip the next cell if the previous estimator.fit command was executed

from sagemaker.tensorflow import TensorFlow

# Set the training job you want to attach to the estimator object
# Use this option if the training job was not trained in this execution
my_training_job_name = 'tf-transformer-single-gpu-2020-11-12-18-36-15'

# In case, when the training job have been trained in this execution, we can retrive the data from the job_name variable
#my_training_job_name = job_name
# Attach the estimator to the selected training job
estimator = TensorFlow.attach(my_training_job_name)

2020-11-12 19:13:13 Starting - Preparing the instances for training
2020-11-12 19:13:13 Downloading - Downloading input data
2020-11-12 19:13:13 Training - Training image download completed. Training in progress.
2020-11-12 19:13:13 Uploading - Uploading generated training model
2020-11-12 19:13:13 Completed - Training job completed

# Set the job_name
job_name = estimator.latest_training_job.job_name
print('Job name where the model will be restored: ',estimator.latest_training_job.job_name)

Job name where the model will be restored:  tf-transformer-single-gpu-2020-11-12-18-36-15

print('Dir of model data: ',estimator.model_data)
print('Dir of output data: ',output_data_uri)
print('Buck name: ',bucket_name)

Dir of model data:  s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-18-36-15/output/model.tar.gz
Dir of output data:  s3://edumunozsala-ml-sagemaker/transformer-nmt
Buck name:  edumunozsala-ml-sagemaker

Download the trained model

The estimator object variable model_data points to the model.tar.gz file which contains the saved model. And the other output files from our model that we need to rebuild and tokenize or detokenize the sentences can be found in the S3 folder output_path/output/output.tar.gz. We can download both files and unzip them.

# Set the model and the output path in S3 to download the data 
init_model_path = len('s3://')+len(bucket_name)+1
s3_model_path=estimator.model_data[init_model_path:]
s3_output_data=output_data_uri[init_model_path:]+'/{}/output/output.tar.gz'.format(job_name)
print('Dir to download traned model: ', s3_model_path)
print('Dir to download model outputs: ', s3_output_data)

Dir to download traned model:  transformer-nmt/tf-transformer-single-gpu-2020-11-12-18-36-15/output/model.tar.gz
Dir to download model outputs:  transformer-nmt/tf-transformer-single-gpu-2020-11-12-18-36-15/output/output.tar.gz

sagemaker_session.download_data(trainedmodel_path,bucket_name,s3_model_path)

sagemaker_session.download_data(output_data_path,bucket_name,s3_output_data)

Next, extract the information out from the model.tar.gz file return by the training job in SageMaker:

!tar -zxvf $trainedmodel_path/model.tar.gz

transformer.data-00000-of-00002
transformer.index
transformer.data-00001-of-00002
checkpoint

Extract the files from output.tar.gz without recreating the directory structure, all files will be extracted to the working directory

!tar -xvzf $output_data_path/output.tar.gz #--strip-components=1

out_vocab.pkl
model_info.pth
in_vocab.pkl

Import the tensorflow model and load the model

We import the model.py file with our model definition but we only have the weights of the model, so we need to rebuild it. The model parameters where saved during training in the model_info.pth, we just need to read that file and use the parameters to initiate an empty instance of the model. And then we can load the weights, load_weights() into that instance.

from train.model import Transformer

# Read the parameters from a dictionary
with open(model_info_file, 'rb') as f:
    model_info = pickle.load(f)
print('Model parameters',model_info)

#Create an instance of the Transforer model and load the saved model to th
transformer = Transformer(vocab_size_enc=model_info['vocab_size_enc'],
                          vocab_size_dec=model_info['vocab_size_dec'],
                          d_model=model_info['d_model'],
                          n_layers=model_info['n_layers'],
                          FFN_units=model_info['ffn_dim'],
                          n_heads=model_info['n_heads'],
                          dropout_rate=model_info['drop_rate'])

#Load the saved model
# To do: Use variable to store the model name and pass it in as a hyperparameter of the estimator
transformer.load_weights('transformer')

Model parameters {'vocab_size_enc': 11460, 'vocab_size_dec': 9383, 'sos_token_input': [11458], 'eos_token_input': [11459], 'sos_token_output': [9381], 'eos_token_output': [9382], 'n_layers': 4, 'd_model': 64, 'ffn_dim': 128, 'n_heads': 8, 'drop_rate': 0.1}

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f00f7b045f8>

Make some predictions

And now everything is ready to make prediction with our trained model:

Import the predict.py file with the functions to make a prediction and to translate a sentence. The code was described in the original post.
Read the files and load the tokenizer for the input and output sentences
Call to traslate function with the model, the tokenizers, the sosand eos tokens, the sentence to translate and the max length of the output. It returns the predicted sentence detokenize, a plain text, with the translation.

# Install the library necessary to tokenize the sentences
!pip install tensorflow-datasets

Collecting tensorflow-datasets
  Downloading tensorflow_datasets-4.1.0-py3-none-any.whl (3.6 MB)
     |████████████████████████████████| 3.6 MB 13.8 MB/s eta 0:00:01
Requirement already satisfied: tqdm in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from tensorflow-datasets) (4.42.1)
Collecting dataclasses; python_version < "3.7"
  Downloading dataclasses-0.8-py3-none-any.whl (19 kB)
Collecting importlib-resources; python_version < "3.9"
  Downloading importlib_resources-3.3.0-py2.py3-none-any.whl (26 kB)
Requirement already satisfied: attrs>=18.1.0 in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from tensorflow-datasets) (19.3.0)
Requirement already satisfied: protobuf>=3.6.1 in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from tensorflow-datasets) (3.8.0)
Collecting tensorflow-metadata
  Downloading tensorflow_metadata-0.25.0-py3-none-any.whl (44 kB)
     |████████████████████████████████| 44 kB 5.6 MB/s  eta 0:00:01
Requirement already satisfied: termcolor in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from tensorflow-datasets) (1.1.0)
Requirement already satisfied: absl-py in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from tensorflow-datasets) (0.11.0)
Collecting promise
  Downloading promise-2.3.tar.gz (19 kB)
Collecting dill
  Downloading dill-0.3.3-py2.py3-none-any.whl (81 kB)
     |████████████████████████████████| 81 kB 17.5 MB/s eta 0:00:01
Requirement already satisfied: future in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from tensorflow-datasets) (0.18.2)
Requirement already satisfied: numpy in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from tensorflow-datasets) (1.18.1)
Requirement already satisfied: requests>=2.19.0 in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from tensorflow-datasets) (2.22.0)
Requirement already satisfied: six in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from tensorflow-datasets) (1.14.0)
Collecting typing-extensions; python_version < "3.8"
  Downloading typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Requirement already satisfied: zipp>=0.4; python_version < "3.8" in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from importlib-resources; python_version < "3.9"->tensorflow-datasets) (2.2.0)
Requirement already satisfied: setuptools in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from protobuf>=3.6.1->tensorflow-datasets) (45.2.0.post20200210)
Collecting googleapis-common-protos<2,>=1.52.0
  Downloading googleapis_common_protos-1.52.0-py2.py3-none-any.whl (100 kB)
     |████████████████████████████████| 100 kB 18.2 MB/s ta 0:00:01
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from requests>=2.19.0->tensorflow-datasets) (1.25.10)
Requirement already satisfied: idna<2.9,>=2.5 in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from requests>=2.19.0->tensorflow-datasets) (2.8)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from requests>=2.19.0->tensorflow-datasets) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /home/ec2-user/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages (from requests>=2.19.0->tensorflow-datasets) (2020.6.20)
Building wheels for collected packages: promise
  Building wheel for promise (setup.py) ... done
  Created wheel for promise: filename=promise-2.3-py3-none-any.whl size=21495 sha256=1eeea3a689d00d6a15f97c89d4bd92ce7c63d62bbca77357f5aa7be67b44d3af
  Stored in directory: /home/ec2-user/.cache/pip/wheels/59/9a/1d/3f1afbbb5122d0410547bf9eb50955f4a7a98e53a6d8b99bd1
Successfully built promise
ERROR: tensorflow-metadata 0.25.0 has requirement absl-py<0.11,>=0.9, but you'll have absl-py 0.11.0 which is incompatible.
Installing collected packages: dataclasses, importlib-resources, googleapis-common-protos, tensorflow-metadata, promise, dill, typing-extensions, tensorflow-datasets
Successfully installed dataclasses-0.8 dill-0.3.3 googleapis-common-protos-1.52.0 importlib-resources-3.3.0 promise-2.3 tensorflow-datasets-4.1.0 tensorflow-metadata-0.25.0 typing-extensions-3.7.4.3
WARNING: You are using pip version 20.0.2; however, version 20.2.4 is available.
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/tensorflow2_p36/bin/python -m pip install --upgrade pip' command.

from serve.predict import translate
import tensorflow_datasets as tfds

Load the input and output tokenizer or vocabularis used in the training. We need them to encode and decode the sentences

# Read the parameters from a dictionary
#model_info_path = os.path.join(model_dir, 'model_info.pth')
with open(input_vocab_file, 'rb') as f:
    tokenizer_inputs = pickle.load(f)

with open(output_vocab_file, 'rb') as f:
    tokenizer_outputs = pickle.load(f)

#Show some translations
sentence = "you should pay for it."
print("Input sentence: {}".format(sentence))
predicted_sentence = translate(transformer,sentence,tokenizer_inputs, tokenizer_outputs,15,model_info['sos_token_input'],
                               model_info['eos_token_input'],model_info['sos_token_output'],
                               model_info['eos_token_output'])
print("Output sentence: {}".format(predicted_sentence))

Input sentence: you should pay for it.
Output sentence: Deberías pagar por ello.

#Show some translations
sentence = "This is a really powerful method!"
print("Input sentence: {}".format(sentence))
predicted_sentence = translate(transformer,sentence,tokenizer_inputs, tokenizer_outputs,15,model_info['sos_token_input'],
                               model_info['eos_token_input'],model_info['sos_token_output'],
                               model_info['eos_token_output'])
print("Output sentence: {}".format(predicted_sentence))

Input sentence: This is a really powerful method!
Output sentence: ¡Esto es un montón de las carreras de las ocho!

Resume training from a checkpoint

Sometimes we need to stop our training, and maybe do some research in the performance or reallocate more resources to continue with the project. But when it is done, we need to resume the training, restoring the model and the optimizer states and continue for some more epochs to achieve a final trained model with a better performance.

To help with that scenario, the checkpoint_local_pathand checkpoint_s3_uri estimator parameters are much relevant. The first one is the local path, inside the container, that the algorithm writes its checkpoints to. SageMaker will persist all files under this path to checkpoint_s3_uri continually during training. On job startup the reverse happens - data from the s3 location is downloaded to this path before the algorithm is started. If the path is unset then SageMaker assumes the checkpoints will be provided under /opt/ml/checkpoints/. Using this feature we can resume training from the last checkpoint (or a previous one).

For this purpose, we set the model parameter resume = True and fit the estimator to execute another training.

Load the experiment and trial created in a previous run or create a new one:

# Set the experiment name
experiment_name='tf-transformer'
# Set the trial name 
trial_name="{}-{}".format(experiment_name,'single-gpu')
trial_comp_name = 'single-gpu-training-job'

tags = [{'Key': 'my-experiments', 'Value': 'transformerEngSpa1'}]

# create the experiment if it doesn't exist
try:
    experiment = Experiment.load(experiment_name=experiment_name)
    print('Load the experiment')
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        experiment = Experiment.create(experiment_name=experiment_name)
        print('Create the experiment')


# create the trial if it doesn't exist
try:
    trial = Trial.load(trial_name=trial_name)
    print('Load the trial')
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        trial = Trial.create(experiment_name=experiment_name, trial_name=trial_name)
        print('Create the trial')

Load the experiment
Load the trial

# Set the configuration parameters for the experiment
experiment_config = {'ExperimentName': experiment.experiment_name, 
                       'TrialName': trial.trial_name,
                       'TrialComponentDisplayName': trial_comp_name}

Create an Estimator for a TensorFlow 2.1 model and set the parameter --resume to True to force the model to restore the latest checkpoint and resume training for the number of epochs selected

#instance_type='ml.m5.xlarge'
#instance_type='ml.m4.4xlarge'
instance_type='ml.p2.xlarge'
#instance_type='local'

# Define the metrics to search for
metric_definitions = [{'Name': 'loss', 'Regex': 'Loss ([0-9\\.]+)'},{'Name': 'Accuracy', 'Regex': 'Accuracy ([0-9\\.]+)'}]

# Create an estimator with the hyperparameter resume = True
estimator = TensorFlow(entry_point='train.py',
                       source_dir='train',
                       role=role,
                       instance_count=1,
                       instance_type=instance_type,
                       framework_version='2.1.0',
                       py_version='py3',
                       output_path=output_data_uri,
                       code_location=output_data_uri,
                       base_job_name='tf-transformer',
                       script_mode= True,
                       checkpoint_s3_uri = ckpt_data_uri,
                       metric_definitions = metric_definitions, 
                       hyperparameters={
                        'epochs': 5,
                        'nsamples': 60000,
                        'resume': True,
                        'train_file': 'spa.txt',
                        'non_breaking_in': 'nonbreaking_prefix.en',
                        'non_breaking_out': 'nonbreaking_prefix.es'
                       })

# Set the job name and show it
job_name = '{}-{}'.format(trial_name,time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime()))
print(job_name)

tf-transformer-single-gpu-2020-11-12-18-36-15

# Fit or train the model from the latest checkpoint
estimator.fit({'training':training_data_uri}, job_name = job_name, 
              experiment_config = experiment_config)

INFO:sagemaker:Creating training-job with name: tf-transformer-single-gpu-2020-11-12-18-36-15

2020-11-12 18:36:34 Starting - Starting the training job...
2020-11-12 18:36:39 Starting - Launching requested ML instances......
2020-11-12 18:37:59 Starting - Preparing the instances for training.........
2020-11-12 18:39:13 Downloading - Downloading input data......
2020-11-12 18:40:24 Training - Downloading the training image......
2020-11-12 18:41:37 Training - Training image download completed. Training in progress...2020-11-12 18:41:43,057 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training
2020-11-12 18:41:43,563 sagemaker-containers INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_tensorflow_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "resume": true,
        "non_breaking_out": "nonbreaking_prefix.es",
        "nsamples": 60000,
        "train_file": "spa.txt",
        "model_dir": "s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-18-36-15/model",
        "non_breaking_in": "nonbreaking_prefix.en",
        "epochs": 5
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "tf-transformer-single-gpu-2020-11-12-18-36-15",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-18-36-15/source/sourcedir.tar.gz",
    "module_name": "train",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 1,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "train.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"epochs":5,"model_dir":"s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-18-36-15/model","non_breaking_in":"nonbreaking_prefix.en","non_breaking_out":"nonbreaking_prefix.es","nsamples":60000,"resume":true,"train_file":"spa.txt"}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-18-36-15/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_tensorflow_container.training:main","hosts":["algo-1"],"hyperparameters":{"epochs":5,"model_dir":"s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-18-36-15/model","non_breaking_in":"nonbreaking_prefix.en","non_breaking_out":"nonbreaking_prefix.es","nsamples":60000,"resume":true,"train_file":"spa.txt"},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"tf-transformer-single-gpu-2020-11-12-18-36-15","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-18-36-15/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":4,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_USER_ARGS=["--epochs","5","--model_dir","s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-18-36-15/model","--non_breaking_in","nonbreaking_prefix.en","--non_breaking_out","nonbreaking_prefix.es","--nsamples","60000","--resume","True","--train_file","spa.txt"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_RESUME=true
SM_HP_NON_BREAKING_OUT=nonbreaking_prefix.es
SM_HP_NSAMPLES=60000
SM_HP_TRAIN_FILE=spa.txt
SM_HP_MODEL_DIR=s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-18-36-15/model
SM_HP_NON_BREAKING_IN=nonbreaking_prefix.en
SM_HP_EPOCHS=5
PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/lib/python36.zip:/usr/lib/python3.6:/usr/lib/python3.6/lib-dynload:/usr/local/lib/python3.6/dist-packages:/usr/lib/python3/dist-packages

Invoking script with the following command:

/usr/bin/python3 train.py --epochs 5 --model_dir s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-18-36-15/model --non_breaking_in nonbreaking_prefix.en --non_breaking_out nonbreaking_prefix.es --nsamples 60000 --resume True --train_file spa.txt


Collecting tensorflow_datasets
  Downloading tensorflow_datasets-4.1.0-py3-none-any.whl (3.6 MB)
Requirement already satisfied: termcolor in /usr/local/lib/python3.6/dist-packages (from tensorflow_datasets) (1.1.0)
Collecting future
  Downloading future-0.18.2.tar.gz (829 kB)
Collecting dill
  Downloading dill-0.3.3-py2.py3-none-any.whl (81 kB)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from tensorflow_datasets) (1.18.1)
Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow_datasets) (2.22.0)
Collecting tensorflow-metadata
  Downloading tensorflow_metadata-0.25.0-py3-none-any.whl (44 kB)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from tensorflow_datasets) (1.14.0)
Requirement already satisfied: protobuf>=3.6.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow_datasets) (3.11.3)
Collecting attrs>=18.1.0
  Downloading attrs-20.3.0-py2.py3-none-any.whl (49 kB)
Collecting tqdm
  Downloading tqdm-4.51.0-py2.py3-none-any.whl (70 kB)
Collecting importlib-resources; python_version < "3.9"
  Downloading importlib_resources-3.3.0-py2.py3-none-any.whl (26 kB)
Collecting typing-extensions; python_version < "3.8"
  Downloading typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Collecting promise
  Downloading promise-2.3.tar.gz (19 kB)
Collecting dataclasses; python_version < "3.7"
  Downloading dataclasses-0.7-py3-none-any.whl (18 kB)
Requirement already satisfied: absl-py in /usr/local/lib/python3.6/dist-packages (from tensorflow_datasets) (0.9.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->tensorflow_datasets) (2020.4.5.1)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->tensorflow_datasets) (1.25.9)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->tensorflow_datasets) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests>=2.19.0->tensorflow_datasets) (2.8)
Collecting googleapis-common-protos<2,>=1.52.0
  Downloading googleapis_common_protos-1.52.0-py2.py3-none-any.whl (100 kB)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from protobuf>=3.6.1->tensorflow_datasets) (46.1.3)
Requirement already satisfied: zipp>=0.4; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from importlib-resources; python_version < "3.9"->tensorflow_datasets) (3.1.0)
Building wheels for collected packages: future, promise
  Building wheel for future (setup.py): started
  Building wheel for future (setup.py): finished with status 'done'
  Created wheel for future: filename=future-0.18.2-py3-none-any.whl size=491058 sha256=8ee2ee7f64812eaeae88dab477913a1a102478229062751714d6baea811887ea
  Stored in directory: /root/.cache/pip/wheels/6e/9c/ed/4499c9865ac1002697793e0ae05ba6be33553d098f3347fb94
  Building wheel for promise (setup.py): started
  Building wheel for promise (setup.py): finished with status 'done'
  Created wheel for promise: filename=promise-2.3-py3-none-any.whl size=21495 sha256=6a50141ea6170764a8fea8f42c9875da12c21f0528a4b72678a6a683e10860ce
  Stored in directory: /root/.cache/pip/wheels/59/9a/1d/3f1afbbb5122d0410547bf9eb50955f4a7a98e53a6d8b99bd1
Successfully built future promise
Installing collected packages: future, dill, googleapis-common-protos, tensorflow-metadata, attrs, tqdm, importlib-resources, typing-extensions, promise, dataclasses, tensorflow-datasets
Successfully installed attrs-20.3.0 dataclasses-0.7 dill-0.3.3 future-0.18.2 googleapis-common-protos-1.52.0 importlib-resources-3.3.0 promise-2.3 tensorflow-datasets-4.1.0 tensorflow-metadata-0.25.0 tqdm-4.51.0 typing-extensions-3.7.4.3
WARNING: You are using pip version 20.0.2; however, version 20.2.4 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
/opt/ml/model s3://edumunozsala-ml-sagemaker/transformer-nmt/tf-transformer-single-gpu-2020-11-12-18-36-15/model
Get the train data
Tokenize the input and output data and create the vocabularies
Input vocab:  11460
Output vocab:  9383
Creating the checkpoint ...
Last checkpoint restored.
Training the model ....
Starting epoch 1
Epoch 1 Batch 0 Loss 0.7465 Accuracy 0.3560
Epoch 1 Batch 100 Loss 0.7395 Accuracy 0.3574
Epoch 1 Batch 200 Loss 0.7495 Accuracy 0.3559
Epoch 1 Batch 300 Loss 0.7552 Accuracy 0.3551
Epoch 1 Batch 400 Loss 0.7641 Accuracy 0.3543
Epoch 1 Batch 500 Loss 0.7689 Accuracy 0.3541
Epoch 1 Batch 600 Loss 0.7754 Accuracy 0.3538
Epoch 1 Batch 700 Loss 0.7815 Accuracy 0.3535
Epoch 1 Batch 800 Loss 0.7849 Accuracy 0.3531
Epoch 1 Batch 900 Loss 0.7891 Accuracy 0.3530
Saving checkpoint for epoch 1 in /opt/ml/checkpoints/ckpt-9
Starting epoch 2
Epoch 2 Batch 0 Loss 0.7752 Accuracy 0.3616
Epoch 2 Batch 100 Loss 0.6931 Accuracy 0.3643
Epoch 2 Batch 200 Loss 0.7000 Accuracy 0.3629
Epoch 2 Batch 300 Loss 0.7107 Accuracy 0.3621
Epoch 2 Batch 400 Loss 0.7174 Accuracy 0.3610
Epoch 2 Batch 500 Loss 0.7233 Accuracy 0.3607
Epoch 2 Batch 600 Loss 0.7292 Accuracy 0.3600
Epoch 2 Batch 700 Loss 0.7342 Accuracy 0.3596
Epoch 2 Batch 800 Loss 0.7372 Accuracy 0.3593
Epoch 2 Batch 900 Loss 0.7408 Accuracy 0.3589
Saving checkpoint for epoch 2 in /opt/ml/checkpoints/ckpt-10
Starting epoch 3
Epoch 3 Batch 0 Loss 0.6146 Accuracy 0.3627
Epoch 3 Batch 100 Loss 0.6685 Accuracy 0.3686
Epoch 3 Batch 200 Loss 0.6670 Accuracy 0.3673
Epoch 3 Batch 300 Loss 0.6770 Accuracy 0.3664
Epoch 3 Batch 400 Loss 0.6802 Accuracy 0.3660
Epoch 3 Batch 500 Loss 0.6862 Accuracy 0.3653
Epoch 3 Batch 600 Loss 0.6914 Accuracy 0.3650
Epoch 3 Batch 700 Loss 0.6965 Accuracy 0.3646
Epoch 3 Batch 800 Loss 0.7007 Accuracy 0.3640
Epoch 3 Batch 900 Loss 0.7036 Accuracy 0.3641
Saving checkpoint for epoch 3 in /opt/ml/checkpoints/ckpt-11
Starting epoch 4
Epoch 4 Batch 0 Loss 0.6168 Accuracy 0.3884
Epoch 4 Batch 100 Loss 0.6127 Accuracy 0.3745
Epoch 4 Batch 200 Loss 0.6317 Accuracy 0.3719
Epoch 4 Batch 300 Loss 0.6400 Accuracy 0.3723
Epoch 4 Batch 400 Loss 0.6493 Accuracy 0.3705
Epoch 4 Batch 500 Loss 0.6558 Accuracy 0.3700
Epoch 4 Batch 600 Loss 0.6614 Accuracy 0.3694
Epoch 4 Batch 700 Loss 0.6638 Accuracy 0.3692
Epoch 4 Batch 800 Loss 0.6702 Accuracy 0.3686
Epoch 4 Batch 900 Loss 0.6730 Accuracy 0.3683
Saving checkpoint for epoch 4 in /opt/ml/checkpoints/ckpt-12
Starting epoch 5
Epoch 5 Batch 0 Loss 0.6213 Accuracy 0.3940
Epoch 5 Batch 100 Loss 0.5858 Accuracy 0.3795
Epoch 5 Batch 200 Loss 0.6030 Accuracy 0.3767
Epoch 5 Batch 300 Loss 0.6117 Accuracy 0.3760
Epoch 5 Batch 400 Loss 0.6194 Accuracy 0.3754
Epoch 5 Batch 500 Loss 0.6277 Accuracy 0.3743
Epoch 5 Batch 600 Loss 0.6322 Accuracy 0.3742
Epoch 5 Batch 700 Loss 0.6369 Accuracy 0.3733
Epoch 5 Batch 800 Loss 0.6421 Accuracy 0.3726
Epoch 5 Batch 900 Loss 0.6467 Accuracy 0.3722

2020-11-12 19:13:13 Uploading - Uploading generated training model
2020-11-12 19:13:13 Completed - Training job completed
Saving checkpoint for epoch 5 in /opt/ml/checkpoints/ckpt-13
Saving the model ....
Saving the model parameters
Saving the dictionaries ....
2020-11-12 19:13:02,312 sagemaker_tensorflow_container.training WARNING  Your model will NOT be servable with SageMaker TensorFlow Serving container. The model artifact was not saved in the TensorFlow SavedModel directory structure:
https://www.tensorflow.org/guide/saved_model#structure_of_a_savedmodel_directory
2020-11-12 19:13:02,312 sagemaker-containers INFO     Reporting training SUCCESS
Training seconds: 2040
Billable seconds: 2040

And this training job will return a new trained model, you can download to make prediction as we describe in a former section.

Delete the experiment

training_experiment.delete_all(action="--force")

References

Referencias for experiment and trial https://github.com/shashankprasanna/sagemaker-training-tutorial/blob/master/sagemaker-training-tutorial.ipynb