Machine learning (ML) is split in two distinct phases: training and inference. Training deals with building the model, i.e. running a ML algorithm on a dataset in order to identify meaningful patterns. This often requires large amounts of storage and computing power, making the cloud a natural place to train ML jobs with services such as Amazon SageMaker and the AWS Deep Learning AMIs.
Inference deals with using the model, i.e. predicting results for data samples that the model has never seen. Here, the requirements are different: developers are typically concerned with optimizing latency (how long does a single prediction take?) and throughput (how many predictions can I run in parallel?). Of course, the hardware architecture of your prediction environment has a very significant impact on such metrics, especially if you’re dealing with resource-constrained devices: as a Raspberry Pi enthusiast, I often wish the little fellow packed a little more punch to speed up my inference code.
Tuning a model for a specific hardware architecture is possible, but the lack of tooling makes this an error-prone and time-consuming process. Minor changes to the ML framework or the model itself usually require the user to start all over again. Unfortunately, this forces most ML developers to deploy the same model everywhere regardless of the underlying hardware, thus missing out on significant performance gains.
Well, no more. Today, I’m very happy to announce Amazon SageMaker Neo, a new capability of Amazon SageMaker that enables machine learning models to train once and run anywhere in the cloud and at the edge with optimal performance.
Introducing Amazon SageMaker Neo
Without any manual intervention, Amazon SageMaker Neo optimizes models deployed on Amazon EC2 instances, Amazon SageMaker endpoints and devices managed by AWS Greengrass.
Here are the supported configurations:
The Amazon SageMaker Neo compiler converts models into an efficient common format, which is executed on the device by a compact runtime that uses less than one-hundredth of the resources that a generic framework would traditionally consume. The Amazon SageMaker Neo runtime is optimized for the underlying hardware, using specific instruction sets that help speed up ML inference.
This has three main benefits:
Under the hood
Most machine learning frameworks represent a model as a computational graph: a vertex represents an operation on data arrays (tensors) and an edge represents data dependencies between operations. The Amazon SageMaker Neo compiler exploits patterns in the computational graph to apply high-level optimizations including operator fusion, which fuses multiple small operations together; constant-folding, which statically pre-computes portions of the graph to save execution costs; a static memory planning pass, which pre-allocates memory to hold each intermediate tensor; and data layout transformations, which transform internal data layouts into hardware-friendly forms. The compiler then produces efficient code for each operator.
Once a model has been compiled, it can be run by the Amazon SageMaker Neo runtime. This runtime takes about 1MB of disk space, compared to the 500MB-1GB required by popular deep learning libraries. An application invokes a model by first loading the runtime, which then loads the model definition, model parameters, and precompiled operations.
I can’t wait to try this on my Raspberry Pi. Let’s get to work.
Downloading a pre-trained model
Plenty of pre-trained models are available in the Apache MXNet, Gluon CV or TensorFlow model zoos: here, I’m using a 50-layer model based on the ResNet architecture, pre-trained with Apache MXNet on the ImageNet dataset.
First, I’m downloading the 227MB model as well as the JSON file defining its different layers. This file is particularly important: it tells me that the input symbol is called ‘data’ and that its shape is [1, 3, 224, 224], i.e. 1 image, 3 channels (red, green and blue), 224×224 pixels. I’ll need to make sure that images passed to the model have this exact shape. The output shape is [1, 1000], i.e. a vector containing the probability for each one of the 1,000 classes present in the ImageNet dataset.
To define a performance baseline, I use this model and a vanilla unoptimized version of Apache MXNet 1.2 to predict a few images: on average, inference takes about 6.5 seconds and requires about 306 MB of RAM.
That’s pretty slow: let’s compile the model and see how fast it gets.
Compiling the model for the Raspberry Pi
First, let’s store both model files in a compressed TAR archive and upload it to an Amazon S3 bucket.
$ tar cvfz model.tar.gz resnet50_v1-symbol.json resnet50_v1-0000.params
a resnet50_v1-symbol.json
a resnet50_v1-0000.paramsresnet50_v1-0000.params
$ aws s3 cp model.tar.gz s3://jsimon-neo/
upload: ./model.tar.gz to s3://jsimon-neo/model.tar.gz
Then, I just have to write a simple configuration file for my compilation job. If you’re curious about other frameworks and hardware targets, ‘aws sagemaker create-compilation-job help‘ will give you the exact syntax to use.
{
"CompilationJobName": "resnet50-mxnet-raspberrypi",
"RoleArn": $SAGEMAKER_ROLE_ARN,
"InputConfig": {
"S3Uri": "s3://jsimon-neo/model.tar.gz",
"DataInputConfig": "{"data": [1, 3, 224, 224]}",
"Framework": "MXNET"
},
"OutputConfig": {
"S3OutputLocation": "s3://jsimon-neo/",
"TargetDevice": "rasp3b"
},
"StoppingCondition": {
"MaxRuntimeInSeconds": 300
}
}
Launching the compilation process takes a single command.
$ aws sagemaker create-compilation-job --cli-input-json file://job.json
Compilation is complete in seconds. Let’s figure out the name of the compilation artifact, fetch it from Amazon S3 and extract it locally
$ aws sagemaker describe-compilation-job
--compilation-job-name resnet50-mxnet-raspberrypi
--query "ModelArtifacts"
{
"S3ModelArtifacts": "s3://jsimon-neo/model-rasp3b.tar.gz"
}
$ aws s3 cp s3://jsimon-neo/model-rasp3b.tar.gz .
$ tar xvfz model-rasp3b.tar.gz
x compiled.params
x compiled_model.json
x compiled.so
As you can see, the artifact contains:
For convenience, let’s rename them to ‘model.params’, ‘model.json’ and ‘model.so’, and then copy them on the Raspberry pi in a ‘resnet50’ directory.
$ mkdir resnet50
$ mv compiled.params resnet50/model.params
$ mv compiled_model.json resnet50/model.json
$ mv compiled.so resnet50/model.so
$ scp -r resnet50 pi@raspberrypi.local:~
Setting up the inference environment on the Raspberry Pi
Before I can predict images with the model, I need to install the appropriate runtime on my Raspberry Pi. Pre-built packages are available [neopackages]: I just have to download the one for ‘armv7l’ architectures and to install it on my Pi with the provided script. Please note that I don’t need to install any additional deep learning framework (Apache MXNet in this case), saving up to 1GB of persistent storage.
$ scp -r dlr-1.0-py2.py3-armv7l pi@raspberrypi.local:~
<ssh to the Pi>
$ cd dlr-1.0-py2.py3-armv7l
$ sh ./install-py3.sh
We’re all set. Time to predict images!
Using the Amazon SageMaker Neo runtime
On the Pi, the runtime is available as a Python package named ‘dlr’ (deep learning runtime). Using it to predict images is what you would expect:
Here’s the corresponding Python code.
import os
import numpy as np
from dlr import DLRModel
# Load the compiled model
input_shape = {'data': [1, 3, 224, 224]} # A single RGB 224x224 image
output_shape = [1, 1000] # The probability for each one of the 1,000 classes
device = 'cpu' # Go, Raspberry Pi, go!
model = DLRModel('resnet50', input_shape, output_shape, device)
# Load names for ImageNet classes
synset_path = os.path.join(model_path, 'synset.txt')
with open(synset_path, 'r') as f:
synset = eval(f.read())
# Load an image stored as a numpy array
image = np.load('dog.npy').astype(np.float32)
print(image.shape)
input_data = {'data': image}
# Predict
out = model.run(input_data)
top1 = np.argmax(out[0])
prob = np.max(out)
print("Class: %s, probability: %f" % (synset[top1], prob))
Let’s give it a try on this image. Aren’t chihuahuas and Raspberry Pis made for one another?
(1, 3, 224, 224)
Class: Chihuahua, probability: 0.901803
The prediction is correct, but what about speed and memory consumption? Well, this prediction takes about 0.85 second and requires about 260MB of RAM: with Amazon SageMaker Neo, it’s now 5 times faster and 15% more RAM-efficient than with a vanilla model.
This impressive performance gain didn’t require any complex and time-consuming work: all we had to do was to compile the model. Of course, your mileage will vary depending on models and hardware architectures, but you should see significant improvements across the board, including on Amazon EC2 instances such as the C5 or P3 families.
Now available
I hope this post was informative. Compiling models with Amazon SageMaker Neo is free of charge, you will only pay for the underlying resource using the model (Amazon EC2 instances, Amazon SageMaker instances and devices managed by AWS Greengrass).
The service is generally available today in US-East (N. Virginia), US-West (Oregon) and Europe (Ireland). Please start exploring and let us know what you think. We can’t wait to see what you will build!
— Julien;
Source: AWS News