Glen Knight

NYC Based IT Professional

Amazon EKS Now Supports EC2 Inf1 Instances

Amazon Elastic Kubernetes Service (EKS) has quickly become a leading choice for machine learning workloads. It combines the developer agility and the scalability of Kubernetes, with the wide selection of Amazon Elastic Compute Cloud (EC2) instance types available on AWS, such as the C5, P3, and G4 families.

As models become more sophisticated, hardware acceleration is increasingly required to deliver fast predictions at high throughput. Today, we’re very happy to announce that AWS customers can now use the Amazon EC2 Inf1 instances on Amazon Elastic Kubernetes Service, for high performance and the lowest prediction cost in the cloud.

A primer on EC2 Inf1 instances
Inf1 instances were launched at AWS re:Invent 2019. They are powered by AWS Inferentia, a custom chip built from the ground up by AWS to accelerate machine learning inference workloads.

Inf1 instances are available in multiple sizes, with 1, 4, or 16 AWS Inferentia chips, with up to 100 Gbps network bandwidth and up to 19 Gbps EBS bandwidth. An AWS Inferentia chip contains four NeuronCores. Each one implements a high-performance systolic array matrix multiply engine, which massively speeds up typical deep learning operations such as convolution and transformers. NeuronCores are also equipped with a large on-chip cache, which helps cut down on external memory accesses, saving I/O time in the process. When several AWS Inferentia chips are available on an Inf1 instance, you can partition a model across them and store it entirely in cache memory. Alternatively, to serve multi-model predictions from a single Inf1 instance, you can partition the NeuronCores of an AWS Inferentia chip across several models.

Compiling Models for EC2 Inf1 Instances
To run machine learning models on Inf1 instances, you need to compile them to a hardware-optimized representation using the AWS Neuron SDK. All tools are readily available on the AWS Deep Learning AMI, and you can also install them on your own instances. You’ll find instructions in the Deep Learning AMI documentation, as well as tutorials for TensorFlow, PyTorch, and Apache MXNet in the AWS Neuron SDK repository.

In the demo below, I will show you how to deploy a Neuron-optimized model on an EKS cluster of Inf1 instances, and how to serve predictions with TensorFlow Serving. The model in question is BERT, a state of the art model for natural language processing tasks. This is a huge model with hundreds of millions of parameters, making it a great candidate for hardware acceleration.

Building an EKS Cluster of EC2 Inf1 Instances
First of all, let’s build a cluster with two inf1.2xlarge instances. I can easily do this with eksctl, the command-line tool to provision and manage EKS clusters. You can find installation instructions in the EKS documentation.

Here is the configuration file for my cluster. Eksctl detects that I’m launching a node group with an Inf1 instance type, and will start your worker nodes using the EKS-optimized Accelerated AMI.

kind: ClusterConfig
  name: cluster-inf1
  region: us-west-2
  - name: ng1-public
    instanceType: inf1.2xlarge
    minSize: 0
    maxSize: 3
    desiredCapacity: 2
      allow: true

Then, I use eksctl to create the cluster. This process will take approximately 10 minutes.

$ eksctl create cluster -f inf1-cluster.yaml

Eksctl automatically installs the Neuron device plugin in your cluster. This plugin advertises Neuron devices to the Kubernetes scheduler, which can be requested by containers in a deployment spec. I can check with kubectl that the device plug-in container is running fine on both Inf1 instances.

$ kubectl get pods -n kube-system
NAME                                  READY STATUS  RESTARTS AGE
aws-node-tl5xv                        1/1   Running 0        14h
aws-node-wk6qm                        1/1   Running 0        14h
coredns-86d5cbb4bd-4fxrh              1/1   Running 0        14h
coredns-86d5cbb4bd-sts7g              1/1   Running 0        14h
kube-proxy-7px8d                      1/1   Running 0        14h
kube-proxy-zqvtc                      1/1   Running 0        14h
neuron-device-plugin-daemonset-888j4  1/1   Running 0        14h
neuron-device-plugin-daemonset-tq9kc  1/1   Running 0        14h

Next, I define AWS credentials in a Kubernetes secret. They will allow me to grab my BERT model stored in S3. Please note that both keys needs to be base64-encoded.

apiVersion: v1 
kind: Secret 
  name: aws-s3-secret 
type: Opaque 
  AWS_ACCESS_KEY_ID: <base64-encoded value> 
  AWS_SECRET_ACCESS_KEY: <base64-encoded value>

Finally, I store these credentials on the cluster.

$ kubectl apply -f secret.yaml

The cluster is correctly set up. Now, let’s build an application container storing a Neuron-enabled version of TensorFlow Serving.

Building an Application Container for TensorFlow Serving
The Dockerfile is very simple. We start from an Amazon Linux 2 base image. Then, we install the AWS CLI, and the TensorFlow Serving package available in the Neuron repository.

FROM amazonlinux:2
RUN yum install -y awscli
RUN echo $'[neuron] n
name=Neuron YUM Repository n
baseurl= n
enabled=1' > /etc/yum.repos.d/neuron.repo
RUN rpm --import
RUN yum install -y tensorflow-model-server-neuron

I build the image, create an Amazon Elastic Container Registry repository, and push the image to it.

$ docker build . -f Dockerfile -t tensorflow-model-server-neuron
$ docker tag IMAGE_NAME
$ aws ecr create-repository --repository-name inf1-demo
$ docker push

Our application container is ready. Now, let’s define a Kubernetes service that will use this container to serve BERT predictions. I’m using a model that has already been compiled with the Neuron SDK. You can compile your own using the instructions available in the Neuron SDK repository.

Deploying BERT as a Kubernetes Service
The deployment manages two containers: the Neuron runtime container, and my application container. The Neuron runtime runs as a sidecar container image, and is used to interact with the AWS Inferentia chips. At startup, the latter configures the AWS CLI with the appropriate security credentials. Then, it fetches the BERT model from S3. Finally, it launches TensorFlow Serving, loading the BERT model and waiting for prediction requests. For this purpose, the HTTP and grpc ports are open. Here is the full manifest.

kind: Service
apiVersion: v1
  name: eks-neuron-test
    app: eks-neuron-test
  - name: http-tf-serving
    port: 8500
    targetPort: 8500
  - name: grpc-tf-serving
    port: 9000
    targetPort: 9000
    app: eks-neuron-test
    role: master
  type: ClusterIP
kind: Deployment
apiVersion: apps/v1
  name: eks-neuron-test
    app: eks-neuron-test
    role: master
  replicas: 2
      app: eks-neuron-test
      role: master
        app: eks-neuron-test
        role: master
        - name: sock
          emptyDir: {}
      - name: eks-neuron-test
        command: ["/bin/sh","-c"]
          - "mkdir ~/.aws/ && 
           echo '[eks-test-profile]' > ~/.aws/credentials && 
           echo AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID >> ~/.aws/credentials && 
           echo AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY >> ~/.aws/credentials; 
           /usr/bin/aws --profile eks-test-profile s3 sync s3://jsimon-inf1-demo/bert /tmp/bert && 
           /usr/local/bin/tensorflow_model_server_neuron --port=9000 --rest_api_port=8500 --model_name=bert_mrpc_hc_gelus_b4_l24_0926_02 --model_base_path=/tmp/bert/"
        - containerPort: 8500
        - containerPort: 9000
        imagePullPolicy: Always
        - name: AWS_ACCESS_KEY_ID
              key: AWS_ACCESS_KEY_ID
              name: aws-s3-secret
        - name: AWS_SECRET_ACCESS_KEY
              key: AWS_SECRET_ACCESS_KEY
              name: aws-s3-secret
        - name: NEURON_RTD_ADDRESS
          value: unix:/sock/neuron.sock

            cpu: 4
            memory: 4Gi
            cpu: "1"
            memory: 1Gi
          - name: sock
            mountPath: /sock

      - name: neuron-rtd
            - SYS_ADMIN
            - IPC_LOCK

          - name: sock
            mountPath: /sock
            hugepages-2Mi: 256Mi
            memory: 1024Mi

I use kubectl to create the service.

$ kubectl create -f bert_service.yml

A few seconds later, the pods are up and running.

$ kubectl get pods
NAME                           READY STATUS  RESTARTS AGE
eks-neuron-test-5d59b55986-7kdml 2/2   Running 0        14h
eks-neuron-test-5d59b55986-gljlq 2/2   Running 0        14h

Finally, I redirect service port 9000 to local port 9000, to let my prediction client connect locally.

$ kubectl port-forward svc/eks-neuron-test 9000:9000 &

Now, everything is ready for prediction, so let’s invoke the model.

Predicting with BERT on EKS and Inf1
The inner workings of BERT are beyond the scope of this post. This particular model expects a sequence of 128 tokens, encoding the words of two sentences we’d like to compare for semantic equivalence.

Here, I’m only interested in measuring prediction latency, so dummy data is fine. I build 100 prediction requests storing a sequence of 128 zeros. I send them to the TensorFlow Serving endpoint via grpc, and I compute the average prediction time.

import numpy as np
import grpc
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import time

if __name__ == '__main__':
    channel = grpc.insecure_channel('localhost:9000')
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
    request = predict_pb2.PredictRequest() = 'bert_mrpc_hc_gelus_b4_l24_0926_02'
    i = np.zeros([1, 128], dtype=np.int32)
    request.inputs['input_ids'].CopyFrom(tf.contrib.util.make_tensor_proto(i, shape=i.shape))
    request.inputs['input_mask'].CopyFrom(tf.contrib.util.make_tensor_proto(i, shape=i.shape))
    request.inputs['segment_ids'].CopyFrom(tf.contrib.util.make_tensor_proto(i, shape=i.shape))

    latencies = []
    for i in range(100):
        start = time.time()
        result = stub.Predict(request)
        latencies.append(time.time() - start)
        print("Inference successful: {}".format(i))
    print ("Ran {} inferences successfully. Latency average = {}".format(len(latencies), np.average(latencies)))

On average, prediction took 5.92ms. As far as BERT goes, this is pretty good!

Ran 100 inferences successfully. Latency average = 0.05920819044113159

In real-life, we would certainly be batching prediction requests in order to increase throughput. If needed, we could also scale to larger Inf1 instances supporting several Inferentia chips, and deliver even more prediction performance at low cost.

Getting Started
Kubernetes users can deploy Amazon Elastic Compute Cloud (EC2) Inf1 instances on Amazon Elastic Kubernetes Service today in the US East (N. Virginia) and US West (Oregon) regions. As Inf1 deployment progresses, you’ll be able to use them with Amazon Elastic Kubernetes Service in more regions.

Give this a try, and please send us feedback either through your usual AWS Support contacts, on the AWS Forum for Amazon Elastic Kubernetes Service, or on the container roadmap on Github.

– Julien

Source: AWS News

Leave a Reply

Your email address will not be published. Required fields as marked *.