In recent years, the meteoric rise of deep learning has made incredible applications possible, such as detecting skin cancer (SkinVision) and building autonomous vehicles (TuSimple). Thanks to neural networks, deep learning indeed has the uncanny ability to extract and model intricate patterns from vast amounts of unstructured data (e.g. images, video, and free-form text).
However, training these neural networks requires equally vasts amounts of computing power. Graphics Processing Units (GPUs) have long proven that they were up to that task, and AWS customers have quickly understood how they could use Amazon Elastic Compute Cloud (EC2) P2 and P3 instances to train their models, in particular on Amazon SageMaker, our fully-managed, modular, machine learning service.
Today, I’m very happy to announce that the largest P3 instance, named p3dn.24xlarge, is now available for model training on Amazon SageMaker. Launched last year, this instance is designed to accelerate large, complex, distributed training jobs: it has twice as much GPU memory as other P3 instances, 50% more vCPUs, blazing-fast local NVMe storage, and 100 Gbit networking.
How about we give it a try on Amazon SageMaker?
Introducing EC2 P3dn instances on Amazon SageMaker
Let’s start from this notebook, which uses the built-in image classification algorithm to train a model on the Caltech-256 dataset. All I have to do to use a p3dn.24xlarge instance on Amazon SageMaker is to set
'ml.p3dn.24xlarge', and train!
ic = sagemaker.estimator.Estimator(training_image, role, train_instance_count=1, train_instance_type='ml.p3dn.24xlarge', input_mode='File', output_path=s3_output_location, sagemaker_session=sess) ... ic.fit(...)
I ran some quick tests on this notebook, and I got a sweet 20% training speedup out of the box (your mileage may vary!). I’m using
'File' mode here, meaning that the full dataset is copied to the training instance: the faster network (100 Gbit, up from 25 Gbit) and storage (local NVMe instead of Amazon EBS) are certainly helping!
When working with large data sets, you could put 100 Gbit networking to good use either by streaming data from Amazon Simple Storage Service (S3) with Pipe Mode, or by storing it in Amazon Elastic File System or Amazon FSx for Lustre. It would also help with distributed training (using Horovod, maybe), as instances would be able to exchange parameter updates faster.
In short, the Amazon SageMaker and P3dn tag team packs quite a punch, and it should deliver a significant performance improvement for large-scale deep learning workloads.
P3dn instances are available on Amazon SageMaker in the US East (N. Virginia) and US West (Oregon) regions. If you are ready to get started, please contact your AWS account team or use the Contact Us page to make a request.
Source: AWS News