{"id":167,"date":"2020-09-02T01:40:06","date_gmt":"2020-09-02T01:40:06","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/09\/02\/how-to-run-distributed-training-using-horovod-and-mxnet-on-aws-dl-containers-and-aws-deep-learning-amis\/"},"modified":"2020-09-02T01:40:06","modified_gmt":"2020-09-02T01:40:06","slug":"how-to-run-distributed-training-using-horovod-and-mxnet-on-aws-dl-containers-and-aws-deep-learning-amis","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/09\/02\/how-to-run-distributed-training-using-horovod-and-mxnet-on-aws-dl-containers-and-aws-deep-learning-amis\/","title":{"rendered":"How to run distributed training using Horovod and MXNet on AWS DL Containers and AWS\u00a0 Deep Learning AMIs"},"content":{"rendered":"<div id=\"\">\n<p>Distributed training of large deep learning models has become an indispensable way of model training for computer vision (CV) and natural language processing (NLP) applications. Open source frameworks such as <a href=\"https:\/\/github.com\/horovod\/horovod\" target=\"_blank\" rel=\"noopener noreferrer\">Horovod<\/a> provide distributed training support to Apache MXNet, PyTorch, and TensorFlow. Converting your non-distributed <a href=\"https:\/\/mxnet.apache.org\/versions\/1.6\/\" target=\"_blank\" rel=\"noopener noreferrer\">Apache MXNet<\/a> training script to use distributed training with Horovod only requires 4-5 lines of additional code. Horovod is an open-source distributed deep learning framework created by Uber. It leverages efficient inter-GPU and inter-node communication methods such as NVIDIA Collective Communications Library (NCCL) and Message Passing Interface (MPI) to distribute and aggregate model parameters between workers. The primary use case of Horovod is to make distributed deep learning fast and easy: to take a single-GPU training script and scale it successfully to train across many GPUs in parallel. For those unfamiliar with using Horovod and Apache MXNet for distributed training, we recommend first reading our previous <a href=\"https:\/\/medium.com\/apache-mxnet\/distributed-training-using-apache-mxnet-with-horovod-44f98bf0e7b7\" target=\"_blank\" rel=\"noopener noreferrer\">blog post<\/a> on the subject before diving into this example.<\/p>\n<p>MXNet is integrated with Horovod through the common distributed training APIs defined in Horovod. You can convert the non-distributed training script to a Horovod world by following the higher level <a href=\"https:\/\/github.com\/horovod\/horovod\/blob\/master\/docs\/mxnet.rst\" target=\"_blank\" rel=\"noopener noreferrer\">code skeleton<\/a>. \u00a0This is a streamlined user experience where the user only has to add few lines of code to make it Horovod compatible.\u00a0However, other pain points may still make distributed training not flow as smoothly as expected. For example, you may need to install additional software and libraries and resolve your incompatibilities to make distributed training works. Horovod requires a certain version of <a href=\"https:\/\/github.com\/horovod\/horovod#install\">Open MPI<\/a>, and if you want to leverage high-performance training on NVIDIA GPUs, you need to install a <a href=\"https:\/\/github.com\/NVIDIA\/nccl\">NCCL<\/a> library. Another pain point you may encounter is when trying to scale up the number of training nodes in the cluster. You need to make sure all the software and libraries in the new nodes are properly installed and configured.<\/p>\n<p><a href=\"https:\/\/aws.amazon.com\/machine-learning\/containers\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Deep Learning Containers<\/a> (AWS DL Containers) has greatly simplified the process of launching new training instances in a cluster, and the <a href=\"https:\/\/github.com\/aws\/deep-learning-containers\/blob\/master\/available_images.md#general-framework-containers\" target=\"_blank\" rel=\"noopener noreferrer\">latest release<\/a> includes all the required libraries to run distributed training using MXNet with Horovod. The <a href=\"https:\/\/aws.amazon.com\/machine-learning\/amis\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Deep Learning AMIs<\/a> (DLAMI) comes with popular open-source deep learning frameworks and pre-configured CUDA, cuDNN, Open MPI, and NCCL libraries.<\/p>\n<p>In this post, we demonstrate how to run distributed training using Horovod and MXNet via AWS DL Containers and the DLAMIs.<\/p>\n<h2>Getting started with AWS DL Containers<\/h2>\n<p>AWS DL Containers are a set of Docker images pre-installed with deep learning frameworks to make it easy to deploy custom machine learning (ML) environments quickly. The AWS DL Containers provide optimized environments with different deep learning frameworks (MXNet, TensorFlow, PyTorch), Nvidia CUDA (for GPU instances), and Intel MKL (for CPU instances) libraries, and are available in <a href=\"http:\/\/aws.amazon.com\/ecr\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Elastic Container Registry<\/a> (Amazon ECR). You can launch AWS DL Containers on <a href=\"https:\/\/aws.amazon.com\/eks\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Elastic Kubernetes Service<\/a> (Amazon EKS), self-managed Kubernetes on <a href=\"http:\/\/aws.amazon.com\/ec2\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Elastic Compute Cloud<\/a> (Amazon EC2), and <a href=\"http:\/\/aws.amazon.com\/ecs\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Elastic Container Service<\/a> (Amazon ECS). For more information about launching AWS DL Containers, follow this <a href=\"https:\/\/aws.amazon.com\/blogs\/aws\/new-aws-deep-learning-containers\/\" target=\"_blank\" rel=\"noopener noreferrer\">link<\/a>.<\/p>\n<h3>Training an MXNet model with Deep Learning Containers on Amazon EC2<\/h3>\n<p>The MXNet <a href=\"https:\/\/github.com\/aws\/deep-learning-containers\/blob\/master\/available_images.md#general-framework-containers\" target=\"_blank\" rel=\"noopener noreferrer\">Deep Learning Container<\/a> comes with pre-installed libraries such as MXNet, Horovod, NCCL, MPI, CUDA, and cuDNN. The following diagram illustrates this architecture.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-15445\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/0-Architecture.jpg\" alt=\"\" width=\"490\" height=\"441\"><\/p>\n<p>For instructions on setting up AWS DL Containers on an EC2 instance, see: <a href=\"https:\/\/aws.amazon.com\/getting-started\/hands-on\/train-deep-learning-model-aws-ec2-containers\/\" target=\"_blank\" rel=\"noopener noreferrer\">Train a Deep Learning model with AWS Deep Learning Containers on Amazon EC2<\/a>. For a hands-on tutorial running a Horovod training script, complete steps 1-5 of the preceding <a href=\"https:\/\/aws.amazon.com\/getting-started\/hands-on\/train-deep-learning-model-aws-ec2-containers\/\" target=\"_blank\" rel=\"noopener noreferrer\">post<\/a>. To use an MXNet framework, complete the following for step 6:<\/p>\n<p><strong>CPU<\/strong>:<\/p>\n<ol>\n<li>Download the Docker image from Amazon ECR repository.<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">docker run -it 763104351884.dkr.ecr.us-east-1.amazonaws.com\/mxnet-training:1.6.0-cpu-py27-ubuntu16.04<\/code><\/pre>\n<\/div>\n<\/div>\n<ol start=\"2\">\n<li>In the terminal of the container, run the following command to train the MNIST example.<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">git clone --recursive https:\/\/github.com\/horovod\/horovod.git\r\nmpirun -np 1 -H localhost:1 --allow-run-as-root python horovod\/examples\/mxnet_mnist.py\r\n<\/code><\/pre>\n<\/div>\n<p><strong>GPU<\/strong>:<\/p>\n<ol>\n<li>Download the Docker image from Amazon ECR repository.<\/li>\n<\/ol>\n<\/div>\n<div class=\"hide-language\">\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">nvidia-docker run -it 763104351884.dkr.ecr.us-east-1.amazonaws.com\/mxnet-training:1.6.0-gpu-py27-cu101-ubuntu16.04<\/code><\/pre>\n<\/div>\n<\/div>\n<ol start=\"2\">\n<li>In the terminal of the container, run the following command to train the MNIST example.<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">git clone --recursive https:\/\/github.com\/horovod\/horovod.git\r\nmpirun -np 4 -H localhost:4 --allow-run-as-root python horovod\/examples\/mxnet_mnist.py<\/code><\/pre>\n<\/div>\n<\/div>\n<p>If the final output looks like the following code, you successfully ran the training script:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">[1,0]&lt;stderr&gt;:INFO:root:Epoch[4]    Train: accuracy=0.987580    Validation: accuracy=0.988582\r\n[1,0]&lt;stderr&gt;:INFO:root:Training finished with Validation Accuracy of 0.988582<\/code><\/pre>\n<\/div>\n<p>For instructions on ending the EC2 instances, execute the step 7 of the preceding <a href=\"https:\/\/aws.amazon.com\/getting-started\/hands-on\/train-deep-learning-model-aws-ec2-containers\/\" target=\"_blank\" rel=\"noopener noreferrer\">post<\/a>. One can follow the same steps as described above for their own training script.<\/p>\n<h3>Training a MXNet model with Deep Learning Containers on Amazon EKS<\/h3>\n<p>Amazon EKS is a managed service that makes it easy for you to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane or nodes. Kubernetes is an open-source system for automating the deployment, scaling, and management of containerized applications. In this post, we show you how to set up a deep learning environment using Amazon EKS and AWS DL Containers. With Amazon EKS, you can scale a production-ready environment for multiple-node training and inference with Kubernetes containers.<\/p>\n<p>The following diagram illustrates this architecture:<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-15432\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/1-Architecture.jpg\" alt=\"\" width=\"900\" height=\"385\"><\/p>\n<p>For instructions on setting up a deep learning environment with Amazon EKS and AWS DL Containers, see <a href=\"https:\/\/docs.aws.amazon.com\/deep-learning-containers\/latest\/devguide\/deep-learning-containers-eks-setup.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon EKS Setup<\/a>. To set up an Amazon EKS cluster, use the open-source tool called <code>eksctl<\/code>. It is recommended to use an EC2 instance with the latest DLAMI. You can spin up a GPU cluster or CPU cluster based on your use case. For this post, follow the <a href=\"https:\/\/docs.aws.amazon.com\/deep-learning-containers\/latest\/devguide\/deep-learning-containers-eks-setup.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon EKS Setup<\/a> instructions until the <strong>Manage Your Cluster <\/strong>section.<\/p>\n<p>When your Amazon EKS cluster is up and running, you can run the Horovod MXNet training on the cluster. For instructions, see <a href=\"https:\/\/docs.aws.amazon.com\/deep-learning-containers\/latest\/devguide\/deep-learning-containers-eks-tutorials-distributed-gpu-training.html#deep-learning-containers-eks-tutorials-distributed-gpu-training-mxnet\" target=\"_blank\" rel=\"noopener noreferrer\">MXNet with Horovod distributed GPU training<\/a>, which uses a Docker image that already contains a Horovod training script and a three-node cluster with <code>node-type=p3.8xlarge<\/code>. This tutorial runs the <a href=\"https:\/\/github.com\/horovod\/horovod\/blob\/master\/examples\/mxnet_mnist.py\" target=\"_blank\" rel=\"noopener noreferrer\">Horovod example script<\/a> for MXNet on an MNIST model. The Horovod examples directory also contains an <a href=\"https:\/\/github.com\/horovod\/horovod\/blob\/master\/examples\/mxnet_imagenet_resnet50.py\" target=\"_blank\" rel=\"noopener noreferrer\">Imagenet<\/a> script, which you can run on the same Amazon EKS cluster.<\/p>\n<h2>Getting started with the AWS DLAMI<\/h2>\n<p>The AWS DLAMI are machine learning images loaded with deep learning frameworks and their dependent libraries such as NVIDIA CUDA, NVIDIA cuDNN, NCCL, Intel MKL-DNN, and many others. DLAMI is a one-stop shop for deep learning in the cloud. You can launch EC2 instances with <a href=\"https:\/\/aws.amazon.com\/marketplace\/pp\/B077GCH38C\" target=\"_blank\" rel=\"noopener noreferrer\">Ubuntu<\/a> or <a href=\"https:\/\/aws.amazon.com\/marketplace\/pp\/B077GF11NF\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Linux<\/a>. DLAMI comes with pre-installed deep learning frameworks such as Apache MXNet, TensorFlow, Keras, and PyTorch. You can train custom models, experiment with new deep learning algorithms, and learn new deep learning skills and techniques. The AMIs also offer GPU and CPU acceleration through pre-configured drivers, Anaconda virtual environments, and come with popular Python packages.<\/p>\n<p>The DLAMI for Ubuntu and Amazon Linux now comes with pre-installed Horovod support with a MXNet backend. You can scale your ML model from a single GPU to multiple GPUs or a multi-node cluster using an EC2 GPU instance. You can also achieve greater scaling efficiency and higher multi-GPU training performance by using Horovod with MXNet as compared to native MXNet KVStore.<\/p>\n<p>All versions of the DLAMI beginning with <a href=\"https:\/\/aws.amazon.com\/marketplace\/pp\/Amazon-Web-Services-AWS-Deep-Learning-AMI-Ubuntu-1\/B07Y43P7X5\" target=\"_blank\" rel=\"noopener noreferrer\">Ubuntu 18.04 v27.0<\/a>, <a href=\"https:\/\/aws.amazon.com\/marketplace\/pp\/Amazon-Web-Services-Deep-Learning-AMI-Amazon-Linux\/B077GF11NF\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Linux v27.0<\/a>, and <a href=\"https:\/\/aws.amazon.com\/marketplace\/pp\/Amazon-Web-Services-AWS-Deep-Learning-AMI-Amazon-L\/B07NMRZ36T\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Linux 2 v27.0<\/a> support Horovod with MXNet. You can use any AWS CPU or GPU machine to spin up the instance using deep learning images. It is recommended to use CPU instances of type C5, C5n, or C4 (optimized for high-performance, compute-intensive workloads) and GPU instances of type P2 and P3 (the latest generation of general-purpose GPU instances).<\/p>\n<p>You can run Horovod training on a single-node or multi-node cluster. A single-node cluster consists of a single machine. A multi-node cluster consists of more than one homogeneous machine. In this post, we walk you through running Horovod multi-node cluster training using MXNet.<\/p>\n<h3>Creating a multi-node cluster using the DLAMI<\/h3>\n<p>You can spin up the EC2 instances with <a href=\"http:\/\/aws.amazon.com\/cloudformation\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CloudFormation<\/a> templates, the <a href=\"http:\/\/aws.amazon.com\/cli\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Command Line Interface<\/a> (AWS CLI), or on the <a href=\"https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/EC2_GetStarted.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon EC2 console<\/a>. For this post, we use the Amazon EC2 console. We launch an identical number of EC2 instances with the same DLAMI. We spin up the instances in the same Region, placement group, and security zone because those factors play an important role in achieving high performance.<\/p>\n<ol>\n<li>On the Amazon EC2 console, search for <code>Deep Learning AMI<\/code>.<\/li>\n<li>Choose <strong>Select<\/strong> for any <strong>Deep Learning AMI (Ubuntu)<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-15433\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/2-Choose-an-AMI.jpg\" alt=\"\" width=\"900\" height=\"454\"><\/p>\n<ol start=\"3\">\n<li>You now have to choose the instance type.<\/li>\n<\/ol>\n<p>AWS supports various categories of instances. Based on your use case such as training time and cost, you can select <a href=\"https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/general-purpose-instances.html\" target=\"_blank\" rel=\"noopener noreferrer\">General Purpose<\/a> instances such as M5, <a href=\"https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/compute-optimized-instances.html\" target=\"_blank\" rel=\"noopener noreferrer\">Compute optimized<\/a> instances such as C5, or <a href=\"https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/accelerated-computing-instances.html\" target=\"_blank\" rel=\"noopener noreferrer\">GPU based<\/a> instances such as family of P2 or P3. You can create a cluster of as many instances as possible based on your requirement. For this post, we select four p3.8xlarge instances with a total of 16 GPUs.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-15434\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/3-Screenshot.jpg\" alt=\"\" width=\"900\" height=\"127\"><\/p>\n<ol start=\"4\">\n<li>Choose <strong>Next: Configure Instance Details.<\/strong>\n<\/li>\n<\/ol>\n<p>Next, we need to configure the instances.<\/p>\n<ol start=\"5\">\n<li>For <strong>Number of instances<\/strong>, enter <code>4<\/code>.<\/li>\n<li>Enter your specific network, subnet, and placement group.<\/li>\n<\/ol>\n<p>If you don\u2019t have a placement group, you can create one.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-15435\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/4-Configure-instance-details.jpg\" alt=\"\" width=\"900\" height=\"481\"><\/p>\n<ol start=\"7\">\n<li>Choose <strong>Next: Add Storage.<\/strong>\n<\/li>\n<\/ol>\n<p>You can change this number based on your dataset size. For the demo purpose, we used the default value.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-15436\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/5-Add-storage.jpg\" alt=\"\" width=\"900\" height=\"176\"><\/p>\n<ol start=\"8\">\n<li>Choose <strong>Next: Add Tags<\/strong>.<\/li>\n<li>For <strong>Key<\/strong>, enter <code>Name<\/code>.<\/li>\n<li>For <strong>Value<\/strong>, enter <code>Horovod_MXNet<\/code>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-15437\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/6-Add-Tags.jpg\" alt=\"\" width=\"900\" height=\"180\"><\/p>\n<ol start=\"11\">\n<li>Choose <strong>Next: Configure Security Group<\/strong>.<\/li>\n<li>Create your own security group or use the existing one.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-15447\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/5A-Configure-Security-Group.jpg\" alt=\"\" width=\"900\" height=\"212\"><\/p>\n<ol start=\"13\">\n<li>Choose <strong>Review and Launch<\/strong>.<\/li>\n<li>Review your instance launch details and choose <strong>Launch<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-15438\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/7-Review-Instance-Launch.jpg\" alt=\"\" width=\"900\" height=\"481\"><\/p>\n<p>After you choose <strong>Launch<\/strong>, you\u2019re asked to select an existing key pair or create a new one.<\/p>\n<ol start=\"15\">\n<li>For <strong>Key pair name<\/strong>, enter a key pair.<\/li>\n<\/ol>\n<p>If you don\u2019t have a key pair, choose <strong>Download Key Pair<\/strong>.<\/p>\n<ol start=\"16\">\n<li>Choose <strong>Launch Instances<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-15439\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/8-Select-an-existing-key-pair.jpg\" alt=\"\" width=\"900\" height=\"668\"><\/p>\n<p>If you see a green banner message, you launched the instance successfully.<\/p>\n<ol start=\"17\">\n<li>Choose <strong>View Instances<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-15440\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/9-Launch-Status.jpg\" alt=\"\" width=\"900\" height=\"383\"><\/p>\n<ol start=\"18\">\n<li>Search for <code>horovod_MXNet<\/code> to see the four instances you created.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-15441\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/10-Search-horovod_.jpg\" alt=\"\" width=\"900\" height=\"597\"><\/p>\n<p>We need to do one more step in our cluster setup. All the instances should be able to communicate with each other, so we have to add our security group ID to all the instances\u2019 inbound rules.<\/p>\n<ol start=\"19\">\n<li>Select one instance from the four which you created.<\/li>\n<li>On the <strong>Description<\/strong> tab, choose <strong>Security groups<\/strong> (for this post, <strong>launch-wizard-144<\/strong>).<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-15442\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/11-Description.jpg\" alt=\"\" width=\"900\" height=\"239\"><\/p>\n<ol start=\"21\">\n<li>On the <strong>Inbound<\/strong> tab, copy the security group ID (<code>sg-00e9376c8f3cab57f<\/code>).<\/li>\n<li>Choose <strong>Edit inbound rules<\/strong>.<\/li>\n<li>Choose <strong>Add rule<\/strong>.<\/li>\n<li>Select <strong>All traffic <\/strong>and <strong>SSH<\/strong>.<\/li>\n<li>Choose <strong>Save rules<\/strong>.<\/li>\n<\/ol>\n<p>You can now see your inbound rules listed.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-15443\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/12-Inbound-rules-listed.jpg\" alt=\"\" width=\"900\" height=\"205\"><\/p>\n<ol start=\"26\">\n<li>Repeat the process to add a security group in the inbound rules for all the instances so they can communicate with each other.<\/li>\n<\/ol>\n<p>You are now done with setting up your cluster.<\/p>\n<h3>Horovod with MXNet training on a multi-node cluster<\/h3>\n<p>For Horovod with MXNet training on a multi-node cluster, complete the following steps:<\/p>\n<ol>\n<li>Copy your PEM key from your local machine to one of the EC2 instances (primary node):<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">\/\/ For Ubuntu user\r\nscp -i &lt;your_pem_key_path&gt; ubuntu@&lt;IPv4_Public_IP&gt;:\/home\/ubuntu\/\r\n<\/code><\/pre>\n<\/div>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">\/\/ For Amazon Linux user\r\nscp -I &lt;your_pem_key_path&gt; ec2-user@&lt;IPv4_Public_IP&gt;:\/home\/ec2-user\/<\/code><\/pre>\n<\/div>\n<\/div>\n<ol start=\"2\">\n<li>SSH into your primary node:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">\/\/ For Ubuntu user\r\n$ ssh -i &lt;your_pem_key&gt; ubuntu@&lt;IPv4_Public_IP&gt;\r\n<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"hide-language\">\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">\/\/ For Amazon Linux user\r\n$ ssh -i &lt;your_pem_key&gt; ec2-user@&lt;IPv4_Public_IP&gt;<\/code><\/pre>\n<\/div>\n<\/div>\n<ol start=\"3\">\n<li>Enable the passwordless SSHing between EC2 instances, without providing the PEM file. Enter the following command into your primary node:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">eval `ssh-agent`\r\nssh-add &lt;your_pem_key&gt;\r\n<\/code><\/pre>\n<\/div>\n<\/div>\n<ol start=\"4\">\n<li>When you SSH or connect for the first time from one EC2 instance to another, you see the following message:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">$ ssh &lt;another_ec2_ipv4_address&gt;\r\nThe authenticity of host 'xxx.xx.xx.xx' can't be established.\r\nECDSA key fingerprint is SHA256:xxxaaabbbbccc.\r\nAre you sure you want to continue connecting (yes\/no)?\r\n\r\n# Make sure you are able to SSH from one EC2 to another without this authenticity, otherwise horovod won't able to communicate with other machines\r\n\r\n# SOLUTION:\r\n# Open file \"\/etc\/ssh\/ssh_config\" and add this lines at the end\r\nHost *\r\n   StrictHostKeyChecking no\r\n   UserKnownHostsFile=\/dev\/null\r\n   <\/code><\/pre>\n<\/div>\n<ol start=\"5\">\n<li>Activate the CONDA environment:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">\/\/ If using Python 3.6\r\n$ source activate mxnet_p36\r\n<\/code><\/pre>\n<\/div>\n<ol start=\"5\">\n<li>As an optional step, confirm Horovod is using MXNet on the backend by running the following command (as of this writing, the Horovod version is 0.19.5):<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">$ horovodrun -cb\r\n\r\n\/\/ Output\r\nHorovod v0.19.5:\r\n\r\nAvailable Frameworks:\r\n    [ ] TensorFlow\r\n    [ ] PyTorch\r\n    [X] MXNet\r\n\r\nAvailable Controllers:\r\n    [X] MPI\r\n    [X] Gloo\r\n\r\nAvailable Tensor Operations:\r\n    [X] NCCL\r\n    [ ] DDL\r\n    [ ] CCL\r\n    [X] MPI\r\n    [X] Gloo\r\n<\/code><\/pre>\n<\/div>\n<ol start=\"6\">\n<li>We have provided a sample MNIST example for you to run the Horovod training.<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">$ horovodrun -np 4 python examples\/horovod\/mxnet\/train_mxnet_hvd_mnist.py\r\n\r\n\/\/ Output\r\n[1,0]&lt;stderr&gt;:INFO:root:Namespace(batch_size=64, dtype='float32', epochs=5, lr=0.002, momentum=0.9, no_cuda=True)\r\n[1,1]&lt;stderr&gt;:INFO:root:Namespace(batch_size=64, dtype='float32', epochs=5, lr=0.002, momentum=0.9, no_cuda=True)\r\n[1,1]&lt;stderr&gt;:INFO:root:downloaded http:\/\/data.mxnet.io\/mxnet\/data\/mnist.zip into data-1\/mnist.zip successfully\r\n[1,1]&lt;stderr&gt;:[04:29:14] src\/io\/iter_mnist.cc:113: MNISTIter: load 30000 images, shuffle=1, shape=[64,1,28,28]\r\n\/\/ ....... &lt;output truncated&gt; ...........\r\n[1,0]&lt;stderr&gt;:INFO:root:Epoch[4]    Train: accuracy=0.987647    Validation: accuracy=0.986178\r\n[1,0]&lt;stderr&gt;:INFO:root:Training finished with Validation Accuracy of 0.986178\r\n[1,1]&lt;stderr&gt;:INFO:root:Training finished with Validation Accuracy of 0.986178\r\n<\/code><\/pre>\n<\/div>\n<ol start=\"7\">\n<li>Don\u2019t forget to <a href=\"https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/Stop_Start.html\" target=\"_blank\" rel=\"noopener noreferrer\">stop<\/a> or <a href=\"https:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/terminating-instances.html\" target=\"_blank\" rel=\"noopener noreferrer\">terminate<\/a> the instance when you no longer need it.<\/li>\n<\/ol>\n<p>For more information about the <code>horovodrun<\/code> command, see <a href=\"https:\/\/github.com\/horovod\/horovod\/blob\/master\/docs\/running.rst\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>.<\/p>\n<p>The preceding code just shows how to run the Horovod training script on a multi-node EC2 instance. You can find the Horovod MXNet example script on the Horovod <a href=\"https:\/\/github.com\/horovod\/horovod\/tree\/master\/examples\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repo<\/a>. Additionally, you can bring your own training script that\u2019s compatible with Horovod and MXNet and train the model on a single node and multi-node cluster. To learn more about the performance comparison between Horovod and Parameter Server, <a href=\"https:\/\/medium.com\/apache-mxnet\/distributed-training-using-apache-mxnet-with-horovod-44f98bf0e7b7\" target=\"_blank\" rel=\"noopener noreferrer\">this blog<\/a> post illustrates the difference as ResNet50 scales from 1 to 64 GPUs.<\/p>\n<p>When using Horovod, keep the following in mind:<\/p>\n<ul>\n<li>All your instances must be the same type<\/li>\n<li>All your instances must have the same environment<\/li>\n<li>The data should be stored in the same location across nodes.<\/li>\n<li>The training script should be in the same location across nodes.<\/li>\n<\/ul>\n<h2>Conclusion<\/h2>\n<p>In this post, we demonstrated how to run the distributed training using Horovod and MXNet on Amazon EC2 and Amazon EKS using AWS DL Containers and AWS DLAMI. Using Horovod, your Apache MXNet models can be distributed across a cluster of instances, providing a significant increase in performance with only minimal changes to your training script.<\/p>\n<p>For more information about deep learning and MXNet, see the <a href=\"https:\/\/mxnet.apache.org\/versions\/1.6\/api\/python\/docs\/tutorials\/getting-started\/crash-course\/index.html\" target=\"_blank\" rel=\"noopener noreferrer\">MXNet crash course<\/a> and <a href=\"https:\/\/d2l.ai\/\" target=\"_blank\" rel=\"noopener noreferrer\">Dive into Deep Learning<\/a> book. You can also get started on the <a href=\"https:\/\/mxnet.apache.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">MXNet website<\/a> and MXNet GitHub <a href=\"https:\/\/github.com\/apache\/incubator-mxnet\/tree\/master\/example\" target=\"_blank\" rel=\"noopener noreferrer\">examples directory<\/a>.<\/p>\n<p>If you are new to distributed training, we highly recommend reading the paper <a href=\"https:\/\/arxiv.org\/pdf\/1802.05799.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">Horovod: fast and easy distributed deep learning inTensorFlow<\/a>. You can also <a href=\"https:\/\/github.com\/horovod\/horovod#install\" target=\"_blank\" rel=\"noopener noreferrer\">install Horovod<\/a>, build Horovod with MXNet, and follow the <a href=\"https:\/\/github.com\/horovod\/horovod\/blob\/master\/examples\/mxnet_mnist.py\" target=\"_blank\" rel=\"noopener noreferrer\">MNIST<\/a> or <a href=\"https:\/\/github.com\/horovod\/horovod\/blob\/master\/examples\/mxnet_imagenet_resnet50.py\" target=\"_blank\" rel=\"noopener noreferrer\">ImageNet<\/a> use case. You can find more Horovod MXNet examples in <a href=\"https:\/\/github.com\/dmlc\/gluon-cv\/tree\/master\/scripts\" target=\"_blank\" rel=\"noopener noreferrer\">GluonCV example<\/a> and <a href=\"https:\/\/github.com\/dmlc\/gluon-nlp\/tree\/master\/scripts\" target=\"_blank\" rel=\"noopener noreferrer\">GluonNLP example<\/a> on GitHub.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-15163 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/26\/Chaitanya-Bapat-1.jpg\" alt=\"\" width=\"100\" height=\"136\">Chaitanya Bapat<\/strong> is a Software Engineer with the AWS Deep Learning team. He works on Apache MXNet and integrating the framework with Sagemaker, DLC and DLAMI. In his spare time, he loves watching sports and enjoys reading books and learning Spanish.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-15164 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/08\/26\/Karan-Jariwala.jpg\" alt=\"\" width=\"100\" height=\"134\">Karan Jariwala <\/strong>is a Software Development Engineer on the AWS Deep Learning team. His work focuses on training deep neural networks. Outside of work, he enjoys hiking, swimming, and playing tennis.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/horovod-mxnet-distributed-training\/<\/p>\n","protected":false},"author":0,"featured_media":168,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/167"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=167"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/167\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/168"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=167"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=167"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=167"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}