{"id":674,"date":"2020-12-10T14:55:44","date_gmt":"2020-12-10T14:55:44","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/12\/10\/identify-bottlenecks-improve-resource-utilization-and-reduce-ml-training-costs-with-the-deep-profiling-feature-in-amazon-sagemaker-debugger\/"},"modified":"2020-12-10T14:55:44","modified_gmt":"2020-12-10T14:55:44","slug":"identify-bottlenecks-improve-resource-utilization-and-reduce-ml-training-costs-with-the-deep-profiling-feature-in-amazon-sagemaker-debugger","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/12\/10\/identify-bottlenecks-improve-resource-utilization-and-reduce-ml-training-costs-with-the-deep-profiling-feature-in-amazon-sagemaker-debugger\/","title":{"rendered":"Identify bottlenecks, improve resource utilization, and reduce ML training costs with the deep profiling feature in Amazon SageMaker Debugger"},"content":{"rendered":"<div id=\"\">\n<p>Machine learning (ML) has shown great promise across domains such as predictive analysis, speech processing, image recognition, recommendation systems, bioinformatics, and more. Training ML models is a time- and compute-intensive process, requiring multiple training runs with different hyperparameters before a model yields acceptable accuracy. CPU- and GPU-based distributed training with frameworks such as Horovod and Parameter Servers addresses this issue by allowing training to be easily scalable to a cluster of resources. However, distributed training makes it harder to identify and debug resource bottlenecks. Gaining insight into the training in progress, both at the ML framework level and the underlying compute resources level, is a critical step towards understanding resource usage patterns and reducing resource wastage. Analyzing bottleneck issues is necessary to maximize the utilization of compute resources and optimize model training performance to deliver state-of-the-art ML models with target accuracy.<\/p>\n<p><a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> is a fully managed service that enables developers and data scientists to quickly and easily build, train, and deploy ML models at scale. <a href=\"https:\/\/aws.amazon.com\/sagemaker\/debugger\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Debugger<\/a> is a feature of SageMaker training that makes it easy to train ML models faster by capturing real-time metrics such as learning gradients and weights. This provides transparency into the training process, so you can correct anomalies such as losses, overfitting, and overtraining. Debugger provides built-in rules to easily analyze emitted data, including tensors that are critical for the success of training jobs.<\/p>\n<p>With the newly introduced profiling capability, Debugger now automatically monitors system resources such as CPU, GPU, network, I\/O, and memory, providing a complete resource utilization view of training jobs. You can also profile your entire training job or portions thereof to emit detailed framework metrics during different phases of the training job. Framework metrics are metrics that are captured from within the training script, such as step duration, data loading, preprocessing, and operator runtime on CPU and GPU.<\/p>\n<p>Debugger correlates system and framework metrics, which helps you identify possible root causes. For example, if utilization on GPU drops to zero, you can inspect what has been happening within the training script at this particular time. You can right-size resources and quickly identify bottlenecks and fix them using insights from the profiler.<\/p>\n<p>You can re-allocate resources based on recommendations from the profiling capability. Metrics and insights are captured and monitored programmatically using the SageMaker Python SDK or visually through <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/studio.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Studio<\/a>.<\/p>\n<p>In this post, we demonstrate Debugger profiling capabilities using a TensorFlow-based sentiment analysis use case. In the <a href=\"https:\/\/github.com\/aws\/amazon-sagemaker-examples\/tree\/master\/sagemaker-debugger\/tensorflow_nlp_sentiment_analysis\/sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">notebook<\/a> included in this post, we set a Convolutional Neural Network (CNN) using TensorFlow script mode on SageMaker. For our dataset, we use the IMDB dataset, which consists of movie reviews labeled as positive or negative sentiment. We use Debugger to showcase how to gain visibility into utilizing system resources of the training instances, profile framework metrics, and identify an underutilized training resource due to resource bottlenecks. We further demonstrate how to improve resource utilization after implementing the recommendations from Debugger.<\/p>\n<h2>Walkthrough overview<\/h2>\n<p>The remainder of this post details how to use the Debugger profiler capability to gain visibility into ML training jobs and analysis of profiler recommendations. The notebook includes details of using TensorFlow Horovod distributed training where the profiling capability enabled us to improve resource utilization up to 36%. The first training run was on <strong>three<\/strong> <strong>p3.8xlarge<\/strong> instances for 503 seconds, and the second training run after implementing the profiler recommendations took 502 seconds on <strong>two p3.2xlarge<\/strong> instances, resulting in <strong>83% cost savings.<\/strong> Profiler analysis of the second training run provided additional recommendations highlighting the possibility of further cost savings and better resource utilization.<\/p>\n<p>The walkthrough includes the following high-level steps:<\/p>\n<ol>\n<li>Train a TensorFlow sentiment analysis CNN model using SageMaker distributed training with custom profiler configuration.<\/li>\n<li>Visualize the system and framework metrics generated to analyze the profiler data.<\/li>\n<li>Access Debugger Insights in Studio.<\/li>\n<li>Analyze the profiler report generated by Debugger.<\/li>\n<li>Analyze and Implement recommendations from the profiler report.<\/li>\n<\/ol>\n<p>Additional steps such as importing the necessary libraries and examining the dataset are included in the notebook. Review the <a href=\"https:\/\/github.com\/aws\/amazon-sagemaker-examples\/tree\/master\/sagemaker-debugger\/tensorflow_nlp_sentiment_analysis\/sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">notebook<\/a> for complete details.<\/p>\n<h2>Training a CNN model using SageMaker distributed training with custom profiler configuration<\/h2>\n<p>In this step, you train the sentiment analysis model using TensorFlow estimator with the profiler enabled.<\/p>\n<p>First ensure that Debugger libraries are imported. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># import debugger libraries\r\nfrom sagemaker.debugger import ProfilerConfig, DebuggerHookConfig, Rule, ProfilerRule, rule_configs, FrameworkProfile<\/code><\/pre>\n<\/div>\n<p>Next, set up Horovod distribution for TensorFlow distributed training. Horovod is a distributed deep learning training framework for TensorFlow, Keras, and PyTorch. The objective is to take a single-GPU training script and successfully scale it to train across many GPUs in parallel. After a training script has been written for scale with Horovod, it can run on a single GPU, multiple GPUs, or even multiple hosts without any further code changes. In addition to being easy to use, Horovod is fast. For more information, see the <a href=\"https:\/\/github.com\/horovod\/horovod\" target=\"_blank\" rel=\"noopener noreferrer\">Horovod GitHub page<\/a>.<\/p>\n<p>We can set up hyperparameters such as number of epochs, batch size, and data augmentation:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">hyperparameters = {'epoch': 25, \r\n                   'batch_size': 256,\r\n                   'data_augmentation': True}<\/code><\/pre>\n<\/div>\n<p>Changing these hyperparameters might impact resource utilization with your training job.<\/p>\n<p>For our training, we start off using three p3.8xlarge instances and change our training configuration based on profiling recommendations from Debugger:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">distributions = {\r\n                    \"mpi\": {\r\n                        \"enabled\": True,\r\n                        \"processes_per_host\": 3,\r\n                        \"custom_mpi_options\": \"-verbose -x HOROVOD_TIMELINE=.\/hvd_timeline.json -x NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none\",\r\n                    }\r\n                }\r\n\r\nmodel_dir = '\/opt\/ml\/model'\r\ntrain_instance_type='ml.p3.8xlarge'\r\ninstance_count = 3<\/code><\/pre>\n<\/div>\n<p>The p3.8xlarge instance comes with 4 GPUs and 32 vCPU cores with 10 Gbps networking performance. For more information, see <a href=\"https:\/\/aws.amazon.com\/ec2\/instance-types\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon EC2 Instance Types<\/a>. Take your AWS account limits into consideration while setting up the <code>instance_type<\/code> and <code>instance_count<\/code> of the cluster.<\/p>\n<p>Then we define the profiler configuration. With the following <code>profiler_config<\/code> parameter configuration, Debugger calls the default settings of monitoring and profiling. Debugger monitors system metrics every 500 milliseconds. You specify additional details on when to start and how long to run profiling. You can set different profiling settings to profile target steps and target time intervals in detail.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">profiler_config = ProfilerConfig(\r\n    system_monitor_interval_millis=500,\r\n    framework_profile_params=FrameworkProfile(start_step=2, num_steps=7)\r\n)<\/code><\/pre>\n<\/div>\n<p>For complete list of parameters, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/train-debugger.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Debugger<\/a>.<\/p>\n<p>Then we configure a training job using TensorFlow estimator and pass in the profiler configuration. For <code>framework_version<\/code> and <code>py_version<\/code>, specify the TensorFlow framework version and supported Python version, respectively:<\/p>\n<div class=\"hide-language\">\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">estimator = TensorFlow(\r\n    role=sagemaker.get_execution_role(),\r\n    base_job_name= 'tf-keras-silent',\r\n    image_uri=f\"763104351884.dkr.ecr.{region}.amazonaws.com\/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04\",\r\n    model_dir=model_dir,\r\n    instance_count=instance_count,\r\n    instance_type=train_instance_type,\r\n    entry_point= 'sentiment-distributed.py',\r\n    source_dir='.\/tf-sentiment-script-mode',\r\n    profiler_config=profiler_config,\r\n    script_mode=True,\r\n    hyperparameters=hyperparameters,\r\n    distribution=distributions\r\n)<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"hide-language\">\n<p>For complete list of the supported framework versions and the corresponding Python version to use, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/train-debugger.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Debugger<\/a>.<\/p>\n<\/div>\n<p>Finally, start the training job:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">estimator.fit(inputs, wait= False)<\/code><\/pre>\n<\/div>\n<h2>Visualizing the system and framework metrics generated<\/h2>\n<p>Now that our training job is running, we can perform interactive analysis of the data captured by Debugger. The analysis is organized in order of training phases: initialization, training, and finalization. The profiling data results are categorized as system metrics and algorithm (framework) metrics. After the training job initiates, Debugger starts collecting system and framework metrics. The <code>smdebug<\/code> library provides profiler analysis tools that enable you to access and analyze the profiling data.<\/p>\n<p>First, we collect the system and framework metrics using the <code>S3SystemMetricsReader<\/code> library:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">from smdebug.profiler.system_metrics_reader import S3SystemMetricsReader\r\nimport time\r\n\r\npath = estimator.latest_job_profiler_artifacts_path()\r\nsystem_metrics_reader = S3SystemMetricsReader(path)<\/code><\/pre>\n<\/div>\n<p>Check if we have metrics available for analysis:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">while system_metrics_reader.get_timestamp_of_latest_available_file() == 0:\r\n    \t\tsystem_metrics_reader.refresh_event_file_list()\r\n    \t\tclient = sagemaker_client.describe_training_job(\r\n        \t\t\tTrainingJobName=training_job_name\r\n   \t\t\t )\r\n    \t\tif 'TrainingJobStatus' in client:\r\n       \t \ttraining_job_status = f\"TrainingJobStatus: {client['TrainingJobStatus']}\"\r\n    \t\tif 'SecondaryStatus' in client:\r\n       \t training_job_secondary_status = f\"TrainingJobSecondaryStatus: {client['SecondaryStatus']}\"<\/code><\/pre>\n<\/div>\n<p>When the data is available, we can query and inspect it:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-sql\">system_metrics_reader.refresh_event_file_list()\r\nlast_timestamp = system_metrics_reader.get_timestamp_of_latest_available_file()\r\nevents = system_metrics_reader.get_events(0, last_timestamp)<\/code><\/pre>\n<\/div>\n<p>Along with the <a href=\"https:\/\/github.com\/aws\/amazon-sagemaker-examples\/tree\/master\/sagemaker-debugger\/tensorflow_nlp_sentiment_analysis\/sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">notebook<\/a>, the <code>smdebug<\/code> SDK contains several utility classes that can be used for visualizations. From the data collected, you can visualize the CPU and GPU utilization values as a histogram using the utility class <code>MetricHistogram<\/code>. <code>MetricHistogram<\/code> computes a histogram on GPU and CPU utilization values. Bins are between 0\u2013100. Good system utilization means that the center of the distribution should be between 80\u201390. In case of multi-GPU training, if distributions of GPU utilization values aren\u2019t similar, it indicates an issue with workload distribution.<\/p>\n<p>The following code plots the histograms per metric. To only plot specific metrics, define the list <code>select_dimensions<\/code> and <code>select_events<\/code>. A dimension can be <code>CPUUtilization, GPUUtilization<\/code>, or <code>GPUMemoryUtilization<\/code> IOPS. If no event is specified, then for the CPU utilization, a histogram for each single core and total CPU usage is plotted.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">from smdebug.profiler.analysis.notebook_utils.metrics_histogram import MetricsHistogram\r\n\r\nsystem_metrics_reader.refresh_event_file_list()\r\nmetrics_histogram = MetricsHistogram(system_metrics_reader)<\/code><\/pre>\n<\/div>\n<p>The following screenshot shows our histograms.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19832 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/09\/util-function.jpg\" alt=\"\" width=\"800\" height=\"600\"><\/p>\n<p>Similar to system metrics, let\u2019s retrieve all the events emitted from the framework or algorithm metrics using the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-sql\">from smdebug.profiler.algorithm_metrics_reader import S3AlgorithmMetricsReader\r\n\r\nframework_metrics_reader = S3AlgorithmMetricsReader(path)\r\n\r\nevents = []\r\nwhile framework_metrics_reader.get_timestamp_of_latest_available_file() == 0 or len(events) == 0:\r\n    framework_metrics_reader.refresh_event_file_list()\r\n    last_timestamp = framework_metrics_reader.get_timestamp_of_latest_available_file()\r\n    events = framework_metrics_reader.get_events(0, last_timestamp)\r\n\r\nframework_metrics_reader.refresh_event_file_list()\r\nlast_timestamp = framework_metrics_reader.get_timestamp_of_latest_available_file()\r\nevents = framework_metrics_reader.get_events(0, last_timestamp)<\/code><\/pre>\n<\/div>\n<p>We can inspect one of the recorded events to get the following:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">print(\"Event name:\", events[0].event_name, \r\n      \"nStart time:\", timestamp_to_utc(events[0].start_time\/1000000000), \r\n      \"nEnd time:\", timestamp_to_utc(events[0].end_time\/1000000000), \r\n      \"nDuration:\", events[0].duration, \"nanosecond\")\r\n\r\n\tEvent name: Step:ModeKeys.TRAIN \r\n\tStart time: 2020-12-04 22:44:14 \r\n\tEnd time: 2020-12-04 22:44:25 \r\n\tDuration: 10966842000 nanosecond<\/code><\/pre>\n<\/div>\n<p>For more information about system and framework metrics, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/debugger-configure-framework-profiling.htm\" target=\"_blank\" rel=\"noopener noreferrer\">documentation<\/a>.<\/p>\n<p>Next, we use the <code>StepHistogram<\/code> utility class to create a histogram of step duration values. Significant outliers in step durations are an indication of a bottleneck. It allows you to easily identify clusters of step duration values.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">from smdebug.profiler.analysis.notebook_utils.step_histogram import StepHistogram\r\nframework_metrics_reader.refresh_event_file_list()\r\nstep_histogram = StepHistogram(framework_metrics_reader)\r\n<\/code><\/pre>\n<\/div>\n<p>The following screenshot shows our visualization.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19624\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1458-2.jpg\" alt=\"The following screenshot shows our visualization.\" width=\"756\" height=\"590\"><\/p>\n<p>For an alternative view of CPU and GPU utilizations, the following code creates a heat map where each row corresponds to one metric (CPU core and GPU utilizations) and the x-axis is the duration of the training job. It allows you to more easily spot CPU bottlenecks, for example, if utilization on GPU is low but a utilization of one or more cores is high.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-sql\">from smdebug.profiler.analysis.notebook_utils.heatmap import Heatmap\r\n\r\nview_heatmap = Heatmap(\r\n    system_metrics_reader,\r\n    framework_metrics_reader,\r\n    select_dimensions=[\"CPU\", \"GPU\", \"I\/O\"], # optional\r\n    select_events=[\"total\"],                 # optional\r\n    plot_height=450\r\n)<\/code><\/pre>\n<\/div>\n<p>The following screenshot shows the heat map of a training job that has been using 4 GPUs and 32 CPU cores. The first few rows show the GPUs\u2019 utilization, and the remaining rows show the utilization on CPU cores. Yellow indicates maximum utilization, and purple means that utilization was 0. GPUs have frequent stalled cycles where utilization drops to 0, whereas at the same time, utilization on CPU cores is at a maximum. This is a clear indication of a CPU bottleneck where GPUs are waiting for the data to arrive. Such a bottleneck can occur by a too compute-heavy preprocessing.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19830 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/09\/HeatMap.jpg\" alt=\"\" width=\"800\" height=\"409\"><\/p>\n<h2>Accessing Debugger Insights in Studio<\/h2>\n<p>You can also use Studio to perform training with our existing <a href=\"https:\/\/github.com\/aws\/amazon-sagemaker-examples\/tree\/master\/sagemaker-debugger\/tensorflow_nlp_sentiment_analysis\/sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">notebook<\/a>. Studio provides built-in visualizations to analyze profiling insights. Alternatively, you can move to next section in this post to directly analyze the profiler report generated.<\/p>\n<p>If you trained in a SageMaker notebook instance, you can still find the Debugger insights for that training in Studio if the training happened in same Region.<\/p>\n<ol>\n<li>On the navigation pane, choose <strong>Components and registries<\/strong>.<\/li>\n<li>Choose <strong>Experiments and trails<\/strong>.<\/li>\n<li>Choose your training job (right-click).<\/li>\n<li>Choose <strong>Debugger Insights<\/strong>.<\/li>\n<\/ol>\n<p>For more information about setting up Studio, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/gs-set-up.html\" target=\"_blank\" rel=\"noopener noreferrer\">Set up Amazon SageMaker<\/a>.<\/p>\n<h3>Reviewing Debugger reports<\/h3>\n<p>After you have set up and run this <a href=\"https:\/\/github.com\/aws\/amazon-sagemaker-examples\/tree\/master\/sagemaker-debugger\/tensorflow_nlp_sentiment_analysis\/sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">notebook<\/a> in Studio, you can access Debugger Insights.<\/p>\n<ol>\n<li>On the navigation pane, choose <strong>Components and registries<\/strong>.<\/li>\n<li>Choose <strong>Experiments and trails<\/strong>.<\/li>\n<li>Choose your training job (right-click).<\/li>\n<li>Choose <strong>View Debugger for insights<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19626 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1458-4.jpg\" alt=\"After you have set up and run this notebook in Studio, you can access Debugger Insights.\" width=\"437\" height=\"637\"><\/p>\n<p>A Debugger tab opens for this training job. For more information, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/debugger-on-studio-insights.html\" target=\"_blank\" rel=\"noopener noreferrer\">Debugger Insights<\/a>.<\/p>\n<h3>Training job summary<\/h3>\n<p>This section of the report shows details of the training job, such as the start time, end time, duration, and time spent in individual phases of the training. The pie chart visualization of these delays shows the time spent in initialization, training, and finalization phases relative to each other.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19627 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1458-5.jpg\" alt=\"This section of the report shows details of the training job, such as the start time, end time, duration, and time spent in individual phases of the training. \" width=\"800\" height=\"440\"><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19628 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1458-6.jpg\" alt=\"The pie chart visualization of these delays shows the time spent in initialization, training, and finalization phases relative to each other.\" width=\"800\" height=\"213\"><\/p>\n<h3>System usage statistics<\/h3>\n<p>This portion of the report gives detailed system usage statistics for both training instances involved in training, along with analysis and suggestions for improvements. The following text is an excerpt from the report, with key issues highlighted:<\/p>\n<p>The 95th quantile of the total GPU utilization on node algo-1 is only 13%. The 95th quantile of the total CPU utilization is only 24%. Node algo-1 is under-utilized. You may want to consider switching to a smaller instance type. The 95th quantile of the total GPU utilization on node algo-2 is only 13%. The 95th quantile of the total CPU utilization is only 24%. Node algo-2 is under-utilized. You may want to consider switching to a smaller instance type. The 95th quantile of the total GPU utilization on node algo-3 is only 13%. The 95th quantile of the total CPU utilization is only 24%. Node algo-3 is under-utilized. You may want to consider switching to a smaller instance type.<\/p>\n<p>The following table shows usage statistics per worker node, such as total CPU and GPU utilization, total CPU, and memory footprint. The table also include total I\/O wait time and total sent and received bytes. The table shows minimum and maximum values as well as p99, p90, and p50 percentiles.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19629 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1458-7.jpg\" alt=\"The following table shows usage statistics per worker node, such as total CPU and GPU utilization, total CPU, and memory footprint.\" width=\"800\" height=\"483\"><\/p>\n<h3>Framework metrics summary<\/h3>\n<p>In this section, the following pie charts show the breakdown of framework operations on CPUs and GPUs.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19630 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1458-8.jpg\" alt=\"The following pie charts show the breakdown of framework operations on CPUs and GPUs. \" width=\"691\" height=\"541\"><\/p>\n<h3>Insights<\/h3>\n<p>Insights provides suggestions and additional details, such as the number of times each rule triggered, the rule parameters, and the default threshold values to evaluate your training job performance. According to the insights for our TensorFlow training job, profiler rules were run for three out of the eight insights. The following screenshot shows the insights.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19631 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1458-9.jpg\" alt=\"The following screenshot shows the insights.\" width=\"562\" height=\"591\"><\/p>\n<p>If you choose an insight, you can view the profiler recommendations.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19632\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1458-10.jpg\" alt=\"\" width=\"401\" height=\"644\"><\/p>\n<p>By default, we are showing the overview report, but you could choose <strong>Nodes<\/strong> to show the dashboard.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19633 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1458-11.jpg\" alt=\"We are showing the overview report, but you could choose Nodes to show the dashboard.\" width=\"784\" height=\"530\"><\/p>\n<p>You can expand each algorithm to get deep dive information such as CPU utilization, network utilization, and system metrics per algorithm used during training.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19652\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/Missing-1-4.jpg\" alt=\"You can expand each algorithm to get deep dive information such as CPU utilization, network utilization, and system metrics.\" width=\"800\" height=\"434\"><\/p>\n<p>Furthermore, you can scroll down to analyze GPU memory utilization over time and system utilization over time for each algorithm.<\/p>\n<h2>Analyzing the profiler report generated by Debugger<\/h2>\n<p>Download the profiler report by choosing <strong>Download report<\/strong>.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19635 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1458-13.jpg\" alt=\"Download the profiler report by choosing Download report.\" width=\"725\" height=\"265\"><\/p>\n<p>Alternatively, if you\u2019re not using Studio, you can download your report directly from <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) at <code>s3:\/\/&lt;your bucket&gt; \/tf-keras-sentiment-&lt;job id&gt;\/profiler-output\/<\/code>.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19636 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1458-14.jpg\" alt=\"Alternatively, if you\u2019re not using Studio, you can download your report directly.\" width=\"800\" height=\"391\"><\/p>\n<p>Next, we review a few sections of the generated report. For additional details, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/debugger-report.html\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker Debugger report<\/a> . You can also use the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/debugger-analyze-data.html\" target=\"_blank\" rel=\"noopener noreferrer\">SMDebug client library<\/a> for performing data analysis.<\/p>\n<h3>Framework metrics summary<\/h3>\n<p>In this section of the report, you see a pie chart that shows the time the training job spent in the training phase, validation phase, or \u201cothers.\u201d \u201cOthers\u201d represents the accumulated time between steps; that is, the time between when a step has finished but the next step hasn\u2019t started. Ideally, most time should be spent in training steps.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19637 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1458-15.jpg\" alt='In this section of the report, you see a pie chart that shows the time the training job spent in the training phase, validation phase, or \"others.\u201d ' width=\"800\" height=\"447\"><\/p>\n<h3>Identifying the most expensive CPU operator<\/h3>\n<p>This section provides information of the CPU operators in detail. The table shows the percentage of the time and the absolute cumulative time spent on the most frequently called CPU operators.<\/p>\n<p>The following table shows a list of operators that your training job run on CPU. The most expensive operator on CPU was <code>ExecutorState::Process<\/code> with 16%.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19638\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1458-16.jpg\" alt=\"The following table shows a list of operators that your training job run on CPU.\" width=\"800\" height=\"259\"><\/p>\n<h3>Identifying the most expensive GPU operator<\/h3>\n<p>This section provides information of the GPU operators in detail. The table shows the percentage of the time and the absolute cumulative time spent on the most frequently called GPU operators.<strong>\u00a0<\/strong><\/p>\n<p>The following table shows a list of operators that your training job ran on GPU. The most expensive operator on GPU was <code>Adam<\/code> with 29%.<\/p>\n<p><strong> <img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19639\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1458-17.jpg\" alt=\"The following table shows a list of operators that your training job ran on GPU. \" width=\"721\" height=\"571\"><\/strong><\/p>\n<h3>Rules summary<\/h3>\n<p>In this section, Debugger aggregates all the rule evaluation results, analysis, rule descriptions, and suggestions. The following table shows a summary of the profiler rules that ran. The table is sorted by the rules that triggered most frequently. In the training job, this was the case for rule <code>LowGPUUtilization<\/code>. It processed 1,001 data points and was triggered 8 times.<\/p>\n<p>\u00a0<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19719 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/profilertrain1.png\" alt=\"\" width=\"2640\" height=\"1376\"><\/p>\n<p>Because the rules were triggered for <code>LowGPUUTilization<\/code>, <code>Batchsize<\/code>, and <code>CPUBottleneck<\/code>, lets deep dive into each to understand the profiler recommendations for each.<\/p>\n<h3>LowGPUUtilization<\/h3>\n<p>The <code>LowGPUUtilization<\/code> rule checks for low and fluctuating GPU usage. If usage is consistently low, it might be caused by bottlenecks or if batch size or model is too small. If usage is heavily fluctuating, it can be caused by bottlenecks or blocking calls.<\/p>\n<p>The rule computed the 95th and 5th quantile of GPU utilization on 500 continuous data points and found eight cases where p95 was above 70% and p5 was below 10%. If p95 is high and p5 is low, it indicates that the usage is highly fluctuating. If both values are very low, it means that the machine is under-utilized. During initialization, utilization is likely 0, so the rule skipped the first 1,000 data points. The rule analyzed 1,001 data points and was triggered eight times. Moreover it also provides the time when this rule was last triggered.<\/p>\n<h3>BatchSize<\/h3>\n<p>The <code>BatchSize<\/code> rule helps detect if GPU is under-utilized because of the batch size being too small. To detect this, the rule analyzes the GPU memory footprint and CPU and GPU utilization. The rule analyzed 1,000 data points and was triggered four times. Your training job is under-utilizing the instance. You may want to consider switching to a smaller instance type or increasing the batch size of your model training. Moreover it also provides the time when this rule was last triggered.<\/p>\n<p>The following boxplot is a snapshot from this timestamp that shows for each node the total CPU utilization and the utilization and memory usage per GPU.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19653\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/Missing-2.jpg\" alt=\"The following boxplot is a snapshot from this timestamp that shows for each node the total CPU utilization.\" width=\"800\" height=\"311\"><\/p>\n<h4>CPUBottleneck<\/h4>\n<p>The <code>CPUBottleneck<\/code> rule checks when CPU utilization was above <code>cpu_threshold<\/code> of 90% and GPU utilization was below <code>gpu_threshold<\/code> of 10%. During initialization, utilization is likely 0, so the rule skipped the first 1,000 data points. With this configuration, the rule found 2,129 CPU bottlenecks, which is 70% of the total time. This is above the threshold of 50%. The rule analyzed 3,019 data points and was triggered four times.<\/p>\n<p>The following chart (left) shows how many data points were below the <code>gpu_threshold<\/code> of 10% and how many of those data points were likely caused by a CPU bottleneck. The rule found 3,000 out of 3,019 data points that had a GPU utilization below 10%. Out of those data points, 70.52% were likely caused by CPU bottlenecks. The second chart (right) shows whether CPU bottlenecks mainly happened during the train or validation phase.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19654\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1458-20.jpg\" alt=\"The following chart (left) shows how many data points were below the gpu_threshold of 10%.\" width=\"800\" height=\"231\"><\/p>\n<h2>Analyzing and implementing recommendations from the profiler report<\/h2>\n<p>Let\u2019s now analyze and implement the profiling recommendations for our training job to improve resource utilization and make our training efficient. First let\u2019s review the configuration of our training job and check the three rules that were triggered by Debugger during the training run.<\/p>\n<p>The following table summarizes the training job configuration.<\/p>\n<table border=\"1px\" cellpadding=\"5px\">\n<tbody>\n<tr>\n<td width=\"130\"><strong>Instance Type<\/strong><\/td>\n<td width=\"118\"><strong>Instance Count<\/strong><\/td>\n<td width=\"127\"><strong>Number of processes per host <\/strong><\/td>\n<td width=\"270\"><strong>Profiling Configuration<\/strong><\/td>\n<td width=\"117\"><strong>Number of Epochs<\/strong><\/td>\n<td width=\"100\"><strong>Batch Size<\/strong><\/td>\n<\/tr>\n<tr>\n<td width=\"130\">P3.8xlarge<\/td>\n<td width=\"118\">3<\/td>\n<td width=\"127\">3<\/td>\n<td width=\"270\">FrameworkProfile(start_step=2, num_steps=7), Monitoring Interval = 500 milliseconds<\/td>\n<td width=\"117\">25<\/td>\n<td width=\"100\">256<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The following table summarizes the Debugger profiling recommendations.<\/p>\n<table border=\"1px\" cellpadding=\"5px\">\n<tbody>\n<tr>\n<td width=\"225\"><strong>Rule Triggered<\/strong><\/td>\n<td width=\"225\"><strong>Reason<\/strong><\/td>\n<td width=\"225\"><strong>Recommendations<\/strong><\/td>\n<\/tr>\n<tr>\n<td width=\"225\">BatchSize<\/td>\n<td width=\"225\">Checks if GPU is under-utilized because of the batch size being too small.<\/td>\n<td width=\"225\">Run on a smaller instance type or increase batch size.<\/td>\n<\/tr>\n<tr>\n<td width=\"225\">LowGPUUtilization<\/td>\n<td width=\"225\">Checks if GPU utilization is low or suffers from fluctuations. This can happen if there are bottlenecks, many blocking calls due to synchronizations, or batch size being too small.<\/td>\n<td width=\"225\">Check for bottlenecks, minimize blocking calls, change distributed training strategy, increase batch size.<\/td>\n<\/tr>\n<tr>\n<td width=\"225\">\n<p>CPUBottleneck<\/p>\n<p>\u00a0<\/p>\n<\/td>\n<td width=\"225\">Checks if CPU usage is high but GPU usage is low at the same time, which may indicate a CPU bottleneck where GPU is waiting for data to arrive from CPU.<\/td>\n<td width=\"225\">CPU bottlenecks can happen when data preprocessing is very compute intensive. You should consider increasing the number of data-loader processes or apply pre-fetching.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Based on the recommendation to consider switching to a smaller instance type and to increase the batch size, we change the training configuration settings and rerun the training. In the notebook, the training instances are changed from p3.8xlarge to p3.2xlarge instances, the number of instances is reduced to two, and only one process per host for MPI is configured to increase the number of data loaders. The batch size is also changed in parallel to 512.<\/p>\n<p>The following table summarizes the revised training job configuration.<strong>\u00a0<\/strong><\/p>\n<table border=\"1px\" cellpadding=\"5px\">\n<tbody>\n<tr>\n<td width=\"130\"><strong>Instance Type<\/strong><\/td>\n<td width=\"118\"><strong>Instance Count<\/strong><\/td>\n<td width=\"127\"><strong>Number of processes per host <\/strong><\/td>\n<td width=\"270\"><strong>Profiling Configuration<\/strong><\/td>\n<td width=\"117\"><strong>Number of Epochs<\/strong><\/td>\n<td width=\"100\"><strong>Batch Size<\/strong><\/td>\n<\/tr>\n<tr>\n<td width=\"130\">P3.2xlarge<\/td>\n<td width=\"118\">2<\/td>\n<td width=\"127\">1<\/td>\n<td width=\"270\">FrameworkProfile(start_step=2, num_steps=7), Monitoring Interval = 500 milliseconds<\/td>\n<td width=\"117\">25<\/td>\n<td width=\"100\">512<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>After running the second training job with the new settings, a new report is generated, but with no rules triggered, indicating all the issues identified in the earlier run were resolved. Now let\u2019s compare the report analysis from the two training jobs and understand the impact of the configuration changes made.<\/p>\n<p>The training job summary shows that the training time was almost similar, with 502 seconds in the revised run compared to 503 seconds in the first run. The amount of time spent in the training loop for both jobs was also comparable at 45%.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19655 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1458-21.jpg\" alt=\"The amount of time spent in the training loop for both jobs was also comparable at 45%. \" width=\"800\" height=\"286\"><\/p>\n<p>Examining the system usage statistics shows that both CPU and GPU utilization of the two training instances increased when compared to the original run. For the first training run, GPU utilization was constant at 13.5% across the three instances for the 95th quantile of GPU utilization, and the CPU utilization was constant at 24.4% across the three instances for the 95th quantile of CPU utilization. For the second training run, <strong>GPU utilization increased to 46%<\/strong> for the 95th quantile, and the <strong>CPU utilization increased to 61%<\/strong> for the 95th quantile.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-19656\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/ML-1458-22.jpg\" alt=\"Examining the system usage statistics shows that both CPU and GPU utilization of the two training instances increased.\" width=\"800\" height=\"332\"><\/p>\n<p>Although no rules were triggered during this run, there is still room for improvement in resource utilization.<\/p>\n<p>The following screenshot shows the rules summary for our revised training run.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-19720 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/08\/profilertrain2.png\" alt=\"The following screenshot shows the rules summary for our revised training run.\" width=\"2628\" height=\"1380\"><\/p>\n<p>You can continue to tune your training job, change the training parameters, rerun the training, and compare the results against previous training runs. Repeat this process to fine-tune your training strategy and training resources to achieve the optimal combination of training cost and training performance according to your business needs.<\/p>\n<h2>Optimizing costs<\/h2>\n<p>The following table shows a cost comparison of the two training runs.<\/p>\n<table border=\"1px\" width=\"0\" cellpadding=\"5px\">\n<tbody>\n<tr>\n<td width=\"146\"><\/td>\n<td width=\"100\"><strong>Instance Count<\/strong><\/td>\n<td width=\"100\"><strong>Instance Type<\/strong><\/td>\n<td width=\"100\"><strong>Training Time (in Seconds)<\/strong><\/td>\n<td width=\"100\">\n<p><strong>Instance Hourly Cost<\/strong><\/p>\n<p><strong>(us-west-2)<\/strong><\/p>\n<\/td>\n<td width=\"100\"><strong>Training Cost<\/strong><\/td>\n<td width=\"100\"><strong>Cost Savings<\/strong><\/td>\n<\/tr>\n<tr>\n<td width=\"146\">First training run<\/td>\n<td width=\"100\">3<\/td>\n<td width=\"100\">p3.8xlarge<\/td>\n<td width=\"100\">503<\/td>\n<td width=\"100\">$14.688<\/td>\n<td width=\"100\">$6.16<\/td>\n<td width=\"100\">N\/A<\/td>\n<\/tr>\n<tr>\n<td width=\"146\">Second training run with Debugger profiling recommendations<\/td>\n<td width=\"100\">2<\/td>\n<td width=\"100\">p3.2xlarge<\/td>\n<td width=\"100\">502<\/td>\n<td width=\"100\">$3.825<\/td>\n<td width=\"100\">$1.07<\/td>\n<td width=\"100\">82.6%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Considering the cost of the training instances in a specific Region at the time of the this writing, for example <code>us-west-2<\/code>, training with three ml.p3.8xlarge instances for 503 seconds costs $6.16, and training with two ml.p3.2xlarge for 502 seconds costs $1.07. That is 83% cost savings by simply implementing the profiler recommendation to reduce the instance type.<\/p>\n<h2>Conclusion<\/h2>\n<p>The profiling feature of SageMaker Debugger is a powerful tool to gain visibility into ML training jobs. In this post, we provided insight into training resource utilization to identify bottlenecks, analyze the various phases of training, and identify expensive framework functions. We also showed how to analyze and implement profiler recommendations. We applied profiler recommendations to a TensorFlow Horovod distributed training for a sentiment analysis model and achieved resource utilization improvement up to 60% and cost savings of 83%. Debugger provides profiling capabilities for all leading deep learning frameworks, including TensorFlow, PyTorch, and Keras.<\/p>\n<p>Give Debugger profiling a try and leave your feedback in the comments. For additional information on SageMaker Debugger, check out the announcement post linked below.<\/p>\n<p>\u00a0<\/p>\n<hr>\n<h3>About the <strong>Authors<\/strong><br \/>\n<\/h3>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-16219 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/23\/Mona.jpg\" alt=\"\" width=\"100\" height=\"134\"><strong>Mona Mona<\/strong> is an AI\/ML Specialist Solutions Architect based out of Arlington, VA. She works with the World Wide Public Sector team and helps customers adopt machine learning on a large scale. Prior to joining Amazon, she worked as an IT Consultant and completed her masters in Computer Information Systems from Georgia State University, with a focus in big data analytics. She is passionate about NLP and ML explainability in AI\/ML.<\/p>\n<p>\u00a0<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-13702 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/07\/09\/prem-ranga-100.jpg\" alt=\"\" width=\"100\" height=\"100\"><strong>Prem Ranga<\/strong> is an Enterprise Solutions Architect based out of Houston, Texas. He is part of the Machine Learning Technical Field Community and loves working with customers on their ML and AI journey. Prem is passionate about robotics, is an Autonomous Vehicles researcher, and also built the Alexa-controlled Beer Pours in Houston and other locations.<\/p>\n<p>\u00a0<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-11239 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/03\/03\/sireesha-muppala-100.jpg\" alt=\"\" width=\"100\" height=\"135\"><strong>Sireesha Muppala<\/strong> is an AI\/ML Specialist Solutions Architect at AWS, providing guidance to customers on architecting and implementing machine learning solutions at scale. She received her Ph.D. in Computer Science from the University of Colorado, Colorado Springs. In her spare time, Sireesha loves to run and hike Colorado trails.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/identify-bottlenecks-improve-resource-utilization-and-reduce-ml-training-costs-with-the-new-profiling-feature-in-amazon-sagemaker-debugger\/<\/p>\n","protected":false},"author":0,"featured_media":675,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/674"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=674"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/674\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/675"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=674"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=674"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=674"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}