{"id":200,"date":"2020-09-10T01:23:52","date_gmt":"2020-09-10T01:23:52","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/09\/10\/visualizing-tensorflow-training-jobs-with-tensorboard\/"},"modified":"2020-09-10T01:23:52","modified_gmt":"2020-09-10T01:23:52","slug":"visualizing-tensorflow-training-jobs-with-tensorboard","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/09\/10\/visualizing-tensorflow-training-jobs-with-tensorboard\/","title":{"rendered":"Visualizing TensorFlow training jobs with TensorBoard"},"content":{"rendered":"<div id=\"\">\n<p><a href=\"https:\/\/www.tensorflow.org\/tensorboard\" target=\"_blank\" rel=\"noopener noreferrer\">TensorBoard<\/a> is an open-source toolkit for <a href=\"https:\/\/www.tensorflow.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">TensorFlow<\/a> users that allows you to visualize a wide range of useful information about your model, from model graphs; to loss, accuracy, or custom metrics; to embedding projections, images, and histograms of weights and biases.<\/p>\n<p>This post demonstrates how to use TensorBoard with <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> training jobs, write logs from TensorFlow training scripts to <a href=\"https:\/\/aws.amazon.com\/s3\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3), and ways to run TensorBoard: locally, using <a href=\"https:\/\/aws.amazon.com\/ecs\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Elastic Container Service<\/a> (Amazon ECS) on <a href=\"https:\/\/aws.amazon.com\/fargate\/\">AWS Fargate<\/a>, or inside of an <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/nbi.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker notebook instance<\/a>.<\/p>\n<h2>Generating training logs using tf.summary<\/h2>\n<p>TensorFlow comes with a <code>tf.summary<\/code> module to write summary data, which it uses for monitoring and visualization. The module\u2019s API provides methods to write scalars, audio, histograms, text, and image summaries, and can trace information that\u2019s useful for profiling training jobs. An example command to write the accuracy of the first step of training looks like the following:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">tf.summary.scalar('accuracy', 0.45, step=1)<\/code><\/pre>\n<\/div>\n<p>To use the summary data after the training job is complete, it\u2019s important to write the files to a persistent storage. This way, you can visualize your past jobs or compare different runs during the hyperparameter tuning phase. The <code>tf.summary<\/code> module allows you to use Amazon S3 as the destination for log files, passing the S3 bucket URI directly into the <code>create_file_writer<\/code> method. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">tf.summary.create_file_writer('s3:\/\/&lt;bucket_name&gt;\/&lt;prefix&gt;')<\/code><\/pre>\n<\/div>\n<p><a href=\"https:\/\/keras.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">Keras<\/a> users can use <code>keras.callbacks.TensorBoard<\/code> as one of the callbacks provided to the <code>Model.fit()<\/code> method. This callback provides an abstraction of a low-level <code>tf.summary<\/code> API and collects a lot of the data automatically. With TensorBoard callbacks, you can collect data to visualize training graphs, metrics plots, activation histograms, and run profiling. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">tb_callback = tf.keras.callbacks.TensorBoard(log_dir='s3:\/\/&lt;bucket_name&gt;\/&lt;prefix&gt;')\r\nmodel.fit(x, y, epochs=5, callbacks=[tb_callback])<\/code><\/pre>\n<\/div>\n<p>For a detailed example of how to collect summary data in the training scripts, see the TensorBoard Keras example notebook on the <a href=\"https:\/\/github.com\/awslabs\/amazon-sagemaker-examples\/tree\/master\/sagemaker-python-sdk\/tensorboard_keras\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker examples GitHub repo<\/a> or inside a running Amazon SageMaker notebook instance on the Amazon SageMaker <strong>Examples<\/strong> tab. This notebook uses TensorFlow 2.2 and Keras to train a Convolutional Neural Network (CNN) to recognize images from the <a href=\"https:\/\/www.cs.toronto.edu\/~kriz\/cifar.html\" target=\"_blank\" rel=\"noopener noreferrer\">CIFAR-10 dataset<\/a>. Code in the notebook runs the training job locally inside the notebook instance one time, and then another 10 times during the hyperparameter tuning job. All training jobs write log files under one Amazon S3 prefix, so the log destination path for every run follows the format <span><code>s3:\/\/<em>&lt;bucket_name&gt;<\/em>\/<em>&lt;project_name&gt;<\/em>\/logs\/<em>&lt;training_job_name&gt;<\/em><\/code><\/span>, where the project name is <code>tensorboard_keras_cifar10<\/code>.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-15517\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/1-Screenshot.jpg\" alt=\"\" width=\"900\" height=\"793\"><\/p>\n<p>The notebook also demonstrates how to run TensorBoard inside of the Amazon SageMaker notebook instance. This method has some limitations; for example, the TensorBoard command blocks the run of the notebook and lives as long as the notebook instance is alive, but allows you to quickly access the dashboard and make sure the training is running correctly.<\/p>\n<p>In the following sections, we look at other ways to run TensorBoard.<\/p>\n<h2>Running TensorBoard on your local machine<\/h2>\n<p>If you want to run TensorBoard locally, the first thing you need to do is to install TensorFlow:<\/p>\n<p>An independent distribution of TensorBoard is also available, but it has limited functionality if run without TensorFlow. For this post, we use TensorBoard as part of the TensorFlow distribution.<\/p>\n<p>Assuming your <a href=\"http:\/\/aws.amazon.com\/cli\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Command Line Interface<\/a> (AWS CLI) is installed and configured properly, we simply run TensorBoard pointing to the Amazon S3 directory containing the generated summary data:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">AWS_REGION=eu-west-1 tensorboard --logdir s3:\/\/&lt;bucket_name&gt;\/tensorboard_keras_cifar10\/logs\/<\/code><\/pre>\n<\/div>\n<p>You must specify the region where your S3 bucket is located. You can find the right region in the <a href=\"https:\/\/s3.console.aws.amazon.com\/s3\/home\" target=\"_blank\" rel=\"noopener noreferrer\">list of buckets<\/a> on the Amazon S3 console.<\/p>\n<p>The user you use must have read access to the specified S3 bucket. For more information about securely granting access to S3 buckets to a specific user, see <a href=\"https:\/\/aws.amazon.com\/blogs\/security\/writing-iam-policies-how-to-grant-access-to-an-amazon-s3-bucket\/\" target=\"_blank\" rel=\"noopener noreferrer\">Writing IAM Policies: How to Grant Access to an Amazon S3 Bucket<\/a>.<\/p>\n<p>You should see something similar to the following screenshot.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-15518\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/2-TensorBoard-Screenshot.jpg\" alt=\"\" width=\"900\" height=\"860\"><\/p>\n<h2>Running TensorBoard on Amazon ECS on AWS Fargate<\/h2>\n<p>If you prefer to have an instance of TensorBoard permanently running and accessible to your whole team, you can deploy it as an independent application in the cloud. One of the easiest ways to do this without managing servers is AWS Fargate, a serverless compute engine for containers. The following diagram illustrates this architecture.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/3-Flow.jpg\" alt=\"\" width=\"900\" height=\"491\"><\/p>\n<p>You can deploy an example TensorBoard container image with all required roles and an Application Load Balancer by using the provided <a href=\"http:\/\/aws.amazon.com\/cloudformation\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CloudFormation<\/a> template:<\/p>\n<p><a target=\"_blank\" href=\"https:\/\/us-east-1.console.aws.amazon.com\/cloudformation\/home?region=us-east-1#\/stacks\/create\/review?templateURL=https:\/\/aws-ml-blog.s3.amazonaws.com\/artifacts\/visualizing-tensorflow-template\/template.yaml&amp;stackName=tensorboard-fargate\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-15520 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/4-Launch-Stack.jpg\" alt=\"\" width=\"107\" height=\"20\"><\/a><\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>This template has five input parameters:<\/p>\n<ul>\n<li>\n<strong>TensorBoard container image<\/strong> \u2013 Use <code>tensorflow\/tensorflow<\/code> for a standard distribution or a custom container image if you want to enable the Profiler plugin<\/li>\n<li>\n<strong>S3Bucket<\/strong> \u2013 Enter the name of the bucket where TensorFlow logs are stored<\/li>\n<li>\n<strong>S3Prefix<\/strong> \u2013 Enter the path to the TensorFlow logs inside of the bucket; for example, <code>tensorboard_keras_cifar10\/logs\/<\/code>\n<\/li>\n<li>\n<strong>VpcId<\/strong> \u2013 Select the VPC where you want TensorBoard to be deployed to<\/li>\n<li>\n<strong>SubnetId<\/strong> \u2013 Select two or more subnets in the selected VPC<\/li>\n<\/ul>\n<p>This example solution doesn\u2019t include authorization and authentication mechanisms. Remember that if you deploy TensorBoard to a publicly accessible subnet, your TensorBoard instance and training logs are accessible to everyone on the internet. You can secure TensorBoard with the following methods:<\/p>\n<p>After you create the CloudFormation stack, you can find the link to the deployed TensorBoard on the <strong>Outputs<\/strong> tab on the AWS CloudFormation console.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/5-Tensorboard-Screenshot.jpg\" alt=\"\" width=\"900\" height=\"372\"><\/p>\n<h2>Using a custom TensorBoard container image<\/h2>\n<p>Because TensorBoard is part of the TensorFlow distribution, we can use the official <code>tensorflow<\/code> Docker container image hosted on Docker Hub.<\/p>\n<p>Optionally, we can build a custom image with the optional Profiler TensorBoard plugin to visualize profiling data:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">#Dockerfile\r\nFROM tensorflow\/tensorflow\r\n\r\nRUN python3 -m pip install --upgrade --no-cache-dir tensorboard_plugin_profile\r\n\r\nEXPOSE 6006\r\n\r\nENTRYPOINT [\"tensorboard\"]\r\n<\/code><\/pre>\n<\/div>\n<p>You can build and test the container locally:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">docker build -t tensorboard .\r\n\r\ndocker run -p 6006:6006 \r\n    --env AWS_ACCESS_KEY_ID=XXXXX \r\n    --env AWS_SECRET_ACCESS_KEY=XXXXX \r\n    --env AWS_REGION=eu-west-1 \r\ntensorboard \r\n    --logdir s3:\/\/bucket_name\/tensorboard_keras_cifar10\/logs\/\r\n<\/code><\/pre>\n<\/div>\n<p>After testing the container, you need to push it to a container image repository of your choice. Detailed instructions on deploying an application aren\u2019t in the scope of this post. To set up Amazon ECS and Elastic Load Balancer, see <a href=\"https:\/\/aws.amazon.com\/blogs\/compute\/building-deploying-and-operating-containerized-applications-with-aws-fargate\/\" target=\"_blank\" rel=\"noopener noreferrer\">Building, deploying, and operating containerized applications with AWS Fargate<\/a>.<\/p>\n<h2>Conclusion<\/h2>\n<p>In this post, I showed you how to use TensorBoard to visualize TensorFlow training jobs using Amazon S3 as storage for the logs. You can use this solution and the example notebooks to build and train a model with Amazon SageMaker and run a hyperparameter tuning job. You can use TensorBoard to compare hyperparameters from different training runs, generate and display confusion matrices for the classifier, and profile and visualize the training job\u2019s performance.<\/p>\n<hr>\n<h3>About the Author<\/h3>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-15522 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/01\/Yegor-Tokmakov.jpg\" alt=\"\" width=\"100\" height=\"100\">Yegor Tokmakov<\/strong> is a solutions architect at AWS, working with startups. Before joining AWS, Yegor was Chief Technology Officer at a healthcare startup based in Berlin and was responsible for architecture and operations, as well as product development and growth of the tech team. Yegor is passionate about novel AI applications and data analytics. You can find him at @yegortokmakov on Twitter.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/visualizing-tensorflow-training-jobs-with-tensorboard\/<\/p>\n","protected":false},"author":0,"featured_media":201,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/200"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=200"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/200\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/201"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=200"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=200"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=200"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}