{"id":406,"date":"2020-10-15T22:15:44","date_gmt":"2020-10-15T22:15:44","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/10\/15\/streamlining-data-labeling-for-yolo-object-detection-in-amazon-sagemaker-ground-truth\/"},"modified":"2020-10-15T22:15:44","modified_gmt":"2020-10-15T22:15:44","slug":"streamlining-data-labeling-for-yolo-object-detection-in-amazon-sagemaker-ground-truth","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/10\/15\/streamlining-data-labeling-for-yolo-object-detection-in-amazon-sagemaker-ground-truth\/","title":{"rendered":"Streamlining data labeling for YOLO object detection in Amazon SageMaker Ground Truth"},"content":{"rendered":"<div id=\"\">\n<p>Object detection is a common task in computer vision (CV), and the <a href=\"https:\/\/arxiv.org\/pdf\/1804.02767.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">YOLOv3 model<\/a> is state-of-the-art in terms of accuracy and speed. In transfer learning, you obtain a model trained on a large but generic dataset and retrain the model on your custom dataset. One of the most time-consuming parts in transfer learning is collecting and labeling image data to generate a custom training dataset. This post explores how to do this in <a href=\"https:\/\/aws.amazon.com\/sagemaker\/groundtruth\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Ground Truth<\/a>.<\/p>\n<p>Ground Truth offers a comprehensive platform for annotating the most common data labeling jobs in CV: image classification, object detection, semantic segmentation, and instance segmentation. You can perform labeling using Amazon Mechanical Turk or create your own private team to label collaboratively. You can also use one of the third-party data labeling service providers listed on the <a href=\"http:\/\/aws.amazon.com\/partners\/aws-marketplace\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Marketplace<\/a>. Ground Truth offers an intuitive interface that is easy to work with. You can communicate with labelers about specific needs for your particular task using examples and notes through the interface.<\/p>\n<p>Labeling data is already hard work. Creating training data for a CV modeling task requires data collection and storage, setting up labeling jobs, and post-processing the labeled data. Moreover, not all object detection models expect the data in the same format. For example, the Faster RCNN model expects the data in the popular Pascal VOC format, which the YOLO models can\u2019t work with. These associated steps are part of any machine learning pipeline for CV. You sometimes need to run the pipeline multiple times to improve the model incrementally. This post shows how to perform these steps efficiently by using Python scripts and get to model training as quickly as possible. This post uses the YOLO format for its use case, but the steps are mostly independent of the data format.<\/p>\n<p>The image labeling step of a training data generation task is inherently manual. This post shows how to create a reusable framework to create training data for model building efficiently. Specifically, you can do the following:<\/p>\n<ul>\n<li>Create the required directory structure in <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon S3<\/a> before starting a Ground Truth job<\/li>\n<li>Create a private team of annotators and start a Ground Truth job<\/li>\n<li>Collect the annotations when labeling is complete and save it in a pandas dataframe<\/li>\n<li>Post-process the dataset for model training<\/li>\n<\/ul>\n<p>You can download the code presented in this post from this <a href=\"https:\/\/github.com\/aws-samples\/groundtruth-object-detection\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repo<\/a>. This post demonstrates how to run the code from the <a href=\"http:\/\/aws.amazon.com\/cli\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CLI<\/a> on a local machine that can access an AWS account. For more information about setting up AWS CLI, see <a href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/userguide\/cli-chap-welcome.html\" target=\"_blank\" rel=\"noopener noreferrer\">What Is the AWS Command Line Interface?<\/a> Make sure that you configure it to access the S3 buckets in this post. Alternatively, you can run it in <a href=\"https:\/\/aws.amazon.com\/cloud9\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Cloud9<\/a> or by spinning up an <a href=\"http:\/\/aws.amazon.com\/ec2\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon EC2<\/a> instance. You can also run the code blocks in an <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> notebook.<\/p>\n<p>If you\u2019re using an Amazon SageMaker notebook, you can still access the Linux shell of the underlying EC2 instance and follow along by opening a new terminal from the Jupyter main page and running the scripts from the <code>\/home\/ec2-user\/SageMaker<\/code> folder.<\/p>\n<h2>Setting up your S3 bucket<\/h2>\n<p>The first thing you need to do is to upload the training images to an S3 bucket. Name the bucket <code>ground-truth-data-labeling<\/code>. You want each labeling task to have its own self-contained folder under this bucket. If you start labeling a small set of images that you keep in the first folder, but find that the model performed poorly after the first round because the data was insufficient, you can upload more images to a different folder under the same bucket and start another labeling task.<\/p>\n<p>For the first labeling task, create the folder <code>bounding_box<\/code> and the following three subfolders under it:<\/p>\n<ul>\n<li>\n<strong>images<\/strong> \u2013 You upload all the images in the Ground Truth labeling job to this subfolder.<\/li>\n<li>\n<strong>ground_truth_annots<\/strong> \u2013 This subfolder starts empty; the Ground Truth job populates it automatically, and you retrieve the final annotations from here.<\/li>\n<li>\n<strong>yolo_annot_files<\/strong> \u2013 This subfolder also starts empty, but eventually holds the annotation files ready for model training. The script populates it automatically.<\/li>\n<\/ul>\n<p>If your images are in .jpeg format and available in the current working directory, you can upload the images with the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-ruby\">aws s3 sync . s3:\/\/ground-truth-data-labeling\/bounding_box\/images\/ --exclude \"*\" --include \"*.jpg\" <\/code><\/pre>\n<\/div>\n<p>For this use case, you use five images. There are two types of objects in the images\u2014pencil and pen. You need to draw bounding boxes around each object in the images. The following images are examples of what you need to label. All images are available in the GitHub repo.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15717 size-medium\" title=\"Images to be labeled\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/09\/1-Pen-Image-300x225.jpg\" alt=\"\" width=\"300\" height=\"225\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15718 size-medium\" title=\"Images to be labeled\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/09\/2-Pen-Image-300x225.jpg\" alt=\"\" width=\"300\" height=\"225\"><\/p>\n<h2>Creating the manifest file<\/h2>\n<p>A Ground Truth job requires a manifest file in JSON format that contains the Amazon S3 paths of all the images to label. You need to create this file before you can start the first Ground Truth job. The format of this file is simple:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-json\">{\"source-ref\": &lt; S3 path to image1 &gt;}\r\n{\"source-ref\": &lt; S3 path to image2 &gt;}\r\n...\r\n<\/code><\/pre>\n<\/div>\n<p>However, creating the manifest file by hand would be tedious for a large number of images. Therefore, you can automate the process by running a script. You first need to create a file holding the parameters required for the scripts. Create a file <code>input.json<\/code> in your local file system with the following content:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-json\">{\r\n    \"s3_bucket\":\"ground-truth-data-labeling\",\r\n    \"job_id\":\"bounding_box\",\r\n    \"ground_truth_job_name\":\"yolo-bbox\",\r\n    \"yolo_output_dir\":\"yolo_annot_files\"\r\n}\r\n<\/code><\/pre>\n<\/div>\n<p>Save the following code block in a file called <code>prep_gt_job.py<\/code>:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">import boto3\r\nimport json\r\n\r\n\r\ndef create_manifest(job_path):\r\n    \"\"\"\r\n    Creates the manifest file for the Ground Truth job\r\n\r\n    Input:\r\n    job_path: Full path of the folder in S3 for GT job\r\n\r\n    Returns:\r\n    manifest_file: The manifest file required for GT job\r\n    \"\"\"\r\n\r\n    s3_rec = boto3.resource(\"s3\")\r\n    s3_bucket = job_path.split(\"\/\")[0]\r\n    prefix = job_path.replace(s3_bucket, \"\")[1:]\r\n    image_folder = f\"{prefix}\/images\"\r\n    print(f\"using images from ... {image_folder} n\")\r\n\r\n    bucket = s3_rec.Bucket(s3_bucket)\r\n    objs = list(bucket.objects.filter(Prefix=image_folder))\r\n    img_files = objs[1:]  # first item is the folder name\r\n    n_imgs = len(img_files)\r\n    print(f\"there are {n_imgs} images n\")\r\n\r\n    TOKEN = \"source-ref\"\r\n    manifest_file = \"\/tmp\/manifest.json\"\r\n    with open(manifest_file, \"w\") as fout:\r\n        for img_file in img_files:\r\n            fname = f\"s3:\/\/{s3_bucket}\/{img_file.key}\"\r\n            fout.write(f'{{\"{TOKEN}\": \"{fname}\"}}n')\r\n\r\n    return manifest_file\r\n\r\n\r\ndef upload_manifest(job_path, manifest_file):\r\n    \"\"\"\r\n    Uploads the manifest file into S3\r\n\r\n    Input:\r\n    job_path: Full path of the folder in S3 for GT job\r\n    manifest_file: Path to the local copy of the manifest file\r\n    \"\"\"\r\n\r\n    s3_rec = boto3.resource(\"s3\")\r\n    s3_bucket = job_path.split(\"\/\")[0]\r\n    source = manifest_file.split(\"\/\")[-1]\r\n    prefix = job_path.replace(s3_bucket, \"\")[1:]\r\n    destination = f\"{prefix}\/{source}\"\r\n\r\n    print(f\"uploading manifest file to {destination} n\")\r\n    s3_rec.meta.client.upload_file(manifest_file, s3_bucket, destination)\r\n\r\n\r\ndef main():\r\n    \"\"\"\r\n    Performs the following tasks:\r\n    1. Reads input from 'input.json'\r\n    2. Collects image names from S3 and creates the manifest file for GT\r\n    3. Uploads the manifest file to S3\r\n    \"\"\"\r\n\r\n    with open(\"input.json\") as fjson:\r\n        input_dict = json.load(fjson)\r\n\r\n    s3_bucket = input_dict[\"s3_bucket\"]\r\n    job_id = input_dict[\"job_id\"]\r\n\r\n    gt_job_path = f\"{s3_bucket}\/{job_id}\"\r\n    man_file = create_manifest(gt_job_path)\r\n    upload_manifest(gt_job_path, man_file)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    main()\r\n<\/code><\/pre>\n<\/div>\n<p>Run the following script:<\/p>\n<p><code>python prep_gt_job.py<\/code><\/p>\n<p>This script reads the S3 bucket and job names from the input file, creates a list of images available in the images folder, creates the <code>manifest.json<\/code> file, and uploads the manifest file to the S3 bucket at <code>s3:\/\/ground-truth-data-labeling\/bounding_box\/<\/code>.<\/p>\n<p>This method illustrates a programmatic control of the process, but you can also create the file from the Ground Truth API. For instructions, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sms-data-input.html#sms-console-create-manifest-file\" target=\"_blank\" rel=\"noopener noreferrer\">Create a Manifest File<\/a>.<\/p>\n<p>At this point, the folder structure in the S3 bucket should look like the following:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-code\">ground-truth-data-labeling \r\n|-- bounding_box\r\n    |-- ground_truth_annots\r\n    |-- images\r\n    |-- yolo_annot_files\r\n    |-- manifest.json\r\n<\/code><\/pre>\n<\/div>\n<h2>Creating the Ground Truth job<\/h2>\n<p>You\u2019re now ready to create your Ground Truth job. You need to specify the job details and task type, and create your team of labelers and labeling task details. Then you can sign in to begin the labeling job.<\/p>\n<h3>Specifying the job details<\/h3>\n<p>To specify the job details, complete the following steps:<\/p>\n<ol>\n<li>On the Amazon SageMaker console, under <strong>Ground Truth<\/strong>, choose <strong>Labeling jobs<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15719 size-full\" title=\"Select Labeling jobs\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/09\/3-SageMaker.jpg\" alt=\"\" width=\"300\" height=\"487\"><\/p>\n<ol start=\"2\">\n<li>On the <strong>Labeling jobs<\/strong> page, choose <strong>Create labeling job<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15720 size-full\" title=\"Create labeling jobs\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/09\/4-Labeling-Jobs.jpg\" alt=\"\" width=\"900\" height=\"163\"><\/p>\n<ol start=\"3\">\n<li>In the <strong>Job overview <\/strong>section, for <strong>Job name<\/strong>, enter <code>yolo-bbox<\/code>. It should be the name you defined in the <code>input.json<\/code> file earlier.<\/li>\n<li>\n<strong>Pick Manual Data Setup<\/strong> under <strong>Input Data Setup<\/strong>.<\/li>\n<li>For <strong>Input dataset location<\/strong>, enter <code>s3:\/\/ground-truth-data-labeling\/bounding_box\/manifest.json<\/code>.<\/li>\n<li>For <strong>Output dataset location<\/strong>, enter <code>s3:\/\/ground-truth-data-labeling\/bounding_box\/ground_truth_annots<\/code>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15721 size-full\" title=\"Configuring the job\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/09\/5-Job-overview-Screenshot.jpg\" alt=\"\" width=\"900\" height=\"710\"><\/p>\n<ol start=\"7\">\n<li>In the <strong>Create an IAM role<\/strong> section, first select <strong>Create a new role<\/strong> from the drop down menu and then select <strong>Specific S3 buckets<\/strong>.<\/li>\n<li>Enter ground-truth-data-labeling<em>.<\/em>\n<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15722 size-full\" title=\"Managing IAM roles\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/09\/6-Create-IAM.jpg\" alt=\"\" width=\"900\" height=\"350\"><\/p>\n<ol start=\"9\">\n<li>Choose <strong>Create<\/strong>.<\/li>\n<\/ol>\n<h3>Specifying the task type<\/h3>\n<p>To specify the task type, complete the following steps:<\/p>\n<ol>\n<li>In the <strong>Task selection<\/strong> section, from the <strong>Task Category<\/strong> drop-down menu, choose <strong>Image<\/strong>.<\/li>\n<li>Select <strong>Bounding box<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15723 size-full\" title=\"Selecting the bounding box\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/09\/7-Create-bounding-box.jpg\" alt=\"\" width=\"500\" height=\"378\"><\/p>\n<ol start=\"3\">\n<li>Don\u2019t change <strong>Enable enhanced image access<\/strong>, which is selected by default. It enables Cross-Origin Resource Sharing (CORS) that may be required for some workers to complete the annotation task.<\/li>\n<li>Choose <strong>Next<\/strong>.<\/li>\n<\/ol>\n<h3>Creating a team of labelers<\/h3>\n<p>To create your team of labelers, complete the following steps:<\/p>\n<ol>\n<li>In the <strong>Workers <\/strong>section, select <strong>Private<\/strong>.<\/li>\n<li>Follow the instructions to create a new team.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15724 size-full\" title=\"Selecting your team of labelers\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/09\/8-Workers-Info.jpg\" alt=\"\" width=\"900\" height=\"275\"><\/p>\n<p>Each member of the team receives a notification email titled, \u201cYou\u2019re invited to work on a labeling project\u201d that has initial sign-in credentials. For this use case, create a team with just yourself as a member.<\/p>\n<h3>Specifying labeling task details<\/h3>\n<p>In the <strong>Bounding box labeling tool<\/strong> section, you should see the images you uploaded to Amazon S3. You should check that the paths are correct in the previous steps. To specify your task details, complete the following steps:<\/p>\n<ol>\n<li>In the text box, enter a brief description of the task.<\/li>\n<\/ol>\n<p>This is critical if the data labeling team has more than one members and you want to make sure everyone follows the same rule when drawing the boxes. Any inconsistency in bounding box creation may end up confusing your object detection model. For example, if you\u2019re labeling beverage cans and want to create a tight bounding box only around the visible logo, instead of the entire can, you should specify that to get consistent labeling from all the workers. For this use case, you can enter <code>Please enter a tight bounding box around the entire object<\/code>.<\/p>\n<ol start=\"2\">\n<li>Optionally, you can upload examples of a good and a bad bounding box.<\/li>\n<\/ol>\n<p>You can make sure your team is consistent in their labels by providing good and bad examples.<\/p>\n<ol start=\"3\">\n<li>Under <strong>Labels<\/strong>, enter the names of the labels you\u2019re using to identify each bounding box; in this case, <code>pencil<\/code> and <code>pen<\/code>.<\/li>\n<\/ol>\n<p>A color is assigned to each label automatically, which helps to visualize the boxes created for overlapping objects.<\/p>\n<ol start=\"4\">\n<li>To run a final sanity check, choose <strong>Preview<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15725 size-full\" title=\"Preview view\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/09\/9-Preview.jpg\" alt=\"\" width=\"500\" height=\"418\"><\/p>\n<ol start=\"5\">\n<li>Choose <strong>Create job<\/strong>.<\/li>\n<\/ol>\n<p>Job creation can take up to a few minutes. When it\u2019s complete, you should see a job titled yolo-bbox on the Ground Truth <strong>Labeling jobs<\/strong> page with In progress as the status.<\/p>\n<ol start=\"6\">\n<li>To view the job details, select the job.<\/li>\n<\/ol>\n<p>This is a good time to verify the paths are correct; the scripts don\u2019t run if there\u2019s any inconsistency in names.<\/p>\n<p>For more information about providing labeling instructions, see <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/create-high-quality-instructions-for-amazon-sagemaker-ground-truth-labeling-jobs\/\" target=\"_blank\" rel=\"noopener noreferrer\">Create high-quality instructions for Amazon SageMaker Ground Truth labeling jobs<\/a>.<\/p>\n<h3>Sign in and start labeling<\/h3>\n<p>After you receive the initial credentials to register as a labeler for this job, follow the link to reset the password and start labeling.<\/p>\n<p>If you need to interrupt your labeling session, you can resume labeling by choosing <strong>Labeling workforces<\/strong> under <strong>Ground Truth<\/strong> on the SageMaker console.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15726 size-full\" title=\"Selecting Labeling workforces\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/09\/10-Ground-Truth.jpg\" alt=\"\" width=\"300\" height=\"208\"><\/p>\n<p>You can find the link to the labeling portal on the <strong>Private <\/strong>tab. The page also lists the teams and individuals involved in this private labeling task.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15736 size-full\" title=\"Link to the labeling portal\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/09\/11-Private-Workforce-Summary.jpg\" alt=\"\" width=\"900\" height=\"324\"><\/p>\n<p>After you sign in, start labeling by choosing <strong>Start working<\/strong>.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15737 size-full\" title=\"Starting labeling by selecting Start working\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/09\/12-Jobs-1.jpg\" alt=\"\" width=\"900\" height=\"138\"><\/p>\n<p>Because you only have five images in the dataset to label, you can finish the entire task in a single session. For larger datasets, you can pause the task by choosing <strong>Stop working<\/strong> and return to the task later to finish it.<\/p>\n<h3>Checking job status<\/h3>\n<p>After the labeling is complete, the status of the labeling job changes to Complete and a new JSON file called <code>output.manifest<\/code> containing the annotations appears at <code>s3:\/\/ground-truth-data-labeling\/bounding_box\/ground_truth_annots\/yolo-bbox\/manifests\/output \/output.manifest<\/code>.<\/p>\n<h2>Parsing Ground Truth annotations<\/h2>\n<p>You can now parse through the annotations and perform the necessary post-processing steps to make it ready for model training. Start by running the following code block:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">from io import StringIO\r\nimport json\r\nimport s3fs\r\nimport boto3\r\nimport pandas as pd\r\n\r\n\r\ndef parse_gt_output(manifest_path, job_name):\r\n    \"\"\"\r\n    Captures the json Ground Truth bounding box annotations into a pandas dataframe\r\n\r\n    Input:\r\n    manifest_path: S3 path to the annotation file\r\n    job_name: name of the Ground Truth job\r\n\r\n    Returns:\r\n    df_bbox: pandas dataframe with bounding box coordinates\r\n             for each item in every image\r\n    \"\"\"\r\n\r\n    filesys = s3fs.S3FileSystem()\r\n    with filesys.open(manifest_path) as fin:\r\n        annot_list = []\r\n        for line in fin.readlines():\r\n            record = json.loads(line)\r\n            if job_name in record.keys():  # is it necessary?\r\n                image_file_path = record[\"source-ref\"]\r\n                image_file_name = image_file_path.split(\"\/\")[-1]\r\n                class_maps = record[f\"{job_name}-metadata\"][\"class-map\"]\r\n\r\n                imsize_list = record[job_name][\"image_size\"]\r\n                assert len(imsize_list) == 1\r\n                image_width = imsize_list[0][\"width\"]\r\n                image_height = imsize_list[0][\"height\"]\r\n\r\n                for annot in record[job_name][\"annotations\"]:\r\n                    left = annot[\"left\"]\r\n                    top = annot[\"top\"]\r\n                    height = annot[\"height\"]\r\n                    width = annot[\"width\"]\r\n                    class_name = class_maps[f'{annot[\"class_id\"]}']\r\n\r\n                    annot_list.append(\r\n                        [\r\n                            image_file_name,\r\n                            class_name,\r\n                            left,\r\n                            top,\r\n                            height,\r\n                            width,\r\n                            image_width,\r\n                            image_height,\r\n                        ]\r\n                    )\r\n\r\n        df_bbox = pd.DataFrame(\r\n            annot_list,\r\n            columns=[\r\n                \"img_file\",\r\n                \"category\",\r\n                \"box_left\",\r\n                \"box_top\",\r\n                \"box_height\",\r\n                \"box_width\",\r\n                \"img_width\",\r\n                \"img_height\",\r\n            ],\r\n        )\r\n\r\n    return df_bbox\r\n\r\n\r\ndef save_df_to_s3(df_local, s3_bucket, destination):\r\n    \"\"\"\r\n    Saves a pandas dataframe to S3\r\n\r\n    Input:\r\n    df_local: Dataframe to save\r\n    s3_bucket: Bucket name\r\n    destination: Prefix\r\n    \"\"\"\r\n\r\n    csv_buffer = StringIO()\r\n    s3_resource = boto3.resource(\"s3\")\r\n\r\n    df_local.to_csv(csv_buffer, index=False)\r\n    s3_resource.Object(s3_bucket, destination).put(Body=csv_buffer.getvalue())\r\n\r\n\r\ndef main():\r\n    \"\"\"\r\n    Performs the following tasks:\r\n    1. Reads input from 'input.json'\r\n    2. Parses the Ground Truth annotations and creates a dataframe\r\n    3. Saves the dataframe to S3\r\n    \"\"\"\r\n\r\n    with open(\"input.json\") as fjson:\r\n        input_dict = json.load(fjson)\r\n\r\n    s3_bucket = input_dict[\"s3_bucket\"]\r\n    job_id = input_dict[\"job_id\"]\r\n    gt_job_name = input_dict[\"ground_truth_job_name\"]\r\n\r\n    mani_path = f\"s3:\/\/{s3_bucket}\/{job_id}\/ground_truth_annots\/{gt_job_name}\/manifests\/output\/output.manifest\"\r\n\r\n    df_annot = parse_gt_output(mani_path, gt_job_name)\r\n    dest = f\"{job_id}\/ground_truth_annots\/{gt_job_name}\/annot.csv\"\r\n    save_df_to_s3(df_annot, s3_bucket, dest)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    main()\r\n<\/code><\/pre>\n<\/div>\n<p>From the AWS CLI, save the preceding code block in the file parse_annot.py and run:<\/p>\n<p><code>python parse_annot.py<\/code><\/p>\n<p>Ground Truth returns the bounding box information using the following four numbers: x and y coordinates, and its height and width. The procedure <code>parse_gt_output<\/code> scans through the <code>output.manifest<\/code> file and stores the information for every bounding box for each image in a pandas dataframe. The procedure <code>save_df_to_s3<\/code> saves it in a tabular format as <code>annot.csv<\/code> to the S3 bucket for further processing.<\/p>\n<p>The creation of the dataframe is useful for a few reasons. JSON files are hard to read and the <code>output.manifest<\/code> file contains more information, like label metadata, than you need for the next step. The dataframe contains only the relevant information and you can visualize it easily to make sure everything looks fine.<\/p>\n<p>To grab the <code>annot.csv<\/code> file from Amazon S3 and save a local copy, run the following:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">aws s3 cp s3:\/\/ground-truth-data-labeling\/bounding_box\/ground_truth_annots\/yolo-bbox\/annot.csv <\/code><\/pre>\n<\/div>\n<p>You can read it back into a pandas dataframe and inspect the first few lines. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">import pandas as pd\r\ndf_ann = pd.read_csv('annot.csv')\r\ndf_ann.head()\r\n<\/code><\/pre>\n<\/div>\n<p>The following screenshot shows the results.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15729 size-full\" title=\"Results screenshot\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/09\/13-Screenshot.jpg\" alt=\"\" width=\"900\" height=\"127\"><\/p>\n<p>You also capture the size of the image through <code>img_width<\/code> and <code>img_height<\/code>. This is necessary because the object detection models need to know the location of each bounding box within the image. In this case, you can see that images in the dataset were captured with a 4608\u00d73456 pixel resolution.<\/p>\n<p>There are quite a few reasons why it is a good idea to save the annotation information into a dataframe:<\/p>\n<ul>\n<li>In a subsequent step, you need to rescale the bounding box coordinates into a YOLO-readable format. You can do this operation easily in a dataframe.<\/li>\n<li>If you decide to capture and label more images in the future to augment the existing dataset, all you need to do is join the newly created dataframe with the existing one. Again, you can perform this easily using a dataframe.<\/li>\n<li>As of this writing, Ground Truth doesn\u2019t allow through the console more than 30 different categories to label in the same job. If you have more categories in your dataset, you have to label them under multiple Ground Truth jobs and combine them. Ground Truth associates each bounding box to an integer index in the <code>output.manifest<\/code> file. Therefore, the integer labels are different across multiple Ground Truth jobs if you have more than 30 categories. Having the annotations as dataframes makes the task of combining them easier and takes care of the conflict of category names across multiple jobs. In the preceding screenshot, you can see that you used the actual names under the category column instead of the integer index.<\/li>\n<\/ul>\n<h2>Generating YOLO annotations<\/h2>\n<p>You\u2019re now ready to reformat the bounding box coordinates Ground Truth provided into a format the YOLO model accepts.<\/p>\n<p>In the YOLO format, each bounding box is described by the center coordinates of the box and its width and height. Each number is scaled by the dimensions of the image; therefore, they all range between 0 and 1. Instead of category names, YOLO models expect the corresponding integer categories.<\/p>\n<p>Therefore, you need to map each name in the category column of the dataframe into a unique integer. Moreover, the official <a href=\"https:\/\/pjreddie.com\/darknet\/yolo\/\" target=\"_blank\" rel=\"noopener noreferrer\">Darknet implementation of YOLOv3<\/a> needs to have the name of the image match the annotation text file name. For example, if the image file is <code>pic01.jpg<\/code>, the corresponding annotation file should be named <code>pic01.txt<\/code>.<\/p>\n<p>The following code block performs all these tasks:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">import os\r\nimport json\r\nfrom io import StringIO\r\nimport boto3\r\nimport s3fs\r\nimport pandas as pd\r\n\r\n\r\ndef annot_yolo(annot_file, cats):\r\n    \"\"\"\r\n    Prepares the annotation in YOLO format\r\n\r\n    Input:\r\n    annot_file: csv file containing Ground Truth annotations\r\n    ordered_cats: List of object categories in proper order for model training\r\n\r\n    Returns:\r\n    df_ann: pandas dataframe with the following columns\r\n            img_file int_category box_center_w box_center_h box_width box_height\r\n\r\n\r\n    Note:\r\n    YOLO data format: &lt;object-class&gt; &lt;x_center&gt; &lt;y_center&gt; &lt;width&gt; &lt;height&gt;\r\n    \"\"\"\r\n\r\n    df_ann = pd.read_csv(annot_file)\r\n\r\n    df_ann[\"int_category\"] = df_ann[\"category\"].apply(lambda x: cats.index(x))\r\n    df_ann[\"box_center_w\"] = df_ann[\"box_left\"] + df_ann[\"box_width\"] \/ 2\r\n    df_ann[\"box_center_h\"] = df_ann[\"box_top\"] + df_ann[\"box_height\"] \/ 2\r\n\r\n    # scale box dimensions by image dimensions\r\n    df_ann[\"box_center_w\"] = df_ann[\"box_center_w\"] \/ df_ann[\"img_width\"]\r\n    df_ann[\"box_center_h\"] = df_ann[\"box_center_h\"] \/ df_ann[\"img_height\"]\r\n    df_ann[\"box_width\"] = df_ann[\"box_width\"] \/ df_ann[\"img_width\"]\r\n    df_ann[\"box_height\"] = df_ann[\"box_height\"] \/ df_ann[\"img_height\"]\r\n\r\n    return df_ann\r\n\r\n\r\ndef save_annots_to_s3(s3_bucket, prefix, df_local):\r\n    \"\"\"\r\n    For every image in the dataset, save a text file with annotation in YOLO format\r\n\r\n    Input:\r\n    s3_bucket: S3 bucket name\r\n    prefix: Folder name under s3_bucket where files will be written\r\n    df_local: pandas dataframe with the following columns\r\n              img_file int_category box_center_w box_center_h box_width box_height\r\n    \"\"\"\r\n\r\n    unique_images = df_local[\"img_file\"].unique()\r\n    s3_resource = boto3.resource(\"s3\")\r\n\r\n    for image_file in unique_images:\r\n        df_single_img_annots = df_local.loc[df_local.img_file == image_file]\r\n        annot_txt_file = image_file.split(\".\")[0] + \".txt\"\r\n        destination = f\"{prefix}\/{annot_txt_file}\"\r\n\r\n        csv_buffer = StringIO()\r\n        df_single_img_annots.to_csv(\r\n            csv_buffer,\r\n            index=False,\r\n            header=False,\r\n            sep=\" \",\r\n            float_format=\"%.4f\",\r\n            columns=[\r\n                \"int_category\",\r\n                \"box_center_w\",\r\n                \"box_center_h\",\r\n                \"box_width\",\r\n                \"box_height\",\r\n            ],\r\n        )\r\n        s3_resource.Object(s3_bucket, destination).put(Body=csv_buffer.getvalue())\r\n\r\n\r\ndef get_cats(json_file):\r\n    \"\"\"\r\n    Makes a list of the category names in proper order\r\n\r\n    Input:\r\n    json_file: s3 path of the json file containing the category information\r\n\r\n    Returns:\r\n    cats: List of category names\r\n    \"\"\"\r\n\r\n    filesys = s3fs.S3FileSystem()\r\n    with filesys.open(json_file) as fin:\r\n        line = fin.readline()\r\n        record = json.loads(line)\r\n        labels = [item[\"label\"] for item in record[\"labels\"]]\r\n\r\n    return labels\r\n\r\n\r\ndef main():\r\n    \"\"\"\r\n    Performs the following tasks:\r\n    1. Reads input from 'input.json'\r\n    2. Collect the category names from the Ground Truth job\r\n    3. Creates a dataframe with annotaion in YOLO format\r\n    4. Saves a text file in S3 with YOLO annotations\r\n       for each of the labeled images\r\n    \"\"\"\r\n\r\n    with open(\"input.json\") as fjson:\r\n        input_dict = json.load(fjson)\r\n\r\n    s3_bucket = input_dict[\"s3_bucket\"]\r\n    job_id = input_dict[\"job_id\"]\r\n    gt_job_name = input_dict[\"ground_truth_job_name\"]\r\n    yolo_output = input_dict[\"yolo_output_dir\"]\r\n\r\n    s3_path_cats = (\r\n        f\"s3:\/\/{s3_bucket}\/{job_id}\/ground_truth_annots\/{gt_job_name}\/annotation-tool\/data.json\"\r\n    )\r\n    categories = get_cats(s3_path_cats)\r\n    print(\"n labels used in Ground Truth job: \")\r\n    print(categories, \"n\")\r\n\r\n    gt_annot_file = \"annot.csv\"\r\n    s3_dir = f\"{job_id}\/{yolo_output}\"\r\n    print(f\"annotation files saved in = \", s3_dir)\r\n\r\n    df_annot = annot_yolo(gt_annot_file, categories)\r\n    save_annots_to_s3(s3_bucket, s3_dir, df_annot)\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    main()<\/code><\/pre>\n<\/div>\n<p>From the AWS CLI, save the preceding code block in a file <code>create_annot.py<\/code> and run:<\/p>\n<p><code>python create_annot.py<\/code><\/p>\n<p>The <code>annot_yolo<\/code> procedure transforms the dataframe you created by rescaling the box coordinates by the image size, and the <code>save_annots_to_s3<\/code> procedure saves the annotations corresponding to each image into a text file and stores it in Amazon S3.<\/p>\n<p>\u00a0<\/p>\n<p>You can now inspect a couple of images and their corresponding annotations to make sure they\u2019re properly formatted for model training. However, you first need to write a procedure to draw YOLO formatted bounding boxes on an image. Save the following code block in <code>visualize.py<\/code>:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">import matplotlib.pyplot as plt\r\nimport matplotlib.image as mpimg\r\nimport matplotlib.colors as mcolors\r\nimport argparse\r\n\r\n\r\ndef visualize_bbox(img_file, yolo_ann_file, label_dict, figure_size=(6, 8)):\r\n    \"\"\"\r\n    Plots bounding boxes on images\r\n\r\n    Input:\r\n    img_file : numpy.array\r\n    yolo_ann_file: Text file containing annotations in YOLO format\r\n    label_dict: Dictionary of image categories\r\n    figure_size: Figure size\r\n    \"\"\"\r\n\r\n    img = mpimg.imread(img_file)\r\n    fig, ax = plt.subplots(1, 1, figsize=figure_size)\r\n    ax.imshow(img)\r\n\r\n    im_height, im_width, _ = img.shape\r\n\r\n    palette = mcolors.TABLEAU_COLORS\r\n    colors = [c for c in palette.keys()]\r\n    with open(yolo_ann_file, \"r\") as fin:\r\n        for line in fin:\r\n            cat, center_w, center_h, width, height = line.split()\r\n            cat = int(cat)\r\n            category_name = label_dict[cat]\r\n            left = (float(center_w) - float(width) \/ 2) * im_width\r\n            top = (float(center_h) - float(height) \/ 2) * im_height\r\n            width = float(width) * im_width\r\n            height = float(height) * im_height\r\n\r\n            rect = plt.Rectangle(\r\n                (left, top),\r\n                width,\r\n                height,\r\n                fill=False,\r\n                linewidth=2,\r\n                edgecolor=colors[cat],\r\n            )\r\n            ax.add_patch(rect)\r\n            props = dict(boxstyle=\"round\", facecolor=colors[cat], alpha=0.5)\r\n            ax.text(\r\n                left,\r\n                top,\r\n                category_name,\r\n                fontsize=14,\r\n                verticalalignment=\"top\",\r\n                bbox=props,\r\n            ) \r\n    plt.show()\r\n\r\n\r\ndef main():\r\n    \"\"\"\r\n    Plots bounding boxes\r\n    \"\"\"\r\n\r\n    labels = {0: \"pen\", 1: \"pencil\"}\r\n    parser = argparse.ArgumentParser()\r\n    parser.add_argument(\"img\", help=\"image file\")\r\n    args = parser.parse_args()\r\n    img_file = args.img\r\n    ann_file = img_file.split(\".\")[0] + \".txt\"\r\n    visualize_bbox(img_file, ann_file, labels, figure_size=(6, 8))\r\n\r\n\r\nif __name__ == \"__main__\":\r\n    main()\r\n<\/code><\/pre>\n<\/div>\n<p>Download an image and the corresponding annotation file from Amazon S3. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">aws s3 cp s3:\/\/ground-truth-data-labeling\/bounding_box\/yolo_annot_files\/IMG_20200816_205004.txt .<\/code><\/pre>\n<\/div>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">aws s3 cp s3:\/\/ground-truth-data-\r\nlabeling\/bounding_box\/images\/IMG_20200816_205004.jpg .\r\n<\/code><\/pre>\n<\/div>\n<p>To display the correct label of each bounding box, you need to specify the names of the objects you labeled in a dictionary and pass it to <code>visualize_bbox<\/code>. For this use case, you only have two items in the list. However, the order of the labels is important\u2014it should match the order you used while creating the Ground Truth labeling job. If you can\u2019t remember the order, you can access the information from the <code>s3:\/\/data-labeling-ground-truth\/bounding_box\/ground_truth_annots\/bbox-yolo\/annotation-tool\/data.json<\/code><\/p>\n<p>file in Amazon S3, which the Ground Truth job creates automatically.<\/p>\n<p>The contents of the <code>data.json<\/code> file the task look like the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-json\">{\"document-version\":\"2018-11-28\",\"labels\":[{\"label\":\"pencil\"},{\"label\":\"pen\"}]}<\/code><\/pre>\n<\/div>\n<p>Therefore, a dictionary with the labels as follows was created in <code>visualize.py<\/code>:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">labels = {0: 'pencil', 1: 'pen'}<\/code><\/pre>\n<\/div>\n<p>Now run the following to visualize the image:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">python visualize.py IMG_20200816_205004.jpg<\/code><\/pre>\n<\/div>\n<p>The following screenshot shows the bounding boxes correctly drawn around two pens.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15730 size-full\" title=\"Labeled pens\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/09\/14-Pens-2.jpg\" alt=\"\" width=\"400\" height=\"299\"><\/p>\n<p>To plot an image with a mix of pens and pencils, get the image and the corresponding annotation text from Amazon S3. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">aws s3 cp s3:\/\/ground-truth-data-labeling\/bounding_box\/yolo_annot_files\/IMG_20200816_205029.txt .\r\n    \r\naws s3 cp s3:\/\/ground-truth-data-\r\nlabeling\/bounding_box\/images\/IMG_20200816_205029.jpg .\r\n<\/code><\/pre>\n<\/div>\n<p>Override the default image size in the\u00a0 visualize_bbox\u00a0 procedure to (10, 12) and run the following:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">python visualize.py IMG_20200816_205029.jpg<\/code><\/pre>\n<\/div>\n<p>The following screenshot shows three bounding boxes correctly drawn around two types of objects.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-15731 size-full\" title=\"Labeling pens and pencils\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/09\/15-Pens-Pencils.jpg\" alt=\"\" width=\"400\" height=\"302\"><\/p>\n<h2>Conclusion<\/h2>\n<p>This post described how to create an efficient, end-to-end data-gathering pipeline in Amazon Ground Truth for an object detection model. Try out this process yourself next time you are creating an object detection model. You can modify the post-processing annotations to produce labeled data in the Pascal VOC format, which is required for models like Faster RCNN. You can also adopt the basic framework to other data-labeling pipelines with job-specific modifications. For example, you can rewrite the annotation post-processing procedures to adopt the framework for an instance segmentation task, in which an object is labeled at the pixel level instead of drawing a rectangle around the object. Amazon Ground Truth gets regularly updated with enhanced capabilities. Therefore, check \u00a0the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sms.html\" target=\"_blank\" rel=\"noopener noreferrer\">documentation<\/a> for the most up to date features.<\/p>\n<hr>\n<h3>About the Author<\/h3>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-9327 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2019\/08\/05\/Arkajyoti-Misra-100.jpg\" alt=\"\" width=\"100\" height=\"133\"><strong>Arkajyoti Misra<\/strong> is a Data Scientist working in AWS Professional Services. He loves to dig into Machine Learning algorithms and enjoys reading about new frontiers in Deep Learning.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/streamlining-data-labeling-for-yolo-object-detection-in-amazon-sagemaker-ground-truth\/<\/p>\n","protected":false},"author":0,"featured_media":407,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/406"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=406"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/406\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/407"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=406"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=406"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=406"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}