{"id":1051,"date":"2021-10-19T08:39:29","date_gmt":"2021-10-19T08:39:29","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/10\/19\/gamify-amazon-sagemaker-ground-truth-labeling-workflows-via-a-bar-chart-race\/"},"modified":"2021-10-19T08:39:29","modified_gmt":"2021-10-19T08:39:29","slug":"gamify-amazon-sagemaker-ground-truth-labeling-workflows-via-a-bar-chart-race","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/10\/19\/gamify-amazon-sagemaker-ground-truth-labeling-workflows-via-a-bar-chart-race\/","title":{"rendered":"Gamify Amazon SageMaker Ground Truth labeling workflows via a bar chart race"},"content":{"rendered":"<div id=\"\">\n<p>Labeling is an indispensable stage of data preprocessing in supervised learning. <a href=\"https:\/\/aws.amazon.com\/sagemaker\/groundtruth\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Ground Truth<\/a> is a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning. Ground Truth helps improve the quality of labels through annotation consolidation and audit workflows. Ground Truth is easy to use, can reduce your labeling costs by up to 70% using automatic labeling, and provides options to work with labelers inside and outside of your organization.<\/p>\n<p>This post explains how you can use Ground Truth partial labeling data loaded in <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) to gamify labeling workflows. The core of the gamification approach is to create a bar chart race showing the progress of the labeling workflow and highlighting the evolution of completed labeling per workers. The bar chart race can be sent periodically (such as daily or weekly). We present options to create and send your bar chart manually or automatically.<\/p>\n<p>This gamification approach to Ground Truth labeling workflows can allow you to:<\/p>\n<ul>\n<li>Speed up labeling<\/li>\n<li>Reduce delays in labeling by continuous monitoring<\/li>\n<li>Increase user engagement and user satisfaction<\/li>\n<\/ul>\n<p>We have successfully adopted this solution for a healthcare and life science customer. The labeling job owner kept the internal labeling team engaged by sending a bar chart race daily, and the labeling job was completed 20% faster than planned.<\/p>\n<h2>Option 1: Manual chart creation<\/h2>\n<p>A first option for gamifying your Ground Truth labeling workflow via a bar chart race is to create an <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> instance to fetch the partial labeling data, parse the data and create the bar chart race manually. You then save it to Amazon S3 and send it to the workers. The following diagram shows this workflow.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/15\/ML-2967-image001.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28163\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/15\/ML-2967-image001.jpg\" alt=\"\" width=\"440\" height=\"473\"><\/a><\/p>\n<p>To create your bar chart race manually, complete the following steps:<\/p>\n<ol>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sms-getting-started-step2.html\" target=\"_blank\" rel=\"noopener noreferrer\">Create a Ground Truth labeling job<\/a> and indicate an S3 bucket where the labeling data is continuously loaded.<\/li>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/howitworks-create-ws.html\" target=\"_blank\" rel=\"noopener noreferrer\">Create a SageMaker notebook instance<\/a><strong>.<\/strong>\n<ol type=\"a\">\n<li>Attach the appropriate <a href=\"http:\/\/aws.amazon.com\/iam\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Identity and Access Management<\/a> (IAM) role to allow read access to the S3 bucket containing the outputs of the Ground Truth labeling job.<\/li>\n<\/ol>\n<\/li>\n<li>Create a notebook using a <code>conda_python3<\/code> based kernel, then install the required dependencies. You can run the following commands from the terminal after activating the appropriate environment:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">$cd \/home\/ec2-user\/SageMaker\/\n$source activate python3\n$pip install bar_chart_race\n$pip install ffmpeg-python\n$sudo su -\n$cd \/usr\/local\/bin\n$mkdir ffmpeg\n$cd ffmpeg\n$wget https:\/\/www.johnvansickle.com\/ffmpeg\/old-$releases\/ffmpeg-4.2.1-amd64-static.tar.xz\n$tar xvf ffmpeg-4.2.1-amd64-static.tar.xz\n$mv ffmpeg-4.2.1-amd64-static\/ffmpeg .\n$ln -s \/usr\/local\/bin\/ffmpeg\/ffmpeg \/usr\/bin\/ffmpeg\n$exit<\/code><\/pre>\n<\/p><\/div>\n<ol start=\"4\">\n<li>Import the required packages via the following code:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">import boto3\nimport json\nimport pandas as pd \nimport numpy as np<\/code><\/pre>\n<\/p><\/div>\n<ol start=\"5\">\n<li>Set up the SageMaker notebook instance to access the S3 bucket containing the Ground Truth labeling data (for this post, we use the bucket <code>Example_SageMaker_GT<\/code>):<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">s3 = boto3.client('s3')\nbucket_name = 'Example_SageMaker_GT'\nprefix = '\/annotations\/worker-response\/iteration-1\/'<\/code><\/pre>\n<\/p><\/div>\n<ol start=\"6\">\n<li>Analyze the partial Ground Truth labeling data:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">s3_res = boto3.resource('s3')\npaginator = boto3.client('s3').get_paginator('list_objects_v2')\npages = paginator.paginate(Bucket=bucket_name, Prefix=prefix)\n\ntimes = []\nsubs = []\nfor page in pages:\n    for work in page['Contents']:\n        content_object = s3_res.Object(bucket_name, work['Key'])\n        file_content = content_object.get()['Body'].read().decode('utf-8')\n        json_content = json.loads(file_content)\n        times.append(json_content['answers'][0]['submissionTime'])\n        subs.append(json_content['answers'][0]['workerMetadata']['identityData']['sub'])\nsub_map = { s: f'Name {i}' for i,s in enumerate(np.unique(subs))}<\/code><\/pre>\n<\/p><\/div>\n<ol start=\"7\">\n<li>Convert the partial Ground Truth labeling data into a DataFrame and structure the <code>date<\/code> and <code>hours<\/code> fields:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">df = pd.DataFrame({'times':times,'subs':subs})\ndf[\"subs\"] = df[\"subs\"].map(sub_map)\nsubs_df = pd.DataFrame(pd.Series(subs))\ndf['date'] = pd.to_datetime(df.times).dt.date\ndf['hours'] = pd.to_datetime(df.times).dt.strftime('%Y-%m-%d %H:30')<\/code><\/pre>\n<\/p><\/div>\n<ol start=\"8\">\n<li>Extract the labeling occurrence per workers and calculate the cumulative sum:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">counts_per_sub_per_date = df.groupby(['hours','subs'])['count'].count().unstack()\ncounts_per_sub_per_date_cum = counts_per_sub_per_date.fillna(0).cumsum()<\/code><\/pre>\n<\/p><\/div>\n<ol start=\"9\">\n<li>Create the bar chart race video:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">import bar_chart_race as bcr\n\nbcr.bar_chart_race(\n    df=counts_per_sub_per_date_cum,\n    filename=None,\n    orientation='h',\n    sort='desc',\n    #n_bars=len(counts_per_sub.columns),\n    fixed_order=False,\n    fixed_max=True,\n    steps_per_period=5,\n    interpolate_period=False,\n    label_bars=True,\n    bar_size=.95,\n    period_label={'x': .99, 'y': .25, 'ha': 'right', 'va': 'center'},\n    #period_fmt='%B %d, %Y',\n    period_summary_func=lambda v, r: {'x': .99, 'y': .18,\n                                      's': f'Total labels: {v.sum():,.0f}',\n                                      'ha': 'right', 'size': 8, 'family': 'Courier New'},\n    perpendicular_bar_func='median',\n    period_length=50,\n    figsize=(5, 3),\n    dpi=144,\n    cmap='dark12',\n    title='Who is going to be the top labeller?',\n    title_size='',\n    bar_label_size=7,\n    tick_label_size=7,\n    shared_fontdict={'family' : 'Helvetica', 'color' : '.1'},\n    scale='linear',\n    writer=None,\n    fig=None,\n    bar_kwargs={'alpha': .7},\n    filter_column_colors=False)  <\/code><\/pre>\n<\/p><\/div>\n<ol start=\"10\">\n<li>Download the bar chart race output and save it into an S3 bucket.<\/li>\n<li>Email this file to your workers.<\/li>\n<\/ol>\n<h2>Option 2: Automatic chart creation<\/h2>\n<p>Option 2 requires no manual intervention; the bar chart races are sent automatically to the workers at a fixed interval (such as every day or every week). We provide a completely serverless solution, where the computing is done through <a href=\"http:\/\/aws.amazon.com\/lambda\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Lambda<\/a>. The advantage of this approach is that the you don\u2019t need to deploy any computing infrastructure (the SageMaker notebook instance in the first option). The steps involved are as follows:<\/p>\n<ol>\n<li>A Lambda function is triggered at fixed time intervals, and generates the bar chart race by replicating the steps highlighted in Option 1. External dependencies, such as ffmpeg, are installed as Lambda layers.<\/li>\n<li>The bar chart races are saved to Amazon S3.<\/li>\n<li>The updates to the video on Amazon S3 trigger a message sent to <a href=\"http:\/\/aws.amazon.com\/sns\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Notification Service<\/a> (Amazon SNS).<\/li>\n<li>Amazon SNS sends an email to subscribers.<\/li>\n<\/ol>\n<p>The following diagram illustrates this architecture.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/15\/ML-2967-image004.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-28165\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/15\/ML-2967-image004.jpg\" alt=\"\" width=\"738\" height=\"552\"><\/a><\/p>\n<p>The following is the code for the Lambda function:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-js\">import boto3\nimport json \nimport os\n\nimport numpy as np \nimport pandas as pd\n\nfrom matplotlib import pyplot as plt\nfrom matplotlib import animation\nimport bar_chart_race as bcr\n\n# point this to the path in your lambda layer\nplt.rcParams['animation.ffmpeg_path'] = '\/opt\/ffmpeg\/bin\/ffmpeg'\n\ns3_res = boto3.resource('s3')\n\n\nbucket_name = 'YourBucketHere'\nprefix = 'GTFolder\/annotations\/worker-response\/iteration-1\/'\n\ndef lambda_handler(event, context):\n    \n    print(os.environ)\n    print(os.getcwd())\n    print(os.listdir('\/opt\/'))\n    \n    paginator = boto3.client('s3').get_paginator('list_objects_v2')\n    pages = paginator.paginate(Bucket=bucket_name, Prefix=f'{prefix}')\n        \n    times = []\n    subs = []\n    for page in pages:\n        for work in page['Contents']:\n            content_object = s3_res.Object(bucket_name, work['Key'])\n            file_content = content_object.get()['Body'].read().decode('utf-8')\n            json_content = json.loads(file_content)\n            times.append(json_content['answers'][0]['submissionTime'])\n            subs.append(json_content['answers'][0]['workerMetadata']['identityData']['sub'])\n        \n    # this is where one would map back to the real names of the labelers, possibly\n    # using Cognito for sub -&gt; Name correspondence\n    \n    sub_map = { s: f'Name {i}' for i,s in enumerate(np.unique(subs))}\n    \n    df = pd.DataFrame({'times':times,'subs':subs})\n    df[\"subs\"] = df[\"subs\"].map(sub_map)\n    df['date'] = pd.to_datetime(df.times).dt.date\n    df['hours'] = pd.to_datetime(df.times).dt.strftime('%Y-%m-%d %H:30')\n    df['count']=1\n    \n    counts_per_sub_per_date = df.groupby(['hours','subs'])['count'].count().unstack()\n    counts_per_sub_per_date_cum = counts_per_sub_per_date.fillna(0).cumsum()\n    \n    bcr.bar_chart_race(df=counts_per_sub_per_date_cum.iloc[:100],\n        filename='\/tmp\/barchart.mp4',\n        orientation='h',\n        sort='desc',\n        #n_bars=len(counts_per_sub.columns),\n        fixed_order=False,\n        fixed_max=True,\n        steps_per_period=5,\n        interpolate_period=False,\n        label_bars=True,\n        bar_size=.95,\n        period_label={'x': .99, 'y': .25, 'ha': 'right', 'va': 'center'},\n        #period_fmt='%B %d, %Y',\n        period_summary_func=lambda v, r: {'x': .99, 'y': .18,\n                                          's': f'Total labels: {v.sum():,.0f}',\n                                          'ha': 'right', 'size': 8, 'family': 'Courier New'},\n        perpendicular_bar_func='median',\n        period_length=50,\n        figsize=(5, 3),\n        dpi=144,\n        cmap='dark12',\n        title='Who is going to be the top labeller?',\n        title_size='',\n        bar_label_size=7,\n        tick_label_size=7,\n        shared_fontdict={'family' : 'Helvetica', 'color' : '.1'},\n        scale='linear',\n        writer=None,\n        fig=None,\n        bar_kwargs={'alpha': .7},\n        filter_column_colors=False)  \n    \n    boto3.client('s3').upload_file('\/tmp\/barchart.mp4', bucket_name, 'barchart\/barchart.mp4')\n    \n    return {\n        'statusCode': 200,\n        'body': json.dumps('Hello from Lambda!')\n    }<\/code><\/pre>\n<\/p><\/div>\n<h2>Clean up<\/h2>\n<p>When you finish this exercise, remove your resources with the following steps:<\/p>\n<ol>\n<li>Delete your notebook instance.<\/li>\n<li>Stop your Ground Truth job.<\/li>\n<li>Optionally, delete the SageMaker execution role.<\/li>\n<li>Optionally, empty and delete the S3 bucket.<\/li>\n<\/ol>\n<h2>Conclusions<\/h2>\n<p>This post demonstrated how to use Ground Truth partial labeling data loaded in Amazon S3 to gamify labeling workflows by periodically creating a bar chart race. Engaging with workers with a bar chart race has been shown to spark a fruitful competition among workers, speed up labeling, and increase user engagement and user satisfaction.<\/p>\n<p>Get started today! You can learn more about Ground Truth and kick off your own labeling and gamification processes by visiting the <a href=\"https:\/\/console.aws.amazon.com\/sagemaker\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker console<\/a>.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/07\/danglos.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-27747 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/07\/danglos.jpg\" alt=\"\" width=\"100\" height=\"133\"><\/a><strong>Daniele Angelosante<\/strong> is a Senior Engagement Manager with AWS Professional Services. He is passionate about AI\/ML projects and products. In his free time, he likes coffee, sport, soccer, and baking.<\/p>\n<p><strong> <a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/08\/09\/Andrea-Simone.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-26947 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/08\/09\/Andrea-Simone.jpg\" alt=\"\" width=\"100\" height=\"133\"><\/a>Andrea Di Simone<\/strong> is a Data Scientist in the Professional Services team based in Munich, Germany. He helps customers to develop their AI\/ML products and workflows, leveraging AWS tools. He enjoys reading, classical music and hiking.<\/p>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/01\/14\/Othmane-Hamzaoui.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-20706 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/01\/14\/Othmane-Hamzaoui.jpg\" alt=\"\" width=\"100\" height=\"133\"><\/a>Othmane Hamzaoui<\/strong>\u00a0is a Data Scientist working in the AWS Professional Services team. He is passionate about solving customer challenges using Machine Learning, with a focus on bridging the gap between research and business to achieve impactful outcomes. In his spare time, he enjoys running and discovering new coffee shops in the beautiful city of Paris.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/gamify-amazon-sagemaker-ground-truth-labeling-workflows-via-a-bar-chart-race\/<\/p>\n","protected":false},"author":0,"featured_media":1052,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1051"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1051"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1051\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1052"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1051"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1051"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1051"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}