{"id":1545,"date":"2022-02-10T21:47:29","date_gmt":"2022-02-10T21:47:29","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2022\/02\/10\/reduce-costs-and-complexity-of-ml-preprocessing-with-amazon-s3-object-lambda\/"},"modified":"2022-02-10T21:47:29","modified_gmt":"2022-02-10T21:47:29","slug":"reduce-costs-and-complexity-of-ml-preprocessing-with-amazon-s3-object-lambda","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2022\/02\/10\/reduce-costs-and-complexity-of-ml-preprocessing-with-amazon-s3-object-lambda\/","title":{"rendered":"Reduce costs and complexity of ML preprocessing with Amazon S3 Object Lambda"},"content":{"rendered":"<div id=\"\">\n<p><a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. Often, customers have objects in S3 buckets that need further processing to be used effectively by consuming applications. Data engineers must support these application-specific data views with trade-offs between persisting derived copies or transforming data at the consumer level. Neither solution is ideal because it introduces operational complexity, causes data consistency challenges, and wastes more expensive computing resources.<\/p>\n<p>These trade-offs broadly apply to many machine learning (ML) pipelines that train on unstructured data, such as audio, video, and free-form text, among other sources. In each example, the training job must download data from S3 buckets, prepare an application-specific view, and then use an AI algorithm. This post demonstrates a design pattern for reducing costs, complexity, and centrally managing this second step. It uses the concrete <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/preprocess-input-data-before-making-predictions-using-amazon-sagemaker-inference-pipelines-and-scikit-learn\/\" target=\"_blank\" rel=\"noopener noreferrer\">example of image processing<\/a>, though the approach broadly applies to any workload. The economic benefits are also most pronounced when the transformation step doesn\u2019t require GPU, but the AI algorithm does.<\/p>\n<p>The proposed solution also centralizes data transformation code and enables just-in-time (JIT) transformation. Furthermore, the approach uses a serverless infrastructure to reduce operational overhead and undifferentiated heavy lifting.<\/p>\n<h2>Solution overview<\/h2>\n<p>When ML algorithms process unstructured data like images and video, it requires various normalization tasks (such as grey-scaling and resizing). This step exists to accelerate model convergence, avoid overfitting, and improve prediction accuracy. You often perform these preprocessing steps on instances that later run the AI training. That approach creates inefficiencies, because those resources typically have more expensive processors (for example, GPUs) than these tasks require. Instead, our solution externalizes those operations across economic, horizontally scalable <a href=\"https:\/\/aws.amazon.com\/s3\/features\/object-lambda\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon S3 Object Lambda<\/a> functions.<\/p>\n<p>This design pattern has three critical benefits. First, it centralizes the shared data transformation steps, such as image normalization and removing ML pipeline code duplication. Next, S3 Object Lambda functions avoid data consistency issues in derived data through JIT conversions. Third, the serverless infrastructure reduces operational overhead, increases access time, and limits costs to the per-millisecond time running your code.<\/p>\n<p>An elegant solution exists in which you can centralize these data preprocessing and data conversion operations with S3 Object Lambda. S3 Object Lambda enables you to add code that modifies data from Amazon S3 before returning it to an application. The code runs within an <a href=\"https:\/\/aws.amazon.com\/lambda\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Lambda<\/a> function, a serverless compute service. Lambda can instantly scale to tens of thousands of parallel runs while supporting dozens of programming languages and even <a href=\"https:\/\/docs.aws.amazon.com\/lambda\/latest\/dg\/images-create.html\" target=\"_blank\" rel=\"noopener noreferrer\">custom containers<\/a>. For more information, see <a href=\"https:\/\/aws.amazon.com\/blogs\/aws\/introducing-amazon-s3-object-lambda-use-your-code-to-process-data-as-it-is-being-retrieved-from-s3\/\" target=\"_blank\" rel=\"noopener noreferrer\">Introducing Amazon S3 Object Lambda \u2013 Use Your Code to Process Data as It Is Being Retrieved from S3<\/a>.<\/p>\n<p>The following diagram illustrates the solution architecture.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-32672\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/03\/ml_6508_fig1.png\" alt=\"Normalize datasets used to train machine learning model\" width=\"724\" height=\"376\"><\/p>\n<p>In this solution, you have an S3 bucket that contains the raw images to be processed. Next, you create an <a href=\"https:\/\/aws.amazon.com\/s3\/features\/access-points\/\" target=\"_blank\" rel=\"noopener noreferrer\">S3 Access Point<\/a> for these images. If you build multiple ML models, you can create separate S3 Access Points for each model. Alternatively, <a href=\"http:\/\/aws.amazon.com\/iam\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Identity and Access Management<\/a> (IAM) <a href=\"https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/access-points-policies.html\" target=\"_blank\" rel=\"noopener noreferrer\">policies for access points<\/a> support sharing reusable functions across ML pipelines. Then you attach a Lambda function that has your preprocessing business logic to the S3 Access Point. After you retrieve the data, you call the S3 Access Point to perform JIT data transformations. Finally, you update your ML model to use the new <a href=\"https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/olap-create.html\" target=\"_blank\" rel=\"noopener noreferrer\">S3 Object Lambda Access Point<\/a> to retrieve data from Amazon S3.<\/p>\n<h2>Create the normalization access point<\/h2>\n<p>This section walks through the steps to create the S3 Object Lambda access point.<\/p>\n<p>Raw data is stored in an S3 bucket. To provide the user with the right set of permissions to access this data, while avoiding complex bucket policies that can cause unexpected impact to another application, you need to create S3 Access Points. S3 Access Points are unique host names that you can use to reach S3 buckets. With S3 Access Points, you can create individual access control policies for each access point to control access to shared datasets easily and securely.<\/p>\n<ol>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/creating-access-points.html\" target=\"_blank\" rel=\"noopener noreferrer\">Create your access point<\/a>.<\/li>\n<li>Create a Lambda function that performs the image resizing and conversion. See the following Python code:<\/li>\n<\/ol>\n<pre><code class=\"lang-python\">import boto3\nimport cv2\nimport numpy as np\nimport requests\nimport io\n\ndef lambda_handler(event, context):\n    print(event)\n    object_get_context = event[\"getObjectContext\"]\n    request_route = object_get_context[\"outputRoute\"]\n    request_token = object_get_context[\"outputToken\"]\n    s3_url = object_get_context[\"inputS3Url\"]\n\n    # Get object from S3\n    response = requests.get(s3_url)\n    nparr = np.fromstring(response.content, np.uint8)\n    img = cv2.imdecode(nparr, flags=1)\n\n    # Transform object\n    new_shape=(256,256)\n    resized = cv2.resize(img, new_shape, interpolation= cv2.INTER_AREA)\n    gray_scaled = cv2.cvtColor(resized,cv2.COLOR_BGR2GRAY)\n\n    # Transform object\n    is_success, buffer = cv2.imencode(\".jpg\", gray_scaled)\n    if not is_success:\n       raise ValueError('Unable to imencode()')\n\n    transformed_object = io.BytesIO(buffer).getvalue()\n\n    # Write object back to S3 Object Lambda\n    s3 = boto3.client('s3')\n    s3.write_get_object_response(\n        Body=transformed_object,\n        RequestRoute=request_route,\n        RequestToken=request_token)\n\n    return {'status_code': 200}\n<\/code><\/pre>\n<ol start=\"3\">\n<li>Create an <a href=\"https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/olap-create.html\" target=\"_blank\" rel=\"noopener noreferrer\">Object Lambda access point<\/a> using the supporting access point from Step 1.<\/li>\n<\/ol>\n<p>The Lambda function uses the supporting access point to download the original objects.<\/p>\n<ol start=\"4\">\n<li>Update <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> to use the new S3 Object Lambda access point to retrieve data from Amazon S3. See the following bash code:<\/li>\n<\/ol>\n<pre><code class=\"lang-bash\">aws s3api get-object --bucket arn:aws:s3-object-lambda:us-west-2:12345678901:accesspoint\/image-normalizer --key images\/test.png<\/code><\/pre>\n<h2>Cost savings analysis<\/h2>\n<p>Traditionally, ML pipelines copy images and other files from Amazon S3 to SageMaker instances and then perform normalization. However, transforming these actions on training instances has inefficiencies. First, Lambda functions horizontally scale to handle the burst then elastically shrink, only <a href=\"https:\/\/aws.amazon.com\/about-aws\/whats-new\/2020\/12\/aws-lambda-changes-duration-billing-granularity-from-100ms-to-1ms\/\" target=\"_blank\" rel=\"noopener noreferrer\">charging per millisecond<\/a> when the code is running. Many preprocessing steps don\u2019t require GPUs and can even use ARM64. That creates an incentive to move that processing to more economical compute such as <a href=\"https:\/\/aws.amazon.com\/blogs\/aws\/aws-lambda-functions-powered-by-aws-graviton2-processor-run-your-functions-on-arm-and-get-up-to-34-better-price-performance\/\" target=\"_blank\" rel=\"noopener noreferrer\">Lambda functions powered by AWS Graviton2 processors<\/a>.<\/p>\n<p>Using an example from the <a href=\"https:\/\/calculator.aws\/#\/createCalculator\/Lambda\" target=\"_blank\" rel=\"noopener noreferrer\">Lambda pricing calculator<\/a>, you can configure the function with 256 MB of memory and compare the costs for both x86 and Graviton (ARM64). We chose this size because it\u2019s sufficient for many single-image data preparation tasks. Next, use the <a href=\"https:\/\/calculator.aws\/#\/createCalculator\/SageMaker\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker pricing calculator<\/a> to compute expenses for an ml.p2.xlarge instance. This is the smallest supported SageMaker training instance with GPU support. These results show up to 90% compute savings for operations that don\u2019t use GPUs and can shift to Lambda. The following table summarizes these findings.<\/p>\n<p>\u00a0<\/p>\n<table border=\"1px\" cellpadding=\"10px\">\n<tbody>\n<tr>\n<td><\/td>\n<td><span>Lambda with x86<\/span><\/td>\n<td><span>Lambda with Graviton2 (ARM)<\/span><\/td>\n<td><span>SageMaker ml.p2.xlarge<\/span><\/td>\n<\/tr>\n<tr>\n<td>Memory (GB)<\/td>\n<td>0.25<\/td>\n<td>0.25<\/td>\n<td>61<\/td>\n<\/tr>\n<tr>\n<td>CPU<\/td>\n<td>\u00a0\u2014<\/td>\n<td>\u00a0\u2014<\/td>\n<td>4<\/td>\n<\/tr>\n<tr>\n<td>GPU<\/td>\n<td>\u00a0\u2014<\/td>\n<td>\u00a0\u2014<\/td>\n<td>1<\/td>\n<\/tr>\n<tr>\n<td>Cost\/hour<\/td>\n<td>$0.061<\/td>\n<td>$0.049<\/td>\n<td>$0.90<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Conclusion<\/h2>\n<p>You can build modern applications to unlock insights into your data. These different applications have unique data view requirements, such as formatting and preprocessing actions. Addressing these other use cases can result in data duplication, increasing costs, and more complexity to maintain consistency. This post offers a solution for efficiently handling these situations using S3 Object Lambda functions.<\/p>\n<p>Not only does this remove the need for duplication, but it also forms a path to scale these actions across less expensive compute horizontally! Even optimizing the transformation code for the ml.p2.xlarge instance would still be significantly more costly because of the idle GPUs.<\/p>\n<p>For more ideas on using serverless and ML, see <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/machine-learning-inference-at-scale-using-aws-serverless\/\" target=\"_blank\" rel=\"noopener noreferrer\">Machine learning inference at scale using AWS serverless<\/a> and <a href=\"https:\/\/aws.amazon.com\/blogs\/compute\/deploying-machine-learning-models-with-serverless-templates\/\" target=\"_blank\" rel=\"noopener noreferrer\">Deploying machine learning models with serverless templates<\/a>.<\/p>\n<hr>\n<h3><strong>About the Authors<\/strong><\/h3>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/03\/Nate-Bachmeir.png\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-32677 size-full alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/03\/Nate-Bachmeir.png\" alt=\"\" width=\"100\" height=\"133\"><\/a><strong>Nate Bachmeier<\/strong> is an AWS Senior Solutions Architect nomadically explores New York, one cloud integration at a time. He specializes in migrating and modernizing customers\u2019 workloads. Besides this, Nate is a full-time student and has two kids.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/03\/Marvin-Fernandes.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-32678 size-full alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/03\/Marvin-Fernandes.jpg\" alt=\"\" width=\"100\" height=\"96\"><\/a><strong>Marvin Fernandes<\/strong> is a Solutions Architect at AWS, based in the New York City area. He has over 20 years of experience building and running financial services applications. He is currently working with large enterprise customers to solve complex business problems by crafting scalable, flexible, and resilient cloud architectures.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/reduce-costs-and-complexity-of-ml-preprocessing-with-amazon-s3-object-lambda\/<\/p>\n","protected":false},"author":0,"featured_media":1546,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1545"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1545"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1545\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1546"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1545"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1545"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1545"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}