{"id":1219,"date":"2021-11-18T08:32:51","date_gmt":"2021-11-18T08:32:51","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/11\/18\/use-amazon-sagemaker-ack-operators-to-train-and-deploy-machine-learning-models\/"},"modified":"2021-11-18T08:32:51","modified_gmt":"2021-11-18T08:32:51","slug":"use-amazon-sagemaker-ack-operators-to-train-and-deploy-machine-learning-models","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/11\/18\/use-amazon-sagemaker-ack-operators-to-train-and-deploy-machine-learning-models\/","title":{"rendered":"Use Amazon SageMaker ACK Operators to train and deploy machine learning models"},"content":{"rendered":"<div id=\"\">\n<p>AWS recently released the new Amazon SageMaker Operators for Kubernetes using the <a href=\"https:\/\/aws-controllers-k8s.github.io\/community\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Controllers for Kubernetes (ACK)<\/a>. ACK is a framework for building Kubernetes custom controllers, where each controller communicates with an AWS service API. These controllers allow Kubernetes users to provision AWS resources like databases or message queues simply by using the Kubernetes API. The new SageMaker ACK Operators make it easier for machine learning (ML) developers and data scientists who use Kubernetes as their control plane to train, tune, and deploy ML models in <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> without signing in to the SageMaker console.<\/p>\n<h2>Kubernetes and SageMaker<\/h2>\n<p>Building scalable ML workflows involves many iterative steps, including sourcing and preparing data, building ML models, training and evaluating these models, deploying them to production, and monitoring workloads after deployment.<\/p>\n<p>SageMaker is a fully managed service designed and optimized specifically for managing these ML workflows. It removes the undifferentiated heavy lifting of infrastructure management and eliminates the need to invest in IT and DevOps to manage clusters for ML model building, training, and inference. Compute resources are only provisioned when requested, scaled as needed, and shut down automatically when jobs complete, thereby providing near 100% utilization. SageMaker provides many performance and cost optimizations for distributed training, spot training, automatic model tuning, inference latency, and multi-model endpoints.<\/p>\n<p>Many AWS customers who have portability requirements implement a hybrid cloud approach, or implement on-premises and use Kubernetes, an open-source, general-purpose container orchestration system, to set up repeatable ML pipelines running training and inference workloads. However, to support ML workloads, these developers still need to write custom code to optimize the underlying ML infrastructure, provide high availability and reliability, provide data science productivity tools, and comply with appropriate security and regulatory requirements. Kubernetes customers therefore want to use fully managed ML services such as SageMaker for cost-optimized and managed infrastructure, but want platform and infrastructure teams to continue using Kubernetes for orchestration and managing pipelines to retain standardization and portability.<\/p>\n<p>To address this need, AWS allows you to train, tune, and deploy models in SageMaker by using the new SageMaker ACK Operators, which includes a set of custom resource definitions for SageMaker resources that extends the Kubernetes API. With the SageMaker ACK Operators, you can take advantage of fully managed SageMaker infrastructure, tools, and optimizations natively from Kubernetes.<\/p>\n<h2>How did we get here?<\/h2>\n<p>In late 2019, AWS introduced the <a href=\"https:\/\/github.com\/aws\/amazon-sagemaker-operator-for-k8s\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker Operators for Kubernetes<\/a> to enable developers and data scientists to manage the end-to-end SageMaker training and production lifecycle using Kubernetes as the control plane. SageMaker operators were installed from the <a href=\"https:\/\/github.com\/aws\/amazon-sagemaker-operator-for-k8s\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repo<\/a> by downloading a YAML configuration file that configured your Kubernetes cluster with the custom resource definitions and operator controller service.<\/p>\n<p>In 2020, AWS introduced ACK to facilitate a Kubernetes-native way of managing AWS Cloud resources. ACK includes a common controller runtime, a code generator, and a set of AWS service-specific controllers, one of which is the <a href=\"https:\/\/github.com\/aws-controllers-k8s\/sagemaker-controller\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker controller<\/a>.<\/p>\n<p>Going forward, new functionality will be added to the SageMaker Operators for Kubernetes through the ACK project.<\/p>\n<h2>How does ACK work?<\/h2>\n<p>The following diagram illustrates how ACK works.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-27999\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/09\/13\/4371-Architecture.jpg\" alt=\"\" width=\"801\" height=\"474\"><\/p>\n<p>In this example, Alice is a Kubernetes user. She wants to run model training on SageMaker from within the Kubernetes cluster using the Kubernetes API. Alice issues a call to <code>kubectl apply<\/code>, passing in a file that describes a Kubernetes <a href=\"https:\/\/kubernetes.io\/docs\/concepts\/extend-kubernetes\/api-extension\/custom-resources\/\" target=\"_blank\" rel=\"noopener noreferrer\">custom resource<\/a> describing her SageMaker training job. <code>kubectl apply<\/code> passes this file, called a <a href=\"https:\/\/kubernetes.io\/docs\/reference\/glossary\/?all=true#term-manifest\" target=\"_blank\" rel=\"noopener noreferrer\">manifest<\/a>, to the Kubernetes API server running in the Kubernetes controller node (Step 1 in the workflow diagram).<\/p>\n<p>The Kubernetes API server receives the manifest with the SageMaker training job specification and determines whether Alice <a href=\"https:\/\/aws-controllers-k8s.github.io\/community\/docs\/user-docs\/authorization\/\" target=\"_blank\" rel=\"noopener noreferrer\">has permissions<\/a> to create a custom resource of <a href=\"https:\/\/kubernetes.io\/docs\/reference\/using-api\/api-concepts\/#standard-api-terminology\" target=\"_blank\" rel=\"noopener noreferrer\">kind<\/a> <code>sageMaker.services.k8s.aws\/TrainingJob<\/code>, and whether the custom resource is properly formatted (Step 2).<\/p>\n<p>If Alice is authorized and the custom resource is valid, the Kubernetes API server writes (Step 3) the custom resource to its <code>etcd<\/code> data store and then responds back (Step 4) to Alice that the custom resource has been created.<\/p>\n<p>The SageMaker controller, which is running on a Kubernetes worker node within the context of a normal Kubernetes <a href=\"https:\/\/kubernetes.io\/docs\/concepts\/workloads\/pods\/\" target=\"_blank\" rel=\"noopener noreferrer\">Pod<\/a>, is notified (Step 5) that a new custom resource of kind <code>SageMaker.services.k8s.aws\/TrainingJob<\/code> has been created.<\/p>\n<p>The SageMaker controller then communicates (Step 6) with the SageMaker API, calling the SageMaker CreateTrainingJob API to create the training job in AWS. After communicating with the SageMaker API, the SageMaker controller calls the Kubernetes API server to update (Step 7) the custom resource\u2019s status with information it received from SageMaker. The SageMaker controller therefore provides the same information to the developers that they would have received using the AWS SDK. This results in a better and consistent developer experience.<\/p>\n<h2>Machine learning use case<\/h2>\n<p>For this post, we follow the SageMaker example provided in the following <a href=\"https:\/\/github.com\/aws\/amazon-sagemaker-examples\/blob\/master\/introduction_to_amazon_algorithms\/xgboost_abalone\/xgboost_abalone.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">notebook<\/a>. However, you can reuse the components in this example with your preference of SageMaker built-in or custom algorithms and your own datasets.<\/p>\n<p>We use the <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/abalone\" target=\"_blank\" rel=\"noopener noreferrer\">Abalone dataset<\/a> originally from the UCI data repository [1]. In the libsvm converted version, the nominal feature (male\/female\/infant) has been converted into a real valued feature. The age of abalone is to be predicted from eight physical measurements. This dataset is already processed and stored in <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3). We train an XGBoost model on the UCI Abalone dataset to replicate the flow in the example Jupyter notebook.<\/p>\n<h2>Prerequisites<\/h2>\n<p>For this walkthrough, you should have the following prerequisites:<\/p>\n<p>An existing <a href=\"https:\/\/aws.amazon.com\/eks\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Elastic Kubernetes Service<\/a> (Amazon EKS) cluster. It should be Kubernetes version 1.16+. <span role=\"note\">For automated cluster creation using <\/span><code><span role=\"note\">eksctl<\/span><\/code><span role=\"note\">, see <\/span><a href=\"https:\/\/docs.aws.amazon.com\/eks\/latest\/userguide\/getting-started-eksctl.html\" target=\"_blank\" rel=\"noopener noreferrer\"><span role=\"note\">Getting started with Amazon EKS \u2013 <code>eksctl<\/code><\/span><\/a><span role=\"note\"> and create your cluster with Amazon EC2 Linux managed nodes.<\/span><\/p>\n<p>Install the following tools on the client machine used to access your Kubernetes cluster (you can use <a href=\"https:\/\/docs.aws.amazon.com\/cloud9\/latest\/user-guide\/setting-up.html\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Cloud9<\/a>, a cloud-based integrated development environment (IDE) for the Kubernetes cluster setup):<\/p>\n<ul>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/eks\/latest\/userguide\/install-kubectl.html\" target=\"_blank\" rel=\"noopener noreferrer\">kubectl<\/a> \u2013 A command line tool for working with Kubernetes clusters.<\/li>\n<li><a href=\"https:\/\/helm.sh\/docs\/intro\/install\/\" target=\"_blank\" rel=\"noopener noreferrer\">Helm<\/a> version 3.7+ \u2013 A tool for installing and managing Kubernetes applications.<\/li>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/userguide\/install-cliv1.html\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Command Line Interface<\/a> (AWS CLI) \u2013 A command line tool for interacting with AWS services.<\/li>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/eks\/latest\/userguide\/eksctl.html\" target=\"_blank\" rel=\"noopener noreferrer\">eksctl<\/a> \u2013 A command line tool for working with Amazon EKS clusters that automates many individual tasks.<\/li>\n<li><a href=\"https:\/\/mikefarah.gitbook.io\/yq\" target=\"_blank\" rel=\"noopener noreferrer\">yq<\/a> \u2013 A command line YAML processor. (For Linux environments, use the\u00a0<a href=\"https:\/\/github.com\/mikefarah\/yq\/#wget\" target=\"_blank\" rel=\"noopener noreferrer\">wget\u00a0plain binary installation<\/a>).<\/li>\n<\/ul>\n<h2>Set up IAM role-based authentication for the controller Pod<\/h2>\n<p>IAM roles for service accounts (IRSA) allows fine-grained roles at the Kubernetes Pod level by combining an OpenID Connect (OIDC) identity provider with Kubernetes service account annotations. In this section, we associate the Amazon EKS cluster with an OIDC provider and create an <a href=\"http:\/\/aws.amazon.com\/iam\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Identity and Access Management<\/a> (IAM) role that is assumed by the ACK controller Pod via its service account to access AWS services.<\/p>\n<h3>Create a cluster and OIDC ID provider<\/h3>\n<p>Make sure you\u2019re connected to the right cluster. Substitute the values for <code>CLUSTER_NAME<\/code> and <code>CLUSTER_REGION<\/code> below:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\"># Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n# SPDX-License-Identifier: MIT-0\n\n# Set the cluster name, region where the cluster exists\nexport CLUSTER_NAME=&lt;CLUSTER_NAME&gt;\nexport CLUSTER_REGION=&lt;CLUSTER_REGION&gt;\nexport RANDOM_VAR=$RANDOM\n\naws eks update-kubeconfig --name $CLUSTER_NAME --region $CLUSTER_REGION\nkubectl config get-contexts \n\n# Ensure cluster has compute\nkubectl get nodes<\/code><\/pre>\n<\/p><\/div>\n<p>Set up the OIDC ID provider (IdP) in AWS and associate it with your Amazon EKS cluster:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">eksctl utils associate-iam-oidc-provider --cluster ${CLUSTER_NAME} \n--region ${CLUSTER_REGION} --approve\n<\/code><\/pre>\n<\/p><\/div>\n<p>Get the identity issuer URL by running the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query \"Account\" --output text)\nOIDC_PROVIDER_URL=$(aws eks describe-cluster --name $CLUSTER_NAME --region $CLUSTER_REGION --query \"cluster.identity.oidc.issuer\" --output text | cut -c9-)\n<\/code><\/pre>\n<\/p><\/div>\n<h3>Set up an IAM role<\/h3>\n<p>Next, let\u2019s set up the IAM role that defines the access to the SageMaker and Application Auto Scaling services. For this, we also need to have an IAM trust policy in place, allowing the specified Kubernetes service account (for example, <code>ack-sagemaker-controller<\/code>) to assume the IAM role.<\/p>\n<p>Create a file named <code>trust.json<\/code> and insert the following trust relationship code block required for IAM role:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">printf '{\n  \"Version\": \"2012-10-17\",\n  \"Statement\": [\n    {\n      \"Effect\": \"Allow\",\n      \"Principal\": {\n        \"Federated\": \"arn:aws:iam::'$AWS_ACCOUNT_ID':oidc-provider\/'$OIDC_PROVIDER_URL'\"\n      },\n      \"Action\": \"sts:AssumeRoleWithWebIdentity\",\n      \"Condition\": {\n        \"StringEquals\": {\n          \"'$OIDC_PROVIDER_URL':aud\": \"sts.amazonaws.com\",\n          \"'$OIDC_PROVIDER_URL':sub\": [\n            \"system:serviceaccount:ack-system:ack-sagemaker-controller\",\n            \"system:serviceaccount:ack-system:ack-applicationautoscaling-controller\"\n          ]\n        }\n      }\n    }\n  ]\n}\n' &gt; .\/trust.json<\/code><\/pre>\n<\/p><\/div>\n<p>Updating an Application Auto Scaling Scalable Target requires additional permissions. First, create a service-linked role for Application Auto Scaling.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">aws iam create-service-linked-role --aws-service-name sagemaker.application-autoscaling.amazonaws.com<\/code><\/pre>\n<\/p><\/div>\n<p>Create a file named\u00a0<code>pass_role_policy.json<\/code>\u00a0to create the policy required for the IAM role.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">printf '{\n  \"Version\": \"2012-10-17\",\n  \"Statement\": [\n    {\n      \"Effect\": \"Allow\",\n      \"Action\": \"iam:PassRole\",\n      \"Resource\": \"arn:aws:iam::'$AWS_ACCOUNT_ID':role\/aws-service-role\/sagemaker.application-autoscaling.amazonaws.com\/AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint\"\n    }\n  ]\n}\n' &gt; .\/pass_role_policy.json<\/code><\/pre>\n<\/p><\/div>\n<p>Run the following command to create a role with the trust relationship defined in <code>trust.json<\/code>. This trust relationship is required so that Amazon EKS (via a webhook) can inject the necessary environment variables and mount volumes into the Pod that are required by the AWS SDK to assume this role.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">OIDC_ROLE_NAME=ack-controller-role-$CLUSTER_NAME\n\naws iam create-role --role-name $OIDC_ROLE_NAME --assume-role-policy-document file:\/\/trust.json\n\n# Attach the AmazonSageMakerFullAccess Policy to the Role. This policy provides full access to \n# Amazon SageMaker. Also provides select access to related services (e.g., Application Autoscaling,\n# S3, ECR, CloudWatch Logs).\naws iam attach-role-policy --role-name $OIDC_ROLE_NAME --policy-arn arn:aws:iam::aws:policy\/AmazonSageMakerFullAccess\n\n# Attach the iam:PassRole policy required for updating ApplicationAutoscaling ScalableTarget\naws iam put-role-policy --role-name $OIDC_ROLE_NAME --policy-name \"iam-pass-role-policy\" --policy-document file:\/\/pass_role_policy.json\n\nexport IAM_ROLE_ARN_FOR_IRSA=$(aws iam get-role --role-name $OIDC_ROLE_NAME --output text --query 'Role.Arn')\necho $IAM_ROLE_ARN_FOR_IRSA<\/code><\/pre>\n<\/p><\/div>\n<h2>Install SageMaker and Application Auto Scaling controllers<\/h2>\n<p>Choose an AWS Region for the SageMaker and automatic scaling resources we create in this post. For convenience, we recommend using <code>us-east-1<\/code>:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">export SERVICE_REGION=\"us-east-1\"\n# Namespace for controller\nexport ACK_K8S_NAMESPACE=\"ack-system\"\n<\/code><\/pre>\n<\/p><\/div>\n<p>Now, let\u2019s install the SageMaker and Application Auto Scaling controller using the following helper script. This script pulls the helm charts from ACK\u2019s public <a href=\"http:\/\/aws.amazon.com\/ecr\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Elastic Container Registry<\/a> (Amazon ECR) repository and configures the values of the AWS account, default Region for resources to be created, and IAM role (created in previous step) in the service account to be used by the controller Pod to assume the role. Create a file named <code>install-controllers.sh<\/code> and insert the following code block:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">#!\/usr\/bin\/env bash\n\n# Deploy ACK Helm Charts\nexport HELM_EXPERIMENTAL_OCI=1\nexport ACK_K8S_NAMESPACE=${ACK_K8S_NAMESPACE:-\"ack-system\"}\n\nfunction install_ack_controller() {\n    local service=\"$1\"\n    local release_version=\"$2\"\n    local chart_export_path=\/tmp\/chart\n    local chart_ref=$service-chart\n    local chart_repo=public.ecr.aws\/aws-controllers-k8s\/$chart_ref\n    local chart_package=$chart_ref-$release_version.tgz\n    \n    # Download helm chart\n    mkdir -p $chart_export_path\n    helm pull oci:\/\/\"$chart_repo\" --version \"$release_version\" -d $chart_export_path\n    tar xvf \"$chart_export_path\"\/\"$chart_package\" -C \"$chart_export_path\"\n\n    # Update the values in helm chart\n    pushd $chart_export_path\/$service-chart\n        yq e '.aws.region = env(SERVICE_REGION)' -i values.yaml \n        yq e '.serviceAccount.annotations.\"eks.amazonaws.com\/role-arn\" = env(IAM_ROLE_ARN_FOR_IRSA)' -i values.yaml\n    popd\n\n    # Create a namespace and install the helm chart\n    helm install -n $ACK_K8S_NAMESPACE --create-namespace ack-$service-controller $chart_export_path\/$service-chart\n}\n\ninstall_ack_controller \"sagemaker\" \"v0.3.0\"\ninstall_ack_controller \"applicationautoscaling\" \"v0.2.0\"<\/code><\/pre>\n<\/p><\/div>\n<p>Run the script:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">chmod +x install-controllers.sh\n.\/install-controllers.sh\n<\/code><\/pre>\n<\/p><\/div>\n<p>The output contains the following:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">Pulled: public.ecr.aws\/aws-controllers-k8s\/sagemaker-chart:v0.3.0\n...\n\nNAME: ack-sagemaker-controller\nLAST DEPLOYED: Tue Nov 16 01:53:34 2021\nNAMESPACE: ack-system\nSTATUS: deployed\nREVISION: 1\nTEST SUITE: None\nPulled: public.ecr.aws\/aws-controllers-k8s\/applicationautoscaling-chart:v0.2.0\n...\n\nNAME: ack-applicationautoscaling-controller\nLAST DEPLOYED: Tue Nov 16 01:53:35 2021\nNAMESPACE: ack-system\nSTATUS: deployed\nREVISION: 1\nTEST SUITE: None<\/code><\/pre>\n<\/p><\/div>\n<p>Next, we run the following commands to verify custom resource definitions were applied and controller Pods are running:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">kubectl get crds | grep \"services.k8s.aws\"<\/code><\/pre>\n<\/p><\/div>\n<p>The output of the command should contain a number of custom resource definitions related to SageMaker (such as <code>trainingjobs<\/code> or <code>endpoint<\/code>) and Application Auto Scaling (such as <code>scalingpolicies<\/code> and <code>scalabletargets<\/code>):<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\"># Get pods in controller namespace\nkubectl get pods -n $ACK_K8S_NAMESPACE\n<\/code><\/pre>\n<\/p><\/div>\n<p>We see one controller Pod per service running in the <code>ack-system<\/code> namespace:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">NAME                                                     READY   STATUS    RESTARTS   AGE\nack-applicationautoscaling-controller-7479dc78dd-ts9ng   1\/1     Running   0          4m52s\nack-sagemaker-controller-788858fc98-6fgr6                1\/1     Running   0          4m56s\n<\/code><\/pre>\n<\/p><\/div>\n<h2>Prepare SageMaker resources<\/h2>\n<p>Next, we create an S3 bucket and IAM role for SageMaker.<\/p>\n<p>To train a model with SageMaker, we need an S3 bucket to store the dataset and artifacts from the training process. We simply use the preprocessed dataset at <code>s3:\/\/SageMaker-sample-files\/datasets\/tabular\/uci_abalone<\/code>[1].<\/p>\n<p>Let\u2019s create a variable for the S3 bucket:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">export SAGEMAKER_BUCKET=ack-sagemaker-bucket-$RANDOM_VAR<\/code><\/pre>\n<\/p><\/div>\n<p>Create a file named <code>create-bucket.sh<\/code> and insert the following code block:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">printf '\n#!\/usr\/bin\/env bash\n# create bucket\nif [[ $SERVICE_REGION != \"us-east-1\" ]]; then\n  aws s3api create-bucket --bucket \"$SAGEMAKER_BUCKET\" --region \"$SERVICE_REGION\" --create-bucket-configuration LocationConstraint=\"$SERVICE_REGION\"\nelse\n  aws s3api create-bucket --bucket \"$SAGEMAKER_BUCKET\" --region \"$SERVICE_REGION\"\nfi\n# sync dataset\naws s3 sync s3:\/\/sagemaker-sample-files\/datasets\/tabular\/uci_abalone\/train s3:\/\/\"$SAGEMAKER_BUCKET\"\/datasets\/tabular\/uci_abalone\/train\naws s3 sync s3:\/\/sagemaker-sample-files\/datasets\/tabular\/uci_abalone\/validation s3:\/\/\"$SAGEMAKER_BUCKET\"\/datasets\/tabular\/uci_abalone\/validation\n' &gt; .\/create-bucket.sh\n<\/code><\/pre>\n<\/p><\/div>\n<p>Run the script to create the S3 bucket and copy the dataset:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-bash\">chmod +x create-bucket.sh\n.\/create-bucket.sh\n<\/code><\/pre>\n<\/p><\/div>\n<p>The SageMaker training job that we run later in the post needs an IAM role to access Amazon S3 and SageMaker. Run the following commands to create a SageMaker execution IAM role that is used by SageMaker to access AWS resources:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">export SAGEMAKER_EXECUTION_ROLE_NAME=ack-sagemaker-execution-role-$RANDOM_VAR\n\nTRUST=\"{ \"Version\": \"2012-10-17\", \"Statement\": [ { \"Effect\": \"Allow\", \"Principal\": { \"Service\": \"sagemaker.amazonaws.com\" }, \"Action\": \"sts:AssumeRole\" } ] }\"\naws iam create-role --role-name ${SAGEMAKER_EXECUTION_ROLE_NAME} --assume-role-policy-document \"$TRUST\"\naws iam attach-role-policy --role-name ${SAGEMAKER_EXECUTION_ROLE_NAME} --policy-arn arn:aws:iam::aws:policy\/AmazonSageMakerFullAccess\naws iam attach-role-policy --role-name ${SAGEMAKER_EXECUTION_ROLE_NAME} --policy-arn arn:aws:iam::aws:policy\/AmazonS3FullAccess\n\nSAGEMAKER_EXECUTION_ROLE_ARN=$(aws iam get-role --role-name ${SAGEMAKER_EXECUTION_ROLE_NAME} --output text --query 'Role.Arn')\n\necho $SAGEMAKER_EXECUTION_ROLE_ARN\n<\/code><\/pre>\n<\/p><\/div>\n<p>Note down the execution role ARN to use in later steps.<\/p>\n<h2>Train an XGBoost model<\/h2>\n<p>Now, we create a <code>training.yaml<\/code> file to specify the parameters for a SageMaker training job. SageMaker training jobs enable remote training of ML models. You can customize each training job to run your own ML scripts with custom architectures, data loaders, hyperparameters, and more. To submit a SageMaker training job, we require a job name. Let\u2019s create that variable first:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">export JOB_NAME=ack-xgboost-training-job-$RANDOM_VAR<\/code><\/pre>\n<\/p><\/div>\n<p>In the following code, we create a <code>training.yaml<\/code> file that contains the hyperparameters for the training job as well as the location of the training and validation data. It\u2019s also where we specify the Amazon ECR image used for training.<\/p>\n<p><strong>Note<\/strong>: <strong>If your <code>$SERVICE_REGION<\/code> isn\u2019t <code>us-east-1<\/code>, change the following image URI. For the XGBoost algorithm version 1.2-1 Region-specific image URI, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sagemaker-algo-docker-registry-paths.html\" target=\"_blank\" rel=\"noopener noreferrer\">Docker Registry Paths and Example Code<\/a>.<\/strong><\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">export XGBOOST_IMAGE=683313688378.dkr.ecr.us-east-1.amazonaws.com\/sagemaker-xgboost:1.2-1<\/code><\/pre>\n<\/p><\/div>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">\nprintf '\napiVersion: sagemaker.services.k8s.aws\/v1alpha1\nkind: TrainingJob\nmetadata:\n  name: '$JOB_NAME'\nspec:\n  # Name that will appear in SageMaker console\n  trainingJobName: '$JOB_NAME'\n  hyperParameters: \n    max_depth: \"5\"\n    gamma: \"4\"\n    eta: \"0.2\"\n    min_child_weight: \"6\"\n    subsample: \"0.7\"\n    objective: \"reg:linear\"\n    num_round: \"50\"\n    verbosity: \"2\"\n  algorithmSpecification:\n    trainingImage: '$XGBOOST_IMAGE'\n    trainingInputMode: File\n  roleARN: '$SAGEMAKER_EXECUTION_ROLE_ARN'\n  outputDataConfig:\n    # The output path of our model\n    s3OutputPath: s3:\/\/'$SAGEMAKER_BUCKET'\n  resourceConfig:\n    instanceCount: 1\n    instanceType: ml.m4.xlarge\n    volumeSizeInGB: 5\n  stoppingCondition:\n    maxRuntimeInSeconds: 3600\n  inputDataConfig:\n    - channelName: train\n      dataSource:\n        s3DataSource:\n          s3DataType: S3Prefix\n          # The input path of our train data \n          s3URI: s3:\/\/'$SAGEMAKER_BUCKET'\/datasets\/tabular\/uci_abalone\/train\/abalone.train\n          s3DataDistributionType: FullyReplicated\n      contentType: text\/libsvm\n      compressionType: None\n    - channelName: validation\n      dataSource:\n        s3DataSource:\n          s3DataType: S3Prefix\n          # The input path of our validation data \n          s3URI: s3:\/\/'$SAGEMAKER_BUCKET'\/datasets\/tabular\/uci_abalone\/validation\/abalone.validation\n          s3DataDistributionType: FullyReplicated\n      contentType: text\/libsvm\n      compressionType: None \n' &gt; .\/training.yaml\n<\/code><\/pre>\n<\/p><\/div>\n<p>Now, we can create the training job:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">kubectl apply -f training.yaml<\/code><\/pre>\n<\/p><\/div>\n<p>You should see the following output:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">trainingjob.sagemaker.services.k8s.aws\/ack-xgboost-training-job-7420 created<\/code><\/pre>\n<\/p><\/div>\n<p>You can watch the status of the training job. It takes a few minutes for <code>STATUS<\/code> to show as <code>Completed<\/code>.<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">kubectl get trainingjob.sagemaker --watch\nNAME                            SECONDARYSTATUS   STATUS\nack-xgboost-training-job-7420   Starting          InProgress\nack-xgboost-training-job-7420   Downloading       InProgress\nack-xgboost-training-job-7420   Training          InProgress\nack-xgboost-training-job-7420   Completed         Completed\n<\/code><\/pre>\n<\/p><\/div>\n<h2>Deploy the results of the SageMaker training job<\/h2>\n<p>To deploy the model, we need to specify a model name, an endpoint config name, and an endpoint name:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">export MODEL_NAME=ack-xgboost-model-$RANDOM_VAR\nexport ENDPOINT_CONFIG_NAME=ack-xgboost-endpoint-config-$RANDOM_VAR\nexport ENDPOINT_NAME=ack-xgboost-endpoint-$RANDOM_VAR\n<\/code><\/pre>\n<\/p><\/div>\n<p>We deploy this model on a c5.large instance type. In the following .yaml file, we define the model, the endpoint config, and the endpoint:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">printf '\napiVersion: sagemaker.services.k8s.aws\/v1alpha1\nkind: Model\nmetadata:\n  name: '$MODEL_NAME'\nspec:\n  modelName: '$MODEL_NAME'\n  primaryContainer:\n    containerHostname: xgboost\n    # The source of the model data\n    modelDataURL: s3:\/\/'$SAGEMAKER_BUCKET'\/'$JOB_NAME'\/output\/model.tar.gz\n    image: '$XGBOOST_IMAGE'\n  executionRoleARN: '$SAGEMAKER_EXECUTION_ROLE_ARN'\n---\napiVersion: sagemaker.services.k8s.aws\/v1alpha1\nkind: EndpointConfig\nmetadata:\n  name: '$ENDPOINT_CONFIG_NAME'\nspec:\n  endpointConfigName: '$ENDPOINT_CONFIG_NAME'\n  productionVariants:\n  - modelName: '$MODEL_NAME'\n    variantName: AllTraffic\n    instanceType: ml.c5.large\n    initialInstanceCount: 1\n---\napiVersion: sagemaker.services.k8s.aws\/v1alpha1\nkind: Endpoint\nmetadata:\n  name: '$ENDPOINT_NAME'\nspec:\n  endpointName: '$ENDPOINT_NAME'\n  endpointConfigName: '$ENDPOINT_CONFIG_NAME'\n' &gt; .\/deploy.yaml\n<\/code><\/pre>\n<\/p><\/div>\n<p>Now, the endpoint is ready to be deployed:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">kubectl apply -f deploy.yaml<\/code><\/pre>\n<\/p><\/div>\n<p>You should see the following output:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">model.sagemaker.services.k8s.aws\/ack-xgboost-model-7420 created\nendpointconfig.sagemaker.services.k8s.aws\/ack-xgboost-endpoint-config-7420 created\nendpoint.sagemaker.services.k8s.aws\/ack-xgboost-endpoint-7420 created\n<\/code><\/pre>\n<\/p><\/div>\n<p>We can observe that the model and endpoint config were created. Deploying the endpoint may take some time:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">kubectl describe models.sagemaker\nkubectl describe endpointconfigs.sagemaker\nkubectl describe endpoints.sagemaker\n<\/code><\/pre>\n<\/p><\/div>\n<p>We can watch this process using the following command:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">kubectl get endpoints.sagemaker --watch<\/code><\/pre>\n<\/p><\/div>\n<p>After some time, the <code>STATUS<\/code> changes to <code>InService<\/code>:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">NAME                        STATUS\nack-xgboost-endpoint-7420   Creating         \nack-xgboost-endpoint-7420   InService        \n<\/code><\/pre>\n<\/p><\/div>\n<p>This indicates the deployed endpoint is ready for use.<\/p>\n<h2>Verify the inference capabilities of the trained model<\/h2>\n<p>We invoke the model endpoint using Python to emulate a typical use case. We reuse the code in <a href=\"https:\/\/github.com\/aws\/amazon-sagemaker-examples\/blob\/master\/introduction_to_amazon_algorithms\/xgboost_abalone\/xgboost_abalone.ipynb%20\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker example notebook<\/a>.<\/p>\n<p>We first download the test set from Amazon S3. Then we load a single sample from the test set and use it to invoke the endpoint we deployed in the previous section. Download the test file with the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">pip install boto3 numpy\naws s3 cp s3:\/\/sagemaker-sample-files\/datasets\/tabular\/uci_abalone\/test\/abalone.test abalone.test\nhead -1 abalone.test &gt; abalone.single.test\n<\/code><\/pre>\n<\/p><\/div>\n<p>Use the Python interpreter to test inference. The Python interpreter is usually installed as <code>\/usr\/local\/bin\/python&lt;version&gt;<\/code> on those machines where it\u2019s available; putting <code>\/usr\/local\/bin<\/code> in your Unix\/Linux shell\u2019s search path makes it possible to start it by entering the Python command.<\/p>\n<p>Create a file named <code>predict.py<\/code> and insert the following code block:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">printf '\nimport sys\nimport math\nimport json\nimport boto3\nimport numpy as np\nimport os\n\nregion = os.environ.get(\"SERVICE_REGION\")\nendpoint_name = os.environ.get(\"ENDPOINT_NAME\")\n\nruntime_client = boto3.client(\"runtime.sagemaker\", region_name=region)\n\nfile_name = \"abalone.single.test\"\nwith open(file_name, \"r\") as f:\n    payload = f.read().strip()\n\nresponse = runtime_client.invoke_endpoint(\n    EndpointName=endpoint_name, ContentType=\"text\/x-libsvm\", Body=payload\n)\n\nresult = response[\"Body\"].read().decode(\"utf-8\").split(\",\")\nresult = [math.ceil(float(i)) for i in result]\nlabel = payload.strip(\" \").split()[0]\nprint(\"Label: \" + label)\nprint(\"Prediction:\" + str(result[0]))\n' &gt; .\/predict.py\npython predict.py\n<\/code><\/pre>\n<\/p><\/div>\n<p>Running this sample should give us the following result:<\/p>\n<p>The age of the abalone that is provided in the test example is estimated to be 13 by the ML model. The actual age was 12. This suggests that our ML model has been trained and provides reasonable predictions. However, the experienced ML user may realize that we haven\u2019t performed hyperparameter tuning and other methods of increasing accuracy yet, which is outside the scope of this post.<\/p>\n<h2>Dynamically scale the endpoint according to the load<\/h2>\n<p>SageMaker ACK Operators support custom resource definitions for automatic scaling (using <a href=\"https:\/\/aws-controllers-k8s.github.io\/community\/reference\/applicationautoscaling\/v1alpha1\/ScalableTarget\/\" target=\"_blank\" rel=\"noopener noreferrer\">ScalableTarget<\/a> and <a href=\"https:\/\/aws-controllers-k8s.github.io\/community\/reference\/applicationautoscaling\/v1alpha1\/ScalingPolicy\/\" target=\"_blank\" rel=\"noopener noreferrer\">ScalingPolicy<\/a>) for your hosted models. The following resources adjust the number of instances (minimum 1 to maximum 20) provisioned for a model in response to changes in metric <code>SageMakerVariantInvocationsPerInstancetracking<\/code>, which is the average number of times per minute that each instance for a variant is invoked:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">printf '\napiVersion: applicationautoscaling.services.k8s.aws\/v1alpha1\nkind: ScalableTarget\nmetadata:\n  name: ack-scalable-target-predfined\nspec:\n  maxCapacity: 20\n  minCapacity: 1\n  resourceID: endpoint\/'$ENDPOINT_NAME'\/variant\/AllTraffic\n  scalableDimension: \"sagemaker:variant:DesiredInstanceCount\"\n  serviceNamespace: sagemaker\n---\napiVersion: applicationautoscaling.services.k8s.aws\/v1alpha1\nkind: ScalingPolicy\nmetadata:\n  name: ack-scaling-policy-predefined\nspec:\n  policyName: ack-scaling-policy-predefined\n  policyType: TargetTrackingScaling\n  resourceID: endpoint\/'$ENDPOINT_NAME'\/variant\/AllTraffic\n  scalableDimension: \"sagemaker:variant:DesiredInstanceCount\"\n  serviceNamespace: sagemaker\n  targetTrackingScalingPolicyConfiguration:\n    targetValue: 60\n    scaleInCooldown: 700\n    scaleOutCooldown: 300\n    predefinedMetricSpecification:\n        predefinedMetricType: SageMakerVariantInvocationsPerInstance\n ' &gt; .\/scale-endpoint.yaml\n<\/code><\/pre>\n<\/p><\/div>\n<p>Apply with the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">kubectl apply -f scale-endpoint.yaml<\/code><\/pre>\n<\/p><\/div>\n<p>You should see the following output:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">scalabletarget.applicationautoscaling.services.k8s.aws\/ack-scalable-target-predfined created\nscalingpolicy.applicationautoscaling.services.k8s.aws\/ack-scaling-policy-predefined created\n<\/code><\/pre>\n<\/p><\/div>\n<p>We can observe that <code>scalingpolicy<\/code> was created:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">kubectl describe scalingpolicy.applicationautoscaling<\/code><\/pre>\n<\/p><\/div>\n<p>The output of <code>scalingpolicy<\/code> looks like the following:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">Status:\n\u00a0\u00a0Ack Resource Metadata:\n\u00a0\u00a0 \u00a0Arn: \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 arn:aws:autoscaling:us-east-1:123456789012:scalingPolicy:b33d12b8-aa81-4cb8-855e-c7b6dcb9d6e7:resource\/SageMaker\/endpoint\/ack-xgboost-endpoint\/variant\/AllTraffic:policyName\/ack-scaling-policy-predefined\n\u00a0\u00a0 \u00a0Owner Account ID:\u00a0 123456789012\n\u00a0\u00a0Alarms:\n\u00a0\u00a0 \u00a0Alarm ARN: \u00a0 arn:aws:cloudwatch:us-east-1:123456789012:alarm:TargetTracking-endpoint\/ack-xgboost-endpoint\/variant\/AllTraffic-AlarmHigh-966b8232-a9b9-467d-99f3-95436f5c0383\n\u00a0\u00a0 \u00a0Alarm Name: \u00a0TargetTracking-endpoint\/ack-xgboost-endpoint\/variant\/AllTraffic-AlarmHigh-966b8232-a9b9-467d-99f3-95436f5c0383\n\u00a0\u00a0 \u00a0Alarm ARN: \u00a0 arn:aws:cloudwatch:us-east-1:123456789012:alarm:TargetTracking-endpoint\/ack-xgboost-endpoint\/variant\/AllTraffic-AlarmLow-71e39f85-1afb-401d-9703-b788cdc10a93\n\u00a0\u00a0 \u00a0Alarm Name: \u00a0TargetTracking-endpoint\/ack-xgboost-endpoint\/variant\/AllTraffic-AlarmLow-71e39f85-1afb-401d-9703-b788cdc10a93\n<\/code><\/pre>\n<\/p><\/div>\n<h2>Clean up<\/h2>\n<p>Run the following commands to delete the resources created in this post:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">kubectl delete -f scale-endpoint.yaml\nkubectl delete -f deploy.yaml\nkubectl delete -f training.yaml\n<\/code><\/pre>\n<\/p><\/div>\n<p>Create a file named <code>uninstall-controller.sh<\/code> and insert the following code block required for deleting the controller and custom resource definitions:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">printf '\n#!\/usr\/bin\/env bash\n\n# Uninstall Controller\n\nexport HELM_EXPERIMENTAL_OCI=1\nexport ACK_K8S_NAMESPACE=${ACK_K8S_NAMESPACE:-\"ack-system\"}\n\nfunction uninstall_ack_controller() {\n   local service=\"$1\"\n   local chart_export_path=\/tmp\/chart\n   \n   helm uninstall -n $ACK_K8S_NAMESPACE ack-$service-controller\n   kubectl delete -f $chart_export_path\/ack-$service-controllerchart\/crds\n}\n\nuninstall_ack_controller \"sagemaker\"\nuninstall_ack_controller \"applicationautoscaling\"\n' &gt; .\/uninstall-controller.sh<\/code><\/pre>\n<\/p><\/div>\n<p>Run the following commands to uninstall the controller and custom resource definitions, and delete the namespace, IAM roles, and S3 bucket you created:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># uninstall controller and remove CRDs\nchmod +x uninstall-controller.sh\n.\/uninstall-controller.sh\n\n# Delete controller namespace\nkubectl delete namespace $ACK_K8S_NAMESPACE\n\n# Delete S3 bucket\naws s3 rb s3:\/\/$SAGEMAKERageMaker_BUCKET --force\n\n# Delete SageMaker execution role\naws iam detach-role-policy --role-name $SAGEMAKER_EXECUTION_ROLE_NAME --policy-arn arn:aws:iam::aws:policy\/AmazonSageMakerFullAccess\naws iam detach-role-policy --role-name $SAGEMAKER_EXECUTION_ROLE_NAME --policy-arn arn:aws:iam::aws:policy\/AmazonS3FullAccess\naws iam delete-role --role-name $SAGEMAKER_EXECUTION_ROLE_NAME\n\n# Delete application autoscaling service linked role\naws iam delete-service-linked-role --role-name AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint\n\n# Delete IAM role created for IRSA\naws iam detach-role-policy --role-name $OIDC_ROLE_NAME --policy-arn arn:aws:iam::aws:policy\/AmazonSageMakerFullAccess\naws iam delete-role-policy --role-name $OIDC_ROLE_NAME --policy-name \"iam-pass-role-policy\"\naws iam delete-role --role-name $OIDC_ROLE_NAME<\/code><\/pre>\n<\/p><\/div>\n<h2>Conclusion<\/h2>\n<p>SageMaker ACK Operators provide engineering teams with a native Kubernetes experience for creating and interacting with the ML jobs on SageMaker, either with the Kubernetes API or with Kubernetes command line utilities such as kubectl. You can build automation, tooling, and custom interfaces for data scientists in Kubernetes by using these controllers\u2014all without building, maintaining, or optimizing ML infrastructure. Data scientists and developers familiar with Kubernetes can compose and interact with fully managed SageMaker training, tuning, and inference jobs, as you would with Kubernetes jobs running locally. Logs from SageMaker jobs stream back to Kubernetes, allowing you to natively view logs for your model training, tuning, and prediction jobs in the command line.<\/p>\n<p>ACK is a community-driven project and will soon include <a href=\"https:\/\/aws-controllers-k8s.github.io\/community\/services\/\" target=\"_blank\" rel=\"noopener noreferrer\">service controllers for other AWS service APIs<\/a>.<\/p>\n<h3>Links<\/h3>\n<p>[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http:\/\/archive.ics.uci.edu\/ml]. Irvine, CA: University of California, School of Information and Computer Science.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/15\/Kanwaljit-Khurmi-cropped.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-30789 size-full alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/11\/15\/Kanwaljit-Khurmi-cropped.jpg\" alt=\"\" width=\"100\" height=\"113\"><\/a>Kanwaljit Khurmi\u00a0<\/strong>is a Senior Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.<\/p>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-20019 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/12\/16\/Suraj-Kota.jpg\" alt=\"\" width=\"101\" height=\"134\"><\/strong><strong>Suraj Kota<\/strong> is a Software Engineer specialized in Machine Learning infrastructure. He builds tools to easily get started and scale machine learning workload on AWS. He worked on the AWS Deep Learning Containers, Deep Learning AMI, SageMaker Operators for Kubernetes, and other open source integrations like Kubeflow.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/04\/ArchisJoglekar-badgephoto.jpeg\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-28924 size-full alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/10\/04\/ArchisJoglekar-badgephoto.jpeg\" alt=\"\" width=\"100\" height=\"133\"><\/a><strong>Archis Joglekar<\/strong>\u00a0is an AI\/ML Partner Solutions Architect in the Emerging Technologies team. He is interested in performant, scalable deep learning and scientific computing using the building blocks at AWS. His past experiences range from computational physics research to machine learning platform development in academia, national labs, and startups. His time away from the computer is spent playing soccer and with friends and family.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/use-amazon-sagemaker-ack-operators-to-train-and-deploy-machine-learning-models\/<\/p>\n","protected":false},"author":0,"featured_media":1220,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1219"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1219"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1219\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1220"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1219"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1219"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1219"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}