{"id":432,"date":"2020-10-21T22:47:08","date_gmt":"2020-10-21T22:47:08","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/10\/21\/using-amazon-sagemaker-inference-pipelines-with-multi-model-endpoints\/"},"modified":"2020-10-21T22:47:08","modified_gmt":"2020-10-21T22:47:08","slug":"using-amazon-sagemaker-inference-pipelines-with-multi-model-endpoints","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/10\/21\/using-amazon-sagemaker-inference-pipelines-with-multi-model-endpoints\/","title":{"rendered":"Using Amazon SageMaker inference pipelines with multi-model endpoints"},"content":{"rendered":"<div id=\"\">\n<p>Businesses are increasingly deploying multiple machine learning (ML) models to serve precise and accurate predictions to their consumers. Consider a media company that wants to provide recommendations to its subscribers. The company may want to employ different custom models for recommending different categories of products\u2014such as movies, books, music, and articles. If the company wants to add personalization to the recommendations by using individual subscriber information, the number of custom models further increases. Hosting each custom model on a distinct compute instance is not only cost prohibitive, but also leads to underutilization of the hosting resources if not all models are frequently used.<\/p>\n<p><a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker <\/a>is a fully managed service that enables developers and data scientists to quickly and easily build, train, and deploy ML models at any scale. After you train an ML model, you can deploy it on Amazon SageMaker endpoints that are fully managed and can serve inferences in real time with low latency. <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/multi-model-endpoints.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker multi-model endpoints<\/a> (MMEs) are a cost-effective solution to deploy a large number of ML models or per-user models. You can deploy multiple models on a single multi-model enabled endpoint such that all models share the compute resources and the serving container. You get significant cost savings and also simplify model deployments and updates. For more information about MME, see <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/save-on-inference-costs-by-using-amazon-sagemaker-multi-model-endpoints\/\" target=\"_blank\" rel=\"noopener noreferrer\">Save on inference costs by using Amazon SageMaker multi-model endpoints<\/a>.<\/p>\n<p>The following diagram depicts how MMEs work.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17286 size-full\" title=\"How MMEs work\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/1-Diagram-1.jpg\" alt=\"\" width=\"900\" height=\"349\"><\/p>\n<p>Multiple model artifacts are persisted in an Amazon S3 bucket. When a specific model is invoked, Amazon SageMaker dynamically loads it onto the container hosting the endpoint. If the model is already loaded in the container\u2019s memory, invocation is faster because Amazon SageMaker doesn\u2019t need to download and load it.<\/p>\n<p>Until now, you could use MME with several frameworks, such as TensorFlow, PyTorch, MXNet, SKLearn, and <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/build-multi-model-build-container.html\" target=\"_blank\" rel=\"noopener noreferrer\">build your own container with a multi-model server.<\/a> This post introduces the following feature enhancements to MME:<\/p>\n<ul>\n<li>\n<strong>MME support for Amazon SageMaker built-in algorithms<\/strong> \u2013 MME is now supported natively in the following popular Amazon SageMaker built-in algorithms: <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/xgboost.html\" target=\"_blank\" rel=\"noopener noreferrer\">XGBoost<\/a>, <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/linear-learner.html\" target=\"_blank\" rel=\"noopener noreferrer\">linear learner<\/a>, <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/randomcutforest.html\" target=\"_blank\" rel=\"noopener noreferrer\">RCF<\/a>, and <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/k-nearest-neighbors.html\" target=\"_blank\" rel=\"noopener noreferrer\">KNN<\/a>. You can directly use the Amazon SageMaker provided containers while using these algorithms without having to build your own <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/build-multi-model-build-container.html\" target=\"_blank\" rel=\"noopener noreferrer\">custom container.<\/a>\n<\/li>\n<li>\n<strong>MME support for Amazon SageMaker inference pipelines<\/strong> \u2013 The <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/inference-pipelines.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker inference pipeline model<\/a> consists of a sequence of containers that serve inference requests by combining preprocessing, predictions, and postprocessing data science tasks. An inference pipeline allows you to reuse the same preprocessing code used during model training to process the inference request data used for predictions. You can now deploy an inference pipeline on an MME where one of the containers in the pipeline can dynamically serve requests based on the model being invoked.<\/li>\n<li>\n<strong>IAM condition keys for granular access to models<\/strong> \u2013 Prior to this enhancement, an <a href=\"http:\/\/aws.amazon.com\/iam\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Identity and Access Management<\/a> (IAM) principal with <code>InvokeEndpoint<\/code> permission on the endpoint resource could invoke all the models hosted on that endpoint. Now, we support <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/multi-model-endpoints.html#multi-model-endpoint-security\" target=\"_blank\" rel=\"noopener noreferrer\">granular access to models using IAM condition keys<\/a>. For example, the following IAM condition restricts the principal\u2019s access to a model persisted in the <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) bucket with <code>company_a<\/code> or <code>common<\/code> prefixes:<\/li>\n<\/ul>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-json\">           Condition\": {\r\n                \"StringLike\": {\r\n                    \"sagemaker:TargetModel\": [\"company_a\/*\", \"common\/*\"]\r\n                }\r\n            }\r\n<\/code><\/pre>\n<\/div>\n<p>We also provide a fully functional notebook to demonstrate these enhancements.<\/p>\n<h2>Walkthrough overview<\/h2>\n<p>To demonstrate these capabilities, the <a href=\"https:\/\/github.com\/aws\/amazon-sagemaker-examples\/blob\/master\/advanced_functionality\/multi_model_linear_learner_home_value\/linear_learner_multi_model_endpoint_inf_pipeline.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">notebook<\/a> discusses the use case of predicting house prices in multiple cities using linear regression. House prices are predicted based on features like number of bedrooms, number of garages, square footage, and more. Depending on the city, the features affect the house price differently. For example, small changes in the square footage cause a drastic change in house prices in New York City when compared to price changes in Houston.<\/p>\n<p>For accurate house price predictions, we train multiple linear regression models, with a unique location-specific model per city. Each location-specific model is trained on synthetic housing data with randomly generated characteristics. To cost-effectively serve the multiple housing price prediction models, we deploy the models on a single multi-model enabled endpoint, as shown in the following diagram.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17359 size-full\" title=\"Deploying models on a single multi-model enabled endpoint\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/21\/2-Diagram_Updated.jpg\" alt=\"\" width=\"900\" height=\"321\"><\/p>\n<p>The walkthrough includes the following high-level steps:<\/p>\n<ol>\n<li>Examine the synthetic housing data generated.<\/li>\n<li>Preprocess the raw housing data using Scikit-learn.<\/li>\n<li>Train regression models using the built-in Amazon SageMaker linear learner algorithm.<\/li>\n<li>Create an Amazon SageMaker model with multi-model support.<\/li>\n<li>Create an Amazon SageMaker inference pipeline with an Sklearn model and multi-model enabled linear learner model.<\/li>\n<li>Test the inference pipeline by getting predictions from the different linear learner models.<\/li>\n<li>Update the MME with new models.<\/li>\n<li>Monitor the MME with <a href=\"http:\/\/aws.amazon.com\/cloudwatch\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon CloudWatch<\/a>\n<\/li>\n<li>Explore fine-grained access to models hosted on the MME using IAM condition keys.<\/li>\n<\/ol>\n<p>Other steps necessary to import libraries, set up IAM permissions, and use utility functions are defined in the notebook, which this post doesn\u2019t discuss. You can walk through and run the code with the following <a href=\"https:\/\/github.com\/awslabs\/amazon-sagemaker-examples\/blob\/master\/advanced_functionality\/multi_model_linear_learner_home_value\/linear_learner_multi_model_endpoint_inf_pipeline.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">notebook <\/a>on the GitHub repo.<\/p>\n<h2>Examining the synthetic housing data<\/h2>\n<p>The dataset consists of six numerical features that capture the year the house was built, house size in square feet, number of bedrooms, number of bathrooms, lot size, number of garages, and two categorical features: deck and front porch, indicating whether these are present or not.<\/p>\n<p>To see the raw data, enter the following code:<\/p>\n<p>The following screenshot shows the results.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17288 size-full\" title=\"Results\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/3-Screenshot-1.jpg\" alt=\"\" width=\"900\" height=\"189\"><\/p>\n<p>You can now preprocess the categorical variables (<code>front_porch<\/code> and <code>deck<\/code>) using Scikit-learn.<\/p>\n<h2>Preprocessing the raw housing data<\/h2>\n<p>To preprocess the raw data, you first create an SKLearn estimator and use the <a href=\"https:\/\/github.com\/awslabs\/amazon-sagemaker-examples\/blob\/master\/advanced_functionality\/multi_model_linear_learner_home_value\/sklearn_preprocessor.py\" target=\"_blank\" rel=\"noopener noreferrer\">sklearn_preprocessor.py<\/a> script as the <code>entry_point<\/code>:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">#Create the SKLearn estimator with the sklearn_preprocessor.py as the script\r\nfrom sagemaker.sklearn.estimator import SKLearn\r\nscript_path = 'sklearn_preprocessor.py'\r\nsklearn_preprocessor = SKLearn(\r\n    entry_point=script_path,\r\n    role=role,\r\n    train_instance_type=\"ml.c4.xlarge\",\r\n    sagemaker_session=sagemaker_session_gamma)\r\n<\/code><\/pre>\n<\/div>\n<p>You then launch multiple Scikit-learn training jobs to process the raw synthetic data generated for multiple locations. Before running the following code, take the training instance limits in your account and cost into consideration and adjust the PARALLEL_TRAINING_JOBS value accordingly:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">preprocessor_transformers = []\r\n\r\nfor index, loc in enumerate(LOCATIONS[:PARALLEL_TRAINING_JOBS]):\r\n    print(\"preprocessing fit input data at \", index , \" for loc \", loc)\r\n    job_name='scikit-learn-preprocessor-{}'.format(strftime('%Y-%m-%d-%H-%M-%S', gmtime()))\r\n    \r\n    sklearn_preprocessor.fit({'train': train_inputs[index]}, job_name=job_name, wait=True)\r\n    \r\n    ##Once the preprocessor is fit, use tranformer to preprocess the raw training data and store the transformed data right back into s3.\r\n    transformer = sklearn_preprocessor.transformer(\r\n        instance_count=1, \r\n        instance_type='ml.m4.xlarge',\r\n        assemble_with='Line',\r\n        accept='text\/csv'\r\n    )\r\n    preprocessor_transformers.append(transformer)\r\n<\/code><\/pre>\n<\/div>\n<p>When the preprocessors are properly fitted, preprocess the training data using batch transform to directly preprocess the raw data and store back into Amazon S3:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">\t\tpreprocessed_train_data_path = []\r\n\r\nfor index, transformer in enumerate(preprocessor_transformers):\r\n    transformer.transform(train_inputs[index], content_type='text\/csv')\r\n    \r\n    print('Launching batch transform job:    \r\n{}'.format(transformer.latest_transform_job.job_name))\r\n    preprocessed_train_data_path.append(transformer.output_path)\r\n<\/code><\/pre>\n<\/div>\n<h2>Training regression models<\/h2>\n<p>In this step, you train multiple models, one for each location.<\/p>\n<p>Start by accessing the built-in linear learner algorithm:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">from sagemaker.amazon.amazon_estimator import get_image_uri\r\ncontainer = get_image_uri(boto3.Session().region_name, 'linear-learner')\r\ncontainer\r\n<\/code><\/pre>\n<\/div>\n<p>Depending on the Region you\u2019re using, you receive output similar to the following:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-code\">\t382416733822.dkr.ecr.us-east-1.amazonaws.com\/linear-learner:1<\/code><\/pre>\n<\/div>\n<p>Next, define a method to launch a training job for a single location using the Amazon SageMaker Estimator API. In the hyperparameter configuration, you use <code>predictor_type='regressor'<\/code> to indicate that you\u2019re using the algorithm to train a regression model. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">def launch_training_job(location, transformer):\r\n    \"\"\"Launch a linear learner traing job\"\"\"\r\n    \r\n    train_inputs = '{}\/{}'.format(transformer.output_path, \"train.csv\")\r\n    val_inputs = '{}\/{}'.format(transformer.output_path, \"val.csv\")\r\n    \r\n    print(\"train_inputs:\", train_inputs)\r\n    print(\"val_inputs:\", val_inputs)\r\n     \r\n    full_output_prefix = '{}\/model_artifacts\/{}'.format(DATA_PREFIX, location)\r\n    s3_output_path = 's3:\/\/{}\/{}'.format(BUCKET, full_output_prefix)\r\n    \r\n    print(\"s3_output_path \", s3_output_path)\r\n    \r\n    s3_output_path = 's3:\/\/{}\/{}\/model_artifacts\/{}'.format(BUCKET, DATA_PREFIX, location)\r\n    \r\n    linear_estimator = sagemaker.estimator.Estimator(\r\n                            container,\r\n                            role, \r\n                            train_instance_count=1, \r\n                            train_instance_type='ml.c4.xlarge',\r\n                            output_path=s3_output_path,\r\n                            sagemaker_session=sagemaker_session)\r\n    \r\n    linear_estimator.set_hyperparameters(\r\n                           feature_dim=10,\r\n                           mini_batch_size=100,\r\n                           predictor_type='regressor',\r\n                           epochs=10,\r\n                           num_models=32,\r\n                           loss='absolute_loss')\r\n    DISTRIBUTION_MODE = 'FullyReplicated'\r\n    train_input = sagemaker.s3_input(s3_data=train_inputs, \r\n           distribution=DISTRIBUTION_MODE, content_type='text\/csv;label_size=1')\r\n    val_input   = sagemaker.s3_input(s3_data=val_inputs,\r\n           distribution=DISTRIBUTION_MODE, content_type='text\/csv;label_size=1')\r\n    \r\n    remote_inputs = {'train': train_input, 'validation': val_input}\r\n    linear_estimator.fit(remote_inputs, wait=False)\r\n    return linear_estimator.latest_training_job.name\r\n<\/code><\/pre>\n<\/div>\n<p>You can now start multiple model training jobs, one for each location. Make sure to choose the correct value for <code>PARALLEL TRAINING_JOBS<\/code>, taking your AWS account service limits and cost into consideration. In the notebook, this value is set to 4. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">training_jobs = []\r\nfor transformer, loc in zip(preprocessor_transformers, LOCATIONS[:PARALLEL_TRAINING_JOBS]): \r\n    job = launch_training_job(loc, transformer)\r\n    training_jobs.append(job)\r\nprint('{} training jobs launched: {}'.format(len(training_jobs), training_jobs))\r\n<\/code><\/pre>\n<\/div>\n<p>You receive output similar to the following:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-code\">4 training jobs launched: [(&lt;sagemaker.estimator.Estimator object at 0x7fb54784b6d8&gt;, 'linear-learner-2020-06-03-03-51-26-548'), (&lt;sagemaker.estimator.Estimator object at 0x7fb5478b3198&gt;, 'linear-learner-2020-06-03-03-51-26-973'), (&lt;sagemaker.estimator.Estimator object at 0x7fb54780dbe0&gt;, 'linear-learner-2020-06-03-03-51-27-775'), (&lt;sagemaker.estimator.Estimator object at 0x7fb5477664e0&gt;, 'linear-learner-2020-06-03-03-51-31-457')]<\/code><\/pre>\n<\/div>\n<p>Wait until all training jobs are complete before proceeding to the next step.<\/p>\n<h2>Creating an Amazon SageMaker model with multi-model support<\/h2>\n<p>When the training jobs are complete, you\u2019re ready to create an MME.<\/p>\n<p>First, define a method to copy model artifacts from the training job output to a location in Amazon S3 where the MME dynamically loads individual models:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">def deploy_artifacts_to_mme(job_name):\r\n    print(\"job_name :\", job_name)\r\n    response = sm_client.describe_training_job(TrainingJobName=job_name)\r\n    source_s3_key,model_name =    parse_model_artifacts(response['ModelArtifacts']['S3ModelArtifacts'])\r\n    copy_source = {'Bucket': BUCKET, 'Key': source_s3_key}\r\n    key = '{}\/{}\/{}\/{}.tar.gz'.format(DATA_PREFIX, MULTI_MODEL_ARTIFACTS, model_name, model_name)\r\n    print('Copying {} modeln   from: {}n     to: {}...'.format(model_name, source_s3_key, key))\r\n    s3_client.copy_object(Bucket=BUCKET, CopySource=copy_source, Key=key)\r\n<\/code><\/pre>\n<\/div>\n<p>Copy the model artifacts from all the training jobs to this location:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">## Deploy all but the last model trained to MME\r\nfor job_name in training_jobs[:-1]:\r\n\tdeploy_artifacts_to_mme(job_name)\r\n<\/code><\/pre>\n<\/div>\n<p>You receive output similar to the following:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-code\">linear-learner-2020-06-03-03-51-26-973\r\nCopying LosAngeles_CA model\r\n   from: DEMO_MME_LINEAR_LEARNER\/model_artifacts\/LosAngeles_CA\/linear-learner-2020-06-03-03-51-26-973\/output\/model.tar.gz\r\n     to: DEMO_MME_LINEAR_LEARNER\/multi_model_artifacts\/LosAngeles_CA\/LosAngeles_CA.tar.gz...\r\nlinear-learner-2020-06-03-03-51-27-775\r\nCopying Chicago_IL model\r\n   from: DEMO_MME_LINEAR_LEARNER\/model_artifacts\/Chicago_IL\/linear-learner-2020-06-03-03-51-27-775\/output\/model.tar.gz\r\n     to: DEMO_MME_LINEAR_LEARNER\/multi_model_artifacts\/Chicago_IL\/Chicago_IL.tar.gz...\r\nlinear-learner-2020-06-03-03-51-31-457\r\n<\/code><\/pre>\n<\/div>\n<p>Create the Amazon SageMaker model entity using the MultiDataModel API:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">MODEL_NAME = '{}-{}'.format(HOUSING_MODEL_NAME, strftime('%Y-%m-%d-%H-%M-%S', gmtime()))\r\n\r\n_model_url  = 's3:\/\/{}\/{}\/{}\/'.format(BUCKET, DATA_PREFIX, MULTI_MODEL_ARTIFACTS)\r\n\r\nll_multi_model = MultiDataModel(\r\n        name=MODEL_NAME,\r\n        model_data_prefix=_model_url,\r\n        image=container,\r\n        role=role,\r\n        sagemaker_session=sagemaker\r\n<\/code><\/pre>\n<\/div>\n<h2>Creating an inference pipeline<\/h2>\n<p>Set up an inference pipeline with the <code>PipelineModel<\/code> API. This sets up a list of models in a single endpoint; for this post, we configure our pipeline model with the fitted Scikit-learn inference model and the fitted MME linear learner model. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">from sagemaker.model import Model\r\nfrom sagemaker.pipeline import PipelineModel\r\nimport boto3\r\nfrom time import gmtime, strftime\r\n\r\ntimestamp_prefix = strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\r\n\r\nscikit_learn_inference_model = sklearn_preprocessor.create_model()\r\n\r\nmodel_name = '{}-{}'.format('inference-pipeline', timestamp_prefix)\r\nendpoint_name = '{}-{}'.format('inference-pipeline-ep', timestamp_prefix)\r\n\r\nsm_model = PipelineModel(\r\n    name=model_name, \r\n    role=role, \r\n    sagemaker_session=sagemaker_session,\r\n    models=[\r\n        scikit_learn_inference_model, \r\n        ll_multi_model])\r\n\r\nsm_model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge', endpoint_name=endpoint_name)\r\n<\/code><\/pre>\n<\/div>\n<p>The MME is now ready to take inference requests and respond with predictions. With the MME, the inference request should include the target model to invoke.<\/p>\n<h2>Testing the inference pipeline<\/h2>\n<p>You can now get predictions from the different linear learner models. Create a <code>RealTimePredictor<\/code> with the inference pipeline endpoint:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">from sagemaker.predictor import json_serializer, csv_serializer, json_deserializer, RealTimePredictor\r\nfrom sagemaker.content_types import CONTENT_TYPE_CSV, CONTENT_TYPE_JSON\r\npredictor = RealTimePredictor(\r\n    endpoint=endpoint_name,\r\n    sagemaker_session=sagemaker_session,\r\n    serializer=csv_serializer,\r\n    content_type=CONTENT_TYPE_CSV,\r\n    accept=CONTENT_TYPE_JSON)\r\n<\/code><\/pre>\n<\/div>\n<p>Define a method to get predictions from the <code>RealTimePredictor<\/code>:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">def predict_one_house_value(features, model_name, predictor_to_use):\r\n    print('Using model {} to predict price of this house: {}'.format(model_name,\r\n                                                                     features))\r\n    body = ','.join(map(str, features)) + 'n'\r\n    start_time = time.time()\r\n     \r\n    response = predictor_to_use.predict(features, target_model=model_name)\r\n    \r\n    response_json = json.loads(response)\r\n        \r\n    predicted_value = response_json['predictions'][0]['score']    \r\n    \r\n    duration = time.time() - start_time\r\n    \r\n    print('${:,.2f}, took {:,d} msn'.format(predicted_value, int(duration * 1000)))\r\n<\/code><\/pre>\n<\/div>\n<p>With MME, the models are dynamically loaded into the container\u2019s memory of the instance hosting the endpoint when invoked. Therefore, the model invocation may take longer when it\u2019s invoked for the first time. When the model is already in the instance container\u2019s memory, the subsequent invocations are faster. If an instance memory utilization is high and a new model needs to be loaded, unused models are unloaded. The unloaded models remain in the instance\u2019s storage volume and can be loaded into container\u2019s memory later without being downloaded from the S3 bucket again. If the instance\u2019s storage volume is full, unused models are deleted from storage volume.<\/p>\n<p>Amazon SageMaker fully manages the loading and unloading of the models, without you having to take any specific actions. However, it\u2019s important to understand this behavior because it has implications on the model invocation latency.<\/p>\n<p>Iterate through invocations with random inputs against a random model and show the predictions and the time it takes for the prediction to come back:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">for i in range(10):\r\n    model_name = LOCATIONS[np.random.randint(1, len(LOCATIONS[:PARALLEL_TRAINING_JOBS]))]\r\n    full_model_name = '{}\/{}.tar.gz'.format(model_name,model_name)\r\n    predict_one_house_value(gen_random_house()[1:], full_model_name,runtime_sm_client)\r\n<\/code><\/pre>\n<\/div>\n<p>You receive output similar to the following:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-code\">Using model Chicago_IL\/Chicago_IL.tar.gz to predict price of this house: [1993, 2728, 6, 3.0, 0.7, 1, 'y', 'y']\r\n$439,972.62, took 1,166 ms\r\n\r\nUsing model Houston_TX\/Houston_TX.tar.gz to predict price of this house: [1989, 1944, 5, 3.0, 1.0, 1, 'n', 'y']\r\n$280,848.00, took 1,086 ms\r\n\r\nUsing model LosAngeles_CA\/LosAngeles_CA.tar.gz to predict price of this house: [1968, 2427, 4, 3.0, 1.0, 2, 'y', 'n']\r\n$266,721.31, took 1,029 ms\r\n\r\nUsing model Chicago_IL\/Chicago_IL.tar.gz to predict price of this house: [2000, 4024, 2, 1.0, 0.82, 1, 'y', 'y']\r\n$584,069.88, took 53 ms\r\n\r\nUsing model LosAngeles_CA\/LosAngeles_CA.tar.gz to predict price of this house: [1986, 3463, 5, 3.0, 0.9, 1, 'y', 'n']\r\n$496,340.19, took 43 ms\r\n\r\nUsing model Chicago_IL\/Chicago_IL.tar.gz to predict price of this house: [2002, 3885, 4, 3.0, 1.16, 2, 'n', 'n']\r\n$626,904.12, took 39 ms\r\n\r\nUsing model Chicago_IL\/Chicago_IL.tar.gz to predict price of this house: [1992, 1531, 6, 3.0, 0.68, 1, 'y', 'n']\r\n$257,696.17, took 36 ms\r\n\r\nUsing model Chicago_IL\/Chicago_IL.tar.gz to predict price of this house: [1992, 2327, 2, 3.0, 0.59, 3, 'n', 'n']\r\n$337,758.22, took 33 ms\r\n\r\nUsing model LosAngeles_CA\/LosAngeles_CA.tar.gz to predict price of this house: [1995, 2656, 5, 1.0, 1.16, 0, 'y', 'n']\r\n$390,652.59, took 35 ms\r\n\r\nUsing model LosAngeles_CA\/LosAngeles_CA.tar.gz to predict price of this house: [2000, 4086, 2, 3.0, 1.03, 3, 'n', 'y']\r\n$632,995.44, took 35 ms\r\n<\/code><\/pre>\n<\/div>\n<p>The output that shows the predicted house price and the time it took for the prediction.<\/p>\n<p>You should consider two different invocations of the same model. The second time, you don\u2019t need to download from Amazon S3 because they\u2019re already present on the instance. You see the inferences return in less time than before. For this use case, the invocation time for the <code>Chicago_IL\/Chicago_IL.tar.gz<\/code> model reduced from 1,166 milliseconds the first time to 53 milliseconds the second time. Similarly, the invocation time for the <code>LosAngeles_CA \/LosAngeles_CA.tar.gz<\/code> model reduced from 1,029 milliseconds to 43 milliseconds.<\/p>\n<h2>Updating an MME with new models<\/h2>\n<p>To deploy a new model to an existing MME, copy a new set of model artifacts to the same Amazon S3 location you set up earlier. For example, copy the model for the Houston location with the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">## Copy the last model\r\nlast_training_job=training_jobs[PARALLEL_TRAINING_JOBS-1]\r\ndeploy_artifacts_to_mme(last_training_job)\r\n<\/code><\/pre>\n<\/div>\n<p>Now you can make predictions using the last model. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">model_name = LOCATIONS[PARALLEL_TRAINING_JOBS-1]\r\nfull_model_name = '{}\/{}.tar.gz'.format(model_name,model_name)\r\npredict_one_house_value(gen_random_house()[:-1], full_model_name,predictor)\r\n<\/code><\/pre>\n<\/div>\n<h2>Monitoring MMEs with CloudWatch metrics<\/h2>\n<p>Amazon SageMaker provides <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/monitoring-cloudwatch.html\" target=\"_blank\" rel=\"noopener noreferrer\">CloudWatch metrics<\/a> for MMEs so you can determine the endpoint usage and the cache hit rate and optimize your endpoint. To analyze the endpoint and the container behavior, you invoke multiple models in this sequence:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">##Create 200 copies of the original model and save with different names.\r\ncopy_additional_artifacts_to_mme(200)\r\n##Starting with no models loaded into the container\r\n##Invoke the first 100 models\r\ninvoke_multiple_models_mme(0,100)\r\n##Invoke the same 100 models again\r\ninvoke_multiple_models_mme(0,100)\r\n##This time invoke all 200 models to observe behavior\r\ninvoke_multiple_models_mme(0,200)\r\n<\/code><\/pre>\n<\/div>\n<p>The following chart shows the behavior of the CloudWatch metrics <code>LoadedModelCount<\/code> and <code>MemoryUtilization<\/code> corresponding to these model invocations.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17293 size-full\" title=\"Behavior of the CloudWatch metrics\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/4-Graph_Update.jpg\" alt=\"\" width=\"900\" height=\"188\"><\/p>\n<p>The <code>LoadedModelCount<\/code> metric continuously increases as more models are invoked, until it levels off at 121. The <code>MemoryUtilization<\/code> metric of the container also increased correspondingly to around 79%. This shows that the instance chosen to host the endpoint could only maintain 121 models in memory when 200 model invocations were made.<\/p>\n<p>The following chart adds the <code>ModelCacheHit<\/code> metric to the previous two.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17290 size-full\" title=\"ModelCacheHit metric\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/5-Graph-1.jpg\" alt=\"\" width=\"900\" height=\"204\"><\/p>\n<p>As the number of models loaded to the container memory increase, the <code>ModelCacheHit<\/code> metric improves. When the same 100 models are invoked the second time, <code>ModelCacheHit<\/code> reaches 1. When new models not yet loaded are invoked, <code>ModelCacheHit<\/code> decreases again.<\/p>\n<p>You can use CloudWatch charts to help make ongoing decisions on the optimal choice of instance type, instance count, and number of models that a given endpoint should host.<\/p>\n<h2>Exploring granular access to models hosted on an MME<\/h2>\n<p>Because of the role attached to the notebook instance, it can invoke all models hosted on the MME. However, you can restrict this model invocation access to specific models by using IAM condition keys. To explore this, you create a new IAM role and IAM policy with a condition key to restrict access to a single model. You then assume this new role and verify that only a single target model can be invoked.<\/p>\n<p>The role assigned to the Amazon SageMaker notebook instance should allow IAM role and IAM policy creation for the next steps to be successful.<\/p>\n<p>Create an IAM role with the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">#Create a new role that can be assumed by this notebook.  The roles should allow access to only a single model.\r\npath='\/'\r\nrole_name=\"{}{}\".format('allow_invoke_ny_model_role', strftime('%Y-%m-%d-%H-%M-%S', gmtime()))\r\ndescription='Role that allows invoking a single model'\r\naction_string = \"sts:AssumeRole\"\r\ntrust_policy={\r\n  \"Version\": \"2012-10-17\",\r\n  \"Statement\": [\r\n    {\r\n      \"Sid\": \"statement1\",\r\n      \"Effect\": \"Allow\",\r\n      \"Principal\": {\r\n        \"AWS\": role\r\n      },\r\n      \"Action\": \"sts:AssumeRole\"\r\n    }\r\n  ]\r\n        \t} \r\n\r\nresponse = iam_client.create_role(\r\n    Path=path,\r\n    RoleName=role_name,\r\n    AssumeRolePolicyDocument=json.dumps(trust_policy),\r\n    Description=description,\r\n    MaxSessionDuration=3600\r\n)\r\n\r\nprint(response)\r\n<\/code><\/pre>\n<\/div>\n<p>Create an IAM policy with a condition key to restrict access to only the <code>NewYork<\/code> model:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">managed_policy = {\r\n    \"Version\": \"2012-10-17\",\r\n    \"Statement\": [\r\n        {\r\n            \"Sid\": \"SageMakerAccess\",\r\n            \"Action\": \"sagemaker:InvokeEndpoint\",\r\n            \"Effect\": \"Allow\",\r\n            \"Resource\":endpoint_resource_arn,\r\n            \"Condition\": {\r\n                \"StringLike\": {\r\n                    \"sagemaker:TargetModel\": [\"NewYork_NY\/*\"]\r\n                }\r\n            }\r\n        }\r\n    ]\r\n}\r\nresponse = iam_client.create_policy(\r\n  PolicyName='allow_invoke_ny_model_policy',\r\n  PolicyDocument=json.dumps(managed_policy)\r\n)\r\n<\/code><\/pre>\n<\/div>\n<p>Attach the IAM policy to the IAM role:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">iam_client.attach_role_policy(\r\n    PolicyArn=policy_arn,\r\n    RoleName=role_name\r\n)\r\n<\/code><\/pre>\n<\/div>\n<p>Assume the new role and create a <code>RealTimePredictor<\/code> object runtime client:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">## Invoke with the role that has access to only NY model\r\nsts_connection = boto3.client('sts')\r\nassumed_role_limited_access = sts_connection.assume_role(\r\n    RoleArn=role_arn,\r\n    RoleSessionName=\"MME_Invoke_NY_Model\"\r\n)\r\nassumed_role_limited_access['AssumedRoleUser']['Arn']\r\n\r\n#Create sagemaker runtime client with assumed role\r\nACCESS_KEY = assumed_role_limited_access['Credentials']['AccessKeyId']\r\nSECRET_KEY = assumed_role_limited_access['Credentials']['SecretAccessKey']\r\nSESSION_TOKEN = assumed_role_limited_access['Credentials']['SessionToken']\r\n\r\nruntime_sm_client_with_assumed_role = boto3.client(\r\n    service_name='sagemaker-runtime', \r\n    aws_access_key_id=ACCESS_KEY,\r\n    aws_secret_access_key=SECRET_KEY,\r\n    aws_session_token=SESSION_TOKEN,\r\n)\r\n\r\n#SageMaker session with the assumed role\r\nsagemakerSessionAssumedRole = sagemaker.Session(sagemaker_runtime_client=runtime_sm_client_with_assumed_role)\r\n#Create a RealTimePredictor with the assumed role.\r\npredictorAssumedRole = RealTimePredictor(\r\n    endpoint=endpoint_name,\r\n    sagemaker_session=sagemakerSessionAssumedRole,\r\n    serializer=csv_serializer,\r\n    content_type=CONTENT_TYPE_CSV,\r\n    accept=CONTENT_TYPE_JSON)\r\n<\/code><\/pre>\n<\/div>\n<p>Now invoke the <code>NewYork_NY<\/code> model:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">full_model_name = 'NewYork_NY\/NewYork_NY.tar.gz'\r\npredict_one_house_value(gen_random_house()[:-1], full_model_name, predictorAssumedRole) \r\n<\/code><\/pre>\n<\/div>\n<p>You receive output similar to the following:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-code\">Using model NewYork_NY\/NewYork_NY.tar.gz to predict price of this house: [1992, 1659, 2, 2.0, 0.87, 2, 'n', 'y']\r\n$222,008.38, took 154 ms\r\n<\/code><\/pre>\n<\/div>\n<p>Next, try to invoke a different model (<code>Chicago_IL\/Chicago_IL.tar.gz<\/code>). This should throw an error because the assumed role isn\u2019t authorized to invoke this model. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">full_model_name = 'Chicago_IL\/Chicago_IL.tar.gz'\r\n\r\npredict_one_house_value(gen_random_house()[:-1], full_model_name,predictorAssumedRole) \r\n<\/code><\/pre>\n<\/div>\n<p>You receive output similar to the following:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-code\">ClientError: An error occurred (AccessDeniedException) when calling the InvokeEndpoint operation: User: arn:aws:sts::xxxxxxxxxxxx:assumed-role\/allow_invoke_ny_model_role\/MME_Invoke_NY_Model is not authorized to perform: sagemaker:InvokeEndpoint on resource: arn:aws:sagemaker:us-east-1:xxxxxxxxxxxx:endpoint\/inference-pipeline-ep-2020-07-01-15-46-51<\/code><\/pre>\n<\/div>\n<h2>Conclusion<\/h2>\n<p>Amazon SageMaker MMEs are a very powerful tool for teams developing multiple ML models to save significant costs and lower deployment overhead for a large number of ML models. This post discussed the new capabilities of Amazon SageMaker MMEs: native integration with Amazon SageMaker built-in algorithms (such as linear learner and KNN), native integration with inference pipelines, and fine-grained controlled access to the multiple models hosted on a single endpoint using IAM condition keys.<\/p>\n<p>The notebook included with the post provided detailed instructions on training multiple linear learner models for house price predictions for multiple locations, hosting all the models on a single MME, and controlling access to the individual models.When considering multi-model enabled endpoints, you should balance the cost savings and the latency requirements.<\/p>\n<p>Give Amazon SageMaker MMEs a try and leave your feedback in the comments.<\/p>\n<hr>\n<h3>About the Author<\/h3>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-11239 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/03\/03\/sireesha-muppala-100.jpg\" alt=\"\" width=\"100\" height=\"135\">Sireesha Muppala<\/strong> is a AI\/ML Specialist Solutions Architect at AWS, providing guidance to customers on architecting and implementing machine learning solutions at scale. She received her Ph.D. in Computer Science from University of Colorado, Colorado Springs. In her spare time, Sireesha loves to run and hike Colorado trails.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-17338 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/21\/michaelpham.jpg\" alt=\"\" width=\"100\" height=\"136\">Michael Pham<\/strong> is a Software Development Engineer in the Amazon SageMaker team. His current work focuses on helping developers efficiently host machine learning models. In his spare time he enjoys Olympic weightlifting, reading, and playing chess.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/using-amazon-sagemaker-inference-pipelines-with-multi-model-endpoints\/<\/p>\n","protected":false},"author":0,"featured_media":433,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/432"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=432"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/432\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/433"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=432"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=432"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=432"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}