{"id":1553,"date":"2022-02-14T21:55:47","date_gmt":"2022-02-14T21:55:47","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2022\/02\/14\/automate-a-shared-bikes-and-scooters-classification-model-with-amazon-sagemaker-autopilot\/"},"modified":"2022-02-14T21:55:47","modified_gmt":"2022-02-14T21:55:47","slug":"automate-a-shared-bikes-and-scooters-classification-model-with-amazon-sagemaker-autopilot","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2022\/02\/14\/automate-a-shared-bikes-and-scooters-classification-model-with-amazon-sagemaker-autopilot\/","title":{"rendered":"Automate a shared bikes and scooters classification model with Amazon SageMaker Autopilot"},"content":{"rendered":"<div id=\"\">\n<p><a href=\"https:\/\/aws.amazon.com\/sagemaker\/autopilot\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Autopilot<\/a> makes it possible for organizations to quickly build and deploy an end-to-end machine learning (ML) model and inference pipeline with just a few lines of code or even <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/autopilot-automate-model-development-create-experiment.html\" target=\"_blank\" rel=\"noopener noreferrer\">without any code<\/a> at all with <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/studio.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Studio<\/a>. Autopilot offloads the heavy lifting of configuring infrastructure and the time it takes to build an entire pipeline, including feature engineering, model selection, and hyperparameter tuning.<\/p>\n<p>In this post, we show how to go from raw data to a robust and fully deployed inference pipeline with Autopilot.<\/p>\n<h2>Solution overview<\/h2>\n<p>We use <a href=\"https:\/\/www.lyft.com\/bikes\/bay-wheels\/system-data\" target=\"_blank\" rel=\"noopener noreferrer\">Lyft\u2019s public dataset on bikesharing<\/a> for this simulation to predict whether or not a user participates in the <a href=\"https:\/\/www.lyft.com\/bikes\/bay-wheels\/bike-share-for-all\" target=\"_blank\" rel=\"noopener noreferrer\">Bike Share for All program<\/a>. This is a simple binary classification problem.<\/p>\n<p>We want to showcase how easy it is to build an automated and real-time inference pipeline to classify users based on their participation in the Bike Share for All program. To this end, we simulate an end-to-end data ingestion and inference pipeline for an imaginary bikeshare company operating in the San Francisco Bay Area.<\/p>\n<p>The architecture is broken down into two parts: the ingestion pipeline and the inference pipeline.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-32850\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/09\/Capture_bike_2.png\" alt=\"\" width=\"1443\" height=\"547\"><\/p>\n<p>We primarily focus on the ML pipeline in the first section of this post, and review the data ingestion pipeline in the second part.<\/p>\n<h2>Prerequisites<\/h2>\n<p>To follow along with this example, complete the following prerequisites:<\/p>\n<ol>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/gs-setup-working-env.html\" target=\"_blank\" rel=\"noopener noreferrer\">Create a new SageMaker notebook instance<\/a>.<\/li>\n<li>Create an <a href=\"https:\/\/aws.amazon.com\/kinesis\/data-firehose\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Kinesis Data Firehose<\/a> delivery stream with an <a href=\"http:\/\/aws.amazon.com\/lambda\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Lambda<\/a> transform function. For instructions, see <a href=\"https:\/\/aws.amazon.com\/blogs\/compute\/amazon-kinesis-firehose-data-transformation-with-aws-lambda\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Kinesis Firehose Data Transformation with AWS Lambda<\/a>. This step is optional and only needed to simulate data streaming.<\/li>\n<\/ol>\n<h2>Data exploration<\/h2>\n<p>Let\u2019s download and visualize the dataset, which is located in a public <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) bucket and static website:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># The dataset is located in a public bucket and static s3 website.\n# https:\/\/www.lyft.com\/bikes\/bay-wheels\/system-data\n\nimport pandas as pd\nimport numpy as np\nimport os\nfrom time import sleep\n\n!wget -q -O '201907-baywheels-tripdata.zip' https:\/\/s3.amazonaws.com\/baywheels-data\/201907-baywheels-tripdata.csv.zip\n!unzip -q -o 201907-baywheels-tripdata.zip\ncsv_file = os.listdir('.')\ndata = pd.read_csv('201907-baywheels-tripdata.csv', low_memory=False)\ndata.head()<\/code><\/pre>\n<\/p><\/div>\n<p>The following screenshot shows a subset of the data before transformation.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-32882\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/10\/image15.png\" alt=\"\" width=\"2738\" height=\"774\"><\/p>\n<p>The last column of the data contains the target we want to predict, which is a binary variable taking either a Yes or No value, indicating whether the user participates in the Bike Share for All program.<\/p>\n<p>Let\u2019s take a look at the distribution of our target variable for any data imbalance.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># For plotting\n%matplotlib inline\nimport matplotlib.pyplot as plt\n#!pip install seaborn # If you need this library\nimport seaborn as sns\ndisplay(sns.countplot(x='bike_share_for_all_trip', data=data))<\/code><\/pre>\n<\/p><\/div>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-medium wp-image-32880\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/10\/image14-300x189.png\" alt=\"\" width=\"300\" height=\"189\"><\/p>\n<p>As shown in the graph above, the data is imbalanced, with fewer people participating in the program.<\/p>\n<p>We need to balance the data to prevent an over-representation bias. This step is optional because Autopilot also offers an internal approach to handle class imbalance automatically, which defaults to a F1 score validation metric. Additionally, if you choose to balance the data yourself, you can use more advanced techniques for handling class imbalance, such as <a href=\"https:\/\/arxiv.org\/abs\/1106.1813\" target=\"_blank\" rel=\"noopener noreferrer\">SMOTE <\/a>or <a href=\"https:\/\/arxiv.org\/abs\/2008.09202\" target=\"_blank\" rel=\"noopener noreferrer\">GAN<\/a>.<\/p>\n<p>For this post, we downsample the majority class (No) as a data balancing technique:<\/p>\n<p>The following code enriches the data and under-samples the overrepresented class:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">df = data.copy()\ndf.drop(columns=['rental_access_method'], inplace=True)\n\ndf['start_time'] = pd.to_datetime(df['start_time'])\ndf['start_time'] = pd.to_datetime(df['end_time'])\n\n# Adding some day breakdown\ndf = df.assign(day_of_week=df.start_time.dt.dayofweek,\n                            hour_of_day=df.start_time.dt.hour,\n                            trip_month=df.start_time.dt.month)\n# Breaking the day in 4 parts: ['morning', 'afternoon', 'evening']\nconditions = [\n    (df['hour_of_day'] &gt;= 5) &amp; (df['hour_of_day'] &lt; 12),\n    (df['hour_of_day'] &gt;= 12) &amp; (df['hour_of_day'] &lt; 18),\n    (df['hour_of_day'] &gt;= 18) &amp; (df['hour_of_day'] &lt; 21),\n]\nchoices = ['morning', 'afternoon', 'evening']\ndf['part_of_day'] = np.select(conditions, choices, default='night')\ndf.dropna(inplace=True)\n\n# Downsampling the majority to rebalance the data\n# We are getting about an even distribution\ndf.sort_values(by='bike_share_for_all_trip', inplace=True)\nslice_pointe = int(df['bike_share_for_all_trip'].value_counts()['Yes'] * 2.1)\ndf = df[-slice_pointe:]\n# The data is balanced now. Let's reshuffle the data\ndf = df.sample(frac=1).reset_index(drop=True)<\/code><\/pre>\n<\/p><\/div>\n<p>We deliberately left our categorical features not encoded, including our binary target value. This is because Autopilot takes care of encoding and decoding the data for us as part of the automatic feature engineering and pipeline deployment, as we see in the next section.<\/p>\n<p>The following screenshot shows a sample of our data.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-large wp-image-32884\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/10\/image16-1024x193.png\" alt=\"\" width=\"1024\" height=\"193\"><\/p>\n<p>The data in the following graphs looks otherwise normal, with a bimodal distribution representing the two peaks for the morning hours and the afternoon rush hours, as you would expect. We also observe low activities on weekends and at night.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-32881\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/10\/Capture_charts.png\" alt=\"\" width=\"994\" height=\"687\"><\/p>\n<p>In the next section, we feed the data to Autopilot so that it can run an experiment for us.<\/p>\n<h2>Build a binary classification model<\/h2>\n<p>Autopilot requires that we specify the input and output destination buckets. It uses the input bucket to load the data and the output bucket to save the artifacts, such as feature engineering and the generated Jupyter notebooks. We retain 5% of the dataset to evaluate and validate the model\u2019s performance after the training is complete and upload 95% of the dataset to the S3 input bucket. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">import sagemaker\nimport boto3\n\n# Let's define our storage.\n# We will use the default sagemaker bucket and will enforce encryption \n \nbucket = sagemaker.Session().default_bucket()  # SageMaker default bucket. \n#Encrypting the bucket\ns3 = boto3.client('s3')\nSSEConfig={\n        'Rules': [\n            {\n                'ApplyServerSideEncryptionByDefault': {\n                    'SSEAlgorithm': 'AES256',\n                }\n            },\n        ]\n    }\ns3.put_bucket_encryption(Bucket=bucket, ServerSideEncryptionConfiguration=SSEConfig)\n\nprefix = 'sagemaker-automl01'                  # prefix for ther bucket\nrole = sagemaker.get_execution_role()          # IAM role object to use by SageMaker\nsagemaker_session = sagemaker.Session()        # Sagemaker API\nregion = sagemaker_session.boto_region_name    # AWS Region\n\n# Where we will load our data \ninput_path = \"s3:\/\/{}\/{}\/automl_bike_train_share-1\".format(bucket, prefix) \noutput_path = \"s3:\/\/{}\/{}\/automl_bike_output_share-1\".format(bucket, prefix)\n\n# Spliting data in train\/test set.\n# We will use 95% of the data for training and the remainder for testing.\nslice_point = int(df.shape[0] * 0.95) \ntraining_set = df[:slice_point] # 95%\ntesting_set = df[slice_point:]  # 5%\n\n# Just making sure we have split it correctly\nassert training_set.shape[0] + testing_set.shape[0] == df.shape[0]\n\n# Let's save the data locally and upload it to our s3 data location\ntraining_set.to_csv('bike_train.csv')\ntesting_set.to_csv('bike_test.csv', header=False)\n\n# Uploading file the trasining set to the input bucket\nsagemaker.s3.S3Uploader.upload(local_path='bike_train.csv', desired_s3_uri=input_path)<\/code><\/pre>\n<\/p><\/div>\n<p>After we upload the data to the input destination, it\u2019s time to start Autopilot:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">from sagemaker.automl.automl import AutoML\n# You give your job a name and provide the s3 path where you uploaded the data\nbike_automl_binary = AutoML(role=role, \n                         target_attribute_name='bike_share_for_all_trip', \n                         output_path=output_path,\n                         max_candidates=30)\n# Starting the training \nbike_automl_binary.fit(inputs=input_path, \n                       wait=False, logs=False)<\/code><\/pre>\n<\/p><\/div>\n<p>All we need to start experimenting is to call the fit() method. Autopilot needs the input and output S3 location and the target attribute column as the required parameters. After feature processing, Autopilot calls <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/automatic-model-tuning.html\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker automatic model tuning<\/a> to find the best version of a model by running many training jobs on your dataset. We added the optional max_candidates parameter to limit the number of candidates to 30, which is the number of training jobs that Autopilot launches with different combinations of algorithms and hyperparameters in order to find the best model. If you don\u2019t specify this parameter, it defaults to 250.<\/p>\n<p>We can observe the progress of Autopilot with the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Let's monitor the progress this will take a while. Go grup some coffe.\nfrom time import sleep\n\ndef check_job_status():\n    return bike_automl_binary.describe_auto_ml_job()['AutoMLJobStatus']\n\ndef discribe():\n    return bike_automl_binary.describe_auto_ml_job()\n\nwhile True:\n    print (check_job_status(), discribe()['AutoMLJobSecondaryStatus'], end='** ') \n    if check_job_status() in [\"Completed\", \"Failed\"]:\n        if \"Failed\" in check_job_status():\n            print(discribe()['FailureReason'])\n        break\n    sleep(20)<\/code><\/pre>\n<\/p><\/div>\n<p>The training takes some time to complete. While it\u2019s running, let\u2019s look at the Autopilot workflow.<br \/><img decoding=\"async\" loading=\"lazy\" class=\" wp-image-32921 aligncenter\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/11\/AmazonSagemaker_refonted_v2-529x1024.png\" alt=\"\" width=\"566\" height=\"1096\"><\/p>\n<p>To find the best candidate, use the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Let's take a look at the best candidate selected by AutoPilot\nfrom IPython.display import JSON\ndef jsonView(obj, rootName=None):\n    return JSON(obj, root=rootName, expanded=True)\n\nbestCandidate = bike_automl_binary.describe_auto_ml_job()['BestCandidate']\ndisplay(jsonView(bestCandidate['FinalAutoMLJobObjectiveMetric'], 'FinalAutoMLJobObjectiveMetric'))<\/code><\/pre>\n<\/p><\/div>\n<p>The following screenshot shows our output.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-medium wp-image-32885\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/10\/image17-300x60.png\" alt=\"\" width=\"300\" height=\"60\"><\/p>\n<p>Our model achieved a validation accuracy of 96%, so we\u2019re going to deploy it. We could add a condition such that we only use the model if the accuracy is above a certain level.<\/p>\n<h2>Inference pipeline<\/h2>\n<p>Before we deploy our model, let\u2019s examine our best candidate and what\u2019s happening in our pipeline. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">display(jsonView(bestCandidate['InferenceContainers'], 'InferenceContainers'))<\/code><\/pre>\n<\/p><\/div>\n<p>The following diagram shows our output.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-32886\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/10\/image18_LI.jpg\" alt=\"\" width=\"1795\" height=\"1314\"><\/p>\n<p>Autopilot has built the model and has packaged it in three different containers, each sequentially running a specific task: transform, predict, and reverse-transform. This multi-step inference is possible with a <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/inference-pipelines.html\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker inference pipeline.<\/a><\/p>\n<p>A multi-step inference can also chain multiple inference models. For instance, one container can perform <a href=\"https:\/\/en.wikipedia.org\/wiki\/Principal_component_analysis\" target=\"_blank\" rel=\"noopener noreferrer\">principal component analysis<\/a> before passing the data to the XGBoost container.<\/p>\n<h2>Deploy the inference pipeline to an endpoint<\/h2>\n<p>The deployment process involves just a few lines of code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># We chose to difine an endpoint name.\nfrom datetime import datetime as dt\ntoday = str(dt.today())[:10]\nendpoint_name='binary-bike-share-' + today\nendpoint = bike_automl_binary.deploy(initial_instance_count=1,\n                                  instance_type='ml.m5.xlarge',\n                                  endpoint_name=endpoint_name,\n                                  candidate=bestCandidate,\n                                  wait=True)<\/code><\/pre>\n<\/p><\/div>\n<p>Let\u2019s configure our endpoint for prediction with a predictor:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">from sagemaker.serializers import CSVSerializer\nfrom sagemaker.deserializers import CSVDeserializer\ncsv_serializer = CSVSerializer()\ncsv_deserializer = CSVDeserializer()\n# Initialize the predictor\npredictor = sagemaker.predictor.Predictor(endpoint_name=endpoint_name, \n                                                  sagemaker_session=sagemaker.Session(),\n                                                  serializer=csv_serializer,\n                                                  deserializer=csv_deserializer\n                                                  )<\/code><\/pre>\n<\/p><\/div>\n<p>Now that we have our endpoint and predictor ready, it\u2019s time to use the testing data we set aside and test the accuracy of our model. We start by defining a utility function that sends the data one line at a time to our inference endpoint and gets a prediction in return. Because we have an <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/xgboost.html\" target=\"_blank\" rel=\"noopener noreferrer\">XGBoost<\/a> model, we drop the target variable before sending the CSV line to the endpoint. Additionally, we removed the header from the testing CSV before looping through the file, which is also another requirement for XGBoost on SageMaker. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># The fuction takes 3 arguments: the file containing the test set,\n# The predictor and finaly the number of lines to send for prediction.\n# The function returns two Series: inferred and Actual.\ndef get_inference(file, predictor, n=1):\n    infered = []\n    actual = []\n    with open(file, 'r') as csv:\n        for i in range(n):\n            line = csv.readline().split(',')\n            #print(line)\n            try:\n                # Here we remove the target variable from the csv line before predicting \n                observed = line.pop(14).strip('n')\n                actual.append(observed)\n            except:\n                pass\n            obj = ','.join(line)\n            \n            predicted = predictor.predict(obj)[0][0]\n            infered.append(predicted)\n            pd.Series(infered)\n            data = {'Infered': pd.Series(infered), 'Observed': pd.Series(actual)}\n    return  pd.DataFrame(data=data)\n    \nn = testing_set.shape[0] # The size of the testing data\ninference_df = get_inference('bike_test.csv', predictor, n)\n\ninference_df['Binary_Result'] = (inference_df['Observed'] == inference_df['Infered'])\ndisplay(inference_df.head())<\/code><\/pre>\n<\/p><\/div>\n<p>The following screenshot shows our output.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-medium wp-image-32889\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/10\/image20-300x183.png\" alt=\"\" width=\"300\" height=\"183\"><\/p>\n<p>Now let\u2019s calculate the accuracy of our model.<br \/><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-medium wp-image-32888\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/10\/image19-300x56.png\" alt=\"\" width=\"300\" height=\"56\"><\/p>\n<p>See the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">count_binary = inference_df['Binary_Result'].value_counts()\naccuracy = count_binary[True]\/n\nprint('Accuracy:', accuracy)<\/code><\/pre>\n<\/p><\/div>\n<p>We get an accuracy of 92%. This is slightly lower than the 96% obtained during the validation step, but it\u2019s still high enough. We don\u2019t expect the accuracy to be exactly the same because the test is performed with a new dataset.<\/p>\n<h2>Data ingestion<\/h2>\n<p>We downloaded the data directly and configured it for training. In real life, you may have to send the data directly from the edge device into the data lake and have SageMaker load it directly from the data lake into the notebook.<\/p>\n<p>Kinesis Data Firehose is a good option and the most straightforward way to reliably load streaming data into data lakes, data stores, and analytics tools. It can capture, transform, and load streaming data into Amazon S3 and other AWS data stores.<\/p>\n<p>For our use case, we create a Kinesis Data Firehose delivery stream with a Lambda transformation function to do some lightweight data cleaning as it traverses the stream. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Data processing libraries\nimport pandas as pd  # Data processing\nimport numpy as np\nimport base64\nfrom io import StringIO\n\n\ndef lambda_handler(event, context):\n    output = []\n    print('Received', len(event['records']), 'Records')\n    for record in event['records']:\n\n        payload = base64.b64decode(record['data']).decode('utf-8')\n        df = pd.read_csv(StringIO(payload), index_col=0)\n\n        df.drop(columns=['rental_access_method'], inplace=True)\n\n        df['start_time'] = pd.to_datetime(df['start_time'])\n        df['start_time'] = pd.to_datetime(df['end_time'])\n\n        # Adding some day breakdown\n        df = df.assign(day_of_week=df.start_time.dt.dayofweek,\n                                 hour_of_day=df.start_time.dt.hour,\n                                 trip_month=df.start_time.dt.month)\n        # Breaking the day in 4 parts: ['morning', 'afternoon', 'evening']\n        conditions = [\n            (df['hour_of_day'] &gt;= 5) &amp; (df['hour_of_day'] &lt; 12),\n            (df['hour_of_day'] &gt;= 12) &amp; (df['hour_of_day'] &lt; 18),\n            (df['hour_of_day'] &gt;= 18) &amp; (df['hour_of_day'] &lt; 21),\n        ]\n        choices = ['morning', 'afternoon', 'evening']\n        df['part_of_day'] = np.select(conditions, choices, default='night')\n        df.dropna(inplace=True)\n\n        # Downsampling the majority to rebalance the data\n        # We are getting about an even distribution\n        df.sort_values(by='bike_share_for_all_trip', inplace=True)\n        slice_pointe = int(df['bike_share_for_all_trip'].value_counts()['Yes'] * 2.1)\n        df = df[-slice_pointe:]\n        # The data is balanced now. Let's reshuffle the data\n        df = df.sample(frac=1).reset_index(drop=True)\n\n        data = base64.b64encode(bytes(df.to_csv(), 'utf-8')).decode(\"utf-8\")\n        output_record = {\n            'recordId': record['recordId'],\n            'result': 'Ok',\n            'data': data\n\n        }\n        output.append(output_record)\n    print('Returned', len(output), 'Records')\n    print('Event', event)\n\n    return {'records': output}<\/code><\/pre>\n<\/p><\/div>\n<p>This Lambda function performs light transformation of the data streamed from the devices onto the data lake. It expects a CSV formatted data file.<\/p>\n<p>For the ingestion step, we download the data and simulate a data stream to Kinesis Data Firehose with a Lambda transform function and into our S3 data lake.<\/p>\n<p>Let\u2019s simulate streaming a few lines:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Saving the data in one file.\nfile = '201907-baywheels-tripdata.csv' \ndata.to_csv(file)\n\n# Stream the data 'n' lines at a time.\n# Only run this for a minute and stop the cell\ndef streamer(file, n):\n    with open(file, 'r') as csvfile:  \n        header = next(csvfile)\n        data = header\n        counter = 0\n        loop = True\n        while loop == True:\n            for i in range(n):\n                line = csvfile.readline()\n                data+=line\n                # We reached the end of the csv file.\n                if line == '':\n                    loop = False\n            counter+=n\n            # Use your kinesis streaming name\n            stream = client.put_record(DeliveryStreamName='firehose12-DeliveryStream-OJYW71BPYHF2', Record={\"Data\": bytes(data, 'utf-8')})\n            data = header\n            print( file, 'HTTPStatusCode: '+ str(stream['ResponseMetadata']['HTTPStatusCode']), 'csv_lines_sent: ' + str(counter), end=' -*- ')\n            \n            sleep(random.randrange(1, 3))\n        return\n# Streaming for 500 lines at a time. You can change this number up and down.\nstreamer(file, 500)\n\n# We can now load our data as a DataFrame because it\u2019s streamed into the S3 data lake:\n# Getting data from s3 location where it was streamed.\nSTREAMED_DATA = 's3:\/\/firehose12-deliverybucket-11z0ya3patrod\/firehose\/2020'\ncsv_uri = sagemaker.s3.S3Downloader.list(STREAMED_DATA)\nin_memory_string = [sagemaker.s3.S3Downloader.read_file(file) for file in csv_uri]\nin_memory_csv = [pd.read_csv(StringIO(file), index_col=0) for file in in_memory_string]\ndisplay(df.tail())<\/code><\/pre>\n<\/p><\/div>\n<h2>Clean up<\/h2>\n<p>It\u2019s important to delete all the resources used in this exercise to minimize cost. The following code deletes the SageMaker inference endpoint we created as well the training and testing data we uploaded:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">#Delete the s3 data\npredictor.delete_endpoint()\n\n# Delete s3 data\ns3 = boto3.resource('s3')\nml_bucket = sagemaker.Session().default_bucket()\ndelete_data = s3.Bucket(ml_bucket).objects.filter(Prefix=prefix).delete()<\/code><\/pre>\n<\/p><\/div>\n<h2>Conclusion<\/h2>\n<p>ML engineers, data scientists, and software developers can use Autopilot to build and deploy an inference pipeline with little to no ML programming experience. Autopilot saves time and resources, using data science and ML best practices. Large organizations can now shift engineering resources away from infrastructure configuration towards improving models and solving business use cases. Startups and smaller organizations can get started on machine learning with little to no ML expertise.<\/p>\n<p>We recommend learning more about other important features SageMaker has to offer, such as the <a href=\"https:\/\/aws.amazon.com\/sagemaker\/feature-store\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Feature Store<\/a>, which integrates with <a href=\"https:\/\/aws.amazon.com\/sagemaker\/pipelines\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Pipelines<\/a> to create, add feature search and discovery, and reuse automated ML workflows. You can run multiple Autopilot simulations with different feature or target variants in your dataset. You could also approach this as a dynamic vehicle allocation problem in which your model tries to predict vehicle demand based on time (such as time of day or day of the week) or location, or a combination of both.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignleft size-full wp-image-32898\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/10\/blog_image-2.jpg\" alt=\"\" width=\"100\" height=\"108\"><a href=\"https:\/\/www.linkedin.com\/in\/dougmbaya\/\"><strong>Doug Mbaya<\/strong><\/a> is a Senior Solution architect with a focus in data and analytics. Doug works closely with AWS partners, helping them integrate data and analytics solution in the cloud. Doug\u2019s prior experience includes\u00a0 supporting AWS customers in the ride sharing and food delivery segment.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignleft size-full wp-image-32490\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/01\/26\/vperrone.png\" alt=\"\" width=\"100\" height=\"133\"><strong><a class=\"c-link\" href=\"https:\/\/www.linkedin.com\/in\/valerio-perrone-391731132\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-stringify-link=\"https:\/\/www.linkedin.com\/in\/valerio-perrone-391731132\/\" data-sk=\"tooltip_parent\" data-remove-tab-index=\"true\" aria-describedby=\"sk-tooltip-7235\">Valerio Perrone<\/a><\/strong> is an Applied Science Manager working on Amazon SageMaker Automatic Model Tuning and Autopilot.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/automate-a-shared-bikes-and-scooters-classification-model-with-amazon-sagemaker-autopilot\/<\/p>\n","protected":false},"author":0,"featured_media":1554,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1553"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1553"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1553\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1554"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1553"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1553"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1553"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}