{"id":434,"date":"2020-10-21T22:47:09","date_gmt":"2020-10-21T22:47:09","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/10\/21\/time-series-forecasting-using-unstructured-data-with-amazon-forecast-and-the-amazon-sagemaker-neural-topic-model\/"},"modified":"2020-10-21T22:47:09","modified_gmt":"2020-10-21T22:47:09","slug":"time-series-forecasting-using-unstructured-data-with-amazon-forecast-and-the-amazon-sagemaker-neural-topic-model","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/10\/21\/time-series-forecasting-using-unstructured-data-with-amazon-forecast-and-the-amazon-sagemaker-neural-topic-model\/","title":{"rendered":"Time series forecasting using unstructured data with Amazon Forecast and the Amazon SageMaker Neural Topic Model"},"content":{"rendered":"<div id=\"\">\n<p>As the volume of unstructured data such as text and voice continues to grow, businesses are increasingly looking for ways to incorporate this data into their time series predictive modeling workflows. One example use case is transcribing calls from call centers to forecast call handle times and improve call volume forecasting. In the retail or media industry, companies are interested in using related information about products or content to forecast popularity of existing or new products or content from unstructured information such as product type, description, audience reviews, or social media feeds. However, combining this unstructured data with time series is challenging because most traditional time series models require numerical inputs for forecasting. In this post, we describe how you can combine <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> with <a href=\"https:\/\/aws.amazon.com\/forecast\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Forecast<\/a> to include unstructured text data into your time series use cases.<\/p>\n<h2>Solution overview<\/h2>\n<p>For our use case, we predict the popularity of news articles based on their topics looking forward over a 15 day horizon. You first download and preprocess the data and then run the NTM algorithm to generate topic vectors. After generating the topic vectors, you save them and use these vectors as a related time series to create the forecast.<\/p>\n<p>The following diagram illustrates the architecture of this solution.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17262 size-full\" title=\"Solution architecture\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/1-SolutionArchitecture.jpg\" alt=\"\" width=\"900\" height=\"461\"><\/p>\n<h2>AWS services<\/h2>\n<p>Forecast is a fully managed service that uses machine learning (ML) to generate highly accurate forecasts without requiring any prior ML experience. Forecast is applicable in a wide variety of use cases, including energy demand forecasting, estimating product demand, workforce planning, and computing cloud infrastructure usage.<\/p>\n<p>With Forecast, there are no servers to provision or ML models to build manually. Additionally, you only pay for what you use, and there is no minimum fee or upfront commitment. To use Forecast, you only need to provide historical data for what you want to forecast, and, optionally, any related data that you believe may impact your forecasts. This related data may include time-varying data (such as price, events, and weather) and categorical data (such as color, genre, or region). The service automatically trains and deploys ML models based on your data and provides you with a custom API to retrieve forecasts.<\/p>\n<p>Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy ML models quickly. The <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/ntm.html\" target=\"_blank\" rel=\"noopener noreferrer\">Neural Topic Model<\/a> (NTM) algorithm is an unsupervised learning algorithm that can organize a collection of documents into topics that contain word groupings based on their statistical distribution. For example, documents that contain frequent occurrences of words such as \u201cbike,\u201d \u201ccar,\u201d \u201ctrain,\u201d \u201cmileage,\u201d and \u201cspeed\u201d are likely to share a topic on \u201ctransportation.\u201d You can use topic modeling to classify or summarize documents based on the topics detected. You can also use it to retrieve information and recommend content based on topic similarities.<\/p>\n<p>The derived topics that NTM learns are characterized as a latent representation because they are inferred from the observed word distributions in the collection. The semantics of topics are usually inferred by examining the top ranking words they contain. Because the method is unsupervised, only the number of topics, not the topics themselves, are pre-specified. In addition, the topics aren\u2019t guaranteed to align with how a human might naturally categorize documents. NTM is one of the built-in algorithms you can train and deploy using Amazon SageMaker.<\/p>\n<h2>Prerequisites<\/h2>\n<p>To follow along with this post, you must create the following:<\/p>\n<p>To create the aforementioned resources and clone the <code>forecast-samples<\/code> GitHub repo into the notebook instance, launch the following <a href=\"http:\/\/aws.amazon.com\/cloudformation\" target=\"_blank\" rel=\"noopener noreferrer\">AWS CloudFormation<\/a> stack:<\/p>\n<p><a href=\"https:\/\/console.aws.amazon.com\/cloudformation\/home#\/stacks\/new?stackName=ForecastDemo&amp;templateURL=https:\/\/chriskingpartnershare.s3.amazonaws.com\/ForecastDemo.yaml\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-16174\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/22\/LaunchStack.jpg\" alt=\"\" width=\"144\" height=\"27\"><\/a><\/p>\n<p>In the <strong>Parameters <\/strong>section, enter unique names for your S3 bucket and notebook and leave all other settings at their default.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17263 size-full\" title=\"Specify stack details\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/2-Screenshot.jpg\" alt=\"\" width=\"900\" height=\"595\"><\/p>\n<p>When the CloudFormation script is complete, you can view the created resources on the <strong>Resources <\/strong>tab of the stack.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17264 size-full\" title=\"Viewing created resources\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/3-Screenshot.jpg\" alt=\"\" width=\"900\" height=\"396\"><\/p>\n<p>Navigate to Sagemaker and open the notebook instance created from the CloudFormation template. Open Jupyter and continue to the <code>\/notebooks\/blog_materials\/Time_Series_Forecasting_with_Unstructured_Data_and_Amazon_SageMaker_Neural_Topic_Model\/<\/code> folder and start working your way through the notebooks.<\/p>\n<h3>Creating the resources manually<\/h3>\n<p>For the sake of completeness, we explain in detail the steps necessary to create the resources that the CloudFormation script creates automatically.<\/p>\n<ol>\n<li>Create an IAM role that can do the following:\n<ol type=\"a\">\n<li>Has permission to access Forecast and Amazon S3 to store the training and test datasets.<\/li>\n<li>Has an attached trust policy to give Amazon SageMaker permission to assume the role.<\/li>\n<li>Allows Forecast to access Amazon S3 to pull the stored datasets into Forecast.<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<p>For more information, see <a href=\"https:\/\/docs.aws.amazon.com\/forecast\/latest\/dg\/aws-forecast-iam-roles.html\" target=\"_blank\" rel=\"noopener noreferrer\">Set Up Permissions for Amazon Forecast<\/a>.<\/p>\n<ol start=\"2\">\n<li>\n<a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/howitworks-create-ws.html\" target=\"_blank\" rel=\"noopener noreferrer\">Create an Amazon SageMaker notebook instance<\/a>.<\/li>\n<li>Attach the IAM role you created for Amazon SageMaker to this notebook instance.<\/li>\n<li>\n<a href=\"https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/gsg\/CreatingABucket.html\" target=\"_blank\" rel=\"noopener noreferrer\">Create an S3 bucket<\/a> to store the outputs of your human workflow.<\/li>\n<li>Copy the ARN of the bucket to use in the accompanying Jupyter notebook.<\/li>\n<\/ol>\n<p><strong>\u00a0<\/strong>This project consists of three notebooks, available in the <a href=\"https:\/\/github.com\/aws-samples\/amazon-forecast-samples\/tree\/master\/notebooks\/blog_materials\/Time_Series_Forecasting_with_Unstructured_Data_and_Amazon_SageMaker_Neural_Topic_Model\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repo<\/a>. They cover the following:<\/p>\n<ul>\n<li>Preprocessing the dataset<\/li>\n<li>NTM with Amazon SageMaker<\/li>\n<li>Using Amazon Forecast to predict the topic\u2019s popularity on various social media platforms going forward<\/li>\n<\/ul>\n<h2>Training and deploying the forecast<\/h2>\n<p>In the first notebook, <code>1_preprocess.ipynb<\/code>, you download the <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/News+Popularity+in+Multiple+Social+Media+Platforms\" target=\"_blank\" rel=\"noopener noreferrer\">New Popularity in Multiple Social Media Platforms<\/a> dataset from the University of California Irvine (UCI) Machine Learning Repository using the requests library [1]. The following screenshot shows a sample of the dataset, where we have anonymized the topic names without loss of generality. It consists of news articles and their popularity on various social channels.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17265 size-full\" title=\"New Popularity in Multiple Social Media Platforms dataset\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/4-Screenshot.jpg\" alt=\"\" width=\"900\" height=\"216\"><\/p>\n<p>Because we\u2019re focused on predictions based on the Headline and Title columns, we drop the Source and IDLink columns. We examine the current state of the data with a simple histogram plot. The following plot depicts the popularity of a subset of articles on Facebook.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17266 size-full\" title=\"Popularity of a subset of articles on Facebook\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/5-Graph.jpg\" alt=\"\" width=\"900\" height=\"624\"><\/p>\n<p>The following plot depicts the popularity of a subset of articles on GooglePlus.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17267 size-full\" title=\"Popularity of a subset of articles on GooglePlus\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/6-Graph.jpg\" alt=\"\" width=\"900\" height=\"620\"><\/p>\n<p>The distributions are heavily skewed towards a very small number of views; however, there are a few outlier articles that have an extremely high popularity.<\/p>\n<h3>Preprocessing the data<\/h3>\n<p>You may notice the popularity of the articles is extremely skewed. To convert this into a usable time series for ML, we need to convert the <code>PublishDate<\/code> column, which is read in as a string type, to a datetime type using the Pandas <code>to_datetime<\/code> method:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">df['PublishDate'] = pd.to_datetime(df['PublishDate'], infer_datetime_format=True)<\/code><\/pre>\n<\/div>\n<p>We then group by topic and save the preprocessed.csv to be used by the next notebook, <code>2_NTM.ipynb<\/code>. In the directory <code>\/data<\/code>, you should see a file called NewsRatingsdataset.csv. You can now move to the next notebook, where you build a neural topic model to extract topic vectors from the processed dataset.<\/p>\n<p>Before creating the topic model, it\u2019s helpful to explore the data some more. In the following code, we plot the daily time series for the popularity of a given topic across the three social media channels, as well as a daily time series for the sentiment of a topic based on news article titles and headlines:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">topic = 1 # Change this to any of [0, 1, 2, 3]\r\nsubdf = df[(df['Topic']==topic)&amp;(df['PublishDate']&gt;START_DATE)]\r\nsubdf = subdf.reset_index().set_index('PublishDate')\r\nsubdf.index = pd.to_datetime(subdf.index)\r\nsubdf.head()\r\nsubdf[[\"LinkedIn\", 'GooglePlus', 'Facebook']].resample(\"1D\").mean().dropna().plot(figsize=(15, 4))\r\nsubdf[[\"SentimentTitle\", 'SentimentHeadline']].resample(\"1D\").mean().dropna().plot(figsize=(15, 4))\r\n<\/code><\/pre>\n<\/div>\n<p>The following are the plots for the topic <code>Topic_1<\/code>.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17268 size-full\" title=\"Topic_1 plots\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/7-TwinCharts.jpg\" alt=\"\" width=\"900\" height=\"564\"><\/p>\n<p>The dataset still needs a bit more cleaning before it\u2019s ready for the NTM algorithm to use. Not much data exists before October 13, 2015, so you can drop the data before that date and reset the indexes accordingly. Moreover, some of the headlines and ratings contain missing values, denoted by <code>NaN<\/code> and<code> -1<\/code>, respectively. You can use regex to find and replace those headlines with empty strings and convert these ratings to zeros. There is a difference in scale for the popularity of a topic on Facebook vs. LinkedIn vs. GooglePlus. For this post, you focus on forecasting popularity on Facebook only.<\/p>\n<h2>Topic modeling<\/h2>\n<p>Now you use the built-in NTM algorithm on Amazon SageMaker to extract topics from the news headlines. When preparing a corpus of documents for NTM, you must clean and standardize the data by converting the text to lower case, remove stop words, remove any numeric characters that may not be meaningful to your corpus, and tokenize the document\u2019s text.<\/p>\n<p>We use the <a href=\"https:\/\/www.nltk.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Natural Language Toolkit<\/a> and sklearn Python libraries to convert the headlines into tokens and create vectors of the token\u2019s counts. Also, we drop the <code>Title<\/code> column in the dataframe, but store the titles in a separate dataframe. This is because the <code>Headline<\/code> column contains similar information as the <code>Title<\/code> column, but the headlines are longer and more descriptive, and we want to use the titles later on as a validation set for our NTM during training.<\/p>\n<p>Lastly, we type cast the vectors into a sparse array in order to reduce the amount of memory utilization, because the bag-of-words matrix can quickly become quite large and memory intensive. For more information, see the notebook or <a href=\"https:\/\/aws.amazon.com\/getting-started\/hands-on\/semantic-content-recommendation-system-amazon-sagemaker\/4\/\" target=\"_blank\" rel=\"noopener noreferrer\">Build a semantic content recommendation system with Amazon SageMaker<\/a>.<\/p>\n<h3>Training an NTM<\/h3>\n<p>To extract text vectors, you convert each headline into a 20 (<code>NUM_TOPICS<\/code>)-dimensional topic vector. This can be viewed as an effective lower-dimensional embedding of all the text in the corpus into some predefined topics. Each topic has a representation as a vector, and related topics have a related vector representation. This topic is a derived topic and is not to be confused with the original <code>Topic<\/code> field in the raw dataset. Assuming that there is some correlation between topics from one day to the next (for example, the top topics don\u2019t change very frequently on a daily basis), you can represent all the text in the dataset as a collection of 20 topics.<\/p>\n<p>You then set the training dataset and trained model artifact location in Amazon S3 and upload the data. To train the model, you can use one or more instances (specified by the parameter <code>train_instance_count<\/code>) and choose a strategy to either fully replicate the data on each instance or use <code>ShardedByS3Key<\/code>, which only puts certain data shards on each instance. This speeds up training at the cost of each instance only seeing a fraction of the data.<\/p>\n<p>To reduce overfitting, it\u2019s helpful to introduce a validation dataset in addition to the training dataset. The hyperparameter <code>num_patience_epochs<\/code> controls the early stopping behavior, which makes sure the training is stopped if the change in the loss is less than the specified tolerance (set by <code>tolerance<\/code>) consistently for <code>num_patience_epochs<\/code>. The <code>epochs<\/code> hyperparameter specifies the total number of epochs to run the job. For this post, we chose hyperparameters to balance the tradeoff between accuracy and training time:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">%time\r\nfrom sagemaker.session import s3_input\r\n\r\nsess = sagemaker.Session()\r\nntm = sagemaker.estimator.Estimator(container,\r\n                                    role, \r\n                                    train_instance_count=1, \r\n                                    train_instance_type='ml.c4.xlarge',\r\n                                    output_path=output_path,\r\n                                    sagemaker_session=sess)\r\nntm.set_hyperparameters(num_topics=NUM_TOPICS, feature_dim=vocab_size, mini_batch_size=128, \r\n                        epochs=100, num_patience_epochs=5, tolerance=0.001)\r\ns3_train = s3_input(s3_train_data, distribution='FullyReplicated') \r\nntm.fit({'train': s3_train})\r\n<\/code><\/pre>\n<\/div>\n<p>To further improve the model performance, you can take advantage of <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/automatic-model-tuning-how-it-works.html\" target=\"_blank\" rel=\"noopener noreferrer\">hyperparameter tuning<\/a> in Amazon SageMaker.<\/p>\n<h3>Deploying and testing the model<\/h3>\n<p>To generate the feature vectors for the headlines, you first deploy the model and run inferences on the entire training dataset to obtain the topic vectors. An alternative option is to run a <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/batch-transform.html\" target=\"_blank\" rel=\"noopener noreferrer\">batch transform job.<\/a><\/p>\n<p>To ensure that the topic model works as expected, we show the extracted topic vectors from the titles, and check if the topic distribution of the title is similar to that of the corresponding headline. Remember that the model hasn\u2019t seen the titles before. As a measure of similarity, we compute the cosine similarity for a random title and associated headline. A high cosine similarity indicates that titles and headlines have a similar representation in this low-dimensional embedding space.<\/p>\n<p>You can also use a cosine similarity of the title-headline as a feature: well-written titles that correlate well with the actual headline may obtain a higher popularity score. You could use this to check if titles and headlines represent the content of the document accurately, but we don\u2019t explore this further in this notebook [2].<\/p>\n<p>Finally, you store the results of the headlines mapped across the extracted <code>NUM_TOPICS<\/code> (20) back into a dataframe and save the dataframe as preprocessed_data.csv in <code>data\/<\/code> for use in subsequent notebooks.<\/p>\n<p>The following code tests the vector similarity:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">ntm_predictor = ntm.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')\r\ntopic_data = np.array(topic_vectors.tocsr()[:10].todense())\r\ntopic_vecs = []\r\nfor index in range(10):\r\n    results = ntm_predictor.predict(topic_data[index])\r\n    predictions = np.array([prediction['topic_weights'] for prediction in results['predictions']])\r\n    topic_vecs.append(predictions)\r\nfrom sklearn.metrics.pairwise import cosine_similarity\r\ncomparisonvec = []\r\nfor i, idx in enumerate(range(10)):\r\n    comparisonvec.append([df.Headline[idx], title_column[idx], cosine_similarity(topic_vecs[i], [pred_array_cc[idx]])[0][0]])\r\npd.DataFrame(comparisonvec, columns=['Headline', 'Title', 'CosineSimilarity'])\r\n<\/code><\/pre>\n<\/div>\n<p>The following screenshot shows the output.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17269 size-full\" title=\"Output\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/8-Output.jpg\" alt=\"\" width=\"900\" height=\"400\"><\/p>\n<h3>Visualizing headlines<\/h3>\n<p>Another way to visualize the results is to plot a T-SNE graph. T-SNE uses a nonlinear embedding model by attempting to check if the nearest neighbor joint probability distribution in the high-dimensional space (for this use case, <code>NUM_TOPICS<\/code>) matches the equivalent lower-dimensional (2) joint distribution by minimizing a loss known as the Kullback-Leibler divergence [3]. Essentially, this is a dimensionality reduction technique that can map high-dimensional vectors to a lower-dimensional space.<\/p>\n<p>Computing the T-SNE can take quite some time, especially for large datasets, so we shuffle the dataset and extract only 10,000 headline embeddings for the T-SNE plot. For more information about the advantages and pitfalls of using T-SNE in topic modeling, see <a href=\"https:\/\/distill.pub\/2016\/misread-tsne\/\" target=\"_blank\" rel=\"noopener noreferrer\">How to Use t-SNE Effectively<\/a>.<\/p>\n<p>The following T-SNE plot shows a few large topics (indicated by the similar color clusters\u2014red green, purple, blue, and brown), which is consistent with the dataset containing four primary topics. But by expanding the dimensionality of the topic vectors to <code>NUM_TOPICS = 20<\/code>, we allow the NTM model to capture additional semantic information between the headlines than is captured by a single topic token.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17353 size-full\" title=\"T-SNE plot\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/21\/9-Map_Resized.jpg\" alt=\"\" width=\"378\" height=\"327\"><\/p>\n<p>With our topic modeling complete and our data saved, you can now delete the endpoint to avoid incurring any charges.<\/p>\n<h2>Forecasting topic popularity<\/h2>\n<p>Now you run the third and final notebook, where you use the Forecast DeepAR+ algorithm to forecast the popularity of the topics. First, you establish a Forecast session using the Forecast SDK. It\u2019s very important the region of your bucket is in the same region as the session.<\/p>\n<p>After this step, you read in preprocessed_data.csv into a dataframe for some additional preprocessing. Drop the <code>Headline<\/code> column and replace the index of the dataframe with the publish date of the news article. You do this so you can easily aggregate the data on a daily basis. The following screenshot shows your results.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17270 size-full\" title=\"Results\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/10-Screenshot.jpg\" alt=\"\" width=\"900\" height=\"180\"><\/p>\n<h3>Creating the target and related time series<\/h3>\n<p>For this post, you want to forecast the Facebook ratings for each of the four topics in the <code>Topic<\/code> column of the dataset. In Forecast, we need to define a target time series that consists of the item ID, timestamp, and the value we want to forecast.<\/p>\n<p>Additionally, as of this writing, you can provide a related time series, which can include up to 13 dynamic features, which in our use case are the <code>SentimentHeadline<\/code> and the topic vectors. Because we can only choose 13 features in Forecast, we choose 10 out of the 20 topic vectors to illustrate building the Forecast model. Currently, the CNN-QR, DeepAR+ algorithm (which we use in this post), and Prophet algorithm support related time series.<\/p>\n<p>As before, we start forecasting from <code>2015-11-01<\/code> and end our training data at <code>2016-06-21<\/code>. Using this, we forecast for 15 days into the future. The following screenshot shows our target time series.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17354 size-full\" title=\"Target time series\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/21\/11-Smallscreensot_Resized.jpg\" alt=\"\" width=\"300\" height=\"222\"><\/p>\n<p>The following screenshot shows our related time series.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17271 size-full\" title=\"Related time series\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/12-Screen.jpg\" alt=\"\" width=\"900\" height=\"240\"><\/p>\n<p>Upload the datasets to the S3 bucket.<\/p>\n<h3>Defining the dataset schemas and dataset group to ingest into Forecast<\/h3>\n<p>Forecast has several predefined domains that come with predefined schemas for data ingestion. Because we\u2019re interested in web traffic, you can choose the <code>WEB_TRAFFIC<\/code> domain. For more information about dataset domains, see <a href=\"https:\/\/docs.aws.amazon.com\/forecast\/latest\/dg\/howitworks-domains-ds-types.html\" target=\"_blank\" rel=\"noopener noreferrer\">Predefined Dataset Domains and Dataset Types<\/a>.<\/p>\n<p>This provides a predefined schema and attribute types for the attributes you include in the target and related time series. The <code>WEB_TRAFFIC<\/code> domain doesn\u2019t have item metadata; only target and related time series data is allowed.<\/p>\n<p>Define the schema for the target time series with the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\"># Set the dataset name to a new unique value. If it already exists, go to the Forecast console and delete any existing\r\n# dataset ARNs and datasets.\r\n\r\ndatasetName = 'webtraffic_forecast_NLP'\r\n\r\nschema ={\r\n   \"Attributes\":[\r\n      {\r\n         \"AttributeName\":\"item_id\",\r\n         \"AttributeType\":\"string\"\r\n      },    \r\n       {\r\n         \"AttributeName\":\"timestamp\",\r\n         \"AttributeType\":\"timestamp\"\r\n      },\r\n      {\r\n         \"AttributeName\":\"value\",\r\n         \"AttributeType\":\"float\"\r\n      }      \r\n   ]\r\n}\r\n\r\ntry:\r\n    response = forecast.create_dataset(\r\n                    Domain=\"WEB_TRAFFIC\",\r\n                    DatasetType='TARGET_TIME_SERIES',\r\n                    DatasetName=datasetName,\r\n                    DataFrequency=DATASET_FREQUENCY, \r\n                    Schema = schema\r\n                   )\r\n    datasetArn = response['DatasetArn']\r\n    print('Success')\r\nexcept Exception as e:\r\n    print(e)\r\n    datasetArn = 'arn:aws:forecast:{}:{}:dataset\/{}'.format(REGION_NAME, ACCOUNT_NUM, datasetName)\r\n<\/code><\/pre>\n<\/div>\n<p>Define the schema for the related time series with the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\"># Set the dataset name to a new unique value. If it already exists, go to the Forecast console and delete any existing\r\n# dataset ARNs and datasets.\r\n\r\ndatasetName = 'webtraffic_forecast_related_NLP'\r\nschema ={\r\n   \"Attributes\":[{\r\n         \"AttributeName\":\"item_id\",\r\n         \"AttributeType\":\"string\"\r\n      }, \r\n       {\r\n         \"AttributeName\":\"timestamp\",\r\n         \"AttributeType\":\"timestamp\"\r\n      },\r\n       {\r\n         \"AttributeName\":\"SentimentHeadline\",\r\n         \"AttributeType\":\"float\"\r\n      }]\r\n    + \r\n      [{\r\n         \"AttributeName\":\"Headline_{}\".format(x),\r\n         \"AttributeType\":\"float\"\r\n      } for x in range(10)] \r\n}\r\n\r\ntry:\r\n    response=forecast.create_dataset(\r\n                    Domain=\"WEB_TRAFFIC\",\r\n                    DatasetType='RELATED_TIME_SERIES',\r\n                    DatasetName=datasetName,\r\n                    DataFrequency=DATASET_FREQUENCY, \r\n                    Schema = schema\r\n                   )\r\n    related_datasetArn = response['DatasetArn']\r\n    print('Success')\r\nexcept Exception as e:\r\n    print(e)\r\n    related_datasetArn = 'arn:aws:forecast:{}:{}:dataset\/{}'.format(REGION_NAME, ACCOUNT_NUM, datasetName)\r\n<\/code><\/pre>\n<\/div>\n<p>Before ingesting any data into Forecast, we need to combine the target and related time series into a dataset group:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">datasetGroupName = 'webtraffic_forecast_NLPgroup'\r\n    \r\n#try:\r\ncreate_dataset_group_response = forecast.create_dataset_group(DatasetGroupName=datasetGroupName,\r\n                                                              Domain=\"WEB_TRAFFIC\",\r\n                                                              DatasetArns= [datasetArn, related_datasetArn]\r\n                                                             )\r\ndatasetGroupArn = create_dataset_group_response['DatasetGroupArn']\r\nexcept Exception as e:\r\n    datasetGroupArn = 'arn:aws:forecast:{}:{}:dataset-group\/{}'.format(REGION_NAME, ACCOUNT_NUM, datasetGroupName)\r\n<\/code><\/pre>\n<\/div>\n<h3>Ingesting the target and related time series data from Amazon S3<\/h3>\n<p>Next you import the target and related data previously stored in AmazonS3 to create a Forecast dataset. You provide the location of the training data in Amazon S3 and the ARN of the dataset placeholder you created.<\/p>\n<p>Ingest the target and related time series with the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">s3DataPath = 's3:\/\/{}\/{}\/target_time_series.csv'.format(bucket, prefix)\r\ndatasetImportJobName = 'forecast_DSIMPORT_JOB_TARGET'\r\n\r\ntry:\r\n    ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,\r\n                                                          DatasetArn=datasetArn,\r\n                                                          DataSource= {\r\n                                                              \"S3Config\" : {\r\n                                                                 \"Path\":s3DataPath,\r\n                                                                 \"RoleArn\": role_arn\r\n                                                              } \r\n                                                          },\r\n                                                          TimestampFormat=TIMESTAMP_FORMAT\r\n                                                         )\r\n    ds_import_job_arn=ds_import_job_response['DatasetImportJobArn']\r\n    target_ds_import_job_arn = copy.copy(ds_import_job_arn) #used to delete the resource during cleanup\r\nexcept Exception as e:\r\n    print(e)\r\n    ds_import_job_arn='arn:aws:forecast:{}:{}:dataset-import-job\/{}\/{}'.format(REGION_NAME, ACCOUNT_NUM, datasetArn, datasetImportJobName)\r\ns3DataPath = 's3:\/\/{}\/{}\/related_time_series.csv'.format(bucket, prefix)\r\ndatasetImportJobName = 'forecast_DSIMPORT_JOB_RELATED'\r\ntry:\r\n    ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,\r\n                                                          DatasetArn=related_datasetArn,\r\n                                                          DataSource= {\r\n                                                              \"S3Config\" : {\r\n                                                                 \"Path\":s3DataPath,\r\n                                                                 \"RoleArn\": role_arn\r\n                                                              } \r\n                                                          },\r\n                                                          TimestampFormat=TIMESTAMP_FORMAT\r\n                                                         )\r\n    ds_import_job_arn=ds_import_job_response['DatasetImportJobArn']\r\n    related_ds_import_job_arn = copy.copy(ds_import_job_arn) #used to delete the resource during cleanup\r\nexcept Exception as e:\r\n    print(e)\r\n    ds_import_job_arn='arn:aws:forecast:{}:{}:dataset-import-job\/{}\/{}'.format(REGION_NAME, ACCOUNT_NUM, related_datasetArn, datasetImportJobName)\r\n<\/code><\/pre>\n<\/div>\n<h3>Creating the predictor<\/h3>\n<p>The Forecast DeepAR+ algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNNs). Classic forecasting methods, such as ARIMA or exponential smoothing (ETS), fit a single model to each individual time series. In contrast, DeepAR+ creates a global model (one model for all the time series) with the potential benefit of learning across time series.<\/p>\n<p>The DeepAR+ model is particularly useful when working with a large collection (over thousands) of target time series, in which certain time series have a limited amount of information. For example, as a generalization of this use case, global models such as DeepAR+ can use the information from related topics with strong statistical signals to predict the popularity of new topics with little historical data. Importantly, DeepAR+ also allows you to include related information such as the topic vectors in a related time series.<\/p>\n<p>To create the predictor, use the following code:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">try:\r\n    create_predictor_response=forecast.create_predictor(PredictorName=predictorName, \r\n                                                  ForecastHorizon=forecastHorizon,\r\n                                                  AlgorithmArn=algorithmArn,\r\n                                                  PerformAutoML=False, # change to true if want to perform AutoML\r\n                                                  PerformHPO=False, # change to true to perform HPO\r\n                                                  EvaluationParameters= {\"NumberOfBacktestWindows\": 1, \r\n                                                                         \"BackTestWindowOffset\": 15}, \r\n                                                  InputDataConfig= {\"DatasetGroupArn\": datasetGroupArn},\r\n                                                  FeaturizationConfig= {\"ForecastFrequency\": \"D\", \r\n                                                                        }\r\n                                                 )\r\n    predictorArn=create_predictor_response['PredictorArn']\r\nexcept Exception as e:\r\n    predictorArn = 'arn:aws:forecast:{}:{}:predictor\/{}'.format(REGION_NAME, ACCOUNT_NUM, predictorName)\r\n<\/code><\/pre>\n<\/div>\n<p>When you call the <code>create_predictor()<\/code> method, it takes several minutes to complete.<\/p>\n<p><em>Backtesting<\/em> is a method of testing an ML model trained on and designed to predict time series data. Due to the sequential nature of time series data, training and test data can\u2019t be randomized. Moreover, the most recent time series data is generally considered the most relevant for testing purposes. Therefore, backtesting uses the most recent windows that were unseen by the model during training to test the model and collect metrics. Amazon Forecast lets you choose up to five windows for backtesting. For more information, see <a href=\"https:\/\/docs.aws.amazon.com\/forecast\/latest\/dg\/metrics.html\" target=\"_blank\" rel=\"noopener noreferrer\">Evaluating Predictor Accuracy<\/a>.<\/p>\n<p>For this post, we evaluate the DeepAR+ model for both the MAPE error, which is a common error metric in time series forecasting, and the root mean square error (RMSE), which penalizes larger deviations even more. The RMSE is an average deviation from the forecasted value and actual value in the same units as the dependent variable (in this use case, topic popularity on Facebook).<\/p>\n<h3>Creating and querying the forecast<\/h3>\n<p>When you\u2019re satisfied with the accuracy metrics from your trained Forecast model, it\u2019s time to generate a forecast. You can do this by creating a forecast for each item in the target time series used to train the predictor. Query the results to find out the popularity of the different topics in the original dataset.<\/p>\n<p>The following is the result for <code>Topic 0<\/code>.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17272 size-full\" title=\"Result for Topic 0\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/13-Graph1.jpg\" alt=\"\" width=\"900\" height=\"273\"><\/p>\n<p>The following is the result for <code>Topic 1<\/code>.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17273 size-full\" title=\"Result for Topic 1\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/13-Graph2.jpg\" alt=\"\" width=\"900\" height=\"272\"><\/p>\n<p>The following is the result for <code>Topic 2<\/code>.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17274 size-full\" title=\"Result for Topic 2\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/13-Graph3.jpg\" alt=\"\" width=\"900\" height=\"269\"><\/p>\n<p>The following is the result for <code>Topic 3<\/code>.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-17275 size-full\" title=\"Result for Topic 3\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/20\/13-Graph4.jpg\" alt=\"\" width=\"900\" height=\"272\"><\/p>\n<h3>Forecast accuracy<\/h3>\n<p>As an example, the RMSE for <code>Topic 1<\/code> is 22.047559071991657. Although the actual range of popularity values in the ground truth set over the date range of the forecast is quite large [3:331], this RMSE does not in and of itself indicate if the model is production ready or not. The RMSE metric is simply an additional data point that should be used in the evaluation of the efficacy of your model.<\/p>\n<h2>Cleaning up<\/h2>\n<p>To avoid incurring future charges, delete each Forecast component. Also delete any other resources used in the notebook such as the Amazon SageMaker NTM endpoint, any S3 buckets used for storing data, and finally the Amazon SageMaker notebooks.<\/p>\n<h2>Conclusion<\/h2>\n<p>In this post, you learned how to build a forecasting model using unstructured raw text data. You also learned how to train a topic model and use the generated topic vectors as related time series for Forecast. Although this post is intended to demonstrate how you can combine these models together, you can improve the model accuracy by training on much larger datasets by having many more topics than in this dataset, using the same methodology. \u00a0Amazon Forecast also supports other deep learning models for time series forecasting such as <a class=\"c-link\" href=\"https:\/\/docs.aws.amazon.com\/forecast\/latest\/dg\/aws-forecast-algo-cnnqr.html#aws-forecast-algo-cnnqr-how-it-works\" target=\"_blank\" rel=\"noopener noreferrer\" data-stringify-link=\"https:\/\/docs.aws.amazon.com\/forecast\/latest\/dg\/aws-forecast-algo-cnnqr.html#aws-forecast-algo-cnnqr-how-it-works\" data-sk=\"tooltip_parent\">CNN-Qr<\/a>. To read more about how you can build an end-to-end operational workflow with Amazon Forecast and AWS StepFunctions, see <a class=\"c-link\" href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/building-ai-powered-forecasting-automation-with-amazon-forecast-by-applying-mlops\/\" target=\"_blank\" rel=\"noopener noreferrer\" data-stringify-link=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/building-ai-powered-forecasting-automation-with-amazon-forecast-by-applying-mlops\/\" data-sk=\"tooltip_parent\">here<\/a>.<\/p>\n<p>\u00a0<\/p>\n<h2>References<\/h2>\n<p>[1] Multi-Source Social Feedback of Online News Feeds, N. Moniz and L. Torgo, arXiv:1801.07055 (2018).<\/p>\n<p>[2] Learning to determine the quality of news headlines, Omidvar, A. <em>et al.<\/em> arXiv:1911.11139.<\/p>\n<p>[3] \u201cVisualizing data using T-SNE\u201d, L., Van der Maaten and G. Hinton, Journal of Machine Learning Research <strong>9 <\/strong>2579-2605 (2008).<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-17357 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/10\/21\/davidehrlich.jpg\" alt=\"\" width=\"101\" height=\"137\">David Ehrlich<\/strong> is a Machine Learning Specialist at Amazon Web Services. He is passionate about helping customers unlock the true potential of their data. In his spare time, he enjoys exploring the different neighborhoods in New York City, going to comedy clubs, and traveling.<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p>\u00a0<\/p>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-15884 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2020\/09\/12\/stefan-natu.jpg\" alt=\"\" width=\"100\" height=\"113\">Stefan Natu<\/strong> is a Sr. Machine Learning Specialist at Amazon Web Services. He is focused on helping financial services customers build end-to-end machine learning solutions on AWS. In his spare time, he enjoys reading machine learning blogs, playing the guitar, and exploring the food scene in New York City.<\/p>\n<p>\u00a0<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/time-series-forecasting-using-unstructured-data-with-amazon-forecast-and-the-amazon-sagemaker-neural-topic-model\/<\/p>\n","protected":false},"author":0,"featured_media":435,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/434"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=434"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/434\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/435"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=434"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=434"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=434"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}