{"id":1561,"date":"2022-02-16T16:41:02","date_gmt":"2022-02-16T16:41:02","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2022\/02\/16\/prepare-time-series-data-with-amazon-sagemaker-data-wrangler\/"},"modified":"2022-02-16T16:41:02","modified_gmt":"2022-02-16T16:41:02","slug":"prepare-time-series-data-with-amazon-sagemaker-data-wrangler","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2022\/02\/16\/prepare-time-series-data-with-amazon-sagemaker-data-wrangler\/","title":{"rendered":"Prepare time series data with Amazon SageMaker Data Wrangler"},"content":{"rendered":"<div id=\"\">\n<p>Time series data is widely present in our lives. Stock prices, house prices, weather information, and sales data captured over time are just a few examples. As businesses increasingly look for new ways to gain meaningful insights from time-series data, the ability to visualize data and apply desired transformations are fundamental steps. However, time-series data possesses unique characteristics and nuances compared to other kinds of tabular data, and require special considerations. For example, standard tabular or cross-sectional data is collected at a specific point in time. In contrast, time series data is captured repeatedly over time, with each successive data point dependent on its past values.<\/p>\n<p>Because most time series analyses rely on the information gathered across a contiguous set of observations, missing data and inherent sparseness can reduce the accuracy of forecasts and introduce bias. Additionally, most time series analysis approaches rely on equal spacing between data points, in other words, periodicity. Therefore, the ability to fix data spacing irregularities is a critical prerequisite. Finally, time series analysis often requires the creation of additional features that can help explain the inherent relationship between input data and future predictions. All these factors differentiate time series projects from traditional machine learning (ML) scenarios and demand a distinct approach to its analysis.<\/p>\n<p>This post walks through how to use <a href=\"https:\/\/aws.amazon.com\/sagemaker\/data-wrangler\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Data Wrangler<\/a> to apply time series transformations and prepare your dataset for time series use cases.<\/p>\n<h2>Use cases for Data Wrangler<\/h2>\n<p>Data Wrangler provides a no-code\/low-code solution to time series analysis with features to clean, transform, and prepare data faster. It also enables data scientists to prepare time series data in adherence to their forecasting model\u2019s input format requirements. The following are a few ways you can use these capabilities:<\/p>\n<ul>\n<li><strong>Descriptive analysis<\/strong>\u2013 Usually, step one of any data science project is understanding the data. When we plot time series data, we get a high-level overview of its patterns, such as trend, seasonality, cycles, and random variations. It helps us decide the correct forecasting methodology for accurately representing these patterns. Plotting can also help identify outliers, preventing unrealistic and inaccurate forecasts. Data Wrangler comes with a <strong>seasonality-trend decomposition visualization<\/strong> for representing components of a time series, and an <strong>outlier detection visualization<\/strong> to identify outliers.<\/li>\n<li><strong>Explanatory analysis<\/strong>\u2013 For multi-variate time series, the ability to explore, identify, and model the relationship between two or more time series is essential for obtaining meaningful forecasts. The <strong>Group by<\/strong> transform in Data Wrangler creates multiple time series by grouping data for specified cells. Additionally, Data Wrangler time series transforms, where applicable, allow specification of additional ID columns to group on, enabling complex time series analysis.<\/li>\n<li><strong>Data preparation and feature engineering<\/strong>\u2013 Time series data is rarely in the format expected by time series models. It often requires data preparation to convert raw data into time series-specific features. You may want to validate that time series data is regularly or equally spaced prior to analysis. For forecasting use cases, you may also want to incorporate additional time series characteristics, such as autocorrelation and statistical properties. With Data Wrangler, you can quickly create time series features such as lag columns for multiple lag periods, resample data to multiple time granularities, and automatically extract statistical properties of a time series, to name a few capabilities.<\/li>\n<\/ul>\n<h2>Solution overview<\/h2>\n<p>This post elaborates on how data scientists and analysts can use Data Wrangler to visualize and prepare time series data. We use the bitcoin cryptocurrency dataset from <a href=\"https:\/\/www.cryptodatadownload.com\/data\/bitstamp\/\" target=\"_blank\" rel=\"noopener noreferrer\">cryptodatadownload<\/a> with bitcoin trading details to showcase these capabilities. We clean, validate, and transform the raw dataset with time series features and also generate bitcoin volume price forecasts using the transformed dataset as input.<\/p>\n<p>The sample of bitcoin trading data is from January 1 \u2013 November 19, 2021, with 464,116 data points. The dataset attributes include a timestamp of the price record, the opening or first price at which the coin was exchanged for a particular day, the highest price at which the coin was exchanged on the day, the last price at which the coin was exchanged on the day, the volume exchanged in the cryptocurrency value on the day in BTC, and corresponding USD currency.<\/p>\n<h2>Prerequisites<\/h2>\n<p>Download the <code>Bitstamp_BTCUSD_2021_minute.csv<\/code> file from <a href=\"https:\/\/www.cryptodatadownload.com\/data\/bitstamp\/\" target=\"_blank\" rel=\"noopener noreferrer\">cryptodatadownload<\/a> and upload it to <a href=\"https:\/\/aws.amazon.com\/s3\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service (Amazon S3)<\/a>.<\/p>\n<p>Import bitcoin dataset in Data Wrangler<\/p>\n<p>To start the ingestion process to Data Wrangler, complete the following steps:<\/p>\n<ol>\n<li>On the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/studio.html\" target=\"_blank\" rel=\"noopener noreferrer\">SageMaker Studio<\/a> console, on the <strong>File<\/strong> menu, choose <strong>New<\/strong>, then choose <strong>Data Wrangler Flow<\/strong>.<\/li>\n<li>Rename the flow as desired.<\/li>\n<li>For <strong>Import data<\/strong>, choose <strong>Amazon S3<\/strong>.<\/li>\n<li>Upload the <code class=\"lang-xml\">Bitstamp_BTCUSD_2021_minute.csv<\/code> file from your S3 bucket.<\/li>\n<\/ol>\n<p>You can now preview your data set.<\/p>\n<ol start=\"5\">\n<li>In the <strong>Details<\/strong> pane, choose <strong>Advanced configuration<\/strong> and deselect <strong>Enable sampling<\/strong>.<\/li>\n<\/ol>\n<p>This is a relatively small data set, so we don\u2019t need sampling.<\/p>\n<ol start=\"6\">\n<li>Choose <strong>Import<\/strong>.<\/li>\n<\/ol>\n<p>You have successfully created the flow diagram and are ready to add transformation steps.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/13\/ML-6133-Import-data-1.gif\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-32956 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/13\/ML-6133-Import-data-1.gif\" alt=\"\" width=\"1920\" height=\"1080\"><\/a><\/p>\n<h2>Add transformations<\/h2>\n<p>To add data transformations, choose the plus sign next to <strong>Data types<\/strong> and choose <strong>Edit data types<\/strong>.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/13\/ML-6133-edit-datatypes.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-32946 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/13\/ML-6133-edit-datatypes.png\" alt=\"\" width=\"790\" height=\"683\"><\/a><\/p>\n<p>Ensure that Data Wrangler automatically inferred the correct data types for the data columns.<\/p>\n<p>In our case, the inferred data types are correct. However, suppose one data type was incorrect. You can easily modify them through the UI, as shown in the following screenshot.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-32954\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/13\/ML-6133-review-datatypes.png\" alt=\"edit and review data types\" width=\"1276\" height=\"631\"><\/p>\n<p>Let\u2019s kick off the analysis and start adding transformations.<\/p>\n<h2>Data cleaning<\/h2>\n<p>We first perform several data cleaning transformations.<\/p>\n<h3>Drop column<\/h3>\n<p>Let\u2019s start by dropping the <code>unix<\/code> column, because we use the <code>date<\/code> column as the index.<\/p>\n<ol>\n<li>Choose <strong>Back to data flow<\/strong>.<\/li>\n<li>Choose the plus sign next to <strong>Data types<\/strong> and choose <strong>Add transform<\/strong>.<\/li>\n<li>Choose <strong>+ Add step<\/strong> in the<strong> TRANSFORMS<\/strong> pane.<\/li>\n<li>Choose <strong>Manage columns<\/strong>.<\/li>\n<li>For <strong>Transform<\/strong>, choose <strong>Drop column<\/strong>.<\/li>\n<li>For <strong>Column to drop<\/strong>, choose <strong>unix<\/strong>.<\/li>\n<li>Choose <strong>Preview<\/strong>.<\/li>\n<li>Choose <strong>Add<\/strong> to save the step.<\/li>\n<\/ol>\n<h3>Handle missing<\/h3>\n<p>Missing data is a well-known problem in real-world datasets. Therefore, it\u2019s a best practice to verify the presence of any missing or null values and handle them appropriately. Our dataset doesn\u2019t contain missing values. But if there were, we would use the <strong>Handle missing<\/strong> time series transform to fix them. Commonly used strategies for handling missing data include dropping rows with missing values or filling the missing values with reasonable estimates. Because time series data relies on a sequence of data points across time, filling missing values is the preferred approach. The process of filling missing values is referred to as <em>imputation<\/em>. The <strong>Handle missing<\/strong> time series transform allows you to choose from multiple imputation strategies.<\/p>\n<ol>\n<li>Choose <strong>+ Add step <\/strong>in the<strong> TRANSFORMS<\/strong> pane.<\/li>\n<li>Choose the <strong>Time Series<\/strong> transform.<\/li>\n<li>For <strong>Transform<\/strong>, Choose <strong>Handle missing<\/strong>.<\/li>\n<li>For <strong>Time series input type<\/strong>, choose <strong>Along column<\/strong>.<\/li>\n<li>For <strong>Method for imputing values<\/strong>, choose <strong>Forward fill<\/strong>.<\/li>\n<\/ol>\n<p>The <strong>Forward fill<\/strong> method replaces the missing values with the non-missing values preceding the missing values.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-32951\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/13\/ML-6133-handle-missing.png\" alt=\"handle missing time series transform\" width=\"293\" height=\"617\"><\/p>\n<p><strong>Backward fill<\/strong>, <strong>Constant Value<\/strong>, <strong>Most common value <\/strong>and <strong>Interpolate <\/strong>are other imputation strategies available in Data Wrangler. Interpolation techniques rely on neighboring values for filling missing values. Time series data often exhibits correlation between neighboring values, making interpolation an effective filling strategy. For additional details on the functions you can use for applying interpolation, refer to <a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.interpolate.html\" target=\"_blank\" rel=\"noopener noreferrer\">pandas.DataFrame.interpolate<\/a>.<\/p>\n<h3>Validate timestamp<\/h3>\n<p>In time series analysis, the timestamp column acts as the index column, around which the analysis revolves. Therefore, it\u2019s essential to make sure the timestamp column doesn\u2019t contain invalid or incorrectly formatted time stamp values. Because we\u2019re using the <code>date<\/code> column as the timestamp column and index, let\u2019s confirm its values are correctly formatted.<\/p>\n<ol>\n<li>Choose <strong>+ Add step<\/strong> in the<strong> TRANSFORMS<\/strong> pane.<\/li>\n<li>Choose the <strong>Time Series<\/strong> transform.<\/li>\n<li>For <strong>Transform,<\/strong> choose <strong>Validate timestamps<\/strong>.<\/li>\n<\/ol>\n<p>The <strong>Validate timestamps<\/strong> transform allows you to check that the timestamp column in your dataset doesn\u2019t have values with an incorrect timestamp or missing values.<\/p>\n<ol start=\"4\">\n<li>For <strong>Timestamp Column<\/strong>, choose <strong>date<\/strong>.<\/li>\n<li>For <strong>Policy<\/strong> dropdown, choose <strong>Indicate<\/strong>.<\/li>\n<\/ol>\n<p>The <strong>Indicate<\/strong> policy option creates a Boolean column indicating if the value in the timestamp column is a valid date\/time format. Other options for <strong>Policy <\/strong>include:<\/p>\n<ul>\n<li><strong>Error <\/strong>\u2013 Throws an error if the timestamp column is missing or invalid<\/li>\n<li><strong>Drop <\/strong>\u2013 Drops the row if the timestamp column is missing or invalid<\/li>\n<\/ul>\n<ol start=\"6\">\n<li>Choose <strong>Preview<\/strong>.<\/li>\n<\/ol>\n<p>A new Boolean column named <code>date_is_valid<\/code> was created, with <code>true<\/code> values indicating correct format and non-null entries. Our dataset doesn\u2019t contain invalid timestamp values in the <code>date<\/code> column. But if it did, you could use the new Boolean column to identify and fix those values.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-32955\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/13\/ML-6133-validate-timestamp.png\" alt=\"Validate Timestamp time series transform\" width=\"1276\" height=\"528\"><\/p>\n<ol start=\"7\">\n<li>Choose <strong>Add <\/strong>to save this step.<\/li>\n<\/ol>\n<h2>Time series visualization<\/h2>\n<p>After we clean and validate the dataset, we can better visualize the data to understand its different component.<\/p>\n<h3>Resample<\/h3>\n<p>Because we\u2019re interested in daily predictions, let\u2019s transform the frequency of data to daily.<\/p>\n<p>The <strong>Resample<\/strong> transformation changes the frequency of the time series observations to a specified granularity, and comes with both upsampling and downsampling options. Applying upsampling increases the frequency of the observations (for example from daily to hourly), whereas downsampling decreases the frequency of the observations (for example from hourly to daily).<\/p>\n<p>Because our dataset is at minute granularity, let\u2019s use the downsampling option.<\/p>\n<ol>\n<li>Choose <strong>+ Add step<\/strong>.<\/li>\n<li>Choose the <strong>Time Series<\/strong> transform.<\/li>\n<li>For <strong>Transform<\/strong>, choose <strong>Resample<\/strong>.<\/li>\n<li>For <strong>Timestamp<\/strong>, choose <strong>date<\/strong>.<\/li>\n<li>For <strong>Frequency unit<\/strong>, choose <strong>Calendar day<\/strong>.<\/li>\n<li>For <strong>Frequency quantity<\/strong>, enter 1.<\/li>\n<li>For <strong>Method to aggregate numeric values<\/strong>, choose <strong>mean<\/strong>.<\/li>\n<li>Choose <strong>Preview<\/strong>.<\/li>\n<\/ol>\n<p>The frequency of our dataset has changed from per minute to daily.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/14\/ML-6133-Resample-3.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-33076 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/14\/ML-6133-Resample-3.png\" alt=\"\" width=\"1048\" height=\"834\"><\/a><\/p>\n<ol start=\"9\">\n<li>Choose <strong>Add<\/strong> to save this step.<\/li>\n<\/ol>\n<h3>Seasonal-Trend decomposition<\/h3>\n<p>After resampling, we can visualize the transformed series and its associated STL (Seasonal and Trend decomposition using LOESS) components using the <strong>Seasonal-Trend-decomposition<\/strong> visualization. This breaks down original time series into distinct trend, seasonality and residual components, giving us a good understanding of how each pattern behaves. We can also use the information when modelling forecasting problems.<\/p>\n<p>Data Wrangler uses LOESS, a robust and versatile statistical method for modelling trend and seasonal components. It\u2019s underlying implementation uses polynomial regression for estimating nonlinear relationships present in the time series components (seasonality, trend, and residual).<\/p>\n<ol>\n<li>Choose <strong>Back to data flow<\/strong>.<\/li>\n<li>Choose the plus sign next to the <strong>Steps<\/strong> on <strong>Data Flow<\/strong>.<\/li>\n<li>Choose <strong>Add analysis<\/strong>.<\/li>\n<li>In the <strong>Create analysis<\/strong> pane, for <strong>Analysis type,<\/strong> choose <strong>Time Series<\/strong>.<\/li>\n<li>For <strong>Visualization<\/strong>, choose <strong>Seasonal-Trend decomposition<\/strong>.<\/li>\n<li>For <strong>Analysis Name<\/strong>, enter a name.<\/li>\n<li>For <strong>Timestamp column<\/strong>, choose <strong>date<\/strong>.<\/li>\n<li>For <strong>Value column<\/strong>, choose <strong>Volume USD<\/strong>.<\/li>\n<li>Choose <strong>Preview<\/strong>.<\/li>\n<\/ol>\n<p>The analysis allows us to visualize the input time series and decomposed seasonality, trend, and residual.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-32945\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/13\/ML-6133-decomposition.png\" alt=\"\" width=\"1276\" height=\"727\"><\/p>\n<ol start=\"10\">\n<li>Choose <strong>Save<\/strong> to save the analysis.<\/li>\n<\/ol>\n<p>With the <strong>seasonal-trend decomposition visualization<\/strong>, we can generate four patterns, as shown in the preceding screenshot:<\/p>\n<ul>\n<li><strong>Original<\/strong> \u2013 The original time series re-sampled to daily granularity.<\/li>\n<li><strong>Trend<\/strong> \u2013 The polynomial trend with an overall negative trend pattern for the year 2021, indicating a decrease in <code>Volume USD<\/code> value.<\/li>\n<li><strong>Season<\/strong> \u2013 The multiplicative seasonality represented by the varying oscillation patterns. We see a decrease in seasonal variation, characterized by decreasing amplitude of oscillations.<\/li>\n<li><strong>Residual<\/strong> \u2013 The remaining residual or random noise. The residual series is the resulting series after trend and seasonal components have been removed. Looking closely, we observe spikes between January and March, and between April and June, suggesting room for modelling such particular events using historical data.<\/li>\n<\/ul>\n<p>These visualizations provide valuable leads to data scientists and analysts into existing patterns and can help you choose a modelling strategy. However, it\u2019s always a good practice to validate the output of STL decomposition with the information gathered through descriptive analysis and domain expertise.<\/p>\n<p>To summarize, we observe a downward trend consistent with original series visualization, which increases our confidence in incorporating the information conveyed by trend visualization into downstream decision-making. In contrast, the seasonality visualization helps inform the presence of seasonality and the need for its removal by applying techniques such as differencing, it doesn\u2019t provide the desired level of detailed insight into various seasonal patterns present, thereby requiring deeper analysis.<\/p>\n<h2>Feature engineering<\/h2>\n<p>After we understand the patterns present in our dataset, we can start to engineer new features aimed to increase the accuracy of the forecasting models.<\/p>\n<h3>Featurize datetime<\/h3>\n<p>Let\u2019s start the feature engineering process with more straightforward date\/time features. Date\/time features are created from the <code>timestamp<\/code> column and provide an optimal avenue for data scientists to start the feature engineering process. We begin with the <strong>Featurize datetime<\/strong> time series transformation to add the month, day of the month, day of the year, week of the year, and quarter features to our dataset. Because we\u2019re providing the date\/time components as separate features, we enable ML algorithms to detect signals and patterns for improving prediction accuracy.<\/p>\n<ol>\n<li>Choose <strong>+ Add step<\/strong>.<\/li>\n<li>Choose the <strong>Time Series<\/strong> transform.<\/li>\n<li>For <strong>Transform,<\/strong> choose <strong>Featurize datetime<\/strong>.<\/li>\n<li>For <strong>Input Column<\/strong>, choose <strong>date<\/strong>.<\/li>\n<li>For <strong>Output Column<\/strong>, enter <code>date<\/code> (this step is optional).<\/li>\n<li>For <strong>Output mode<\/strong>, choose <strong>Ordinal<\/strong>.<\/li>\n<li>For <strong>Output format<\/strong>, choose <strong>Columns<\/strong>.<\/li>\n<li>For date\/time features to extract, select <strong>Month<\/strong>, <strong>Day<\/strong>, <strong>Week of year<\/strong>, <strong>Day of year<\/strong>, and <strong>Quarter<\/strong>.<\/li>\n<li>Choose <strong>Preview<\/strong>.<\/li>\n<\/ol>\n<p>The dataset now contains new columns named <code>date_month<\/code>, <code>date_day<\/code>, <code>date_week_of_year<\/code>, <code>date_day_of_year<\/code>, and <code>date_quarter<\/code>. The information retrieved from these new features could help data scientists derive additional insights from the data and into the relationship between input features and output features.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-32950\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/13\/ML-6133-featurize-datetime.png\" alt=\"featurize datetime time series transform\" width=\"1276\" height=\"767\"><\/p>\n<ol start=\"10\">\n<li>Choose <strong>Add<\/strong> to save this step.<\/li>\n<\/ol>\n<h3>Encode categorical<\/h3>\n<p>Date\/time features aren\u2019t limited to integer values. You may also choose to consider certain extracted date\/time features as categorical variables and represent them as one-hot encoded features, with each column containing binary values. The newly created <code>date_quarter<\/code> column contains values between 0-3, and can be one-hot encoded using four binary columns. Let\u2019s create four new binary features, each representing the corresponding quarter of the year.<\/p>\n<ol>\n<li>Choose <strong>+ Add step<\/strong>.<\/li>\n<li>Choose the <strong>Encode categorical<\/strong> transform.<\/li>\n<li>For <strong>Transform<\/strong>, choose <strong>One-hot encode<\/strong>.<\/li>\n<li>For <strong>Input column<\/strong>, choose <strong>date_quarter<\/strong>.<\/li>\n<li>For <strong>Output style<\/strong>, choose <strong>Columns<\/strong>.<\/li>\n<li>Choose <strong>Preview<\/strong>.<\/li>\n<li>Choose <strong>Add<\/strong> to add the step.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-32947\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/13\/ML-6133-encode-categorical.png\" alt=\"\" width=\"1285\" height=\"801\"><\/p>\n<h3>Lag feature<\/h3>\n<p>Next, let\u2019s create lag features for the target column <code>Volume USD<\/code>. Lag features in time series analysis are values at prior timestamps that are considered helpful in inferring future values. They also help identify autocorrelation (also known as <em>serial correlation<\/em>) patterns in the residual series by quantifying the relationship of the observation with observations at previous time steps. Autocorrelation is similar to regular correlation but between the values in a series and its past values. It forms the basis for the autoregressive forecasting models in the ARIMA series.<\/p>\n<p>With the Data Wrangler <strong>Lag feature<\/strong> transform, you can easily create lag features n periods apart. Additionally, we often want to create multiple lag features at different lags and let the model decide the most meaningful features. For such a scenario, the <strong>Lag features <\/strong>transform helps create multiple lag columns over a specified window size.<\/p>\n<ol>\n<li>Choose <strong>Back to data flow<\/strong>.<\/li>\n<li>Choose the plus sign next to the <strong>Steps<\/strong> on <strong>Data Flow<\/strong>.<\/li>\n<li>Choose <strong>+ Add step<\/strong>.<\/li>\n<li>Choose <strong>Time Series<\/strong> transform.<\/li>\n<li>For <strong>Transform<\/strong>, choose <strong>Lag features<\/strong>.<\/li>\n<li>For <strong>Generate lag features for this column<\/strong>, choose <strong>Volume USD<\/strong>.<\/li>\n<li>For <strong>Timestamp Column<\/strong>, choose <strong>date<\/strong>.<\/li>\n<li>For <strong>Lag<\/strong>, enter <code>7<\/code>.<\/li>\n<li>Because we\u2019re interested in observing up to the previous seven lag values, let\u2019s select <strong>Include the entire lag window<\/strong>.<\/li>\n<li>To create a new column for each lag value, select <strong>Flatten the output<\/strong>.<\/li>\n<li>Choose <strong>Preview<\/strong>.<\/li>\n<\/ol>\n<p>Seven new columns are added, suffixed with the <code>lag_number<\/code> keyword for the target column <code>Volume USD<\/code>.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-32952\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/13\/ML-6133-laf_feature.png\" alt=\"Lag feature time series transform\" width=\"1276\" height=\"719\"><\/p>\n<ol start=\"12\">\n<li>Choose <strong>Add<\/strong> to save the step.<\/li>\n<\/ol>\n<h3>Rolling window features<\/h3>\n<p>We can also calculate meaningful statistical summaries across a range of values and include them as input features. Let\u2019s extract common statistical time series features.<\/p>\n<p>Data Wrangler implements automatic time series feature extraction capabilities using the open source <a href=\"https:\/\/tsfresh.readthedocs.io\/en\/v0.17.0\/text\/list_of_features.html\" target=\"_blank\" rel=\"noopener noreferrer\">tsfresh<\/a> package. With the time series feature extraction transforms, you can automate the feature extraction process. This eliminates the time and effort otherwise spent manually implementing signal processing libraries. For this post, we extract features using the <strong>Rolling window features<\/strong> transform. This method computes statistical properties across a set of observations defined by the window size.<\/p>\n<ol>\n<li>Choose <strong>+ Add step<\/strong>.<\/li>\n<li>Choose the <strong>Time Series<\/strong> transform.<\/li>\n<li>For <strong>Transform<\/strong>, choose <strong>Rolling window features<\/strong>.<\/li>\n<li>For <strong>Generate rolling window features for this column<\/strong>, choose <strong>Volume USD<\/strong>.<\/li>\n<li>For <strong>Timestamp Column<\/strong>, choose <strong>date<\/strong>.<\/li>\n<li>For <strong>Window size<\/strong>, enter <code>7<\/code>.<\/li>\n<\/ol>\n<p>Specifying a window size of <code>7<\/code> computes features by combining the value at the current timestamp and values for the previous seven timestamps.<\/p>\n<ol start=\"7\">\n<li>Select <strong>Flatten<\/strong> to create a new column for each computed feature.<\/li>\n<li>Choose your strategy as <strong>Minimal subset<\/strong>.<\/li>\n<\/ol>\n<p>This strategy extracts eight features that are useful in downstream analyses. Other strategies include <strong>Efficient Subset<\/strong>, <strong>Custom subset<\/strong>, and <strong>All features<\/strong>. For full list of features available for extraction, refer to <a href=\"https:\/\/tsfresh.readthedocs.io\/en\/latest\/text\/list_of_features.html\" target=\"_blank\" rel=\"noopener noreferrer\">Overview on extracted features<\/a>.<\/p>\n<ol start=\"9\">\n<li>Choose <strong>Preview<\/strong>.<\/li>\n<\/ol>\n<p>We can see eight new columns with specified window size of <code>7<\/code> in their name, appended to our dataset.<\/p>\n<ol start=\"10\">\n<li>Choose<strong> Add <\/strong>to save the step.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-32949\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/13\/ML-6133-extract_features.png\" alt=\"\" width=\"1276\" height=\"569\"><\/p>\n<h2>Export the dataset<\/h2>\n<p>We have transformed the time series dataset and are ready to use the transformed dataset as input for a forecasting algorithm. The last step is to export the transformed dataset to Amazon S3. In Data Wrangler, you can choose <strong>Export step <\/strong>to automatically generate a Jupyter notebook with Amazon SageMaker Processing code for processing and exporting the transformed dataset to a S3 bucket. However, because our dataset contains just over 300 records, let\u2019s take advantage of the <strong>Export data <\/strong>option in the <strong>Add Transform<\/strong> view to export the transformed dataset directly to Amazon S3 from Data Wrangler.<\/p>\n<ol>\n<li>Choose<strong> Export data<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-32948\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/13\/ML-6133-export-data.png\" alt=\"\" width=\"987\" height=\"517\"><\/p>\n<ol start=\"2\">\n<li>For<strong> S3 location<\/strong>, choose<strong> Browser<\/strong> and choose your S3 bucket.<\/li>\n<li>Choose <strong>Export data<\/strong>.<\/li>\n<\/ol>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-32944\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/13\/ML-6133-browse.png\" alt=\"\" width=\"1276\" height=\"346\"><\/p>\n<p>Now that we have successfully transformed the bitcoin dataset, we can use <a href=\"https:\/\/aws.amazon.com\/forecast\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Forecast<\/a> to generate bitcoin predictions.<\/p>\n<h2>Clean up<\/h2>\n<p>If you\u2019re done with this use case, clean up the resources you created to avoid incurring additional charges. For Data Wrangler you can shutdown the underlying instance when finished. Refer to <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/data-wrangler-shut-down.html\" target=\"_blank\" rel=\"noopener noreferrer\">Shut Down Data Wrangler<\/a> documentation for details. Alternatively, you can continue to <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/?p=32976&amp;preview=true\" target=\"_blank\" rel=\"noopener noreferrer\">Part 2<\/a> of this series to use this dataset for forecasting.<\/p>\n<h2>Summary<\/h2>\n<p>This post demonstrated how to utilize Data Wrangler to simplify and accelerate time series analysis using its built-in time series capabilities. We explored how data scientists can easily and interactively clean, format, validate, and transform time series data into the desired format, for meaningful analysis. We also explored how you can enrich your time series analysis by adding a comprehensive set of statistical features using Data Wrangler. To learn more about time series transformations in Data Wrangler, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/data-wrangler-transform.html\" target=\"_blank\" rel=\"noopener noreferrer\">Transform Data<\/a>.<\/p>\n<hr>\n<h3>About the Author<\/h3>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/20\/Roop-Bains.png\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-31932 size-full alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/20\/Roop-Bains.png\" alt=\"\" width=\"100\" height=\"132\"><\/a>Roop Bains <\/strong>is a Solutions Architect at AWS focusing on AI\/ML. He is passionate about helping customers innovate and achieve their business objectives using Artificial Intelligence and Machine Learning. In his spare time, Roop enjoys reading and hiking.<\/p>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/14\/Nikita-headshot-1.png\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-33054 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/14\/Nikita-headshot-1.png\" alt=\"\" width=\"100\" height=\"134\"><\/a><strong>Nikita Ivkin\u00a0<\/strong><\/strong>is an Applied Scientist, Amazon SageMaker Data Wrangler.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/prepare-time-series-data-with-amazon-sagemaker-data-wrangler\/<\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1561"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1561"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1561\/revisions"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1561"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1561"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1561"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}