{"id":546,"date":"2020-11-14T04:22:02","date_gmt":"2020-11-14T04:22:02","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/11\/14\/how-to-identify-overfitting-machine-learning-models-in-scikit-learn\/"},"modified":"2020-11-14T04:22:02","modified_gmt":"2020-11-14T04:22:02","slug":"how-to-identify-overfitting-machine-learning-models-in-scikit-learn","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/11\/14\/how-to-identify-overfitting-machine-learning-models-in-scikit-learn\/","title":{"rendered":"How to Identify Overfitting Machine Learning Models in Scikit-Learn"},"content":{"rendered":"<div id=\"\">\n<p id=\"last-modified-info\">Last Updated on November 13, 2020<\/p>\n<p><strong>Overfitting<\/strong> is a common explanation for the poor performance of a predictive model.<\/p>\n<p>An analysis of learning dynamics can help to identify whether a model has overfit the training dataset and may suggest an alternate configuration to use that could result in better predictive performance.<\/p>\n<p>Performing an analysis of learning dynamics is straightforward for algorithms that learn incrementally, like neural networks, but it is less clear how we might perform the same analysis with other algorithms that do not learn incrementally, such as decision trees, k-nearest neighbors, and other general algorithms in the scikit-learn machine learning library.<\/p>\n<p>In this tutorial, you will discover how to identify overfitting for machine learning models in Python.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>Overfitting is a possible cause of poor generalization performance of a predictive model.<\/li>\n<li>Overfitting can be analyzed for machine learning models by varying key model hyperparameters.<\/li>\n<li>Although overfitting is a useful tool for analysis, it must not be confused with model selection.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_11580\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-11580\" loading=\"lazy\" class=\"size-full wp-image-11580\" src=\"https:\/\/3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com\/wp-content\/uploads\/2021\/01\/Identify-Overfitting-Machine-Learning-Models-With-Scikit-Learn.jpg\" alt=\"Identify Overfitting Machine Learning Models With Scikit-Learn\" width=\"799\" height=\"533\"><\/p>\n<p id=\"caption-attachment-11580\" class=\"wp-caption-text\">Identify Overfitting Machine Learning Models With Scikit-Learn<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/icetsarina\/24082660497\/\">Bonnie Moreland<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into five parts; they are:<\/p>\n<ol>\n<li>What Is Overfitting<\/li>\n<li>How to Perform an Overfitting Analysis<\/li>\n<li>Example of Overfitting in Scikit-Learn<\/li>\n<li>Counterexample of Overfitting in Scikit-Learn<\/li>\n<li>Separate Overfitting Analysis From Model Selection<\/li>\n<\/ol>\n<h2>What Is Overfitting<\/h2>\n<p>Overfitting refers to an unwanted behavior of a machine learning algorithm used for predictive modeling.<\/p>\n<p>It is the case where model performance on the training dataset is improved at the cost of worse performance on data not seen during training, such as a holdout test dataset or new data.<\/p>\n<p>We can identify if a machine learning model has overfit by first evaluating the model on the training dataset and then evaluating the same model on a holdout test dataset.<\/p>\n<p>If the performance of the model on the training dataset is significantly better than the performance on the test dataset, then the model may have overfit the training dataset.<\/p>\n<p>We care about overfitting because it is a common cause for \u201c<em>poor generalization<\/em>\u201d of the model as measured by high \u201c<em>generalization error<\/em>.\u201d That is error made by the model when making predictions on new data.<\/p>\n<p>This means, if our model has poor performance, maybe it is because it has overfit.<\/p>\n<p>But what does it mean if a model\u2019s performance is \u201c<em>significantly better<\/em>\u201d on the training set compared to the test set?<\/p>\n<p>For example, it is common and perhaps normal for the model to have better performance on the training set than the test set.<\/p>\n<p>As such, we can perform an analysis of the algorithm on the dataset to better expose the overfitting behavior.<\/p>\n<h2>How to Perform an Overfitting Analysis<\/h2>\n<p>An overfitting analysis is an approach for exploring how and when a specific model is overfitting on a specific dataset.<\/p>\n<p>It is a tool that can help you learn more about the learning dynamics of a machine learning model.<\/p>\n<p>This might be achieved by reviewing the model behavior during a single run for algorithms like neural networks that are fit on the training dataset incrementally.<\/p>\n<p>A plot of the model performance on the train and test set can be calculated at each point during training and plots can be created. This plot is often called a learning curve plot, showing one curve for model performance on the training set and one curve for the test set for each increment of learning.<\/p>\n<p>If you would like to learn more about learning curves for algorithms that learn incrementally, see the tutorial:<\/p>\n<p>The common pattern for overfitting can be seen on learning curve plots, where model performance on the training dataset continues to improve (e.g. loss or error continues to fall or accuracy continues to rise) and performance on the test or validation set improves to a point and then begins to get worse.<\/p>\n<p>If this pattern is observed, then training should stop at that point where performance gets worse on the test or validation set for algorithms that learn incrementally<\/p>\n<p>This makes sense for algorithms that learn incrementally like neural networks, but what about other algorithms?<\/p>\n<ul>\n<li><strong>How do you perform an overfitting analysis for machine learning algorithms in scikit-learn?<\/strong><\/li>\n<\/ul>\n<p>One approach for performing an overfitting analysis on algorithms that do not learn incrementally is by varying a key model hyperparameter and evaluating the model performance on the train and test sets for each configuration.<\/p>\n<p>To make this clear, let\u2019s explore a case of analyzing a model for overfitting in the next section.<\/p>\n<h2>Example of Overfitting in Scikit-Learn<\/h2>\n<p>In this section, we will look at an example of overfitting a machine learning model to a training dataset.<\/p>\n<p>First, let\u2019s define a synthetic classification dataset.<\/p>\n<p>We will use the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_classification.html\">make_classification() function<\/a> to define a binary (two class) classification prediction problem with 10,000 examples (rows) and 20 input features (columns).<\/p>\n<p>The example below creates the dataset and summarizes the shape of the input and output components.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5faf595942a69172862342\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n# synthetic classification dataset<br \/>\nfrom sklearn.datasets import make_classification<br \/>\n# define dataset<br \/>\nX, y = make_classification(n_samples=10000, n_features=20, n_informative=5, n_redundant=15, random_state=1)<br \/>\n# summarize the dataset<br \/>\nprint(X.shape, y.shape)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-p\"># synthetic classification dataset<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">datasets <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-v\">make<\/span><span class=\"crayon-sy\">_<\/span>classification<\/p>\n<p><span class=\"crayon-p\"># define dataset<\/span><\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">make_classification<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_samples<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">10000<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_features<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">20<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_informative<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">5<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_redundant<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">15<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># summarize the dataset<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0003 seconds] --><\/p>\n<p>Running the example creates the dataset and reports the shape, confirming our expectations.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<p><!-- [Format Time: 0.0000 seconds] --><\/p>\n<p>Next, we need to split the dataset into train and test subsets.<\/p>\n<p>We will use the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.train_test_split.html\">train_test_split() function<\/a> and split the data into 70 percent for training a model and 30 percent for evaluating it.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5faf595942a6f289218579\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n# split a dataset into train and test sets<br \/>\nfrom sklearn.datasets import make_classification<br \/>\nfrom sklearn.model_selection import train_test_split<br \/>\n# create dataset<br \/>\nX, y = make_classification(n_samples=10000, n_features=20, n_informative=5, n_redundant=15, random_state=1)<br \/>\n# split into train test sets<br \/>\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)<br \/>\n# summarize the shape of the train and test sets<br \/>\nprint(X_train.shape, X_test.shape, y_train.shape, y_test.shape)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-p\"># split a dataset into train and test sets<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">datasets <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">make_classification<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-v\">train_test<\/span><span class=\"crayon-sy\">_<\/span>split<\/p>\n<p><span class=\"crayon-p\"># create dataset<\/span><\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">make_classification<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_samples<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">10000<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_features<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">20<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_informative<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">5<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_redundant<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">15<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># split into train test sets<\/span><\/p>\n<p><span class=\"crayon-v\">X_train<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X_test<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y_train<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y_test<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">train_test_split<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">test_size<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">0.3<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># summarize the shape of the train and test sets<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X_train<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X_test<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y_train<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y_test<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0004 seconds] --><\/p>\n<p>Running the example splits the dataset and we can confirm that we have 70k examples for training and 30k for evaluating a model.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5faf595942a70961501028\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n(7000, 20) (3000, 20) (7000,) (3000,)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>(7000, 20) (3000, 20) (7000,) (3000,)<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0000 seconds] --><\/p>\n<p>Next, we can explore a machine learning model overfitting the training dataset.<\/p>\n<p>We will use a decision tree via the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.tree.DecisionTreeClassifier.html\">DecisionTreeClassifier<\/a> and test different tree depths with the \u201c<em>max_depth<\/em>\u201d argument.<\/p>\n<p>Shallow decision trees (e.g. few levels) generally do not overfit but have poor performance (high bias, low variance). Whereas deep trees (e.g. many levels) generally do overfit and have good performance (low bias, high variance). A desirable tree is one that is not so shallow that it has low skill and not so deep that it overfits the training dataset.<\/p>\n<p>We evaluate decision tree depths from 1 to 20.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5faf595942a71763622089\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&#8230;<br \/>\n# define the tree depths to evaluate<br \/>\nvalues = [i for i in range(1, 21)]<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<p><span class=\"crayon-p\"># define the tree depths to evaluate<\/span><\/p>\n<p><span class=\"crayon-v\">values<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-i\">i<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">i<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">range<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">21<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0001 seconds] --><\/p>\n<p>We will enumerate each tree depth, fit a tree with a given depth on the training dataset, then evaluate the tree on both the train and test sets.<\/p>\n<p>The expectation is that as the depth of the tree increases, performance on train and test will improve to a point, and as the tree gets too deep, it will begin to overfit the training dataset at the expense of worse performance on the holdout test set.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5faf595942a72825325094\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&#8230;<br \/>\n# evaluate a decision tree for each depth<br \/>\nfor i in values:<br \/>\n\t# configure the model<br \/>\n\tmodel = DecisionTreeClassifier(max_depth=i)<br \/>\n\t# fit model on the training dataset<br \/>\n\tmodel.fit(X_train, y_train)<br \/>\n\t# evaluate on the train dataset<br \/>\n\ttrain_yhat = model.predict(X_train)<br \/>\n\ttrain_acc = accuracy_score(y_train, train_yhat)<br \/>\n\ttrain_scores.append(train_acc)<br \/>\n\t# evaluate on the test dataset<br \/>\n\ttest_yhat = model.predict(X_test)<br \/>\n\ttest_acc = accuracy_score(y_test, test_yhat)<br \/>\n\ttest_scores.append(test_acc)<br \/>\n\t# summarize progress<br \/>\n\tprint(&#8216;&gt;%d, train: %.3f, test: %.3f&#8217; % (i, train_acc, test_acc))<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<p><span class=\"crayon-p\"># evaluate a decision tree for each depth<\/span><\/p>\n<p><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">i<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">values<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-p\"># configure the model<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">DecisionTreeClassifier<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">max_depth<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-p\"># fit model on the training dataset<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">fit<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X_train<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y_train<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-p\"># evaluate on the train dataset<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">train_yhat<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">predict<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X_train<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">train_acc<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">accuracy_score<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">y_train<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">train_yhat<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">train_scores<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">train_acc<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-p\"># evaluate on the test dataset<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">test_yhat<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">predict<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X_test<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">test_acc<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">accuracy_score<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">y_test<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">test_yhat<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">test_scores<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">test_acc<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-p\"># summarize progress<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8216;&gt;%d, train: %.3f, test: %.3f&#8217;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">%<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">train_acc<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">test_acc<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0005 seconds] --><\/p>\n<p>At the end of the run, we will then plot all model accuracy scores on the train and test sets for visual comparison.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5faf595942a73782117503\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&#8230;<br \/>\n# plot of train and test scores vs tree depth<br \/>\npyplot.plot(values, train_scores, &#8216;-o&#8217;, label=&#8217;Train&#8217;)<br \/>\npyplot.plot(values, test_scores, &#8216;-o&#8217;, label=&#8217;Test&#8217;)<br \/>\npyplot.legend()<br \/>\npyplot.show()<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<p><span class=\"crayon-p\"># plot of train and test scores vs tree depth<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">plot<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">values<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">train_scores<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8216;-o&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">label<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;Train&#8217;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">plot<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">values<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">test_scores<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8216;-o&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">label<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;Test&#8217;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">legend<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">show<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0002 seconds] --><\/p>\n<p>Tying this together, the complete example of exploring different tree depths on the synthetic binary classification dataset is listed below.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5faf595942a74418253267\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n# evaluate decision tree performance on train and test sets with different tree depths<br \/>\nfrom sklearn.datasets import make_classification<br \/>\nfrom sklearn.model_selection import train_test_split<br \/>\nfrom sklearn.metrics import accuracy_score<br \/>\nfrom sklearn.tree import DecisionTreeClassifier<br \/>\nfrom matplotlib import pyplot<br \/>\n# create dataset<br \/>\nX, y = make_classification(n_samples=10000, n_features=20, n_informative=5, n_redundant=15, random_state=1)<br \/>\n# split into train test sets<br \/>\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)<br \/>\n# define lists to collect scores<br \/>\ntrain_scores, test_scores = list(), list()<br \/>\n# define the tree depths to evaluate<br \/>\nvalues = [i for i in range(1, 21)]<br \/>\n# evaluate a decision tree for each depth<br \/>\nfor i in values:<br \/>\n\t# configure the model<br \/>\n\tmodel = DecisionTreeClassifier(max_depth=i)<br \/>\n\t# fit model on the training dataset<br \/>\n\tmodel.fit(X_train, y_train)<br \/>\n\t# evaluate on the train dataset<br \/>\n\ttrain_yhat = model.predict(X_train)<br \/>\n\ttrain_acc = accuracy_score(y_train, train_yhat)<br \/>\n\ttrain_scores.append(train_acc)<br \/>\n\t# evaluate on the test dataset<br \/>\n\ttest_yhat = model.predict(X_test)<br \/>\n\ttest_acc = accuracy_score(y_test, test_yhat)<br \/>\n\ttest_scores.append(test_acc)<br \/>\n\t# summarize progress<br \/>\n\tprint(&#8216;&gt;%d, train: %.3f, test: %.3f&#8217; % (i, train_acc, test_acc))<br \/>\n# plot of train and test scores vs tree depth<br \/>\npyplot.plot(values, train_scores, &#8216;-o&#8217;, label=&#8217;Train&#8217;)<br \/>\npyplot.plot(values, test_scores, &#8216;-o&#8217;, label=&#8217;Test&#8217;)<br \/>\npyplot.legend()<br \/>\npyplot.show()<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<p>22<\/p>\n<p>23<\/p>\n<p>24<\/p>\n<p>25<\/p>\n<p>26<\/p>\n<p>27<\/p>\n<p>28<\/p>\n<p>29<\/p>\n<p>30<\/p>\n<p>31<\/p>\n<p>32<\/p>\n<p>33<\/p>\n<p>34<\/p>\n<p>35<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-p\"># evaluate decision tree performance on train and test sets with different tree depths<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">datasets <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">make_classification<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">train_test_split<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">metrics <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">accuracy_score<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">tree <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">DecisionTreeClassifier<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">matplotlib <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">pyplot<\/span><\/p>\n<p><span class=\"crayon-p\"># create dataset<\/span><\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">make_classification<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_samples<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">10000<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_features<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">20<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_informative<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">5<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_redundant<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">15<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># split into train test sets<\/span><\/p>\n<p><span class=\"crayon-v\">X_train<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X_test<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y_train<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y_test<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">train_test_split<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">test_size<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">0.3<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># define lists to collect scores<\/span><\/p>\n<p><span class=\"crayon-v\">train_scores<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">test_scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># define the tree depths to evaluate<\/span><\/p>\n<p><span class=\"crayon-v\">values<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-i\">i<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">i<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">range<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">21<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-p\"># evaluate a decision tree for each depth<\/span><\/p>\n<p><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">i<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">values<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-p\"># configure the model<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">DecisionTreeClassifier<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">max_depth<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-p\"># fit model on the training dataset<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">fit<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X_train<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y_train<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-p\"># evaluate on the train dataset<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">train_yhat<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">predict<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X_train<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">train_acc<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">accuracy_score<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">y_train<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">train_yhat<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">train_scores<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">train_acc<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-p\"># evaluate on the test dataset<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">test_yhat<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">predict<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X_test<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">test_acc<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">accuracy_score<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">y_test<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">test_yhat<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">test_scores<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">test_acc<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-p\"># summarize progress<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8216;&gt;%d, train: %.3f, test: %.3f&#8217;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">%<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">train_acc<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">test_acc<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># plot of train and test scores vs tree depth<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">plot<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">values<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">train_scores<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8216;-o&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">label<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;Train&#8217;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">plot<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">values<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">test_scores<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8216;-o&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">label<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;Test&#8217;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">legend<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">show<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0012 seconds] --><\/p>\n<p>Running the example fits and evaluates a decision tree on the train and test sets for each tree depth and reports the accuracy scores.<\/p>\n<p><strong>Note<\/strong>: Your <a href=\"https:\/\/machinelearningmastery.com\/different-results-each-time-in-machine-learning\/\">results may vary<\/a> given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.<\/p>\n<p>In this case, we can see a trend of increasing accuracy on the training dataset with the tree depth to a point around a depth of 19-20 levels where the tree fits the training dataset perfectly.<\/p>\n<p>We can also see that the accuracy on the test set improves with tree depth until a depth of about eight or nine levels, after which accuracy begins to get worse with each increase in tree depth.<\/p>\n<p>This is exactly what we would expect to see in a pattern of overfitting.<\/p>\n<p>We would choose a tree depth of eight or nine before the model begins to overfit the training dataset.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5faf595942a75207695972\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&gt;1, train: 0.769, test: 0.761<br \/>\n&gt;2, train: 0.808, test: 0.804<br \/>\n&gt;3, train: 0.879, test: 0.878<br \/>\n&gt;4, train: 0.902, test: 0.896<br \/>\n&gt;5, train: 0.915, test: 0.903<br \/>\n&gt;6, train: 0.929, test: 0.918<br \/>\n&gt;7, train: 0.942, test: 0.921<br \/>\n&gt;8, train: 0.951, test: 0.924<br \/>\n&gt;9, train: 0.959, test: 0.926<br \/>\n&gt;10, train: 0.968, test: 0.923<br \/>\n&gt;11, train: 0.977, test: 0.925<br \/>\n&gt;12, train: 0.983, test: 0.925<br \/>\n&gt;13, train: 0.987, test: 0.926<br \/>\n&gt;14, train: 0.992, test: 0.921<br \/>\n&gt;15, train: 0.995, test: 0.920<br \/>\n&gt;16, train: 0.997, test: 0.913<br \/>\n&gt;17, train: 0.999, test: 0.918<br \/>\n&gt;18, train: 0.999, test: 0.918<br \/>\n&gt;19, train: 1.000, test: 0.914<br \/>\n&gt;20, train: 1.000, test: 0.913<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>&gt;1, train: 0.769, test: 0.761<\/p>\n<p>&gt;2, train: 0.808, test: 0.804<\/p>\n<p>&gt;3, train: 0.879, test: 0.878<\/p>\n<p>&gt;4, train: 0.902, test: 0.896<\/p>\n<p>&gt;5, train: 0.915, test: 0.903<\/p>\n<p>&gt;6, train: 0.929, test: 0.918<\/p>\n<p>&gt;7, train: 0.942, test: 0.921<\/p>\n<p>&gt;8, train: 0.951, test: 0.924<\/p>\n<p>&gt;9, train: 0.959, test: 0.926<\/p>\n<p>&gt;10, train: 0.968, test: 0.923<\/p>\n<p>&gt;11, train: 0.977, test: 0.925<\/p>\n<p>&gt;12, train: 0.983, test: 0.925<\/p>\n<p>&gt;13, train: 0.987, test: 0.926<\/p>\n<p>&gt;14, train: 0.992, test: 0.921<\/p>\n<p>&gt;15, train: 0.995, test: 0.920<\/p>\n<p>&gt;16, train: 0.997, test: 0.913<\/p>\n<p>&gt;17, train: 0.999, test: 0.918<\/p>\n<p>&gt;18, train: 0.999, test: 0.918<\/p>\n<p>&gt;19, train: 1.000, test: 0.914<\/p>\n<p>&gt;20, train: 1.000, test: 0.913<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0000 seconds] --><\/p>\n<p>A figure is also created that shows line plots of the model accuracy on the train and test sets with different tree depths.<\/p>\n<p>The plot clearly shows that increasing the tree depth in the early stages results in a corresponding improvement in both train and test sets.<\/p>\n<p>This continues until a depth of around 10 levels, after which the model is shown to overfit the training dataset at the cost of worse performance on the holdout dataset.<\/p>\n<div id=\"attachment_11577\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-11577\" loading=\"lazy\" class=\"wp-image-11577 size-full\" src=\"https:\/\/3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com\/wp-content\/uploads\/2020\/09\/Line-Plot-of-Decision-Tree-Accuracy-on-Train-and-Test-Datasets-for-Different-Tree-Depths.png\" alt=\"Line Plot of Decision Tree Accuracy on Train and Test Datasets for Different Tree Depths\" width=\"1280\" height=\"960\"><\/p>\n<p id=\"caption-attachment-11577\" class=\"wp-caption-text\">Line Plot of Decision Tree Accuracy on Train and Test Datasets for Different Tree Depths<\/p>\n<\/div>\n<p>This analysis is interesting. It shows why the model has a worse hold-out test set performance when \u201c<em>max_depth<\/em>\u201d is set to large values.<\/p>\n<p>But it is not required.<\/p>\n<p>We can just as easily choose a \u201c<em>max_depth<\/em>\u201d using a grid search without performing an analysis on why some values result in better performance and some result in worse performance.<\/p>\n<p>In fact, in the next section, we will show where this analysis can be misleading.<\/p>\n<h2>Counterexample of Overfitting in Scikit-Learn<\/h2>\n<p>Sometimes, we may perform an analysis of machine learning model behavior and be deceived by the results.<\/p>\n<p>A good example of this is varying the number of neighbors for the k-nearest neighbors algorithms, which we can implement using the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.neighbors.KNeighborsClassifier.html\">KNeighborsClassifier<\/a> class and configure via the \u201c<em>n_neighbors<\/em>\u201d argument.<\/p>\n<p>Let\u2019s forget how KNN works for the moment.<\/p>\n<p>We can perform the same analysis of the KNN algorithm as we did in the previous section for the decision tree and see if our model overfits for different configuration values. In this case, we will vary the number of neighbors from 1 to 50 to get more of the effect.<\/p>\n<p>The complete example is listed below.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5faf595942a79455274825\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n# evaluate knn performance on train and test sets with different numbers of neighbors<br \/>\nfrom sklearn.datasets import make_classification<br \/>\nfrom sklearn.model_selection import train_test_split<br \/>\nfrom sklearn.metrics import accuracy_score<br \/>\nfrom sklearn.neighbors import KNeighborsClassifier<br \/>\nfrom matplotlib import pyplot<br \/>\n# create dataset<br \/>\nX, y = make_classification(n_samples=10000, n_features=20, n_informative=5, n_redundant=15, random_state=1)<br \/>\n# split into train test sets<br \/>\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)<br \/>\n# define lists to collect scores<br \/>\ntrain_scores, test_scores = list(), list()<br \/>\n# define the tree depths to evaluate<br \/>\nvalues = [i for i in range(1, 51)]<br \/>\n# evaluate a decision tree for each depth<br \/>\nfor i in values:<br \/>\n\t# configure the model<br \/>\n\tmodel = KNeighborsClassifier(n_neighbors=i)<br \/>\n\t# fit model on the training dataset<br \/>\n\tmodel.fit(X_train, y_train)<br \/>\n\t# evaluate on the train dataset<br \/>\n\ttrain_yhat = model.predict(X_train)<br \/>\n\ttrain_acc = accuracy_score(y_train, train_yhat)<br \/>\n\ttrain_scores.append(train_acc)<br \/>\n\t# evaluate on the test dataset<br \/>\n\ttest_yhat = model.predict(X_test)<br \/>\n\ttest_acc = accuracy_score(y_test, test_yhat)<br \/>\n\ttest_scores.append(test_acc)<br \/>\n\t# summarize progress<br \/>\n\tprint(&#8216;&gt;%d, train: %.3f, test: %.3f&#8217; % (i, train_acc, test_acc))<br \/>\n# plot of train and test scores vs number of neighbors<br \/>\npyplot.plot(values, train_scores, &#8216;-o&#8217;, label=&#8217;Train&#8217;)<br \/>\npyplot.plot(values, test_scores, &#8216;-o&#8217;, label=&#8217;Test&#8217;)<br \/>\npyplot.legend()<br \/>\npyplot.show()<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<p>22<\/p>\n<p>23<\/p>\n<p>24<\/p>\n<p>25<\/p>\n<p>26<\/p>\n<p>27<\/p>\n<p>28<\/p>\n<p>29<\/p>\n<p>30<\/p>\n<p>31<\/p>\n<p>32<\/p>\n<p>33<\/p>\n<p>34<\/p>\n<p>35<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-p\"># evaluate knn performance on train and test sets with different numbers of neighbors<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">datasets <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">make_classification<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">train_test_split<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">metrics <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">accuracy_score<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">neighbors <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">KNeighborsClassifier<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">matplotlib <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">pyplot<\/span><\/p>\n<p><span class=\"crayon-p\"># create dataset<\/span><\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">make_classification<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_samples<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">10000<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_features<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">20<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_informative<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">5<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_redundant<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">15<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># split into train test sets<\/span><\/p>\n<p><span class=\"crayon-v\">X_train<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X_test<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y_train<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y_test<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">train_test_split<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">test_size<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">0.3<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># define lists to collect scores<\/span><\/p>\n<p><span class=\"crayon-v\">train_scores<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">test_scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># define the tree depths to evaluate<\/span><\/p>\n<p><span class=\"crayon-v\">values<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-i\">i<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">i<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">range<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">51<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-p\"># evaluate a decision tree for each depth<\/span><\/p>\n<p><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">i<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">values<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-p\"># configure the model<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">KNeighborsClassifier<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_neighbors<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-p\"># fit model on the training dataset<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">fit<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X_train<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y_train<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-p\"># evaluate on the train dataset<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">train_yhat<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">predict<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X_train<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">train_acc<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">accuracy_score<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">y_train<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">train_yhat<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">train_scores<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">train_acc<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-p\"># evaluate on the test dataset<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">test_yhat<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">predict<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X_test<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">test_acc<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">accuracy_score<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">y_test<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">test_yhat<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">test_scores<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">test_acc<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-p\"># summarize progress<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8216;&gt;%d, train: %.3f, test: %.3f&#8217;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">%<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">train_acc<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">test_acc<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># plot of train and test scores vs number of neighbors<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">plot<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">values<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">train_scores<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8216;-o&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">label<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;Train&#8217;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">plot<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">values<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">test_scores<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8216;-o&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">label<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;Test&#8217;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">legend<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">show<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0011 seconds] --><\/p>\n<p>Running the example fits and evaluates a KNN model on the train and test sets for each number of neighbors and reports the accuracy scores.<\/p>\n<p><strong>Note<\/strong>: Your <a href=\"https:\/\/machinelearningmastery.com\/different-results-each-time-in-machine-learning\/\">results may vary<\/a> given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.<\/p>\n<p>Recall, we are looking for a pattern where performance on the test set improves and then starts to get worse, and performance on the training set continues to improve.<\/p>\n<p>We do not see this pattern.<\/p>\n<p>Instead, we see that accuracy on the training dataset starts at perfect accuracy and falls with almost every increase in the number of neighbors.<\/p>\n<p>We also see that performance of the model on the holdout test improves to a value of about five neighbors, holds level and begins a downward trend after that.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5faf595942a7a641892427\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&gt;1, train: 1.000, test: 0.919<br \/>\n&gt;2, train: 0.965, test: 0.916<br \/>\n&gt;3, train: 0.962, test: 0.932<br \/>\n&gt;4, train: 0.957, test: 0.932<br \/>\n&gt;5, train: 0.954, test: 0.935<br \/>\n&gt;6, train: 0.953, test: 0.934<br \/>\n&gt;7, train: 0.952, test: 0.932<br \/>\n&gt;8, train: 0.951, test: 0.933<br \/>\n&gt;9, train: 0.949, test: 0.933<br \/>\n&gt;10, train: 0.950, test: 0.935<br \/>\n&gt;11, train: 0.947, test: 0.934<br \/>\n&gt;12, train: 0.947, test: 0.933<br \/>\n&gt;13, train: 0.945, test: 0.932<br \/>\n&gt;14, train: 0.945, test: 0.932<br \/>\n&gt;15, train: 0.944, test: 0.932<br \/>\n&gt;16, train: 0.944, test: 0.934<br \/>\n&gt;17, train: 0.943, test: 0.932<br \/>\n&gt;18, train: 0.943, test: 0.935<br \/>\n&gt;19, train: 0.942, test: 0.933<br \/>\n&gt;20, train: 0.943, test: 0.935<br \/>\n&gt;21, train: 0.942, test: 0.933<br \/>\n&gt;22, train: 0.943, test: 0.933<br \/>\n&gt;23, train: 0.941, test: 0.932<br \/>\n&gt;24, train: 0.942, test: 0.932<br \/>\n&gt;25, train: 0.942, test: 0.931<br \/>\n&gt;26, train: 0.941, test: 0.930<br \/>\n&gt;27, train: 0.941, test: 0.932<br \/>\n&gt;28, train: 0.939, test: 0.932<br \/>\n&gt;29, train: 0.938, test: 0.931<br \/>\n&gt;30, train: 0.938, test: 0.931<br \/>\n&gt;31, train: 0.937, test: 0.931<br \/>\n&gt;32, train: 0.938, test: 0.931<br \/>\n&gt;33, train: 0.937, test: 0.930<br \/>\n&gt;34, train: 0.938, test: 0.931<br \/>\n&gt;35, train: 0.937, test: 0.930<br \/>\n&gt;36, train: 0.937, test: 0.928<br \/>\n&gt;37, train: 0.936, test: 0.930<br \/>\n&gt;38, train: 0.937, test: 0.930<br \/>\n&gt;39, train: 0.935, test: 0.929<br \/>\n&gt;40, train: 0.936, test: 0.929<br \/>\n&gt;41, train: 0.936, test: 0.928<br \/>\n&gt;42, train: 0.936, test: 0.929<br \/>\n&gt;43, train: 0.936, test: 0.930<br \/>\n&gt;44, train: 0.935, test: 0.929<br \/>\n&gt;45, train: 0.935, test: 0.929<br \/>\n&gt;46, train: 0.934, test: 0.929<br \/>\n&gt;47, train: 0.935, test: 0.929<br \/>\n&gt;48, train: 0.934, test: 0.929<br \/>\n&gt;49, train: 0.934, test: 0.929<br \/>\n&gt;50, train: 0.934, test: 0.929<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<p>22<\/p>\n<p>23<\/p>\n<p>24<\/p>\n<p>25<\/p>\n<p>26<\/p>\n<p>27<\/p>\n<p>28<\/p>\n<p>29<\/p>\n<p>30<\/p>\n<p>31<\/p>\n<p>32<\/p>\n<p>33<\/p>\n<p>34<\/p>\n<p>35<\/p>\n<p>36<\/p>\n<p>37<\/p>\n<p>38<\/p>\n<p>39<\/p>\n<p>40<\/p>\n<p>41<\/p>\n<p>42<\/p>\n<p>43<\/p>\n<p>44<\/p>\n<p>45<\/p>\n<p>46<\/p>\n<p>47<\/p>\n<p>48<\/p>\n<p>49<\/p>\n<p>50<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>&gt;1, train: 1.000, test: 0.919<\/p>\n<p>&gt;2, train: 0.965, test: 0.916<\/p>\n<p>&gt;3, train: 0.962, test: 0.932<\/p>\n<p>&gt;4, train: 0.957, test: 0.932<\/p>\n<p>&gt;5, train: 0.954, test: 0.935<\/p>\n<p>&gt;6, train: 0.953, test: 0.934<\/p>\n<p>&gt;7, train: 0.952, test: 0.932<\/p>\n<p>&gt;8, train: 0.951, test: 0.933<\/p>\n<p>&gt;9, train: 0.949, test: 0.933<\/p>\n<p>&gt;10, train: 0.950, test: 0.935<\/p>\n<p>&gt;11, train: 0.947, test: 0.934<\/p>\n<p>&gt;12, train: 0.947, test: 0.933<\/p>\n<p>&gt;13, train: 0.945, test: 0.932<\/p>\n<p>&gt;14, train: 0.945, test: 0.932<\/p>\n<p>&gt;15, train: 0.944, test: 0.932<\/p>\n<p>&gt;16, train: 0.944, test: 0.934<\/p>\n<p>&gt;17, train: 0.943, test: 0.932<\/p>\n<p>&gt;18, train: 0.943, test: 0.935<\/p>\n<p>&gt;19, train: 0.942, test: 0.933<\/p>\n<p>&gt;20, train: 0.943, test: 0.935<\/p>\n<p>&gt;21, train: 0.942, test: 0.933<\/p>\n<p>&gt;22, train: 0.943, test: 0.933<\/p>\n<p>&gt;23, train: 0.941, test: 0.932<\/p>\n<p>&gt;24, train: 0.942, test: 0.932<\/p>\n<p>&gt;25, train: 0.942, test: 0.931<\/p>\n<p>&gt;26, train: 0.941, test: 0.930<\/p>\n<p>&gt;27, train: 0.941, test: 0.932<\/p>\n<p>&gt;28, train: 0.939, test: 0.932<\/p>\n<p>&gt;29, train: 0.938, test: 0.931<\/p>\n<p>&gt;30, train: 0.938, test: 0.931<\/p>\n<p>&gt;31, train: 0.937, test: 0.931<\/p>\n<p>&gt;32, train: 0.938, test: 0.931<\/p>\n<p>&gt;33, train: 0.937, test: 0.930<\/p>\n<p>&gt;34, train: 0.938, test: 0.931<\/p>\n<p>&gt;35, train: 0.937, test: 0.930<\/p>\n<p>&gt;36, train: 0.937, test: 0.928<\/p>\n<p>&gt;37, train: 0.936, test: 0.930<\/p>\n<p>&gt;38, train: 0.937, test: 0.930<\/p>\n<p>&gt;39, train: 0.935, test: 0.929<\/p>\n<p>&gt;40, train: 0.936, test: 0.929<\/p>\n<p>&gt;41, train: 0.936, test: 0.928<\/p>\n<p>&gt;42, train: 0.936, test: 0.929<\/p>\n<p>&gt;43, train: 0.936, test: 0.930<\/p>\n<p>&gt;44, train: 0.935, test: 0.929<\/p>\n<p>&gt;45, train: 0.935, test: 0.929<\/p>\n<p>&gt;46, train: 0.934, test: 0.929<\/p>\n<p>&gt;47, train: 0.935, test: 0.929<\/p>\n<p>&gt;48, train: 0.934, test: 0.929<\/p>\n<p>&gt;49, train: 0.934, test: 0.929<\/p>\n<p>&gt;50, train: 0.934, test: 0.929<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0001 seconds] --><\/p>\n<p>A figure is also created that shows line plots of the model accuracy on the train and test sets with different numbers of neighbors.<\/p>\n<p>The plots make the situation clearer. It looks as though the line plot for the training set is dropping to converge with the line for the test set. Indeed, this is exactly what is happening.<\/p>\n<div id=\"attachment_11578\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-11578\" loading=\"lazy\" class=\"size-full wp-image-11578\" src=\"https:\/\/3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com\/wp-content\/uploads\/2020\/09\/Line-Plot-of-KNN-Accuracy-on-Train-and-Test-Datasets-for-Different-Numbers-of-Neighbors.png\" alt=\"Line Plot of KNN Accuracy on Train and Test Datasets for Different Numbers of Neighbors\" width=\"1280\" height=\"960\"><\/p>\n<p id=\"caption-attachment-11578\" class=\"wp-caption-text\">Line Plot of KNN Accuracy on Train and Test Datasets for Different Numbers of Neighbors<\/p>\n<\/div>\n<p>Now, recall how KNN works.<\/p>\n<p>The \u201c<em>model<\/em>\u201d is really just the entire training dataset stored in an efficient data structure. Skill for the \u201c<em>model<\/em>\u201d on the training dataset should be 100 percent and anything less is unforgivable.<\/p>\n<p>In fact, this argument holds for any machine learning algorithm and slices to the core of the confusion around overfitting for beginners.<\/p>\n<h2>Separate Overfitting Analysis From Model Selection<\/h2>\n<p>Overfitting can be an explanation for poor performance of a predictive model.<\/p>\n<p>Creating learning curve plots that show the learning dynamics of a model on the train and test dataset is a helpful analysis for learning more about a model on a dataset.<\/p>\n<p><strong>But overfitting should not be confused with model selection.<\/strong><\/p>\n<p>We choose a predictive model or model configuration based on its out-of-sample performance. That is, its performance on new data not seen during training.<\/p>\n<p>The reason we do this is that in predictive modeling, we are primarily interested in a model that makes skillful predictions. We want the model that can make the best possible predictions given the time and computational resources we have available.<\/p>\n<p>This might mean we choose a model that looks like it has overfit the training dataset. In which case, an overfit analysis might be misleading.<\/p>\n<p>It might also mean that the model has poor or terrible performance on the training dataset.<\/p>\n<p>In general, if we cared about model performance on the training dataset in model selection, then we would expect a model to have perfect performance on the training dataset. It\u2019s data we have available; we should not tolerate anything less.<\/p>\n<p>As we saw with the KNN example above, we can achieve perfect performance on the training set by storing the training set directly and returning predictions with one neighbor at the cost of poor performance on any new data.<\/p>\n<ul>\n<li><strong>Wouldn\u2019t a model that performs well on both train and test datasets be a better model?<\/strong><\/li>\n<\/ul>\n<p>Maybe. But, maybe not.<\/p>\n<p>This argument is based on the idea that a model that performs well on both train and test sets has a better understanding of the underlying problem.<\/p>\n<p>A corollary is that a model that performs well on the test set but poor on the training set is lucky (e.g. a statistical fluke) and a model that performs well on the train set but poor on the test set is overfit.<\/p>\n<p>I believe this is the sticking point for beginners that often ask how to fix overfitting for their scikit-learn machine learning model.<\/p>\n<p>The worry is that a model must perform well on both train and test sets, otherwise, they are in trouble.<\/p>\n<p><strong>This is not the case<\/strong>.<\/p>\n<p>Performance on the training set is not relevant during model selection. You must focus on the out-of-sample performance only when choosing a predictive model.<\/p>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Tutorials<\/h3>\n<h3>APIs<\/h3>\n<h3>Articles<\/h3>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to identify overfitting for machine learning models in Python.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>Overfitting is a possible cause of poor generalization performance of a predictive model.<\/li>\n<li>Overfitting can be analyzed for machine learning models by varying key model hyperparameters.<\/li>\n<li>Although overfitting is a useful tool for analysis, it must not be confused with model selection.<\/li>\n<\/ul>\n<p><strong>Do you have any questions?<\/strong><br \/>Ask your questions in the comments below and I will do my best to answer.<\/p>\n<div class=\"widget_text awac-wrapper\" id=\"custom_html-78\">\n<div class=\"widget_text awac widget custom_html-78\">\n<div class=\"textwidget custom-html-widget\">\n<div>\n<h2>Discover Fast Machine Learning in Python!<\/h2>\n<p><a href=\"\/machine-learning-with-python\/\" rel=\"nofollow\"><img decoding=\"async\" src=\"https:\/\/3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com\/wp-content\/uploads\/2014\/07\/MachineLearningMasteryWithPython-220px.png\" alt=\"Master Machine Learning With Python\" align=\"left\"><\/a><\/p>\n<h4>Develop Your Own Models in Minutes<\/h4>\n<p>&#8230;with just a few lines of scikit-learn code<\/p>\n<p>Learn how in my new Ebook:<br \/><a href=\"\/machine-learning-with-python\/\" rel=\"nofollow\">Machine Learning Mastery With Python<\/a><\/p>\n<p>Covers <strong>self-study tutorials<\/strong> and <strong>end-to-end projects<\/strong> like:<br \/><em>Loading data<\/em>, <em>visualization<\/em>, <em>modeling<\/em>, <em>tuning<\/em>, and much more&#8230;<\/p>\n<h4>Finally Bring Machine Learning To<br \/>Your Own Projects<\/h4>\n<p>Skip the Academics. Just Results.<\/p>\n<p><a href=\"\/machine-learning-with-python\/\" class=\"woo-sc-button  red\"><span class=\"woo-\">See What&#8217;s Inside<\/span><\/a><\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/machinelearningmastery.com\/overfitting-machine-learning-models\/<\/p>\n","protected":false},"author":0,"featured_media":547,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/546"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=546"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/546\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/547"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=546"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=546"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=546"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}