{"id":604,"date":"2020-11-23T00:28:19","date_gmt":"2020-11-23T00:28:19","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/11\/23\/extreme-gradient-boosting-xgboost-ensemble-in-python\/"},"modified":"2020-11-23T00:28:19","modified_gmt":"2020-11-23T00:28:19","slug":"extreme-gradient-boosting-xgboost-ensemble-in-python","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/11\/23\/extreme-gradient-boosting-xgboost-ensemble-in-python\/","title":{"rendered":"Extreme Gradient Boosting (XGBoost) Ensemble in Python"},"content":{"rendered":"<div id=\"\">\n<p>Extreme Gradient Boosting (XGBoost) is an open-source library that provides an efficient and effective implementation of the gradient boosting algorithm.<\/p>\n<p>Although other open-source implementations of the approach existed before XGBoost, the release of XGBoost appeared to unleash the power of the technique and made the applied machine learning community take notice of gradient boosting more generally.<\/p>\n<p>Shortly after its development and initial release, XGBoost became the go-to method and often the key component in winning solutions for classification and regression problems in machine learning competitions.<\/p>\n<p>In this tutorial, you will discover how to develop Extreme Gradient Boosting ensembles for classification and regression.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>Extreme Gradient Boosting is an efficient open-source implementation of the stochastic gradient boosting ensemble algorithm.<\/li>\n<li>How to develop XGBoost ensembles for classification and regression with the scikit-learn API.<\/li>\n<li>How to explore the effect of XGBoost model hyperparameters on model performance.<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_11222\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-11222\" loading=\"lazy\" class=\"size-full wp-image-11222\" src=\"https:\/\/3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com\/wp-content\/uploads\/2020\/11\/Extreme-Gradient-Boosting-XGBoost-Ensemble-in-Python.jpg\" alt=\"Extreme Gradient Boosting (XGBoost) Ensemble in Python\" width=\"799\" height=\"533\"><\/p>\n<p id=\"caption-attachment-11222\" class=\"wp-caption-text\">Extreme Gradient Boosting (XGBoost) Ensemble in Python<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/anieto2k\/24313819645\/\">Andr\u00e9s Nieto Porras<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial Overview<\/h2>\n<p>This tutorial is divided into three parts; they are:<\/p>\n<ol>\n<li>Extreme Gradient Boosting Algorithm<\/li>\n<li>XGBoost Scikit-Learn API\n<ol>\n<li>XGBoost Ensemble for Classification<\/li>\n<li>XGBoost Ensemble for Regression<\/li>\n<\/ol>\n<\/li>\n<li>XGBoost Hyperparameters\n<ol>\n<li>Explore Number of Trees<\/li>\n<li>Explore Tree Depth<\/li>\n<li>Explore Learning Rate<\/li>\n<li>Explore Number of Samples<\/li>\n<li>Explore Number of Features<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<h2>Extreme Gradient Boosting Algorithm<\/h2>\n<p><a href=\"https:\/\/machinelearningmastery.com\/gradient-boosting-machine-ensemble-in-python\/\">Gradient boosting<\/a> refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems.<\/p>\n<p>Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.<\/p>\n<p>Models are fit using any arbitrary differentiable loss function and gradient descent optimization algorithm. This gives the technique its name, \u201c<em>gradient boosting<\/em>,\u201d as the loss gradient is minimized as the model is fit, much like a neural network.<\/p>\n<p>For more on gradient boosting, see the tutorial:<\/p>\n<p>Extreme Gradient Boosting, or XGBoost for short is an efficient open-source implementation of the gradient boosting algorithm. As such, XGBoost is an algorithm, an open-source project, and a Python library.<\/p>\n<p>It was initially developed by <a href=\"https:\/\/www.linkedin.com\/in\/tianqi-chen-679a9856\/\">Tianqi Chen<\/a> and was described by Chen and <a href=\"https:\/\/homes.cs.washington.edu\/~guestrin\/index.html\">Carlos Guestrin<\/a> in their 2016 paper titled \u201c<a href=\"https:\/\/arxiv.org\/abs\/1603.02754\">XGBoost: A Scalable Tree Boosting System<\/a>.\u201d<\/p>\n<p>It is designed to be both computationally efficient (e.g. fast to execute) and highly effective, perhaps more effective than other open-source implementations.<\/p>\n<blockquote>\n<p>The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use xgboost.<\/p>\n<\/blockquote>\n<p>\u2014 Tianqi Chen, in answer to the question \u201c<a href=\"https:\/\/www.quora.com\/What-is-the-difference-between-the-R-gbm-gradient-boosting-machine-and-xgboost-extreme-gradient-boosting\/answer\/Tianqi-Chen-1\">What is the difference between the R gbm (gradient boosting machine) and xgboost (extreme gradient boosting)?<\/a>\u201d on Quora<\/p>\n<p>The two main reasons to use XGBoost are execution speed and model performance.<\/p>\n<p>Generally, XGBoost is fast when compared to other implementations of gradient boosting. Szilard Pafka performed some objective benchmarks comparing the performance of XGBoost to other implementations of gradient boosting and bagged decision trees. He wrote up his results in May 2015 in the blog post titled \u201c<a href=\"http:\/\/datascience.la\/benchmarking-random-forest-implementations\/\">Benchmarking Random Forest Implementations<\/a>.\u201d<\/p>\n<p>His results showed that XGBoost was almost always faster than the other benchmarked implementations from R, Python Spark, and H2O.<\/p>\n<p>From his experiment, he commented:<\/p>\n<blockquote>\n<p>I also tried xgboost, a popular library for boosting which is capable of building random forests as well. It is fast, memory efficient and of high accuracy<\/p>\n<\/blockquote>\n<p>\u2014 <a href=\"http:\/\/datascience.la\/benchmarking-random-forest-implementations\/\">Benchmarking Random Forest Implementations, Szilard Pafka<\/a>, 2015.<\/p>\n<p>XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems. The evidence is that it is the go-to algorithm for competition winners on the Kaggle competitive data science platform.<\/p>\n<blockquote>\n<p>Among the 29 challenge winning solutions 3 published at Kaggle\u2019s blog during 2015, 17 solutions used XGBoost. [\u2026] The success of the system was also witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10.<\/p>\n<\/blockquote>\n<p>\u2014 <a href=\"https:\/\/arxiv.org\/abs\/1603.02754\">XGBoost: A Scalable Tree Boosting System<\/a>, 2016.<\/p>\n<p>Now that we are familiar with what XGBoost is and why it is important, let\u2019s take a closer look at how we can use it in our predictive modeling projects.<\/p>\n<h2>XGBoost Scikit-Learn API<\/h2>\n<p>XGBoost can be installed as a standalone library and an XGBoost model can be developed using the scikit-learn API.<\/p>\n<p>The first step is to install the XGBoost library if it is not already installed. This can be achieved using the pip python package manager on most platforms; for example:<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<p><!-- [Format Time: 0.0001 seconds] --><\/p>\n<p>You can then confirm that the XGBoost library was installed correctly and can be used by running the following script.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e3b169149816\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n# check xgboost version<br \/>\nimport xgboost<br \/>\nprint(xgboost.__version__)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-p\"># check xgboost version<\/span><\/p>\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">xgboost<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">xgboost<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">__version__<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0001 seconds] --><\/p>\n<p>Running the script will print your version of the XGBoost library you have installed.<\/p>\n<p>Your version should be the same or higher. If not, you must upgrade your version of the XGBoost library.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<p><!-- [Format Time: 0.0000 seconds] --><\/p>\n<p>It is possible that you may have problems with the latest version of the library. It is not your fault.<\/p>\n<p>Sometimes, the most recent version of the library imposes additional requirements or may be less stable.<\/p>\n<p>If you do have errors when trying to run the above script, I recommend downgrading to version 1.0.1 (or lower). This can be achieved by specifying the version to install to the pip command, as follows:<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e3e041504299\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\nsudo pip install xgboost==1.0.1<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>sudo pip install xgboost==1.0.1<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0000 seconds] --><\/p>\n<p>If you see a warning message, you can safely ignore it for now. For example, below is an example of a warning message that you may see and can ignore:<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e3f127969355\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\nFutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0000 seconds] --><\/p>\n<p>If you require specific instructions for your development environment, see the tutorial:<\/p>\n<p>The XGBoost library has its own custom API, although we will use the method via the scikit-learn wrapper classes: <a href=\"https:\/\/xgboost.readthedocs.io\/en\/latest\/python\/python_api.html#xgboost.XGBRegressor\">XGBRegressor<\/a> and <a href=\"https:\/\/xgboost.readthedocs.io\/en\/latest\/python\/python_api.html#xgboost.XGBClassifier\">XGBClassifier<\/a>. This will allow us to use the full suite of tools from the scikit-learn machine learning library to prepare data and evaluate models.<\/p>\n<p>Both models operate the same way and take the same arguments that influence how the decision trees are created and added to the ensemble.<\/p>\n<p>Randomness is used in the construction of the model. This means that each time the algorithm is run on the same data, it will produce a slightly different model.<\/p>\n<p>When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.<\/p>\n<p>Let\u2019s take a look at how to develop an XGBoost ensemble for both classification and regression.<\/p>\n<h3>XGBoost Ensemble for Classification<\/h3>\n<p>In this section, we will look at using XGBoost for a classification problem.<\/p>\n<p>First, we can use the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_classification.html\">make_classification() function<\/a> to create a synthetic binary classification problem with 1,000 examples and 20 input features.<\/p>\n<p>The complete example is listed below.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e41054423534\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n# test classification dataset<br \/>\nfrom sklearn.datasets import make_classification<br \/>\n# define dataset<br \/>\nX, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)<br \/>\n# summarize the dataset<br \/>\nprint(X.shape, y.shape)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-p\"># test classification dataset<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">datasets <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-v\">make<\/span><span class=\"crayon-sy\">_<\/span>classification<\/p>\n<p><span class=\"crayon-p\"># define dataset<\/span><\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">make_classification<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_samples<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1000<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_features<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">20<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_informative<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">15<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_redundant<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">5<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">7<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># summarize the dataset<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0003 seconds] --><\/p>\n<p>Running the example creates the dataset and summarizes the shape of the input and output components.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<p><!-- [Format Time: 0.0000 seconds] --><\/p>\n<p>Next, we can evaluate an XGBoost model on this dataset.<\/p>\n<p>We will evaluate the model using repeated stratified k-fold cross-validation, with three repeats and 10 folds. We will report the mean and standard deviation of the accuracy of the model across all repeats and folds.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e44302981006\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n# evaluate xgboost algorithm for classification<br \/>\nfrom numpy import mean<br \/>\nfrom numpy import std<br \/>\nfrom sklearn.datasets import make_classification<br \/>\nfrom sklearn.model_selection import cross_val_score<br \/>\nfrom sklearn.model_selection import RepeatedStratifiedKFold<br \/>\nfrom xgboost import XGBClassifier<br \/>\n# define dataset<br \/>\nX, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)<br \/>\n# define the model<br \/>\nmodel = XGBClassifier()<br \/>\n# evaluate the model<br \/>\ncv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)<br \/>\nn_scores = cross_val_score(model, X, y, scoring=&#8217;accuracy&#8217;, cv=cv, n_jobs=-1)<br \/>\n# report performance<br \/>\nprint(&#8216;Accuracy: %.3f (%.3f)&#8217; % (mean(n_scores), std(n_scores)))<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-p\"># evaluate xgboost algorithm for classification<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">mean<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">std<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">datasets <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">make_classification<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">cross_val_score<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">RepeatedStratifiedKFold<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">xgboost <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">XGBClassifier<\/span><\/p>\n<p><span class=\"crayon-p\"># define dataset<\/span><\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">make_classification<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_samples<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1000<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_features<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">20<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_informative<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">15<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_redundant<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">5<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">7<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># define the model<\/span><\/p>\n<p><span class=\"crayon-v\">model<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">XGBClassifier<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># evaluate the model<\/span><\/p>\n<p><span class=\"crayon-v\">cv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">RepeatedStratifiedKFold<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_splits<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">10<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_repeats<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">n_scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">cross_val_score<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">scoring<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;accuracy&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_jobs<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># report performance<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8216;Accuracy: %.3f (%.3f)&#8217;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">%<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-e\">mean<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_scores<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">std<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_scores<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0006 seconds] --><\/p>\n<p>Running the example reports the mean and standard deviation accuracy of the model.<\/p>\n<p><strong>Note<\/strong>: Your <a href=\"https:\/\/machinelearningmastery.com\/different-results-each-time-in-machine-learning\/\">results may vary<\/a> given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.<\/p>\n<p>In this case, we can see the XGBoost ensemble with default hyperparameters achieves a classification accuracy of about 92.5 percent on this test dataset.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<p><!-- [Format Time: 0.0000 seconds] --><\/p>\n<p>We can also use the XGBoost model as a final model and make predictions for classification.<\/p>\n<p>First, the XGBoost ensemble is fit on all available data, then the <em>predict()<\/em> function can be called to make predictions on new data. Importantly, this function expects data to always be provided as a NumPy array as a matrix with one row for each input sample.<\/p>\n<p>The example below demonstrates this on our binary classification dataset.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e47473977887\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n# make predictions using xgboost for classification<br \/>\nfrom numpy import asarray<br \/>\nfrom sklearn.datasets import make_classification<br \/>\nfrom xgboost import XGBClassifier<br \/>\n# define dataset<br \/>\nX, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)<br \/>\n# define the model<br \/>\nmodel = XGBClassifier()<br \/>\n# fit the model on the whole dataset<br \/>\nmodel.fit(X, y)<br \/>\n# make a single prediction<br \/>\nrow = [0.2929949,-4.21223056,-1.288332,-2.17849815,-0.64527665,2.58097719,0.28422388,-7.1827928,-1.91211104,2.73729512,0.81395695,3.96973717,-2.66939799,3.34692332,4.19791821,0.99990998,-0.30201875,-4.43170633,-2.82646737,0.44916808]<br \/>\nrow = asarray([row])<br \/>\nyhat = model.predict(row)<br \/>\nprint(&#8216;Predicted Class: %d&#8217; % yhat[0])<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-p\"># make predictions using xgboost for classification<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">asarray<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">datasets <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">make_classification<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">xgboost <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">XGBClassifier<\/span><\/p>\n<p><span class=\"crayon-p\"># define dataset<\/span><\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">make_classification<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_samples<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1000<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_features<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">20<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_informative<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">15<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_redundant<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">5<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">7<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># define the model<\/span><\/p>\n<p><span class=\"crayon-v\">model<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">XGBClassifier<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># fit the model on the whole dataset<\/span><\/p>\n<p><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">fit<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># make a single prediction<\/span><\/p>\n<p><span class=\"crayon-v\">row<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">0.2929949<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">4.21223056<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1.288332<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">2.17849815<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">0.64527665<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-cn\">2.58097719<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-cn\">0.28422388<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">7.1827928<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1.91211104<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-cn\">2.73729512<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-cn\">0.81395695<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-cn\">3.96973717<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">2.66939799<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-cn\">3.34692332<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-cn\">4.19791821<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-cn\">0.99990998<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">0.30201875<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">4.43170633<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">2.82646737<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-cn\">0.44916808<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-v\">row<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">asarray<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">row<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">yhat<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">predict<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">row<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8216;Predicted Class: %d&#8217;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">%<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">yhat<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0006 seconds] --><\/p>\n<p>Running the example fits the XGBoost ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<p><!-- [Format Time: 0.0000 seconds] --><\/p>\n<p>Now that we are familiar with using XGBoost for classification, let\u2019s look at the API for regression.<\/p>\n<h3>XGBoost Ensemble for Regression<\/h3>\n<p>In this section, we will look at using XGBoost for a regression problem.<\/p>\n<p>First, we can use the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.datasets.make_regression.html\">make_regression() function<\/a> to create a synthetic regression problem with 1,000 examples and 20 input features.<\/p>\n<p>The complete example is listed below.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e4a579519916\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n# test regression dataset<br \/>\nfrom sklearn.datasets import make_regression<br \/>\n# define dataset<br \/>\nX, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7)<br \/>\n# summarize the dataset<br \/>\nprint(X.shape, y.shape)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-p\"># test regression dataset<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">datasets <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-v\">make<\/span><span class=\"crayon-sy\">_<\/span>regression<\/p>\n<p><span class=\"crayon-p\"># define dataset<\/span><\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">make_regression<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_samples<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1000<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_features<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">20<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_informative<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">15<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">noise<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">0.1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">7<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># summarize the dataset<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0003 seconds] --><\/p>\n<p>Running the example creates the dataset and summarizes the shape of the input and output components.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<p><!-- [Format Time: 0.0000 seconds] --><\/p>\n<p>Next, we can evaluate an XGBoost algorithm on this dataset.<\/p>\n<p>As we did with the last section, we will evaluate the model using repeated k-fold cross-validation, with three repeats and 10 folds. We will report the mean absolute error (MAE) of the model across all repeats and folds. The scikit-learn library makes the MAE negative so that it is maximized instead of minimized. This means that larger negative MAE are better and a perfect model has a MAE of 0.<\/p>\n<p>The complete example is listed below.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e4d743746786\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n# evaluate xgboost ensemble for regression<br \/>\nfrom numpy import mean<br \/>\nfrom numpy import std<br \/>\nfrom sklearn.datasets import make_regression<br \/>\nfrom sklearn.model_selection import cross_val_score<br \/>\nfrom sklearn.model_selection import RepeatedKFold<br \/>\nfrom xgboost import XGBRegressor<br \/>\n# define dataset<br \/>\nX, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7)<br \/>\n# define the model<br \/>\nmodel = XGBRegressor()<br \/>\n# evaluate the model<br \/>\ncv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)<br \/>\nn_scores = cross_val_score(model, X, y, scoring=&#8217;neg_mean_absolute_error&#8217;, cv=cv, n_jobs=-1, error_score=&#8217;raise&#8217;)<br \/>\n# report performance<br \/>\nprint(&#8216;MAE: %.3f (%.3f)&#8217; % (mean(n_scores), std(n_scores)))<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-p\"># evaluate xgboost ensemble for regression<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">mean<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">std<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">datasets <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">make_regression<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">cross_val_score<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">RepeatedKFold<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">xgboost <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">XGBRegressor<\/span><\/p>\n<p><span class=\"crayon-p\"># define dataset<\/span><\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">make_regression<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_samples<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1000<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_features<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">20<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_informative<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">15<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">noise<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">0.1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">7<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># define the model<\/span><\/p>\n<p><span class=\"crayon-v\">model<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">XGBRegressor<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># evaluate the model<\/span><\/p>\n<p><span class=\"crayon-v\">cv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">RepeatedKFold<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_splits<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">10<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_repeats<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">n_scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">cross_val_score<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">scoring<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;neg_mean_absolute_error&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_jobs<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">error_score<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;raise&#8217;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># report performance<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8216;MAE: %.3f (%.3f)&#8217;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">%<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-e\">mean<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_scores<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">std<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_scores<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0007 seconds] --><\/p>\n<p>Running the example reports the mean and standard deviation accuracy of the model.<\/p>\n<p><strong>Note<\/strong>: Your <a href=\"https:\/\/machinelearningmastery.com\/different-results-each-time-in-machine-learning\/\">results may vary<\/a> given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.<\/p>\n<p>In this case, we can see the XGBoost ensemble with default hyperparameters achieves a MAE of about 76.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<p><!-- [Format Time: 0.0000 seconds] --><\/p>\n<p>We can also use the XGBoost model as a final model and make predictions for regression.<\/p>\n<p>First, the XGBoost ensemble is fit on all available data, then the <em>predict()<\/em> function can be called to make predictions on new data. As with classification, the single row of data must be represented as a two-dimensional matrix in NumPy array format.<\/p>\n<p>The example below demonstrates this on our regression dataset.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e50172983893\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n# gradient xgboost for making predictions for regression<br \/>\nfrom numpy import asarray<br \/>\nfrom sklearn.datasets import make_regression<br \/>\nfrom xgboost import XGBRegressor<br \/>\n# define dataset<br \/>\nX, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7)<br \/>\n# define the model<br \/>\nmodel = XGBRegressor()<br \/>\n# fit the model on the whole dataset<br \/>\nmodel.fit(X, y)<br \/>\n# make a single prediction<br \/>\nrow = [0.20543991,-0.97049844,-0.81403429,-0.23842689,-0.60704084,-0.48541492,0.53113006,2.01834338,-0.90745243,-1.85859731,-1.02334791,-0.6877744,0.60984819,-0.70630121,-1.29161497,1.32385441,1.42150747,1.26567231,2.56569098,-0.11154792]<br \/>\nrow = asarray([row])<br \/>\nyhat = model.predict(row)<br \/>\nprint(&#8216;Prediction: %d&#8217; % yhat[0])<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-p\"># gradient xgboost for making predictions for regression<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">asarray<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">datasets <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">make_regression<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">xgboost <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">XGBRegressor<\/span><\/p>\n<p><span class=\"crayon-p\"># define dataset<\/span><\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">make_regression<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_samples<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1000<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_features<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">20<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_informative<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">15<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">noise<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">0.1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">7<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># define the model<\/span><\/p>\n<p><span class=\"crayon-v\">model<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">XGBRegressor<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># fit the model on the whole dataset<\/span><\/p>\n<p><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">fit<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># make a single prediction<\/span><\/p>\n<p><span class=\"crayon-v\">row<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">0.20543991<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">0.97049844<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">0.81403429<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">0.23842689<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">0.60704084<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">0.48541492<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-cn\">0.53113006<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-cn\">2.01834338<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">0.90745243<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1.85859731<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1.02334791<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">0.6877744<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-cn\">0.60984819<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">0.70630121<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1.29161497<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-cn\">1.32385441<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-cn\">1.42150747<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-cn\">1.26567231<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-cn\">2.56569098<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">0.11154792<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-v\">row<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">asarray<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">row<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">yhat<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">predict<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">row<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8216;Prediction: %d&#8217;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">%<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">yhat<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0007 seconds] --><\/p>\n<p>Running the example fits the XGBoost ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<p><!-- [Format Time: 0.0000 seconds] --><\/p>\n<p>Now that we are familiar with using the XGBoost Scikit-Learn API to evaluate and use XGBoost ensembles, let\u2019s look at configuring the model.<\/p>\n<h2>XGBoost Hyperparameters<\/h2>\n<p>In this section, we will take a closer look at some of the hyperparameters you should consider tuning for the Gradient Boosting ensemble and their effect on model performance.<\/p>\n<h3>Explore Number of Trees<\/h3>\n<p>An important hyperparameter for the XGBoost ensemble algorithm is the number of decision trees used in the ensemble.<\/p>\n<p>Recall that decision trees are added to the model sequentially in an effort to correct and improve upon the predictions made by prior trees. As such, more trees is often better.<\/p>\n<p>The number of trees can be set via the \u201c<em>n_estimators<\/em>\u201d argument and defaults to 100.<\/p>\n<p>The example below explores the effect of the number of trees with values between 10 to 5,000.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e53467834369\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n# explore xgboost number of trees effect on performance<br \/>\nfrom numpy import mean<br \/>\nfrom numpy import std<br \/>\nfrom sklearn.datasets import make_classification<br \/>\nfrom sklearn.model_selection import cross_val_score<br \/>\nfrom sklearn.model_selection import RepeatedStratifiedKFold<br \/>\nfrom xgboost import XGBClassifier<br \/>\nfrom matplotlib import pyplot<\/p>\n<p># get the dataset<br \/>\ndef get_dataset():<br \/>\n\tX, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)<br \/>\n\treturn X, y<\/p>\n<p># get a list of models to evaluate<br \/>\ndef get_models():<br \/>\n\tmodels = dict()<br \/>\n\ttrees = [10, 50, 100, 500, 1000, 5000]<br \/>\n\tfor n in trees:<br \/>\n\t\tmodels[str(n)] = XGBClassifier(n_estimators=n)<br \/>\n\treturn models<\/p>\n<p># evaluate a give model using cross-validation<br \/>\ndef evaluate_model(model):<br \/>\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)<br \/>\n\tscores = cross_val_score(model, X, y, scoring=&#8217;accuracy&#8217;, cv=cv, n_jobs=-1)<br \/>\n\treturn scores<\/p>\n<p># define dataset<br \/>\nX, y = get_dataset()<br \/>\n# get the models to evaluate<br \/>\nmodels = get_models()<br \/>\n# evaluate the models and store results<br \/>\nresults, names = list(), list()<br \/>\nfor name, model in models.items():<br \/>\n\tscores = evaluate_model(model)<br \/>\n\tresults.append(scores)<br \/>\n\tnames.append(name)<br \/>\n\tprint(&#8216;&gt;%s %.3f (%.3f)&#8217; % (name, mean(scores), std(scores)))<br \/>\n# plot model performance for comparison<br \/>\npyplot.boxplot(results, labels=names, showmeans=True)<br \/>\npyplot.show()<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<p>22<\/p>\n<p>23<\/p>\n<p>24<\/p>\n<p>25<\/p>\n<p>26<\/p>\n<p>27<\/p>\n<p>28<\/p>\n<p>29<\/p>\n<p>30<\/p>\n<p>31<\/p>\n<p>32<\/p>\n<p>33<\/p>\n<p>34<\/p>\n<p>35<\/p>\n<p>36<\/p>\n<p>37<\/p>\n<p>38<\/p>\n<p>39<\/p>\n<p>40<\/p>\n<p>41<\/p>\n<p>42<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-p\"># explore xgboost number of trees effect on performance<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">mean<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">std<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">datasets <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">make_classification<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">cross_val_score<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">RepeatedStratifiedKFold<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">xgboost <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">XGBClassifier<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">matplotlib <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">pyplot<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># get the dataset<\/span><\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">get_dataset<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">make_classification<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_samples<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1000<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_features<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">20<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_informative<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">15<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_redundant<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">5<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">7<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">y<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># get a list of models to evaluate<\/span><\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">get_models<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">models<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">dict<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">trees<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">10<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">50<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">100<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">500<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1000<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">5000<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">n<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">trees<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t\t<\/span><span class=\"crayon-v\">models<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-e\">str<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">XGBClassifier<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_estimators<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">n<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">models<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># evaluate a give model using cross-validation<\/span><\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">evaluate_model<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">RepeatedStratifiedKFold<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_splits<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">10<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_repeats<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">cross_val_score<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">scoring<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;accuracy&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_jobs<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">scores<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># define dataset<\/span><\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">get_dataset<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># get the models to evaluate<\/span><\/p>\n<p><span class=\"crayon-v\">models<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">get_models<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># evaluate the models and store results<\/span><\/p>\n<p><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">name<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">model <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">models<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">items<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">evaluate_model<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">name<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8216;&gt;%s %.3f (%.3f)&#8217;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">%<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">name<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">mean<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">std<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># plot model performance for comparison<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">boxplot<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">labels<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">showmeans<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">True<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">show<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0013 seconds] --><\/p>\n<p>Running the example first reports the mean accuracy for each configured number of decision trees.<\/p>\n<p><strong>Note<\/strong>: Your <a href=\"https:\/\/machinelearningmastery.com\/different-results-each-time-in-machine-learning\/\">results may vary<\/a> given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.<\/p>\n<p>In this case, we can see that that performance improves on this dataset until about 500 trees, after which performance appears to level off or decrease.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e54879327977\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&gt;10 0.885 (0.029)<br \/>\n&gt;50 0.915 (0.029)<br \/>\n&gt;100 0.925 (0.028)<br \/>\n&gt;500 0.927 (0.028)<br \/>\n&gt;1000 0.926 (0.028)<br \/>\n&gt;5000 0.925 (0.027)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>&gt;10 0.885 (0.029)<\/p>\n<p>&gt;50 0.915 (0.029)<\/p>\n<p>&gt;100 0.925 (0.028)<\/p>\n<p>&gt;500 0.927 (0.028)<\/p>\n<p>&gt;1000 0.926 (0.028)<\/p>\n<p>&gt;5000 0.925 (0.027)<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0000 seconds] --><\/p>\n<p>A box and whisker plot is created for the distribution of accuracy scores for each configured number of trees.<\/p>\n<p>We can see the general trend of increasing model performance and ensemble size.<\/p>\n<div id=\"attachment_11217\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-11217\" loading=\"lazy\" class=\"size-full wp-image-11217\" src=\"https:\/\/3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com\/wp-content\/uploads\/2020\/07\/Box-Plots-of-XGBoost-Ensemble-Size-vs.-Classification-Accuracy.png\" alt=\"Box Plots of XGBoost Ensemble Size vs. Classification Accuracy\" width=\"1280\" height=\"960\"><\/p>\n<p id=\"caption-attachment-11217\" class=\"wp-caption-text\">Box Plots of XGBoost Ensemble Size vs. Classification Accuracy<\/p>\n<\/div>\n<h3>Explore Tree Depth<\/h3>\n<p>Varying the depth of each tree added to the ensemble is another important hyperparameter for gradient boosting.<\/p>\n<p>The tree depth controls how specialized each tree is to the training dataset: how general or overfit it might be. Trees are preferred that are not too shallow and general (like <a href=\"https:\/\/machinelearningmastery.com\/adaboost-ensemble-in-python\/\">AdaBoost<\/a>) and not too deep and specialized (like <a href=\"https:\/\/machinelearningmastery.com\/bagging-ensemble-with-python\/\">bootstrap aggregation<\/a>).<\/p>\n<p>Gradient boosting generally performs well with trees that have a modest depth, finding a balance between skill and generality.<\/p>\n<p>Tree depth is controlled via the \u201c<em>max_depth<\/em>\u201d argument and defaults to 6.<\/p>\n<p>The example below explores tree depths between 1 and 10 and the effect on model performance.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e56273348521\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n# explore xgboost tree depth effect on performance<br \/>\nfrom numpy import mean<br \/>\nfrom numpy import std<br \/>\nfrom sklearn.datasets import make_classification<br \/>\nfrom sklearn.model_selection import cross_val_score<br \/>\nfrom sklearn.model_selection import RepeatedStratifiedKFold<br \/>\nfrom xgboost import XGBClassifier<br \/>\nfrom matplotlib import pyplot<\/p>\n<p># get the dataset<br \/>\ndef get_dataset():<br \/>\n\tX, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)<br \/>\n\treturn X, y<\/p>\n<p># get a list of models to evaluate<br \/>\ndef get_models():<br \/>\n\tmodels = dict()<br \/>\n\tfor i in range(1,11):<br \/>\n\t\tmodels[str(i)] = XGBClassifier(max_depth=i)<br \/>\n\treturn models<\/p>\n<p># evaluate a give model using cross-validation<br \/>\ndef evaluate_model(model):<br \/>\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)<br \/>\n\tscores = cross_val_score(model, X, y, scoring=&#8217;accuracy&#8217;, cv=cv, n_jobs=-1)<br \/>\n\treturn scores<\/p>\n<p># define dataset<br \/>\nX, y = get_dataset()<br \/>\n# get the models to evaluate<br \/>\nmodels = get_models()<br \/>\n# evaluate the models and store results<br \/>\nresults, names = list(), list()<br \/>\nfor name, model in models.items():<br \/>\n\tscores = evaluate_model(model)<br \/>\n\tresults.append(scores)<br \/>\n\tnames.append(name)<br \/>\n\tprint(&#8216;&gt;%s %.3f (%.3f)&#8217; % (name, mean(scores), std(scores)))<br \/>\n# plot model performance for comparison<br \/>\npyplot.boxplot(results, labels=names, showmeans=True)<br \/>\npyplot.show()<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<p>22<\/p>\n<p>23<\/p>\n<p>24<\/p>\n<p>25<\/p>\n<p>26<\/p>\n<p>27<\/p>\n<p>28<\/p>\n<p>29<\/p>\n<p>30<\/p>\n<p>31<\/p>\n<p>32<\/p>\n<p>33<\/p>\n<p>34<\/p>\n<p>35<\/p>\n<p>36<\/p>\n<p>37<\/p>\n<p>38<\/p>\n<p>39<\/p>\n<p>40<\/p>\n<p>41<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-p\"># explore xgboost tree depth effect on performance<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">mean<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">std<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">datasets <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">make_classification<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">cross_val_score<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">RepeatedStratifiedKFold<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">xgboost <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">XGBClassifier<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">matplotlib <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">pyplot<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># get the dataset<\/span><\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">get_dataset<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">make_classification<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_samples<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1000<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_features<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">20<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_informative<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">15<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_redundant<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">5<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">7<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">y<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># get a list of models to evaluate<\/span><\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">get_models<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">models<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">dict<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">i<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">range<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-cn\">11<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t\t<\/span><span class=\"crayon-v\">models<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-e\">str<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">XGBClassifier<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">max_depth<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">models<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># evaluate a give model using cross-validation<\/span><\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">evaluate_model<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">RepeatedStratifiedKFold<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_splits<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">10<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_repeats<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">cross_val_score<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">scoring<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;accuracy&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_jobs<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">scores<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># define dataset<\/span><\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">get_dataset<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># get the models to evaluate<\/span><\/p>\n<p><span class=\"crayon-v\">models<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">get_models<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># evaluate the models and store results<\/span><\/p>\n<p><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">name<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">model <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">models<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">items<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">evaluate_model<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">name<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8216;&gt;%s %.3f (%.3f)&#8217;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">%<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">name<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">mean<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">std<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># plot model performance for comparison<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">boxplot<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">labels<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">showmeans<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">True<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">show<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0012 seconds] --><\/p>\n<p>Running the example first reports the mean accuracy for each configured tree depth.<\/p>\n<p><strong>Note<\/strong>: Your <a href=\"https:\/\/machinelearningmastery.com\/different-results-each-time-in-machine-learning\/\">results may vary<\/a> given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.<\/p>\n<p>In this case, we can see that performance improves with tree depth, perhaps peeking around a depth of 3 to 8, after which the deeper, more specialized trees result in worse performance.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e57819376045\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&gt;1 0.849 (0.028)<br \/>\n&gt;2 0.906 (0.032)<br \/>\n&gt;3 0.926 (0.027)<br \/>\n&gt;4 0.930 (0.027)<br \/>\n&gt;5 0.924 (0.031)<br \/>\n&gt;6 0.925 (0.028)<br \/>\n&gt;7 0.926 (0.030)<br \/>\n&gt;8 0.926 (0.029)<br \/>\n&gt;9 0.921 (0.032)<br \/>\n&gt;10 0.923 (0.035)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>&gt;1 0.849 (0.028)<\/p>\n<p>&gt;2 0.906 (0.032)<\/p>\n<p>&gt;3 0.926 (0.027)<\/p>\n<p>&gt;4 0.930 (0.027)<\/p>\n<p>&gt;5 0.924 (0.031)<\/p>\n<p>&gt;6 0.925 (0.028)<\/p>\n<p>&gt;7 0.926 (0.030)<\/p>\n<p>&gt;8 0.926 (0.029)<\/p>\n<p>&gt;9 0.921 (0.032)<\/p>\n<p>&gt;10 0.923 (0.035)<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0000 seconds] --><\/p>\n<p>A box and whisker plot is created for the distribution of accuracy scores for each configured tree depth.<\/p>\n<p>We can see the general trend of increasing model performance with the tree depth to a point, after which performance begins to sit flat or degrade with the over-specialized trees.<\/p>\n<div id=\"attachment_11218\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-11218\" loading=\"lazy\" class=\"size-full wp-image-11218\" src=\"https:\/\/3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com\/wp-content\/uploads\/2020\/07\/Box-Plots-of-XGBoost-Ensemble-Tree-Depth-vs.-Classification-Accuracy.png\" alt=\"Box Plots of XGBoost Ensemble Tree Depth vs. Classification Accuracy\" width=\"1280\" height=\"960\"><\/p>\n<p id=\"caption-attachment-11218\" class=\"wp-caption-text\">Box Plots of XGBoost Ensemble Tree Depth vs. Classification Accuracy<\/p>\n<\/div>\n<h3>Explore Learning Rate<\/h3>\n<p>Learning rate controls the amount of contribution that each model has on the ensemble prediction.<\/p>\n<p>Smaller rates may require more decision trees in the ensemble.<\/p>\n<p>The learning rate can be controlled via the \u201c<em>eta<\/em>\u201d argument and defaults to 0.3.<\/p>\n<p>The example below explores the learning rate and compares the effect of values between 0.0001 and 1.0.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e59115143745\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n# explore xgboost learning rate effect on performance<br \/>\nfrom numpy import mean<br \/>\nfrom numpy import std<br \/>\nfrom sklearn.datasets import make_classification<br \/>\nfrom sklearn.model_selection import cross_val_score<br \/>\nfrom sklearn.model_selection import RepeatedStratifiedKFold<br \/>\nfrom xgboost import XGBClassifier<br \/>\nfrom matplotlib import pyplot<\/p>\n<p># get the dataset<br \/>\ndef get_dataset():<br \/>\n\tX, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)<br \/>\n\treturn X, y<\/p>\n<p># get a list of models to evaluate<br \/>\ndef get_models():<br \/>\n\tmodels = dict()<br \/>\n\trates = [0.0001, 0.001, 0.01, 0.1, 1.0]<br \/>\n\tfor r in rates:<br \/>\n\t\tkey = &#8216;%.4f&#8217; % r<br \/>\n\t\tmodels[key] = XGBClassifier(eta=r)<br \/>\n\treturn models<\/p>\n<p># evaluate a give model using cross-validation<br \/>\ndef evaluate_model(model):<br \/>\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)<br \/>\n\tscores = cross_val_score(model, X, y, scoring=&#8217;accuracy&#8217;, cv=cv, n_jobs=-1)<br \/>\n\treturn scores<\/p>\n<p># define dataset<br \/>\nX, y = get_dataset()<br \/>\n# get the models to evaluate<br \/>\nmodels = get_models()<br \/>\n# evaluate the models and store results<br \/>\nresults, names = list(), list()<br \/>\nfor name, model in models.items():<br \/>\n\tscores = evaluate_model(model)<br \/>\n\tresults.append(scores)<br \/>\n\tnames.append(name)<br \/>\n\tprint(&#8216;&gt;%s %.3f (%.3f)&#8217; % (name, mean(scores), std(scores)))<br \/>\n# plot model performance for comparison<br \/>\npyplot.boxplot(results, labels=names, showmeans=True)<br \/>\npyplot.show()<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<p>22<\/p>\n<p>23<\/p>\n<p>24<\/p>\n<p>25<\/p>\n<p>26<\/p>\n<p>27<\/p>\n<p>28<\/p>\n<p>29<\/p>\n<p>30<\/p>\n<p>31<\/p>\n<p>32<\/p>\n<p>33<\/p>\n<p>34<\/p>\n<p>35<\/p>\n<p>36<\/p>\n<p>37<\/p>\n<p>38<\/p>\n<p>39<\/p>\n<p>40<\/p>\n<p>41<\/p>\n<p>42<\/p>\n<p>43<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-p\"># explore xgboost learning rate effect on performance<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">mean<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">std<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">datasets <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">make_classification<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">cross_val_score<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">RepeatedStratifiedKFold<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">xgboost <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">XGBClassifier<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">matplotlib <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">pyplot<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># get the dataset<\/span><\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">get_dataset<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">make_classification<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_samples<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1000<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_features<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">20<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_informative<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">15<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_redundant<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">5<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">7<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">y<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># get a list of models to evaluate<\/span><\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">get_models<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">models<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">dict<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">rates<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">0.0001<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0.001<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0.01<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0.1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1.0<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">r<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">rates<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t\t<\/span><span class=\"crayon-v\">key<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8216;%.4f&#8217;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">%<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">r<\/span><\/p>\n<p><span class=\"crayon-h\">\t\t<\/span><span class=\"crayon-v\">models<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">key<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">XGBClassifier<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">eta<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">r<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">models<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># evaluate a give model using cross-validation<\/span><\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">evaluate_model<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">RepeatedStratifiedKFold<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_splits<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">10<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_repeats<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">cross_val_score<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">scoring<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;accuracy&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_jobs<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">scores<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># define dataset<\/span><\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">get_dataset<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># get the models to evaluate<\/span><\/p>\n<p><span class=\"crayon-v\">models<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">get_models<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># evaluate the models and store results<\/span><\/p>\n<p><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">name<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">model <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">models<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">items<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">evaluate_model<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">name<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8216;&gt;%s %.3f (%.3f)&#8217;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">%<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">name<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">mean<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">std<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># plot model performance for comparison<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">boxplot<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">labels<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">showmeans<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">True<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">show<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0013 seconds] --><\/p>\n<p>Running the example first reports the mean accuracy for each configured learning rate.<\/p>\n<p><strong>Note<\/strong>: Your <a href=\"https:\/\/machinelearningmastery.com\/different-results-each-time-in-machine-learning\/\">results may vary<\/a> given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.<\/p>\n<p>In this case, we can see that a larger learning rate results in better performance on this dataset. We would expect that adding more trees to the ensemble for the smaller learning rates would further lift performance.<\/p>\n<p>This highlights the trade-off between the number of trees (speed of training) and learning rate, e.g. we can fit a model faster by using fewer trees and a larger learning rate.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e5a859534142\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&gt;0.0001 0.804 (0.039)<br \/>\n&gt;0.0010 0.814 (0.037)<br \/>\n&gt;0.0100 0.867 (0.027)<br \/>\n&gt;0.1000 0.923 (0.030)<br \/>\n&gt;1.0000 0.913 (0.030)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>&gt;0.0001 0.804 (0.039)<\/p>\n<p>&gt;0.0010 0.814 (0.037)<\/p>\n<p>&gt;0.0100 0.867 (0.027)<\/p>\n<p>&gt;0.1000 0.923 (0.030)<\/p>\n<p>&gt;1.0000 0.913 (0.030)<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0000 seconds] --><\/p>\n<p>A box and whisker plot is created for the distribution of accuracy scores for each configured learning rate.<\/p>\n<p>We can see the general trend of increasing model performance with the increase in learning rate of 0.1, after which performance degrades.<\/p>\n<div id=\"attachment_11219\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-11219\" loading=\"lazy\" class=\"size-full wp-image-11219\" src=\"https:\/\/3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com\/wp-content\/uploads\/2020\/07\/Box-Plot-of-XGBoost-Learning-Rate-vs.-Classification-Accuracy.png\" alt=\"Box Plot of XGBoost Learning Rate vs. Classification Accuracy\" width=\"1280\" height=\"960\"><\/p>\n<p id=\"caption-attachment-11219\" class=\"wp-caption-text\">Box Plot of XGBoost Learning Rate vs. Classification Accuracy<\/p>\n<\/div>\n<h3>Explore Number of Samples<\/h3>\n<p>The number of samples used to fit each tree can be varied. This means that each tree is fit on a randomly selected subset of the training dataset.<\/p>\n<p>Using fewer samples introduces more variance for each tree, although it can improve the overall performance of the model.<\/p>\n<p>The number of samples used to fit each tree is specified by the \u201c<em>subsample<\/em>\u201d argument and can be set to a fraction of the training dataset size. By default, it is set to 1.0 to use the entire training dataset.<\/p>\n<p>The example below demonstrates the effect of the sample size on model performance with ratios varying from 10 percent to 100 percent in 10 percent increments.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e5c760699403\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n# explore xgboost subsample ratio effect on performance<br \/>\nfrom numpy import arange<br \/>\nfrom numpy import mean<br \/>\nfrom numpy import std<br \/>\nfrom sklearn.datasets import make_classification<br \/>\nfrom sklearn.model_selection import cross_val_score<br \/>\nfrom sklearn.model_selection import RepeatedStratifiedKFold<br \/>\nfrom xgboost import XGBClassifier<br \/>\nfrom matplotlib import pyplot<\/p>\n<p># get the dataset<br \/>\ndef get_dataset():<br \/>\n\tX, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)<br \/>\n\treturn X, y<\/p>\n<p># get a list of models to evaluate<br \/>\ndef get_models():<br \/>\n\tmodels = dict()<br \/>\n\tfor i in arange(0.1, 1.1, 0.1):<br \/>\n\t\tkey = &#8216;%.1f&#8217; % i<br \/>\n\t\tmodels[key] = XGBClassifier(subsample=i)<br \/>\n\treturn models<\/p>\n<p># evaluate a give model using cross-validation<br \/>\ndef evaluate_model(model):<br \/>\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)<br \/>\n\tscores = cross_val_score(model, X, y, scoring=&#8217;accuracy&#8217;, cv=cv, n_jobs=-1)<br \/>\n\treturn scores<\/p>\n<p># define dataset<br \/>\nX, y = get_dataset()<br \/>\n# get the models to evaluate<br \/>\nmodels = get_models()<br \/>\n# evaluate the models and store results<br \/>\nresults, names = list(), list()<br \/>\nfor name, model in models.items():<br \/>\n\tscores = evaluate_model(model)<br \/>\n\tresults.append(scores)<br \/>\n\tnames.append(name)<br \/>\n\tprint(&#8216;&gt;%s %.3f (%.3f)&#8217; % (name, mean(scores), std(scores)))<br \/>\n# plot model performance for comparison<br \/>\npyplot.boxplot(results, labels=names, showmeans=True)<br \/>\npyplot.show()<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<p>22<\/p>\n<p>23<\/p>\n<p>24<\/p>\n<p>25<\/p>\n<p>26<\/p>\n<p>27<\/p>\n<p>28<\/p>\n<p>29<\/p>\n<p>30<\/p>\n<p>31<\/p>\n<p>32<\/p>\n<p>33<\/p>\n<p>34<\/p>\n<p>35<\/p>\n<p>36<\/p>\n<p>37<\/p>\n<p>38<\/p>\n<p>39<\/p>\n<p>40<\/p>\n<p>41<\/p>\n<p>42<\/p>\n<p>43<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-p\"># explore xgboost subsample ratio effect on performance<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">arange<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">mean<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">std<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">datasets <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">make_classification<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">cross_val_score<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">RepeatedStratifiedKFold<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">xgboost <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">XGBClassifier<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">matplotlib <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">pyplot<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># get the dataset<\/span><\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">get_dataset<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">make_classification<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_samples<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1000<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_features<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">20<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_informative<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">15<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_redundant<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">5<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">7<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">y<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># get a list of models to evaluate<\/span><\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">get_models<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">models<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">dict<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">i<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">arange<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">0.1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1.1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0.1<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t\t<\/span><span class=\"crayon-v\">key<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8216;%.1f&#8217;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">%<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">i<\/span><\/p>\n<p><span class=\"crayon-h\">\t\t<\/span><span class=\"crayon-v\">models<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">key<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">XGBClassifier<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">subsample<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">models<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># evaluate a give model using cross-validation<\/span><\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">evaluate_model<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">RepeatedStratifiedKFold<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_splits<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">10<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_repeats<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">cross_val_score<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">scoring<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;accuracy&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_jobs<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">scores<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># define dataset<\/span><\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">get_dataset<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># get the models to evaluate<\/span><\/p>\n<p><span class=\"crayon-v\">models<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">get_models<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># evaluate the models and store results<\/span><\/p>\n<p><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">name<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">model <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">models<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">items<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">evaluate_model<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">name<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8216;&gt;%s %.3f (%.3f)&#8217;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">%<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">name<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">mean<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">std<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># plot model performance for comparison<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">boxplot<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">labels<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">showmeans<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">True<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">show<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0013 seconds] --><\/p>\n<p>Running the example first reports the mean accuracy for each configured sample size.<\/p>\n<p><strong>Note<\/strong>: Your <a href=\"https:\/\/machinelearningmastery.com\/different-results-each-time-in-machine-learning\/\">results may vary<\/a> given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.<\/p>\n<p>In this case, we can see that mean performance is probably best for a sample size that covers most of the dataset, such as 80 percent or higher.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e5e302305506\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&gt;0.1 0.876 (0.027)<br \/>\n&gt;0.2 0.912 (0.033)<br \/>\n&gt;0.3 0.917 (0.032)<br \/>\n&gt;0.4 0.925 (0.026)<br \/>\n&gt;0.5 0.928 (0.027)<br \/>\n&gt;0.6 0.926 (0.024)<br \/>\n&gt;0.7 0.925 (0.031)<br \/>\n&gt;0.8 0.928 (0.028)<br \/>\n&gt;0.9 0.928 (0.025)<br \/>\n&gt;1.0 0.925 (0.028)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>&gt;0.1 0.876 (0.027)<\/p>\n<p>&gt;0.2 0.912 (0.033)<\/p>\n<p>&gt;0.3 0.917 (0.032)<\/p>\n<p>&gt;0.4 0.925 (0.026)<\/p>\n<p>&gt;0.5 0.928 (0.027)<\/p>\n<p>&gt;0.6 0.926 (0.024)<\/p>\n<p>&gt;0.7 0.925 (0.031)<\/p>\n<p>&gt;0.8 0.928 (0.028)<\/p>\n<p>&gt;0.9 0.928 (0.025)<\/p>\n<p>&gt;1.0 0.925 (0.028)<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0000 seconds] --><\/p>\n<p>A box and whisker plot is created for the distribution of accuracy scores for each configured sampling ratio.<\/p>\n<p>We can see the general trend of increasing model performance, perhaps peaking around 80 percent and staying somewhat level.<\/p>\n<div id=\"attachment_11220\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-11220\" loading=\"lazy\" class=\"size-full wp-image-11220\" src=\"https:\/\/3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com\/wp-content\/uploads\/2020\/07\/Box-Plots-of-XGBoost-Ensemble-Sample-Ratio-vs.-Classification-Accuracy.png\" alt=\"Box Plots of XGBoost Ensemble Sample Ratio vs. Classification Accuracy\" width=\"1280\" height=\"960\"><\/p>\n<p id=\"caption-attachment-11220\" class=\"wp-caption-text\">Box Plots of XGBoost Ensemble Sample Ratio vs. Classification Accuracy<\/p>\n<\/div>\n<h3>Explore Number of Features<\/h3>\n<p>The number of features used to fit each decision tree can be varied.<\/p>\n<p>Like changing the number of samples, changing the number of features introduces additional variance into the model, which may improve performance, although it might require an increase in the number of trees.<\/p>\n<p>The number of features used by each tree is taken as a random sample and is specified by the \u201c<em>colsample_bytree<\/em>\u201d argument and defaults to all features in the training dataset, e.g. 100 percent or a value of 1.0. You can also sample columns for each split, and this is controlled by the \u201c<em>colsample_bylevel<\/em>\u201d argument, but we will not look at this hyperparameter here.<\/p>\n<p>The example below explores the effect of the number of features on model performance with ratios varying from 10 percent to 100 percent in 10 percent increments.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e60463676818\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n# explore xgboost column ratio per tree effect on performance<br \/>\nfrom numpy import arange<br \/>\nfrom numpy import mean<br \/>\nfrom numpy import std<br \/>\nfrom sklearn.datasets import make_classification<br \/>\nfrom sklearn.model_selection import cross_val_score<br \/>\nfrom sklearn.model_selection import RepeatedStratifiedKFold<br \/>\nfrom xgboost import XGBClassifier<br \/>\nfrom matplotlib import pyplot<\/p>\n<p># get the dataset<br \/>\ndef get_dataset():<br \/>\n\tX, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)<br \/>\n\treturn X, y<\/p>\n<p># get a list of models to evaluate<br \/>\ndef get_models():<br \/>\n\tmodels = dict()<br \/>\n\tfor i in arange(0.1, 1.1, 0.1):<br \/>\n\t\tkey = &#8216;%.1f&#8217; % i<br \/>\n\t\tmodels[key] = XGBClassifier(colsample_bytree=i)<br \/>\n\treturn models<\/p>\n<p># evaluate a give model using cross-validation<br \/>\ndef evaluate_model(model):<br \/>\n\tcv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)<br \/>\n\tscores = cross_val_score(model, X, y, scoring=&#8217;accuracy&#8217;, cv=cv, n_jobs=-1)<br \/>\n\treturn scores<\/p>\n<p># define dataset<br \/>\nX, y = get_dataset()<br \/>\n# get the models to evaluate<br \/>\nmodels = get_models()<br \/>\n# evaluate the models and store results<br \/>\nresults, names = list(), list()<br \/>\nfor name, model in models.items():<br \/>\n\tscores = evaluate_model(model)<br \/>\n\tresults.append(scores)<br \/>\n\tnames.append(name)<br \/>\n\tprint(&#8216;&gt;%s %.3f (%.3f)&#8217; % (name, mean(scores), std(scores)))<br \/>\n# plot model performance for comparison<br \/>\npyplot.boxplot(results, labels=names, showmeans=True)<br \/>\npyplot.show()<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<p>22<\/p>\n<p>23<\/p>\n<p>24<\/p>\n<p>25<\/p>\n<p>26<\/p>\n<p>27<\/p>\n<p>28<\/p>\n<p>29<\/p>\n<p>30<\/p>\n<p>31<\/p>\n<p>32<\/p>\n<p>33<\/p>\n<p>34<\/p>\n<p>35<\/p>\n<p>36<\/p>\n<p>37<\/p>\n<p>38<\/p>\n<p>39<\/p>\n<p>40<\/p>\n<p>41<\/p>\n<p>42<\/p>\n<p>43<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-p\"># explore xgboost column ratio per tree effect on performance<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">arange<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">mean<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">std<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">datasets <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">make_classification<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">cross_val_score<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">RepeatedStratifiedKFold<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">xgboost <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">XGBClassifier<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">matplotlib <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">pyplot<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># get the dataset<\/span><\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">get_dataset<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">make_classification<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_samples<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1000<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_features<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">20<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_informative<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">15<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_redundant<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">5<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">7<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">y<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># get a list of models to evaluate<\/span><\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">get_models<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">models<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">dict<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">i<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">arange<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">0.1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1.1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0.1<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t\t<\/span><span class=\"crayon-v\">key<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8216;%.1f&#8217;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">%<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">i<\/span><\/p>\n<p><span class=\"crayon-h\">\t\t<\/span><span class=\"crayon-v\">models<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">key<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">XGBClassifier<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">colsample_bytree<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">models<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># evaluate a give model using cross-validation<\/span><\/p>\n<p><span class=\"crayon-e\">def <\/span><span class=\"crayon-e\">evaluate_model<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">RepeatedStratifiedKFold<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_splits<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">10<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_repeats<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">cross_val_score<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">scoring<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;accuracy&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">cv<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_jobs<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-st\">return<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">scores<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># define dataset<\/span><\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">get_dataset<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># get the models to evaluate<\/span><\/p>\n<p><span class=\"crayon-v\">models<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">get_models<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># evaluate the models and store results<\/span><\/p>\n<p><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">name<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">model <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">models<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">items<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">evaluate_model<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">append<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">name<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\t<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8216;&gt;%s %.3f (%.3f)&#8217;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">%<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">name<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">mean<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">std<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">scores<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-p\"># plot model performance for comparison<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">boxplot<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">results<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">labels<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">showmeans<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">True<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">pyplot<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">show<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0013 seconds] --><\/p>\n<p>Running the example first reports the mean accuracy for each configured ratio of columns.<\/p>\n<p><strong>Note<\/strong>: Your <a href=\"https:\/\/machinelearningmastery.com\/different-results-each-time-in-machine-learning\/\">results may vary<\/a> given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.<\/p>\n<p>In this case, we can see that mean performance increases to about half the number of features (50 percent) and stays somewhat level after that. It\u2019s surprising that removing half of the input variables per tree has so little effect.<\/p>\n<p><!-- Urvanov Syntax Highlighter v2.8.14 --><\/p>\n<div id=\"urvanov-syntax-highlighter-5fbb015866e61152849330\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&gt;0.1 0.861 (0.033)<br \/>\n&gt;0.2 0.906 (0.027)<br \/>\n&gt;0.3 0.923 (0.029)<br \/>\n&gt;0.4 0.917 (0.029)<br \/>\n&gt;0.5 0.928 (0.030)<br \/>\n&gt;0.6 0.929 (0.031)<br \/>\n&gt;0.7 0.924 (0.027)<br \/>\n&gt;0.8 0.931 (0.025)<br \/>\n&gt;0.9 0.927 (0.033)<br \/>\n&gt;1.0 0.925 (0.028)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>&gt;0.1 0.861 (0.033)<\/p>\n<p>&gt;0.2 0.906 (0.027)<\/p>\n<p>&gt;0.3 0.923 (0.029)<\/p>\n<p>&gt;0.4 0.917 (0.029)<\/p>\n<p>&gt;0.5 0.928 (0.030)<\/p>\n<p>&gt;0.6 0.929 (0.031)<\/p>\n<p>&gt;0.7 0.924 (0.027)<\/p>\n<p>&gt;0.8 0.931 (0.025)<\/p>\n<p>&gt;0.9 0.927 (0.033)<\/p>\n<p>&gt;1.0 0.925 (0.028)<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table>\n<\/div>\n<\/div>\n<p><!-- [Format Time: 0.0000 seconds] --><\/p>\n<p>A box and whisker plot is created for the distribution of accuracy scores for each configured column ratio.<\/p>\n<p>We can see the general trend of increasing model performance perhaps peaking with a ratio of 60 percent and staying somewhat level.<\/p>\n<div id=\"attachment_11221\" class=\"wp-caption aligncenter\">\n<img decoding=\"async\" aria-describedby=\"caption-attachment-11221\" loading=\"lazy\" class=\"size-full wp-image-11221\" src=\"https:\/\/3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com\/wp-content\/uploads\/2020\/07\/Box-Plots-of-XGBoost-Ensemble-Column-Ratio-vs.-Classification-Accuracy.png\" alt=\"Box Plots of XGBoost Ensemble Column Ratio vs. Classification Accuracy\" width=\"1280\" height=\"960\"><\/p>\n<p id=\"caption-attachment-11221\" class=\"wp-caption-text\">Box Plots of XGBoost Ensemble Column Ratio vs. Classification Accuracy<\/p>\n<\/div>\n<h2>Further Reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Tutorials<\/h3>\n<h3>Papers<\/h3>\n<h3>Project<\/h3>\n<h3>APIs<\/h3>\n<h3>Articles<\/h3>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered how to develop Extreme Gradient Boosting ensembles for classification and regression.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>Extreme Gradient Boosting is an efficient open-source implementation of the stochastic gradient boosting ensemble algorithm.<\/li>\n<li>How to develop XGBoost ensembles for classification and regression with the scikit-learn API.<\/li>\n<li>How to explore the effect of XGBoost model hyperparameters on model performance.<\/li>\n<\/ul>\n<p><strong>Do you have any questions?<\/strong><br \/>Ask your questions in the comments below and I will do my best to answer.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/machinelearningmastery.com\/extreme-gradient-boosting-ensemble-in-python\/<\/p>\n","protected":false},"author":0,"featured_media":605,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/604"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=604"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/604\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/605"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=604"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=604"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=604"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}