In a recent video, I covered Random Forests and Neural Nets as part of the codecentric.ai Bootcamp.

In the most recent video, I covered Gradient Boosting and XGBoost.

You can find the video on YouTube and the slides on slides.com. Both are again in German with code examples in Python.

But below, you find the English version of the content, plus code examples in R for caret, xgboost and h2o. :-)

Like Random Forest, Gradient Boosting is another technique for performing supervised machine learning tasks, like classification and regression. The implementations of this technique can have different names, most commonly you encounter Gradient Boosting machines (abbreviated GBM) and XGBoost. XGBoost is particularly popular because it has been the winning algorithm in a number of recent Kaggle competitions.

Similar to Random Forests, Gradient Boosting is an ensemble learner. This means it will create a final model based on a collection of individual models. The predictive power of these individual models is weak and prone to overfitting but combining many such weak models in an ensemble will lead to an overall much improved result. In Gradient Boosting machines, the most common type of weak model used is decision trees - another parallel to Random Forests.

How Gradient Boosting works

Let’s look at how Gradient Boosting works. Most of the magic is described in the name: “Gradient” plus “Boosting”.

Boosting builds models from individual so called “weak learners” in an iterative way. In the Random Forests part, I had already discussed the differences between Bagging and Boosting as tree ensemble methods. In boosting, the individual models are not built on completely random subsets of data and features but sequentially by putting more weight on instances with wrong predictions and high errors. The general idea behind this is that instances, which are hard to predict correctly (“difficult” cases) will be focused on during learning, so that the model learns from past mistakes. When we train each ensemble on a subset of the training set, we also call this Stochastic Gradient Boosting, which can help improve generalizability of our model.

The gradient is used to minimize a loss function, similar to how Neural Nets utilize gradient descent to optimize (“learn”) weights. In each round of training, the weak learner is built and its predictions are compared to the correct outcome that we expect. The distance between prediction and truth represents the error rate of our model. These errors can now be used to calculate the gradient. The gradient is nothing fancy, it is basically the partial derivative of our loss function - so it describes the steepness of our error function. The gradient can be used to find the direction in which to change the model parameters in order to (maximally) reduce the error in the next round of training by “descending the gradient”.

In Neural nets, gradient descent is used to look for the minimum of the loss function, i.e. learning the model parameters (e.g. weights) for which the prediction error is lowest in a single model. In Gradient Boosting we are combining the predictions of multiple models, so we are not optimizing the model parameters directly but the boosted model predictions. Therefore, the gradients will be added to the running training process by fitting the next tree also to these values.

Because we apply gradient descent, we will find learning rate (the “step size” with which we descend the gradient), shrinkage (reduction of the learning rate) and loss function as hyperparameters in Gradient Boosting models - just as with Neural Nets. Other hyperparameters of Gradient Boosting are similar to those of Random Forests:

the number of iterations (i.e. the number of trees to ensemble),
the number of observations in each leaf,
tree complexity and depth,
the proportion of samples and
the proportion of features on which to train on.

Gradient Boosting Machines vs. XGBoost

XGBoost stands for Extreme Gradient Boosting; it is a specific implementation of the Gradient Boosting method which uses more accurate approximations to find the best tree model. It employs a number of nifty tricks that make it exceptionally successful, particularly with structured data. The most important are

1.) computing second-order gradients, i.e. second partial derivatives of the loss function (similar to Newton’s method), which provides more information about the direction of gradients and how to get to the minimum of our loss function. While regular gradient boosting uses the loss function of our base model (e.g. decision tree) as a proxy for minimizing the error of the overall model, XGBoost uses the 2nd order derivative as an approximation.

2.) And advanced regularization (L1 & L2), which improves model generalization.

XGBoost has additional advantages: training is very fast and can be parallelized / distributed across clusters.

Code in R

Here is a very quick run through how to train Gradient Boosting and XGBoost models in R with caret, xgboost and h2o.

Data

First, data: I’ll be using the ISLR package, which contains a number of datasets, one of them is College.

Statistics for a large number of US Colleges from the 1995 issue of US News and World Report.

library(tidyverse)
library(ISLR)

ml_data <- College
ml_data %>%
  glimpse()

## Observations: 777
## Variables: 18
## $ Private     <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, ...
## $ Apps        <dbl> 1660, 2186, 1428, 417, 193, 587, 353, 1899, 1038, ...
## $ Accept      <dbl> 1232, 1924, 1097, 349, 146, 479, 340, 1720, 839, 4...
## $ Enroll      <dbl> 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 4...
## $ Top10perc   <dbl> 23, 16, 22, 60, 16, 38, 17, 37, 30, 21, 37, 44, 38...
## $ Top25perc   <dbl> 52, 29, 50, 89, 44, 62, 45, 68, 63, 44, 75, 77, 64...
## $ F.Undergrad <dbl> 2885, 2683, 1036, 510, 249, 678, 416, 1594, 973, 7...
## $ P.Undergrad <dbl> 537, 1227, 99, 63, 869, 41, 230, 32, 306, 78, 110,...
## $ Outstate    <dbl> 7440, 12280, 11250, 12960, 7560, 13500, 13290, 138...
## $ Room.Board  <dbl> 3300, 6450, 3750, 5450, 4120, 3335, 5720, 4826, 44...
## $ Books       <dbl> 450, 750, 400, 450, 800, 500, 500, 450, 300, 660, ...
## $ Personal    <dbl> 2200, 1500, 1165, 875, 1500, 675, 1500, 850, 500, ...
## $ PhD         <dbl> 70, 29, 53, 92, 76, 67, 90, 89, 79, 40, 82, 73, 60...
## $ Terminal    <dbl> 78, 30, 66, 97, 72, 73, 93, 100, 84, 41, 88, 91, 8...
## $ S.F.Ratio   <dbl> 18.1, 12.2, 12.9, 7.7, 11.9, 9.4, 11.5, 13.7, 11.3...
## $ perc.alumni <dbl> 12, 16, 30, 37, 2, 11, 26, 37, 23, 15, 31, 41, 21,...
## $ Expend      <dbl> 7041, 10527, 8735, 19016, 10922, 9727, 8861, 11487...
## $ Grad.Rate   <dbl> 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74...

Gradient Boosting in caret

The most flexible R package for machine learning is caret. If you go to the Available Models section in the online documentation and search for “Gradient Boosting”, this is what you’ll find:

Model	method Value	Type	Libraries	Tuning Parameters
eXtreme Gradient Boosting	xgbDART	Classification, Regression	xgboost, plyr	nrounds, max_depth, eta, gamma, subsample, colsample_bytree, rate_drop, skip_drop, min_child_weight
eXtreme Gradient Boosting	xgbLinear	Classification, Regression	xgboost	nrounds, lambda, alpha, eta
eXtreme Gradient Boosting	xgbTree	Classification, Regression	xgboost, plyr	nrounds, max_depth, eta, gamma, colsample_bytree, min_child_weight, subsample
Gradient Boosting Machines	gbm_h2o	Classification, Regression	h2o	ntrees, max_depth, min_rows, learn_rate, col_sample_rate
Stochastic Gradient Boosting	gbm	Classification, Regression	gbm, plyr	n.trees, interaction.depth, shrinkage, n.minobsinnode

A table with the different Gradient Boosting implementations, you can use with caret. Here I’ll show a very simple Stochastic Gradient Boosting example:

library(caret)

# Partition into training and test data
set.seed(42)
index <- createDataPartition(ml_data$Private, p = 0.7, list = FALSE)
train_data <- ml_data[index, ]
test_data  <- ml_data[-index, ]

# Train model with preprocessing & repeated cv
model_gbm <- caret::train(Private ~ .,
                          data = train_data,
                          method = "gbm",
                          trControl = trainControl(method = "repeatedcv", 
                                                  number = 5, 
                                                  repeats = 3, 
                                                  verboseIter = FALSE),
                          verbose = 0)
model_gbm

## Stochastic Gradient Boosting 
## 
## 545 samples
##  17 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 437, 436, 435, 436, 436, 436, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.9217830  0.7940197
##   1                  100      0.9327980  0.8264864
##   1                  150      0.9370795  0.8389860
##   2                   50      0.9334095  0.8275826
##   2                  100      0.9364341  0.8373727
##   2                  150      0.9333872  0.8298388
##   3                   50      0.9370627  0.8373028
##   3                  100      0.9376629  0.8398466
##   3                  150      0.9370401  0.8395797
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

With predict(), we can use this model to make predictions on test data. Here, I’ll be feeding this directly to the confusionMatrix function:

caret::confusionMatrix(
  data = predict(model_gbm, test_data),
  reference = test_data$Private
  )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No   57   9
##        Yes   6 160
##                                           
##                Accuracy : 0.9353          
##                  95% CI : (0.8956, 0.9634)
##     No Information Rate : 0.7284          
##     P-Value [Acc > NIR] : 7.952e-16       
##                                           
##                   Kappa : 0.839           
##  Mcnemar's Test P-Value : 0.6056          
##                                           
##             Sensitivity : 0.9048          
##             Specificity : 0.9467          
##          Pos Pred Value : 0.8636          
##          Neg Pred Value : 0.9639          
##              Prevalence : 0.2716          
##          Detection Rate : 0.2457          
##    Detection Prevalence : 0.2845          
##       Balanced Accuracy : 0.9258          
##                                           
##        'Positive' Class : No              
##

The xgboost library

We can also directly work with the xgboost package in R. It’s a bit more involved but also includes advanced possibilities.

The easiest way to work with xgboost is with the xgboost() function. The four most important arguments to give are

data: a matrix of the training data
label: the response variable in numeric format (for binary classification 0 & 1)
objective: defines what learning task should be trained, here binary classification
nrounds: number of boosting iterations

library(xgboost)

xgboost_model <- xgboost(data = as.matrix(train_data[, -1]), 
                         label = as.numeric(train_data$Private)-1,
                         max_depth = 3, 
                         objective = "binary:logistic", 
                         nrounds = 10, 
                         verbose = FALSE,
                         prediction = TRUE)
xgboost_model

## ##### xgb.Booster
## raw: 6.7 Kb 
## call:
##   xgb.train(params = params, data = dtrain, nrounds = nrounds, 
##     watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
##     early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
##     save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
##     callbacks = callbacks, max_depth = 3, objective = "binary:logistic", 
##     prediction = TRUE)
## params (as set within xgb.train):
##   max_depth = "3", objective = "binary:logistic", prediction = "TRUE", silent = "1"
## xgb.attributes:
##   niter
## callbacks:
##   cb.evaluation.log()
## # of features: 17 
## niter: 10
## nfeatures : 17 
## evaluation_log:
##     iter train_error
##        1    0.064220
##        2    0.051376
## ---                 
##        9    0.036697
##       10    0.033028

We can again use predict(); because here, we will get prediction probabilities, we need to convert them into labels to compare them with the true class:

predict(xgboost_model, 
        as.matrix(test_data[, -1])) %>%
  as.tibble() %>%
  mutate(prediction = round(value),
         label = as.numeric(test_data$Private)-1) %>%
  count(prediction, label)

## # A tibble: 4 x 3
##   prediction label     n
##        <dbl> <dbl> <int>
## 1          0     0    56
## 2          0     1     6
## 3          1     0     7
## 4          1     1   163

Alternatively, we can use xgb.train(), which is more flexible and allows for more advanced settings compared to xgboost(). Here, we first need to create a so called DMatrix from the data. Optionally, we can define a watchlist for evaluating model performance during the training run. I am also creating a parameter set as a list object, which I am feeding to the params argument.

dtrain <- xgb.DMatrix(as.matrix(train_data[, -1]), 
                      label = as.numeric(train_data$Private)-1)
dtest <- xgb.DMatrix(as.matrix(test_data[, -1]), 
                      label = as.numeric(test_data$Private)-1)

params <- list(max_depth = 3, 
               objective = "binary:logistic",
               silent = 0)

watchlist <- list(train = dtrain, eval = dtest)

bst_model <- xgb.train(params = params, 
                       data = dtrain, 
                       nrounds = 10, 
                       watchlist = watchlist,
                       verbose = FALSE,
                       prediction = TRUE)
bst_model

## ##### xgb.Booster
## raw: 6.7 Kb 
## call:
##   xgb.train(params = params, data = dtrain, nrounds = 10, watchlist = watchlist, 
##     verbose = FALSE, prediction = TRUE)
## params (as set within xgb.train):
##   max_depth = "3", objective = "binary:logistic", silent = "0", prediction = "TRUE", silent = "1"
## xgb.attributes:
##   niter
## callbacks:
##   cb.evaluation.log()
## # of features: 17 
## niter: 10
## nfeatures : 17 
## evaluation_log:
##     iter train_error eval_error
##        1    0.064220   0.099138
##        2    0.051376   0.077586
## ---                            
##        9    0.036697   0.060345
##       10    0.033028   0.056034

The model can be used just as before:

predict(bst_model, 
        as.matrix(test_data[, -1])) %>%
  as.tibble() %>%
  mutate(prediction = round(value),
         label = as.numeric(test_data$Private)-1) %>%
  count(prediction, label)

## # A tibble: 4 x 3
##   prediction label     n
##        <dbl> <dbl> <int>
## 1          0     0    56
## 2          0     1     6
## 3          1     0     7
## 4          1     1   163

The third option, is to use xgb.cv, which will perform cross-validation. This function does not return a model, it is rather used to find optimal hyperparameters, particularly for nrounds.

cv_model <- xgb.cv(params = params,
                   data = dtrain, 
                   nrounds = 100, 
                   watchlist = watchlist,
                   nfold = 5,
                   verbose = FALSE,
                   prediction = TRUE) # prediction of cv folds

Here, we can see after how many rounds, we achieved the smallest test error:

cv_model$evaluation_log %>%
  filter(test_error_mean == min(test_error_mean))

##   iter train_error_mean train_error_std test_error_mean test_error_std
## 1   17        0.0082568     0.002338999       0.0550458     0.01160461
## 2   25        0.0018350     0.001716352       0.0550458     0.01004998
## 3   29        0.0009176     0.001123826       0.0550458     0.01421269
## 4   32        0.0009176     0.001123826       0.0550458     0.01535140
## 5   33        0.0004588     0.000917600       0.0550458     0.01535140
## 6   80        0.0000000     0.000000000       0.0550458     0.01004998

H2O

H2O is another popular package for machine learning in R. We will first set up the session and create training and test data:

library(h2o)
h2o.init(nthreads = -1)

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     /var/folders/5j/v30zfr7s14qfhqwqdmqmpxw80000gn/T//RtmpWCdBYk/h2o_shiringlander_started_from_r.out
##     /var/folders/5j/v30zfr7s14qfhqwqdmqmpxw80000gn/T//RtmpWCdBYk/h2o_shiringlander_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: ... Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 seconds 105 milliseconds 
##     H2O cluster timezone:       Europe/Berlin 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.20.0.8 
##     H2O cluster version age:    2 months and 8 days  
##     H2O cluster name:           H2O_started_from_R_shiringlander_phb668 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.56 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.5.1 (2018-07-02)

h2o.no_progress()

data_hf <- as.h2o(ml_data)

splits <- h2o.splitFrame(data_hf, 
                         ratios = 0.75, 
                         seed = 1)

train <- splits[[1]]
test <- splits[[2]]

response <- "Private"
features <- setdiff(colnames(train), response)

Gradient Boosting

The Gradient Boosting implementation can be used as such:

h2o_gbm <- h2o.gbm(x = features, 
                   y = response, 
                   training_frame = train,
                   nfolds = 3) # cross-validation
h2o_gbm

## Model Details:
## ==============
## 
## H2OBinomialModel: gbm
## Model ID:  GBM_model_R_1543572213551_1 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1              50                       50               12998         5
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         5    5.00000          8         21    15.74000
## 
## 
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.00244139
## RMSE:  0.04941043
## LogLoss:  0.02582422
## Mean Per-Class Error:  0
## AUC:  1
## Gini:  1
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error    Rate
## No     160   0 0.000000  =0/160
## Yes      0 419 0.000000  =0/419
## Totals 160 419 0.000000  =0/579
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.671121 1.000000 246
## 2                       max f2  0.671121 1.000000 246
## 3                 max f0point5  0.671121 1.000000 246
## 4                 max accuracy  0.671121 1.000000 246
## 5                max precision  0.996764 1.000000   0
## 6                   max recall  0.671121 1.000000 246
## 7              max specificity  0.996764 1.000000   0
## 8             max absolute_mcc  0.671121 1.000000 246
## 9   max min_per_class_accuracy  0.671121 1.000000 246
## 10 max mean_per_class_accuracy  0.671121 1.000000 246
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## 
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 3-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.05794659
## RMSE:  0.240721
## LogLoss:  0.1971785
## Mean Per-Class Error:  0.1030131
## AUC:  0.9741125
## Gini:  0.9482249
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error     Rate
## No     132  28 0.175000  =28/160
## Yes     13 406 0.031026  =13/419
## Totals 145 434 0.070812  =41/579
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.345249 0.951934 265
## 2                       max f2  0.149750 0.969939 284
## 3                 max f0point5  0.971035 0.958493 184
## 4                 max accuracy  0.345249 0.929188 265
## 5                max precision  0.997741 1.000000   0
## 6                   max recall  0.009001 1.000000 385
## 7              max specificity  0.997741 1.000000   0
## 8             max absolute_mcc  0.345249 0.819491 265
## 9   max min_per_class_accuracy  0.893580 0.904535 223
## 10 max mean_per_class_accuracy  0.917039 0.916982 213
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                                mean           sd cv_1_valid  cv_2_valid
## accuracy                  0.9278624  0.008904516   0.921466  0.94545454
## auc                       0.9762384 0.0051301476 0.96743006   0.9851994
## err                      0.07213761  0.008904516 0.07853403 0.054545455
## err_count                 13.666667    0.8819171       15.0        12.0
## f0point5                  0.9352853  0.013447009 0.92972183   0.9608541
## f1                        0.9512787 0.0065108957 0.94423795  0.96428573
## f2                        0.9681131 0.0052480404  0.9592145   0.9677419
## lift_top_group            1.3879367  0.040602904  1.4580153   1.3173653
## logloss                  0.20110694  0.028338892 0.23033275   0.1444385
## max_per_class_error      0.20442705  0.049009725 0.18333334  0.13207547
## mcc                      0.81914276  0.016271077  0.8149471   0.8491877
## mean_per_class_accuracy   0.8877074   0.01979144 0.89306617   0.9189922
## mean_per_class_error      0.1122926   0.01979144 0.10693384  0.08100779
## mse                     0.059073452  0.007476475 0.06384816 0.044414397
## precision                 0.9250553   0.01813692  0.9202899   0.9585799
## r2                        0.7061854   0.02871026 0.70365846  0.75712836
## recall                    0.9798418  0.010080538  0.9694657   0.9700599
## rmse                     0.24200912  0.015890896 0.25268194  0.21074724
## specificity              0.79557294  0.049009725 0.81666666   0.8679245
##                          cv_3_valid
## accuracy                  0.9166667
## auc                       0.9760858
## err                     0.083333336
## err_count                      14.0
## f0point5                 0.91527987
## f1                        0.9453125
## f2                        0.9773829
## lift_top_group            1.3884298
## logloss                  0.22854955
## max_per_class_error      0.29787233
## mcc                       0.7932934
## mean_per_class_accuracy  0.85106385
## mean_per_class_error     0.14893617
## mse                     0.068957806
## precision                 0.8962963
## r2                       0.65776944
## recall                          1.0
## rmse                      0.2625982
## specificity              0.70212764

We can calculate performance on test data with h2o.performance():

h2o.performance(h2o_gbm, test)

## H2OBinomialMetrics: gbm
## 
## MSE:  0.03509102
## RMSE:  0.187326
## LogLoss:  0.1350709
## Mean Per-Class Error:  0.05216017
## AUC:  0.9770811
## Gini:  0.9541623
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        No Yes    Error    Rate
## No     48   4 0.076923   =4/52
## Yes     4 142 0.027397  =4/146
## Totals 52 146 0.040404  =8/198
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.580377 0.972603 136
## 2                       max f2  0.214459 0.979730 146
## 3                 max f0point5  0.907699 0.979827 127
## 4                 max accuracy  0.580377 0.959596 136
## 5                max precision  0.997449 1.000000   0
## 6                   max recall  0.006710 1.000000 187
## 7              max specificity  0.997449 1.000000   0
## 8             max absolute_mcc  0.580377 0.895680 136
## 9   max min_per_class_accuracy  0.821398 0.952055 131
## 10 max mean_per_class_accuracy  0.821398 0.956797 131
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

XGBoost

Alternatively, we can also use the XGBoost implementation of H2O:

h2o_xgb <- h2o.xgboost(x = features, 
                       y = response, 
                       training_frame = train,
                       nfolds = 3)
h2o_xgb

## Model Details:
## ==============
## 
## H2OBinomialModel: xgboost
## Model ID:  XGBoost_model_R_1543572213551_364 
## Model Summary: 
##   number_of_trees
## 1              50
## 
## 
## H2OBinomialMetrics: xgboost
## ** Reported on training data. **
## 
## MSE:  0.25
## RMSE:  0.5
## LogLoss:  0.6931472
## Mean Per-Class Error:  0.5
## AUC:  0.5
## Gini:  0
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        No Yes    Error      Rate
## No      0 160 1.000000  =160/160
## Yes     0 419 0.000000    =0/419
## Totals  0 579 0.276339  =160/579
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.500000 0.839679   0
## 2                       max f2  0.500000 0.929047   0
## 3                 max f0point5  0.500000 0.765996   0
## 4                 max accuracy  0.500000 0.723661   0
## 5                max precision  0.500000 0.723661   0
## 6                   max recall  0.500000 1.000000   0
## 7              max specificity  0.500000 0.000000   0
## 8             max absolute_mcc  0.500000 0.000000   0
## 9   max min_per_class_accuracy  0.500000 0.000000   0
## 10 max mean_per_class_accuracy  0.500000 0.500000   0
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## 
## H2OBinomialMetrics: xgboost
## ** Reported on cross-validation data. **
## ** 3-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.25
## RMSE:  0.5
## LogLoss:  0.6931472
## Mean Per-Class Error:  0.5
## AUC:  0.5
## Gini:  0
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        No Yes    Error      Rate
## No      0 160 1.000000  =160/160
## Yes     0 419 0.000000    =0/419
## Totals  0 579 0.276339  =160/579
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.500000 0.839679   0
## 2                       max f2  0.500000 0.929047   0
## 3                 max f0point5  0.500000 0.765996   0
## 4                 max accuracy  0.500000 0.723661   0
## 5                max precision  0.500000 0.723661   0
## 6                   max recall  0.500000 1.000000   0
## 7              max specificity  0.500000 0.000000   0
## 8             max absolute_mcc  0.500000 0.000000   0
## 9   max min_per_class_accuracy  0.500000 0.000000   0
## 10 max mean_per_class_accuracy  0.500000 0.500000   0
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                                mean          sd  cv_1_valid  cv_2_valid
## accuracy                  0.7260711  0.01583762  0.73595506   0.6950673
## auc                             0.5         0.0         0.5         0.5
## err                      0.27392888  0.01583762  0.26404494  0.30493274
## err_count                 53.333332   7.3560257        47.0        68.0
## f0point5                  0.7680598 0.014220628  0.77698696   0.7402101
## f1                        0.8411026 0.010714028  0.84789646   0.8201058
## f2                       0.92966795 0.005267985   0.9330484   0.9193357
## lift_top_group                  1.0         0.0         1.0         1.0
## logloss                   0.6931472         0.0   0.6931472   0.6931472
## max_per_class_error             1.0         0.0         1.0         1.0
## mcc                             0.0         NaN         NaN         NaN
## mean_per_class_accuracy         0.5         0.0         0.5         0.5
## mean_per_class_error            0.5         0.0         0.5         0.5
## mse                            0.25         0.0        0.25        0.25
## precision                 0.7260711  0.01583762  0.73595506   0.6950673
## r2                      -0.26316962 0.043160092 -0.28650317 -0.17953037
## recall                          1.0         0.0         1.0         1.0
## rmse                            0.5         0.0         0.5         0.5
## specificity                     0.0         0.0         0.0         0.0
##                          cv_3_valid
## accuracy                   0.747191
## auc                             0.5
## err                        0.252809
## err_count                      45.0
## f0point5                 0.78698224
## f1                        0.8553055
## f2                        0.9366197
## lift_top_group                  1.0
## logloss                   0.6931472
## max_per_class_error             1.0
## mcc                             NaN
## mean_per_class_accuracy         0.5
## mean_per_class_error            0.5
## mse                            0.25
## precision                  0.747191
## r2                      -0.32347536
## recall                          1.0
## rmse                            0.5
## specificity                     0.0

And use it just as before:

h2o.performance(h2o_xgb, test)

## H2OBinomialMetrics: xgboost
## 
## MSE:  0.25
## RMSE:  0.5
## LogLoss:  0.6931472
## Mean Per-Class Error:  0.5
## AUC:  0.5
## Gini:  0
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        No Yes    Error     Rate
## No      0  52 1.000000   =52/52
## Yes     0 146 0.000000   =0/146
## Totals  0 198 0.262626  =52/198
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.500000 0.848837   0
## 2                       max f2  0.500000 0.933504   0
## 3                 max f0point5  0.500000 0.778252   0
## 4                 max accuracy  0.500000 0.737374   0
## 5                max precision  0.500000 0.737374   0
## 6                   max recall  0.500000 1.000000   0
## 7              max specificity  0.500000 0.000000   0
## 8             max absolute_mcc  0.500000 0.000000   0
## 9   max min_per_class_accuracy  0.500000 0.000000   0
## 10 max mean_per_class_accuracy  0.500000 0.500000   0
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Video

Slides

sessionInfo()

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.14.1
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] h2o_3.20.0.8    bindrcpp_0.2.2  xgboost_0.71.2  caret_6.0-80   
##  [5] lattice_0.20-38 ISLR_1.2        forcats_0.3.0   stringr_1.3.1  
##  [9] dplyr_0.7.7     purrr_0.2.5     readr_1.1.1     tidyr_0.8.2    
## [13] tibble_1.4.2    ggplot2_3.1.0   tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-137       bitops_1.0-6       lubridate_1.7.4   
##  [4] dimRed_0.1.0       httr_1.3.1         rprojroot_1.3-2   
##  [7] tools_3.5.1        backports_1.1.2    utf8_1.1.4        
## [10] R6_2.3.0           rpart_4.1-13       lazyeval_0.2.1    
## [13] colorspace_1.3-2   nnet_7.3-12        withr_2.1.2       
## [16] gbm_2.1.4          gridExtra_2.3      tidyselect_0.2.5  
## [19] compiler_3.5.1     cli_1.0.1          rvest_0.3.2       
## [22] xml2_1.2.0         bookdown_0.7       scales_1.0.0      
## [25] sfsmisc_1.1-2      DEoptimR_1.0-8     robustbase_0.93-3 
## [28] digest_0.6.18      rmarkdown_1.10     pkgconfig_2.0.2   
## [31] htmltools_0.3.6    rlang_0.3.0.1      readxl_1.1.0      
## [34] ddalpha_1.3.4      rstudioapi_0.8     bindr_0.1.1       
## [37] jsonlite_1.5       ModelMetrics_1.2.2 RCurl_1.95-4.11   
## [40] magrittr_1.5       Matrix_1.2-15      fansi_0.4.0       
## [43] Rcpp_0.12.19       munsell_0.5.0      abind_1.4-5       
## [46] stringi_1.2.4      yaml_2.2.0         MASS_7.3-51.1     
## [49] plyr_1.8.4         recipes_0.1.3      grid_3.5.1        
## [52] pls_2.7-0          crayon_1.3.4       haven_1.1.2       
## [55] splines_3.5.1      hms_0.4.2          knitr_1.20        
## [58] pillar_1.3.0       reshape2_1.4.3     codetools_0.2-15  
## [61] stats4_3.5.1       CVST_0.2-2         magic_1.5-9       
## [64] glue_1.3.0         evaluate_0.12      blogdown_0.9      
## [67] data.table_1.11.8  modelr_0.1.2       foreach_1.4.4     
## [70] cellranger_1.1.0   gtable_0.2.0       kernlab_0.9-27    
## [73] assertthat_0.2.0   DRR_0.0.3          xfun_0.4          
## [76] gower_0.1.2        prodlim_2018.04.18 broom_0.5.0       
## [79] e1071_1.7-0        class_7.3-14       survival_2.43-1   
## [82] geometry_0.3-6     timeDate_3043.102  RcppRoll_0.3.0    
## [85] iterators_1.0.10   lava_1.6.3         ipred_0.9-8

Machine Learning Basics - Gradient Boosting & XGBoost