Machine Learning Basics - Gradient Boosting & XGBoost
In a recent video, I covered Random Forests and Neural Nets as part of the codecentric.ai Bootcamp.
In the most recent video, I covered Gradient Boosting and XGBoost.
You can find the video on YouTube and the slides on slides.com. Both are again in German with code examples in Python.
But below, you find the English version of the content, plus code examples in R for caret
, xgboost
and h2o
. :-)
Like Random Forest, Gradient Boosting is another technique for performing supervised machine learning tasks, like classification and regression. The implementations of this technique can have different names, most commonly you encounter Gradient Boosting machines (abbreviated GBM) and XGBoost. XGBoost is particularly popular because it has been the winning algorithm in a number of recent Kaggle competitions.
Similar to Random Forests, Gradient Boosting is an ensemble learner. This means it will create a final model based on a collection of individual models. The predictive power of these individual models is weak and prone to overfitting but combining many such weak models in an ensemble will lead to an overall much improved result. In Gradient Boosting machines, the most common type of weak model used is decision trees - another parallel to Random Forests.
How Gradient Boosting works
Let’s look at how Gradient Boosting works. Most of the magic is described in the name: “Gradient” plus “Boosting”.
Boosting builds models from individual so called “weak learners” in an iterative way. In the Random Forests part, I had already discussed the differences between Bagging and Boosting as tree ensemble methods. In boosting, the individual models are not built on completely random subsets of data and features but sequentially by putting more weight on instances with wrong predictions and high errors. The general idea behind this is that instances, which are hard to predict correctly (“difficult” cases) will be focused on during learning, so that the model learns from past mistakes. When we train each ensemble on a subset of the training set, we also call this Stochastic Gradient Boosting, which can help improve generalizability of our model.
The gradient is used to minimize a loss function, similar to how Neural Nets utilize gradient descent to optimize (“learn”) weights. In each round of training, the weak learner is built and its predictions are compared to the correct outcome that we expect. The distance between prediction and truth represents the error rate of our model. These errors can now be used to calculate the gradient. The gradient is nothing fancy, it is basically the partial derivative of our loss function - so it describes the steepness of our error function. The gradient can be used to find the direction in which to change the model parameters in order to (maximally) reduce the error in the next round of training by “descending the gradient”.
In Neural nets, gradient descent is used to look for the minimum of the loss function, i.e. learning the model parameters (e.g. weights) for which the prediction error is lowest in a single model. In Gradient Boosting we are combining the predictions of multiple models, so we are not optimizing the model parameters directly but the boosted model predictions. Therefore, the gradients will be added to the running training process by fitting the next tree also to these values.
Because we apply gradient descent, we will find learning rate (the “step size” with which we descend the gradient), shrinkage (reduction of the learning rate) and loss function as hyperparameters in Gradient Boosting models - just as with Neural Nets. Other hyperparameters of Gradient Boosting are similar to those of Random Forests:
- the number of iterations (i.e. the number of trees to ensemble),
- the number of observations in each leaf,
- tree complexity and depth,
- the proportion of samples and
- the proportion of features on which to train on.
Gradient Boosting Machines vs. XGBoost
XGBoost stands for Extreme Gradient Boosting; it is a specific implementation of the Gradient Boosting method which uses more accurate approximations to find the best tree model. It employs a number of nifty tricks that make it exceptionally successful, particularly with structured data. The most important are
1.) computing second-order gradients, i.e. second partial derivatives of the loss function (similar to Newton’s method), which provides more information about the direction of gradients and how to get to the minimum of our loss function. While regular gradient boosting uses the loss function of our base model (e.g. decision tree) as a proxy for minimizing the error of the overall model, XGBoost uses the 2nd order derivative as an approximation.
2.) And advanced regularization (L1 & L2), which improves model generalization.
XGBoost has additional advantages: training is very fast and can be parallelized / distributed across clusters.
Code in R
Here is a very quick run through how to train Gradient Boosting and XGBoost models in R with caret
, xgboost
and h2o
.
Data
First, data: I’ll be using the ISLR
package, which contains a number of datasets, one of them is College
.
Statistics for a large number of US Colleges from the 1995 issue of US News and World Report.
library(tidyverse)
library(ISLR)
ml_data <- College
ml_data %>%
glimpse()
## Observations: 777
## Variables: 18
## $ Private <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, ...
## $ Apps <dbl> 1660, 2186, 1428, 417, 193, 587, 353, 1899, 1038, ...
## $ Accept <dbl> 1232, 1924, 1097, 349, 146, 479, 340, 1720, 839, 4...
## $ Enroll <dbl> 721, 512, 336, 137, 55, 158, 103, 489, 227, 172, 4...
## $ Top10perc <dbl> 23, 16, 22, 60, 16, 38, 17, 37, 30, 21, 37, 44, 38...
## $ Top25perc <dbl> 52, 29, 50, 89, 44, 62, 45, 68, 63, 44, 75, 77, 64...
## $ F.Undergrad <dbl> 2885, 2683, 1036, 510, 249, 678, 416, 1594, 973, 7...
## $ P.Undergrad <dbl> 537, 1227, 99, 63, 869, 41, 230, 32, 306, 78, 110,...
## $ Outstate <dbl> 7440, 12280, 11250, 12960, 7560, 13500, 13290, 138...
## $ Room.Board <dbl> 3300, 6450, 3750, 5450, 4120, 3335, 5720, 4826, 44...
## $ Books <dbl> 450, 750, 400, 450, 800, 500, 500, 450, 300, 660, ...
## $ Personal <dbl> 2200, 1500, 1165, 875, 1500, 675, 1500, 850, 500, ...
## $ PhD <dbl> 70, 29, 53, 92, 76, 67, 90, 89, 79, 40, 82, 73, 60...
## $ Terminal <dbl> 78, 30, 66, 97, 72, 73, 93, 100, 84, 41, 88, 91, 8...
## $ S.F.Ratio <dbl> 18.1, 12.2, 12.9, 7.7, 11.9, 9.4, 11.5, 13.7, 11.3...
## $ perc.alumni <dbl> 12, 16, 30, 37, 2, 11, 26, 37, 23, 15, 31, 41, 21,...
## $ Expend <dbl> 7041, 10527, 8735, 19016, 10922, 9727, 8861, 11487...
## $ Grad.Rate <dbl> 60, 56, 54, 59, 15, 55, 63, 73, 80, 52, 73, 76, 74...
Gradient Boosting in caret
The most flexible R package for machine learning is caret
. If you go to the Available Models section in the online documentation and search for “Gradient Boosting”, this is what you’ll find:
Model | method Value | Type | Libraries | Tuning Parameters |
---|---|---|---|---|
eXtreme Gradient Boosting | xgbDART | Classification, Regression | xgboost, plyr | nrounds, max_depth, eta, gamma, subsample, colsample_bytree, rate_drop, skip_drop, min_child_weight |
eXtreme Gradient Boosting | xgbLinear | Classification, Regression | xgboost | nrounds, lambda, alpha, eta |
eXtreme Gradient Boosting | xgbTree | Classification, Regression | xgboost, plyr | nrounds, max_depth, eta, gamma, colsample_bytree, min_child_weight, subsample |
Gradient Boosting Machines | gbm_h2o | Classification, Regression | h2o | ntrees, max_depth, min_rows, learn_rate, col_sample_rate |
Stochastic Gradient Boosting | gbm | Classification, Regression | gbm, plyr | n.trees, interaction.depth, shrinkage, n.minobsinnode |
A table with the different Gradient Boosting implementations, you can use with caret
. Here I’ll show a very simple Stochastic Gradient Boosting example:
library(caret)
# Partition into training and test data
set.seed(42)
index <- createDataPartition(ml_data$Private, p = 0.7, list = FALSE)
train_data <- ml_data[index, ]
test_data <- ml_data[-index, ]
# Train model with preprocessing & repeated cv
model_gbm <- caret::train(Private ~ .,
data = train_data,
method = "gbm",
trControl = trainControl(method = "repeatedcv",
number = 5,
repeats = 3,
verboseIter = FALSE),
verbose = 0)
model_gbm
## Stochastic Gradient Boosting
##
## 545 samples
## 17 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 437, 436, 435, 436, 436, 436, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.9217830 0.7940197
## 1 100 0.9327980 0.8264864
## 1 150 0.9370795 0.8389860
## 2 50 0.9334095 0.8275826
## 2 100 0.9364341 0.8373727
## 2 150 0.9333872 0.8298388
## 3 50 0.9370627 0.8373028
## 3 100 0.9376629 0.8398466
## 3 150 0.9370401 0.8395797
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
With predict()
, we can use this model to make predictions on test data. Here, I’ll be feeding this directly to the confusionMatrix
function:
caret::confusionMatrix(
data = predict(model_gbm, test_data),
reference = test_data$Private
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 57 9
## Yes 6 160
##
## Accuracy : 0.9353
## 95% CI : (0.8956, 0.9634)
## No Information Rate : 0.7284
## P-Value [Acc > NIR] : 7.952e-16
##
## Kappa : 0.839
## Mcnemar's Test P-Value : 0.6056
##
## Sensitivity : 0.9048
## Specificity : 0.9467
## Pos Pred Value : 0.8636
## Neg Pred Value : 0.9639
## Prevalence : 0.2716
## Detection Rate : 0.2457
## Detection Prevalence : 0.2845
## Balanced Accuracy : 0.9258
##
## 'Positive' Class : No
##
The xgboost library
We can also directly work with the xgboost package in R. It’s a bit more involved but also includes advanced possibilities.
The easiest way to work with xgboost
is with the xgboost()
function. The four most important arguments to give are
data
: a matrix of the training datalabel
: the response variable in numeric format (for binary classification 0 & 1)objective
: defines what learning task should be trained, here binary classificationnrounds
: number of boosting iterations
library(xgboost)
xgboost_model <- xgboost(data = as.matrix(train_data[, -1]),
label = as.numeric(train_data$Private)-1,
max_depth = 3,
objective = "binary:logistic",
nrounds = 10,
verbose = FALSE,
prediction = TRUE)
xgboost_model
## ##### xgb.Booster
## raw: 6.7 Kb
## call:
## xgb.train(params = params, data = dtrain, nrounds = nrounds,
## watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
## early_stopping_rounds = early_stopping_rounds, maximize = maximize,
## save_period = save_period, save_name = save_name, xgb_model = xgb_model,
## callbacks = callbacks, max_depth = 3, objective = "binary:logistic",
## prediction = TRUE)
## params (as set within xgb.train):
## max_depth = "3", objective = "binary:logistic", prediction = "TRUE", silent = "1"
## xgb.attributes:
## niter
## callbacks:
## cb.evaluation.log()
## # of features: 17
## niter: 10
## nfeatures : 17
## evaluation_log:
## iter train_error
## 1 0.064220
## 2 0.051376
## ---
## 9 0.036697
## 10 0.033028
We can again use predict()
; because here, we will get prediction probabilities, we need to convert them into labels to compare them with the true class:
predict(xgboost_model,
as.matrix(test_data[, -1])) %>%
as.tibble() %>%
mutate(prediction = round(value),
label = as.numeric(test_data$Private)-1) %>%
count(prediction, label)
## # A tibble: 4 x 3
## prediction label n
## <dbl> <dbl> <int>
## 1 0 0 56
## 2 0 1 6
## 3 1 0 7
## 4 1 1 163
Alternatively, we can use xgb.train()
, which is more flexible and allows for more advanced settings compared to xgboost()
. Here, we first need to create a so called DMatrix from the data. Optionally, we can define a watchlist for evaluating model performance during the training run. I am also creating a parameter set as a list object, which I am feeding to the params
argument.
dtrain <- xgb.DMatrix(as.matrix(train_data[, -1]),
label = as.numeric(train_data$Private)-1)
dtest <- xgb.DMatrix(as.matrix(test_data[, -1]),
label = as.numeric(test_data$Private)-1)
params <- list(max_depth = 3,
objective = "binary:logistic",
silent = 0)
watchlist <- list(train = dtrain, eval = dtest)
bst_model <- xgb.train(params = params,
data = dtrain,
nrounds = 10,
watchlist = watchlist,
verbose = FALSE,
prediction = TRUE)
bst_model
## ##### xgb.Booster
## raw: 6.7 Kb
## call:
## xgb.train(params = params, data = dtrain, nrounds = 10, watchlist = watchlist,
## verbose = FALSE, prediction = TRUE)
## params (as set within xgb.train):
## max_depth = "3", objective = "binary:logistic", silent = "0", prediction = "TRUE", silent = "1"
## xgb.attributes:
## niter
## callbacks:
## cb.evaluation.log()
## # of features: 17
## niter: 10
## nfeatures : 17
## evaluation_log:
## iter train_error eval_error
## 1 0.064220 0.099138
## 2 0.051376 0.077586
## ---
## 9 0.036697 0.060345
## 10 0.033028 0.056034
The model can be used just as before:
predict(bst_model,
as.matrix(test_data[, -1])) %>%
as.tibble() %>%
mutate(prediction = round(value),
label = as.numeric(test_data$Private)-1) %>%
count(prediction, label)
## # A tibble: 4 x 3
## prediction label n
## <dbl> <dbl> <int>
## 1 0 0 56
## 2 0 1 6
## 3 1 0 7
## 4 1 1 163
The third option, is to use xgb.cv
, which will perform cross-validation. This function does not return a model, it is rather used to find optimal hyperparameters, particularly for nrounds
.
cv_model <- xgb.cv(params = params,
data = dtrain,
nrounds = 100,
watchlist = watchlist,
nfold = 5,
verbose = FALSE,
prediction = TRUE) # prediction of cv folds
Here, we can see after how many rounds, we achieved the smallest test error:
cv_model$evaluation_log %>%
filter(test_error_mean == min(test_error_mean))
## iter train_error_mean train_error_std test_error_mean test_error_std
## 1 17 0.0082568 0.002338999 0.0550458 0.01160461
## 2 25 0.0018350 0.001716352 0.0550458 0.01004998
## 3 29 0.0009176 0.001123826 0.0550458 0.01421269
## 4 32 0.0009176 0.001123826 0.0550458 0.01535140
## 5 33 0.0004588 0.000917600 0.0550458 0.01535140
## 6 80 0.0000000 0.000000000 0.0550458 0.01004998
H2O
H2O is another popular package for machine learning in R. We will first set up the session and create training and test data:
library(h2o)
h2o.init(nthreads = -1)
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## /var/folders/5j/v30zfr7s14qfhqwqdmqmpxw80000gn/T//RtmpWCdBYk/h2o_shiringlander_started_from_r.out
## /var/folders/5j/v30zfr7s14qfhqwqdmqmpxw80000gn/T//RtmpWCdBYk/h2o_shiringlander_started_from_r.err
##
##
## Starting H2O JVM and connecting: ... Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 seconds 105 milliseconds
## H2O cluster timezone: Europe/Berlin
## H2O data parsing timezone: UTC
## H2O cluster version: 3.20.0.8
## H2O cluster version age: 2 months and 8 days
## H2O cluster name: H2O_started_from_R_shiringlander_phb668
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.56 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.5.1 (2018-07-02)
h2o.no_progress()
data_hf <- as.h2o(ml_data)
splits <- h2o.splitFrame(data_hf,
ratios = 0.75,
seed = 1)
train <- splits[[1]]
test <- splits[[2]]
response <- "Private"
features <- setdiff(colnames(train), response)
Gradient Boosting
The Gradient Boosting implementation can be used as such:
h2o_gbm <- h2o.gbm(x = features,
y = response,
training_frame = train,
nfolds = 3) # cross-validation
h2o_gbm
## Model Details:
## ==============
##
## H2OBinomialModel: gbm
## Model ID: GBM_model_R_1543572213551_1
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 50 50 12998 5
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 5 5.00000 8 21 15.74000
##
##
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.00244139
## RMSE: 0.04941043
## LogLoss: 0.02582422
## Mean Per-Class Error: 0
## AUC: 1
## Gini: 1
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 160 0 0.000000 =0/160
## Yes 0 419 0.000000 =0/419
## Totals 160 419 0.000000 =0/579
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.671121 1.000000 246
## 2 max f2 0.671121 1.000000 246
## 3 max f0point5 0.671121 1.000000 246
## 4 max accuracy 0.671121 1.000000 246
## 5 max precision 0.996764 1.000000 0
## 6 max recall 0.671121 1.000000 246
## 7 max specificity 0.996764 1.000000 0
## 8 max absolute_mcc 0.671121 1.000000 246
## 9 max min_per_class_accuracy 0.671121 1.000000 246
## 10 max mean_per_class_accuracy 0.671121 1.000000 246
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
##
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 3-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.05794659
## RMSE: 0.240721
## LogLoss: 0.1971785
## Mean Per-Class Error: 0.1030131
## AUC: 0.9741125
## Gini: 0.9482249
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 132 28 0.175000 =28/160
## Yes 13 406 0.031026 =13/419
## Totals 145 434 0.070812 =41/579
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.345249 0.951934 265
## 2 max f2 0.149750 0.969939 284
## 3 max f0point5 0.971035 0.958493 184
## 4 max accuracy 0.345249 0.929188 265
## 5 max precision 0.997741 1.000000 0
## 6 max recall 0.009001 1.000000 385
## 7 max specificity 0.997741 1.000000 0
## 8 max absolute_mcc 0.345249 0.819491 265
## 9 max min_per_class_accuracy 0.893580 0.904535 223
## 10 max mean_per_class_accuracy 0.917039 0.916982 213
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid
## accuracy 0.9278624 0.008904516 0.921466 0.94545454
## auc 0.9762384 0.0051301476 0.96743006 0.9851994
## err 0.07213761 0.008904516 0.07853403 0.054545455
## err_count 13.666667 0.8819171 15.0 12.0
## f0point5 0.9352853 0.013447009 0.92972183 0.9608541
## f1 0.9512787 0.0065108957 0.94423795 0.96428573
## f2 0.9681131 0.0052480404 0.9592145 0.9677419
## lift_top_group 1.3879367 0.040602904 1.4580153 1.3173653
## logloss 0.20110694 0.028338892 0.23033275 0.1444385
## max_per_class_error 0.20442705 0.049009725 0.18333334 0.13207547
## mcc 0.81914276 0.016271077 0.8149471 0.8491877
## mean_per_class_accuracy 0.8877074 0.01979144 0.89306617 0.9189922
## mean_per_class_error 0.1122926 0.01979144 0.10693384 0.08100779
## mse 0.059073452 0.007476475 0.06384816 0.044414397
## precision 0.9250553 0.01813692 0.9202899 0.9585799
## r2 0.7061854 0.02871026 0.70365846 0.75712836
## recall 0.9798418 0.010080538 0.9694657 0.9700599
## rmse 0.24200912 0.015890896 0.25268194 0.21074724
## specificity 0.79557294 0.049009725 0.81666666 0.8679245
## cv_3_valid
## accuracy 0.9166667
## auc 0.9760858
## err 0.083333336
## err_count 14.0
## f0point5 0.91527987
## f1 0.9453125
## f2 0.9773829
## lift_top_group 1.3884298
## logloss 0.22854955
## max_per_class_error 0.29787233
## mcc 0.7932934
## mean_per_class_accuracy 0.85106385
## mean_per_class_error 0.14893617
## mse 0.068957806
## precision 0.8962963
## r2 0.65776944
## recall 1.0
## rmse 0.2625982
## specificity 0.70212764
We can calculate performance on test data with h2o.performance()
:
h2o.performance(h2o_gbm, test)
## H2OBinomialMetrics: gbm
##
## MSE: 0.03509102
## RMSE: 0.187326
## LogLoss: 0.1350709
## Mean Per-Class Error: 0.05216017
## AUC: 0.9770811
## Gini: 0.9541623
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 48 4 0.076923 =4/52
## Yes 4 142 0.027397 =4/146
## Totals 52 146 0.040404 =8/198
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.580377 0.972603 136
## 2 max f2 0.214459 0.979730 146
## 3 max f0point5 0.907699 0.979827 127
## 4 max accuracy 0.580377 0.959596 136
## 5 max precision 0.997449 1.000000 0
## 6 max recall 0.006710 1.000000 187
## 7 max specificity 0.997449 1.000000 0
## 8 max absolute_mcc 0.580377 0.895680 136
## 9 max min_per_class_accuracy 0.821398 0.952055 131
## 10 max mean_per_class_accuracy 0.821398 0.956797 131
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
XGBoost
Alternatively, we can also use the XGBoost implementation of H2O:
h2o_xgb <- h2o.xgboost(x = features,
y = response,
training_frame = train,
nfolds = 3)
h2o_xgb
## Model Details:
## ==============
##
## H2OBinomialModel: xgboost
## Model ID: XGBoost_model_R_1543572213551_364
## Model Summary:
## number_of_trees
## 1 50
##
##
## H2OBinomialMetrics: xgboost
## ** Reported on training data. **
##
## MSE: 0.25
## RMSE: 0.5
## LogLoss: 0.6931472
## Mean Per-Class Error: 0.5
## AUC: 0.5
## Gini: 0
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 0 160 1.000000 =160/160
## Yes 0 419 0.000000 =0/419
## Totals 0 579 0.276339 =160/579
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.500000 0.839679 0
## 2 max f2 0.500000 0.929047 0
## 3 max f0point5 0.500000 0.765996 0
## 4 max accuracy 0.500000 0.723661 0
## 5 max precision 0.500000 0.723661 0
## 6 max recall 0.500000 1.000000 0
## 7 max specificity 0.500000 0.000000 0
## 8 max absolute_mcc 0.500000 0.000000 0
## 9 max min_per_class_accuracy 0.500000 0.000000 0
## 10 max mean_per_class_accuracy 0.500000 0.500000 0
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
##
## H2OBinomialMetrics: xgboost
## ** Reported on cross-validation data. **
## ** 3-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.25
## RMSE: 0.5
## LogLoss: 0.6931472
## Mean Per-Class Error: 0.5
## AUC: 0.5
## Gini: 0
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 0 160 1.000000 =160/160
## Yes 0 419 0.000000 =0/419
## Totals 0 579 0.276339 =160/579
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.500000 0.839679 0
## 2 max f2 0.500000 0.929047 0
## 3 max f0point5 0.500000 0.765996 0
## 4 max accuracy 0.500000 0.723661 0
## 5 max precision 0.500000 0.723661 0
## 6 max recall 0.500000 1.000000 0
## 7 max specificity 0.500000 0.000000 0
## 8 max absolute_mcc 0.500000 0.000000 0
## 9 max min_per_class_accuracy 0.500000 0.000000 0
## 10 max mean_per_class_accuracy 0.500000 0.500000 0
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid
## accuracy 0.7260711 0.01583762 0.73595506 0.6950673
## auc 0.5 0.0 0.5 0.5
## err 0.27392888 0.01583762 0.26404494 0.30493274
## err_count 53.333332 7.3560257 47.0 68.0
## f0point5 0.7680598 0.014220628 0.77698696 0.7402101
## f1 0.8411026 0.010714028 0.84789646 0.8201058
## f2 0.92966795 0.005267985 0.9330484 0.9193357
## lift_top_group 1.0 0.0 1.0 1.0
## logloss 0.6931472 0.0 0.6931472 0.6931472
## max_per_class_error 1.0 0.0 1.0 1.0
## mcc 0.0 NaN NaN NaN
## mean_per_class_accuracy 0.5 0.0 0.5 0.5
## mean_per_class_error 0.5 0.0 0.5 0.5
## mse 0.25 0.0 0.25 0.25
## precision 0.7260711 0.01583762 0.73595506 0.6950673
## r2 -0.26316962 0.043160092 -0.28650317 -0.17953037
## recall 1.0 0.0 1.0 1.0
## rmse 0.5 0.0 0.5 0.5
## specificity 0.0 0.0 0.0 0.0
## cv_3_valid
## accuracy 0.747191
## auc 0.5
## err 0.252809
## err_count 45.0
## f0point5 0.78698224
## f1 0.8553055
## f2 0.9366197
## lift_top_group 1.0
## logloss 0.6931472
## max_per_class_error 1.0
## mcc NaN
## mean_per_class_accuracy 0.5
## mean_per_class_error 0.5
## mse 0.25
## precision 0.747191
## r2 -0.32347536
## recall 1.0
## rmse 0.5
## specificity 0.0
And use it just as before:
h2o.performance(h2o_xgb, test)
## H2OBinomialMetrics: xgboost
##
## MSE: 0.25
## RMSE: 0.5
## LogLoss: 0.6931472
## Mean Per-Class Error: 0.5
## AUC: 0.5
## Gini: 0
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 0 52 1.000000 =52/52
## Yes 0 146 0.000000 =0/146
## Totals 0 198 0.262626 =52/198
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.500000 0.848837 0
## 2 max f2 0.500000 0.933504 0
## 3 max f0point5 0.500000 0.778252 0
## 4 max accuracy 0.500000 0.737374 0
## 5 max precision 0.500000 0.737374 0
## 6 max recall 0.500000 1.000000 0
## 7 max specificity 0.500000 0.000000 0
## 8 max absolute_mcc 0.500000 0.000000 0
## 9 max min_per_class_accuracy 0.500000 0.000000 0
## 10 max mean_per_class_accuracy 0.500000 0.500000 0
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Video
Slides
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS 10.14.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] h2o_3.20.0.8 bindrcpp_0.2.2 xgboost_0.71.2 caret_6.0-80
## [5] lattice_0.20-38 ISLR_1.2 forcats_0.3.0 stringr_1.3.1
## [9] dplyr_0.7.7 purrr_0.2.5 readr_1.1.1 tidyr_0.8.2
## [13] tibble_1.4.2 ggplot2_3.1.0 tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-137 bitops_1.0-6 lubridate_1.7.4
## [4] dimRed_0.1.0 httr_1.3.1 rprojroot_1.3-2
## [7] tools_3.5.1 backports_1.1.2 utf8_1.1.4
## [10] R6_2.3.0 rpart_4.1-13 lazyeval_0.2.1
## [13] colorspace_1.3-2 nnet_7.3-12 withr_2.1.2
## [16] gbm_2.1.4 gridExtra_2.3 tidyselect_0.2.5
## [19] compiler_3.5.1 cli_1.0.1 rvest_0.3.2
## [22] xml2_1.2.0 bookdown_0.7 scales_1.0.0
## [25] sfsmisc_1.1-2 DEoptimR_1.0-8 robustbase_0.93-3
## [28] digest_0.6.18 rmarkdown_1.10 pkgconfig_2.0.2
## [31] htmltools_0.3.6 rlang_0.3.0.1 readxl_1.1.0
## [34] ddalpha_1.3.4 rstudioapi_0.8 bindr_0.1.1
## [37] jsonlite_1.5 ModelMetrics_1.2.2 RCurl_1.95-4.11
## [40] magrittr_1.5 Matrix_1.2-15 fansi_0.4.0
## [43] Rcpp_0.12.19 munsell_0.5.0 abind_1.4-5
## [46] stringi_1.2.4 yaml_2.2.0 MASS_7.3-51.1
## [49] plyr_1.8.4 recipes_0.1.3 grid_3.5.1
## [52] pls_2.7-0 crayon_1.3.4 haven_1.1.2
## [55] splines_3.5.1 hms_0.4.2 knitr_1.20
## [58] pillar_1.3.0 reshape2_1.4.3 codetools_0.2-15
## [61] stats4_3.5.1 CVST_0.2-2 magic_1.5-9
## [64] glue_1.3.0 evaluate_0.12 blogdown_0.9
## [67] data.table_1.11.8 modelr_0.1.2 foreach_1.4.4
## [70] cellranger_1.1.0 gtable_0.2.0 kernlab_0.9-27
## [73] assertthat_0.2.0 DRR_0.0.3 xfun_0.4
## [76] gower_0.1.2 prodlim_2018.04.18 broom_0.5.0
## [79] e1071_1.7-0 class_7.3-14 survival_2.43-1
## [82] geometry_0.3-6 timeDate_3043.102 RcppRoll_0.3.0
## [85] iterators_1.0.10 lava_1.6.3 ipred_0.9-8