Machine Learning in R

Machine learning packages in R

  • The Good: hundreds of ML packages, many comply to an unwritten interface definition
    model = fit(target ~ ., data = train.data, ...)
    predictions = predict(model, newdata = test.data, ...)
    
  • The Bad: some have different API, many others have package-dependent functionality for general procedures
  • The Ugly: meta-information and hyperparameters buried in docs (if documented at all), large experiments lead to tedious error-prone code

mlr package (machine learning in R)

  • Domain-specific language for machine learning concepts
  • Unified interface:
    • Tasks: data and meta-info (e.g. target features)
    • Learners: fit a model, make predictions
    • Resampling: evaluate a model, optimize hyperparameters
  • Reflections: all objects are queryable, you can program on them
  • OO structure: generic algorithms (Bagging, Stacking, Feature Selection,...)
  • Guide: https://mlr-org.github.io/mlr-tutorial/
  • Reference: http://rpackages.ianhowson.com/cran/mlr/

mlr tasks

  • Encapsulate data and meta-data about it (e.g. target features)
  • Define precisely what you want to do with the data
  • Types: regression, classification, clustering,...

Create regression task on the 'BostonHousing' dataset from the mlbench package

In [32]:
data(BostonHousing, package = "mlbench")
task = makeRegrTask(data = BostonHousing, target = "medv") 
print(task)
Supervised task: BostonHousing
Type: regr
Target: medv
Observations: 506
Features:
numerics  factors  ordered 
      12        1        0 
Missings: FALSE
Has weights: FALSE
Has blocking: FALSE

Task introspection: what's in a task?

In [33]:
names(task)               # objects in a task
names(task$task.desc)     # objects in the task description
str(getTaskId(task))      # a unique ID that we can track
getTaskSize(task)         # number of data points
getTaskFeatureNames(task) # feature names
getTaskTargetNames(task)  # name of target feature
summary(getTaskTargets(task)) # distribution of target values
  1. 'type'
  2. 'env'
  3. 'weights'
  4. 'blocking'
  5. 'task.desc'
  1. 'id'
  2. 'type'
  3. 'target'
  4. 'size'
  5. 'n.feat'
  6. 'has.missings'
  7. 'has.weights'
  8. 'has.blocking'
 chr "BostonHousing"
506
  1. 'crim'
  2. 'zn'
  3. 'indus'
  4. 'chas'
  5. 'nox'
  6. 'rm'
  7. 'age'
  8. 'dis'
  9. 'rad'
  10. 'tax'
  11. 'ptratio'
  12. 'b'
  13. 'lstat'
'medv'
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.00   17.02   21.20   22.53   25.00   50.00 

Get the actual data set from a task

In [34]:
data = getTaskData(task)
str(data)
'data.frame':	506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : num  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ b      : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

OpenML integration

  • There are many datasets available, but often hard to find, different formats,...
  • Often no view on what are current state-of-the-art models for a dataset
  • To compare, we need to rerun all other algorithms ourselves

OpenML.org: Collaboration platform for machine learning

  • Datasets, models, evaluations shared by anyone (reproducible and reusable)
  • Predefined tasks with train/test splits so that results are comparable
  • Run large-scale experiments effortlessly, automatically share results
  • Overview of state-of-the-art, leaderboards,...

Load any dataset from OpenML

In [35]:
datasets = listOMLDataSets()  # fetches all data sets
nrow(datasets) # number of available datasets
datasets[1:20, c("did", "name", "NumberOfInstances", "NumberOfFeatures")]
# You can also view them online on www.openml.org/d/<did>
2420
didnameNumberOfInstancesNumberOfFeatures
11anneal89839
22anneal89839
33kr-vs-kp319637
44labor5717
55arrhythmia452280
66letter2000017
77audiology22670
88liver-disorders3457
99autos20526
1010lymph14819
1111balance-scale6255
1212mfeat-factors2000217
1313breast-cancer28610
1414mfeat-fourier200077
1515breast-w69910
1616mfeat-karhunen200065
1718mfeat-morphological20007
1820mfeat-pixel2000241
1921car17287
2022mfeat-zernike200048

Search for specific datasets

In [36]:
# Show only a subset of dataset properties here
datacols = c("did", "name", "NumberOfInstances", "NumberOfFeatures", "NumberOfClasses")
subset(datasets, name == "iris")[, datacols]
subset(datasets, NumberOfInstances > 10000)[0:5, datacols]
subset(datasets, NumberOfClasses == 2)[0:5, datacols]
didnameNumberOfInstancesNumberOfFeaturesNumberOfClasses
5561iris15053
821969iris15052
didnameNumberOfInstancesNumberOfFeaturesNumberOfClasses
66letter200001726
2426nursery1296095
3032pendigits109921710
5770BNG(anneal,nominal,1000000)1000000396
5871BNG(anneal.ORIG,nominal,1000000)1000000396
didnameNumberOfInstancesNumberOfFeaturesNumberOfClasses
33kr-vs-kp3196372
44labor57172
1313breast-cancer286102
1515breast-w699102
2224mushroom8124232

Create an mlr task on an OpenML dataset

In [37]:
ddata = getOMLDataSet(did = 35)$data # Dermatology dataset (id=35), see openml.org/d/35  
task = makeClassifTask(data = ddata, target = "class")
print(task)
Supervised task: ddata
Type: classif
Target: class
Observations: 366
Features:
numerics  factors  ordered 
       1       33        0 
Missings: TRUE
Has weights: FALSE
Has blocking: FALSE
Classes: 6
  1   2   3   4   5   6 
112  61  72  49  52  20 
Positive class: NA

Load predefined OpenML tasks

So you can use the same train/test samples as everyone else (and compare results)

In [38]:
omltask = getOMLTask(task.id = 35) # Classification on breast-cancer dataset (task_id=35), see openml.org/t/35  
print(omltask)
mlrtask = convertOMLTaskToMlr(omltask)$mlr.task # Optional: convert to MLR
OpenML Task 35 :: (Data ID = 35)
  Task Type            : Supervised Classification
  Data Set             : dermatology :: (Version = 1, OpenML ID = 35)
  Target Feature(s)    : class
  Estimation Procedure : Stratified crossvalidation (1 x 10 folds)

List all OpenML tasks

In [39]:
tasks = listOMLTasks() # Fetches all tasks
# Only show a subset of task properties
taskcols = c("task.id","task.type","name","target.feature","estimation.procedure")
head(tasks[,taskcols])
task.idtask.typenametarget.featureestimation.procedure
11Supervised Classificationannealclass10-fold Crossvalidation
22Supervised Classificationannealclass10-fold Crossvalidation
33Supervised Classificationkr-vs-kpclass10-fold Crossvalidation
44Supervised Classificationlaborclass10-fold Crossvalidation
55Supervised Classificationarrhythmiaclass10-fold Crossvalidation
66Supervised Classificationletterclass10-fold Crossvalidation

Search for specific tasks

In [40]:
subset(tasks, name == "breast-cancer")[, taskcols]
task.idtask.typenametarget.featureestimation.procedure
1313Supervised Classificationbreast-cancerClass10-fold Crossvalidation
6672Learning Curvebreast-cancerClass10 times 10-fold Learning Curve
231243Supervised Classificationbreast-cancerClass33% Holdout set
3411712Learning Curvebreast-cancerClass10-fold Learning Curve
4021777Supervised Classificationbreast-cancerClass5 times 2-fold Crossvalidation
5111893Supervised Classificationbreast-cancerClass10 times 10-fold Crossvalidation
5661954Supervised Classificationbreast-cancerClassLeave one out
6982181Supervised Data Stream Classificationbreast-cancerClassInterleaved Test then Train
28655533Clusteringbreast-cancerNA50 times Clustering
488110125Clusteringbreast-cancerNA50 times Clustering

mlr learners

  • Wrappers around fit() and predict() functions
  • Learners: 73 Classification, 54 Regression, 8 Clustering,...
  • Descriptions of parameter sets
  • Annotations (e.g. handles missing values)
  • Naming convention <tasktype>.<functionname>
    makeLearner("classif.rpart")
    makeLearner("regr.rpart")
    

Create learner

Initializes a learner with default hyperparameters, not trained yet. Naming convention: <task>.<algorithm>

In [41]:
lrn = makeLearner("classif.rpart")
print(lrn)
Learner classif.rpart from package rpart
Type: classif
Name: Decision Tree; Short name: rpart
Class: classif.rpart
Properties: twoclass,multiclass,missings,numerics,factors,ordered,prob,weights
Predict-Type: response
Hyperparameters: xval=0

Search for available learners via the help page ?learners or listLearners()

In [42]:
?learners
learners {mlr}R Documentation

List of supported learning algorithms.

Description

All supported learners can be found by listLearners or as a table in the tutorial appendix: http://mlr-org.github.io/mlr-tutorial/release/html/integrated_learners/.


[Package mlr version 2.8 ]
In [43]:
listLearners()[0:10,c(1,2,5,20)]
classtypenamenote
1classif.adaclassifada Boosting`xval` has been set to `0` by default for speed.
2classif.avNNetclassifNeural Network`size` has been set to `3` by default. Doing bagging training of `nnet` if set `bag = TRUE`.
3classif.bartMachineclassifBayesian Additive Regression Trees`use_missing_data` has been set to `TRUE` by default to allow missing data support.
4classif.bdkclassifBi-Directional Kohonen map
5classif.binomialclassifBinomial RegressionDelegates to `glm` with freely choosable binomial link function via learner parameter `link`.
6classif.blackboostclassifGradient Boosting With Regression TreesSee `?ctree_control` for possible breakage for nominal features with missingness.
7classif.boostingclassifAdabag Boosting`xval` has been set to `0` by default for speed.
8classif.bstclassifGradient BoostingRenamed parameter `learner` to `Learner` due to nameclash with `setHyperPars`. Default changes: `Learner = "ls"`, `xval = 0`, and `maxdepth = 1`.
9classif.cforestclassifRandom forest based on conditional inference treesSee `?ctree_control` for possible breakage for nominal features with missingness.
10classif.clusterSVMclassifClustered Support Vector Machines`centers` set to `2` by default.
In [44]:
# list all classification learners which can handle missing values 
listLearners("classif", properties = c("missings"))[, 1:4]
classtypepackageshort.name
1classif.bartMachineclassifbartMachinebartmachine
2classif.blackboostclassifmboost,partyblackbst
3classif.boostingclassifadabag,rpartadabag
4classif.cforestclassifpartycforest
5classif.ctreeclassifpartyctree
6classif.gbmclassifgbmgbm
7classif.J48classifRWekaj48
8classif.JRipclassifRWekajrip
9classif.naiveBayesclassife1071nbayes
10classif.OneRclassifRWekaoner
11classif.PARTclassifRWekapart
12classif.randomForestSRCclassifrandomForestSRCrfsrc
13classif.rpartclassifrpartrpart
In [45]:
# list all classification learners that can handle the current task 
listLearners(task)[, 1:4]
classtypepackageshort.name
1classif.boostingclassifadabag,rpartadabag
2classif.cforestclassifpartycforest
3classif.ctreeclassifpartyctree
4classif.gbmclassifgbmgbm
5classif.J48classifRWekaj48
6classif.JRipclassifRWekajrip
7classif.naiveBayesclassife1071nbayes
8classif.OneRclassifRWekaoner
9classif.PARTclassifRWekapart
10classif.randomForestSRCclassifrandomForestSRCrfsrc
11classif.rpartclassifrpartrpart

Hyperparameters

List all hyperparameters

In [46]:
lrn = makeLearner("classif.rpart")
getParamSet(lrn)
                   Type len  Def   Constr Req Tunable Trafo
minsplit        integer   -   20 1 to Inf   -    TRUE     -
minbucket       integer   -    - 1 to Inf   -    TRUE     -
cp              numeric   - 0.01   0 to 1   -    TRUE     -
maxcompete      integer   -    4 0 to Inf   -    TRUE     -
maxsurrogate    integer   -    5 0 to Inf   -    TRUE     -
usesurrogate   discrete   -    2    0,1,2   -    TRUE     -
surrogatestyle discrete   -    0      0,1   -    TRUE     -
maxdepth        integer   -   30  1 to 30   -    TRUE     -
xval            integer   -   10 0 to Inf   -   FALSE     -
parms           untyped   -    -        -   -   FALSE     -

Setting hyperparameters at creation time

In [47]:
cluster.lrn = makeLearner("cluster.SimpleKMeans", N = 5)
lrn = makeLearner("classif.rpart", par.vals = list(maxdepth = 10, cp = 0.1))
lrn
Learner classif.rpart from package rpart
Type: classif
Name: Decision Tree; Short name: rpart
Class: classif.rpart
Properties: twoclass,multiclass,missings,numerics,factors,ordered,prob,weights
Predict-Type: response
Hyperparameters: xval=0,maxdepth=10,cp=0.1

You can change parameter values at any time, or go back to default values

In [48]:
# Afterwards
setHyperPars(lrn,maxdepth=20)
# Go back to defaults
removeHyperPars(lrn,c("maxdepth","cp"))
Learner classif.rpart from package rpart
Type: classif
Name: Decision Tree; Short name: rpart
Class: classif.rpart
Properties: twoclass,multiclass,missings,numerics,factors,ordered,prob,weights
Predict-Type: response
Hyperparameters: xval=0,maxdepth=20,cp=0.1
Learner classif.rpart from package rpart
Type: classif
Name: Decision Tree; Short name: rpart
Class: classif.rpart
Properties: twoclass,multiclass,missings,numerics,factors,ordered,prob,weights
Predict-Type: response
Hyperparameters: xval=0

Learner properties

  • numerics, factors, ordered: Handles numeric, factors (categories), ordered features
  • missings: Handles missing values in the data
  • weights: Training examples can be weighted
  • oneclass, twoclass, multiclass: Handles 1,2, multi-class classification problems
  • class.weights: Handles class weights
  • prob: Can predict probabilities
  • se: Can predict standard errors (for regression)

Querying and setting learner properties

In [49]:
class_tree = makeLearner("classif.rpart")
regr_forest = makeLearner("regr.randomForest")
class_tree$properties
setPredictType(class_tree, "prob")
setPredictType(regr_forest, "se")
  1. 'twoclass'
  2. 'multiclass'
  3. 'missings'
  4. 'numerics'
  5. 'factors'
  6. 'ordered'
  7. 'prob'
  8. 'weights'
Learner classif.rpart from package rpart
Type: classif
Name: Decision Tree; Short name: rpart
Class: classif.rpart
Properties: twoclass,multiclass,missings,numerics,factors,ordered,prob,weights
Predict-Type: prob
Hyperparameters: xval=0
Learner regr.randomForest from package randomForest
Type: regr
Name: Random Forest; Short name: rf
Class: regr.randomForest
Properties: numerics,factors,ordered,se
Predict-Type: se
Hyperparameters: se.method=bootstrap,se.boot=50,ntree.for.se=100

Training a learner

  • Training: fitting a model to the given data (as defined in a task)
In [50]:
iristask = convertOMLTaskToMlr(getOMLTask(task.id=59))$mlr.task 
lrn = makeLearner("classif.rpart")
model = train(lrn,iristask)
names(model)
model$learner.model
  1. 'learner'
  2. 'learner.model'
  3. 'task.desc'
  4. 'subset'
  5. 'features'
  6. 'factor.levels'
  7. 'time'
n= 150 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 150 100 Iris-setosa (0.33333333 0.33333333 0.33333333)  
  2) petallength< 2.45 50   0 Iris-setosa (1.00000000 0.00000000 0.00000000) *
  3) petallength>=2.45 100  50 Iris-versicolor (0.00000000 0.50000000 0.50000000)  
    6) petalwidth< 1.75 54   5 Iris-versicolor (0.00000000 0.90740741 0.09259259) *
    7) petalwidth>=1.75 46   1 Iris-virginica (0.00000000 0.02173913 0.97826087) *
In [92]:
library(rpart.plot) # For plotting rpart trees
rpart.plot(model$learner.model, extra = 4)

Making predictions

  • Given a trained model, make predictions for a test set
In [52]:
n = getTaskSize(iristask)
train.set = seq(1, n, by = 2) # odd rows for training
test.set = seq(2, n, by = 2)  # even rows for testing
model = train(lrn, iristask, subset = train.set) # train with subbset
pred = predict(model, task = iristask, subset = test.set) # predict the rest
pred
Prediction: 75 observations
predict.type: response
threshold: 
time: 0.00
   id       truth    response
1   2 Iris-setosa Iris-setosa
3   4 Iris-setosa Iris-setosa
5   6 Iris-setosa Iris-setosa
7   8 Iris-setosa Iris-setosa
9  10 Iris-setosa Iris-setosa
11 12 Iris-setosa Iris-setosa

Visualizing predictions

  • mlr includes several functions to visualize learning performance
  • returns a ggplot object which you can manipulate as usual

Classification: only 2D visualizations

In [53]:
lrn = makeLearner("classif.rpart")
plotLearnerPrediction(lrn, iristask, features = c("petallength", "petalwidth")) + theme_cowplot()
In [54]:
# Learner parameters can be passed in the plotting function
lrn = makeLearner("classif.ksvm")
plotLearnerPrediction(lrn, iristask, kernel = "rbfdot", features = c("sepallength", "sepalwidth")) + theme_cowplot()

Regression: 1D and 2D visualizations

In [55]:
data(BostonHousing, package = "mlbench")
bh.task = makeRegrTask(data = BostonHousing, target = "medv")
lrn = makeLearner("regr.rpart")
plotLearnerPrediction(lrn, bh.task, features = c("lstat"))
In [56]:
lrn = makeLearner("regr.rpart")
plotLearnerPrediction(lrn, bh.task, features = c("lstat","rm"))

mlr resampling (evaluation)

  • Estimate generalization error of our models
  • Repeatedly fit models on training sets
  • Evaluate performance on independent test sets and average performance results

Evaluation measures

  • Default measure for classification: Mean misclassification error (mmce) $$ \frac{1}{n} \sum_{i=1}^n \mathbb{1}(y_i \neq \hat{y}_i) $$

  • Default measure for regression: Mean squared error (mse) $$ \frac{1}{n} \sum_{i = 1}^{n}(y_i - \hat{y}_i)^2 $$

In [57]:
listMeasures("classif") # All available classification measures
print(mmce)
listMeasures("regr") # All available regression measures
print(mse)
  1. 'timepredict'
  2. 'gmean'
  3. 'acc'
  4. 'auc'
  5. 'ber'
  6. 'fn'
  7. 'fp'
  8. 'fnr'
  9. 'gpr'
  10. 'featperc'
  11. 'ppv'
  12. 'fpr'
  13. 'mmce'
  14. 'timeboth'
  15. 'npv'
  16. 'timetrain'
  17. 'fdr'
  18. 'tnr'
  19. 'mcc'
  20. 'bac'
  21. 'tpr'
  22. 'tn'
  23. 'f1'
  24. 'tp'
  25. 'multiclass.auc'
  26. 'brier'
Name: Mean misclassification error
Performance measure: mmce
Properties: classif,classif.multi,req.pred,req.truth
Minimize: TRUE
Best: 0; Worst: 1
Aggregated by: test.mean
Note: 
  1. 'timepredict'
  2. 'sae'
  3. 'sse'
  4. 'rmse'
  5. 'expvar'
  6. 'rsq'
  7. 'featperc'
  8. 'arsq'
  9. 'timeboth'
  10. 'timetrain'
  11. 'mae'
  12. 'mse'
  13. 'medae'
  14. 'medse'
Name: Mean of squared errors
Performance measure: mse
Properties: regr,req.pred,req.truth
Minimize: TRUE
Best: 0; Worst: Inf
Aggregated by: test.mean
Note: 

Evaluations are included in the predictions object

In [58]:
lrn = makeLearner("classif.rpart")
lrn = setPredictType(lrn, "prob") # We need probabilities
model = train(lrn, iristask, subset = train.set) # train with subbset
pred = predict(model, task = iristask, subset = test.set) # predict the rest
performance(pred)
performance(pred, measures = list(mlr::multiclass.auc, mlr::acc))
# timetrain also needs the model
performance(pred, measures = mlr::timetrain, model = model)
mmce: 0.0533333333333333
multiclass.auc
0.833333333333333
acc
0.946666666666667
timetrain: 0.00500000000465661

Resampling strategies

  • Cross-validation (CV): Split data in $k$ roughly equal parts (folds), iteratively use one as the test set and the remainder as the training set, then average the performance estimates. $k$ typically 5..10.
  • Repeated Cross-validation (RepCV): Repeat CV $i$ times (randomly shuffling the data points), and average the CV results
  • Leave-one-out Cross-validation (LOO): Cross-Validation with $k$ = $n$ = the number of data points
  • Holdout, training/test (Holdout): Randomly split the data in a train and test set. Typically 70/30 split.
  • Subsampling, a.k.a Monte-Carlo Cross-Validation (Subsample). Draw $k$ random holdouts, and average the results. $k$ typically 30..100
  • Out-of-bag bootstrap (Bootstrap): Randomly draw $B$ training sets of size $n$ with replacement (points can be drawn more than once). These are expected to contain 63,2% of the original data points. The non-drawn (out-of-bag, OOB) points form the test sets. $B$ typically 30..100
  • Bootstrapping with the 632-rule (B632): Variant of OOB bootstrapping that takes a weighted average of the OOB bootstrap estimate and the training set error on the whole dataset. Less pessimistic than the OOB bootstrap.
  • Stratified resampling: Option for all strategies, creates splits so that class labels occur in equal proportions as in the original data

First create resample description, then execute with learner and task. Returned resample object contains all performance estimates.

In [59]:
rdesc = makeResampleDesc("CV", iters = 3)
lrn = makeLearner("classif.rpart")
r = resample(lrn, iristask, resampling = rdesc)
print(r)
# Shorthand
r = crossval("classif.rpart", iristask, iters = 3)
Resample Result
Task: OpenML-Task-59
Learner: classif.rpart
mmce.aggr: 0.07
mmce.mean: 0.07
mmce.sd: 0.01
Runtime: 0.0511
In [60]:
names(r)
head(r$measures.test) # Results per CV fold
head(as.data.frame(r$pred)) # Final predictions
  1. 'learner.id'
  2. 'task.id'
  3. 'measures.train'
  4. 'measures.test'
  5. 'aggr'
  6. 'pred'
  7. 'models'
  8. 'err.msgs'
  9. 'extract'
  10. 'runtime'
itermmce
110.06
220.02
330.1
idtruthresponseiterset
11Iris-setosaIris-setosa1test
23Iris-setosaIris-setosa1test
39Iris-setosaIris-setosa1test
414Iris-setosaIris-setosa1test
515Iris-setosaIris-setosa1test
616Iris-setosaIris-setosa1test

To retrieve the models, set model=TRUE and extract it from the resample object

In [61]:
r = resample(lrn, iristask, resampling = rdesc, model = TRUE)
names(r)
print(r$models[]) # 3 models are returned (one for each fold)
  1. 'learner.id'
  2. 'task.id'
  3. 'measures.train'
  4. 'measures.test'
  5. 'aggr'
  6. 'pred'
  7. 'models'
  8. 'err.msgs'
  9. 'extract'
  10. 'runtime'
[[1]]
Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = OpenML-Task-59; obs = 100; features = 4
Hyperparameters: xval=0

[[2]]
Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = OpenML-Task-59; obs = 100; features = 4
Hyperparameters: xval=0

[[3]]
Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = OpenML-Task-59; obs = 100; features = 4
Hyperparameters: xval=0

To compare multiple learners, reuse the same train/test sets for all by saving an instance of the resampling

In [62]:
rin = makeResampleInstance(rdesc, iristask)
model1 = resample("classif.rpart", iristask, resampling = rin)
model2 = resample("classif.knn", iristask, resampling = rin)
print(model1$aggr)
print(model2$aggr)
mmce.test.mean 
    0.04666667 
mmce.test.mean 
    0.04666667 

Benchmarking

  • Comparing algorithms over many datasets
  • Train and test sets need to be synchronized: learners see the same data splits
  • Can be combined with feature selection, hyperparameter tuning: mlr wrappers
  • Results stored in well-defined container object

A regression benchmark

In [63]:
data("BostonHousing", "mtcars", "swiss", package = c("mlbench", "datasets"))
cv10f = makeResampleDesc("CV", iters = 10)
tasks = list(
    makeRegrTask(data = BostonHousing, target = "medv"), 
    makeRegrTask(data = swiss, target = "Fertility"), 
    makeRegrTask(data = mtcars, target = "mpg")
    )
learners = list(
    makeLearner("regr.rpart"),
    makeLearner("regr.randomForest"),
    makeLearner("regr.lm")
    )
bmr = benchmark(learners, tasks, cv10f, mlr::mse)
bmr
        task.id        learner.id mse.test.mean
1 BostonHousing        regr.rpart     23.196284
2 BostonHousing regr.randomForest     10.877412
3 BostonHousing           regr.lm     23.654804
4        mtcars        regr.rpart     16.844322
5        mtcars regr.randomForest      5.016773
6        mtcars           regr.lm     12.279907
7         swiss        regr.rpart    134.075749
8         swiss regr.randomForest     65.037044
9         swiss           regr.lm     57.770478

Accessing benchmark results

In [64]:
head(getBMRAggrPerformances(bmr, as.df = TRUE), 3) # Aggregated results
head(getBMRPerformances(bmr, as.df = TRUE), 3) # Per-fold results
head(getBMRPredictions(bmr, as.df = TRUE), 3)  # Predictions
task.idlearner.idmse.test.mean
1BostonHousingregr.rpart23.19628
2BostonHousingregr.randomForest10.87741
3BostonHousingregr.lm23.6548
task.idlearner.iditermse
1BostonHousingregr.rpart148.42943
2BostonHousingregr.rpart29.563561
3BostonHousingregr.rpart321.84305
task.idlearner.ididtruthresponseiterset
1BostonHousingregr.rpart1920.221.329251test
2BostonHousingregr.rpart4521.221.329251test
3BostonHousingregr.rpart4619.321.329251test

Visualizing performance

In [65]:
plotBMRBoxplots(bmr, measure = mlr::mse)
In [66]:
# Some ggplot2 customizations
plotBMRBoxplots(bmr, measure = mlr::mse, style = "violin") + aes(color = learner.id)
In [67]:
# Summary plot
plotBMRSummary(bmr)

A classification benchmark

In [68]:
# Classification benchmark with OpenML
cv10f = makeResampleDesc("CV", iters = 10)
tasks = list(
    convertOMLTaskToMlr(getOMLTask(task.id=59))$mlr.task,  # iris
    convertOMLTaskToMlr(getOMLTask(task.id=57))$mlr.task,  # ionosphere
    convertOMLTaskToMlr(getOMLTask(task.id=2382))$mlr.task # wine
    )
learners = list(
    makeLearner("classif.rpart"),
    makeLearner("classif.knn"),
    makeLearner("classif.randomForest")
    )
bmr = benchmark(learners, tasks, cv10f, mlr::mmce)
bmr
           task.id           learner.id mmce.test.mean
1 OpenML-Task-2382        classif.rpart     0.11143791
2 OpenML-Task-2382          classif.knn     0.26928105
3 OpenML-Task-2382 classif.randomForest     0.01666667
4   OpenML-Task-57        classif.rpart     0.13912698
5   OpenML-Task-57          classif.knn     0.13920635
6   OpenML-Task-57 classif.randomForest     0.07111111
7   OpenML-Task-59        classif.rpart     0.06000000
8   OpenML-Task-59          classif.knn     0.04000000
9   OpenML-Task-59 classif.randomForest     0.04000000

Benchmark + share automatically on OpenML

In [69]:
task.ids = c(59,57,2382)
for (lrn in learners) {
  for (id in task.ids) {
    task = getOMLTask(id)
    res = runTaskMlr(task, lrn) # shorthand to run OpenML tasks
    run.id = uploadOMLRun(res)  # upload results
  }
}

Hyperparameter tuning

  • Find the 'best' hyperparameters for a given dataset
  • Tuners iteratively evaluate hyperparameters and move to next setting
  • mlr offers many tuners through the same interface
  • All info logged in OptPath object

Tuning strategies

  • Grid search: exhaustively try all combinations on finite grid
  • Random search: draw parameters randomly (scales better)
  • Iterated F-Racing: all candidatea are trained in parallel, on increasing amount of data (steps). At each step, models that perform statistically worse than others are dropped
  • Model-based (Bayesian) optimization: while evaluating parameter settings, a surrogate model learns the hyperparameter space and predicts interesting parameter settings to try next
  • ...

A simple grid search

In [118]:
task = convertOMLTaskToMlr(getOMLTask(task.id=2073))$mlr.task #yeast
lrn = makeLearner("classif.rpart")
rdesc = makeResampleDesc("CV", iters = 3)
ps = makeParamSet( # Define the grid
    makeIntegerParam("maxdepth", lower = 2, upper = 30),
    makeNumericParam("cp", lower = 0, upper = 0.5)
    )
ctrl = makeTuneControlGrid() # Use a grid search
res = tuneParams(lrn,task, rdesc, par.set = ps, control = ctrl, 
                measures = list(acc, setAggregation(acc, test.sd)))
res
Tune result:
Op. pars: maxdepth=5; cp=0
acc.test.mean=0.567,acc.test.sd=0.0136
In [105]:
ctrl = makeTuneControlIrace(maxExperiments = 200) # Iterative F-Racing
ctrl = makeTuneControlRandom(maxit = 200) # Random Search
In [119]:
res$opt.path
opt.grid = as.data.frame(res$opt.path)
head(opt.grid)
Optimization path
  Dimensions: x = 2/2, y = 2
  Length: 100
  Add x values transformed: FALSE
  Error messages: TRUE. Errors: 0 / 100.
  Exec times: TRUE. Range: 0.055 - 0.126. 0 NAs.
maxdepthcpacc.test.meanacc.test.sddobeolerror.messageexec.time
1200.4770880.0051898851NANA0.063
2500.56671440.013611032NANA0.07
3800.5667130.0044354383NANA0.083
41100.5619910.010188164NANA0.084
51400.55862130.0079773765NANA0.082
61800.55660110.0061094536NANA0.083
In [121]:
myg = ggplot(opt.grid, aes(x = maxdepth, y = cp, fill = acc.test.mean, label = round(acc.test.sd, 3))) 
myg + geom_tile() + geom_text(color = "white")

Nested resampling

  • Never optimize parameters over the same test data as used for estimating the final performance of the model
  • Nested resampling: embed the whole model selection process into an outer resampling loop
  • In mlr you can wrap the learner in a tuning procedure with makeTuneWrapper

Nested subsampling with 2 iterations in the inner loop (for selecting hyperparameters) and 3-fold crossvalidation in the outer loop (for final evaluation)

In [74]:
## Inner tuning loop
ctrl = makeTuneControlGrid()
inner = makeResampleDesc("Subsample", iters = 2)
lrn = makeTuneWrapper("classif.rpart", resampling = inner, par.set = ps, control = ctrl, show.info = FALSE)

## Outer resampling loop
outer = makeResampleDesc("CV", iters = 3)
r = resample(lrn, iristask, resampling = outer, extract = getTuneResult, show.info = FALSE)
print(r)
Resample Result
Task: OpenML-Task-59
Learner: classif.rpart.tuned
mmce.aggr: 0.05
mmce.mean: 0.05
mmce.sd: 0.02
Runtime: 58.4672

ROC analysis

In [122]:
data(Sonar, package = "mlbench")
sonar.task = makeClassifTask(data=Sonar,target="Class")

n = getTaskSize(sonar.task) # make a 2/3 split
train.set = sample(n, size = round(2/3 * n))
test.set = setdiff(seq_len(n), train.set)

lrn1 = makeLearner("classif.kknn", predict.type = "prob") # output probabilities
mod1 = train(lrn1, sonar.task, subset = train.set)
pred1 = predict(mod1, task = sonar.task, subset = test.set)

# Evaluate performance over different thresholds
df = generateThreshVsPerfData(pred1, measures = list(fpr, tpr, mmce))
In [76]:
plotROCCurves(df)
In [77]:
plotThreshVsPerf(df)

Compare learners

In [78]:
lrn2 = makeLearner("classif.ksvm", predict.type = "prob")
mod2 = train(lrn2, sonar.task, subset = train.set)
pred2 = predict(mod2, task = sonar.task, subset = test.set)
df = generateThreshVsPerfData(list(svm = pred1, kknn = pred2), measures = list(fpr, tpr))
plotROCCurves(df)
In [79]:
qplot(x = fpr, y = tpr, color = learner, data = df$data, geom = "path")

Extra: Handling datasets in R

  • In R, datasets are represented as R Data Frames
  • Here, we show some useful methods. Skip it if you know them.

Also see:

Basics

In [80]:
data(iris, package = "datasets") # Loads dataset iris
str(iris) # Returns a description
'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
In [81]:
head(iris) # First few rows
summary(iris) # Summary statistics
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2setosa
24.931.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
553.61.40.2setosa
65.43.91.70.4setosa
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                
In [82]:
names(iris) # All feature names
dim(iris)   # Dimensions
nrow(iris)  # Number of rows / examples
ncol(iris)  # Number of columns / features
  1. 'Sepal.Length'
  2. 'Sepal.Width'
  3. 'Petal.Length'
  4. 'Petal.Width'
  5. 'Species'
  1. 150
  2. 5
150
5

Slicing

In [83]:
head(iris$Petal.Length) # Select specific feature
iris[1:5,3]     # Select values in row 1:5, column 3
head(iris[,c("Petal.Length","Petal.Width")]) # Select columns by name
iris[3,]        # Select row 3
  1. 1.4
  2. 1.4
  3. 1.3
  4. 1.5
  5. 1.4
  6. 1.7
  1. 1.4
  2. 1.4
  3. 1.3
  4. 1.5
  5. 1.4
Petal.LengthPetal.Width
11.40.2
21.40.2
31.30.2
41.50.2
51.40.2
61.70.4
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
34.73.21.30.2setosa
In [84]:
# R will convert a single column to a vector, use drop=FALSE to avoid
head(iris[,c("Petal.Length")])
head(iris[,c("Petal.Length"), drop=FALSE])
  1. 1.4
  2. 1.4
  3. 1.3
  4. 1.5
  5. 1.4
  6. 1.7
Petal.Length
11.4
21.4
31.3
41.5
51.4
61.7

Applying functions

In [85]:
dat = iris[1:8,3:4] # select rows 1-8, columns 3-4
apply(dat, 1, mean)       # computes mean per ROW (=1)
apply(dat, 2, median)     # computes median per COLUMN (=2)
1
0.8
2
0.8
3
0.75
4
0.85
5
0.8
6
1.05
7
0.85
8
0.85
Petal.Length
1.4
Petal.Width
0.2

Subsetting

In [86]:
head(subset(iris, Petal.Length <= 4 & Petal.Width > 1.2))
head(subset(iris, Species == "setosa"))
head(subset(iris, Species %in% c("setosa", "versicolor")))
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
545.52.341.3versicolor
605.22.73.91.4versicolor
655.62.93.61.3versicolor
726.12.841.3versicolor
905.52.541.3versicolor
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2setosa
24.931.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
553.61.40.2setosa
65.43.91.70.4setosa
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2setosa
24.931.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
553.61.40.2setosa
65.43.91.70.4setosa
In [87]:
# Use droplevels to actually remove unused feature values (levels) from the dataset
onlysetosa = subset(iris, Species == "setosa")
str(onlysetosa)
str(droplevels(onlysetosa))
'data.frame':	50 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
'data.frame':	50 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 1 level "setosa": 1 1 1 1 1 1 1 1 1 1 ...

Removing and adding rows

In [88]:
small_iris = iris[iris$Sepal.Length < 4.5,] # Remove rows based on test
head(small_iris)
new_row = c(4.0, 3.0, 1.0, 0.2, "setosa")
small_iris = rbind(small_iris, new_row)
head(small_iris)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
94.42.91.40.2setosa
144.331.10.1setosa
394.431.30.2setosa
434.43.21.30.2setosa
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
94.42.91.40.2setosa
144.331.10.1setosa
394.431.30.2setosa
434.43.21.30.2setosa
54310.2setosa

Removing and adding columns

In [89]:
sepalwidth = iris$Sepal.Width
iris$Sepal.Width = NULL # Remove column Sepal.Width
head(iris)
iris$Sepal.Width = sepalwidth # Add it again
head(iris)
iris = iris[c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species")] #Reorder
head(iris)
Sepal.LengthPetal.LengthPetal.WidthSpecies
15.11.40.2setosa
24.91.40.2setosa
34.71.30.2setosa
44.61.50.2setosa
551.40.2setosa
65.41.70.4setosa
Sepal.LengthPetal.LengthPetal.WidthSpeciesSepal.Width
15.11.40.2setosa3.5
24.91.40.2setosa3
34.71.30.2setosa3.2
44.61.50.2setosa3.1
551.40.2setosa3.6
65.41.70.4setosa3.9
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2setosa
24.931.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
553.61.40.2setosa
65.43.91.70.4setosa

Generating data

Several R packages allow you to generate synthetic data

In [90]:
library(mlbench)
newdata = as.data.frame(mlbench.spirals(n=1000, cycles=1.5, sd=0.05))
spirals = ggplot(data=newdata, aes(x=x.1, y=x.2, color=classes)) + geom_point() + coord_fixed(ratio=1)
newdata = as.data.frame(mlbench.threenorm(n=1000, d=2))
blobs = ggplot(data=newdata, aes(x=x.1, y=x.2, color=classes)) + geom_point() + coord_fixed(ratio=1)
newdata = as.data.frame(mlbench.cassini(n=1000, relsize=c(2,2,1)))
waves = ggplot(data=newdata, aes(x=x.1, y=x.2, color=classes)) + geom_point() + coord_fixed(ratio=1)
newdata = as.data.frame(mlbench.ringnorm(n=1000, d=2))
shapes = ggplot(data=newdata, aes(x=x.1, y=x.2, color=classes)) + geom_point() + coord_fixed(ratio=1)
grid.arrange(spirals, blobs, waves, shapes, ncol=2, nrow =2)
In [91]:
# Class distribution
plotclass = ggplot(data=iris, aes(x=Species)) + geom_bar(stat="count") + coord_fixed(ratio=0.02)
#  feature
plotpetal = ggplot(data=iris, aes(x=Sepal.Length, fill=Species)) + geom_bar(stat="count") + coord_fixed(ratio=0.05)
# Plot lym_nodes_dimin against lym_nodes_enlar
plotdim = ggplot(data=iris, aes(x=Petal.Length, y=Petal.Width, color=Species)) + geom_point()
grid.arrange(plotclass, plotpetal, plotdim, ncol = 1)