Nested cross-validation for second-stage validated predictions
long_nestor.Rd
Perform a nested cross-validation for a late-integration scheme, i.e., perform a cross-validation for the early model and then train second-stage models on validated predictions and evaluate the entire models via validated predictions made by the late model. "Validated prediction" means predictions made on independent data like out-of-bag (OOB) or cross-validated (CV) predictions.
Arguments
- x
data as a numeric matrix object (rows=samples). The zero-sum regression requires data on the log scale, i.e. x should be log-transformed data.
- y
Named list with entries
"bin"
, a named numeric one-column matrix, binary response to be used for training,"cox"
, a named numeric two-column matrix to be used for training, time to event and event (0 = censoring, 1 = event) in first and second column, respectively."true"
, a named numeric one-column matrix, binary response to be used for calculating the CV error.
- val_error_fun
Function used to calculate the error of independently validated predictions. Must take two numeric vector of equal length:
y
andy_hat
, the true and predicted outcomes, respectively, and return a numeric scalar; the lower, the better the model. Seeerror_rate()
orneg_roc_auc()
for examples.- fitter1
A patroklos-compliant fitter with CV tuning (see README for more details).
- fitter2
A patroklos-compliant fitter with validated predictions (see README for more details). If it returns
"next"
, we skip the current combination of hyperparameters and set its metric to-Inf
.- hyperparams1
A named list with hyperparaters we will pass to
fitter1
.- hyperparams2
A named list with hyperparameters for the late model. Unlike
hyperparams1
, we callfitter2
for every combination of values inhyperparams2
and lambda value fromfitter1
.
Value
An S3 object with class nested_fit
, the model with the best
performance according to validated predictions assessed with metric
.
Details
This function does hyperparameter tuning for a nested model, i.e.,
a so-called early model makes predictions from the high-dimensional part of
data (e.g. RNA-seq, Nanostring), we then provide these predictions as a
one-dimensional feature together with new features to a late model. Both the
early and late model try to predict y
. To not provide the late model overly
optimistic (since overfitted) predictions during training, we feed its
training algorithm with values comparable to those we would observe for
independent test samples, i.e. either cross-validated or out-of-bage (OOB)
predictions. To evaluate the overall model, we do a second cross-validation
or use OOB predictions.
Note that the predictions we get for the nested model are not predictions as
one would observe for independent test samples:
Let's fix sample i. We get the OOB/CV prediction for sample i from models/
a model whose training algorithm didn't see sample i itself. But it probably
saw the prediction for a sample j != i according to an early model whose
training algorithm had seen sample i. Hence the term "pseudo" in the name of
this function. This heuristic saves a factor n_folds
computation time
compared to a full nested cross-validation.