Nested cross-validation for second-stage validated predictions

Perform a nested cross-validation for a late-integration scheme, i.e., perform a cross-validation for the early model and then train second-stage models on validated predictions and evaluate the entire models via validated predictions made by the late model. "Validated prediction" means predictions made on independent data like out-of-bag (OOB) or cross-validated (CV) predictions.

Usage

long_nestor(x, y, val_error_fun, fitter1, fitter2, hyperparams1, hyperparams2)

Arguments

x

data as a numeric matrix object (rows=samples). The zero-sum regression requires data on the log scale, i.e. x should be log-transformed data.

y

Named list with entries

"bin", a named numeric one-column matrix, binary response to be used for training,
"cox", a named numeric two-column matrix to be used for training, time to event and event (0 = censoring, 1 = event) in first and second column, respectively.
"true", a named numeric one-column matrix, binary response to be used for calculating the CV error.

val_error_fun

Function used to calculate the error of independently validated predictions. Must take two numeric vector of equal length: y and y_hat, the true and predicted outcomes, respectively, and return a numeric scalar; the lower, the better the model. See error_rate() or neg_roc_auc() for examples.

fitter1

A patroklos-compliant fitter with CV tuning (see README for more details).

fitter2

A patroklos-compliant fitter with validated predictions (see README for more details). If it returns "next", we skip the current combination of hyperparameters and set its metric to -Inf.

hyperparams1

A named list with hyperparaters we will pass to fitter1.

hyperparams2

A named list with hyperparameters for the late model. Unlike hyperparams1, we call fitter2 for every combination of values in hyperparams2 and lambda value from fitter1.

Value

An S3 object with class nested_fit, the model with the best performance according to validated predictions assessed with metric.

Details

This function does hyperparameter tuning for a nested model, i.e., a so-called early model makes predictions from the high-dimensional part of data (e.g. RNA-seq, Nanostring), we then provide these predictions as a one-dimensional feature together with new features to a late model. Both the early and late model try to predict y. To not provide the late model overly optimistic (since overfitted) predictions during training, we feed its training algorithm with values comparable to those we would observe for independent test samples, i.e. either cross-validated or out-of-bage (OOB) predictions. To evaluate the overall model, we do a second cross-validation or use OOB predictions.

Note that the predictions we get for the nested model are not predictions as one would observe for independent test samples: Let's fix sample i. We get the OOB/CV prediction for sample i from models/ a model whose training algorithm didn't see sample i itself. But it probably saw the prediction for a sample j != i according to an early model whose training algorithm had seen sample i. Hence the term "pseudo" in the name of this function. This heuristic saves a factor n_folds computation time compared to a full nested cross-validation.