An R6 class for the data

A Data object holds the phenotype and expression data belonging to a data set. It specifies the names of important features in the columns, reads in the data, puts the spotlight on a part of the data: the cohort, and prepares the data for a model.

Public fields

name: A telling name for the data set.
directory: Directory where the expression and pheno csv files lie.
pivot_time_cutoff: Time cutoff that divides the sample into low-risk (event before) and high-risk (event after). assessment.
cohort: Regular expression to subset the data to a cohort.
imputer: Function handling NAs in the predictor matrix.
expr_mat: Named numeric matrix. Samples correspond to rows.
pheno_tbl: A tibble with phenotypic features and samples as rows.
expr_file: Name of the expression csv file inside directory.
pheno_file: Name of the pheno data csv inside directory.
cohort_col: Find the cohort of a sample in this column of the pheno data.
patient_id_col: The name of the column in the pheno data that holds unique patient identifiers.
time_to_event_col: The name of the column in the pheno data that holds the time-to-event values.
event_col: The name of the column in the pheno data that holds the event status encoded as 1 = occurrence, 0 = censoring.
gene_id_col: The name of the column in the expression data that holds the gene identifiers.
benchmark_col: The name of the column in the pheno data that holds the benchmark risk score (like the IPI).

Methods

Method `new()`

Construct a Data R6 object.

Usage

Data$new(
  name,
  directory,
  pivot_time_cutoff,
  cohort,
  imputer = mean_impute,
  time_to_event_col,
  event_col,
  cohort_col,
  benchmark_col = NULL,
  expr_file = "expr.csv",
  pheno_file = "pheno.csv",
  patient_id_col = "patient_id",
  gene_id_col = "gene_id"
)

Arguments

name: string. A telling name for the data set.
directory: string. The directory where both expression and pheno csv files lie.
pivot_time_cutoff: numeric. Time cutoff that divides the samples into low-risk (event before) and high-risk (event after).
cohort: string. At the end of preparing the data, subset it to those samples whose value in the cohort_col column matches cohort.
imputer: function or NULL. Function imputing NAs in the predictor matrix. See imputer_prototype() for its interface. Default is mean_impute(). NULL means no imputation.
time_to_event_col: string. The name of the column in the pheno data that holds the time-to-event values.
event_col: string. The name of the column in the pheno data that holds the event status encoded as 1 = occurrence, 0 = censoring.
cohort_col: string. The name of the column in the pheno data that holds the cohort a sample belongs to.
benchmark_col: string or NULL. The name of the column in the pheno data that holds the output of a benchmark model.
expr_file: string. The name of the expression csv file inside directory. Default is "expr.csv". See details for the expected format.
pheno_file: string. The name of the pheno data csv inside directory. Default is "pheno.csv". See details for the expected format.
patient_id_col: string. The name of the column in the pheno data that holds the patient identifiers.
gene_id_col: string. The name of the column in the expression data that holds the gene identifiers.

Details

The pheno csv file holds the samples as rows (with unique sample ids in the first (character) column called patient_id_col), the variables as columns.

The expression csv file holds the genes as rows (with unique gene ids in the first (character) column called gene_id_col), the samples as columns.

Returns

A Data object.

Method `read()`

Read expression data into the expr_mat attribute and pheno data into the pheno_tbl attribute.

Usage

Data$read()

Method `prepare()`

Prepare the already read-in data for a model.

Usage

Data$prepare(model, quiet = FALSE)

Arguments

model: A Model object.
quiet: logical. If TRUE, suppress messages.

Details

You need to set cohort before calling this method.

Method `survival_quantiles()`

Calculate the quantiles of the survival times.

Usage

Data$survival_quantiles(round_digits = 3)

Arguments

round_digits: integer. Round the numbers in the returned tibble to this number of digits after the point.

Details

We take censoring into account.

Returns

A tibble with two columns. For each quantile q in the first column, the time-to-event value in the second column.

Method `split()`

Split the data into a train and test cohort

Usage

Data$split(train_prop, save = TRUE, keep_risk = TRUE, quiet = FALSE)

Arguments

train_prop: numeric. Proportion of the data to put in the train cohort.
save: logical. If TRUE, save the named cohort vector to a file.
keep_risk: logical. If TRUE, keep the ratio of high-risk versus low-risk samples in train and test cohort the same as in the complete data set.
quiet: logical. If TRUE, suppress messages.

Details

Cohort affiliation will show up in the column cohort_col in pheno_tbl

Method `qc_preprocess()`

Quality control at the end of preprocessing

Usage

Data$qc_preprocess(expr_tbl)

Arguments

expr_tbl: A tibble with the expression data, the first column, named gene_id_col, holds the gene identifiers and the other columns the samples.

Details

Check if the expression and pheno tibble are consistent with the other attributes of the Data object. You typically call this method at the end of preprocessing, and the read() method calls it.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

Data$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Public fields

Methods

Public methods

Method new()

Usage

Arguments

Details

Returns

Method read()

Usage

Method prepare()

Usage

Arguments

Details

Method survival_quantiles()

Usage

Arguments

Details

Returns

Method split()

Usage

Arguments

Details

Method qc_preprocess()

Usage

Arguments

Details

Method clone()

Usage

Arguments

Method `new()`

Method `read()`

Method `prepare()`

Method `survival_quantiles()`

Method `split()`

Method `qc_preprocess()`

Method `clone()`