Skip to contents

A Data object holds the phenotype and expression data belonging to a data set. It specifies the names of important features in the columns, reads in the data, puts the spotlight on a part of the data: the cohort, and prepares the data for a model.

Public fields

name

A telling name for the data set.

directory

Directory where the expression and pheno csv files lie.

pivot_time_cutoff

Time cutoff that divides the sample into low-risk (event before) and high-risk (event after). assessment.

cohort

Regular expression to subset the data to a cohort.

imputer

Function handling NAs in the predictor matrix.

expr_mat

Named numeric matrix. Samples correspond to rows.

pheno_tbl

A tibble with phenotypic features and samples as rows.

expr_file

Name of the expression csv file inside directory.

pheno_file

Name of the pheno data csv inside directory.

cohort_col

Find the cohort of a sample in this column of the pheno data.

patient_id_col

The name of the column in the pheno data that holds unique patient identifiers.

time_to_event_col

The name of the column in the pheno data that holds the time-to-event values.

event_col

The name of the column in the pheno data that holds the event status encoded as 1 = occurrence, 0 = censoring.

gene_id_col

The name of the column in the expression data that holds the gene identifiers.

benchmark_col

The name of the column in the pheno data that holds the benchmark risk score (like the IPI).

Methods


Method new()

Construct a Data R6 object.

Usage

Data$new(
  name,
  directory,
  pivot_time_cutoff,
  cohort,
  imputer = mean_impute,
  time_to_event_col,
  event_col,
  cohort_col,
  benchmark_col = NULL,
  expr_file = "expr.csv",
  pheno_file = "pheno.csv",
  patient_id_col = "patient_id",
  gene_id_col = "gene_id"
)

Arguments

name

string. A telling name for the data set.

directory

string. The directory where both expression and pheno csv files lie.

pivot_time_cutoff

numeric. Time cutoff that divides the samples into low-risk (event before) and high-risk (event after).

cohort

string. At the end of preparing the data, subset it to those samples whose value in the cohort_col column matches cohort.

imputer

function or NULL. Function imputing NAs in the predictor matrix. See imputer_prototype() for its interface. Default is mean_impute(). NULL means no imputation.

time_to_event_col

string. The name of the column in the pheno data that holds the time-to-event values.

event_col

string. The name of the column in the pheno data that holds the event status encoded as 1 = occurrence, 0 = censoring.

cohort_col

string. The name of the column in the pheno data that holds the cohort a sample belongs to.

benchmark_col

string or NULL. The name of the column in the pheno data that holds the output of a benchmark model.

expr_file

string. The name of the expression csv file inside directory. Default is "expr.csv". See details for the expected format.

pheno_file

string. The name of the pheno data csv inside directory. Default is "pheno.csv". See details for the expected format.

patient_id_col

string. The name of the column in the pheno data that holds the patient identifiers.

gene_id_col

string. The name of the column in the expression data that holds the gene identifiers.

Details

The pheno csv file holds the samples as rows (with unique sample ids in the first (character) column called patient_id_col), the variables as columns.

The expression csv file holds the genes as rows (with unique gene ids in the first (character) column called gene_id_col), the samples as columns.

Returns

A Data object.


Method read()

Read expression data into the expr_mat attribute and pheno data into the pheno_tbl attribute.

Usage

Data$read()


Method prepare()

Prepare the already read-in data for a model.

Usage

Data$prepare(model, quiet = FALSE)

Arguments

model

A Model object.

quiet

logical. If TRUE, suppress messages.

Details

You need to set cohort before calling this method.


Method survival_quantiles()

Calculate the quantiles of the survival times.

Usage

Data$survival_quantiles(round_digits = 3)

Arguments

round_digits

integer. Round the numbers in the returned tibble to this number of digits after the point.

Details

We take censoring into account.

Returns

A tibble with two columns. For each quantile q in the first column, the time-to-event value in the second column.


Method split()

Split the data into a train and test cohort

Usage

Data$split(train_prop, save = TRUE, keep_risk = TRUE, quiet = FALSE)

Arguments

train_prop

numeric. Proportion of the data to put in the train cohort.

save

logical. If TRUE, save the named cohort vector to a file.

keep_risk

logical. If TRUE, keep the ratio of high-risk versus low-risk samples in train and test cohort the same as in the complete data set.

quiet

logical. If TRUE, suppress messages.

Details

Cohort affiliation will show up in the column cohort_col in pheno_tbl


Method qc_preprocess()

Quality control at the end of preprocessing

Usage

Data$qc_preprocess(expr_tbl)

Arguments

expr_tbl

A tibble with the expression data, the first column, named gene_id_col, holds the gene identifiers and the other columns the samples.

Details

Check if the expression and pheno tibble are consistent with the other attributes of the Data object. You typically call this method at the end of preprocessing, and the read() method calls it.


Method clone()

The objects of this class are cloneable with this method.

Usage

Data$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.