
An R6 class for the data
Data.RdA Data object holds the phenotype and expression data belonging to
a data set. It specifies the names of important features in the columns, reads in the data, puts
the spotlight on a part of the data: the cohort, and prepares the data for a model.
Public fields
nameA telling name for the data set.
directoryDirectory where the expression and pheno csv files lie.
pivot_time_cutoffTime cutoff that divides the sample into low-risk (event before) and high-risk (event after). assessment.
cohortRegular expression to subset the data to a cohort.
imputerFunction handling NAs in the predictor matrix.
expr_matNamed numeric matrix. Samples correspond to rows.
pheno_tblA tibble with phenotypic features and samples as rows.
expr_fileName of the expression csv file inside
directory.pheno_fileName of the pheno data csv inside
directory.cohort_colFind the cohort of a sample in this column of the pheno data.
patient_id_colThe name of the column in the pheno data that holds unique patient identifiers.
time_to_event_colThe name of the column in the pheno data that holds the time-to-event values.
event_colThe name of the column in the pheno data that holds the event status encoded as 1 = occurrence, 0 = censoring.
gene_id_colThe name of the column in the expression data that holds the gene identifiers.
benchmark_colThe name of the column in the pheno data that holds the benchmark risk score (like the IPI).
Methods
Method new()
Construct a Data R6 object.
Usage
Data$new(
name,
directory,
pivot_time_cutoff,
cohort,
imputer = mean_impute,
time_to_event_col,
event_col,
cohort_col,
benchmark_col = NULL,
expr_file = "expr.csv",
pheno_file = "pheno.csv",
patient_id_col = "patient_id",
gene_id_col = "gene_id"
)Arguments
namestring. A telling name for the data set.
directorystring. The directory where both expression and pheno csv files lie.
pivot_time_cutoffnumeric. Time cutoff that divides the samples into low-risk (event before) and high-risk (event after).
cohortstring. At the end of preparing the data, subset it to those samples whose value in the
cohort_colcolumn matchescohort.imputerfunction or
NULL. Function imputingNAs in the predictor matrix. Seeimputer_prototype()for its interface. Default ismean_impute().NULLmeans no imputation.time_to_event_colstring. The name of the column in the pheno data that holds the time-to-event values.
event_colstring. The name of the column in the pheno data that holds the event status encoded as 1 = occurrence, 0 = censoring.
cohort_colstring. The name of the column in the pheno data that holds the cohort a sample belongs to.
benchmark_colstring or
NULL. The name of the column in the pheno data that holds the output of a benchmark model.expr_filestring. The name of the expression csv file inside
directory. Default is"expr.csv". See details for the expected format.pheno_filestring. The name of the pheno data csv inside
directory. Default is"pheno.csv". See details for the expected format.patient_id_colstring. The name of the column in the pheno data that holds the patient identifiers.
gene_id_colstring. The name of the column in the expression data that holds the gene identifiers.
Details
The pheno csv file holds the samples as rows (with unique sample ids in the
first (character) column called patient_id_col), the variables as columns.
The expression csv file holds the genes as rows (with unique gene ids in the first
(character) column called gene_id_col), the samples as columns.
Method read()
Read expression data into the expr_mat attribute and pheno data into the
pheno_tbl attribute.
Method survival_quantiles()
Calculate the quantiles of the survival times.
Method split()
Split the data into a train and test cohort
Arguments
train_propnumeric. Proportion of the data to put in the train cohort.
savelogical. If TRUE, save the named cohort vector to a file.
keep_risklogical. If TRUE, keep the ratio of high-risk versus low-risk samples in train and test cohort the same as in the complete data set.
quietlogical. If TRUE, suppress messages.
Method qc_preprocess()
Quality control at the end of preprocessing