An R6 class for the data
Data.Rd
A Data
object holds the phenotype and expression data belonging to
a data set. It specifies the names of important features in the columns, reads in the data, puts
the spotlight on a part of the data: the cohort, and prepares the data for a model.
Public fields
name
A telling name for the data set.
directory
Directory where the expression and pheno csv files lie.
pivot_time_cutoff
Time cutoff that divides the sample into low-risk (event before) and high-risk (event after). assessment.
cohort
Regular expression to subset the data to a cohort.
imputer
Function handling NAs in the predictor matrix.
expr_mat
Named numeric matrix. Samples correspond to rows.
pheno_tbl
A tibble with phenotypic features and samples as rows.
expr_file
Name of the expression csv file inside
directory
.pheno_file
Name of the pheno data csv inside
directory
.cohort_col
Find the cohort of a sample in this column of the pheno data.
patient_id_col
The name of the column in the pheno data that holds unique patient identifiers.
time_to_event_col
The name of the column in the pheno data that holds the time-to-event values.
event_col
The name of the column in the pheno data that holds the event status encoded as 1 = occurrence, 0 = censoring.
gene_id_col
The name of the column in the expression data that holds the gene identifiers.
benchmark_col
The name of the column in the pheno data that holds the benchmark risk score (like the IPI).
Methods
Method new()
Construct a Data
R6 object.
Usage
Data$new(
name,
directory,
pivot_time_cutoff,
cohort,
imputer = mean_impute,
time_to_event_col,
event_col,
cohort_col,
benchmark_col = NULL,
expr_file = "expr.csv",
pheno_file = "pheno.csv",
patient_id_col = "patient_id",
gene_id_col = "gene_id"
)
Arguments
name
string. A telling name for the data set.
directory
string. The directory where both expression and pheno csv files lie.
pivot_time_cutoff
numeric. Time cutoff that divides the samples into low-risk (event before) and high-risk (event after).
cohort
string. At the end of preparing the data, subset it to those samples whose value in the
cohort_col
column matchescohort
.imputer
function or
NULL
. Function imputingNA
s in the predictor matrix. Seeimputer_prototype()
for its interface. Default ismean_impute()
.NULL
means no imputation.time_to_event_col
string. The name of the column in the pheno data that holds the time-to-event values.
event_col
string. The name of the column in the pheno data that holds the event status encoded as 1 = occurrence, 0 = censoring.
cohort_col
string. The name of the column in the pheno data that holds the cohort a sample belongs to.
benchmark_col
string or
NULL
. The name of the column in the pheno data that holds the output of a benchmark model.expr_file
string. The name of the expression csv file inside
directory
. Default is"expr.csv"
. See details for the expected format.pheno_file
string. The name of the pheno data csv inside
directory
. Default is"pheno.csv"
. See details for the expected format.patient_id_col
string. The name of the column in the pheno data that holds the patient identifiers.
gene_id_col
string. The name of the column in the expression data that holds the gene identifiers.
Details
The pheno csv file holds the samples as rows (with unique sample ids in the
first (character) column called patient_id_col
), the variables as columns.
The expression csv file holds the genes as rows (with unique gene ids in the first
(character) column called gene_id_col
), the samples as columns.
Method read()
Read expression data into the expr_mat
attribute and pheno data into the
pheno_tbl
attribute.
Method survival_quantiles()
Calculate the quantiles of the survival times.
Method split()
Split the data into a train and test cohort
Arguments
train_prop
numeric. Proportion of the data to put in the train cohort.
save
logical. If TRUE, save the named cohort vector to a file.
keep_risk
logical. If TRUE, keep the ratio of high-risk versus low-risk samples in train and test cohort the same as in the complete data set.
quiet
logical. If TRUE, suppress messages.
Method qc_preprocess()
Quality control at the end of preprocessing