# Data Preparation

#### 1. Functionalities

• To upload data file, preview data set, and check the correctness of data input
• To pre-process some variables (when necessary) for building the model
• To achieve the basic descriptive statistics and plots of the variables
• To prepare the survival object alternative to the dependent variable for in the model

• Your data need to include one survival time variable and one 1/0 censoring variable and at least one independent variables (denoted as X)
• Your data need to have more rows than columns
• Do not mix character and numbers in the same column
• The data used to build a model is called a training set

#### Case Example 1: Right-censored diabetes data

Suppose in a study, we got some observations from a trial of laser coagulation for the treatment of diabetic retinopathy. Each patient had one eye randomized to laser treatment, and the other eye received no treatment. For each eye, the event of interest was the time from initiation of treatment to the time when visual acuity dropped below 5/200 two visits in a row. Thus there is a built-in lag time of approximately 6 months (visits were every 3 months). Therefore, survival times in this dataset are the actual time to blindness in months, minus the minimum possible time to event (6.5 months). Censor status of 0= censored; 1 = visual loss. Treatment: 0 = no treatment, 1= laser. Age is the age at diagnosis.

#### Case Example 2: Left-truncated right-censored Nki70 data

Suppose we wanted to explore 100 lymph node positive breast cancer patients on metastasis-free survival. But some patients enrolled in the study later than other people. Data contained 5 clinical risk factors: (1) Diam: diameter of the tumor; (2) N: number of affected lymph nodes; (3) ER: estrogen receptor status; (4) Grade: grade of the tumor; and (5) Age: Patient age at diagnosis (years); and gene expression measurements of 70 genes found to be prognostic for metastasis-free survival in an earlier study. Time variable is metastasis-free follow-up time (months). Censoring indicator variable: 1 = metastasis or death; 0 = censored.

We wanted to explore the association between survival time and the independent variables.

#### Training Set Preparation

Upload data will cover the example data

2. Show 1st row as column names?

3. Use 1st column as row names? (No duplicates)

Correct separator and quote ensure the successful data input

Find some example data here
Diabetes data has only one time duration variable, while Nki70 data has start.time and end.time.

#### Step 2. Create a Survival Object

Right-censored time needs only 1 time duration / follow-up variable

Left-truncated right-censored time needs start time and end time variables

Diabetes data has right-censored time, while Nki70 data has left-truncated right-censored time.

#### Step 3. Check the Survival Object

Valid survival object example: Surv (time, status)

or, Surv (start.time, end.time, status)

#### Change the referential level for categorical variable?

2. Input the referential level, each line for one variable

#### Output 1. Data Information

Data Preview

1. Numeric variable information list

2. Categorical variable information list

#### Output 2. Descriptive Results

1. For numeric variable

2. For categorical variable

Histogram: to roughly show the probability distribution of a variable by depicting the frequencies of observations occurring in certain ranges of values.

Density plot: to show the distribution of a variable

Histogram

When the number of bins is 0, plot will use the default number of bins

Density plot

# Non-Parametric Kaplan-Meier Estimator and Log-rank Test

Kaplan–Meier estimator, also known as the product-limit estimator, is a non-parametric statistic used to estimate the survival function from lifetime data.

The log-rank test is a hypothesis test to compare the survival distributions of two samples. It compares estimates of the hazard functions of the two groups at each observed event time.

#### 1. Functionalities

• To achieve Kaplan-Meier survival probability estimate
• To achieve Kaplan-Meier survival curves, cumulative events distribution curves, and cumulative hazard curves by a group variable
• To conduct a log-rank test to compare the survival curves from 2 groups
• To conduct a pairwise log-rank test to compare the survival curves from more than two groups

• Prepare the survival object in the Data tab
• A categorical variable is required in this model

#### Choose group variable to build the model

Prepare the data in the Data tab

1. Check survival object, Surv(time, event), in the Data Tab

In the example of Diabetes data, we chose 'laser' as a categorical group variable. That is to explore if the survival curves in two laser groups were different.

#### Log-rank Test

Null hypothesis

Two groups have identical hazard functions

See method explanations in Output 2. Log-rank Test tab.

#### Pairwise Log-rank Test

Null hypothesis

Two groups have identical hazard functions

See method explanations in Output 2. Pairwise Log-rank Test tab.

#### Output 1. Data Preview

Check full data in Data tab

#### Output 2. Estimate and Test Results

Kaplan-Meier survival probability by group

Explanations

This implements the G-rho family of Harrington and Fleming (1982), with weights on each death of S(t)rho, where S is the Kaplan-Meier estimate of survival.

• rho = 0: log-rank or Mantel-Haenszel test
• rho = 1: Peto & Peto modification of the Gehan-Wilcoxon test.
• p < 0.05 indicates the curves are significantly different in the survival probabilities
• p >= 0.05 indicates the curves are NOT significantly different in the survival probabilities

Log-rank Test Result

In this example, we could not find the statistical difference between 2 laser groups (p=0.8). Also from the Kaplan-Meier plot, we could found that the survival curves from 2 laser group intersect with each other.

Explanations

This implements the G-rho family of Harrington and Fleming (1982), with weights on each death of S(t)rho, where S is the Kaplan-Meier estimate of survival.

• rho = 0: log-rank or Mantel-Haenszel test
• rho = 1: Peto & Peto modification of the Gehan-Wilcoxon test.
• Bonferroni correction is a generic but very conservative approach
• Bonferroni-Holm is less conservative and uniformly more powerful than Bonferroni
• False Discovery Rate-BH is more powerful than the others, developed by Benjamini and Hochberg
• False Discovery Rate-BY is more powerful than the others, developed by Benjamini and Yekutieli
• p < 0.05 indicates the curves are significantly different in the survival probabilities
• p >= 0.05 indicates the curves are NOT significantly different in the survival probabilities

Pairwise Log-rank Test P Value Table

# Semi-Parametric Cox Regression

Cox Regression, also known as Cox proportional hazard regression, assumes that if the proportional hazards assumption holds (or, is assumed to hold), then it is possible to estimate the effect parameter(s) without any consideration of the hazard function. Cox regression assumes that the effects of the predictor variables upon survival are constant over time and are additive in one scale.

#### 1. Functionalities

• To build a Cox regression model
• To achieve the estimates of the model, such as (1) estimate of coefficient, (2) predictions from the training data, (3)residuals, (4) the adjusted survival curves, (5) proportional hazard test, and (6) diagnostic plot
• To upload new data and get the prediction
• To achieve the evaluation of new data containing new dependent variable
• To achieve Brier Score and time-dependent AUC

• Please prepare the training data in the Data tab
• Please prepare the survival object, Surv(time, event), in the Data tab
• New data (test set) should cover all the independent variables used in the model.

#### Build the Model

Prepare the data in the Data tab

#### Step 1. Choose variables to build the model

1. Check Surv(time, event), survival object, in the Data Tab

If you want to consider the heterogeneity in the sample, continue with Extending Model tab

Frailty: individuals have different frailties, and those who most frail will die earlier than others. Frailty model estimates the relative risk within the random effect variable

Cluster model is also called marginal model. It estimates the population averaged relative risk due to the independent variable.

In the example of Diabetes data: 'eye' could be used as random effect of strata, then the results will be shown by eye group; 'id' can be used as random effect variable of cluster, then the result will assume the independent within a cluster; 'id' can also be used as random effect variable of frailty, then the result will be adjusted by the simulated distribution from 'id'.

#### Step 2. Check Cox Model

Valid model example: Surv(time, status) ~ X1 + X2

Or, Surv(time1, time2, status) ~ X1 + X2

#### Output 1. Data Preview

Check full data in Data tab

#### Output 2. Model Results

Explanations
• For each variable, estimated coefficients (coef), the statistic for the significance of a single variable, and P value are given.
• The column marked 'z' gives the Wald statistic value. It corresponds to the ratio of each regression coefficient to its standard error (z = coef/se(coef)). The Wald statistic evaluates whether the beta coefficient of a given variable is statistically significantly different from 0.
• The coefficients relate to hazard; a positive coefficient indicates a worse prognosis, and a negative coefficient indicates a protective effect of the variable with which it is associated.
• exp(coef) = hazard ratio (HR). HR = 1: No effect; HR < 1: Reduction in the hazard; HR > 1: Increase in Hazard
• The output also gives upper and lower 95% confidence intervals for the hazard ratio (exp(coef)),
• The likelihood-ratio test, Wald test, and score log-rank statistics give the global statistical significance of the model. These three methods are asymptotically equivalent. For large enough N, they will give similar results. For small N, they may differ somewhat. The Likelihood ratio test has better behavior for small sample sizes, so it is generally preferred.

Fitting values and residuals from the existed data

Explanations
• The Akaike Information Criterion (AIC) is used to performs stepwise model selection.
• Model fits are ranked according to their AIC values, and the model with the lowest AIC value is sometime considered the 'best'.

Model selection suggested by AIC

Explanations
• this plot is to present expected survival curves calculated based on Cox model separately for subpopulations / strata
• If there is no strata() component then only a single curve will be plotted - average for the whole population

The adjusted survival curves from Cox regression

Explanations
• Schoenfeld residuals are used to check the proportional hazards assumption
• Schoenfeld residuals are independent of time. A plot that shows a non-random pattern against time is evidence of violation of the PH assumption
• If the test is not statistically significant (p>0.05) for each of the independent variable, we can assume the proportional hazards

Explanations

Martingale residuals against continuous independent variable is a common approach used to detect nonlinearity. For a given continuous covariate, patterns in the plot may suggest that the variable is not properly fit. Martingale residuals may present any value in the range (-INF, +1):
• A value of martingale residuals near 1 represents individuals that 'died too soon',
• Large negative values correspond to individuals that 'lived too long'.
Deviance residual is a normalized transform of the martingale residual. These residuals should be roughly symmetrically distributed about zero with a standard deviation of 1.
• Positive values correspond to individuals that 'died too soon' compared to expected survival times.
• Negative values correspond to individual that 'lived too long'.
• Very large or small values are outliers, which are poorly predicted by the model.
Cox-Snell residuals are used to check for overall goodness of fit in survival models.
• Cox-Snell residuals are equal to the -log(survival probability) for each observation
• If the model fits the data well, Cox-Snell residuals should behave like a sample from an exponential distribution with a mean of 1
• If the residuals act like a sample from a unit exponential distribution, they should lie along the 45-degree diagonal line.

The residuals can be found in Data Fitting tab.

Red points are those who 'died soon'; black points are whose who 'lived long'

1. Martingale residuals plot against continuous independent variable

2. Deviance residuals plot by observational id

3. Cox-Snell residuals plot

#### Prediction

Prepare model first

#### Data: Diabetes / NKI70

Data for prediction should cover all the variables in the model

2. Show 1st row as column names?

3. Use 1st column as row names? (No duplicates)

Correct separator and quote ensure the successful data input

Find some example data here

#### Output 3. Prediction Results

Brier score is used to evaluate the accuracy of a predicted survival function at given time series. It represents the average squared distances between the observed survival status and the predicted survival probability and is always a number between 0 and 1, with 0 being the best possible value.

The Integrated Brier Score (IBS) provides an overall calculation of the model performance at all available times.

The default setting give time series 1,2,...10

Brier score at given time

Explanations

AUC here is time-dependent AUC, which gives AUC at given time series.
• Chambless and Diao: assumed that lp and lpnew are the predictors of a Cox proportional hazards model. (Chambless, L. E. and G. Diao (2006). Estimation of time-dependent area under the ROC curve for long-term risk prediction. Statistics in Medicine 25, 3474–3486.)
• Hung and Chiang: assumed that there is a one-to-one relationship between the predictor and the expected survival times conditional on the predictor. (Hung, H. and C.-T. Chiang (2010). Estimation methods for time-dependent AUC models with survival data. Canadian Journal of Statistics 38, 8–26.)
• Song and Zhou: in this method, the estimators remain valid even if the censoring times depend on the values of the predictors. (Song, X. and X.-H. Zhou (2008). A semiparametric approach for the covariate specific ROC curve with survival outcome. Statistica Sinica 18, 947–965.)
• Uno et al.: are based on inverse-probability-of-censoring weights and do not assume a specific working model for deriving the predictor lpnew. It is assumed that there is a one-to-one relationship between the predictor and the expected survival times conditional on the predictor. (Uno, H., T. Cai, L. Tian, and L. J. Wei (2007). Evaluating prediction rules for t-year survivors with censored regression models. Journal of the American Statistical Association 102, 527–537.)
The example time series: 1, 2, 3, ...,10

Time dependent AUC at given time

# Parametric Accelerated Failure Time (AFT) Model

Accelerated failure time (AFT) model is a parametric model that assumes that the effect of a covariate is to accelerate or decelerate the life course of a disease by some constant.

#### 1. Functionalities

• To build an AFT model
• To achieve the estimates of the model, such as coefficients of parameters, residuals, and diagnostic plot
• To achieve fitted values which are predicted from the training data
• To upload new data and get the prediction
• To achieve the evaluation of new data containing new dependent variable

• Prepare the training data in the Data tab
• Prepare the survival object, Surv(time, event), in the Data tab
• New data (test set) should cover all the independent variables used in the model.

#### Build the Model

Prepare the data in the Data tab

#### Step 1. Choose variables to build the model

1. Check survival object, Surv(time, event), in the Data Tab

If you want to consider the heterogeneity in the sample, continue with Extending Model tab

In the example of Diabetes data: 'eye' could be used as random effect of strata; 'id' can be used as random effect variable of cluster.

#### Step 2. Check AFT Model

Valid model example: Surv(time, status) ~ X1 + X2

Or, Surv(time1, time2, status) ~ X1 + X2

'-1' in the formula indicates that the intercept/constant term has been removed

#### Output 1. Data Preview

Check full data in Data tab

#### Output 2. Model Results

Explanations
• For each variable, estimated coefficients (Value), statistic for the significance of single variable, and p value are given.
• The column marked 'z' gives the Wald statistic value. It corresponds to the ratio of each regression coefficient to its standard error (z = coef/se(coef)).The Wald statistic evaluates, whether the beta coefficient of a given variable is statistically significantly different from 0.
• The coefficients relate to hazard; a positive coefficient indicates a worse prognosis and a negative coefficient indicates a protective effect of the variable with which it is associated.
• exp(Value) = hazard ratio (HR). HR = 1: No effect; HR < 1: Reduction in the hazard; HR > 1: Increase in Hazard
• Scale and Log(scale) are the estimated parameters in the error term of AFT model
• The log-likelihood is given in the model. When maximum likelihood estimation is used to generate the log-likelihoods, then the closer that the log-likelihood(LL) is to zero, the better is the model fit.
• For left-truncated data, the time here is the differences of end-time and start-time

Fitting values and residuals from the existed data

Explanations
• The Akaike Information Criterion (AIC) is used to performs stepwise model selection.
• Model fits are ranked according to their AIC values, and the model with the lowest AIC value is sometime considered the 'best'.

Model selection suggested by AIC

Explanations

Martingale residuals against continuous independent variable is a common approach used to detect nonlinearity. For a given continuous covariate, patterns in the plot may suggest that the variable is not properly fit. Martingale residuals may present any value in the range (-INF, +1):
• A value of martingale residuals near 1 represents individuals that 'died too soon',
• Large negative values correspond to individuals that 'lived too long'.
Deviance residual is a normalized transform of the martingale residual. These residuals should be roughly symmetrically distributed about zero with a standard deviation of 1.
• Positive values correspond to individuals that 'died too soon' compared to expected survival times.
• Negative values correspond to individual that 'lived too long'.
• Very large or small values are outliers, which are poorly predicted by the model.
Cox-Snell residuals are used to check for overall goodness of fit in survival models.
• Cox-Snell residuals are equal to the -log(survival probability) for each observation
• If the model fits the data well, Cox-Snell residuals should behave like a sample from an exponential distribution with a mean of 1
• If the residuals act like a sample from a unit exponential distribution, they should lie along the 45-degree diagonal line.

The residuals can be found in Data Fitting tab.

Red points are those who 'died soon'; black points are whose who 'lived long'

1. Martingale residuals plot against continuous independent variable

2. Deviance residuals plot by observational id

3. Cox-Snell residuals plot

#### Prediction

Prepare model first

#### Data: Diabetes / NKI70

Data for prediction should cover all the variables in the model

2. Show 1st row as column names?

3. Use 1st column as row names? (No duplicates)

Correct separator and quote ensure the successful data input

Find some example data here

#### Output 3. Prediction Results

The predicted survival probability of N'th observation