Data Preparation

1. Functionalities

To upload data file, preview data set, and check the correctness of data input
To pre-process some variables (when necessary) for building the model
To achieve the basic descriptive statistics and plots of the variables
To prepare the survival object alternative to the dependent variable for in the model

2. About your data (training set)

Your data need to include one survival time variable and one 1/0 censoring variable and at least one independent variables (denoted as X)
Your data need to have more rows than columns
Do not mix character and numbers in the same column
The data used to build a model is called a training set

Case Example 1: Right-censored diabetes data

Suppose in a study, we got some observations from a trial of laser coagulation for the treatment of diabetic retinopathy. Each patient had one eye randomized to laser treatment, and the other eye received no treatment. For each eye, the event of interest was the time from initiation of treatment to the time when visual acuity dropped below 5/200 two visits in a row. Thus there is a built-in lag time of approximately 6 months (visits were every 3 months). Therefore, survival times in this dataset are the actual time to blindness in months, minus the minimum possible time to event (6.5 months). Censor status of 0= censored; 1 = visual loss. Treatment: 0 = no treatment, 1= laser. Age is the age at diagnosis.

Case Example 2: Left-truncated right-censored Nki70 data

Suppose we wanted to explore 100 lymph node positive breast cancer patients on metastasis-free survival. But some patients enrolled in the study later than other people. Data contained 5 clinical risk factors: (1) Diam: diameter of the tumor; (2) N: number of affected lymph nodes; (3) ER: estrogen receptor status; (4) Grade: grade of the tumor; and (5) Age: Patient age at diagnosis (years); and gene expression measurements of 70 genes found to be prognostic for metastasis-free survival in an earlier study. Time variable is metastasis-free follow-up time (months). Censoring indicator variable: 1 = metastasis or death; 0 = censored.

We wanted to explore the association between survival time and the independent variables.

Please follow the Steps, and Outputs will give real-time analytical results. After getting data ready, please find the model in the next tabs.

Training Set Preparation

Example data
Upload Data

Use example data

Upload data will cover the example data

Please refer to the example data format to upload new data

1. Choose CSV/TXT file

Browse...

2. Show 1st row as column names?

Yes

3. Use 1st column as row names? (No duplicates)

Yes

4. Which separator for data?

Comma (,): CSV often uses this

One Tab (->|): TXT often uses this

Semicolon (;)

One Space (_)

5. Which quote for characters?

None

Double Quote

Single Quote

Correct separator and quote ensure the successful data input

Find some example data here

Diabetes data has only one time duration variable, while Nki70 data has start.time and end.time.

Step 2. Create a Survival Object

2. Select survival time type

Right-censored time needs only 1 time duration / follow-up variable

Left-truncated right-censored time needs start time and end time variables

Diabetes data has right-censored time, while Nki70 data has left-truncated right-censored time.

Step 3. Check the Survival Object

Valid survival object example: Surv (time, status)

or, Surv (start.time, end.time, status)

Change the types of some variable?

Change the referential level for categorical variable?

2. Input the referential level, each line for one variable

Output 1. Data Information

Data Preview

1. Numeric variable information list

2. Categorical variable information list

Output 2. Descriptive Results

Basic Descriptives
Survival Curves
Life Table
Histogram and Density Plot

1. For numeric variable

2. For categorical variable

Download Results (Categorical variables)

Choose one plot

1. Survival Probability

2. Cumulative Events

3. Cumulative Hazard

Histogram: to roughly show the probability distribution of a variable by depicting the frequencies of observations occurring in certain ranges of values.

Density plot: to show the distribution of a variable

Histogram

The number of bins in the histogram

When the number of bins is 0, plot will use the default number of bins

Density plot

Non-Parametric Kaplan-Meier Estimator and Log-rank Test

Kaplan–Meier estimator, also known as the product-limit estimator, is a non-parametric statistic used to estimate the survival function from lifetime data.

The log-rank test is a hypothesis test to compare the survival distributions of two samples. It compares estimates of the hazard functions of the two groups at each observed event time.

1. Functionalities

To achieve Kaplan-Meier survival probability estimate
To achieve Kaplan-Meier survival curves, cumulative events distribution curves, and cumulative hazard curves by a group variable
To conduct a log-rank test to compare the survival curves from 2 groups
To conduct a pairwise log-rank test to compare the survival curves from more than two groups

2. About your data

Prepare the survival object in the Data tab
A categorical variable is required in this model

Please follow the Steps to build the model, and click Outputs to get analytical results.

Choose group variable to build the model

Prepare the data in the Data tab

1. Check survival object, Surv(time, event), in the Data Tab

In the example of Diabetes data, we chose 'laser' as a categorical group variable. That is to explore if the survival curves in two laser groups were different.

Log-rank Test

Null hypothesis

Two groups have identical hazard functions

Choose Log-rank Test Method

1. Log-rank test

2. Peto & Peto modification of the Gehan-Wilcoxon test

See method explanations in Output 2. Log-rank Test tab.

Pairwise Log-rank Test

Null hypothesis

Two groups have identical hazard functions

1. Choose Log-rank Test Method

1. Log-rank test

2. Peto & Peto modification of the Gehan-Wilcoxon test

2. Choose a method to adjust P value

Bonferroni

Bonferroni-Holm: often used

False Discovery Rate-BH

False Discovery Rate-BY

See method explanations in Output 2. Pairwise Log-rank Test tab.

Output 1. Data Preview

Variables information
Part of Data

Check full data in Data tab

Output 2. Estimate and Test Results

Kaplan-Meier Survival Probability
Kaplan-Meier Plot by Group
Log-Rank Test
Pairwise Log-Rank Test

Kaplan-Meier survival probability by group

Which plot do you want to see?

1. Survival Probability

2. Cumulative Events

3. Cumulative Hazard

Explanations

This implements the G-rho family of Harrington and Fleming (1982), with weights on each death of S(t)^rho, where S is the Kaplan-Meier estimate of survival.

rho = 0: log-rank or Mantel-Haenszel test
rho = 1: Peto & Peto modification of the Gehan-Wilcoxon test.
p < 0.05 indicates the curves are significantly different in the survival probabilities
p >= 0.05 indicates the curves are NOT significantly different in the survival probabilities

Log-rank Test Result

In this example, we could not find the statistical difference between 2 laser groups (p=0.8). Also from the Kaplan-Meier plot, we could found that the survival curves from 2 laser group intersect with each other.

Explanations

This implements the G-rho family of Harrington and Fleming (1982), with weights on each death of S(t)^rho, where S is the Kaplan-Meier estimate of survival.

rho = 0: log-rank or Mantel-Haenszel test
rho = 1: Peto & Peto modification of the Gehan-Wilcoxon test.
Bonferroni correction is a generic but very conservative approach
Bonferroni-Holm is less conservative and uniformly more powerful than Bonferroni
False Discovery Rate-BH is more powerful than the others, developed by Benjamini and Hochberg
False Discovery Rate-BY is more powerful than the others, developed by Benjamini and Yekutieli
p < 0.05 indicates the curves are significantly different in the survival probabilities
p >= 0.05 indicates the curves are NOT significantly different in the survival probabilities

Pairwise Log-rank Test P Value Table

Semi-Parametric Cox Regression

Cox Regression, also known as Cox proportional hazard regression, assumes that if the proportional hazards assumption holds (or, is assumed to hold), then it is possible to estimate the effect parameter(s) without any consideration of the hazard function. Cox regression assumes that the effects of the predictor variables upon survival are constant over time and are additive in one scale.

1. Functionalities

To build a Cox regression model
To achieve the estimates of the model, such as (1) estimate of coefficient, (2) predictions from the training data, (3)residuals, (4) the adjusted survival curves, (5) proportional hazard test, and (6) diagnostic plot
To upload new data and get the prediction
To achieve the evaluation of new data containing new dependent variable
To achieve Brier Score and time-dependent AUC

2. About your data (training set)

Please prepare the training data in the Data tab
Please prepare the survival object, Surv(time, event), in the Data tab
New data (test set) should cover all the independent variables used in the model.

Please follow the Steps to build the model, and click Outputs to get analytical results.

Build the Model

Prepare the data in the Data tab

Step 1. Choose variables to build the model

1. Check Surv(time, event), survival object, in the Data Tab

Basic Model
Extending Model

If you want to consider the heterogeneity in the sample, continue with Extending Model tab

4. (Optional) Choose Method for Ties Handling

1. Efron method: more accurate if there are a large number of ties

2. Breslow approximation: the easiest to program and the first option coded for almost all computer routines

3. Exact partial likelihood method: the Cox partial likelihood is equivalent to that for matched logistic regression

5. (Optional) Add random effect term (the effect of heterogeneity)

None

Strata: identifies stratification variable (categorical, such as disease subtype and enrolling institutes)

Cluster: identifies correlated groups of observations (such as multiple events per subject)

Gamma Frailty: allows one to add a simple gamma distributed random effects term

Gaussian Frailty: allows one to add a simple Gaussian distributed random effects term

Frailty: individuals have different frailties, and those who most frail will die earlier than others. Frailty model estimates the relative risk within the random effect variable

Cluster model is also called marginal model. It estimates the population averaged relative risk due to the independent variable.

In the example of Diabetes data: 'eye' could be used as random effect of strata, then the results will be shown by eye group; 'id' can be used as random effect variable of cluster, then the result will assume the independent within a cluster; 'id' can also be used as random effect variable of frailty, then the result will be adjusted by the simulated distribution from 'id'.

Step 2. Check Cox Model

Valid model example: Surv(time, status) ~ X1 + X2

Or, Surv(time1, time2, status) ~ X1 + X2

Step 3. If data and model are ready, click the blue button to generate model results.

Output 1. Data Preview

Variables information
Part of Data

Check full data in Data tab

Output 2. Model Results

Model Estimation
Data Fitting
AIC-based Selection
Survival Curve
Proportional Hazards Test
Diagnostic Plot

Explanations

For each variable, estimated coefficients (coef), the statistic for the significance of a single variable, and P value are given.
The column marked 'z' gives the Wald statistic value. It corresponds to the ratio of each regression coefficient to its standard error (z = coef/se(coef)). The Wald statistic evaluates whether the beta coefficient of a given variable is statistically significantly different from 0.
The coefficients relate to hazard; a positive coefficient indicates a worse prognosis, and a negative coefficient indicates a protective effect of the variable with which it is associated.
exp(coef) = hazard ratio (HR). HR = 1: No effect; HR < 1: Reduction in the hazard; HR > 1: Increase in Hazard
The output also gives upper and lower 95% confidence intervals for the hazard ratio (exp(coef)),
The likelihood-ratio test, Wald test, and score log-rank statistics give the global statistical significance of the model. These three methods are asymptotically equivalent. For large enough N, they will give similar results. For small N, they may differ somewhat. The Likelihood ratio test has better behavior for small sample sizes, so it is generally preferred.

Fitting values and residuals from the existed data

Explanations

The Akaike Information Criterion (AIC) is used to performs stepwise model selection.
Model fits are ranked according to their AIC values, and the model with the lowest AIC value is sometime considered the 'best'.

Model selection suggested by AIC

Explanations

this plot is to present expected survival curves calculated based on Cox model separately for subpopulations / strata
If there is no strata() component then only a single curve will be plotted - average for the whole population

The adjusted survival curves from Cox regression

Explanations

Schoenfeld residuals are used to check the proportional hazards assumption
Schoenfeld residuals are independent of time. A plot that shows a non-random pattern against time is evidence of violation of the PH assumption
If the test is not statistically significant (p>0.05) for each of the independent variable, we can assume the proportional hazards

Choose N'th variable

Explanations

Martingale residuals against continuous independent variable is a common approach used to detect nonlinearity. For a given continuous covariate, patterns in the plot may suggest that the variable is not properly fit. Martingale residuals may present any value in the range (-INF, +1):

A value of martingale residuals near 1 represents individuals that 'died too soon',
Large negative values correspond to individuals that 'lived too long'.

Deviance residual is a normalized transform of the martingale residual. These residuals should be roughly symmetrically distributed about zero with a standard deviation of 1.

Positive values correspond to individuals that 'died too soon' compared to expected survival times.
Negative values correspond to individual that 'lived too long'.
Very large or small values are outliers, which are poorly predicted by the model.

Cox-Snell residuals are used to check for overall goodness of fit in survival models.

Cox-Snell residuals are equal to the -log(survival probability) for each observation
If the model fits the data well, Cox-Snell residuals should behave like a sample from an exponential distribution with a mean of 1
If the residuals act like a sample from a unit exponential distribution, they should lie along the 45-degree diagonal line.

The residuals can be found in Data Fitting tab.

Red points are those who 'died soon'; black points are whose who 'lived long'

1. Martingale residuals plot against continuous independent variable

2. Deviance residuals plot by observational id

3. Cox-Snell residuals plot

Prediction

Prepare model first

Step 4. Test Set Preparation

Example data
Upload Data

Data: Diabetes / NKI70

Data for prediction should cover all the variables in the model

1. Choose CSV/TXT file

Browse...

2. Show 1st row as column names?

Yes

3. Use 1st column as row names? (No duplicates)

Yes

4. Which separator for data?

Comma (,): CSV often uses this

One Tab (->|): TXT often uses this

Semicolon (;)

One Space (_)

5. Which quote for characters?

None

Double Quote

Single Quote

Correct separator and quote ensure the successful data input

Find some example data here

Step 5. If the model and new data are ready, click the blue button to generate prediction results.

Output 3. Prediction Results

Prediction Table
Brier Score
AUC

Brier score is used to evaluate the accuracy of a predicted survival function at given time series. It represents the average squared distances between the observed survival status and the predicted survival probability and is always a number between 0 and 1, with 0 being the best possible value.

The Integrated Brier Score (IBS) provides an overall calculation of the model performance at all available times.

Set time series: start point

Set time series:end point

Set time series: sequence

The default setting give time series 1,2,...10

Brier score at given time

Explanations

AUC here is time-dependent AUC, which gives AUC at given time series.

Chambless and Diao: assumed that lp and lpnew are the predictors of a Cox proportional hazards model. (Chambless, L. E. and G. Diao (2006). Estimation of time-dependent area under the ROC curve for long-term risk prediction. Statistics in Medicine 25, 3474–3486.)
Hung and Chiang: assumed that there is a one-to-one relationship between the predictor and the expected survival times conditional on the predictor. (Hung, H. and C.-T. Chiang (2010). Estimation methods for time-dependent AUC models with survival data. Canadian Journal of Statistics 38, 8–26.)
Song and Zhou: in this method, the estimators remain valid even if the censoring times depend on the values of the predictors. (Song, X. and X.-H. Zhou (2008). A semiparametric approach for the covariate specific ROC curve with survival outcome. Statistica Sinica 18, 947–965.)
Uno et al.: are based on inverse-probability-of-censoring weights and do not assume a specific working model for deriving the predictor lpnew. It is assumed that there is a one-to-one relationship between the predictor and the expected survival times conditional on the predictor. (Uno, H., T. Cai, L. Tian, and L. J. Wei (2007). Evaluating prediction rules for t-year survivors with censored regression models. Journal of the American Statistical Association 102, 527–537.)

Set time series: start point

Set time series: end point

Set time series sequence

The example time series: 1, 2, 3, ...,10

Choose one AUC estimator

Chambless and Diao

Hung and Chiang

Song and Zhou

Uno et al.

Time dependent AUC at given time

Parametric Accelerated Failure Time (AFT) Model

Accelerated failure time (AFT) model is a parametric model that assumes that the effect of a covariate is to accelerate or decelerate the life course of a disease by some constant.

1. Functionalities

To build an AFT model
To achieve the estimates of the model, such as coefficients of parameters, residuals, and diagnostic plot
To achieve fitted values which are predicted from the training data
To upload new data and get the prediction
To achieve the evaluation of new data containing new dependent variable

2. About your data

Prepare the training data in the Data tab
Prepare the survival object, Surv(time, event), in the Data tab
New data (test set) should cover all the independent variables used in the model.

Please follow the Steps to build the model, and click Outputs to get analytical results.

Build the Model

Prepare the data in the Data tab

Step 1. Choose variables to build the model

1. Check survival object, Surv(time, event), in the Data Tab

Basic Model
Extending Model

3. Choose AFT Model

1. Log-normal regression model

2. Weibull regression model

3. Exponential regression model

4. Log-logistic regression model

5. (Optional) Keep or remove intercept / constant term

Remove intercept / constant

Keep intercept / constant term

If you want to consider the heterogeneity in the sample, continue with Extending Model tab

6. (Optional) Add random effect term (the effect of heterogeneity)

None

Strata: identifies stratification variable (categorical, such as disease subtype and enrolling institutes)

Cluster: identifies correlated groups of observations (such as multiple events per subject)

In the example of Diabetes data: 'eye' could be used as random effect of strata; 'id' can be used as random effect variable of cluster.

Step 2. Check AFT Model

Valid model example: Surv(time, status) ~ X1 + X2

Or, Surv(time1, time2, status) ~ X1 + X2

'-1' in the formula indicates that the intercept/constant term has been removed

Step 3. If data and model are ready, click the blue button to generate model results.

Output 1. Data Preview

Variables information
Part of Data

Check full data in Data tab

Output 2. Model Results

Model Estimation
Data Fitting
AIC-based Selection
Diagnostics Plot

Explanations

For each variable, estimated coefficients (Value), statistic for the significance of single variable, and p value are given.
The column marked 'z' gives the Wald statistic value. It corresponds to the ratio of each regression coefficient to its standard error (z = coef/se(coef)).The Wald statistic evaluates, whether the beta coefficient of a given variable is statistically significantly different from 0.
The coefficients relate to hazard; a positive coefficient indicates a worse prognosis and a negative coefficient indicates a protective effect of the variable with which it is associated.
exp(Value) = hazard ratio (HR). HR = 1: No effect; HR < 1: Reduction in the hazard; HR > 1: Increase in Hazard
Scale and Log(scale) are the estimated parameters in the error term of AFT model
The log-likelihood is given in the model. When maximum likelihood estimation is used to generate the log-likelihoods, then the closer that the log-likelihood(LL) is to zero, the better is the model fit.
For left-truncated data, the time here is the differences of end-time and start-time

Fitting values and residuals from the existed data

Explanations

The Akaike Information Criterion (AIC) is used to performs stepwise model selection.
Model fits are ranked according to their AIC values, and the model with the lowest AIC value is sometime considered the 'best'.

Model selection suggested by AIC

Explanations

A value of martingale residuals near 1 represents individuals that 'died too soon',
Large negative values correspond to individuals that 'lived too long'.

Deviance residual is a normalized transform of the martingale residual. These residuals should be roughly symmetrically distributed about zero with a standard deviation of 1.

Positive values correspond to individuals that 'died too soon' compared to expected survival times.
Negative values correspond to individual that 'lived too long'.
Very large or small values are outliers, which are poorly predicted by the model.

Cox-Snell residuals are used to check for overall goodness of fit in survival models.

Cox-Snell residuals are equal to the -log(survival probability) for each observation
If the model fits the data well, Cox-Snell residuals should behave like a sample from an exponential distribution with a mean of 1
If the residuals act like a sample from a unit exponential distribution, they should lie along the 45-degree diagonal line.

The residuals can be found in Data Fitting tab.

Red points are those who 'died soon'; black points are whose who 'lived long'

1. Martingale residuals plot against continuous independent variable

2. Deviance residuals plot by observational id

3. Cox-Snell residuals plot

Prediction

Prepare model first

Step 4. Test Set Preparation

Example data
Upload Data

Data: Diabetes / NKI70

Data for prediction should cover all the variables in the model

1. Choose CSV/TXT file

Browse...

2. Show 1st row as column names?

Yes

3. Use 1st column as row names? (No duplicates)

Yes

4. Which separator for data?

Comma (,): CSV often uses this

One Tab (->|): TXT often uses this

Semicolon (;)

One Space (_)

5. Which quote for characters?

None

Double Quote

Single Quote

Correct separator and quote ensure the successful data input

Find some example data here

Step 5. If the model and new data are ready, click the blue button to generate prediction results.

Output 3. Prediction Results

Prediction Table
Predicted Survival Plot

The predicted survival probability of N'th observation

Choose N'th observation (N'th row of new data)