Data Preparation

1. Functionalities

  • To upload data file, preview data set, and check the correctness of data input
  • To pre-process some variables (when necessary) for building the model
  • To achieve the basic descriptive statistics and plots of the variables
  • To prepare the survival object alternative to the dependent variable for in the model

2. About your data (training set)

  • Your data need to include one survival time variable and one 1/0 censoring variable and at least one independent variables (denoted as X)
  • Your data need to have more rows than columns
  • Do not mix character and numbers in the same column
  • The data used to build a model is called a training set

Case Example 1: Right-censored diabetes data

Suppose in a study, we got some observations from a trial of laser coagulation for the treatment of diabetic retinopathy. Each patient had one eye randomized to laser treatment, and the other eye received no treatment. For each eye, the event of interest was the time from initiation of treatment to the time when visual acuity dropped below 5/200 two visits in a row. Thus there is a built-in lag time of approximately 6 months (visits were every 3 months). Therefore, survival times in this dataset are the actual time to blindness in months, minus the minimum possible time to event (6.5 months). Censor status of 0= censored; 1 = visual loss. Treatment: 0 = no treatment, 1= laser. Age is the age at diagnosis.

Case Example 2: Left-truncated right-censored Nki70 data

Suppose we wanted to explore 100 lymph node positive breast cancer patients on metastasis-free survival. But some patients enrolled in the study later than other people. Data contained 5 clinical risk factors: (1) Diam: diameter of the tumor; (2) N: number of affected lymph nodes; (3) ER: estrogen receptor status; (4) Grade: grade of the tumor; and (5) Age: Patient age at diagnosis (years); and gene expression measurements of 70 genes found to be prognostic for metastasis-free survival in an earlier study. Time variable is metastasis-free follow-up time (months). Censoring indicator variable: 1 = metastasis or death; 0 = censored.

We wanted to explore the association between survival time and the independent variables.

Please follow the Steps, and Outputs will give real-time analytical results. After getting data ready, please find the model in the next tabs.


Output 1. Data Information

Data Preview


1. Numeric variable information list


            

2. Categorical variable information list


            

Output 2. Descriptive Results


1. For numeric variable

2. For categorical variable


                  
                    
                    Download Results (Categorical variables)
                  
                



                



Histogram: to roughly show the probability distribution of a variable by depicting the frequencies of observations occurring in certain ranges of values.

Density plot: to show the distribution of a variable


Histogram

When the number of bins is 0, plot will use the default number of bins

Density plot


Non-Parametric Kaplan-Meier Estimator and Log-rank Test

Kaplan–Meier estimator, also known as the product-limit estimator, is a non-parametric statistic used to estimate the survival function from lifetime data.

The log-rank test is a hypothesis test to compare the survival distributions of two samples. It compares estimates of the hazard functions of the two groups at each observed event time.

1. Functionalities

  • To achieve Kaplan-Meier survival probability estimate
  • To achieve Kaplan-Meier survival curves, cumulative events distribution curves, and cumulative hazard curves by a group variable
  • To conduct a log-rank test to compare the survival curves from 2 groups
  • To conduct a pairwise log-rank test to compare the survival curves from more than two groups

2. About your data

  • Prepare the survival object in the Data tab
  • A categorical variable is required in this model

Please follow the Steps to build the model, and click Outputs to get analytical results.


Output 1. Data Preview



                

Check full data in Data tab


Output 2. Estimate and Test Results



Kaplan-Meier survival probability by group


                



                


Explanations

This implements the G-rho family of Harrington and Fleming (1982), with weights on each death of S(t)rho, where S is the Kaplan-Meier estimate of survival.

  • rho = 0: log-rank or Mantel-Haenszel test
  • rho = 1: Peto & Peto modification of the Gehan-Wilcoxon test.
  • p < 0.05 indicates the curves are significantly different in the survival probabilities
  • p >= 0.05 indicates the curves are NOT significantly different in the survival probabilities

Log-rank Test Result


                  

In this example, we could not find the statistical difference between 2 laser groups (p=0.8). Also from the Kaplan-Meier plot, we could found that the survival curves from 2 laser group intersect with each other.


Explanations

This implements the G-rho family of Harrington and Fleming (1982), with weights on each death of S(t)rho, where S is the Kaplan-Meier estimate of survival.

  • rho = 0: log-rank or Mantel-Haenszel test
  • rho = 1: Peto & Peto modification of the Gehan-Wilcoxon test.
  • Bonferroni correction is a generic but very conservative approach
  • Bonferroni-Holm is less conservative and uniformly more powerful than Bonferroni
  • False Discovery Rate-BH is more powerful than the others, developed by Benjamini and Hochberg
  • False Discovery Rate-BY is more powerful than the others, developed by Benjamini and Yekutieli
  • p < 0.05 indicates the curves are significantly different in the survival probabilities
  • p >= 0.05 indicates the curves are NOT significantly different in the survival probabilities

Pairwise Log-rank Test P Value Table


Semi-Parametric Cox Regression

Cox Regression, also known as Cox proportional hazard regression, assumes that if the proportional hazards assumption holds (or, is assumed to hold), then it is possible to estimate the effect parameter(s) without any consideration of the hazard function. Cox regression assumes that the effects of the predictor variables upon survival are constant over time and are additive in one scale.

1. Functionalities

  • To build a Cox regression model
  • To achieve the estimates of the model, such as (1) estimate of coefficient, (2) predictions from the training data, (3)residuals, (4) the adjusted survival curves, (5) proportional hazard test, and (6) diagnostic plot
  • To upload new data and get the prediction
  • To achieve the evaluation of new data containing new dependent variable
  • To achieve Brier Score and time-dependent AUC

2. About your data (training set)

  • Please prepare the training data in the Data tab
  • Please prepare the survival object, Surv(time, event), in the Data tab
  • New data (test set) should cover all the independent variables used in the model.

Please follow the Steps to build the model, and click Outputs to get analytical results.


Output 1. Data Preview



                

Check full data in Data tab


Output 2. Model Results


Explanations
  • For each variable, estimated coefficients (coef), the statistic for the significance of a single variable, and P value are given.
  • The column marked 'z' gives the Wald statistic value. It corresponds to the ratio of each regression coefficient to its standard error (z = coef/se(coef)). The Wald statistic evaluates whether the beta coefficient of a given variable is statistically significantly different from 0.
  • The coefficients relate to hazard; a positive coefficient indicates a worse prognosis, and a negative coefficient indicates a protective effect of the variable with which it is associated.
  • exp(coef) = hazard ratio (HR). HR = 1: No effect; HR < 1: Reduction in the hazard; HR > 1: Increase in Hazard
  • The output also gives upper and lower 95% confidence intervals for the hazard ratio (exp(coef)),
  • The likelihood-ratio test, Wald test, and score log-rank statistics give the global statistical significance of the model. These three methods are asymptotically equivalent. For large enough N, they will give similar results. For small N, they may differ somewhat. The Likelihood ratio test has better behavior for small sample sizes, so it is generally preferred.

                


Fitting values and residuals from the existed data


Explanations
  • The Akaike Information Criterion (AIC) is used to performs stepwise model selection.
  • Model fits are ranked according to their AIC values, and the model with the lowest AIC value is sometime considered the 'best'.

Model selection suggested by AIC


                


Explanations
  • this plot is to present expected survival curves calculated based on Cox model separately for subpopulations / strata
  • If there is no strata() component then only a single curve will be plotted - average for the whole population

The adjusted survival curves from Cox regression


Explanations
  • Schoenfeld residuals are used to check the proportional hazards assumption
  • Schoenfeld residuals are independent of time. A plot that shows a non-random pattern against time is evidence of violation of the PH assumption
  • If the test is not statistically significant (p>0.05) for each of the independent variable, we can assume the proportional hazards


Explanations

Martingale residuals against continuous independent variable is a common approach used to detect nonlinearity. For a given continuous covariate, patterns in the plot may suggest that the variable is not properly fit. Martingale residuals may present any value in the range (-INF, +1):
  • A value of martingale residuals near 1 represents individuals that 'died too soon',
  • Large negative values correspond to individuals that 'lived too long'.
Deviance residual is a normalized transform of the martingale residual. These residuals should be roughly symmetrically distributed about zero with a standard deviation of 1.
  • Positive values correspond to individuals that 'died too soon' compared to expected survival times.
  • Negative values correspond to individual that 'lived too long'.
  • Very large or small values are outliers, which are poorly predicted by the model.
Cox-Snell residuals are used to check for overall goodness of fit in survival models.
  • Cox-Snell residuals are equal to the -log(survival probability) for each observation
  • If the model fits the data well, Cox-Snell residuals should behave like a sample from an exponential distribution with a mean of 1
  • If the residuals act like a sample from a unit exponential distribution, they should lie along the 45-degree diagonal line.

The residuals can be found in Data Fitting tab.

Red points are those who 'died soon'; black points are whose who 'lived long'

1. Martingale residuals plot against continuous independent variable

2. Deviance residuals plot by observational id

3. Cox-Snell residuals plot


Output 3. Prediction Results




Brier score is used to evaluate the accuracy of a predicted survival function at given time series. It represents the average squared distances between the observed survival status and the predicted survival probability and is always a number between 0 and 1, with 0 being the best possible value.

The Integrated Brier Score (IBS) provides an overall calculation of the model performance at all available times.

The default setting give time series 1,2,...10

Brier score at given time


Explanations

AUC here is time-dependent AUC, which gives AUC at given time series.
  • Chambless and Diao: assumed that lp and lpnew are the predictors of a Cox proportional hazards model. (Chambless, L. E. and G. Diao (2006). Estimation of time-dependent area under the ROC curve for long-term risk prediction. Statistics in Medicine 25, 3474–3486.)
  • Hung and Chiang: assumed that there is a one-to-one relationship between the predictor and the expected survival times conditional on the predictor. (Hung, H. and C.-T. Chiang (2010). Estimation methods for time-dependent AUC models with survival data. Canadian Journal of Statistics 38, 8–26.)
  • Song and Zhou: in this method, the estimators remain valid even if the censoring times depend on the values of the predictors. (Song, X. and X.-H. Zhou (2008). A semiparametric approach for the covariate specific ROC curve with survival outcome. Statistica Sinica 18, 947–965.)
  • Uno et al.: are based on inverse-probability-of-censoring weights and do not assume a specific working model for deriving the predictor lpnew. It is assumed that there is a one-to-one relationship between the predictor and the expected survival times conditional on the predictor. (Uno, H., T. Cai, L. Tian, and L. J. Wei (2007). Evaluating prediction rules for t-year survivors with censored regression models. Journal of the American Statistical Association 102, 527–537.)
The example time series: 1, 2, 3, ...,10

Time dependent AUC at given time


Parametric Accelerated Failure Time (AFT) Model

Accelerated failure time (AFT) model is a parametric model that assumes that the effect of a covariate is to accelerate or decelerate the life course of a disease by some constant.

1. Functionalities

  • To build an AFT model
  • To achieve the estimates of the model, such as coefficients of parameters, residuals, and diagnostic plot
  • To achieve fitted values which are predicted from the training data
  • To upload new data and get the prediction
  • To achieve the evaluation of new data containing new dependent variable

2. About your data

  • Prepare the training data in the Data tab
  • Prepare the survival object, Surv(time, event), in the Data tab
  • New data (test set) should cover all the independent variables used in the model.

Please follow the Steps to build the model, and click Outputs to get analytical results.


Output 1. Data Preview



                

Check full data in Data tab


Output 2. Model Results



Explanations
  • For each variable, estimated coefficients (Value), statistic for the significance of single variable, and p value are given.
  • The column marked 'z' gives the Wald statistic value. It corresponds to the ratio of each regression coefficient to its standard error (z = coef/se(coef)).The Wald statistic evaluates, whether the beta coefficient of a given variable is statistically significantly different from 0.
  • The coefficients relate to hazard; a positive coefficient indicates a worse prognosis and a negative coefficient indicates a protective effect of the variable with which it is associated.
  • exp(Value) = hazard ratio (HR). HR = 1: No effect; HR < 1: Reduction in the hazard; HR > 1: Increase in Hazard
  • Scale and Log(scale) are the estimated parameters in the error term of AFT model
  • The log-likelihood is given in the model. When maximum likelihood estimation is used to generate the log-likelihoods, then the closer that the log-likelihood(LL) is to zero, the better is the model fit.
  • For left-truncated data, the time here is the differences of end-time and start-time

                


Fitting values and residuals from the existed data


Explanations
  • The Akaike Information Criterion (AIC) is used to performs stepwise model selection.
  • Model fits are ranked according to their AIC values, and the model with the lowest AIC value is sometime considered the 'best'.

Model selection suggested by AIC


                


Explanations

Martingale residuals against continuous independent variable is a common approach used to detect nonlinearity. For a given continuous covariate, patterns in the plot may suggest that the variable is not properly fit. Martingale residuals may present any value in the range (-INF, +1):
  • A value of martingale residuals near 1 represents individuals that 'died too soon',
  • Large negative values correspond to individuals that 'lived too long'.
Deviance residual is a normalized transform of the martingale residual. These residuals should be roughly symmetrically distributed about zero with a standard deviation of 1.
  • Positive values correspond to individuals that 'died too soon' compared to expected survival times.
  • Negative values correspond to individual that 'lived too long'.
  • Very large or small values are outliers, which are poorly predicted by the model.
Cox-Snell residuals are used to check for overall goodness of fit in survival models.
  • Cox-Snell residuals are equal to the -log(survival probability) for each observation
  • If the model fits the data well, Cox-Snell residuals should behave like a sample from an exponential distribution with a mean of 1
  • If the residuals act like a sample from a unit exponential distribution, they should lie along the 45-degree diagonal line.

The residuals can be found in Data Fitting tab.

Red points are those who 'died soon'; black points are whose who 'lived long'

1. Martingale residuals plot against continuous independent variable

2. Deviance residuals plot by observational id

3. Cox-Snell residuals plot


Output 3. Prediction Results




The predicted survival probability of N'th observation