To upload data file, preview data set, and check the correctness of data input
To pre-process some variables (when necessary) for building the model
To achieve the basic descriptive statistics and plots of the variables
To prepare the survival object alternative to the dependent variable for in the model
2. About your data (training set)
Your data need to include one survival time variable and one 1/0 censoring variable and at least one independent variables (denoted as X)
Your data need to have more rows than columns
Do not mix character and numbers in the same column
The data used to build a model is called a training set
Case Example 1: Right-censored diabetes data
Suppose in a study, we got some observations from a trial of laser coagulation for the treatment of diabetic retinopathy.
Each patient had one eye randomized to laser treatment, and the other eye received no treatment.
For each eye, the event of interest was the time from initiation of treatment to the time when visual acuity dropped below 5/200 two visits in a row.
Thus there is a built-in lag time of approximately 6 months (visits were every 3 months).
Therefore, survival times in this dataset are the actual time to blindness in months, minus the minimum possible time to event (6.5 months).
Censor status of 0= censored; 1 = visual loss. Treatment: 0 = no treatment, 1= laser. Age is the age at diagnosis.
Case Example 2: Left-truncated right-censored Nki70 data
Suppose we wanted to explore 100 lymph node positive breast cancer patients on metastasis-free survival. But some patients enrolled in the study later than other people.
Data contained 5 clinical risk factors: (1) Diam: diameter of the tumor; (2) N: number of affected lymph nodes; (3) ER: estrogen receptor status; (4) Grade: grade of the tumor; and (5) Age: Patient age at diagnosis (years);
and gene expression measurements of 70 genes found to be prognostic for metastasis-free survival in an earlier study.
Time variable is metastasis-free follow-up time (months). Censoring indicator variable: 1 = metastasis or death; 0 = censored.
We wanted to explore the association between survival time and the independent variables.
Please follow the Steps, and Outputs will give real-time analytical results. After getting data ready, please find the model in the next tabs.
Histogram: to roughly show the probability distribution of a variable by depicting the frequencies of observations occurring in certain ranges of values.
Density plot: to show the distribution of a variable
Histogram
When the number of bins is 0, plot will use the default number of bins
Density plot
Non-Parametric Kaplan-Meier Estimator and Log-rank Test
Kaplan–Meier estimator, also known as the product-limit estimator, is a non-parametric statistic used to estimate the survival function from lifetime data.
The log-rank test is a hypothesis test to compare the survival distributions of two samples. It compares estimates of the hazard functions of the two groups at each observed event time.
1. Functionalities
To achieve Kaplan-Meier survival probability estimate
To achieve Kaplan-Meier survival curves, cumulative events distribution curves, and cumulative hazard curves by a group variable
To conduct a log-rank test to compare the survival curves from 2 groups
To conduct a pairwise log-rank test to compare the survival curves from more than two groups
2. About your data
Prepare the survival object in the Data tab
A categorical variable is required in this model
Please follow the Steps to build the model, and click Outputs to get analytical results.
This implements the G-rho family of Harrington and Fleming (1982), with weights on each death of S(t)rho, where S is the Kaplan-Meier estimate of survival.
rho = 0: log-rank or Mantel-Haenszel test
rho = 1: Peto & Peto modification of the Gehan-Wilcoxon test.
p < 0.05 indicates the curves are significantly different in the survival probabilities
p >= 0.05 indicates the curves are NOT significantly different in the survival probabilities
Log-rank Test Result
In this example, we could not find the statistical difference between 2 laser groups (p=0.8).
Also from the Kaplan-Meier plot, we could found that the survival curves from 2 laser group intersect with each other.
Explanations
This implements the G-rho family of Harrington and Fleming (1982), with weights on each death of S(t)rho, where S is the Kaplan-Meier estimate of survival.
rho = 0: log-rank or Mantel-Haenszel test
rho = 1: Peto & Peto modification of the Gehan-Wilcoxon test.
Bonferroni correction is a generic but very conservative approach
Bonferroni-Holm is less conservative and uniformly more powerful than Bonferroni
False Discovery Rate-BH is more powerful than the others, developed by Benjamini and Hochberg
False Discovery Rate-BY is more powerful than the others, developed by Benjamini and Yekutieli
p < 0.05 indicates the curves are significantly different in the survival probabilities
p >= 0.05 indicates the curves are NOT significantly different in the survival probabilities
Pairwise Log-rank Test P Value Table
Semi-Parametric Cox Regression
Cox Regression, also known as Cox proportional hazard regression, assumes that if the proportional hazards assumption holds (or, is assumed to hold), then it is possible to estimate the effect parameter(s) without any consideration of the hazard function.
Cox regression assumes that the effects of the predictor variables upon survival are constant over time and are additive in one scale.
1. Functionalities
To build a Cox regression model
To achieve the estimates of the model, such as (1) estimate of coefficient, (2) predictions from the training data, (3)residuals,
(4) the adjusted survival curves, (5) proportional hazard test, and (6) diagnostic plot
To upload new data and get the prediction
To achieve the evaluation of new data containing new dependent variable
To achieve Brier Score and time-dependent AUC
2. About your data (training set)
Please prepare the training data in the Data tab
Please prepare the survival object, Surv(time, event), in the Data tab
New data (test set) should cover all the independent variables used in the model.
Please follow the Steps to build the model, and click Outputs to get analytical results.
For each variable, estimated coefficients (coef), the statistic for the significance of a single variable, and P value are given.
The column marked 'z' gives the Wald statistic value. It corresponds to the ratio of each regression coefficient to its standard error (z = coef/se(coef)). The Wald statistic evaluates whether the beta coefficient of a given variable is statistically significantly different from 0.
The coefficients relate to hazard; a positive coefficient indicates a worse prognosis, and a negative coefficient indicates a protective effect of the variable with which it is associated.
exp(coef) = hazard ratio (HR). HR = 1: No effect; HR < 1: Reduction in the hazard; HR > 1: Increase in Hazard
The output also gives upper and lower 95% confidence intervals for the hazard ratio (exp(coef)),
The likelihood-ratio test, Wald test, and score log-rank statistics give the global statistical significance of the model. These three methods are asymptotically equivalent. For large enough N, they will give similar results. For small N, they may differ somewhat. The Likelihood ratio test has better behavior for small sample sizes, so it is generally preferred.
Fitting values and residuals from the existed data
Explanations
The Akaike Information Criterion (AIC) is used to performs stepwise model selection.
Model fits are ranked according to their AIC values, and the model with the lowest AIC value is sometime considered the 'best'.
Model selection suggested by AIC
Explanations
this plot is to present expected survival curves calculated based on Cox model separately for subpopulations / strata
If there is no strata() component then only a single curve will be plotted - average for the whole population
The adjusted survival curves from Cox regression
Explanations
Schoenfeld residuals are used to check the proportional hazards assumption
Schoenfeld residuals are independent of time. A plot that shows a non-random pattern against time is evidence of violation of the PH assumption
If the test is not statistically significant (p>0.05) for each of the independent variable, we can assume the proportional hazards
Explanations
Martingale residuals against continuous independent variable is a common approach used to detect nonlinearity. For a given continuous covariate, patterns in the plot may suggest that the variable is not properly fit.
Martingale residuals may present any value in the range (-INF, +1):
A value of martingale residuals near 1 represents individuals that 'died too soon',
Large negative values correspond to individuals that 'lived too long'.
Deviance residual is a normalized transform of the martingale residual. These residuals should be roughly symmetrically distributed about zero with a standard deviation of 1.
Positive values correspond to individuals that 'died too soon' compared to expected survival times.
Negative values correspond to individual that 'lived too long'.
Very large or small values are outliers, which are poorly predicted by the model.
Cox-Snell residuals are used to check for overall goodness of fit in survival models.
Cox-Snell residuals are equal to the -log(survival probability) for each observation
If the model fits the data well, Cox-Snell residuals should behave like a sample from an exponential distribution with a mean of 1
If the residuals act like a sample from a unit exponential distribution, they should lie along the 45-degree diagonal line.
The residuals can be found in Data Fitting tab.
Red points are those who 'died soon'; black points are whose who 'lived long'
1. Martingale residuals plot against continuous independent variable
Brier score is used to evaluate the accuracy of a predicted survival function at given time series.
It represents the average squared distances between the observed survival status and the predicted survival probability and is always a number between 0 and 1,
with 0 being the best possible value.
The Integrated Brier Score (IBS) provides an overall calculation of the model performance at all available times.
The default setting give time series 1,2,...10
Brier score at given time
Explanations
AUC here is time-dependent AUC, which gives AUC at given time series.
Chambless and Diao: assumed that lp and lpnew are the predictors of a Cox proportional hazards model.
(Chambless, L. E. and G. Diao (2006). Estimation of time-dependent area under the ROC curve for long-term risk prediction. Statistics in Medicine 25, 3474–3486.)
Hung and Chiang: assumed that there is a one-to-one relationship between the predictor and the expected survival times conditional on the predictor.
(Hung, H. and C.-T. Chiang (2010). Estimation methods for time-dependent AUC models with survival data. Canadian Journal of Statistics 38, 8–26.)
Song and Zhou: in this method, the estimators remain valid even if the censoring times depend on the values of the predictors.
(Song, X. and X.-H. Zhou (2008). A semiparametric approach for the covariate specific ROC curve with survival outcome. Statistica Sinica 18, 947–965.)
Uno et al.: are based on inverse-probability-of-censoring weights and do not assume a specific working model for deriving the predictor lpnew.
It is assumed that there is a one-to-one relationship between the predictor and the expected survival times conditional on the predictor.
(Uno, H., T. Cai, L. Tian, and L. J. Wei (2007). Evaluating prediction rules for t-year survivors with censored regression models. Journal of the American Statistical Association 102, 527–537.)
The example time series: 1, 2, 3, ...,10
Time dependent AUC at given time
Parametric Accelerated Failure Time (AFT) Model
Accelerated failure time (AFT) model is a parametric model that assumes that the effect of a covariate is to accelerate or decelerate the life course of a disease by some constant.
1. Functionalities
To build an AFT model
To achieve the estimates of the model, such as coefficients of parameters, residuals, and diagnostic plot
To achieve fitted values which are predicted from the training data
To upload new data and get the prediction
To achieve the evaluation of new data containing new dependent variable
2. About your data
Prepare the training data in the Data tab
Prepare the survival object, Surv(time, event), in the Data tab
New data (test set) should cover all the independent variables used in the model.
Please follow the Steps to build the model, and click Outputs to get analytical results.
For each variable, estimated coefficients (Value), statistic for the significance of single variable, and p value are given.
The column marked 'z' gives the Wald statistic value. It corresponds to the ratio of each regression coefficient to its standard error (z = coef/se(coef)).The Wald statistic evaluates, whether the beta coefficient of a given variable is statistically significantly different from 0.
The coefficients relate to hazard; a positive coefficient indicates a worse prognosis and a negative coefficient indicates a protective effect of the variable with which it is associated.
exp(Value) = hazard ratio (HR). HR = 1: No effect; HR < 1: Reduction in the hazard; HR > 1: Increase in Hazard
Scale and Log(scale) are the estimated parameters in the error term of AFT model
The log-likelihood is given in the model. When maximum likelihood estimation is used to generate the log-likelihoods, then the closer that the log-likelihood(LL) is to zero, the better is the model fit.
For left-truncated data, the time here is the differences of end-time and start-time
Fitting values and residuals from the existed data
Explanations
The Akaike Information Criterion (AIC) is used to performs stepwise model selection.
Model fits are ranked according to their AIC values, and the model with the lowest AIC value is sometime considered the 'best'.
Model selection suggested by AIC
Explanations
Martingale residuals against continuous independent variable is a common approach used to detect nonlinearity. For a given continuous covariate, patterns in the plot may suggest that the variable is not properly fit.
Martingale residuals may present any value in the range (-INF, +1):
A value of martingale residuals near 1 represents individuals that 'died too soon',
Large negative values correspond to individuals that 'lived too long'.
Deviance residual is a normalized transform of the martingale residual. These residuals should be roughly symmetrically distributed about zero with a standard deviation of 1.
Positive values correspond to individuals that 'died too soon' compared to expected survival times.
Negative values correspond to individual that 'lived too long'.
Very large or small values are outliers, which are poorly predicted by the model.
Cox-Snell residuals are used to check for overall goodness of fit in survival models.
Cox-Snell residuals are equal to the -log(survival probability) for each observation
If the model fits the data well, Cox-Snell residuals should behave like a sample from an exponential distribution with a mean of 1
If the residuals act like a sample from a unit exponential distribution, they should lie along the 45-degree diagonal line.
The residuals can be found in Data Fitting tab.
Red points are those who 'died soon'; black points are whose who 'lived long'
1. Martingale residuals plot against continuous independent variable