- To upload data file, preview data set, and check the correctness of data input
- To pre-process some variables (when necessary) for building the model
- To achieve the basic descriptive statistics and plots of the variables
- To prepare the
**survival object**alternative to the**dependent variable**for in the model

- Your data need to include
**one survival time variable and one 1/0 censoring variable**and**at least one independent variables (denoted as X)** - Your data need to have more rows than columns
- Do not mix character and numbers in the same column
- The data used to build a model is called a
**training set**

We wanted to explore the association between survival time and the independent variables.

*
*

**Data Preview**

**1. Numeric variable information list**

**2. Categorical variable information list**

**Histogram**: to roughly show the probability distribution of a variable by depicting the frequencies of observations occurring in certain ranges of values.

**Density plot**: to show the distribution of a variable

**Histogram**

When the number of bins is 0, plot will use the default number of bins

**Density plot**

**Kaplan–Meier estimator**, also known as the product-limit estimator, is a non-parametric statistic used to estimate the survival function from lifetime data.

The **log-rank test** is a hypothesis test to compare the survival distributions of two samples. It compares estimates of the hazard functions of the two groups at each observed event time.

- To achieve Kaplan-Meier survival probability estimate
- To achieve Kaplan-Meier survival curves, cumulative events distribution curves, and cumulative hazard curves by a group variable
- To conduct a log-rank test to compare the survival curves from 2 groups
- To conduct a pairwise log-rank test to compare the survival curves from more than two groups

- Prepare the survival object in the Data tab
- A categorical variable is required in this model

Check full data in Data tab

**Kaplan-Meier survival probability by group**

This implements the G-rho family of Harrington and Fleming (1982), with weights on each death of S(t)^{rho}, where S is the Kaplan-Meier estimate of survival.

- rho = 0: log-rank or Mantel-Haenszel test
- rho = 1: Peto & Peto modification of the Gehan-Wilcoxon test.
- p < 0.05 indicates the curves are significantly different in the survival probabilities
- p >= 0.05 indicates the curves are NOT significantly different in the survival probabilities

**Log-rank Test Result**

*In this example, we could not find the statistical difference between 2 laser groups (p=0.8).
Also from the Kaplan-Meier plot, we could found that the survival curves from 2 laser group intersect with each other.*

This implements the G-rho family of Harrington and Fleming (1982), with weights on each death of S(t)^{rho}, where S is the Kaplan-Meier estimate of survival.

**rho = 0:**log-rank or Mantel-Haenszel test**rho = 1:**Peto & Peto modification of the Gehan-Wilcoxon test.-
**Bonferroni**correction is a generic but very conservative approach -
**Bonferroni-Holm**is less conservative and uniformly more powerful than Bonferroni -
**False Discovery Rate-BH**is more powerful than the others, developed by Benjamini and Hochberg -
**False Discovery Rate-BY**is more powerful than the others, developed by Benjamini and Yekutieli - p < 0.05 indicates the curves are significantly different in the survival probabilities
- p >= 0.05 indicates the curves are NOT significantly different in the survival probabilities

**Pairwise Log-rank Test P Value Table**

** Cox Regression**, also known as Cox proportional hazard regression, assumes that if the proportional hazards assumption holds (or, is assumed to hold), then it is possible to estimate the effect parameter(s) without any consideration of the hazard function.
Cox regression assumes that the effects of the predictor variables upon survival are constant over time and are additive in one scale.

- To build a Cox regression model
- To achieve the estimates of the model, such as (1) estimate of coefficient, (2) predictions from the training data, (3)residuals, (4) the adjusted survival curves, (5) proportional hazard test, and (6) diagnostic plot
- To upload new data and get the prediction
- To achieve the evaluation of new data containing new dependent variable
- To achieve Brier Score and time-dependent AUC

- Please prepare the training data in the Data tab
- Please prepare the survival object, Surv(time, event), in the Data tab
- New data (test set) should cover all the independent variables used in the model.

Check full data in Data tab

- For each variable, estimated coefficients (coef), the statistic for the significance of a single variable, and P value are given.
- The column marked 'z' gives the Wald statistic value. It corresponds to the ratio of each regression coefficient to its standard error (z = coef/se(coef)). The Wald statistic evaluates whether the beta coefficient of a given variable is statistically significantly different from 0.
- The coefficients relate to hazard; a positive coefficient indicates a worse prognosis, and a negative coefficient indicates a protective effect of the variable with which it is associated.
- exp(coef) = hazard ratio (HR). HR = 1: No effect; HR < 1: Reduction in the hazard; HR > 1: Increase in Hazard
- The output also gives upper and lower 95% confidence intervals for the hazard ratio (exp(coef)),
- The likelihood-ratio test, Wald test, and score log-rank statistics give the global statistical significance of the model. These three methods are asymptotically equivalent. For large enough N, they will give similar results. For small N, they may differ somewhat. The Likelihood ratio test has better behavior for small sample sizes, so it is generally preferred.

**Fitting values and residuals from the existed data**

- The Akaike Information Criterion (AIC) is used to performs stepwise model selection.
- Model fits are ranked according to their AIC values, and the model with the lowest AIC value is sometime considered the 'best'.

**Model selection suggested by AIC**

- this plot is to present expected survival curves calculated based on Cox model separately for subpopulations / strata
- If there is no strata() component then only a single curve will be plotted - average for the whole population

**The adjusted survival curves from Cox regression**

- Schoenfeld residuals are used to check the proportional hazards assumption
- Schoenfeld residuals are independent of time. A plot that shows a non-random pattern against time is evidence of violation of the PH assumption
- If the test is not statistically significant (p>0.05) for each of the independent variable, we can assume the proportional hazards

** Explanations **

- A value of martingale residuals near 1 represents individuals that 'died too soon',
- Large negative values correspond to individuals that 'lived too long'.

- Positive values correspond to individuals that 'died too soon' compared to expected survival times.
- Negative values correspond to individual that 'lived too long'.
- Very large or small values are outliers, which are poorly predicted by the model.

- Cox-Snell residuals are equal to the -log(survival probability) for each observation
- If the model fits the data well, Cox-Snell residuals should behave like a sample from an exponential distribution with a mean of 1
- If the residuals act like a sample from a unit exponential distribution, they should lie along the 45-degree diagonal line.

The residuals can be found in Data Fitting tab.

Red points are those who 'died soon'; black points are whose who 'lived long'

**1. Martingale residuals plot against continuous independent variable**

**2. Deviance residuals plot by observational id**

**3. Cox-Snell residuals plot**

Brier score is used to evaluate the accuracy of a predicted survival function at given time series. It represents the average squared distances between the observed survival status and the predicted survival probability and is always a number between 0 and 1, with 0 being the best possible value.

The Integrated Brier Score (IBS) provides an overall calculation of the model performance at all available times.

*The default setting give time series 1,2,...10*

**Brier score at given time**

** Explanations **

- Chambless and Diao: assumed that lp and lpnew are the predictors of a Cox proportional hazards model. (Chambless, L. E. and G. Diao (2006). Estimation of time-dependent area under the ROC curve for long-term risk prediction. Statistics in Medicine 25, 3474–3486.)
- Hung and Chiang: assumed that there is a one-to-one relationship between the predictor and the expected survival times conditional on the predictor. (Hung, H. and C.-T. Chiang (2010). Estimation methods for time-dependent AUC models with survival data. Canadian Journal of Statistics 38, 8–26.)
- Song and Zhou: in this method, the estimators remain valid even if the censoring times depend on the values of the predictors. (Song, X. and X.-H. Zhou (2008). A semiparametric approach for the covariate specific ROC curve with survival outcome. Statistica Sinica 18, 947–965.)
- Uno et al.: are based on inverse-probability-of-censoring weights and do not assume a specific working model for deriving the predictor lpnew. It is assumed that there is a one-to-one relationship between the predictor and the expected survival times conditional on the predictor. (Uno, H., T. Cai, L. Tian, and L. J. Wei (2007). Evaluating prediction rules for t-year survivors with censored regression models. Journal of the American Statistical Association 102, 527–537.)

**Time dependent AUC at given time**

**Accelerated failure time (AFT) model** is a parametric model that assumes that the effect of a covariate is to accelerate or decelerate the life course of a disease by some constant.

- To build an AFT model
- To achieve the estimates of the model, such as coefficients of parameters, residuals, and diagnostic plot
- To achieve fitted values which are predicted from the training data
- To upload new data and get the prediction
- To achieve the evaluation of new data containing new dependent variable

- Prepare the training data in the Data tab
- Prepare the survival object, Surv(time, event), in the Data tab
- New data (test set) should cover all the independent variables used in the model.

Check full data in Data tab

- For each variable, estimated coefficients (Value), statistic for the significance of single variable, and p value are given.
- The column marked 'z' gives the Wald statistic value. It corresponds to the ratio of each regression coefficient to its standard error (z = coef/se(coef)).The Wald statistic evaluates, whether the beta coefficient of a given variable is statistically significantly different from 0.
- The coefficients relate to hazard; a positive coefficient indicates a worse prognosis and a negative coefficient indicates a protective effect of the variable with which it is associated.
- exp(Value) = hazard ratio (HR). HR = 1: No effect; HR < 1: Reduction in the hazard; HR > 1: Increase in Hazard
- Scale and Log(scale) are the estimated parameters in the error term of AFT model
- The log-likelihood is given in the model. When maximum likelihood estimation is used to generate the log-likelihoods, then the closer that the log-likelihood(LL) is to zero, the better is the model fit.
- For left-truncated data, the time here is the differences of end-time and start-time

**Fitting values and residuals from the existed data**

- The Akaike Information Criterion (AIC) is used to performs stepwise model selection.
- Model fits are ranked according to their AIC values, and the model with the lowest AIC value is sometime considered the 'best'.

**Model selection suggested by AIC**

** Explanations **

- A value of martingale residuals near 1 represents individuals that 'died too soon',
- Large negative values correspond to individuals that 'lived too long'.

- Positive values correspond to individuals that 'died too soon' compared to expected survival times.
- Negative values correspond to individual that 'lived too long'.
- Very large or small values are outliers, which are poorly predicted by the model.

- Cox-Snell residuals are equal to the -log(survival probability) for each observation
- If the model fits the data well, Cox-Snell residuals should behave like a sample from an exponential distribution with a mean of 1
- If the residuals act like a sample from a unit exponential distribution, they should lie along the 45-degree diagonal line.

The residuals can be found in Data Fitting tab.

Red points are those who 'died soon'; black points are whose who 'lived long'

**1. Martingale residuals plot against continuous independent variable**

**2. Deviance residuals plot by observational id**

**3. Cox-Snell residuals plot**

The predicted survival probability of N'th observation