To upload data files, preview data set, and check the correctness of data input
To pre-process some variables (when necessary) for building the model
To get the basic descriptive statistics and plots of the variables
2. About your data (training set)
The data need to be all numeric
The data used to build a model is called a training set
Case Example: NKI data
Suppose in one study, we wanted to explore some lymph node-positive breast cancer patients on metastasis-free survival.
Data contained the clinical risk factors: (1) Age: Patient age at diagnosis (years) and (2) the year until relapse;
and gene expression measurements of 70 genes found to be prognostic for metastasis-free survival in an earlier study.
In this example, we wanted to create a model that could find the relations between age, year until release, and gene expression measurements.
Case Example: Liver toxicity data
This data set contains the expression measure and clinical measurements for rats that were exposed to non-toxic, moderately toxic or severely toxic doses of acetaminophen in a controlled experiment.
Please follow the Steps, and Outputs will give real-time analytical results. After getting data ready, please find the model in the next tabs.
Linear fitting plot: to roughly show the linear relation between any two numeric variable.
Grey area is 95% confidence interval.
3. Change the labels of X and Y axes
Histogram: to roughly show the probability distribution of a variable by depicting the frequencies of observations occurring in certain ranges of values.
Density plot: to show the distribution of a variable
Histogram and Density Plot
When the number of bins is 0, plot will use the default number of bins
Density plot
Principal Component Regression
Principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). It finds hyperplanes of maximum variance between the response and independent variables.
1. Functionalities
To achieve a correlation matrix and plots
To achieve the results from a model
To achieve the factors and loadings result tables and
To achieve the factors and loadings distribution plots in 2D and 3D
To achieve the predicted dependent variables
To upload new data and conduct the prediction
2. About your data
All the data for analysis are numeric
New data (test set) should cover all the independent variables used in the model.
Please follow the Steps to build the model, and click Outputs to get analytical results.
The results from 1 component, 2 component, ..., n components are given
'CV' is the cross-validation estimate
'adjCV' (for RMSEP and MSEP) is the bias-corrected cross-validation estimate
R^2 is equivalent to the squared correlation between the fitted values and the response. R^2 shown in train is the unadjusted one, while shown in CV is the adjusted one.
The number of components is recommended with high R^2 and low MSEP / RSMEP
10-fold cross-validation randomly split the data into 10 fold every time, so the results will not be exactly the same after a refresh.
R^2
Mean square error of prediction (MSEP)
Root mean square error of prediction (RMSEP)
From the results we could see that, with the increase of A, results in training got better results (higher R^2, lower in MSEP and RMSEP)
However, the results in CV were different. Extremely good in training with extremely bad in CV may cause overfitting, indicating a poor ability in prediction.
In this example, we decided to choose 3 components (A=3), according to the MSEP and RMSEP.
1. Predicted Y and residuals (Y-Predicted Y)
Coefficient
Explanations
This plot graphs the components relations from two scores, you can use the score plot to assess the data structure and detect clusters, outliers, and trends
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
When A >=2, choose 2 different components to show component and loading 2D plot
In this plot, we plot the scatter points of component1 and component2, and found 327, 332 were the outliers.
Explanations
This plot show the contributions from the variables to the PCs (choose PC in the left panel)
Red indicates negative and blue indicates positive effects
Use the cumulative proportion of variance (in the variance table) to determine the amount of variance that the components explain.
For descriptive purposes, you may need only 80% (0.8) of the variance explained.
If you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the components.
Explanations
This plot (biplots) overlays the components and the loadings (choose PC in the left panel)
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
Loadings identify which variables have the largest effect on each component.
Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.
When A >=2, choose 2 different components to show component and loading 2D plot
Explanations
This is the extension for 2D plot. This plot overlays the components and the loadings for 3 PCs (choose PCs and the length of lines in the left panel)
This plot has similar functionality with 2D plots. Trace is the variables which can be hidden when click.
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
Loadings identify which variables have the largest effect on each component
Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.
This plot needs some time to load for the first time
When A >=3, choose 3 different components to show component and loading 3D plot
Partial least squares regression (PLSR) is a regression analysis technique that finds a linear regression model by projecting the predicted variables and the observable variables to a new space.
1. Functionalities
To achieve a correlation matrix and plots
To achieve the results from a model
To achieve the factors and loadings result tables
To achieve the factors and loadings distribution plots in 2D and 3D
To achieve the predicted dependent variables
To upload new data and conduct the prediction
2. About your data (training set)
All the data for analysis are numeric
New data (test set) should cover all the independent variables used in the model.
Please follow the Steps to build the model, and click Outputs to get analytical results.
The results from 1 component, 2 component, ..., n components are given
'CV' is the cross-validation estimate
'adjCV' (for RMSEP and MSEP) is the bias-corrected cross-validation estimate
R^2 is equivalent to the squared correlation between the fitted values and the response. R^2 shown in train is the unadjusted one, while shown in CV is the adjusted one.
The number of components is recommended with high R^2 and low MSEP / RSMEP
10-fold cross-validation randomly split the data into 10 fold every time, so the results will not be exactly the same after a refresh.
R^2
Mean square error of prediction (MSEP)
Root mean square error of prediction (RMSEP)
Because we chose more than one dependent variables (Y), the results were showed by each Y.
PLSR generate new variables from both Y and X, so R^2 is better than PCR. Variance explained (%) in Y is also higher than PCR.
Predicted Y
Residuals (Y-Predicted Y)
Coefficient
Explanations
This plot graphs the components relations from two scores, you can use the score plot to assess the data structure and detect clusters, outliers, and trends
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
When A >=2, choose 2 different components to show component and loading 2D plot
In this plot, we plot the scatter points of component1 and component2, and found 327, 332 the outliers.
Explanations
This plot show the contributions from the variables to the PCs (choose PC in the left panel)
Red indicates negative and blue indicates positive effects
Use the cumulative proportion of variance (in the variance table) to determine the amount of variance that the components explain.
For descriptive purposes, you may need only 80% (0.8) of the variance explained.
If you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the components.
Explanations
This plot (biplots) overlays the components and the loadings (choose PC in the left panel)
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
Loadings identify which variables have the largest effect on each component.
Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.
When A >=2, choose 2 different components to show component and loading 2D plot
Explanations
This is the extension for 2D plot. This plot overlays the components and the loadings for 3 PCs (choose PCs and the length of lines in the left panel)
This plot has similar functionality with 2D plots. Trace is the variables which can be hidden when click.
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
Loadings identify which variables have the largest effect on each component
Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.
This plot needs some time to load for the first time
When A >=3, choose 3 different components to show component and loading 3D plot
Sparse partial least squares regression (SPLSR) is a regression analysis technique that aims simultaneously to achieve good predictive performance and variable selection by producing sparse linear combinations of the original predictors.
1. Functionalities
To achieve a correlation matrix and plot
To achieve the results from a model
To achieve the factors and loadings result tables
To achieve the factors and loadings distribution plots in 2D and 3D
To achieve the predicted dependent variables
To upload new data and conduct the prediction
2. About your data (training set)
All the data for analysis are numeric
New data (test set) should cover all the independent variables used in the model.
Please follow the Steps to build the model, and click Outputs to get analytical results.
This plot shows how the coefficients changed to choose variables
Coefficient
This is components derived based on the selected variables
Explanations
This plot graphs the components relations from two scores, you can use the score plot to assess the data structure and detect clusters, outliers, and trends
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
When A >=2, choose 2 different components to show component and loading 2D plot
In this plot, we plot the scatter points of component1 and component2, and found 378 was outliers.
This is loadings derived based on the selected variables
Explanations
This plot show the contributions from the variables to the PCs (choose PC in the left panel)
Red indicates negative and blue indicates positive effects
Use the cumulative proportion of variance (in the variance table) to determine the amount of variance that the factors explain.
For descriptive purposes, you may need only 80% (0.8) of the variance explained.
If you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the factors.
This is loadings derived based on the selected variables
Explanations
This plot (biplots) overlays the components and the loadings (choose PC in the left panel)
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
Loadings identify which variables have the largest effect on each component.
Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.
When A >=2, choose 2 different components to show component and loading 2D plot
Explanations
This is the extension for 2D plot. This plot overlays the components and the loadings for 3 PCs (choose PCs and the length of lines in the left panel)
This plot has similar functionality with 2D plots. Trace is the variables which can be hidden when click.
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
Loadings identify which variables have the largest effect on each component
Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.
This plot needs some time to load for the first time
When A >=3, choose components to show factor and loading 3D plot