Data Preparation

1. Functionalities

  • To upload data files, preview data set, and check the correctness of data input
  • To pre-process some variables (when necessary) for building the model
  • To get the basic descriptive statistics and plots of the variables

2. About your data (training set)

  • The data need to be all numeric
  • The data used to build a model is called a training set

Case Example: NKI data

Suppose in one study, we wanted to explore some lymph node-positive breast cancer patients on metastasis-free survival. Data contained the clinical risk factors: (1) Age: Patient age at diagnosis (years) and (2) the year until relapse; and gene expression measurements of 70 genes found to be prognostic for metastasis-free survival in an earlier study. In this example, we wanted to create a model that could find the relations between age, year until release, and gene expression measurements.

Case Example: Liver toxicity data

This data set contains the expression measure and clinical measurements for rats that were exposed to non-toxic, moderately toxic or severely toxic doses of acetaminophen in a controlled experiment.

Please follow the Steps, and Outputs will give real-time analytical results. After getting data ready, please find the model in the next tabs.


Training Set Preparation




Upload data will cover the example data

Please refer to the example data format to upload new data

2. Show 1st row as column names?

3. Use 1st column as row names? (No duplicates)

Correct separator and quote ensure the successful data input

Find some example data here


Change the types of some variable?








Output 1. Data Information

Data Preview


1. Numeric variable information list


            

2. Categorical variable information list


            

Output 2. Descriptive Results


1. For numeric variable

2. For categorical variable


                  
                    
                    Download Results (Categorical variable)
                  
                


Linear fitting plot: to roughly show the linear relation between any two numeric variable. Grey area is 95% confidence interval.


3. Change the labels of X and Y axes


Histogram: to roughly show the probability distribution of a variable by depicting the frequencies of observations occurring in certain ranges of values.

Density plot: to show the distribution of a variable


Histogram and Density Plot

When the number of bins is 0, plot will use the default number of bins

Density plot



Principal Component Regression

Principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). It finds hyperplanes of maximum variance between the response and independent variables.

1. Functionalities

  • To achieve a correlation matrix and plots
  • To achieve the results from a model
  • To achieve the factors and loadings result tables and
  • To achieve the factors and loadings distribution plots in 2D and 3D
  • To achieve the predicted dependent variables
  • To upload new data and conduct the prediction

2. About your data

  • All the data for analysis are numeric
  • New data (test set) should cover all the independent variables used in the model.

Please follow the Steps to build the model, and click Outputs to get analytical results.


Build the Model

Prepare the data in the Data tab


Step 1. Choose parameters to build the model

In the example of NKI data, we used time as dependent variable (Y), and variable from TSPYL5 ...are used as independent variables. The default is to put all variables other than Y into X. Thus, we need to remove Diam and Age variables.

From the data tab, we knew that X is a 20 by 25 matrix, so the maximum of a is 19. There will be error if A=20.

We used 10-fold CV to see the results of training set and CV / validation set.


Step 2. If data and model are ready, click the blue button to generate model results.





Output 1. Data Preview

Part of Data

Please edit data in Data tab


Output 2. Model Results


Explanations
  • The results from 1 component, 2 component, ..., n components are given
  • 'CV' is the cross-validation estimate
  • 'adjCV' (for RMSEP and MSEP) is the bias-corrected cross-validation estimate
  • R^2 is equivalent to the squared correlation between the fitted values and the response. R^2 shown in train is the unadjusted one, while shown in CV is the adjusted one.
  • The number of components is recommended with high R^2 and low MSEP / RSMEP

10-fold cross-validation randomly split the data into 10 fold every time, so the results will not be exactly the same after a refresh.



                  

R^2


                  

Mean square error of prediction (MSEP)


                  

Root mean square error of prediction (RMSEP)


                  

From the results we could see that, with the increase of A, results in training got better results (higher R^2, lower in MSEP and RMSEP)

However, the results in CV were different. Extremely good in training with extremely bad in CV may cause overfitting, indicating a poor ability in prediction.

In this example, we decided to choose 3 components (A=3), according to the MSEP and RMSEP.


1. Predicted Y and residuals (Y-Predicted Y)


Coefficient


Explanations
  • This plot graphs the components relations from two scores, you can use the score plot to assess the data structure and detect clusters, outliers, and trends
  • If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero

When A >=2, choose 2 different components to show component and loading 2D plot

In this plot, we plot the scatter points of component1 and component2, and found 327, 332 were the outliers.


Explanations
  • This plot show the contributions from the variables to the PCs (choose PC in the left panel)
  • Red indicates negative and blue indicates positive effects
  • Use the cumulative proportion of variance (in the variance table) to determine the amount of variance that the components explain.
  • For descriptive purposes, you may need only 80% (0.8) of the variance explained.
  • If you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the components.


Explanations
  • This plot (biplots) overlays the components and the loadings (choose PC in the left panel)
  • If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
  • Loadings identify which variables have the largest effect on each component.
  • Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

When A >=2, choose 2 different components to show component and loading 2D plot


Explanations
  • This is the extension for 2D plot. This plot overlays the components and the loadings for 3 PCs (choose PCs and the length of lines in the left panel)
  • This plot has similar functionality with 2D plots. Trace is the variables which can be hidden when click.
  • If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
  • Loadings identify which variables have the largest effect on each component
  • Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

This plot needs some time to load for the first time

When A >=3, choose 3 different components to show component and loading 3D plot

Trace legend


                

Prediction

Prepare model first


Step 3. Test Set Preparation


Data: NKI


Data for prediction should cover all the variables in the model

2. Show 1st row as column names?

3. Use 1st column as row names? (No duplicates)

Correct separator and quote ensure the successful data input

Find some example data here

Step 4. If the model and new data are ready, click the blue button to generate prediction results.






Partial Least Squares Regression

Partial least squares regression (PLSR) is a regression analysis technique that finds a linear regression model by projecting the predicted variables and the observable variables to a new space.

1. Functionalities

  • To achieve a correlation matrix and plots
  • To achieve the results from a model
  • To achieve the factors and loadings result tables
  • To achieve the factors and loadings distribution plots in 2D and 3D
  • To achieve the predicted dependent variables
  • To upload new data and conduct the prediction

2. About your data (training set)

  • All the data for analysis are numeric
  • New data (test set) should cover all the independent variables used in the model.

Please follow the Steps to build the model, and click Outputs to get analytical results.


Build the Model

Prepare the data in the Data tab


Step 1. Choose parameters to build the model

These algorithms do not have much difference in the results

PLSR can use more than one dependent variables and find the linear relation between Y matrix and X matrix. Thus, in this example, we used time, Diam, and Age as dependent variables, and other variables are independent variables.

We wanted to find the components that had good predictive ability.

From the data tab, we knew that X is a 20 by 25 matrix, so the maximum of a is 19. There will be error if A=20.

In this example, we decided to choose 3 components (A=3), according to the MSEP and RMSEP. We used 10-fold CV and a simple and fast algorithm.


Step 2. If data and model are ready, click the blue button to generate model results.





Output 1. Data Preview

Part of Data

Please edit data in Data tab


Output 2. Model Results


Explanations
  • The results from 1 component, 2 component, ..., n components are given
  • 'CV' is the cross-validation estimate
  • 'adjCV' (for RMSEP and MSEP) is the bias-corrected cross-validation estimate
  • R^2 is equivalent to the squared correlation between the fitted values and the response. R^2 shown in train is the unadjusted one, while shown in CV is the adjusted one.
  • The number of components is recommended with high R^2 and low MSEP / RSMEP

10-fold cross-validation randomly split the data into 10 fold every time, so the results will not be exactly the same after a refresh.



                  

R^2


                  

Mean square error of prediction (MSEP)


                  

Root mean square error of prediction (RMSEP)


                  

Because we chose more than one dependent variables (Y), the results were showed by each Y.

PLSR generate new variables from both Y and X, so R^2 is better than PCR. Variance explained (%) in Y is also higher than PCR.


Predicted Y


Residuals (Y-Predicted Y)


Coefficient


Explanations
  • This plot graphs the components relations from two scores, you can use the score plot to assess the data structure and detect clusters, outliers, and trends
  • If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero

When A >=2, choose 2 different components to show component and loading 2D plot

In this plot, we plot the scatter points of component1 and component2, and found 327, 332 the outliers.


Explanations
  • This plot show the contributions from the variables to the PCs (choose PC in the left panel)
  • Red indicates negative and blue indicates positive effects
  • Use the cumulative proportion of variance (in the variance table) to determine the amount of variance that the components explain.
  • For descriptive purposes, you may need only 80% (0.8) of the variance explained.
  • If you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the components.


Explanations
  • This plot (biplots) overlays the components and the loadings (choose PC in the left panel)
  • If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
  • Loadings identify which variables have the largest effect on each component.
  • Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

When A >=2, choose 2 different components to show component and loading 2D plot


Explanations
  • This is the extension for 2D plot. This plot overlays the components and the loadings for 3 PCs (choose PCs and the length of lines in the left panel)
  • This plot has similar functionality with 2D plots. Trace is the variables which can be hidden when click.
  • If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
  • Loadings identify which variables have the largest effect on each component
  • Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

This plot needs some time to load for the first time

When A >=3, choose 3 different components to show component and loading 3D plot

Trace legend


                

Prediction

Prepare model first


Step 3. Test Set Preparation


Data: NKI


Data for prediction should cover all the variables in the model

2. Show 1st row as column names?

3. Use 1st column as row names? (No duplicates)

Correct separator and quote ensure the successful data input

Find some example data here

Step 4. If the model and new data are ready, click the blue button to generate prediction results.






Sparse Partial Least Squares Regression

Sparse partial least squares regression (SPLSR) is a regression analysis technique that aims simultaneously to achieve good predictive performance and variable selection by producing sparse linear combinations of the original predictors.

1. Functionalities

  • To achieve a correlation matrix and plot
  • To achieve the results from a model
  • To achieve the factors and loadings result tables
  • To achieve the factors and loadings distribution plots in 2D and 3D
  • To achieve the predicted dependent variables
  • To upload new data and conduct the prediction

2. About your data (training set)

  • All the data for analysis are numeric
  • New data (test set) should cover all the independent variables used in the model.

Please follow the Steps to build the model, and click Outputs to get analytical results.


Build the Model

Prepare the data in the Data tab


Step 1. Choose parameters to build the model

SPLS adds a penalty to make variable selection available. The penalty will select the variables that may be good for the prediction. The components are generated based on the selected variables.

In the example of NKI data, we used time as dependent variable (Y), and variable from TSPYL5 ...are used as independent variables. The default is to put all variables other than Y into X. Thus, we need to remove Diam and Age variables.

From the data tab, we knew that X is a 20 by 25 matrix, so the maximum of a is 19. There will be error if A=20.


Step 2. If data and model are ready, click the blue button to generate model results.





Output 1. Data Preview


Choose optimal parameters from the following ranges

Cross-validation will choose the parameters according to the minimum error, giving some suggestions to choose parameters.


                

Please edit data in Data tab


Output 1. Model Results



                  


Selected variables


Predicted Y


This plot shows how the coefficients changed to choose variables

Coefficient


This is components derived based on the selected variables

Explanations
  • This plot graphs the components relations from two scores, you can use the score plot to assess the data structure and detect clusters, outliers, and trends
  • If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero

When A >=2, choose 2 different components to show component and loading 2D plot

In this plot, we plot the scatter points of component1 and component2, and found 378 was outliers.


This is loadings derived based on the selected variables

Explanations
  • This plot show the contributions from the variables to the PCs (choose PC in the left panel)
  • Red indicates negative and blue indicates positive effects
  • Use the cumulative proportion of variance (in the variance table) to determine the amount of variance that the factors explain.
  • For descriptive purposes, you may need only 80% (0.8) of the variance explained.
  • If you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the factors.


This is loadings derived based on the selected variables

Explanations
  • This plot (biplots) overlays the components and the loadings (choose PC in the left panel)
  • If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
  • Loadings identify which variables have the largest effect on each component.
  • Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

When A >=2, choose 2 different components to show component and loading 2D plot


Explanations
  • This is the extension for 2D plot. This plot overlays the components and the loadings for 3 PCs (choose PCs and the length of lines in the left panel)
  • This plot has similar functionality with 2D plots. Trace is the variables which can be hidden when click.
  • If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
  • Loadings identify which variables have the largest effect on each component
  • Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

This plot needs some time to load for the first time

When A >=3, choose components to show factor and loading 3D plot

x y z must be different

Trace legend


                

Prediction

Prepare model first


Step 3. Test Set Preparation


Data: NKI


Data for prediction should cover all the variables in the model

2. Show 1st row as column names?

3. Use 1st column as row names? (No duplicates)

Correct separator and quote ensure the successful data input

Find some example data here

Step 4. If the model and new data are ready, click the blue button to generate prediction results.





Output. Model Results