Data Preparation

1. Functionalities

To upload data files, preview data set, and check the correctness of data input
To pre-process some variables (when necessary) for building the model
To get the basic descriptive statistics and plots of the variables

2. About your data (training set)

The data need to be all numeric
The data used to build a model is called a training set

Case Example: NKI data
Suppose in one study, we wanted to explore some lymph node-positive breast cancer patients on metastasis-free survival. Data contained the clinical risk factors: (1) Age: Patient age at diagnosis (years) and (2) the year until relapse; and gene expression measurements of 70 genes found to be prognostic for metastasis-free survival in an earlier study. In this example, we wanted to create a model that could find the relations between age, year until release, and gene expression measurements.
Case Example: Liver toxicity data
This data set contains the expression measure and clinical measurements for rats that were exposed to non-toxic, moderately toxic or severely toxic doses of acetaminophen in a controlled experiment.

Please follow the Steps, and Outputs will give real-time analytical results. After getting data ready, please find the model in the next tabs.

Training Set Preparation

Example data
Upload Data

Use example data

Upload data will cover the example data

Please refer to the example data format to upload new data

1. Choose CSV/TXT file

Browse...

2. Show 1st row as column names?

Yes

3. Use 1st column as row names? (No duplicates)

Yes

4. Which separator for data?

Comma (,): CSV often uses this

One Tab (->|): TXT often uses this

Semicolon (;)

One Space (_)

5. Which quote for characters?

None

Double Quote

Single Quote

Correct separator and quote ensure the successful data input

Find some example data here

Transform the data?

Change the types of some variable?

Output 1. Data Information

Data Preview

1. Numeric variable information list

2. Categorical variable information list

Output 2. Descriptive Results

Basic Descriptives
Linear Fitting Plot
Histogram
Heatmap

1. For numeric variable

2. For categorical variable

Download Results (Categorical variable)

Linear fitting plot: to roughly show the linear relation between any two numeric variable. Grey area is 95% confidence interval.

3. Change the labels of X and Y axes

Histogram: to roughly show the probability distribution of a variable by depicting the frequencies of observations occurring in certain ranges of values.

Density plot: to show the distribution of a variable

Histogram and Density Plot

The number of bins in the histogram

When the number of bins is 0, plot will use the default number of bins

Density plot

Scale the data?

Principal Component Regression

Principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). It finds hyperplanes of maximum variance between the response and independent variables.

1. Functionalities

To achieve a correlation matrix and plots
To achieve the results from a model
To achieve the factors and loadings result tables and
To achieve the factors and loadings distribution plots in 2D and 3D
To achieve the predicted dependent variables
To upload new data and conduct the prediction

2. About your data

All the data for analysis are numeric
New data (test set) should cover all the independent variables used in the model.

Please follow the Steps to build the model, and click Outputs to get analytical results.

Build the Model

Prepare the data in the Data tab

Step 1. Choose parameters to build the model

3. How many new components? (A <= dimension of X)

4. Do cross-validation?

No, use full data

10-fold cross-validation

Leave-one-out cross-validation

Scale the data?

In the example of NKI data, we used time as dependent variable (Y), and variable from TSPYL5 ...are used as independent variables. The default is to put all variables other than Y into X. Thus, we need to remove Diam and Age variables.

From the data tab, we knew that X is a 20 by 25 matrix, so the maximum of a is 19. There will be error if A=20.

We used 10-fold CV to see the results of training set and CV / validation set.

Step 2. If data and model are ready, click the blue button to generate model results.

Output 1. Data Preview

Part of Data

Please edit data in Data tab

Output 2. Model Results

Result
Data Fitting
Component
Loading
Component and Loading 2D Plot
Component and Loading 3D Plot

Explanations

The results from 1 component, 2 component, ..., n components are given
'CV' is the cross-validation estimate
'adjCV' (for RMSEP and MSEP) is the bias-corrected cross-validation estimate
R^2 is equivalent to the squared correlation between the fitted values and the response. R^2 shown in train is the unadjusted one, while shown in CV is the adjusted one.
The number of components is recommended with high R^2 and low MSEP / RSMEP

10-fold cross-validation randomly split the data into 10 fold every time, so the results will not be exactly the same after a refresh.

R^2

Mean square error of prediction (MSEP)

Root mean square error of prediction (RMSEP)

From the results we could see that, with the increase of A, results in training got better results (higher R^2, lower in MSEP and RMSEP)

However, the results in CV were different. Extremely good in training with extremely bad in CV may cause overfitting, indicating a poor ability in prediction.

In this example, we decided to choose 3 components (A=3), according to the MSEP and RMSEP.

1. Predicted Y and residuals (Y-Predicted Y)

Coefficient

Explanations

This plot graphs the components relations from two scores, you can use the score plot to assess the data structure and detect clusters, outliers, and trends
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero

When A >=2, choose 2 different components to show component and loading 2D plot

1. Component at x-axis

2. Component at y-axis

In this plot, we plot the scatter points of component1 and component2, and found 327, 332 were the outliers.

Explanations

This plot show the contributions from the variables to the PCs (choose PC in the left panel)
Red indicates negative and blue indicates positive effects
Use the cumulative proportion of variance (in the variance table) to determine the amount of variance that the components explain.
For descriptive purposes, you may need only 80% (0.8) of the variance explained.
If you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the components.

Explanations

This plot (biplots) overlays the components and the loadings (choose PC in the left panel)
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
Loadings identify which variables have the largest effect on each component.
Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

When A >=2, choose 2 different components to show component and loading 2D plot

1. Component at x-axis

2. Component at y-axis

Explanations

This is the extension for 2D plot. This plot overlays the components and the loadings for 3 PCs (choose PCs and the length of lines in the left panel)
This plot has similar functionality with 2D plots. Trace is the variables which can be hidden when click.
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
Loadings identify which variables have the largest effect on each component
Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

This plot needs some time to load for the first time

When A >=3, choose 3 different components to show component and loading 3D plot

1. Component at x-axis

2. Component at y-axis

3. Component at z-axis

4. (Optional) Change line scale (length)

Trace legend

Prediction

Prepare model first

Step 3. Test Set Preparation

Example data
Upload Data

Data: NKI

Data for prediction should cover all the variables in the model

1. Choose CSV/TXT file

Browse...

2. Show 1st row as column names?

Yes

3. Use 1st column as row names? (No duplicates)

Yes

4. Which separator for data?

Comma (,): CSV often uses this

One Tab (->|): TXT often uses this

Semicolon (;)

One Space (_)

5. Which quote for characters?

None

Double Quote

Single Quote

Correct separator and quote ensure the successful data input

Find some example data here

Step 4. If the model and new data are ready, click the blue button to generate prediction results.

Output. Model Results

Test Data
Predicted Dependent Variable
Predicted Components

Partial Least Squares Regression

Partial least squares regression (PLSR) is a regression analysis technique that finds a linear regression model by projecting the predicted variables and the observable variables to a new space.

1. Functionalities

To achieve a correlation matrix and plots
To achieve the results from a model
To achieve the factors and loadings result tables
To achieve the factors and loadings distribution plots in 2D and 3D
To achieve the predicted dependent variables
To upload new data and conduct the prediction

2. About your data (training set)

All the data for analysis are numeric
New data (test set) should cover all the independent variables used in the model.

Please follow the Steps to build the model, and click Outputs to get analytical results.

Build the Model

Prepare the data in the Data tab

Step 1. Choose parameters to build the model

3. How many new components? (A <= dimension of X)

4. Do cross-validation?

No, use full data

10-fold cross-validation

Leave-one-out cross-validation

5. Which PLS algorithm?

SIMPLS: simple and fast

Kernel algorithm

Wide kernel algorithm

Classical orthogonal scores algorithm

These algorithms do not have much difference in the results

Scale the data?

PLSR can use more than one dependent variables and find the linear relation between Y matrix and X matrix. Thus, in this example, we used time, Diam, and Age as dependent variables, and other variables are independent variables.

We wanted to find the components that had good predictive ability.

From the data tab, we knew that X is a 20 by 25 matrix, so the maximum of a is 19. There will be error if A=20.

In this example, we decided to choose 3 components (A=3), according to the MSEP and RMSEP. We used 10-fold CV and a simple and fast algorithm.

Step 2. If data and model are ready, click the blue button to generate model results.

Output 1. Data Preview

Part of Data

Please edit data in Data tab

Output 2. Model Results

Result
Data Fitting
Component
Loading
Component and Loading 2D Plot
Component and Loading 3D Plot

Explanations

The results from 1 component, 2 component, ..., n components are given
'CV' is the cross-validation estimate
'adjCV' (for RMSEP and MSEP) is the bias-corrected cross-validation estimate
R^2 is equivalent to the squared correlation between the fitted values and the response. R^2 shown in train is the unadjusted one, while shown in CV is the adjusted one.
The number of components is recommended with high R^2 and low MSEP / RSMEP

10-fold cross-validation randomly split the data into 10 fold every time, so the results will not be exactly the same after a refresh.

R^2

Mean square error of prediction (MSEP)

Root mean square error of prediction (RMSEP)

Because we chose more than one dependent variables (Y), the results were showed by each Y.

PLSR generate new variables from both Y and X, so R^2 is better than PCR. Variance explained (%) in Y is also higher than PCR.

Predicted Y

Residuals (Y-Predicted Y)

Coefficient

Explanations

This plot graphs the components relations from two scores, you can use the score plot to assess the data structure and detect clusters, outliers, and trends
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero

When A >=2, choose 2 different components to show component and loading 2D plot

1. Component at x-axis

2. Component at y-axis

In this plot, we plot the scatter points of component1 and component2, and found 327, 332 the outliers.

Explanations

This plot show the contributions from the variables to the PCs (choose PC in the left panel)
Red indicates negative and blue indicates positive effects
Use the cumulative proportion of variance (in the variance table) to determine the amount of variance that the components explain.
For descriptive purposes, you may need only 80% (0.8) of the variance explained.
If you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the components.

Explanations

This plot (biplots) overlays the components and the loadings (choose PC in the left panel)
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
Loadings identify which variables have the largest effect on each component.
Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

When A >=2, choose 2 different components to show component and loading 2D plot

1. Component at x-axis

2. Component at y-axis

Explanations

This is the extension for 2D plot. This plot overlays the components and the loadings for 3 PCs (choose PCs and the length of lines in the left panel)
This plot has similar functionality with 2D plots. Trace is the variables which can be hidden when click.
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
Loadings identify which variables have the largest effect on each component
Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

This plot needs some time to load for the first time

When A >=3, choose 3 different components to show component and loading 3D plot

1. Component at x-axis

2. Component at y-axis

3. Component at z-axis

4. (Optional) Change line scale (length)

Trace legend

Prediction

Prepare model first

Step 3. Test Set Preparation

Example data
Upload Data

Data: NKI

Data for prediction should cover all the variables in the model

1. Choose CSV/TXT file

Browse...

2. Show 1st row as column names?

Yes

3. Use 1st column as row names? (No duplicates)

Yes

4. Which separator for data?

Comma (,): CSV often uses this

One Tab (->|): TXT often uses this

Semicolon (;)

One Space (_)

5. Which quote for characters?

None

Double Quote

Single Quote

Correct separator and quote ensure the successful data input

Find some example data here

Step 4. If the model and new data are ready, click the blue button to generate prediction results.

Output. Model Results

Test Data
Predicted dependent variables
Predicted Components

Sparse Partial Least Squares Regression

Sparse partial least squares regression (SPLSR) is a regression analysis technique that aims simultaneously to achieve good predictive performance and variable selection by producing sparse linear combinations of the original predictors.

1. Functionalities

To achieve a correlation matrix and plot
To achieve the results from a model
To achieve the factors and loadings result tables
To achieve the factors and loadings distribution plots in 2D and 3D
To achieve the predicted dependent variables
To upload new data and conduct the prediction

2. About your data (training set)

All the data for analysis are numeric
New data (test set) should cover all the independent variables used in the model.

Please follow the Steps to build the model, and click Outputs to get analytical results.

Build the Model

Prepare the data in the Data tab

Step 1. Choose parameters to build the model

3. How many new components? (A, larger number chooses more variables)

4. Parameter for selection range (larger number chooses less variables)

5. Which PLS algorithm?

SIMPLS: simple and fast

Kernel algorithm

Wide kernel algorithm

Classical orthogonal scores algorithm

Scale the data?

SPLS adds a penalty to make variable selection available. The penalty will select the variables that may be good for the prediction. The components are generated based on the selected variables.

From the data tab, we knew that X is a 20 by 25 matrix, so the maximum of a is 19. There will be error if A=20.

Step 2. If data and model are ready, click the blue button to generate model results.

Output 1. Data Preview

Cross-validated SPLS
Part of Data

Choose optimal parameters from the following ranges

Maximum new components (default: 1 to 10)

Parameter for selection range (larger number chooses less variables, default: 0.1 to 0.9)

Cross-validation will choose the parameters according to the minimum error, giving some suggestions to choose parameters.

Please edit data in Data tab

Output 1. Model Results

Selection
Data Fitting
Components
Loading
Component and Loading 2D Plot
Component and Loading 3D Plot

Selected variables

Predicted Y

This plot shows how the coefficients changed to choose variables

Which response (N'th dependent variable) to plot

Coefficient

This is components derived based on the selected variables

Explanations

This plot graphs the components relations from two scores, you can use the score plot to assess the data structure and detect clusters, outliers, and trends
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero

When A >=2, choose 2 different components to show component and loading 2D plot

1. Component at x-axis

2. Component at y-axis

In this plot, we plot the scatter points of component1 and component2, and found 378 was outliers.

This is loadings derived based on the selected variables

Explanations

This plot show the contributions from the variables to the PCs (choose PC in the left panel)
Red indicates negative and blue indicates positive effects
Use the cumulative proportion of variance (in the variance table) to determine the amount of variance that the factors explain.
For descriptive purposes, you may need only 80% (0.8) of the variance explained.
If you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the factors.

This is loadings derived based on the selected variables

Explanations

This plot (biplots) overlays the components and the loadings (choose PC in the left panel)
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
Loadings identify which variables have the largest effect on each component.
Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

When A >=2, choose 2 different components to show component and loading 2D plot

1. Component at x-axis

2. Component at y-axis

Explanations

This is the extension for 2D plot. This plot overlays the components and the loadings for 3 PCs (choose PCs and the length of lines in the left panel)
This plot has similar functionality with 2D plots. Trace is the variables which can be hidden when click.
If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
Loadings identify which variables have the largest effect on each component
Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

This plot needs some time to load for the first time

When A >=3, choose components to show factor and loading 3D plot

1. Component at x-axis

2. Component at y-axis

3. Component at z-axis

x y z must be different

4. (Optional) Change line scale (length)

Trace legend

Prediction

Prepare model first

Step 3. Test Set Preparation

Example data
Upload Data

Data: NKI

Data for prediction should cover all the variables in the model

1. Choose CSV/TXT file

Browse...

2. Show 1st row as column names?

Yes

3. Use 1st column as row names? (No duplicates)

Yes

4. Which separator for data?

Comma (,): CSV often uses this

One Tab (->|): TXT often uses this

Semicolon (;)

One Space (_)

5. Which quote for characters?

None

Double Quote

Single Quote

Correct separator and quote ensure the successful data input

Find some example data here

Step 4. If the model and new data are ready, click the blue button to generate prediction results.

Output. Model Results

Test Data
Predicted Dependent Variable