- To upload data files, preview data set, and check the correctness of data input
- To pre-process some variables (when necessary) for building the model
- To get the basic descriptive statistics and plots of the variables

- The data need to be all numeric
- The data used to build a model is called a
**training set**

**Data Preview**

**1. Numeric variable information list**

**2. Categorical variable information list**

**Linear fitting plot**: to roughly show the linear relation between any two numeric variable.
Grey area is 95% confidence interval.

**3. Change the labels of X and Y axes**

**Histogram**: to roughly show the probability distribution of a variable by depicting the frequencies of observations occurring in certain ranges of values.

**Density plot**: to show the distribution of a variable

**Histogram and Density Plot**

When the number of bins is 0, plot will use the default number of bins

**Density plot**

- To achieve a correlation matrix and plots
- To achieve the results from a model
- To achieve the factors and loadings result tables and
- To achieve the factors and loadings distribution plots in 2D and 3D
- To achieve the predicted dependent variables
- To upload new data and conduct the prediction

- All the data for analysis are numeric
- New data (test set) should cover all the independent variables used in the model.

**Part of Data**

Please edit data in Data tab

- The results from 1 component, 2 component, ..., n components are given
- 'CV' is the cross-validation estimate
- 'adjCV' (for RMSEP and MSEP) is the bias-corrected cross-validation estimate
- R^2 is equivalent to the squared correlation between the fitted values and the response. R^2 shown in train is the unadjusted one, while shown in CV is the adjusted one.
- The number of components is recommended with high R^2 and low MSEP / RSMEP

10-fold cross-validation randomly split the data into 10 fold every time, so the results will not be exactly the same after a refresh.

**R^2**

**Mean square error of prediction (MSEP)**

**Root mean square error of prediction (RMSEP)**

*From the results we could see that, with the increase of A, results in training got better results (higher R^2, lower in MSEP and RMSEP)*

*However, the results in CV were different. Extremely good in training with extremely bad in CV may cause overfitting, indicating a poor ability in prediction.*

*In this example, we decided to choose 3 components (A=3), according to the MSEP and RMSEP.*

**1. Predicted Y and residuals (Y-Predicted Y)**

**Coefficient**

- This plot graphs the components relations from two scores, you can use the score plot to assess the data structure and detect clusters, outliers, and trends
- If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero

**When A >=2, choose 2 different components to show component and loading 2D plot**

*In this plot, we plot the scatter points of component1 and component2, and found 327, 332 were the outliers.*

- This plot show the contributions from the variables to the PCs (choose PC in the left panel)
- Red indicates negative and blue indicates positive effects
- Use the cumulative proportion of variance (in the variance table) to determine the amount of variance that the components explain.
- For descriptive purposes, you may need only 80% (0.8) of the variance explained.
- If you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the components.

- This plot (biplots) overlays the components and the loadings (choose PC in the left panel)
- If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
- Loadings identify which variables have the largest effect on each component.
- Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

**When A >=2, choose 2 different components to show component and loading 2D plot**

- This is the extension for 2D plot. This plot overlays the components and the loadings for 3 PCs (choose PCs and the length of lines in the left panel)
- This plot has similar functionality with 2D plots. Trace is the variables which can be hidden when click.
- If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
- Loadings identify which variables have the largest effect on each component
- Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

**This plot needs some time to load for the first time**

**When A >=3, choose 3 different components to show component and loading 3D plot**

**Trace legend**

- To achieve a correlation matrix and plots
- To achieve the results from a model
- To achieve the factors and loadings result tables
- To achieve the factors and loadings distribution plots in 2D and 3D
- To achieve the predicted dependent variables
- To upload new data and conduct the prediction

- All the data for analysis are numeric
- New data (test set) should cover all the independent variables used in the model.

**Part of Data**

Please edit data in Data tab

- The results from 1 component, 2 component, ..., n components are given
- 'CV' is the cross-validation estimate
- 'adjCV' (for RMSEP and MSEP) is the bias-corrected cross-validation estimate
- R^2 is equivalent to the squared correlation between the fitted values and the response. R^2 shown in train is the unadjusted one, while shown in CV is the adjusted one.
- The number of components is recommended with high R^2 and low MSEP / RSMEP

10-fold cross-validation randomly split the data into 10 fold every time, so the results will not be exactly the same after a refresh.

**R^2**

**Mean square error of prediction (MSEP)**

**Root mean square error of prediction (RMSEP)**

*Because we chose more than one dependent variables (Y), the results were showed by each Y.*

*PLSR generate new variables from both Y and X, so R^2 is better than PCR. Variance explained (%) in Y is also higher than PCR.*

**Predicted Y**

**Residuals (Y-Predicted Y)**

**Coefficient**

- This plot graphs the components relations from two scores, you can use the score plot to assess the data structure and detect clusters, outliers, and trends

**When A >=2, choose 2 different components to show component and loading 2D plot**

*In this plot, we plot the scatter points of component1 and component2, and found 327, 332 the outliers.*

- This plot show the contributions from the variables to the PCs (choose PC in the left panel)
- Red indicates negative and blue indicates positive effects
- Use the cumulative proportion of variance (in the variance table) to determine the amount of variance that the components explain.
- For descriptive purposes, you may need only 80% (0.8) of the variance explained.
- If you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the components.

- This plot (biplots) overlays the components and the loadings (choose PC in the left panel)
- Loadings identify which variables have the largest effect on each component.
- Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

**When A >=2, choose 2 different components to show component and loading 2D plot**

- This is the extension for 2D plot. This plot overlays the components and the loadings for 3 PCs (choose PCs and the length of lines in the left panel)
- This plot has similar functionality with 2D plots. Trace is the variables which can be hidden when click.
- Loadings identify which variables have the largest effect on each component

**This plot needs some time to load for the first time**

**When A >=3, choose 3 different components to show component and loading 3D plot**

**Trace legend**

- To achieve a correlation matrix and plot
- To achieve the results from a model
- To achieve the factors and loadings result tables
- To achieve the factors and loadings distribution plots in 2D and 3D
- To achieve the predicted dependent variables
- To upload new data and conduct the prediction

- All the data for analysis are numeric
- New data (test set) should cover all the independent variables used in the model.

Choose optimal parameters from the following ranges

Cross-validation will choose the parameters according to the minimum error, giving some suggestions to choose parameters.

Please edit data in Data tab

**Selected variables**

**Predicted Y**

**This plot shows how the coefficients changed to choose variables**

**Coefficient**

This is components derived based on the selected variables

- This plot graphs the components relations from two scores, you can use the score plot to assess the data structure and detect clusters, outliers, and trends

**When A >=2, choose 2 different components to show component and loading 2D plot**

*In this plot, we plot the scatter points of component1 and component2, and found 378 was outliers.*

**This is loadings derived based on the selected variables**

- This plot show the contributions from the variables to the PCs (choose PC in the left panel)
- Red indicates negative and blue indicates positive effects
- Use the cumulative proportion of variance (in the variance table) to determine the amount of variance that the factors explain.
- For descriptive purposes, you may need only 80% (0.8) of the variance explained.
- If you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the factors.

**This is loadings derived based on the selected variables**

- This plot (biplots) overlays the components and the loadings (choose PC in the left panel)
- Loadings identify which variables have the largest effect on each component.

**When A >=2, choose 2 different components to show component and loading 2D plot**

- This is the extension for 2D plot. This plot overlays the components and the loadings for 3 PCs (choose PCs and the length of lines in the left panel)
- This plot has similar functionality with 2D plots. Trace is the variables which can be hidden when click.
- Loadings identify which variables have the largest effect on each component

**This plot needs some time to load for the first time**

**When A >=3, choose components to show factor and loading 3D plot**

x y z must be different

**Trace legend**