# Data Preparation

#### 1. Functionalities

• To upload data files, preview data set, and check the correctness of data input
• To pre-process some variables (when necessary) for building the model
• To get the basic descriptive statistics and plots of the variables

• The data need to be all numeric
• The data used to build a model is called a training set

#### Case Example: NKI data

Suppose in one study, we wanted to explore some lymph node-positive breast cancer patients on metastasis-free survival. Data contained the clinical risk factors: (1) Age: Patient age at diagnosis (years) and (2) the year until relapse; and gene expression measurements of 70 genes found to be prognostic for metastasis-free survival in an earlier study. In this example, we wanted to create a model that could find the relations between age, year until release, and gene expression measurements.

#### Case Example: Liver toxicity data

This data set contains the expression measure and clinical measurements for rats that were exposed to non-toxic, moderately toxic or severely toxic doses of acetaminophen in a controlled experiment.

#### Output 1. Data Information

Data Preview

1. Numeric variable information list

2. Categorical variable information list

#### Output 2. Descriptive Results

1. For numeric variable

2. For categorical variable

Linear fitting plot: to roughly show the linear relation between any two numeric variable. Grey area is 95% confidence interval.

3. Change the labels of X and Y axes

Histogram: to roughly show the probability distribution of a variable by depicting the frequencies of observations occurring in certain ranges of values.

Density plot: to show the distribution of a variable

Histogram and Density Plot

When the number of bins is 0, plot will use the default number of bins

Density plot

# Principal Component Regression

Principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). It finds hyperplanes of maximum variance between the response and independent variables.

#### 1. Functionalities

• To achieve a correlation matrix and plots
• To achieve the results from a model
• To achieve the factors and loadings distribution plots in 2D and 3D
• To achieve the predicted dependent variables
• To upload new data and conduct the prediction

• All the data for analysis are numeric
• New data (test set) should cover all the independent variables used in the model.

#### Output 1. Data Preview

Part of Data

Please edit data in Data tab

#### Output 2. Model Results

Explanations
• The results from 1 component, 2 component, ..., n components are given
• 'CV' is the cross-validation estimate
• 'adjCV' (for RMSEP and MSEP) is the bias-corrected cross-validation estimate
• R^2 is equivalent to the squared correlation between the fitted values and the response. R^2 shown in train is the unadjusted one, while shown in CV is the adjusted one.
• The number of components is recommended with high R^2 and low MSEP / RSMEP

10-fold cross-validation randomly split the data into 10 fold every time, so the results will not be exactly the same after a refresh.

R^2

Mean square error of prediction (MSEP)

Root mean square error of prediction (RMSEP)

From the results we could see that, with the increase of A, results in training got better results (higher R^2, lower in MSEP and RMSEP)

However, the results in CV were different. Extremely good in training with extremely bad in CV may cause overfitting, indicating a poor ability in prediction.

In this example, we decided to choose 3 components (A=3), according to the MSEP and RMSEP.

1. Predicted Y and residuals (Y-Predicted Y)

Coefficient

Explanations
• This plot graphs the components relations from two scores, you can use the score plot to assess the data structure and detect clusters, outliers, and trends
• If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero

When A >=2, choose 2 different components to show component and loading 2D plot

In this plot, we plot the scatter points of component1 and component2, and found 327, 332 were the outliers.

Explanations
• This plot show the contributions from the variables to the PCs (choose PC in the left panel)
• Red indicates negative and blue indicates positive effects
• Use the cumulative proportion of variance (in the variance table) to determine the amount of variance that the components explain.
• For descriptive purposes, you may need only 80% (0.8) of the variance explained.
• If you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the components.

Explanations
• This plot (biplots) overlays the components and the loadings (choose PC in the left panel)
• If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
• Loadings identify which variables have the largest effect on each component.
• Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

When A >=2, choose 2 different components to show component and loading 2D plot

Explanations
• This is the extension for 2D plot. This plot overlays the components and the loadings for 3 PCs (choose PCs and the length of lines in the left panel)
• This plot has similar functionality with 2D plots. Trace is the variables which can be hidden when click.
• If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
• Loadings identify which variables have the largest effect on each component
• Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

This plot needs some time to load for the first time

When A >=3, choose 3 different components to show component and loading 3D plot

Trace legend

# Partial Least Squares Regression

Partial least squares regression (PLSR) is a regression analysis technique that finds a linear regression model by projecting the predicted variables and the observable variables to a new space.

#### 1. Functionalities

• To achieve a correlation matrix and plots
• To achieve the results from a model
• To achieve the factors and loadings distribution plots in 2D and 3D
• To achieve the predicted dependent variables
• To upload new data and conduct the prediction

• All the data for analysis are numeric
• New data (test set) should cover all the independent variables used in the model.

#### Output 1. Data Preview

Part of Data

Please edit data in Data tab

#### Output 2. Model Results

Explanations
• The results from 1 component, 2 component, ..., n components are given
• 'CV' is the cross-validation estimate
• 'adjCV' (for RMSEP and MSEP) is the bias-corrected cross-validation estimate
• R^2 is equivalent to the squared correlation between the fitted values and the response. R^2 shown in train is the unadjusted one, while shown in CV is the adjusted one.
• The number of components is recommended with high R^2 and low MSEP / RSMEP

10-fold cross-validation randomly split the data into 10 fold every time, so the results will not be exactly the same after a refresh.

R^2

Mean square error of prediction (MSEP)

Root mean square error of prediction (RMSEP)

Because we chose more than one dependent variables (Y), the results were showed by each Y.

PLSR generate new variables from both Y and X, so R^2 is better than PCR. Variance explained (%) in Y is also higher than PCR.

Predicted Y

Residuals (Y-Predicted Y)

Coefficient

Explanations
• This plot graphs the components relations from two scores, you can use the score plot to assess the data structure and detect clusters, outliers, and trends
• If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero

When A >=2, choose 2 different components to show component and loading 2D plot

In this plot, we plot the scatter points of component1 and component2, and found 327, 332 the outliers.

Explanations
• This plot show the contributions from the variables to the PCs (choose PC in the left panel)
• Red indicates negative and blue indicates positive effects
• Use the cumulative proportion of variance (in the variance table) to determine the amount of variance that the components explain.
• For descriptive purposes, you may need only 80% (0.8) of the variance explained.
• If you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the components.

Explanations
• This plot (biplots) overlays the components and the loadings (choose PC in the left panel)
• If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
• Loadings identify which variables have the largest effect on each component.
• Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

When A >=2, choose 2 different components to show component and loading 2D plot

Explanations
• This is the extension for 2D plot. This plot overlays the components and the loadings for 3 PCs (choose PCs and the length of lines in the left panel)
• This plot has similar functionality with 2D plots. Trace is the variables which can be hidden when click.
• If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
• Loadings identify which variables have the largest effect on each component
• Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

This plot needs some time to load for the first time

When A >=3, choose 3 different components to show component and loading 3D plot

Trace legend

# Sparse Partial Least Squares Regression

Sparse partial least squares regression (SPLSR) is a regression analysis technique that aims simultaneously to achieve good predictive performance and variable selection by producing sparse linear combinations of the original predictors.

#### 1. Functionalities

• To achieve a correlation matrix and plot
• To achieve the results from a model
• To achieve the factors and loadings distribution plots in 2D and 3D
• To achieve the predicted dependent variables
• To upload new data and conduct the prediction

• All the data for analysis are numeric
• New data (test set) should cover all the independent variables used in the model.

#### Output 1. Data Preview

Choose optimal parameters from the following ranges

Cross-validation will choose the parameters according to the minimum error, giving some suggestions to choose parameters.

Please edit data in Data tab

#### Output 1. Model Results

Selected variables

Predicted Y

This plot shows how the coefficients changed to choose variables

Coefficient

This is components derived based on the selected variables

Explanations
• This plot graphs the components relations from two scores, you can use the score plot to assess the data structure and detect clusters, outliers, and trends
• If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero

When A >=2, choose 2 different components to show component and loading 2D plot

In this plot, we plot the scatter points of component1 and component2, and found 378 was outliers.

Explanations
• This plot show the contributions from the variables to the PCs (choose PC in the left panel)
• Red indicates negative and blue indicates positive effects
• Use the cumulative proportion of variance (in the variance table) to determine the amount of variance that the factors explain.
• For descriptive purposes, you may need only 80% (0.8) of the variance explained.
• If you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the factors.

Explanations
• This plot (biplots) overlays the components and the loadings (choose PC in the left panel)
• If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
• Loadings identify which variables have the largest effect on each component.
• Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

When A >=2, choose 2 different components to show component and loading 2D plot

Explanations
• This is the extension for 2D plot. This plot overlays the components and the loadings for 3 PCs (choose PCs and the length of lines in the left panel)
• This plot has similar functionality with 2D plots. Trace is the variables which can be hidden when click.
• If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
• Loadings identify which variables have the largest effect on each component
• Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

This plot needs some time to load for the first time

When A >=3, choose components to show factor and loading 3D plot

x y z must be different

Trace legend