# Data Preparation

#### 1. Functionalities

• To upload data files, preview data set, and check the correctness of data input
• To pre-process some variables (when necessary) for building the model
• To achieve the basic descriptive statistics and plots of the variables

• Your data need to have more rows than columns
• Your data need to be all numeric

#### Case Example 1: Mouse gene expression data

This data measured the gene expression of 20 mouses in a diet experiment. Some mouses showed the same genotype, and some gene variables were correlated. We wanted to compute the principal components that were linearly uncorrelated from the gene expression data.

#### Case Example 2: Chemical data

Suppose in one study, people measured the 9 chemical attributes of 7 types of drugs. Some chemicals had a latent association. We wanted to explore the latent relational structure among the set of chemical variables and narrow down to a smaller number of variables.

#### Output 1. Data Information

Data Preview

1. Numeric variable information list

2. Categorical variable information list

#### Output 2. Descriptive Results

1. For numeric variable

2. For categorical variable

Linear fitting plot: to roughly show the linear relation between any two numeric variable. Grey area is 95% confidence interval.

3. Change the labels of X and Y axes

Histogram: to roughly show the probability distribution of a variable by depicting the frequencies of observations occurring in certain ranges of values.

Density plot: to show the distribution of a variable

Histogram and Density Plot

When the number of bins is 0, plot will use the default number of bins

Density plot

# Principal Component Analysis

Principal components analysis (PCA) is a data reduction technique that transforms a larger number of correlated variables into a much smaller set of uncorrelated variables called principal components.

#### 1. Functionalities

• From to estimate the number of components
• To achieve a correlation matrix and draw plots
• To gachieve the principal components and loadings distribution plots in 2D and 3D

• All the data for analysis are numeric
• More samples size than the number of independent variables, that is, the number of rows is greater than the number of columns

#### Output 1. Data Explores

Part of Data

Please edit data in Data tab

#### Output 2. Model Results

Explanations
• This plot graphs the components relations from two components, you can use the score plot to assess the data structure and detect clusters, outliers, and trends
• Groupings of data on the plot may indicate two or more separate distributions in the data
• If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero

2. When A >=2, choose 2 components to show component and loading 2D plot

In the plot of PC1 and PC2 (without group circle), we could find some outliers, for example, 11 and 23. If we chose diet and add group circle in Euclid distance, we could find diet type sun was separated from others.

Explanations
• This plot show the contributions from the variables to the PCs (choose PC in the left panel)
• Red indicates negative and blue indicates positive effects
• Use the cumulative proportion of variance (in the variance table) to determine the amount of variance that the factors explain.
• For descriptive purposes, you may need only 80% (0.8) of the variance explained.
• If you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the factors.

Variance table

Explanations
• This plot (biplots) overlays the components and the loadings (choose PC in the left panel)
• If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
• Loadings identify which variables have the largest effect on each component.
• Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

When A >=2, choose 2 components to show component and loading 2D plot

In the plot of PC1 and PC2, we could find ACAT2 have comparatively strong negative effect to PC1, and PKD4 has strong positive effect on PC1. For PC2, THIOL has strong positive effect and VDR has strong negative effect. The results are corresponding to the loading plot

Explanations
• This is the extension for 2D plot. This plot overlays the components and the loadings for 3 PCs (choose PCs and the length of lines in the left panel)
• We can find the outliers in the plot.
• If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
• Loadings identify which variables have the largest effect on each component
• Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

This plot needs some time to load for the first time

When A >=3, choose 3 components to show component and loading 3D plot

The default is to show the first 3 PC in the 3D plot

Trace legend

# Exploratory Factor Analysis

Exploratory Factor analysis (EFA) is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors.

#### 1. Functionalities

• From parallel analysis to estimate the number of components
• To achieve a correlation matrix and plots
• To achieve the factors and loadings distribution plots in 2D and 3D

• All the data for analysis are numeric
• More samples size than the number of independent variables, that is, the number of rows is greater than the number of columns

#### Output 1. Data Explores

Part of Data

Please edit data in Data tab

#### Output 2. Model Results

Explanations
• This plot graphs the factor relations to the variables
• Results in the window show the statistical test for the sufficiency of factors.

Explanations
• This plot graphs the relations from two factors, you can use the score plot to assess the data structure and detect clusters, outliers, and trends
• Groupings of data on the plot may indicate two or more separate distributions in the data
• If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero

2. When A >=2, choose 2 factors to show component and loading 2D plot

In the plot of ML1 and ML2, we could find some outliers, for example, 169 and 208. We can remove these points in Data tab. If we chose type and add group circle in Euclid distance, we could find B group was somewhat different. Not all the groups had circles due to the number of points were too less.

Explanations
• This plot show the contributions from the variables to the PCs (choose PC in the left panel)
• Red indicates negative and blue indicates positive effects
• Use the proportion of variance (in the variance table) to determine the amount of variance that the factors explain.
• For descriptive purposes, you may need only 80% (0.8) of the variance explained.
• If you want to perform other analyses on the data, you may want to have at least 90% of the variance explained by the factors.

Variance table

Explanations
• This plot (biplots) overlays the factors and the loadings (choose PC in the left panel)
• If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
• Loadings identify which variables have the largest effect on each component
• Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

When A >=2, choose 2 factors to show factors and loading 2D plot

After removing the points 169 and 208, we could find chem2 have comparatively strong relation to ML2.

Explanations
• This is the extension for 2D plot. This plot overlays the factors and the loadings for 3 PCs (choose PCs and the length of lines in the left panel)
• We can find the outliers in the plot.
• If the data follow a normal distribution and no outliers are present, the points are randomly distributed around zero
• Loadings identify which variables have the largest effect on each component
• Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component.

This plot needs some time to load for the first time

When A >=3, choose 3 factors to show factors and loading 3D plot

The default is to show the first 3 factors in the 3D plot

Trace legend