- To upload a data file, preview data set, and check the correctness of data input
- To pre-process some variables (when necessary) for building the model
- To calculate the basic descriptive statistics and draw plots of the variables

- Your data need to include
**one dependent variable (denoted as Y)**and**at least one independent variables (denoted as X)** - Your data need to have more rows than columns
- Do not mix character and numbers in the same column
- The data used to build a model is called a
**training set**

**Data Preview**

**Variable types**

**1. For numeric variable**

**2. For categorical variable**

**Linear fitting plot**: to roughly show the linear relation between any two numeric variable.
Grey area is 95% confidence interval.

**3. Change the labels of X and Y axes**

**Histogram**: to roughly show the probability distribution of a variable by depicting the frequencies of observations occurring in certain ranges of values.

**Density plot**: to show the distribution of a variable

**Histogram**

When the number of bins is 0, plot will use the default number of bins

**Density plot**

- To build a simple or multiple linear regression model
- To achieve the estimates of regressions, including (1) estimate of coefficients with t test, p value, and 95% CI, (2) R
^{2}and adjusted R^{2}, and (3) F-Test for overall significance in regression - To achieve additional information: (1) predicted dependent variable and residuals, (2) ANOVA table of the model, (3) AIC-based variable selection, and (4) diagnostic plot-based from the residuals and predicted dependent variable
- To upload new data and get the prediction
- To achieve the evaluation of new data contains new dependent variable

- The dependent variable is real-valued and continuous under an underlying normal distribution.
- Please prepare the training set data in the previous
**Data**tab - New data (test set) should cover all the independent variables used in the model.

Please edit data in Data tab

- The values for each variable are: estimated coefficients (95% confidence interval), T statistic (t = ) , and P value (p = ) for the significance of each variable
- T test of each variable and P < 0.05 indicate this variable is statistically significant to the model
- Observations show the number of samples
- R2 (R
^{2}) is a goodness-of-fit measure for linear regression models and indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. Suppose R2 = 0.49. This result implies that 49% of the variability of the dependent variable has been accounted for, and the remaining 51% is still unaccounted for. - Adjusted R2 (adjusted R
^{2}) is used to compare the goodness-of-fit for regression models that contain different numbers of independent variables. - F statistic (F-Test for overall significance in regression) judges on multiple coefficients taken together at the same time. F=(R^2/(k-1))/(1-R^2)/(n-k); n is sample size; k is number of variable + constant term

**Results**

- DF
_{variable}= 1 - DF
_{residual}= [number of sample values] - [number of variables] -1 - MS = SS/DF
- F = MS
_{variable}/ MS_{residual} - P Value < 0.05: the variable is significant to the model.

**ANOVA Table**

- The Akaike Information Criterion (AIC) is used to performs stepwise model selection.
- Model fits are ranked according to their AIC values, and the model with the lowest AIC value is sometime considered the 'best'.

**Model selection suggested by AIC**

- Q-Q normal plot of residuals checks the normality of residuals. The linearity of the points suggests that the data are normally distributed.
- Residuals vs fitting plot finds the outliers

**1. Q-Q normal plot of residuals**

**2. Residuals vs Fitting plot**

- 3D scatter plot shows the relation between dependent variable (Y), and 2 independent variable (X1, X2)
- Group variable split the points into groups

Predicted dependent variable is shown in the 1st column

**Prediction vs True Dependent Variable Plot**

This plot is shown when new dependent variable is provided in the test data.

This plot shows the relation between predicted dependent variable and new dependent variable, using linear smooth. Grey area is confidence interval.