Data Preparation

Linear regression is a linear approach to modeling the relationship between a dependent variable and one or more independent variables. The case of one explanatory variable is called (simple) linear regression. For more than one explanatory variable, the process is called multiple linear regression.

1. Functionalities

To upload a data file, preview data set, and check the correctness of data input
To pre-process some variables (when necessary) for building the model
To calculate the basic descriptive statistics and draw plots of the variables

2. About your data (training set)

Your data need to include one dependent variable (denoted as Y) and at least one independent variables (denoted as X)
Your data need to have more rows than columns
Do not mix character and numbers in the same column
The data used to build a model is called a training set

Case Example

Suppose in one study, the doctors recorded the birth weight of 10 infants, together with age (month), age group (a: age < 4 months, b; otherwise), and SBP. We were interested (1) to predict the birth weight of infants, and (2) find the relations between birth weight and the other variables, that is, to find out which variable contributes significantly to the dependent variable.

Please follow the Steps, and Outputs will give real-time analytical results. After getting data ready, please build the model in the next tab.

Training Set Preparation

Example data
Upload Data

Use example data

Upload data will cover the example data

Please refer to the example data format to upload new data

1. Choose CSV/TXT file

Browse...

2. Show 1st row as column names?

Yes

3. Use 1st column as row names? (No duplicates)

Yes

4. Which separator for data?

Comma (,): CSV often uses this

One Tab (->|): TXT often uses this

Semicolon (;)

One Space (_)

5. Which quote for characters?

None

Double Quote

Single Quote

Correct separator and quote ensure the successful data input

Find some example data here

Transform the data?

Change the types of some variable?

Change the referential level for categorical variable?

2. Input the referential level, each line for one variable

Output 1. Data Information

Data Preview

Variable types

Output 2. Descriptive Results

Basic Descriptives
Linear Fitting Plot
Histogram and Density Plot

1. For numeric variable

2. For categorical variable

Linear fitting plot: to roughly show the linear relation between any two numeric variable. Grey area is 95% confidence interval.

3. Change the labels of X and Y axes

Histogram: to roughly show the probability distribution of a variable by depicting the frequencies of observations occurring in certain ranges of values.

Density plot: to show the distribution of a variable

Histogram

The number of bins in the histogram

When the number of bins is 0, plot will use the default number of bins

Density plot

Linear Regression

1. Functionalities

Build the model

To build a simple or multiple linear regression model
To achieve the estimates of regressions, including (1) estimate of coefficients with t test, p value, and 95% CI, (2) R² and adjusted R², and (3) F-Test for overall significance in regression
To achieve additional information: (1) predicted dependent variable and residuals, (2) ANOVA table of the model, (3) AIC-based variable selection, and (4) diagnostic plot-based from the residuals and predicted dependent variable
To upload new data and get the prediction
To achieve the evaluation of new data contains new dependent variable

2. About your data (training set)

The dependent variable is real-valued and continuous under an underlying normal distribution.
Please prepare the training set data in the previous Data tab
New data (test set) should cover all the independent variables used in the model.

Please follow the Steps to build the model, and click Outputs to get analytical results.

Build the Model

Prepare the data in the previous tab

Step 1. Choose variables to build the model

3. (Optional) Keep or remove intercept / constant term

Remove intercept / constant term

Keep intercept / constant term

Step 2. Check the model and generate results

Valid model example: Y ~ X1 + X2

'-1' in the formula indicates that the intercept/constant term has been removed

Step 3. If data and model are ready, click the blue button to generate model results.

Output 1. Data Preview

Variables Information
Part of Data

Please edit data in Data tab

Output 2. Model Results

Model Estimation
Data Fitting
ANOVA
AIC-based Selection
Diagnostics Plot
3D Scatter Plot

Explanations

The values for each variable are: estimated coefficients (95% confidence interval), T statistic (t = ) , and P value (p = ) for the significance of each variable
T test of each variable and P < 0.05 indicate this variable is statistically significant to the model
Observations show the number of samples
R2 (R²) is a goodness-of-fit measure for linear regression models and indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. Suppose R2 = 0.49. This result implies that 49% of the variability of the dependent variable has been accounted for, and the remaining 51% is still unaccounted for.
Adjusted R2 (adjusted R²) is used to compare the goodness-of-fit for regression models that contain different numbers of independent variables.
F statistic (F-Test for overall significance in regression) judges on multiple coefficients taken together at the same time. F=(R^2/(k-1))/(1-R^2)/(n-k); n is sample size; k is number of variable + constant term

Results

Save into CSV Save LaTex codes

Explanations

DF_variable = 1
DF_residual = [number of sample values] - [number of variables] -1
MS = SS/DF
F = MS_variable / MS_residual
P Value < 0.05: the variable is significant to the model.

ANOVA Table

Explanations

The Akaike Information Criterion (AIC) is used to performs stepwise model selection.
Model fits are ranked according to their AIC values, and the model with the lowest AIC value is sometime considered the 'best'.

Model selection suggested by AIC

Save into TXT

Explanations

Q-Q normal plot of residuals checks the normality of residuals. The linearity of the points suggests that the data are normally distributed.
Residuals vs fitting plot finds the outliers

1. Q-Q normal plot of residuals

2. Residuals vs Fitting plot

Explanations

3D scatter plot shows the relation between dependent variable (Y), and 2 independent variable (X1, X2)
Group variable split the points into groups

Prediction

Prepare model first

Step 4. Test Set Preparation

Example data
Upload Data

Data: Birth Weight

Data for prediction should cover all the variables in the model

1. Choose CSV/TXT file

Browse...

2. Show 1st row as column names?

Yes

3. Use 1st column as row names? (No duplicates)

Yes

4. Which separator for data?

Comma (,): CSV often uses this

One Tab (->|): TXT often uses this

Semicolon (;)

One Space (_)

5. Which quote for characters?

None

Double Quote

Single Quote

Correct separator and quote ensure the successful data input

Find some example data here

Step 5. If the model and new data are ready, click the blue button to generate prediction results.

Output 3. Prediction Results

Prediction
Evaluation Plot

Predicted dependent variable is shown in the 1st column

Prediction vs True Dependent Variable Plot

This plot is shown when new dependent variable is provided in the test data.

This plot shows the relation between predicted dependent variable and new dependent variable, using linear smooth. Grey area is confidence interval.