Data Preparation

Logistic regression is used to model the probability of a perticular class or event existing binary outputs such as pass/fail, win/lose, alive/dead, or healthy/sick. Logistic regression uses a logistic function to model a binary dependent variable.

1. Functionalities

To upload data files, preview data set, and check the correctness of data input
To pre-process some variables (when necessary) for building the model
To achieve the basic descriptive statistics and draw plots of the variables

2. About your data (training set)

Your data need to include one binary dependent variable (denoted as Y) and at least one independent variables (denoted as X)
Your data need to have more rows than columns
Do not mix character and numbers in the same column
The data used to build a model is called a training set

Case Example

Suppose we wanted to explore the Breast Cancer dataset and develop a model to try classifying suspected cells to Benign (B) or Malignant (M). The dependent variable is a binary outcome (B/M). We were interested in (1) building a model to calculate the probability of benign or malignant, and to determine whether the patient is benign or malignant, and (2) finding out the relations between the binary dependent variable and the other variables, that is finding out which variable contributes significantly to the dependent variable.

Please follow the Steps, and Outputs will give real-time analytical results. After getting data ready, please find the model in the next tabs.

Training Set Preparation

Example data
Upload Data

Use example data

Upload data will cover the example data

Please refer to the example data format to upload new data

1. Choose CSV/TXT file

Browse...

2. Show 1st row as column names?

Yes

3. Use 1st column as row names? (No duplicates)

Yes

4. Which separator for data?

Comma (,): CSV often uses this

One Tab (->|): TXT often uses this

Semicolon (;)

One Space (_)

5. Which quote for characters?

None

Double Quote

Single Quote

Correct separator and quote ensure the successful data input

Find some example data here

Change the types of some variable?

Change the referential level for categorical variable?

2. Input the referential level, each line for one variable

Output 1. Data Information

Data Preview

Variable types

Output 2. Descriptive Results

Basic Descriptives
Logit Plot
Histogram and Density Plot

1. For numeric variable

2. For categorical variable

Logit plot: to roughly show the relation between any two numeric variable.

3. Change the labels of X and Y axes

Histogram: to roughly show the probability distribution of a variable by depicting the frequencies of observations occurring in certain ranges of values.

Density plot: to show the distribution of a variable

Histogram

The number of bins in the histogram

When the number of bins is 0, plot will use the default number of bins

Density plot

Logistic Regression

1. Functionalities

To build simple or multiple logistic regression model
To achieve the estimates of regressions, including (1) estimate of coefficients with t test, p value, and 95% CI, (2) R² and adjusted R², and (3) F-Test for overall significance in Regression
To achieve additional information: (1) predicted dependent variable and residuals, (2) AIC-based variable selection, (3) ROC plot, and (4) sensitivity and specificity table for ROC plot
To upload new data and achieve the prediction
To achieve the evaluation of new data containing new dependent variable

2. About your data

The dependent variable is binary
Please prepare the training set data in the previous Data tab
New data (test set) should cover all the independent variables used in the model.

Please follow the Steps to build the model, and click Outputs to get analytical results.

Prepare the Model

Prepare the data in the previous tab

Step 1. Choose variables to build the model

3. (Optional) Keep or remove intercept / constant term

Remove intercept / constant term

Keep intercept / constant term

Step 2. Check the model

Valid model example: Y ~ X1 + X2

'-1' in the formula indicates that intercept / constant term has been removed

Step 3. If data and model are ready, click the blue button to generate model results.

Output 1. Data Preview

Variables Information
Part of Data

Check full data in Data tab

Output 2. Model Results

Model Estimation
Data Fitting
AIC-based Selection
ROC Plot

Explanations

Output in the left shows estimated coefficients (95% confidence interval), T statistic (t = ) for the significance of single variable, and P value (p = ) are given
Output in the right shows odds ratio = exp(b) and standard error of the original coefficients
T test of each variable and P < 0.05 indicates this variable is statistically significant to the model
Observations mean the number of samples
Akaike Inf. Crit. = AIC = -2 (log likelihood) + 2k; k is the number of variables + constant
If you want the estimates of Odds Ratio, please take exp() on the estimated coefficients (95% confidence interval), and T statistic and P values are the same.

Save into CSV Save LaTex codes

Explanations

The Akaike Information Criterion (AIC) is used to performs stepwise model selection.
Model fits are ranked according to their AIC values, and the model with the lowest AIC value is sometime considered the 'best'.

Model selection suggested by AIC

Explanations

ROC curve: receiver operating characteristic curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied
ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings
Sensitivity (also called the true positive rate) measures the proportion of actual positives that are correctly identified as such
Specificity (also called the true negative rate) measures the proportion of actual negatives that are correctly identified as such

Prediction

Prepare model first

Step 4. Test Set Preparation

Example data
Upload Data

Data: Breast Cancer

Data for prediction should cover all the variables in the model

1. Choose CSV/TXT file

Browse...

2. Show 1st row as column names?

Yes

3. Use 1st column as row names? (No duplicates)

Yes

4. Which separator for data?

Comma (,): CSV often uses this

One Tab (->|): TXT often uses this

Semicolon (;)

One Space (_)

5. Which quote for characters?

None

Double Quote

Single Quote

Correct separator and quote ensure the successful data input

Find some example data here

If the model and new data are ready, click the blue button to generate prediction results.

Output 3. Prediction Results

Prediction
ROC Evaluation

Predicted dependent variable is shown in the 1st column

This plot is shown when new dependent variable is provided in the test data.

This plot shows the ROC plot between predicted values and true values, based on the new data not used in the model.

Sensitivity and specificity table