Logistic Regression

Author

Biometrics Unit

Introduction

Logistic regression, also known as the logistic model or logit model, examines how multiple independent variables relate to a categorical dependent variable. It predicts the probability of an event happening by modeling data to fit a logistic curve.

Logistic regression is a supervised machine learning algorithm that accomplishes binary classification tasks by predicting the probability of an outcome, event, or observation. The model delivers a binary or dichotomous outcome limited to two possible outcomes: yes/no, 0/1, or true/false.

Logistic regression is a type of statistical model that is used to predict the probability of a certain event happening. It may be used to determine if an email is spam or not, as well as diagnose illnesses by determining whether certain symptoms are present or absent based on test results from patients. It works by taking input variables and transforming them into a probability value between 0 and 1, where 0 represents a low probability and 1 represents a high probability.

For example, a researcher wants to measure the poverty status of farmers in a certain locality using their annual income and decided to classify them as poor (1) and not poor (0) based on a certain threshold. Then logistic regression could be used to determine the factors influencing poverty among the farmers using predictor variables like gender, age, farm size, household size, years of farming experience, etc.

The reason it is named “logistic” is that an S-shaped curve (sigmoid function) is produced when the input variables are transformed using a mathematical function known as the logistic function.

Advantages of Logistic Regression

The logistic regression analysis has several advantages in machine learning. These are highlighted below.

Advantages of Logistic Regression

Source

It is easier to implement machine learning models: Logistic regression is computationally efficient compared to many other machine learning methods, making it easier to implement, interpret, and train.
Optimal for linearly separable data: Logistic regression is specifically designed for binary classification tasks, effectively categorizing data into two distinct groups when the data is linearly separable.
Provides valuable insights: Logistic regression coefficients indicate both the strength and direction of the relationship between predictor variables and the outcome.

The Logistic Curve

Logistic regression is a technique used to model the relationship between a binary (dichotomous) response variable ( y ) and a numerical predictor variable ( x ). It fits a logistic curve, which is an S-shaped or sigmoid curve, to represent how ( y ) changes as ( x ) varies. This method is particularly useful when ( y ) represents binary outcomes coded as 0 (failure) or 1 (success), such as in cases of modeling population growth or other similar scenarios.

Logistic regression fits α and β, the regression coefficients. The logistic or logit function is used to transform an ‘S’-shaped curve into an approximately straight line and to change the range of the proportion from [0 – 1] to [(-∞) - (+∞)]

Logistic Curve

Source

logit(y)= ln (odds)= ln (\(\frac{P}{1 - P}\) ) = \(\alpha\) + \(\beta\)X

Assumptions

Binary Outcome: The dependent variable should be binary in nature, i.e. it should take on one of two possible vales coded as 0 and 1, “success” and “failure”, “yes” and “no”.
Independence of Errors: This implies that the error for each observation in the dataset should not be related to the error for any other observation. If violations of independence are detected, this may indicate the need to consider a different model or to account for correlation or clustering in the data using other methods such as mixed effects model.
Linearity of the Logit: This means that the relationship between the independent variables and the log-odds of the outcome is linear. This means that the effect of the independent variables on the log-odds of the outcome is constant across the range of the independent variables. Logistic regression can accommodate non-linear relationships between the independent and dependent variables by employing a non-linear log transformation of the linear regression framework.
Large Sample Size: A relatively large sample size is required to detect meaningful effects and to ensure stability of estimates in Logistic regression. A small sample size can lead to overfitting, where the model captures noise rather than the underlying signal in the data, and underpowered statistical tests, which may fail to detect significant effects due to inadequate sample size.
Outliers: The dataset is assumed to be free of extreme outliers and significant observations for the purposes of logistic regression. To deal with outliers, you can do any of the following:
1. Replace the outliers with a mean or median value
2. Eliminate the outliers
3. Keep the outliers and maintain a record while reporting the regression results.
No Multicollinearity: The assumption of no or low multicollinearity among independent variables is quite vital in logistic regression. When two or more explanatory variables have a high degree of correlation with one another, the regression model cannot obtain unique or independent information from them. This phenomenon is known as multicollinearity. A high enough degree of correlation between variables may make it difficult to fit and comprehend the model.

The acronym BILLION gives a useful way to remember the six conditions that makes up the Logistic Regression Model.

Source

Comparison Between Linear Regression and Logistic Regression

The following are the comparison that exist between linear and logistic regression

Linear regression is not suitable for binary dependent variables due to several reasons. Firstly, the distribution of the binary variable ( Y ) is not normal; it follows a Bernoulli distribution rather than a Gaussian distribution assumed by linear regression. Secondly, using linear regression may lead to incomparable results because the scale and interpretation of the left-hand side (the binary outcome) and the right-hand side (predictors) of the model are fundamentally different when applied to a binary outcome.

The difference between logistic regression model and a linear regression model is that the outcome variable in logistic regression is binary and dichotomous.

Models of Logistic Regression

There are basically 3 types of logistic regression model, these are:

Binary Logistic Regression
Multinomial Logistic Regression
Ordinal Logistic Regression

Binary Logistic Regression

Binary logistic regression belongs to the broader category of statistical models known as generalized linear models. What sets binary logistic regression apart from other models within this category is its specific application to dependent variables that have two distinct levels.

As seen above, binary logistic regression is suitable when the dependent variable has two categories, and the independent variables are either continuous or categorical. Binary logistic regression is used when we are trying to predict a dependent variable with only two outcomes (dichotomous variable), for example, yes and No.

Logistic regression has the coefficients of parameters like the linear regression and in the addition has the odd ratio which is the exponential of the coefficient.

Odd ratio is \(\varepsilon^{\beta}\)

The equation for binary logistic regression expression is as shown below:

\(log\frac{p}{1 - p}\) = Y= \(\beta_0\) + \(\beta_1X_1\) + \(\beta_2X_2\) + … + \(\beta_kX_k\)

Where \(log\frac{p}{1 - p}\) is the odd ratio (Dependent variables)

\(\beta_0\) = Constant or intercept term in the equation.

\(\beta_1\),…\(\beta_k\) are the logistic regression coefficients (Coefficient of variable )

\(X_1\)…\(X_k\) = Independent variables

\(\varepsilon\) = Error Term

Example

A sugar cane farmer wants to visually select seedlings to plant in another location. The decision is to select (1) or reject a seedling based on the number of stalks, height (m), stalk diameter, and seedling cane yield (kg). In this case, four of the explanatory variables are quantitative, one is qualitative and the response variable is dichotomous, therefore the logistic regression can the used for the analysis.

#|include=FALSE
#|warning =FALSE
#|message=TRUE

# load libraries
library(readxl)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

knitr::opts_chunk$set(echo = TRUE)

#|cache=TRUE
# read in data
dat <- read_excel("Sugar cane.xlsx", sheet = "Sugar cane")
head(dat) # print the first six rows

# A tibble: 6 × 6
  Choice Stalks Height Diameter  Cane Variety
   <dbl>  <dbl>  <dbl>    <dbl> <dbl>   <dbl>
1      1     23   2.37     1.7  12.4        2
2      0     11   2.25     1.68  5.49       3
3      0      9   2.5      1.93  6.59       2
4      1     25   2.4      2.12 21.2        2
5      1     20   2.5      1.7  11.4        1
6      0     12   1.9      1.51  4.08       1

dat <- dat %>% mutate(across(c(Choice,Variety), factor)) #convert

The Logistic Model

#|warning=FALSE
#|eval=TRUE

model_log <- glm(Choice ~ Cane + Diameter + Height, 
             data = dat, family = binomial)

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model_log)


Call:
glm(formula = Choice ~ Cane + Diameter + Height, family = binomial, 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -5.2139    15.5265  -0.336   0.7370  
Cane          2.3901     1.1340   2.108   0.0351 *
Diameter     -6.8924     4.9139  -1.403   0.1607  
Height        0.8381     6.6179   0.127   0.8992  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 40.381  on 29  degrees of freedom
Residual deviance: 11.797  on 26  degrees of freedom
AIC: 19.797

Number of Fisher Scoring iterations: 8

Formula: The model predicts Choice (a binary outcome) based on three predictors: Cane, Diameter, and Height.

Family: The model uses a binomial distribution, suitable for binary response variables. Coefficients Intercept: The estimated intercept is -5.2139. This represents the log-odds of the outcome (Choice = 1) when all predictors are at their reference level (or zero). However, the high standard error (15.5265) indicates that this estimate is not statistically significant (p-value = 0.7370), meaning it is not reliable.

Cane: The coefficient for Cane is 2.3901, with a p-value of 0.0351, which is statistically significant (denoted by *). This suggests that for each one-unit increase in Cane, the log-odds of choosing the outcome (Choice = 1) increases by approximately 2.39. In practical terms, this implies that Cane has a positive influence on the likelihood of the event happening.

Diameter: The coefficient for Diameter is -6.8924, with a p-value of 0.1607, indicating it is not statistically significant. A negative coefficient suggests that as Diameter increases, the log-odds of the outcome decrease, but we cannot conclude that this effect is meaningful due to the p-value being above the common threshold of 0.05.

Height: The coefficient for Height is 0.8381, but like Diameter, it has a very high p-value (0.8992), implying it is not statistically significant. Therefore, Height does not appear to have a meaningful impact on the outcome.

Model Fit Information

Null Deviance: 40.381, which measures the difference in the likelihood of the model with only an intercept compared to the saturated model.

Residual Deviance: 11.797, which is considerably lower than the null deviance, suggesting that the predictors explain a significant amount of variance in the response variable. The degrees of freedom for residuals (26) suggests there are enough observations after accounting for the parameters estimated.

AIC (Akaike Information Criterion): 19.797, a measure used for model comparison, lower values indicate a better fit when comparing multiple models.

Among the predictors, only cane seems to have a statistically significant impact on the likelihood of the choice of seedling selected or not selected. In contrast, Diameter and Height do not appear to significantly affect the choice of seedling selected or not selected. The model does fit the data better than a null model, as indicated by the reduction in deviance, but the overall significance and impact of some predictors are weak.

Variable Importance

#|eval=TRUE

caret::varImp(model_log)

           Overall
Cane     2.1076503
Diameter 1.4026300
Height   0.1266381

exp(coef(model_log))

 (Intercept)         Cane     Diameter       Height 
 0.005440294 10.914966068  0.001015469  2.311918797

#|warning=FALSE

cbind("Odds ratio" = round(exp(coef(model_log)),4), 
      "P-value" = round(coef(summary(model_log)),4)[,4], 
      round(exp(confint.default(model_log, level = 0.95)),4))

            Odds ratio P-value  2.5 %       97.5 %
(Intercept)     0.0054  0.7370 0.0000 8.949811e+10
Cane           10.9150  0.0351 1.1823 1.007648e+02
Diameter        0.0010  0.1607 0.0000 1.546670e+01
Height          2.3119  0.8992 0.0000 9.934237e+05

Cane has an odd ratio of 10.9150. This implies that for each unit increase in cane, the odds of the outcome are approximately 10.92 times higher, with a P-value of 0.0351 indicates it is statistically significant (evidence to suggest Cane has a meaningful effect on the outcome).

Diameter has an odd ratio of 0.0010. This indicate that for each unit increase in diameter, the odds of the choice of seedling made decrease significantly (close to zero).With a P-value of 0.1607 shows it is not statistically significant.

Height has an odd ratio of 2.3119, implying that for each unit increase in height, the odds of the choice of seedling made are increased by about 2.31 times. With a p-value(0.8992). It shows it is not statistically significant.

Hence, we can say that cane is a significant predictor of the choice of seedlings.

Multinomial Logistic Regression

Multinomial logistic regression (often just called “multinomial regression”) is used to predict a nominal dependent variable given one or more independent variables. Multinomial logistic regression is used when the dependent variable has more than two categories.

The equation for multinomial logistic regression is as shown below:

Let K be the number of possible outcomes.

Let P(Y=k|X) be the probability of outcome k given predictors X.

Let \(\beta_{kj}\) be the coefficient for predictor j in predicting outcome k

Then, the probability of outcome k is given by:

\[P(Y=k|X) = \frac{\exp(\beta_{ko} + beta_{k1*X1} + ... + (\beta_{kp*Xp})}{\sum(exp(β_{j0} + β_{j1*X1} + ... + β_{jp*Xp}))}\]

\(β_{ko}\) is the intercept for outcome k.

The summation in the denominator is over all possible outcomes j.

Interpretation

The formula calculates the probability of each possible outcome given the values of the predictor variables.

The coefficients (β values) represent the impact of the predictors on the log-odds of each outcome compared to a reference category.

Y= \(\beta_0\) + \(\beta_1X_1\) + \(\beta_2X_2\) + … + \(\beta_kX_k\)

Where Y = Dependent variables (Odd ratio)

\(\beta_0\) = Constant

\(\beta_1\) = Coefficient of variable

\(X_1\) = Independent variables

E = Error Term

Example

To predict the species of new flowers using a multinomial logistic regression model fitted on the iris dataset.

The VGAM (Vector Generalized Linear and Additive Models) package in R Programming Language provides a suite of functions for fitting a variety of regression models.

#|message=FALSE
#|warning = FALSE
#|cache = TRUE
#|include = FALSE

library(VGAM)

Loading required package: stats4

Loading required package: splines


Attaching package: 'VGAM'

The following object is masked from 'package:caret':

    predictors

# Load and prepare the data 
data(iris) 
  
# Convert the species variable to a factor 
iris$Species <- as.factor(iris$Species)

This loads the library and the data, then converts the variable ‘species’ from character to a factor.

#|warning=FALSE
#|message=FALSE
#|cache = TRUE
#|include = FALSE

# Fit a multinomial logistic regression model 
suppressWarnings({
  fit <- vglm(Species ~ Sepal.Length  
            + Sepal.Width 
            + Petal.Length 
            + Petal.Width, 
            data = iris, 
            family = multinomial) 
})
  
# Print the model summary 
suppressWarnings({
  summary(fit)
})


Call:
vglm(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length + 
    Petal.Width, family = multinomial, data = iris)

Coefficients: 
                Estimate Std. Error z value Pr(>|z|)  
(Intercept):1     35.490  22666.953      NA       NA  
(Intercept):2     42.638     25.708   1.659   0.0972 .
Sepal.Length:1     9.495   6729.217      NA       NA  
Sepal.Length:2     2.465      2.394   1.030   0.3032  
Sepal.Width:1     12.300   3143.611      NA       NA  
Sepal.Width:2      6.681      4.480   1.491   0.1359  
Petal.Length:1   -22.975   4799.227  -0.005   0.9962  
Petal.Length:2    -9.429      4.737      NA       NA  
Petal.Width:1    -33.843   7583.502      NA       NA  
Petal.Width:2    -18.286      9.743      NA       NA  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])

Residual deviance: 11.8985 on 290 degrees of freedom

Log-likelihood: -5.9493 on 290 degrees of freedom

Number of Fisher scoring iterations: 21 

Warning: Hauck-Donner effect detected in the following estimate(s):
'(Intercept):1', 'Sepal.Length:1', 'Sepal.Width:1', 'Petal.Length:2', 'Petal.Width:1', 'Petal.Width:2'


Reference group is level  3  of the response

library(nnet) 
data(iris) 
  
# Fit multinomial logistic regression model 
model <- multinom(Species ~ Petal.Length 
                  + Petal.Width 
                  + Sepal.Length 
                  + Sepal.Width, 
                  data = iris)

# weights:  18 (10 variable)
initial  value 164.791843 
iter  10 value 16.177348
iter  20 value 7.111438
iter  30 value 6.182999
iter  40 value 5.984028
iter  50 value 5.961278
iter  60 value 5.954900
iter  70 value 5.951851
iter  80 value 5.950343
iter  90 value 5.949904
iter 100 value 5.949867
final  value 5.949867 
stopped after 100 iterations

# Predict flower species for new data 
new_flo <- data.frame(Petal.Length = 1.5, 
                       Petal.Width = 0.3,  
                       Sepal.Length = 4.5,  
                       Sepal.Width = 3.1) 
predict(model, newdata = new_flo, type = "class")

[1] setosa
Levels: setosa versicolor virginica

We replace the values in new_flo with the actual measurements of the new flower you want to predict. This gives the predicted species of the new flower (e.g., “setosa”, “versicolor”, or “virginica”).

Ordinal Logistic Regression

Ordinal logistic regression or (ordinal regression) is used to model the relationship between an ordinal response variable and one or more explanatory variables. It is used to predict an ordinal dependent variable given one or more independent variables. The explanatory variables may be either continuous or categorical. We explore how one or more independent variables relate to the probability of an ordinal outcome being categorized into a specific or higher category using ordinal logistic regression.

Ordinal Logistic Regression has assumptions which varies from the general logistic regression. They are listed below:

Assumptions

• The dependent variable is measured on an ordinal level.

• One or more of the independent variables are either continuous, categorical or ordinal.

• No multi-collinearity - i.e. when two or more independent variables are highly correlated with each other.

• Proportional Odds - i.e. that each independent variable has an identical effect at each cumulative split of the ordinal dependent variable.

Ordinal Regression Model:

\[logit(P(Y \geq j)) = \beta_{jo} + \beta_1x_1 + ...+ \beta_px_p\] Where:

\(P(Y \geq j)\) = the cummulative probability of the response variable within category j

\(\beta_{jo}\) = the threshold parameter for category j

\(\beta_{1}\), \(\beta_2\), …, \(\beta_{jp}\) are model coefficient associated with the predictor variables \(X_1\), \(X_2\), …, \(X_p\)

Example

Let’s use a hypothetical dataset where we predict customer satisfaction (ordinal variable: Low, Medium, High) based on factors like age, income, and product usage.

# Load necessary libraries
library(tidyverse)
library(ordinal)


Attaching package: 'ordinal'

The following objects are masked from 'package:VGAM':

    dgumbel, dlgamma, pgumbel, plgamma, qgumbel, rgumbel, wine

The following object is masked from 'package:dplyr':

    slice

# Simulated data (replace with your actual data)
set.seed(123)
data <- data.frame(
  satisfaction = factor(sample(c("Low", "Medium", "High"), 100, replace = TRUE), ordered = TRUE),
  age = sample(25:65, 100, replace = TRUE),
  income = sample(30000:100000, 100, replace = TRUE),
  usage = sample(1:5, 100, replace = TRUE)
)


head(data)

  satisfaction age income usage
1         High  47  73927     2
2         High  39  36600     4
3         High  45  87700     5
4       Medium  61  83517     2
5         High  32  82352     3
6       Medium  34  96153     3

sum(is.na(data)) #checking for missing values

[1] 0

str(data)

'data.frame':   100 obs. of  4 variables:
 $ satisfaction: Ord.factor w/ 3 levels "High"<"Low"<"Medium": 1 1 1 3 1 3 3 3 1 2 ...
 $ age         : int  47 39 45 61 32 34 58 34 46 36 ...
 $ income      : int  73927 36600 87700 83517 82352 96153 77801 61516 96074 62262 ...
 $ usage       : int  2 4 5 2 3 3 1 3 2 3 ...

plot(data) # to check for proportional odds

# Fit the ordinal logistic regression model

library(ordinal)
model <- clm(satisfaction ~ age + usage , data = data)
model

formula: satisfaction ~ age + usage
data:    data

 link  threshold nobs logLik  AIC    niter max.grad cond.H 
 logit flexible  100  -108.48 224.95 3(0)  5.13e-08 9.5e+04

Coefficients:
      age     usage 
-0.005358  0.206033 

Threshold coefficients:
  High|Low Low|Medium 
   -0.2749     1.1276

# Summary of the model
summary(model)

formula: satisfaction ~ age + usage
data:    data

 link  threshold nobs logLik  AIC    niter max.grad cond.H 
 logit flexible  100  -108.48 224.95 3(0)  5.13e-08 9.5e+04

Coefficients:
       Estimate Std. Error z value Pr(>|z|)
age   -0.005358   0.017546  -0.305    0.760
usage  0.206033   0.129595   1.590    0.112

Threshold coefficients:
           Estimate Std. Error z value
High|Low    -0.2749     0.8794  -0.313
Low|Medium   1.1276     0.8869   1.271