Logistic Regression

A Powerful Tool for Classification

Basics of Logistic Regression

Classification Algorithm:

  • Two categories for the outcome variable (analysed in what follows): e.g. Unemployed true or false

  • Multiple categories for the outcome variable (not covered here)
    • unordered logistic regression
    • ordered logistic regression »

A Mock-Up Example to Introduce the Idea

Code
library(tidymodels); library(kableExtra); library(janitor); library(rio)
DataYachts=import("https://lange-analytics.com/AIBook/Data/DataYachts.csv")%>% 
  mutate(YachtNum=Yacht, Yacht=as.factor(Yacht)) 
kbl(DataYachts %>% select(-YachtNum), caption="Income and Yacht Ownership") %>% 
  kable_styling(bootstrap_options=c("striped","hover"), position="center", full_width = F)
Income and Yacht Ownership
Name Income Yacht
Jack 45 1
Sarah 50 0
Carl 55 0
Eric 60 0
Zoe 67 0
James 250 1
Enrico 280 1
Erica 320 1
Stephanie 370 1
Susan 500 1

Using OLS is a Tempting (but bad) Idea

Code
library(plotly)
ggplotly(ggplot(aes(x=Income,y=YachtNum),data=DataYachts)+
  geom_point(size=2.7, color="magenta")+
  scale_x_continuous(limits = c(-50,500), breaks = seq(0,500,50))+ 
  scale_y_continuous(breaks = seq(0,1.25,0.25))+ 
  labs(y="Probability of Yacht Ownership", x="Income in $1,000"))

Using OLS is a Tempting (but bad) Idea

Code
library(plotly)
ggplotly(ggplot(aes(x=Income,y=YachtNum),data=DataYachts)+
  geom_hline(yintercept = 0.5)+
  geom_point(size=2.7, color=ifelse(DataYachts$Income==45,"red","cyan"))+
  geom_smooth(method="lm",se=FALSE, size=1.7)+
  scale_x_continuous(limits = c(-50,500), breaks = seq(0,500,50))+ 
  scale_y_continuous(breaks = seq(0,1.25,0.25))+ 
  labs(y="Probability of Yacht Ownership", x="Income in $1,000"))

Quick Way to Find a Decision Boundary

  1. Find the intersection point between the prediction line and the horizontal 0.5 probability line.

  2. Draw a vertical line through the intersection point. This line is called a decision boundary.

  3. All incomes left of the decision boundary (income smaller than 158) are predicted as “no”. All incomes right of the decision boundary (income greater than 158) are predicted as “yes”.»

Why OLS for Classification is a Bad Idea

Note, incomes > $370,000 are predicted with a probability > 100% to be yacht owners(?)
E.g. probability of owning a yacht for an income of $500,000 is 125% (?)

A similar problem can occur with negative probabilities!

A Step-Function as an Alternative to OLS

The Logistic Function

  • The Logistic function (confusingly sometimes also called the sigmoid function):

\[ y_i=\frac{1}{1+e^{-x_i}} \]

We use: \(y_i=P^{rob}_{yes,i}\) and \(x_i=\beta_1 Inc_i+\beta_2\) which gives us:

\[ P^{rob}_{yes,i}=\frac{1}{1+e^{-(\beta_1 Inc_i+\beta_2)}} \] \(\beta_1\) and \(\beta_2\) change slope and position
\(\beta_1=1\) and \(\beta_2=0\) gives the org. logistic function.
🤓

What Makes the Logistic Function so Special?

— compared to other sigmoid (step) functions —

Time for some mathematical magic:

Logistic function \(P^{rob}_{yes,i}\) := probability for positive event (e.g. yacht ownership: yes):

\[P^{rob}_{yes,i}=\frac{1}{1+e^{-(\beta_1\cdot x_i+\beta_2)}} \]

Take the inverse on both sides of the equation:

\[\frac{1}{P^{rob}_{yes,i}}=1+e^{-(\beta_1\cdot x_i+\beta_2)}\]

Subtract 1 on both sides:

\[\frac{1}{P^{rob}_{yes,i}}-1=e^{-(\beta_1\cdot x_i+\beta_2)}\]

Consider that \(-1=-\frac{P^{rob}_{yes,i}}{P^{rob}_{yes,i}}\) and substitute \(-1\) accordingly, we get after simplification:

\[\frac{1-P^{rob}_{yes,i}}{P^{rob}_{yes,i}}=e^{-(\beta_1\cdot x_i+\beta_2)}\]

\(1-P^{rob}_{yes,i}\) equals by definition \(P^{rob}_{no,i}\):

\[\frac{P^{rob}_{no,i}}{P^{rob}_{yes,i}}=e^{-(\beta_1\cdot x_i+\beta_2)}\]

Take again the inverse on both sides:

\[\frac{P^{rob}_{yes,i}}{P^{rob}_{no,i}}=e^{\beta_1\cdot x_i+\beta_2}\]

Take the logarithm on both sides:

\[\ln\left (\frac{P^{rob}_{yes,i}}{P^{rob}_{no,i}}\right )=\beta_1\cdot x_i+\beta_2\]

One More Step — Odds vs Probabilties

  • The fraction of the yes/no probabilities can be interpreted as \(Odds\) as they are often used in betting.

  • Example: The probability of getting two heads when flipping two coins is is \(P^{rob}_{yes,i}=0.25\).

  • Consequently, the probability of not getting two heads when flipping two coins is \(P^{rob}_{no,i}=0.75\).

  • \(Odds\) for 2 Heads compared to not 2 heads is 1 to 3 or 33%:

\[O^{dds}=\frac{P^{rob}_{yes,i}}{P^{rob}_{no,i}}=\frac{0.25}{0.75}=\frac{1}{3}=0.33\]

Interpretation of the \(\beta s\): Yacht Ownership

\[\ln(O^{dds})=\ln\left (\frac{P^{rob}_{yes,i}}{P^{rob}_{no,i}}\right )=0.02\cdot Inc_i+(-2.7)\]

Model results after running and printing the workflow():

Code
# The workflow WFYachts had been created before and
# DataTest were fitted to the workflow.
print(WFYachts)
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────

Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:
(Intercept)       Income  
   -2.68660      0.02448  

Degrees of Freedom: 9 Total (i.e. Null);  8 Residual
Null Deviance:      13.46 
Residual Deviance: 5.654    AIC: 9.654

Interpretation of the \(\beta s\): Yacht Ownership

\[\ln(O^{dds})=\ln\left (\frac{P^{rob}_{yes,i}}{P^{rob}_{no,i}}\right )=0.02\cdot Inc_i+(-2.7)\]

  • If income increases by 1 ($1,000) the logarithm of the odds increases by 0.02.

  • Since change of a logarithm is a relative change (percentage):

If income increases by 1 ($1,000) the odds increases by 2% (0.02). (careful with the results because data were made up and N is too small!)

Confusion Matrix

Note, in the mockup we did not create training and testing data. Therefore, we use DataYachts (the data we used to fit/train the workflow) here. This is not a proper methodology but good enough for the mock-up:

Code
DataYachtsWithPred=augment(WFYachts, new_data=DataYachts)
conf_mat(DataYachtsWithPred, truth=Yacht, estimate=.pred_class)
          Truth
Prediction 0 1
         0 4 1
         1 0 5

Real World Churn Analysis with Logistic Regression — the Data

We use data (7,043 customers) of the fictional telecommunication company TELCO, generated by IBM for training purposes:

  • The outcome variable \(Churn\) indicates, if a customer departed within the last month (\(Churn=Yes\)) or not (\(Churn=No\)).
  • Predictor variables contain:
    • Customers’ \(Gender\) (\(Female\) or \(Male\)),
    • Customers’ \(SeniorCitizen\) status (\(0\) for no or \(1\) for yes),
    • Customers’ \(Tenure\) with TELCO (month of membership), as well as
    • Customers’ \(MonthlyCharges\) (in US-$).

Real World Churn Analysis with Logistic Regression — the Data

Code
DataChurn=import("https://lange-analytics.com/AIBook/Data/TelcoChurnData.csv") %>%
  clean_names("upper_camel") %>%
  select(Churn,Gender,SeniorCitizen,Tenure,MonthlyCharges) %>%
  mutate(Churn=factor(Churn, levels=c("Yes","No"))) 
head(DataChurn)
  Churn Gender SeniorCitizen Tenure MonthlyCharges
1    No Female             0      1          29.85
2    No   Male             0     34          56.95
3   Yes   Male             0      2          53.85
4    No   Male             0     45          42.30
5   Yes Female             0      2          70.70
6   Yes Female             0      8          99.65

Real World Churn Analysis with Logistic Regression

— Do it yourself —

Create the Churn analysis with logistic regression. Click on the link in the footer to get an R-script with a skeleton for the analysis.🤓

Results from Churn Analysis with Logistic Regression

Confusion Matrix:

Yes No
Yes 239 150
No 322 1403

Accuracy:

.metric .estimator .estimate
accuracy binary 0.7767266

Sensitivity:

.metric .estimator .estimate
sensitivity binary 0.426025

Specificity:

.metric .estimator .estimate
specificity binary 0.9034127

Hint: What do the column sums of the confusion matrix tell you?

Problem: Unballanced Training Data

Churn n
Yes 1308
No 3621

Majority Class: \(Churn=No\) has 3621 observations in the training dataset.

Minority class \(Churn=Yes\) has 1308 observations in the training dataset.

What can we do?

Churn n
Yes 1308
No 3621
  • Downsampling: Randomly delete observations from majority class until ratio of the observations from the majority and the minority class reaches the desired ratio (e.g., 1:1).

  • Upsampling: In simplest version, creates new observations for the minority class by copying randomly chosen observations from the minority class until the ratio of the observations from the majority and the minority class reaches the desired ratio (e.g., 1:1).

  • Often, a combination of downsampling and upsampling is performed.

Performing Down-Sampling with step_downsample()

You need to add the R package themis. Then in your script, you can add step_downsample(Churn) to the recipe (don’t forget to execute the following command lines again). As a reminder our original DataTrain had 4,929 observations, \(Churn_{Yes}=1308\), \(Churn_{No}=3621\):

Code
library(themis)
RecipeChurn=recipe(Churn ~ ., data=DataTrain) %>% 
  step_naomit() %>% 
  step_dummy(Gender) %>% 
  step_downsample(Churn)

 # you do not need to do the following steps.
# They just allow to display the count() for the training data
ExtractedDataTrain=juice(RecipeChurn %>% prep())
kbl(count(ExtractedDataTrain, Churn))
Churn n
Yes 1308
No 1308

Note, the number of observations has decreased by 2313. This is an information loss!

Performing Up-Sampling with step_upsample()

You need to add the R package themis. Then in your script, you can add step_upsample(Churn) to the recipe (don’t forget to execute the following command lines again). As a reminder our original DataTrain had 4,929 observations, \(Churn_{Yes}=1308\), \(Churn_{No}=3621\):

Code
library(themis)
RecipeChurn=recipe(Churn ~ ., data=DataTrain) %>% 
  step_naomit() %>% 
  step_dummy(Gender) %>% 
  step_upsample(Churn)

 # you do not need to do the following steps.
# They just allow to display the count() for the training data
ExtractedDataTrain=juice(RecipeChurn %>% prep())
kbl(count(ExtractedDataTrain, Churn))
Churn n
Yes 3621
No 3621

Note, the number of observations has increased by 2313. The information in the dataset has not increased!

Performing Up-Sampling with step_smote(). What is the advantage

As a reminder our original DataTrain had 4,929 observations, \(Churn_{Yes}=1308\), \(Churn_{No}=3621\):

Code
library(themis)
RecipeChurn=recipe(Churn ~ ., data=DataTrain) %>% 
  step_naomit() %>% 
  step_dummy(Gender) %>% 
  step_upsample(Churn)

 # you do not need to do the following steps.
# They just allow to display the count() for the training data
ExtractedDataTrain=juice(RecipeChurn %>% prep())
kbl(count(ExtractedDataTrain, Churn))
Churn n
Yes 3621
No 3621

Instead of copying a record from the training dataset, step_smote() finds the Nearest Neighbor to that record and creates a new record that has features generated as a weighted average between the Nearest Neighbor and the original record.