Probability and Classification

Am I a man, or am I a muppet?

Author

Affiliation

Fernando Rios-Avila

Levy Economics Institute

Published

November 26, 2024

Motivation

Say you work for a consultancy agency helping a bank. You are asked to predict which firms will default on their loans.
- What would be better: a model that predicts probabilities or a model that classifies firms into default or not default?
- How do you use to decide which firms should get a loan.
Companies need to assess the likelihood of their suppliers or clients staying in business, as it impacts their own operations.
- How to use historical data on company exits, along with key features, to predict the probability of a company’s exit.

Previously on DA

In the previous weeks, we have covered the basics of prediction when the target is quantitative.
- Almost stright forward: predict the value of the target. Consider many specifications and pick the best one.
We also cover the basic of probability modeling
- LPM, Logit and probit models: When your dependent variable is Binary.
However, we have not fully cover how to use these models for prediction.

Prediction with qualitative target

Consider cases where \(Y\) is qualitative
- Whether a debtor defaults (will default) on their loan
- Email is spam or not
- Game result is win / lose (no draw).
For all this cases, the target (dep variable) is binary.
The question is: Given this, What is the best way to predict the target?
- Predict the probability of “success” (default, spam, win)
- or make a classification (default, spam, win) based on a probability.

Classification: The extra step

The process

Predict probability: We have done this.
- Predicted probability between 0 and 1 (logit, probit or LPM in extreme cases)
- For each observation we predicted a probability. Often that is it.

if logit: \[\Pr[y_i = 1|x_i] = \Lambda \times (\beta_0 + \beta_1x_i) = \frac{\exp (\beta_0 + \beta_1x_i)}{1 + \exp (\beta_0 + \beta_1x_i)}\]

Thats it! You can probably go couple of steps further and use various specifications, as well as LASSO (for logit) to pick the best model.
Best model can still be picked based on RMSE, brier score or Calibration.

Refresher: Probability Models

LPM - not this time: Predicted value MUST be between 0 and 1
Logit or probit (or other non-linear probability models)
Nonlinear probability models \[\Pr[y_i = 1|x_i] = \Lambda(\beta_0 + \beta_1x_i) = \frac{\exp (\beta_0 + \beta_1x_i)}{1 + \exp (\beta_0 + \beta_1x_i)}\]
- Predicted probability between 0 and 1
- Starts with a linear combination of the explanatory variables
- Multiplies them with coefficients, just like linear regression
- And then transforms that into something that is always between 0 and 1, the predicted probability.

What’s New with Binary target?

The predicted Probability is not a value.
Desire to classify
- assign 0 or 1
- based on a probability that comes from a model
- But how?
We also need new measures of fit
- Some based on probabilities
- Others based on classification

What’s NOT new with Binary target?

Need best fit
- With highest external validity
Usual worries: overfit
- Cross-validation helps avoid worst overfit
Models similar to those used earlier
- Regression-like models (probability models)
- Tree-based models (CART, Random Forest) <- We will not cover this

Probability prediction and process

We build models to predict probability when:
- aim is to predict probabilities – (Duh!)
- aim is to classify (predict 0 or 1) – (we need probabilities first)
Build models
- several Logit models by domain knowledge
- LASSO - Logit LASSO
Pick the best model via cross-validation using RMSE / Brier score
- Or other LOSS function, if you have one

Classification process

After you predict your conditional probability, you can make classifications based on some threshold or Rule.
- For example if \(\Pr[y = 1] > 0.5\) then predict 1, otherwise predict 0
- But we can choose any threshold. but how? (say top 10%?)
We need to consider that we can make errors
- False negative
- False positive
Thus we need to consider a threshold that minimizes the expected errors

Classification Table: Confusion Matrix

	\(y_j = 0\)	\(y_j = 1\)	Total
\(\hat{y}_j = 0\)	TN	FN	TN + FN
Predicted negative	(true negative)	(false negative)	(all classified negative)
\(\hat{y}_j = 1\)	FP	TP	FP + TP
Predicted positive	(false positive)	(true positive)	(all classified positive)
Total	TN + FP	FN + TP	TN + FN + FP + TP
(all actual negative)	(all actual positive)	(N, all observations)

Classification Table: making errors

	\(y_j = 0\)	\(y_j = 1\)	Total
\(\hat{y}_j = 0\)	Predict firm stay	Predict firm stay	TN + FN
Predicted negative	(Firm did stay )	(Firm exited )	(all classified stay )
\(\hat{y}_j = 1\)	Predict firm exit	Predict firm exit	FP + TP
Predicted positive	(Firm stayed )	(Firm did exit)	(all classified exit)
Total	TN + FP	FN + TP	TN + FN + FP + TP
(all actual stay )	(all actual exit)	(N, all observations)

Measures of classification

There are several measures of classification, each with a different focus.

Accuracy \(=(TP+TN)/N\)
- The proportion of rightly guessed observations
- Hit rate
Sensitivity \(=TP / (TP+FN)\)
- The proportion of true positives among all actual positives
- Probability of predicted \(y\) is 1 conditional on \(y = 1\)
Specificity \(= TN/(TN+FP)\)
- The proportion of true negatives among all actual negatives
- Probability predicted \(y\) is 0 conditional on \(y = 0\)

Theory: The ROC

Sensitivity vs Specificity

Measures of classification

The key point is that there is a trade-off between making false positive and false negative errors.
This is the essential insight in classification
This can be expressed with specificity and sensitivity.
- if your Threshold is low, You can detect all the positives, but you will also detect many false positives. Sensitivity is high.
- if your Threshold is high, You can detect all the negatives, but you will also detect many false negatives. Specificity is high.

ROC Curve

The ROC curve is a popular graphic for simultaneously displaying specificity and sensitivity for all possible thresholds.
ROC: Receiver operating characteristic curve
- Name from engineering
For each threshold, we can compute confusion table \(\rightarrow\) calculate sensitivity and specificity
Then, we can plot sensitivity vs 1-specificity for all thresholds
- Horizontal axis: False positive rate (one minus specificity) = the proportion of FP among actual negatives
- Vertical axis: is true positive rate (sensitivity) = proportion of TP among actual positives

ROC Curve Intuition

Consider this:
- If the threshold is 0, we predict all observations as 1. The sensitivity is 1, but the specificity is 0.
- If the threshold is 1, we predict all observations as 0. The sensitivity is 0, but the specificity is 1.
- The “ideal” threshold is somewhere in between.
ROC curve shows how true positives and false positives increases relative to each other.

ROC Curve Intuition

Area Under ROC Curve

ROC curve: the closer it is to the top left column, the better the (insample) prediction.
Area under ROC (AUC) curve summarizes quality of probabilistic prediction
- For all possible threshold choices
- Area \(=\) 0.5 if random classification
- Area \(>\) 0.5 if curve mostly over 45 degree line
AUC is a good statistic to compare models
- Defined from a non-threshold dependent model (ROC)
- The larger the better
- Ranges between 0 and 1.

Stata Corner

Logit estimation: logit y x1 x2 x3
Predict probabilities: predict yhat, pr
Classification:
- gen yhat_class = (yhat > 0.5)
- estat classification
ROC curve: lroc

Model selection Nr.1: Probability models

Model selection when we have no loss function, based on probability models only
- Predict probabilities (No actual classification)
- Use predicted probability to calculate RMSE
- Pick by smallest RMSE
Or
- Draw up ROC curve and get AUC, Pick the model with the largest AUC
- More frequently used in practice
- Less sensitive to class imbalance
In practice, AUC is more frequently used

Theory: Classification and loss function

How we make classification from predicted probability?

We set a threshold!
The process of classification
- If probability of event is higher than this threshold\(\rightarrow\) assign (predict) class 1; and 0 otherwise.
Who sets the threshold?
- Usually approximated by 0.5
- or by the frequency of the event in the data

Classification: select the threshold with loss function

Find optimal threshold with loss function.
- A loss function is a dollar value assigned to false positive and false negative.
- Most often the costs of FP and FN are very different.

Consider loss function

\[E[loss] = \Pr[FN] \times loss(FN) + \Pr[FP] \times loss(FP)\]

In ideal case, the minimization of this suggests that optimal threshold is:

\[\text{Threshold} = \frac{loss(FP)}{loss(FN) + loss(FP)}\]

Or we can try finding the threshold that minimizes the expected loss using Cross-validation. (Software issue)

When to use Formula

Formula
- When dataset is “large”
- When our model has a “good” fit \(\text{Threshold}_{\min E (loss)} = \frac{loss(FP)}{loss(FN) + loss(FP)}\)
In practice
- Pro: easy to use, often close enough
- Con: not the best cutoff, especially for smaller data, and poorer model

Class imbalance

A potential issue for some dataset - relative frequency of the classes.
Class imbalance = the event we care about is very rare or very frequent (\(\Pr(y = 1)\) or \(\Pr(y = 0)\) is very small)
- Fraud, Sport injury
What is rare?
- Something like 1%, 0.1%. (10% should be okay.)
- Depends on size: in larger dataset we can identify rare patterns better.
Consequence: Hard to find those rare events.
- You may be able to identify some patterns by chance.

Class imbalance: the consequences

Methods we use are not good at handling it.
- Both for the models to predict probabilities, and for the measures of fit used for model selection.
The functional form assumptions behind the logit model tend to matter more, the closer the probabilities are to zero or one.
Cross-validation can be less effective at avoiding overfitting with very rare or very frequent events if the dataset is not very big. (Many samples will not even have the event.)
Usual measures of fit can be less good at differentiating models.
Consequence: Model fitting and selection setup not ideal

Class imbalance: what to do

What to do? Two key insights.
1. Know when it’s happening, and be ready for poor performance.
2. May need an action: rebalance sample to help build better models
Downsampling – randomly drop observations from frequent class to balance out more
- Before: 100,000 observations 1% event rate (99,000 \(y = 1\), 1,000 \(y = 0\))
- After 10,000 observations 10% event rate (9,000 \(y = 1\), 1,000 \(y = 0\))
Over-sampling of rare events
try Smart algorithms: Synthetic Minority Over-Sampling Technique (SMOTE)
- Create synthetic observations that are similar to the rare events
- synthetic rare = Combination of rare and infrequent events

Case study

Firm exit case study: Case study: background

Banks and business partners are often interested in the stability of their customers.
Predicting which firms will be around to do business with is an important part of many prediction projects.
Working with financial and non-financial information, your task may be to predict which firms are more likely to default than others.
Goal: Predict corporate default - exit from the market.
- We have to figure out and decide on target, features, etc.

Firm exit case study: `bisnode-firms` dataset

Firm data
Many different type of variables
- Financial, Management, Ownership, Status (HQ)
Dataset is a panel data
Rows are identified by company id (comp-id) and year.
We’ll focus on a cross-section of 2012.

Firm exit case study: Label (target) engineering

Defining our target. There is no “exit” - we have to define it!
Option: If a firm is operational in year \(t\), but is not in business in \(t + 2\) -> Exit.
This definition is broad
- Defaults / forced exit
- Orderly closure
- Acquisitions

Firm exit case study: Sample design

Look at a cross section in 2012
- If alive in Year=2014, status_alive=1
Keep if established in 2012
We do not care about all firms. Not very small and very large
- Below 10 million euros
- Above 1000 euros
Hardest call: keep when important variables are not missing
- Balance sheet like liquid assets
- Ownership like foreign
- Industry classification
End with 19K observation, 20% default rate

Firm exit case study: Features - overview

Key predictors
- size: sales, sales growth
- management: foreign, female, young, number of managers
- region, industry, firm age
- other financial variables from the balance sheet and P&L.
For financial variables, we use ratios (to sales or size of balance sheet).
Here it will turn out be important to look at functional form carefully, especially regarding financial variables.
Mix domain knowledge and statistics.
Plenty of analyst calls.

Firm exit case study: Feature engineering

Growth rates
- 1 year growth rate of sales. Log difference.
- Could use longer time period, but Lose observations
Ownership, management info
- Keep if well covered, impute some, but drop if key vars missing
- Sometimes simplify (unless big data): ceo_young = ceo_age_mod <40 & ceo_age_mod >15
Industry categories - need simplify
Foreign ownership - above a threshold
Numerical variables from balance sheet: Check functional form - logs, polynomials

Firm exit case study: Feature engineering tools

Check coverage (missing values)
Decide on imputation vs drop
Categorical (factor) variables
Numerical variables
- Check functional form - logs, polynomials
- Look at relationships in scatterplot, loess and decide

Firm exit case study: Feature engineering

May need to make cleaning steps.
Create binary variables (flags) when implementing changes to values.
When financial values are negative: replace with zero and add a flag to capture imputation.
- Zeros will not work with logs.
Annual growth in sales (difference in log sales) vs default
- Try editing variables by Winsorizing and adding flags for extreme values.
- Some ODD shapes due to extreme values.

The Weird Shape

Winzordize

Firm exit case study: Winsorizing

When edge of a distribution is weird…
Winsorizing is a process to keep observations with extreme values in sample
- for each variable, we
  - identify a threshold value, and replace values outside that threshold with the threshold value itself
  - and add a flag variable.
Two ways to do it:
- an automatic approach, where the lowest and highest 1 percent or 5 percent is replaced and flagged.
- Pick thresholds by domain knowledge as well as by looking at lowess. Preferred.

Firm exit case study: Firm sales growth

The winsorized value simply equals original value in a range and flat below/after.

Firm exit case study: Model features 1

Firm: Age of firm, squared age, a dummy if newly established, industry categories, location regions for its headquarters, and dummy if located in a big city.
Financial 1: Winsorized financial variables: fixed, liquid (incl current), intangible assets, current liabilities, inventories, equity shares, subscribed capital, sales revenues, income before tax, extra income, material, personal and extra expenditure.
Financial 2: Flags (extreme, low, high, zero - when applicable) and polynomials: Quadratic terms are created for profit and loss, extra profit and loss, income before tax, and share equity.
Growth: Sales growth is captured by a winsorized growth variable, its quadratic term and flags for extreme low and high values.

Firm exit case study: Model features 2

HR: For the CEO: female dummy, winsorized age and flags, flag for missing information, foreign management dummy; and labor cost, and flag for missing labor cost information.
Data Quality: Variables related to the data quality of the financial information flag for a problem, and the length of the year that the balance sheet covers.
Interactions: Interactions with sales growth, firm size, and industry.

Firm exit case study: Models

Models (number of predictors)

Logit M1: handpicked few variables (\(p = 11\))
Logit M2: handpicked few variables + Firm (\(p = 18\))
Logit M3: Firm, Financial 1, Growth (\(p = 35\))
Logit M4: M3 + Financial 2 + HR + Data Quality (\(p = 79\))
Logit M5: M4 + interactions (\(p = 153\))
Logit LASSO: M5 + LASSO (\(p = 142\))
Number of coefficients = N of predictors +1 (constant)

Firm exit case study: Data

\(N = 19,036\)
\(N = 15,229\) in work set (80%)
Cross validation 5x training + test sets
- Used for cross-validation
\(N = 3,807\) in holdout set (20%)
- Used only for diagnostics of selected model.

Firm exit case study: Comparing model fit

	Variables	Coefficients	CV RMSE
Logit M1	4	12	0.374
Logit M2	9	19	0.366
Logit M3	22	36	0.364
Logit M4	30	80	0.362
Logit M5	30	154	0.363
Logit LASSO	30	143	0.362

5-fold cross-validated on work set, average RMSE
Will use Logit M4 model as benchmark

Classification

Picked a model on RMSE/Brier score
For classification, we will need a threshold

Firm exit case study: ROC curve

ROC curve shows trade-off for various values of the threshold
Go through values of the ROC curve for selected threshold values, between 0.05 and 0.75, by steps of 0.05

Firm exit case study: AUC

Model	RMSE	AUC
Logit M1	0.374	0.738
Logit M2	0.366	0.771
Logit M3	0.364	0.777
Logit M4	0.362	0.782
Logit M5	0.363	0.777
Logit LASSO	0.362	0.768

Can calculate the AUC for all our models
Model selection by RMSE or AUC
Here: same (could be different if close)

Firm exit case study: Comparing two thresholds

Take the Logit M4 model, predict probabilities and use that to classify on the holdout set
Two thresholds: 50% and 20%
Predict exit if probability > threshold

Firm exit case study: Comparing two thresholds

	Threshold: 0.5			Threshold: 0.2
	Actual stay	Actual exit	Total	Actual stay	Actual exit	Total
Predicted stay	75%	15%	90%	57%	7%	64%
Predicted exit	4%	6%	10%	22%	14%	36%
Total	79%	21%	100%	79%	21%	100%

Firm exit case study: Threshold choice consequences

Having a higher threshold leads to
- fewer predicted exits:
  - 10% when the threshold is 50% (36% for threshold 20%).
- fewer false positives (4% versus 22%)
- more false negatives (15% versus 7%).
The 50% threshold leads to a higher accuracy rate than the 20% threshold
- 50% threshold: 75% + 6% = 81%
- 20% threshold: 57% + 14% = 71%
- even though the 20% threshold is very close to the actual proportion of exiting firms.

Summary

First option: no loss fn

On the work set, do 5 fold CV and loop over models
Do Probability predictions
Calculate average RMSE on test for each fold
Draw ROC Curve and calculate AUC for each fold
Pick best model based on avg RMSE
Take best model and estimate RMSE on holdout\(\rightarrow\)best guess for live data performance
Output: probability ranking - most likely to least likely.
Show ROC curve and confusion table with logit on holdout 4 at \(t = 0.5\) and \(t = 0.2\) - to illustrate trade-off.

Firm exit case study: The loss function

Loss function = FN, FP
What matters is FN/FP
FN=10
- If the model predicts staying in business and the firm exits the market (a false negative), the bank loses all 10 thousand euros.
FP=1
- If predict exit and the bank denies the loan but the firm stays in business in fact (a false positive), the bank loses the profit opportunity of 1 thousand euros.
With correct decisions, there is no loss.

Firm exit case study: Finding the threshold

Find threshold by formula or algo
Formula: the optimal classification threshold is \(1/11 = 0.091\)
Algo: search thru possible cutoffs

Firm exit case study: Finding the threshold

Consider all thresholds \(T = 0.01, 0.02—1\)
Calculate the expected loss for all thresholds
Pick when loss function has the minimum
Done in CV, this is fold Nr.5.

Firm exit case study

Model selection process
- Predict probabilities
- Use predicted probabilities and loss function to pick optimal threshold
- Use that threshold to calculate expected loss
- Pick model with smallest expected loss (in 5-fold CV)
We run the threshold selection algorithm on the work set, with 5-fold cross-validation.
Best is model Logit M4
the optimal classification threshold by algo is 0.082. Close to formula (0.091)
The average expected loss of 0.64.

Firm exit case study:Summary of process with loss function

On the work set, do 5 fold CV and loop over models
Do Probability predictions
Calculate average RMSE on each test folds
Draw ROC Curve and find optimal threshold with loss function (1,10)
- show: threshold search - loss plots and ROC curve for fold 5
Summarize: for each model: average of optimal thresholds, threshold for fold 5, average expected loss, expected loss for fold Nr.5.
Pick best model based on average expected loss
Take best model, re-estimate it on work set + find optimal threshold and estimate expected loss on holdout set

Summary

Decide whether the goal is predicting probabilities or classification.
The outcome of prediction with a binary target variable is always the predicted probabilities as a function of predictors.
When our goal is probability prediction, we should find the best model that predicts probabilities by cross-validation + RMSE/AUC.
When our goal is classification, we should find the best model that has the smallest expected loss.
- With formula for threshold or search algorithm
- Finding the optimal classification threshold needs a loss function.

Summary

Without a loss function, no classification.
If you don’t have one, make it up.
Don’t rely on default 0.5.

Motivation

Previously on DA

Prediction with qualitative target

Classification: The extra step

The process

Refresher: Probability Models

What’s New with Binary target?

What’s NOT new with Binary target?

Probability prediction and process

Classification process

Classification Table: Confusion Matrix

Classification Table: making errors

Measures of classification

Theory: The ROC

Measures of classification

ROC Curve

ROC Curve Intuition

ROC Curve Intuition

Area Under ROC Curve

Stata Corner

Model selection Nr.1: Probability models

Theory: Classification and loss function

How we make classification from predicted probability?

Classification: select the threshold with loss function

When to use Formula

Class imbalance

Class imbalance: the consequences

Class imbalance: what to do

Case study

Firm exit case study: Case study: background

Firm exit case study: bisnode-firms dataset

Firm exit case study: Label (target) engineering

Firm exit case study: Sample design

Firm exit case study: Features - overview

Firm exit case study: Feature engineering

Firm exit case study: Feature engineering tools

Firm exit case study: Feature engineering

The Weird Shape

Winzordize

Firm exit case study: Winsorizing

Firm exit case study: Firm sales growth

Firm exit case study: Model features 1

Firm exit case study: Model features 2

Firm exit case study: Models

Firm exit case study: Data

Firm exit case study: Comparing model fit

Classification

Firm exit case study: ROC curve

Firm exit case study: AUC

Firm exit case study: Comparing two thresholds

Firm exit case study: Comparing two thresholds

Firm exit case study: Threshold choice consequences

Summary

Firm exit case study: The loss function

Firm exit case study: Finding the threshold

Firm exit case study: Finding the threshold

Firm exit case study

Firm exit case study:Summary of process with loss function

Summary

Summary

Firm exit case study: `bisnode-firms` dataset