[Day 182] Learning about feature selection in fraud detection and finding a classifier model with low recall

Hello :)
Today is Day 182!

A quick summary of today:
learning about IV, WoE, and finding a best model for an imbalanced insurance fraud imbalanced dataset

The time has come to start thinking about the project for MLOps zoomcamp.

I was looking around for some interesting dataset related to PD (probability of default) or LGD (loss given default) or EAD (exposure at default), and I found this notebook. Warning - it is fairly long. But inside I saw something that interested me - it talked about WoE and IV. It says that they are good estimators for evaluating features for fraud and similar classification tasks. This website's definition was the most clear.

Weight of Evidence (WoE)

It is a technique used in credit scoring and predictive modeling to assess the predictive power of independent variables relative to a dependent variable. Originating from the credit risk world, WoE measures the separation between "good" and "bad" customers. Here, "bad" customers typically refer to those who have defaulted on a loan, while "good" customers are those who have repaid their loans.

Key Concepts of WoE:

Calculation: WoE is calculated using the natural logarithm (ln) of the ratio of the proportion of non-events (e.g., non-defaults) to the proportion of events (e.g., defaults) within a group or bin of the independent variable.
Interpretation: A positive WoE indicates that the proportion of non-events is higher than events, suggesting a stronger likelihood of the event not occurring. Conversely, a negative WoE indicates a higher proportion of events compared to non-events.
Transformation: WoE transforms both continuous and categorical variables into continuous measures, making it easier to interpret their relationship with the target variable.

Steps to Calculate WoE:

Bin Creation: For continuous variables, divide data into bins or groups based on the variable's distribution.
Event and Non-event Calculation: Count the number of events and non-events (e.g., defaults and non-defaults) within each bin.
Percentage Calculation: Calculate the percentage of events and non-events in each bin.

WoE Calculation: Compute WoE using the formula:

Usage: Use WoE values instead of raw input values in predictive models like logistic regression to enhance model interpretability and performance.

Advantages of WoE:

Handles Missing Data: Allows separate treatment of missing values.
Treats Outliers: Bins extreme values into groups, reducing their impact.
Simplifies Model Interpretation: Provides a clear linear relationship between predictors and the target variable's log odds.
No Need for Dummy Variables: Directly transforms categorical variables into continuous ones.

Considerations:

Monotonicity: WoE should exhibit a monotonic relationship with the target variable across bins.
Bin Size: Each bin should ideally contain at least 5% of the observations to ensure robustness.
Model Stability: Incorrect binning can lead to model instability, hence the importance of validating the monotonic relationship of WoE values across bins.

Information Value

It is a technique used in predictive modeling to assess the importance of independent variables in predicting a binary outcome, typically used in credit scoring and similar applications.

Definition and Calculation:

Definition: IV measures the predictive power of a variable by quantifying the relationship between the variable's WOE (Weight of Evidence) and the target variable.
Formula: IV is calculated as the sum of the differences between the percentage of non-events and events in each category (or bin) of the variable, weighted by the WOE of that category:

Interpretation of IV:

IV values are used to rank variables based on their predictive power:

Less than 0.02: Not useful for prediction.
0.02 to 0.1: Weak predictive power.
0.1 to 0.3: Medium predictive power.
0.3 to 0.5: Strong predictive power.
Greater than 0.5: Suspiciously strong predictive power, warrants careful examination.

Usage and Considerations:

Variable Selection: IV helps in selecting important variables for inclusion in models, especially in binary logistic regression where it directly relates to the log odds of the target variable.
Limitations: IV is most effective for binary outcomes and logistic regression models. For other models like random forests or SVMs that can capture non-linear relationships well, IV might not always identify the most predictive variables.
Bin Size Impact: Increasing the number of bins or groups for a variable can increase its IV, but too many bins with few observations can lead to unreliable IV values.

So how did I apply this new thing that I learned?

I found a nice car insurance fraud dataset on Kaggle.

Turns out I can do IV and WoE for both numerical and categorical values. All besides 2 in that dataset are categorical.

Here are the features I used:

numerical_features = ['DriverRating']

categorical_features = [

'Month', 'WeekOfMonth', 'DayOfWeek',

'Make', 'AccidentArea', 'DayOfWeekClaimed',

'MonthClaimed', 'WeekOfMonthClaimed', 'Sex',

'MaritalStatus', 'Fault', 'PolicyType',

'VehicleCategory', 'VehiclePrice', 'Days_Policy_Accident',

'Days_Policy_Claim', 'PastNumberOfClaims', 'AgeOfVehicle',

'AgeOfPolicyHolder', 'PoliceReportFiled', 'WitnessPresent',

'AgentType', 'NumberOfSuppliments', 'AddressChange_Claim',

'NumberOfCars', 'Year', 'BasePolicy', 'Deductible'

]

target = 'FraudFound_P'

First, I used IV to filter out features under 0.02 and above 0.6 (not 0.5 because there was 1 feature with 0.55)

This filter cut down the list to only:

['NumberOfSuppliments',

'AgeOfVehicle',

'AgeOfPolicyHolder',

'Month',

'Deductible',

'MonthClaimed',

'Make',

'AddressChange_Claim',

'PastNumberOfClaims',

'VehiclePrice',

'VehicleCategory',

'Fault']

Then I plotted WoE for each of those.

An example - from AgeOfVehicle

Here I decided to combine 3 years and 4 years into one, 5+6+7+more than 7 years into another one.

I repeated the same process by looking at the WoE line graph for each of the 'meaningful' features. For some I also considered their distribution. For example, in car Make (manufacturer) these are the count of each value

Here I also decided to combine the low represented cars, only if their WoE was similar. Maybe there are other things to consider but for now the above is what I did.

Next, I had a dataset with 12 categorical features and I got dummies for all, and the dataset was ready for a model.

Just like in the notebook I referenced at the start that uses a different dataset (related to credit risk), I started with a simple logistic regression.
The resulting confussion matrix:

Recall being 0.83 for fraud cases. I think recall is the number that is the most important because if it is 1 that means we can detect all insurance fraud cases.

To search for a better model, I looked into XGBoost

Best model I got (recall 0.41 for fraud cases):

Next, I tried CatBoost (recall for fraud cases is 0.21):

Next, an ensamble with RFClassifier and XGBoost (recall for fraud cases is 0.39):

Next, I decided to look on Kaggle and see the notebooks that use this dataset. Turns out that BalancedRandomForestClassifier does the best job (according to the top notebook). I applied it using my features, and recall for fraud cases is 0.90. However as seen from the confusion matrix, the other indicators are hurt. So, the task next (tomorrow) is to find a way to get the False Positives (top right) lower, while holding the True negatives down.

That is all for today!

See you tomorrow :)

Search This Blog

50+ days of Machine Learning