The business problem is to predict which auto insurance customers are most likely to get in an automobile accident. The data set used to predict who will likely get in an accident contains 8,161 records of data with 15 variables representing various customer statistics. To predict the outcome, let’s review using a Decision Tree Classification Model.
Classification data mining techniques are often used to predict, explain, describe and classify an outcome or target variable. The goal of this technique is to accurately predict the target variable for each classification (grouping) of data. The simplest type of classification is binary classification model with two predictor values such as yes or no, however, multi-class targets can also be trained.
The analysis of determining which model to apply requires understanding the business work flows, identification of the data and how to obtain it, who the target audience is, and validation of the outcome. The process begins with exploring historical data to determine the predictor variables such as demographic information about a customer that is needed to identify the target variable; the goal of the model. After data validation and predictor variables are outlined analysis begins by choosing the model that best fits the problem being solved.
Decision trees are a classification data mining techniques used due to the visualization of the tree structure making it easy to understand. The model uses a recursive algorithm that analyzes and classifies the data attributes by each branch also known as a node. Each branch has leaves with decision points that point downward to the next branch that will ultimately lead to bottom leaves resulting in what is predicted based on the rules set within the tree structure. With this technique, the model can categorize numerical and alpha character attributes as inputs for forming the tree structure for multiple-variable analysis.
Data Quality Check
To prepare the dataset for modeling, exploratory data analysis was conducted to determine the state of the data and to identify attributes for automated variable selection to determine the best predictor variable for predicting accidents.
Below is a listing of all the variables available in the data set. The first target variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person had an auto claim. A zero means that the customer did not have an accident claim.
As you can see from the Means Procedure table below, there are many missing values within the data set that will need to be addressed. An evaluation of whether or not these values can be used for prediction will be performed. While decision trees can handle missing data, it is sometimes valuable to impute the mean.
In order to address the missing values for AGE, the median age of 45 was imputed. For missing values related to the JOB variable, a conditional statement was performed based on the median income range associated with the job categories of the individuals without missing values in the data set or age based for students. Once the job variable was imputed, the missing values for INCOME were defined. The median income per job was used to replace missing values. The median was also utilized for the HOME VALUE variable. The CAR AGE was a conditional statement and median were used due to a negative value found in the data set to replace missing values. Lastly, a flag was also created for missing variables since data missing can sometimes be predictable.
Two new variables were created to determine if they are predictable for determining who might have an auto accident. A variable called COST PER CLAIM was calculated which is the OLD CLAIM divided by the CLAIM FREQ. The OLD CLAIM variable is the dollar value associated with the claims and the CLAIM FREQUENCY is the number of times the individual had a claim.
The second variable created is CLAIM RATIO YIS; claims ratio for years in service. This is calculated by TIF divided by the CLAIM FREQUENCY to determine the percentage of claims based on the number of years the individual has been with the company.
There are several binary variables that have been converted to dummy variables to determine if they are predictable with the target variable. The following were converted to binary variables:
|Parent1||Single Parent Yes/No||Single Parent = 1 Married = 0|
|Sex||Male/Female||Male = 1 Female = 0|
|Revoked||Yes/No||License Revoked = 1 No = 0|
|MStatus||Yes/No||Married = 1 No = 0|
|Urbanicity||Highly Urban/ Urban or Highly Rural/ Rural||City = 1 Urban = 0|
|Car Use||Commercial/Private||Commercial = 1 Private = 0|
The discrete variable which counts the number of KIDS DRIVING and KIDS AT HOME were converted to dummy variables to determine if having any kids driving or at home would be predictable regardless of how many kids lived in the home.
|KidsDriv >0||Kids driving||Kids Driving = 1 No = 0|
|KidsHome>0||Kids at home||Kids at Home = 1 No = 0|
Two dummy variables were created to determine if a missing home value meant the individual is a renter and to identify new customers based on the number of years with the company.
|IMP_HOME_VALUE = 0||If no home value, then maybe a renter||Renter = 1 No = 0|
|TIF <= 1||Customer Years with Company||New Customer = 1 No = 0|
The only variable that had outliers to address was the blue book value of a car. The BLUEBOOK VALUE variable was capped at the 95% percentile ($35,000) and the 5th percentile ($5,000).
Lastly, a log transformation were performed on the IMP_INCOME, CAP_BLUEBOOK, OLDCLAIM, CLM_FREQ, MVR_PTS and CLAIM_RATIO_YIS.
The data exploration begins with reviewing mean, medians and maximums to the target flags for the imputed variables, new variables and dummy variables. Categorical variables CAR_TYPE, EDUCATION and JOB were also reviewed to determine their value to the model.
Decision Tree Model
The following variables were selected and applied to the HPSPLIT method using SAS Version 9.4. Predictor variables were chosen during the exploratory data analysis due to their possible importance to the model as described in the table above (see code at end).
As the tree demonstrates, the first split is whether or not the driver lives in a City. The node split at values greater than or less than .010 since 1 means they live in a city and anything less means they don’t live in city. The node estimated that individuals living in a city are more likely to get an accident compared to those not living in a city (93% will not crash). Of the 68% who live in a city, approximately 31% will have an accident. The 6,492 observations with drivers in the city split into the types of jobs the drivers held. Here we can see if the drivers were in the professional industry the model predicted that 80% would not get into an accident compared to 44% that would for clerical, homemaker, or student. The leaves continue to split further into marriage and number of kids driving. In total there were 47 leaves to this decision tree.
The most commonly used metrics; sensitivity and false positive rate are incorporated in the Receiver Operating Characteristic Graph (ROC). This graph is a great tool to compare the predicted and actual target values in a visual way to determine how likely the model will correctly predict the positive and negative classes in a binary classification model. The true positive rate exists on the y-axis while the false positive rate (1-specificity) is on the x-axis of the graph as shown in the image below.
The dots on the plotted line represent the discrimination threshold. As the threshold is changed, the sensitivity rate (TPR) either increases or decreases since the number of true positive and false positive records also change. A bowed curve pointing towards the upper left hand corner would be an indication of the model predicting a higher proportion of cases accurately. In this model, the ROC is .8000. At the .34 cutoff level, the model would predict approximately 60% correctly with a .20 false positive rate.
The gini index is used to manage the population diversity within the node and between the subsets of nodes. For example, if the group was male and female, with males at 52% and female at 48%, the results would be considered a pure split leading to the next leaf on the branch to be trained.
Entropy measures homogeneity, how similar the values are for a sample. The target variable will have an entropy rate as well as each attribute or classification column in the sample. The measurement is used to help determine the information gain.
Information Gain is important in determining which attributes to split. It returns the highest rate or in other words the most homogeneous branches of the decision tree based on the difference between the entropy of the target variable and combined entropy of the each attribute. The attribute with the largest information gain becomes the decision node. A branch with an entropy of zero results in no further splitting. If the entropy value is greater than zero, then a new leaf is created for further analysis.
The chart below is an example of a confusion matrix with A representing the true positives, B false negatives, C false positives and D true negatives.
|Positives||True Positives (A)||False Negatives (B)||A/(A+B) TPR
|Negatives||False Positives (C)||True Negatives (D)||C/(C+D) FPR
Based on the confusion matrix, it is important to review the sensitivity and specificity. Sensitivity also known as the True Positive Rate (TPR) considers the number of true positives (A) correctly predicted by the model as percentage of the total true positives (A) and false negatives (B). The measurement helps determine how accurate the model predicted the proportion of actual positive records. A sensitivity rate of 100 is interpreted as the model predicted all true positive records correctly. This is normally not the case. In general, it is best to review the sensitivity rate in relation to the false positive rate on a Receiver Operating Characteristic Graph (ROC).
In this model, the true positive rate had an accuracy of 94% that the driver will not get in an auto accident. The false positive rate that they would crash is 55%.
Overall, decision tree classification methods are a wonderful way to model an outcome. The consumer can gain a lot of insight about their customer population and use this information to develop rates for car insurance drivers.
data TEMPFILE; set &INFILE;
R_EDUCATION = EDUCATION;
if R_EDUCATION = “z_High School” THEN R_EDUCATION = “High School”;
else if R_EDUCATION = “<High School” THEN R_EDUCATION = “High School”;
/*Fix missing values*/
IMP_AGE = AGE;
if missing (IMP_AGE) then IMP_AGE = 45;
IMP_JOB = JOB ; /*decision tree logic based on median*/
M_JOB = 0;
if missing (IMP_JOB) then do;
if IMP_INCOME >= 128680 then do IMP_JOB = “Doctor”;end;
if IMP_INCOME >= 88001 and IMP_INCOME <= 128680 THEN do IMP_JOB = “Lawyer”;end;
if IMP_INCOME >= 85001 and IMP_INCOME <= 88000 THEN do IMP_JOB = “Manager”;end;
if IMP_INCOME >= 75001 and IMP_INCOME <= 85000 then do IMP_JOB = “Professional”;end;
if IMP_INCOME >= 58001 and IMP_INCOME <= 75000 then do IMP_JOB = “z_Blue Collar”;end;
if IMP_INCOME >= 33001 and IMP_INCOME <= 58000 then do IMP_JOB = “Clerical”;end;
if IMP_INCOME >= 12074 and IMP_INCOME <= 33000 then do IMP_JOB = “Home Maker”;end;
IF IMP_AGE <= 20 THEN do IMP_JOB = “Student”; end;
if IMP_INCOME = . then do IMP_JOB = “Professional”; end;
M_JOB = 1; end;
IMP_INCOME = INCOME; /* Copy INCOME into IMP_INCOME*/
M_INCOME = 0;
if missing (IMP_INCOME) then do ;
if IMP_JOB = “Doctor” then do IMP_INCOME = 128680;end;
if IMP_JOB = “Lawyer” then do IMP_INCOME = 88304;end;
if IMP_JOB = “Manager” THEN do IMP_INCOME = 87461;end;
if IMP_JOB = “Professional” then do IMP_INCOME = 76593;end;
if IMP_JOB = “z_Blue Collar” then do IMP_INCOME = 58957; end;
if IMP_JOB = “Clerical” then do IMP_INCOME = 33861;end;
if IMP_JOB = “Home Maker” then do IMP_INCOME = 12073;end;
if IMP_JOB = “Student” then do IMP_INCOME = 6309;end;
M_INCOME = 1;
end; /*FLAG to 1 means fixed the value of IMP_INCOME and 54000 was a guess.*/
log_IMP_INCOME = sign(IMP_INCOME)*log(abs(IMP_INCOME)+1);
IMP_HOME_VALUE = HOME_VAL;
M_HOME_VAL = 0;
if missing(HOME_VAL) then do;
IMP_HOME_VALUE = 145000;
M_HOME_VAL = 1;
IMP_YOJ = YOJ;
M_YOJ = 0;
if missing (IMP_YOJ) then do;
if IMP_JOB = “Student” then IMP_YOJ = 5;
else if IMP_JOB = “Home Maker” then IMP_YOJ = 4;
else IMP_YOJ =11;
M_YOJ = 1; end;
IMP_CAR_AGE = CAR_AGE;
if IMP_CAR_AGE = –3 then IMP_CAR_AGE = 3;
if missing (IMP_CAR_AGE) then IMP_CAR_AGE = 8;
/*create dummy values*/
if PARENT1 = ‘Yes’ then parent1 = ‘1’;
else parent1 = ‘0’;
if sex = ‘M’ then sex = ‘1’;
else sex = ‘0’;
if revoked = ‘Yes’ then revoked = ‘1’;
else revoked = ‘0’;
if mstatus = ‘Yes’ then mstatus = ‘1’;
else mstatus = ‘0’;
if urbanicity = ‘Highly Urban/ Urban’ then urbanicity = ‘1’;
else urbanicity = ‘0’;
if car_use = ‘Commercial’ then car_use = ‘1’;
else car_use = ‘0’;
if KIDSDRIV > 0 then KIDSDRIV = ‘1’;
else KIDSDRIV = ‘0’;
KIDS_DRIVING = KIDSDRIV*1;
if HOMEKIDS > 0 then HOMEKIDS = ‘1’;
else HOMEKIDS = ‘0’;
KIDS_AT_HOME = HOMEKIDS*1;
if imp_job = “Clerical” THEN IMP_JOB_CLERICAL = 1;
ELSE IMP_JOB_CLERICAL = 0;
IF imp_job = “z_Blue Collar” THEN IMP_JOB_BLUECOLLAR = 1;
ELSE IMP_JOB_BLUECOLLAR = 0;
IF imp_job = “Home Maker” THEN IMP_HOME_MAKER = 1;
ELSE IMP_HOME_MAKER = 0;
IF imp_job = “Student” THEN IMP_JOB_STUDENT = 1;
ELSE IMP_JOB_STUDENT = 0;
if R_EDUCATION = “High School” then HIGH_SCHOOL_OR_LESS = 1;
else HIGH_SCHOOL_OR_LESS = 0;
if CAR_TYPE = “z_SUV” THEN CAR_TYPE_SUV = 1;
ELSE CAR_TYPE_SUV= 0;
IF CAR_TYPE = “Sports Car” THEN CAR_TYPE_SPORTY = 1;
ELSE CAR_TYPE_SPORTY = 0;
IF CAR_TYPE = “Pickup” THEN CAR_TYPE_PICKUP = 1;
ELSE CAR_TYPE_PICKUP = 0;
if CAR_TYPE = “Sports Car” and imp_job = “Student” then STUDENT_SPORT_CAR = 1;
else STUDENT_SPORT_CAR = 0;
if IMP_HOME_VALUE = “0” then RENTER = 1;
ELSE RENTER = 0;
if TIF <= ‘1’ then NEW_CUSTOMER = 1;
ELSE NEW_CUSTOMER = 0;
/*CAP – DEAL WITH OUTLIERS*/
CAP_BLUEBOOK = BLUEBOOK;
if CAP_BLUEBOOK >= 35000 THEN CAP_BLUEBOOK = 35000 /*95 percentile*/;
else if CAP_BLUEBOOK <= 5000 THEN CAP_BLUEBOOK = 5000 /*5 percentile*/;
log_CAP_BLUEBOOK = sign(CAP_BLUEBOOK)*log(abs(CAP_BLUEBOOK)+1);
COST_PER_CLAIM = OLDCLAIM/CLM_FREQ ;
if missing (Cost_per_Claim) then Cost_per_Claim = 0;
CLAIM_RATIO_YIS = TIF/CLM_FREQ;
if missing (CLAIM_RATIO_YIS) then CLAIM_RATIO_YIS = 0;
log_CLAIM_RATIO_YIS = sign(CLAIM_RATIO_YIS)*log(abs(CLAIM_RATIO_YIS)+1);
DROP AGE; /*replaced*/
DROP KIDSDRIV; /*replaced*/
DROP HOMEKIDS; /*replaced*/
drop BLUEBOOK; /*replaced*/
DROP TRAVTIME; /*not predictable*/
DROP CAR_AGE; /*replaced*/
drop HOME_VAL; /*replaced*/
drop INCOME; /*replaced*/
drop JOB; /*replaced*/
drop YOJ; /*replaced*/
drop PARENT1; /*replaced*/
drop sex; /*replaced*/
drop car_use; /*replaced*/
drop mstatus; /*replaced*/
drop URBANICITY; /*replaced*/
drop revoked; /*replaced*/
proc hpsplit data=&SCRUBFILE. seed=15566;
class Target_flag CAR_TYPE R_EDUCATION IMP_JOB ;
model Target_flag = OLDCLAIM