LASSO Regression to Predict Auto Claims

Business Problem

Using the same data set from the Decision Tree Classification Blog from Week 1, the business problem is to predict which auto insurance customers are most likely to get in an automobile accident.  The data set used to predict who will likely get in an accident contains 8,161 records of data with 15 variables representing various customer statistics.   To predict the outcome, let’s review using a LASSO method.

Model

The LASSO method stands for Least Absolute Selection Shrinkage Operator.  This regression models imposes a constraint on the absolute value of the sum of the model parameters.  The sum of these has a constraint with an upper bound.  This constraint causes the sum of the coefficient variables to shrink near zero; hence the shrinkage process.  The model is sometimes preferred over multiple linear regression model due to its ease of interpretation and identification of the variables most associated with the target variable. This selection process goal is to select the subset of predictor variables that minimizes error.

Variables Used

The target variable is 0 for individuals who did not get in an accident and 1 for those who did get in an accident. The variables selected are in the code below and were applied to the glmselect method using SAS Version 9.4. These predictor variables were chosen during the exploratory data analysis due to their possible importance to the model which is outlined in detail in the decision tree blog (Decision Tree Link: https://analyticswithlohr.com/2016/12/04/decision-tree-model-for-auto-crashes/?iframe=true&theme_preview=true via @wordpressdotcom )

The first step was to divide the 8,161 observations into a training and test set.  That is was the SAS code below is doing.

* Split data randomly into test and training data;

proc surveyselect data=&SCRUBFILE. out=traintest seed = 123

samprate=0.7 method=srs outall;

run;

Next we run the lasso multiple regression model applying the algorithm to the traintest set.

* lasso multiple regression with lars algorithm k=10 fold validation;

proc glmselect data=traintest plots=all seed=123;

partition ROLE=selected(train=’1′ test=’0′);

model Target_flag = OLDCLAIM  CITY MVR_PTS KIDS_DRIVING IMP_HOME_VALUE

LICENSE_REVOKED MARRIED IMP_AGE COMMERCIAL log_CLAIM_RATIO_YIS

TIF log_CAP_BLUEBOOK  IMP_YOJ IMP_INCOME M_HOME_VAL IMP_CAR_AGE/selection=lar(choose=cv stop=none) cvmethod=random(10);

run;

Next let’s review all the output graphs from this model.

Lar Selection Summary details the ASE (Average Square Error) for the test and training set.  The ASE decreases as the prediction increases with each variable added to the model.  We can also see in descending order the importance of the variables. The field Motor Vehicle Points has the highest importance to the model followed by the City and Imputed Home Value.  The start at step 14 indicates adding any more variables to the model will not enhance its accuracy.  This is the point where bias and variance tradeoff occurs.

lasso_lar_selection_1

The coefficient progression plot shows the change in coefficients at each step in the model with the vertical line representing the selected model. It demonstrates the level of importance as each variable is added to the model and how it impacts the coefficients. City and Motor Vehicle Points had the largest regressors positive coefficients followed by having a history of an old claim which has negative coefficients.

lasso_coefficient_4

The graph below shows how well the test and training models performed against each other.  We can see that the prediction accuracy seems relatively good.  The average square error is stable throughout both data sets.

lasso_target_test_3

Lastly, the Analysis of Variance output shows the model is statistically significant.  Unfortunately, the adjusted R square value is relatively low at .19.  The model could use some improvement.

lass_anova_2

Limitations of Lasso models – all variables are chosen based on statistical analysis without human intervention.  The variables chosen are highly correlated with each other leaving little option to arbitrary select the variables.  It is difficult to estimate the p-values.

This method seems like a good method to determine variable importance to try in other predictive models.

CODE

%let PATH = C:\Users\mailb_000\Documents\Sas_data\nwu2016;
%let NAME = nwu;
%let LIB = &NAME..;

libname &NAME. “&PATH.”;

%let SCORE_ME = &LIB.LOGIT_INSURANCE_TEST;
%let INFILE = &LIB.LOGIT_INSURANCE;
%let SCRUBFILE = SCRUBFILE;
proc means data=&INFILE. mean median nmiss n p1 p99;
run;

data TEMPFILE;
set &INFILE;

R_EDUCATION = EDUCATION;
if R_EDUCATION = “z_High School” THEN R_EDUCATION = “High School”;
else if R_EDUCATION = “<High School” THEN R_EDUCATION = “High School”;
drop index;
drop education;
drop red_car;
drop target_amt;
run;

proc means data = TEMPFILE nmiss mean median;
var _numeric_;
run;

proc univariate data = TEMPFILE;
class Target_flag;
var _numeric_;
histogram;
run;

proc freq data = TEMPFILE;
table (_character_) * Target_flag/missing;
run;
data &SCRUBFILE.;
SET TEMPFILE;

/*LOG TRANSFORMATIONS*/

log_OLDCLAIM=sign(OLDCLAIM)*log(abs(OLDCLAIM)+1);
log_CLM_FREQ=(CLM_FREQ)*log(abs(CLM_FREQ)+1);
log_MVR_PTS=sign(MVR_PTS)*log(abs(MVR_PTS)+1);

/*Fix missing values*/
IMP_AGE = AGE;
if missing (IMP_AGE) then IMP_AGE = 45;
IMP_JOB = JOB ; /*decision tree logic based on median*/
M_JOB = 0;
if missing (IMP_JOB) then do;
if IMP_INCOME >= 128680 then do IMP_JOB = “Doctor”;end;
if IMP_INCOME >= 88001 and IMP_INCOME <= 128680 THEN do IMP_JOB = “Lawyer”;end;
if IMP_INCOME >= 85001 and IMP_INCOME <= 88000 THEN do IMP_JOB = “Manager”;end;
if IMP_INCOME >= 75001 and IMP_INCOME <= 85000 then do IMP_JOB = “Professional”;end;
if IMP_INCOME >= 58001 and IMP_INCOME <= 75000 then do IMP_JOB = “z_Blue Collar”;end;
if IMP_INCOME >= 33001 and IMP_INCOME <= 58000 then do IMP_JOB = “Clerical”;end;
if IMP_INCOME >= 12074 and IMP_INCOME <= 33000 then do IMP_JOB = “Home Maker”;end;
IF IMP_AGE <= 20 THEN do IMP_JOB = “Student”; end;
if IMP_INCOME = . then do IMP_JOB = “Professional”; end;
M_JOB = 1; end;
IMP_INCOME = INCOME; /* Copy INCOME into IMP_INCOME*/
M_INCOME = 0;
if missing (IMP_INCOME) then do ;
if IMP_JOB = “Doctor” then do IMP_INCOME = 128680;end;
if IMP_JOB = “Lawyer” then do IMP_INCOME = 88304;end;
if IMP_JOB = “Manager” THEN do IMP_INCOME = 87461;end;
if IMP_JOB = “Professional” then do IMP_INCOME = 76593;end;
if IMP_JOB = “z_Blue Collar” then do IMP_INCOME = 58957; end;
if IMP_JOB = “Clerical” then do IMP_INCOME = 33861;end;
if IMP_JOB = “Home Maker” then do IMP_INCOME = 12073;end;
if IMP_JOB = “Student” then do IMP_INCOME = 6309;end;
M_INCOME = 1;
end; /*FLAG to 1 means fixed the value of IMP_INCOME and 54000 was a guess.*/

log_IMP_INCOME = sign(IMP_INCOME)*log(abs(IMP_INCOME)+1);
IMP_HOME_VALUE = HOME_VAL;
M_HOME_VAL = 0;
if missing(HOME_VAL) then do;
IMP_HOME_VALUE = 145000;
M_HOME_VAL = 1;
end;
IMP_YOJ = YOJ;
M_YOJ = 0;
if missing (IMP_YOJ) then do;
if IMP_JOB = “Student” then IMP_YOJ = 5;
else if IMP_JOB = “Home Maker” then IMP_YOJ = 4;
else IMP_YOJ =11;
M_YOJ = 1; end;

IMP_CAR_AGE = CAR_AGE;
if IMP_CAR_AGE = -3 then IMP_CAR_AGE = 3;
if missing (IMP_CAR_AGE) then IMP_CAR_AGE = 8;

/*create dummy values*/
if PARENT1 = ‘Yes’ then parent1 = ‘1’;
else parent1 = ‘0’;
SINGLE_PARENT=parent1*1;

if sex = ‘M’ then sex = ‘1’;
else sex = ‘0’;
MALE=sex*1;

if revoked = ‘Yes’ then revoked = ‘1’;
else revoked = ‘0’;
LICENSE_REVOKED=revoked*1;

if mstatus = ‘Yes’ then mstatus = ‘1’;
else mstatus = ‘0’;
MARRIED=mstatus*1;

if urbanicity = ‘Highly Urban/ Urban’ then urbanicity = ‘1’;
else urbanicity = ‘0’;
CITY=urbanicity*1;

if car_use = ‘Commercial’ then car_use = ‘1’;
else car_use = ‘0’;
COMMERCIAL=car_use*1;

if KIDSDRIV > 0 then KIDSDRIV = ‘1’;
else KIDSDRIV = ‘0’;
KIDS_DRIVING = KIDSDRIV*1;

if HOMEKIDS > 0 then HOMEKIDS = ‘1’;
else HOMEKIDS = ‘0’;
KIDS_AT_HOME = HOMEKIDS*1;
if imp_job = “Clerical” THEN IMP_JOB_CLERICAL = 1;
ELSE IMP_JOB_CLERICAL = 0;
IF imp_job = “z_Blue Collar” THEN IMP_JOB_BLUECOLLAR = 1;
ELSE IMP_JOB_BLUECOLLAR = 0;
IF imp_job = “Home Maker” THEN IMP_HOME_MAKER = 1;
ELSE IMP_HOME_MAKER = 0;
IF imp_job = “Student” THEN IMP_JOB_STUDENT = 1;
ELSE IMP_JOB_STUDENT = 0;

if R_EDUCATION = “High School” then HIGH_SCHOOL_OR_LESS = 1;
else HIGH_SCHOOL_OR_LESS = 0;

if CAR_TYPE = “z_SUV” THEN CAR_TYPE_SUV = 1;
ELSE CAR_TYPE_SUV= 0;
IF CAR_TYPE = “Sports Car” THEN CAR_TYPE_SPORTY = 1;
ELSE CAR_TYPE_SPORTY = 0;
IF CAR_TYPE = “Pickup” THEN CAR_TYPE_PICKUP = 1;
ELSE CAR_TYPE_PICKUP = 0;
if CAR_TYPE = “Sports Car” and imp_job = “Student” then STUDENT_SPORT_CAR = 1;
else STUDENT_SPORT_CAR = 0;

if IMP_HOME_VALUE = “0” then RENTER = 1;
ELSE RENTER = 0;

if TIF <= ‘1’ then NEW_CUSTOMER = 1;
ELSE NEW_CUSTOMER = 0;

/*CAP – DEAL WITH OUTLIERS*/

CAP_BLUEBOOK = BLUEBOOK;
if CAP_BLUEBOOK >= 35000 THEN CAP_BLUEBOOK = 35000 /*95 percentile*/;
else if CAP_BLUEBOOK <= 5000 THEN CAP_BLUEBOOK = 5000 /*5 percentile*/;

log_CAP_BLUEBOOK = sign(CAP_BLUEBOOK)*log(abs(CAP_BLUEBOOK)+1);
/*New Variables*/
COST_PER_CLAIM = OLDCLAIM/CLM_FREQ ;
if missing (Cost_per_Claim) then Cost_per_Claim = 0;

CLAIM_RATIO_YIS = TIF/CLM_FREQ;
if missing (CLAIM_RATIO_YIS) then CLAIM_RATIO_YIS = 0;

log_CLAIM_RATIO_YIS = sign(CLAIM_RATIO_YIS)*log(abs(CLAIM_RATIO_YIS)+1);

DROP AGE; /*replaced*/
DROP KIDSDRIV; /*replaced*/
DROP HOMEKIDS; /*replaced*/
drop BLUEBOOK; /*replaced*/
DROP TRAVTIME; /*not predictable*/
DROP CAR_AGE; /*replaced*/
drop HOME_VAL; /*replaced*/
drop INCOME; /*replaced*/
drop JOB; /*replaced*/
drop YOJ; /*replaced*/
drop PARENT1; /*replaced*/
drop sex; /*replaced*/
drop car_use; /*replaced*/
drop mstatus; /*replaced*/
drop URBANICITY; /*replaced*/
drop revoked; /*replaced*/
run;
* Split data randomly into test and training data;
proc surveyselect data=&SCRUBFILE. out=traintest seed = 123
samprate=0.7 method=srs outall;
run;

* lasso multiple regression with lars algorithm k=10 fold validation;
proc glmselect data=traintest plots=all seed=123;
partition ROLE=selected(train=’1′ test=’0′);
model Target_flag = OLDCLAIM CITY MVR_PTS KIDS_DRIVING IMP_HOME_VALUE
LICENSE_REVOKED MARRIED IMP_AGE COMMERCIAL log_CLAIM_RATIO_YIS
TIF log_CAP_BLUEBOOK IMP_YOJ IMP_INCOME M_HOME_VAL IMP_CAR_AGE/selection=lar(choose=cv stop=none) cvmethod=random(10);
run;

Leave a Reply