Using the same data set from the Decision Tree Classification Blog from Week 1, the business problem is to predict which auto insurance customers are most likely to get in an automobile accident. The data set used to predict who will likely get in an accident contains 8,161 records of data with 15 variables representing various customer statistics. To predict the outcome, let’s review using a Random Forest Classification method.
The Random Forest method creates many trees with many leaves which is a little different than the decision tree method. The SAS 9.4 Random Forest HPFOREST procedure creates a tree recursively splitting the data into two segments. The process is repeated in each segment, and again in each new segment, and so on until some constraint is met. In the terminology of the tree metaphor, the segments are nodes, the original data set is the root node, and the final un-partitioned segments are leaves or terminal nodes. The data in a leaf determine the estimates of the value of the target variable. These estimates are subsequently applied to predict the target of a new observation assigned to the leaf ( (SAS Institute Inc.)
The target variable is 0 for individuals who did not get in an accident and 1 for those who did get in an accident. The variables selected are in the code below and were applied to the HPFOREST method using SAS Version 9.4. These predictor variables were chosen during the exploratory data analysis due to their possible importance to the model which is outlined in detail in the decision tree blog.
proc hpforest data=&SCRUBFILE. seed=15546;
input CAR_TYPE R_EDUCATION IMP_JOB/level=nominal ;
input OLDCLAIM CITY MVR_PTS KIDS_DRIVING IMP_HOME_VALUE
LICENSE_REVOKED MARRIED IMP_AGE COMMERCIAL log_CLAIM_RATIO_YIS
TIF log_CAP_BLUEBOOK IMP_YOJ IMP_INCOME M_HOME_VAL IMP_CAR_AGE/level=interval; run;
The fit statistics below shows a misclassification rate of 26% for the target variable equivalent to one; people who would get into an accident. After reviewing the fit statistics, a review of the first ten trees or so should be conducted to determine if the fit statistics improve or not with the number of trees.
Forest models provide an alternative estimate of average square error and misclassification rate, called the out-of-bag (OOB) estimate (SAS Institute Inc.). The OOB estimate is a convenient substitute for an estimate that is based on test data and is a less biased estimate of how the model will perform on future data (SAS Institute Inc.)
The first number of trees with 1,227 leaves, has a OOB of .298. This is slightly higher than the overall fitness misclassification rate of .26. However, we can see as more trees are built with more leaves, the misclassification rate declines to as low as .24. This suggests the model is good.
The variable with the highest level of importance is if the person driving lives in a city or not followed by the type of profession held (if any profession held). This is a very similar outcome to the decision tree classification model. The IMP_YOJ field has a negative OOB Margin meaning it doesn’t contribute overall to the model’s performance.
This was a very simple model using the auto pruning techniques and auto generation of 100 trees. Other methods should be evaluated to see if the model improves.