Lending Club Data Analysis of Variance – ANOVA


ANOVA

In this article, we are going to perform an analysis of variances which is commonly referred to as ANOVA in statistics. ANOVA enables the comparison of means for multiple groups or populations.  It looks at the variance “between” the attribute over the variance “within” the attribute.   In this case, we want to understand the variance of an attribute given the loan status of “fully paid/good” or “charged off/bad”.

For example, lets take a look at a simple example analysis of a single attribute of interest rate grouped by loan status.

Lending Club Interest Rate Anova Test

Without going into the statistical calculations, we see that the F score is high which indicates that there is a significant difference between the average interest rate mean of classified loans over the variance within the interest rates.  In other words, there is a significant variance in cluster(s) of average interest rates for good loans versus separate cluster(s) for bad loans.  In addition, the difference between these means is significant given the probability(P) of 0.  From this example, we can conclude that average interest rates significantly vary between good and bad loans.

The table below lists each attribute sorted by its corresponding significance probability value (smaller is more significant):  Each nominal attribute such as grade is assigned a unique integer value (grade a=1, b=2, c=3 …) to calculate variance.

Attribute Loan Status
grade 0.000
home_ownership 0.000
is_inc_v 0.000
purpose 0.000
acc_open_past_24mths 0.000
annual_inc 0.000
avg_cur_bal 0.000
bc_open_to_buy 0.000
bc_util 0.000
dti 0.000
fico_range_high 0.000
inq_last_6mths 0.000
int_rate 0.000
loan_amnt 0.000
mort_acc 0.000
mo_sin_old_il_acct 0.000
mo_sin_old_rev_tl_op 0.000
mo_sin_rcnt_tl 0.000
mths_since_last_delinq 0.000
mths_since_last_major_derog 0.000
mths_since_last_record 0.000
mths_since_oldest_il_open 0.000
mths_since_recent_bc 0.000
mths_since_recent_bc_dlq 0.000
mths_since_recent_inq 0.000
mths_since_recent_loan_delinq 0.000
mths_since_recent_revol_delinq 0.000
num_accts_ever_120_pd 0.000
num_actv_rev_tl 0.000
num_bc_sats 0.000
num_bc_tl 0.000
num_il_tl 0.000
num_op_rev_tl 0.000
num_rev_accts 0.000
num_rev_tl_bal_gt_0 0.000
num_sats 0.000
num_tl_30dpd 0.000
num_tl_op_past_12m 0.000
open_acc 0.000
percent_bc_gt_75 0.000
pub_rec 0.000
pub_rec_bankruptcies 0.000
pub_rec_gt_100 0.000
revol_util 0.000
term 0.000
total_acc 0.000
total_bal_ex_mort 0.000
total_bc_limit 0.000
total_il_high_credit_limit 0.000
total_rev_hi_lim 0.000
tot_coll_amt 0.000
tot_cur_bal 0.000
tot_hi_cred_lim 0.000
cpiaucsl 0.000
gs10 0.000
indpro 0.000
mprime 0.000
spcs20rsa 0.000
unrate 0.000
usslind 0.000
num_tl_90g_dpd_24m 0.001
delinq_2yrs 0.004
revol_bal 0.007
num_actv_bc_tl 0.010
emp_length 0.020
num_tl_120dpd_2m 0.026
chargeoff_within_12_mths 0.045
tax_liens 0.054
collections_12_mths_ex_med 0.253
pct_tl_nvr_dlq 0.258
delinq_amnt 0.455
acc_now_delinq 0.739
sub_grade 0.756
m2sl 0.784
ahetpi 0.939
mo_sin_rcnt_rev_tl_op 0.950


Interpretation

There is some valuable information in the table above which can be used to help create loan filters.  Notably, attributes such as grade, purpose, home ownership, etc have statistically significant variance which may be good candidates for loan filters.

You may have noticed that the sub_grade attribute does not have significant variance as you might have expected.  Because each sub grade is represented by an integer value (A1-G5 = 1-25), the variance “between” each sub grade (mean squares numerator of F) is inherently smaller than the “grade” (A-G = 1-7) attribute.  In addition, the residuals/errors are potentially larger because of the number of “bins” (25 vs 7).    It may help to take a look at the distribution of good and bad loans for both grade and sub_grade attributes:

Distribution by Grade

Distribution by Sub Grade

Notice the grade attribute has a skewed distribution while sub_grade has a very similar distribution.  It is important to note that attributes with higher P values may not necessarily be poor filter candidates.

In the table below, each nominal attribute is coded with a unique binary value.  This enables us to see potentially important nominal categories using ANOVA.  Remember, as with the above table, a high P value does not necessarily represent a poor filter candidate.

Attribute Loan Status
emp_length = 2 years 0.000
emp_length = n/a 0.000
grade = c 0.000
grade = f 0.000
grade = a 0.000
grade = b 0.000
grade = d 0.000
grade = g 0.000
grade = e 0.000
home_ownership = mortgage 0.000
home_ownership = rent 0.000
is_inc_v = false 0.000
is_inc_v = true 0.000
purpose = credit card 0.000
purpose = car 0.000
purpose = other 0.000
purpose = small business 0.000
purpose = major purchase 0.000
purpose = medical 0.000
sub_grade = f1 0.000
sub_grade = a1 0.000
sub_grade = c3 0.000
sub_grade = b4 0.000
sub_grade = d1 0.000
sub_grade = b3 0.000
sub_grade = b1 0.000
sub_grade = f3 0.000
sub_grade = d2 0.000
sub_grade = g1 0.000
sub_grade = e2 0.000
sub_grade = d5 0.000
sub_grade = e3 0.000
sub_grade = e1 0.000
sub_grade = a2 0.000
sub_grade = e5 0.000
sub_grade = a5 0.000
sub_grade = b2 0.000
sub_grade = e4 0.000
sub_grade = d3 0.000
sub_grade = a3 0.000
sub_grade = d4 0.000
sub_grade = f4 0.000
sub_grade = a4 0.000
sub_grade = f2 0.000
sub_grade = g2 0.000
sub_grade = f5 0.000
sub_grade = g4 0.000
sub_grade = g3 0.000
sub_grade = g5 0.000
acc_open_past_24mths 0.000
annual_inc 0.000
avg_cur_bal 0.000
bc_open_to_buy 0.000
bc_util 0.000
dti 0.000
fico_range_high 0.000
inq_last_6mths 0.000
int_rate 0.000
loan_amnt 0.000
mort_acc 0.000
mo_sin_old_il_acct 0.000
mo_sin_old_rev_tl_op 0.000
mo_sin_rcnt_tl 0.000
mths_since_last_delinq 0.000
mths_since_last_major_derog 0.000
mths_since_last_record 0.000
mths_since_oldest_il_open 0.000
mths_since_recent_bc 0.000
mths_since_recent_bc_dlq 0.000
mths_since_recent_inq 0.000
mths_since_recent_loan_delinq 0.000
mths_since_recent_revol_delinq 0.000
num_accts_ever_120_pd 0.000
num_actv_rev_tl 0.000
num_bc_sats 0.000
num_bc_tl 0.000
num_il_tl 0.000
num_op_rev_tl 0.000
num_rev_accts 0.000
num_rev_tl_bal_gt_0 0.000
num_sats 0.000
num_tl_30dpd 0.000
num_tl_op_past_12m 0.000
open_acc 0.000
percent_bc_gt_75 0.000
pub_rec 0.000
pub_rec_bankruptcies 0.000
pub_rec_gt_100 0.000
revol_util 0.000
term 0.000
total_acc 0.000
total_bal_ex_mort 0.000
total_bc_limit 0.000
total_il_high_credit_limit 0.000
total_rev_hi_lim 0.000
tot_coll_amt 0.000
tot_cur_bal 0.000
tot_hi_cred_lim 0.000
cpiaucsl 0.000
gs10 0.000
indpro 0.000
mprime 0.000
spcs20rsa 0.000
unrate 0.000
usslind 0.000
num_tl_90g_dpd_24m 0.001
sub_grade = c5 0.002
purpose = renewable energy 0.003
delinq_2yrs 0.004
revol_bal 0.007
num_actv_bc_tl 0.010
purpose = home improvement 0.011
purpose = wedding 0.024
num_tl_120dpd_2m 0.026
purpose = moving 0.028
sub_grade = c2 0.034
chargeoff_within_12_mths 0.045
tax_liens 0.054
emp_length = 9 years 0.071
emp_length = 5 years 0.080
sub_grade = b5 0.108
emp_length = 10+ years 0.126
purpose = educational 0.175
sub_grade = c1 0.247
collections_12_mths_ex_med 0.253
pct_tl_nvr_dlq 0.258
purpose = vacation 0.292
home_ownership = other 0.338
home_ownership = none 0.384
emp_length = 3 years 0.402
purpose = debt consolidation 0.438
delinq_amnt 0.455
emp_length = 1 year 0.467
emp_length = 7 years 0.500
emp_length = 8 years 0.513
emp_length = 4 years 0.543
sub_grade = c4 0.550
purpose = house 0.652
acc_now_delinq 0.739
emp_length = < 1 year 0.748
emp_length = 6 years 0.753
m2sl 0.784
home_ownership = own 0.889
ahetpi 0.939
mo_sin_rcnt_rev_tl_op 0.950

[/minimal_table]
Lets take a look at the distribution of purpose = debt consolidation versus purpose = small business:

 

Distribution by Purpose = Small Business

Distribution by Purpose = Debt Consolidation

 

From the ANOVA information presented above, there is significant variance among the majority of attributes included in the lending club historical loan database. It is recommended to combine the information you already know about the data to fully understand the effect of ANOVA analysis. For example, knowing tax_liens is insignificant according to ANOVA and you believe based on experience that it provides little value, you should consider dropping it.  There are also no recommendations for the filter criteria of specific attributes in this article.  In future articles, we’ll look at methods to derive statistically significant criteria thresholds among the most influential attributes.

About the Author

sociallenderMariried with two beautiful daughters. Enjoy artificial intelligence, investing and Jiu-Jitsu. Life is good - Godspeed!View all posts by sociallender →

Leave a Reply