Survival analysis is a collection of statistical procedures for data analysis for which the outcome variable of interest is time until an event occurs. So how does survival analysis help choose better p2p loan investments? Generally, machine learning models are designed for loan classification (fully paid or charged off). This assumes that all loans used in the training set are fully mature because current loans have an indeterminate future and cannot be classified. As a result, classification models will learn against only a fraction of the loans available (only mature). This is a proven methodology but comes with some drawbacks:
- All non-mature notes are omitted as the class (fully paid or charged off) is unknown. This means that all loans that are not mature are omitted as we don’t know if they will be fully paid or charged off. The majority of loans in Lending Club’s database are non-mature loans. This is a significant amount of data that is generally omitted during traditional classification models.
- The classification dichotomy does not represent probability over time. It is simply a prediction of outcome given a set of variables.
Survival analysis is not designed to predict events. Rather, it is primarily used to estimate a survival curve which measures the probability that an event has not occurred by time t, taking all previous events and censored observations into account. In other words, SA considers both mature and non-mature notes, and provides an estimation of survival over time. This is very helpful not only for projecting a more accurate ROI for backtest analysis, but also provides a forward looking loss curve for each loan. Peer Lending Server uses both classification and survival analysis to help investors make informed decisions. Our historical analytics also uses time sensitive survival probability to project return on investment.
Lets take a look at the benefits of survival analysis in regards to historical analysis and classification separately:
Many popular P2P historical tools provide an instantaneous ROI based on filter parameters with a discount factor for late loans. This may mislead people into thinking that the filter will perform similarly in the future. However, the instantaneous ROI is not discounted for current loans which may default. It assumes that all current loans will reach maturity without default. Having an estimate default projection for each loan based on its age and characteristics produces a forward looking ROI. Survival analysis provides the capability to produce a loss curve for all current loans to produce a more complete ROI picture.
Standard practice for classification based machine learning involves feeding a model loan characteristics with its final outcome (fully paid or charged off). This means that only mature loans are used to train or “educate” a model. Taking a look at the latest LC historical database of loans, you will notice that the vast majority of loans are not mature. Therefore, there is a ton of information that is omitted using this approach. Wouldn’t it be better to utilize the complete historical data set to formulate a comprehensive model? PLS uses an ensemble (combined) approach of classification and survival analysis to provide a complete picture. The final result is an easy to understand probability of the loan being fully paid.