7m ago

19 Views

0 Downloads

728.91 KB

5 Pages

Transcription

NOVATEUR PUBLICATIONSJournalNX- A Multidisciplinary Peer Reviewed JournalISSN No: 2581 - 4230VOLUME 4, ISSUE 5, May -2018MACHINE LEARNING APPLICATION IN LOAN DEFAULT PREDICTIONABHISHEK KUMAR TIWARIManager, Tata Consultancy [email protected]:In Todays world, most of world populationhas access to banking services. Consumers hasincreased many fold in last few years. For the banks,risks related to bank loans has increased especiallyafter The Great Recession (2007–2012) and jobthreats due to automation and advancement intechnologies like artificial intelligence (AI). At thesame time technological advancement enabledcompanies to gather and save huge data whichrepresent the customer’s behavior and the risksaround loan.Data Mining is a promising area of dataanalysis which aims to extract useful knowledgefrom tremendous amount of complex data setsNon-Performing Assets (NPA) is the top mostconcerns of banks. The NPA list is topped by PIIGS(Portugal, Italy, Ireland, Greece and Spain) countriesIntroductionThis paper proposes the use of statisticalmethods especially machine learning techniques tomodel and predict bank losses. We have useddifferent machine learning algorithms specificallydesigned to handle computationally intensiverecognition of interaction in large data-sets. Themethods use all information available regardless onprior beliefs about their importance, and take intoaccount of their interaction effects among allvariables (Features). We have applied four machinelearning algorithms to predict Loan DefaultPrediction: Logistic Regression, K-Nearest Neighbors(KNN), the tree-based classifier, Classification andRegression Tree (CART) and Random Forest (RF).These models are suited for Loan Default Predictionbecause of the large sample sizes and complexity ofthe possible relationships among variables. The datais split into a training data-set (75%) for modeldevelopment and a testing data-set (25%) used forout of sample prediction.The machine learning methods used have their prosand cons.MACHINE LEARNING ALGORITHMSLOGISTIC REGRESSION:The logistic regression is the most widely usedtechniques for classification purpose. It expresses thelinear regression equation in logarithmic terms, calledthe logit or log of odd whereodd probability of success / probability of failureResponse variable for this paper is “Default”,where the default has binary outcome, Yes or No.Logistic regression models the probability of success inthis case (probability of being Default, Default Yes).Logistic Regression is defined by sigmoide function, a Sshaped curve shown belowplog( Y ) log( Y /(1 Y )) 0 Xi ii 1Fig 1: Sigmoid FunctionK-NEAREST NEIGHBORS (KNN)CLASSIFIER:K-nearest neighbor classification method, a verysimple method that works really well on manyproblems/dataset. KNN classifier first calculate thedistance and identifies the K points (neighbors) in thetraining data that are closest to x0, represented by N0,given a positive integer K and a test observation x0 .Itthen estimates the conditional probability for class j asthe fraction of points in N0 whose response values equal jPr(Y j X x0) 1 I ( yi j )K i NoAnd then finally, KNN classifies the test observation x0 tothe class with the largest probability by applying Bayesrule.CLASSIFICATION AND REGRESSION TRESS (CART):CART uses Gini Index as The impurity (or purity)measure used in building decision treeGini p(i) p( j )i jWhere i and j are levels of target variable1 P a ge

NOVATEUR PUBLICATIONSJournalNX- A Multidisciplinary Peer Reviewed JournalISSN No: 2581 - 4230VOLUME 4, ISSUE 5, May -2018Minsplit and Minbucket is important parameterin CART. minsplit is minimum number of observationfor split attempt, minbucket is minimum number ofobservation in leaf node. CART can work even onunbalanced data by changing the prior probabilities toobtain a decision treeparms list(prior c(non default proportion,default proportion))or by including a loss matrix asparms list(loss matrix(c(0, cost def as nondef,cost nondef as def, 0), ncol 2))Detailed explanation of these methods areexplained in imbalanced data section.Trees obtained by CART are easily interpretableand very easy to explain, very useful in case we want totranslate rules in English. Less data preparation isrequired. CART is very robust to outliers in the inputvariables. CART can use the same variables multipletimes in different parts of the tree. This can uncovercomplex interactions between sets of variables.Final estimate of tree can change with a small change inthe data. CART usually overfits which can be solved bypruning of tree. Missing Value Detectiono Missing Value TreatmentSplitting Training & Test DatasetsCheck for Data Imbalanceo Over Samplingo Under Samplingo SMOTEo Changing the prior probabilitieso Loss matrixFeatures SelectionBuilding Classification ModelPredicting Class Labels of Test DatasetEvaluating Accuracy and other metricsParameter TuningFinalize the ModelEDA (EXPLORATORY DATA ANALYSIS):RANDOM FOREST:Random forests, use trees as building blocks toconstruct more powerful prediction models. It improvesaccuracy by fitting many trees by small tweak thatdecorrelates the trees. A random sample of m variablesis chosen as split candidates from the full set of pvariables. mtry is number of randomly selected variablesused at each splitm p for classification and m p3for RegressionFor each bootstrap sample it grow unpruned tree by using best split based on mtry at eachnodeRandom Forest predict the test data by choosingmajority class for Classification and by taking mean incase of Regression. Random forests have a tendency tobias towards variables that have more number ofdistinct values i.e. favor numeric variables overbinary/categorical values.MODEL BUILDING METHODOLOGY:Steps involved in this model building methodologyare mentioned below: Data Selection Exploratory Data Analysis Outlier Detectiono Outlier Treatment2 P a ge

NOVATEUR PUBLICATIONSJournalNX- A Multidisciplinary Peer Reviewed JournalISSN No: 2581 - 4230VOLUME 4, ISSUE 5, May -2018outlier cutoff low - quantile(Data Var, 0.25) - 1.5 *IQR(Data Var)Replace observations beyond higher cutoff point by 95thpercentile and lower cut off by 5th percentileStatistical TechniqueGrubb’s test for outliersR package “Outlier”MISSING VALUE TREATMENTDrop variable if it has more than 30% of missingvalues Drop Observation if has many attributes missingIMPUTATION OF MISSING VALUES:Multivariate Imputation by Chained Equations (MICE)Very powerful and popular technique for imputation ofmissing values, it uses different default methods fordifferent kind of dataNumeric data :: pmm, predictive mean matchingBinary data with 2 levels :logreg, logistic regressionimputationUnordered categorical data (factor 2 levels): polyreg,polytomous regression imputationOrderedcategorical data (factor 2 levels)polr,proportional odds modelKNN IMPUTATION:Impute with neighbor based on existingattributes by using Euclidean or Manhattan distanceDATA ons from the minority class tobalance the data,this may cause overfiting.Under-sampling methods remove the majority of classesto balance data. Removing observations causes the lossof useful information pertaining to the majority class.Synthetic Minority Oversampling Technique (SMOTE)finds random points within nearest neighbors of eachminor class observation and by boosting methodsgenerates new minor class observations. New data arenot the same as the existing data it does not have anyoverfiting problemOUTLIERS DETECTION AND TREATMENT:Any observations which is not in range Q1 1.5*IQ to Q3 1.5*IQ can be considered as outliers. Weneed to be sure before treating/eliminating outliers thatit’s not influential observation.Graphical method:Box Plotoutlier cutoff high - quantile(Data Var, 0.75) 1.5 *IQR(Data Var)Changing the prior probabilities: Changing the priorprobabilities to obtain a decision tree, This is an indirectway of adjusting the importance of mis-classifications foreach class.parms list(prior c(non default proportion,default proportion))Including a loss matrix: Loss matrix can be included,changing the relative importance of misclassifying adefault as non-default versus a non-default as a default.3 P a ge

NOVATEUR PUBLICATIONSJournalNX- A Multidisciplinary Peer Reviewed JournalISSN No: 2581 - 4230VOLUME 4, ISSUE 5, May -2018Ifproblem demands that misclassifying a default as anon-default should be penalized more heavily. Includinga loss matrix can again be done in the argument parms inthe loss matrix.parms list(loss matrix(c(0, cost def as nondef,cost nondef as def, 0), ncol 2))Doing this, we are constructing a 2x2-matrix with zeroeson the diagonal and changed loss penalties off-diagonal.The default loss matrix is all ones off-diagonalFeatures SelectionGraphical representation of the variable importance forTop 10 variablemean decrease in Gini index) is shown infigure. The variables with the largest mean decrease inGiniindex are Amount, Chk Acct, and Duration. Giniindex is decreased by splits over a given predictor,averaged over all trees. Large number indicates highimportance.Fig 2: Variable Importance plotPREDICTION ACCURACY METRICS:CONFUSION MATRIX:F1 scoreis the harmonicmean of precision and sensitivity (recall)F1 2TP2TP FP FNRECEIVER OPERATING CHARACTERISTICS (ROC):The ROC curve is a popular graphic whichsimultaneously display the two types of errors for allpossible thresholds, the vertical axes is the true positiverate (Sensitivity) and the horizontal axes is the falsepositive rate (1-Specificity) for different threshold pointsof a parameters. If the curve is closer to the top left thenthe accuracy of the prediction is higher. ROC curves areuseful for comparing different classification algorithm.According to the ROC curve, Random Forest and KNNmodels have the highest accuracy and the CART modelshave the lowest.Fig 3: ROC curve for different classifiersAREA UNDER THE CURVE (AUC):AUC is a metric for binary classification thatmeasures the accuracy of model, ranging from 0.5 to 1PARAMETER TUNING :CP (Complexity Parameter)Misclassification rate Root node error * Xerror * 100%Pick the Cp value which corresponds to least Xerror anduse this Cp to prune the treee.TP TNTP TN FP FNTPSensitivity TP FNAccuracy Specificit y Pr ecision TNTN FPTPTP FPFig 4: Cp value VsXerror4 P a ge

NOVATEUR PUBLICATIONSJournalNX- A Multidisciplinary Peer Reviewed JournalISSN No: 2581 - 4230VOLUME 4, ISSUE 5, May -2018N TREE (NUMBER OF TRESS IN RANDOM FOREST):Choose the number of trees from plot where elbow isformed, error does not decrease significantly. RandomForest with lesser number of tress will be faster inexecution.Fig 5: Number of trees Vs errorMTRYNumber of variables randomly sampled at each splitCART models have the lowest. Random Forest has given86% accurate classification result.REFERENCES:1) Gareth James, Trevor Hastie, Daniela Witten andRobert Tibshirani, “An Introduction to StatisticalLearning” E.N. Hamid, and N. Ahmad, “A NewApproach for Labeling the Class of Bank CreditCustomers via Classification Method in DataMining”, International Journal of Information andEducation Technology, vol. 1(2), pp. 150-155,2011.2) K. Kavitha, “Clustering Loan Applicants based onRisk Percentage using K-Means ClusteringTechniques”, International Journal of AdvancedResearch in Computer Science and SoftwareEngineering, vol. 6(2), pp. 162–166, 2016.3) Z. Somayyeh, and M. Abdolkarim, “NaturalCustomer Ranking of Banks in Terms of CreditRisk by Using Data Mining A Case Study:Branches of Mellat Bank of Iran”, Jurnal UMPSocial Sciences and Technology Management, vol.3(2), pp. 307–316, 2015.4) M. Sudhakar, and C.V.K. Reddy, “Two Step CreditRisk Assessment Model For Retail Bank LoanApplications Using Decision Tree Data MiningTechniqFig 6: mtryVs OOB errorCONCLUSION:Accurate Default estimation can help banks toavoid huge losses. In this paper we presented aframework for effectively prediction the class labels ofthe new loan applicants. These model were built usingthe data mining techniques available in the R.Preprocessing step is the most important and timeconsuming part. Pre-processed dataset is then used forbuilding the decision tree classifier.The results, verifythat machine learning algorithms yield higher forecastaccuracy. Machine learning algorithms can help torecognize the importance of the variables.The resultsindicate that these four machine learning methods hasits pros and cons. We evaluate the predictionperformance using the metric Area Under the Curve(AUC), F1 score, Recall, Precision, Accuracy using theROC Curve,which plots the true positive rates againstfalse positive rates. According to these metrics, RandomForest and KNN models have the high accuracy and the5 P a ge

that machine learning algorithms yield higher forecast accuracy. Machine learning algorithms can help to recognize the importance of the variables.The results indicate that these four machine learning methods has its pros and cons. We evaluate the prediction performance using the metric Area Under the Curve