Institutional Sector Classi Cation

3y ago
31 Views
2 Downloads
3.58 MB
38 Pages
Last View : 19d ago
Last Download : 6m ago
Upload by : Esmeralda Toy
Transcription

Workshop on “Big Data & Machine Learning Applications for Central Banks”October 22nd 2019Centro Carlo Azeglio CiampiInstitutional sector classi!cationA Machine Learning ApplicationPaolo MassaroOliver GiudiceDivisione Informazioni AnagraficheDipartimento ECSDivisione Ricerca sulle Tecnologie AvanzateDipartimento ITw w w. b a n k i t . a r tThe opinions expressed and conclusions drawn are those of theauthors and do not necessarily reflect the views of the Bank of Italy.

Problem statementGivenDeterminea set of featuresof a companythe appropriate SAE codeto assign to itNumeric and non-numeric: name, number ofemployees, balance sheet data, whetherpublicly held or not, etc.SAE “SETTORE DI ATTIVITA’ ECONOMICA" is a codedefined by Circ. 140/97 meant to cluster companies intoone of 116 "institutional sectors" (e.g., public institution,productive company, financial holding, etc.)

Problem statementGivenDeterminea set of featuresof a companythe appropriate SAE codeto assign to itNumeric and non-numeric: name, number ofemployees, balance sheet data, whetherpublicly held or not, etc.SAE “SETTORE DI ATTIVITA’ ECONOMICA" is a codede!ned by Circ. 140/97 meant to cluster companies intoone of 116 "institutional sectors" (e.g., public institution,productive company, !nancial holding, etc.)Machine Learning approachWe start from existing data; a “machine learning model”” is trained fromcompanies already labeled (by hand); on the basis of this "past experience"to!it learns to predict what SAE any new company belongs toProvidedthe machine is given several (tens of thousands of)prior samples of correctly labeled companiesSAE

Why and when should ML help here?Machine Learning approachSAEWe start from existing data; a “machine learning model” is trained fromcompanies already labeled (by hand); on the basis of this "past experience"it learns to predict what SAE any new company belongs to!AS-IS: Who classi!es companies into SAEs?Type ofcompanyClassified onsISTATauthoritativeSupervisedEntitiesBank of pervisedCompanies, etc )FinancialIntermediariesmay be:incorrectinconsistentstalemissing ( 30%)fixspotupdateautofill

Why and when should ML help here?Machine Learning approachSAEWe start from existing data; a “machine learning model” is trained fromcompanies already labeled (by hand); on the basis of this "past experience"it learns to predict what SAE any new company belongs to!AS-IS: Who classi!es companies into SAEs?Type ofcompanyClassified onsISTATauthoritativeSupervisedEntitiesBank of pervisedCompanies, etc )FinancialIntermediariesmay be:incorrectinconsistentstalemissing ( 30%)fixspotupdateautofill

DataPreprocessingFeature extractionImbalanced learningClassificationResults

PreprocessingDataFeature extractionImbalanced learningClassi!cationOriginal datasetsDataset#OriginAnagrafe Soggetti 42M Bank of ItalyListed Companies 1KBank of ItalyATECO 3.6M Ag. EntrateBalance Sheet et al. 2.2M CERVEDInfo Imprese 2.2M INFOCAMEREPlatformResults

PreprocessingDataFeature extractionImbalanced learningClassi!cationResultsOriginal datasetsDataset#OriginAnagrafe Soggetti 42M Bank of ItalyListed Companies 1KBank of ItalyATECO 3.6M Ag. EntrateBalance Sheet et al. 2.2M CERVEDPlatformInfo Imprese 4.8M INFOCAMERE

PreprocessingDataFeature extractionImbalanced learningClassi!cationResultsData ingestion (ETL)Dataset#OriginPlatformTransformLoadBig Data Analytics PlatformExtractAnagrafe Soggetti 42M Bank of ItalyOpListed Companies 1K Bank of ItalyATECO 3.6M Ag. EntrateBalance Sheet et al. 2.2M CERVEDInfo Imprese 4.8M INFOCAMERE

PreprocessingDataFeature extractionImbalanced learningClassi!cationResultsData ingestion (ETL)Dataset#OriginAnagrafe Soggetti 42M Bank of ItalyListed Companies 1KExtractATECO 3.6M Ag. EntrateTransformBalance Sheet et al. 2.2M CERVEDInfo Imprese 4.8M INFOCAMERELoadBig Data Analytics PlatformPlatformOpBank of ItalyTools

PreprocessingImbalanced learningClassi!cationResultsData ingestion (ETL)PlatformTransformInputDatasetto ML machineryOperationSingle text file1.4M records, 400MBytesEach record contains info about:a. Company structureb. Balance sheetc. Other infod. SAELoadBig Data Analytics PlatformExtractDataFeature extractionOpTools

PreprocessingImbalanced learningClassi!cationResultsData ingestion (ETL)TransformInside the ML machinery, for each companyCompany structureBalance sheet15 numeric features14 numeric featuresNum. of employeesPA-owned shares LoadBig Data Analytics PlatformExtractDataFeature extractionShare capitalPersonnel costs Other infoName & notes3 structured featuresListed (y/n)ATECOComune2 textual featurescompany namebalance notesSAEPlatformOpTools

PreprocessingDataFeature extractionImbalanced learningClassificationResultsSAE: (Un)balanced dataNumber of 1.000.000companies100.000per SAEFinancialNon-financial10.0001.000100101SAE430 288 476 280 432 268 284 273 270 258 475 287 259 263 477 285 450 283 257

PreprocessingDataFeature extractionImbalanced learningClassificationResults

DataFeature extractionPreprocessingImbalanced learningClassi!cationSpeci!c w.r.t data typeDealing with MissingStructuredData(un)structuredDataTry to fix or -checking Use “zero” or average value Regression on other variables etc.IgnoreIt is textual and can be divided:company denomination (always present)balance notes (missing in almost 50% of the dataset)Results

DataPreprocessingFeature extractionFeature extractionImbalanced learningClassi!cationTypes of featuresResults

DataPreprocessingFeature extractionFeature extractionnumeric quantitycategorial propertytextual propertyImbalanced learningClassi!cationResultsTypes of features[direct][one-hot-encoding][tf-idf]List ofcompanyfeaturesPCA

DataPreprocessingImbalanced learningFeature extractionClassificationResults

DataPreprocessingFeature extractionnumber of samplesA couple of unbalanced classesClassificationResultsImbalanced learningover-representedunder-representedclass 1class 2

DataPreprocessingFeature extractionUnder-sampling & over-samplingClassi!cationImbalanced learningResultsSMOTEnew arti!cial samples(over-sampling)number of samplesremoved samples(under-sampling)class 1class 2

DataPreprocessingClassificationFeature extractionImbalanced learningResults

DataPreprocessingFeature extractionImbalanced learningResultsClassification

DataPreprocessingFeature extractionImbalanced learningResultsSAE 84477432263268Sector430Sub-sectorSAE

DataPreprocessingFeature extractionImbalanced learningClassi!er hierarchynumericpropertiesClassi!er i!er 5Classi!er 4non-!nClassi!er 339384Classi!er 2Classi!er 8Classi!er 75316 SAEs31ResultsClassi!cation

DataPreprocessingFeature extractionImbalanced learningClassi!er esnon-430holding!nancial""384"430non-!n"39""5516 SAEs31ResultsClassi!cation

DataPreprocessingFeature extractionImbalanced learningEnsemble sSVMcategoricpropertiesnon-!n"39"16 valuesensembleSAE16 values

DataPreprocessingFeature extractionImbalanced learningEnsemble Neural classi!erResultsClassi!cation

DataPreprocessingFeature extractionImbalanced learningResultsClassification

DataPreprocessingFeature extractionImbalanced learningClassificationDatasets and performance1.4 million recordsResults

DataPreprocessingFeature extractionImbalanced learningClassi!cationDatasets and performanceResultsResults1.4 million recordsof SAE-labeled data430288other SAEs

DataPreprocessingFeature extractionImbalanced learningClassificationDatasets and performanceResultsResultstraining setused to automatically learn classifier parametersvalidation setused to optimise classifier hyperparameterstest setused to evaluate classifier performance

PreprocessingFeature extractionImbalanced learningClassi!cationDatasets and performanceResultsResultstraining setused to automatically learn classifier parametersvalidation setclassifie420,000 rer has ne cordversseenbeforeused to optimise classifier hyperparameterstheDatatest setused to evaluate classifier performance

DataPreprocessingFeature extractionImbalanced learningClassi!cationPerformance metricsResultsResultsRaw, direct, relatable measure1% of samples !in the test set4000maxabsolute number of errorsmin0The standard performance measureany classifiers!that gets the !large 430 “right”surpasses 99%99%min0%minaccuracyaverage F1 scoremax100%max100%Insensitive to class size, toscore high here a classifierhas to get most things right

Imbalanced learningClassi!cationResultsClassi!cation resultsResultsminmaxmax0100%100%average F1 scoreFeature extractionaccuracyPreprocessingabsolute number of errorsData400099%0%maxminmin

PreprocessingFeature extractionImbalanced learningClassi!cationResultsClassi!cation resultsdatapreprocesseddataall .586.3neural ensemble86.11528num categorical name eddatanumerical sifierabsolute number of 95dataalways 4305.2400099%0%maxminmin

ConclusionsDealing with hybriddata is complex and differentpipelines (with ensemble techniques) are neededHierarchical structures give comfortable a-prioriknowledge but are not well suited for “ambiguous” dataA scientific paper with details on all the techniques presentedis currently under review and will be published soon.

FromproblemA business necessity to improve DQM activity efficiencyA Machine Learning solution could solve the problemA research activity was carried out in order to find the best solutionA final solution is being developed as an integration in the enterprise SWTosolution

Workshop on “Big Data & Machine Learning Applications for Central Banks”October 22nd 2019Centro Carlo Azeglio CiampiThank you for your attentionAny questions?Marco Benedetti, Gennaro Catapano,Francesco De Sclavis*, Roberto Favaroni,Giuseppe Galano, Andrea Gentili, Marco *Intern at ARTThe opinions expressed and conclusions drawn are those of theauthors and do not necessarily reflect the views of the Bank of Italy.

16 SAEs 4 5 5 1 " GBoost 430 numeric properties Classi!er hierarchy categoric properties 3. Data Preprocessing Feature extraction Imbalanced learning Results Classi!cation "" "" "non-430!nancial non-!n 38 39 " holding " GBoost Ensemble classi!er 430 SVM ense

Related Documents:

Multi-class classi cation: multiple possible labels . We are interested in mapping the input x 2Xto a label t 2Y In regression typically Y Now Yis categorical Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classi cation 5 / 24 . Classi cation as Regression Can we do this task using what we have learned in previous lectures? Simple hack .

In this study, we seek an improved understanding of the inner workings of a convolutional neural network ECG rhythm classi er. With a move towards understanding how a neural network comes to a rhythm classi cation decision, we may be able to build interpretabil-ity tools for clinicians and improve classi cation accuracy. Recent studies have .

algorithm. Section 6 describes a systematic experimental comparison using three classi cation domains: newsgroup articles, web pages, and newswire articles. The rst two domainsare multi-classclassi cation problems where each class isrelatively frequent. The third domain is treated as binary classi cation, with the \positive"

6.2% in 5-shot learning over the state of the art for object recognition, ne-grained classi cation, and cross-domain adaptation, respectively. Keywords: associative alignment, few-shot image classi cation 1 Introduction Despite recent progress, generalizing on new concepts with little supervision is still a challenge in computer vision.

2The industrial classi cation system used in statistics on Mexican manufacturing plants has changed over time. In this gure we use the North American Industrial Classi cation System (NAICS), the more recent classi cation, to facilitate comparison with later years. Also, in the ENESTyC s

essential tool to calibrate and train these interfaces. In this project we developed binary and multi-class classi ers, labeling a set of 10 performed motor tasks based on recorded fMRI brain signals. Our binary classi er achieved an average accuracy of 93% across all pairwise tasks and our multi-class classi er yielded an accuracy of 68%.

(trochlear dysplasia, patellar height, and TT-TG distance) were evaluated as previously published. Trochlear dysplasia was assessed by transverse MRI and classi ed according to the system described by Dejour et al. [ ]. To improve the reliability of the trochlear dysplasia classi cation, we integrated Dejour s -grade classi cation (Type A D) into

in pile foundations for Level 1 earthquake situation. The proposed load factors in the study are a function of the chosen soil investigation/testing and piling method, which is applied to the bending moment in piles. Therefore, better choices of soil investigation/testing and high quality piling method will result in more reasonable design results. Introduction Reliability-based design .