FROM TRADITIONAL TO ANALYTICAL DATA QUALITY MANAGEMENT - Capgemini

1y ago
16 Views
2 Downloads
6.17 MB
16 Pages
Last View : 13d ago
Last Download : 3m ago
Upload by : Lilly Andre
Transcription

FROM TRADITIONALT O A N A LY T I C A L D ATAQUALIT Y MANAGEMENTOur Point of View

THE NEW ERA OFD ATA Q U A L I T YMANAGEMENTIN BANKING IS HERE2

History and contextHistorically, banks were alwayssubject to a large amount of dataflowing through its systems. Since theintroduction of digital devices such assmartphones, tablets and wearables,the pace of data creation hasexploded. Estimates show us thatmore than 90% of total data has beencreated in only the past two years.Banks are part of this trend andacquire and process increasingamounts data. In the past, this datawas gathered to a great extend insource administration systems. Overthe years however, banks increasinglydeal with new regulations likecompliance with GDPR, TRIM,AnaCredit and BCBS239. Therefore,banks are working intensively onintegral databases where uniformgranular data is available. Thisintegration poses huge challenges forthe data quality management (DQM)practices of banks. In addition, banksare getting criticized and even finedfor data quality issues in practices likeFraud detection and Anti moneylaundering.FINANCIALCOSTSPRODUCTIVITYR EG U L ATO R YCOMPLIANCEAverage financialimpact of poor dataquality on banks issignificantTasks due to weakdata are prone toerrors, requiremanual correctionsand more cleaningtimeFailure of integrationnew data standardsrequired byregulationsPoor-quality datacosts approx. 30% ofrevenuesOne third of analystsspend 40% onvalidating andvetting data beforedecision making andData scientists spend50-80% of their timecollecting andpreparing dataUnable to meetcontinuouslyevolving regulatoryreporting requirementsFines from theregulator for beingunable to becompliantO R G A N I Z AT I O N A LI N N O VAT I O NUnable to rely ondata for insightdriven businessinnovation andAI- Predictive bankingData quality issueshindering expansionof open bankingTraditional Data QualityManagementTraditional human based Data QualityManagement (DQM) often consists ofonly semi-automatic data monitoringand highly manual data cleaningprocesses. Processing the growingbulk of complex interdependent datais becoming increasingly costly andtime-consuming as more and moreDQ rules are required. As a result,crude rules are often applied whichonly touch the surface of data qualityissues.New approachOur approach, Analytical Data QualityManagement (ADQM), deals withthese challenges by incorporatingadvanced analytical methods such as‘Anomaly detection’ and ‘Root causeanalysis’. Methods like these allow usto identify anomalies in the dataquickly and allow for theimplementation of more refined DQrules because the root cause isknown. The efficiency and accuracybenefits of ADQM can be crucial forthe success of banks, therebysupporting to minimize financialcosts, increase productivity,safeguard regulatory compliance,increase organizational innovation,improve the decision-making processand prevent loss of reputation.DECISIONMAKINGWeak businessdecisionsR E P U TAT I O NMissed opportunitiesMis-leadingassumptions createcustomer satisfactionissuesPoorly phrasedbusiness strategies inbankingMedia leakages ofbad encounters harmthe banks imageInaccurate customerinformationEmployees questionthe validity of datacustomers forvalidating services orgoods whichdeteriorate trust toproductsFigure 1: The importance of data quality: data quality issues increase cost, decrease productivity, worsen reputation, negatively impactorganizational innovation, weaken decision making and avoid regulatory compliance of organizations.3

1T R A D I T I O N A L D ATA Q U A L I T YMANAGEMENTTraditionally, Data Quality Management starts with defining business rules. It builds onmetadata and the knowledge of subject matter experts. Traditional DQM is about creatingdata quality rules, the right controls and custom dashboards to improve the level of dataquality. This also includes implementation of things like Six Sigma Method and DQ KPIs.4

D ATAQUALITYMANAGEMENTBusiness rulesMethodsQuality checks0301TOOLSRESOURCESRoles & ResponsibilitiesOrganizationCulture02Data GovernanceData Catalogue04D ATA & S Y S T E M S D E S C R I P T I O N SBI Architecture und systems principlesData flowsMaster Data ManagementMeta Data ManagementFigure 2: Core elements of traditional data quality managementCore elementsTraditional data quality managementrequires data quality policies wherethe business defines why data qualitymatters and what issues in the datashould be avoided. Data rules aredesigned for the business toestablish controls on the data. Dataquality management is a continuesprocess, therefore adaptation of thesix sigma methods in data qualitymanagement can help in the cycle ofimprovements.Resources and tools should be inplace as well. In order to steer andcontrol this process, the right datagovernance should be set up.Defining resources and data-relatedroles such as the Chief Data Officer(CDO), Data Steward, BusinessOwner, Data Owner and DataCustodian will lead to clearresponsibilities and createsownership of the data. This will aid toestablish a data-driven culture. Toassist the people in their work, theright tools should be implemented.There are numerous tools canfacilitate easy custom dashboardsand quality controls.In addition, it is important tostructure and design the dataarchitecture and system principles.An IT-architecture cartography of thekey systems helps organizations toestablish quality checks at the dataentry interfaces and define principleswhich promote automation, systemharmonization and simplification ofthe business processes. One exampleof a popular change in the dataarchitecture is the creation of a dataquality gate that works as a gatewayin the data sourcing layer. Anotherexample is the usage of data hubs.Data that pass the quality gate canbe integrated in the data hubs. Theimplementation of master datamanagement systems can providecommon services to all applicationsand ensures access to master data.Consequently, a metadata repositoryserves as a single point of truth forthe definition of data elements andreports which can be managed fromall enterprise systems.5

ControlDefineDMAICC YCLEImproveMeasureAnalyzeFigure 3: DMAIC cycle of Six SigmaDMAIC cycleThe Six Sigma DMAIC Method reflectsthe traditional DQM process to reachcontinuous data qualityimprovements. In the beginning adefinition of data quality rules basedon meta data and an agreement of abusiness concept needs to be done.Secondly, the implementation andexecution of measurementsaggregate results to KPIs. A detailedanalysis of measurement results aidsto identify root causes and henceprepare data quality improvementsby deriving appropriate measures.Subsequently, the execution ofplanned measures improves dataquality. Based on KPI dashboards dataquality can be controlled and verifiedif improvement activities showedexpected results or repeat analysisand initiate corrective actions isrequired.Challenges withtraditional DQMEven with correct data governance,dedicated resources and solidprocesses, there are huge challengeswith controlling current and futuredata flows within banks. Due tounprecedented changes in the6technological environment and thecontinuous rise of new digitalsolutions, the traditional DQMreaches its limits.Firstly, the requirement to haveaccess to granular data poses newchallenges for data qualitymanagement. In the past, bankscould report data from theiradministration on aggregated level.Nowadays a higher granularity ofdata is requested by the regulator. Incombination with the of theenormous growth of data volume,this poses new challenges in dataquality management. New digitaldevices and user behavior hasaccelerated the pace of data creationand decuples the data volume from2013 to 2020. Integration of this datathroughout the IT landscape requiresmany checks and data quality rules.Hence, the traditional human basedDQM, which consists of only asemi-automatic data monitoring andhighly manual data cleaningprocesses becomes more timeconsuming and less accurate.Secondly, traditional datagovernance and data qualitymanagement platforms have provento be constrained to deal withincreased complexity andinterdependency of the data. Staticrules are the most common ruleswhich work fine in well-defined datasources. However, in modern dataenvironment, the simple rules arechallenged to keep up with thecomplexity of data.Of course, it is possible to come upwith complex rules by data stewards.These rules can be hard codedmanually, but that takes time. Forsubject matter experts to define therules and then for developers tocode them can be time consuming.This makes the scalability a costlychallenge. What this means inpractice is that many banks keepsimple rules and just engage in thecost-inducing activity of datacleansing once every while. Thisshort-term problem solver delivers aquick fix on local data quality issuesbut skips the core of the problem. Itmay appear an easy and cheapsolution locally, but on the long term,the total cost of keeping in control oftheir data becomes bigger andbigger. Banks can be excused, asunderstanding the underlying dataproblems is very difficult. Even with asolid data quality management inplace, the traditional way of DQMgives a human limitation of findingthe root causes.

2OUR APPROACH:A N A LY T I C A L D ATA Q U A L I T YMANAGEMENTMachine learning methods offer a new realm of possibilities when it comes to identifying,analyzing and solving data quality issues.7

Analytical Data Quality Management(ADQM) has three additional coreelements: Data Analytics, Technologyand Big Data Infrastructure. Withthese elements build of the existingcore elements, ADQM is addressingthe shortcomings of traditional dataquality management. Analyticalmethods of ADQM can identifyanomalies in data without a hardcoded DQ rule. Additionally, a rootcause for known DQ issue can belocalized without a completeknowledge of the data lineage anddependencies.Detecting anomalies and finding rootcauses complements traditional DataQuality Managements to improverobustness and efficiency.( A N A LY T I C A L )D ATAQUALITY MANAGEMENTBusiness rules Methods Quality ChecksD A T A A N A LY T I C SPredictive Patterns KPI Dashboards Big Data InsightsRESOURCES Roles &Responsibilities Organization CultureTOOLSTECHNOLOGIES Data Governance Data Catalogue DWH Cloud Computing Stream Processing Big Data ToolsB I G D ATA I N F R A S T R U C T U R EBig Data Processing Big Data StorageD ATA & S Y S T E M S D E S C R I P T I O N BI Architecture und systems principles Master Data Management Data flows Meta Data ManagementFigure 4: The green elements are those of the Traditional Data Quality Management.The blue elements are the additional elements of Analytical Data Quality Management.Methods of ADQMThe two main methods ‘AnomalyDetection’ and ‘Root Cause Analysis’pose the core of ADQM and shall beexplained in more detail. For bothcases, adapted supervised andunsupervised data analytics methodsare used that are based on nothingmore than the data itself.The first method ‘Anomaly Detection’finds unknown data quality problemsthrough historical data patterns,which is possible without theknowledge of semantics. First, thedataset is encoded andcompressed, thereby forcing thealgorithm to find patterns.Secondly, the data is decoded againthereby creating a representationof the original data. This method isalso called “autoencoder”. ADQMuses the Autoencoder technologyto investigate the reconstructionerror between original data anddata representation. Autoencoderslearn standard patterns of historicaldata to subsequently distinguishbetween regular and irregularrecords i.e. hidden issues. Thismeans the difference between theoriginal data and the datarepresentation are the anomalies.Once the anomalies are known, thisinsight can then be used to alter theoriginal data set.CO M P R E S S E D DATAORIGINALDATADATAR E P R E S E N TAT I O NEncodeDecodeDIFFERENCE RECONSTRUCTION ERRORFigure 5: Visualisation of autoencoder technology, a neural network that can finds patterns in the data and creates arepresentation. The differences between the original and the representation are the anomalies.8

Levels of DQ issue causationThe second method ‘Root causeanalysis’ is a statistical model thatenable a broader perspective on dataquality issues by filtering forsignificant impact factors. The modelfinds data quality problems onmultidimensional layers. Usingsupervised learning, Root-causeanalysis enable investigations ofseveral layers below the symptomsof data quality issues. First, patternsare identified in the symptoms byusing the best fitting separators inthe data. Next, a partial dependencyplot and model coefficients arecreated to approximate causation.ADQM does not merely look atcorrelating occurrences but analysesthe impacting factors. In this sense,the deeper underlying problem isinvestigated. Lastly, the criticalimpact factors are actorsRoot causesFigure 6: ‘Root Cause Analysis’ investigates the deepest level ofa data quality issuesBefore one or both methods can beapplied there are some prerequisites.The data needs to be prepared.Firstly, the bank should haveimplemented an integrated datamodel. Structured and labeled datahelps in preparation of the use ofalgorithms. Next to that, a data hubneeds to be in place where uniformdata is stored. Ideally, process datashould be captured next to thecontent data. This includes data onlocation, user, system and timing ofentry or changes. Lastly, an analyticalbase table needs to be created toenable the algorithm to work itsmagic.Figure 7: Screenshots of the prototypes of the “Root Cause Analysis” and “Anomaly Detection” methods of Capgemini Invent9

Analytical Data Quality Management significantly increases therobustness, efficiency and proactiveness of traditional Dataquality managementBenefits of AI andAnalytics for DQMBanks can increase their robustnessby using ADQM. Banks that adoptthese statistical methods will createa resistance to outliers. ‘Anomalydetection’ facilitates continuous dataquality improvements and proactiveremediation by finding data qualityissues before causing process failuresand errors occur. Root-cause-analysisenables organizationunderstanding of interdependenciesbetween different data sources andavoid gaps during the data creationor integration process. In practice,this means that organizations can bein true control of their data, therebyenables regulatory compliance,avoiding data and process errors andcostly regulatory fines.Secondly, ADQM increases theefficiency of DQM processes as aresult of ‘Anomaly detection’ i.e.automated data quality issueidentification. When defining dataquality rules based on meta data andlearnings of previous DMAIC cycles,the AI supports the analysis ofmeasurement results in order toidentify root causes and prepare dataquality improvements. Another10significant efficiency benefit isgained in the root analysis process.Traditional DQM processes are verymanual, particularly when datalineage and dependencies are nottransparent and clear. With theADQM approach, AI methods areused which speeds up the process.Even though ADQM can’t identify thereason for a certain group of issuecompletely automatically, it cansignificantly reduce the number ofobjects and systems to check. Theresult is that issues can be identifiedmuch faster, which enables datastewards to solve them much faster.Furthermore, it avoids new DQ issuesin the future. In this sense, ADQMenables increased productivity anddecreases future cost.ADQM also creates a betterunderstanding of the data and itsdependencies as a result of the AIdriven root cause analysis. This cannow even be used to predict DQissues, that are not yet existing.Identifying the pattern can lead toDQ issue can be used to takepreventive actions. This opens thedoor for new opportunities andenables organization innovation.With ‘Anomaly detection’ inconjunction with a big dataenvironment, banks could react toDQ issues in real time. In some cases,it’s even possible to correct theseissues and to use the alreadycleansed data in the furtherprocesses.Lastly, finding the root causes of dataquality issues enables stakeholdersto take sustainable actions andimprove decision-making based ondata. In this sense, ADQM is the nextstep for data-driven banks.

tay daIs m tly rich?ciensuffiCorrectnesssistencysenespletComIs mydacro ata conss all pla sistenttforms?ConDoes my data reflectreality over time?TimeliIsneallssdwh aten a ane vaied labed le?titydanlicaVnifi e?giva s ssiat rey d expmIs orConsistencyThe degree to which a unique piece of data holds the samevalue across multiple data sets Calculate data set closeness usingProbabilistic Record LinkageCorrectnessThe degree of conformity of a data element or a data set toan authoritative source Anomaly detection (unsupervised, e.g. autoencoder)CompletenessThe degree to which all required occurrences of data arepopulated Missing Data imputation using causality andcorrelation between attributesValidityThe measure of how a data value conforms to its domainvalue set (i.e. value, range of values) Validate existing DQ rules Identify root causes of data quality issuesTimelinessThe degree to which data is available when it is required Identify process and system bottlenecks usingprocess mining of log files (NLP)Figure 8: ADQM assists to solve the five main quality issues of correctness, completeness, validity, timeliness and consistency:11

3EXAMPLES OF USE CASES INTHE BANKING INDUSTRY12

MACHINEAnomaly DetectionImproves Risk Domain DataCapgemini Invent helped a bankupgrade a dataset in the risk domainby improving data managementpractice with the Capgemini DataManagement Flywheel as aframework, building a data qualitymonitoring dashboard and applyingmachine learning techniques toimprove data quality.The dataset in the risk domain (Riskmanagement banking softwareincludes systems such as Matlab,SAS, SCART, RiskPro, Murex) wasfacing challenges such as inadequatedata management and poor dataquality. Data objects includedcustomers, accounts, products andservices as well as transactionaltypes and a balance history, in whicheach data object had attributes(type, length, scaling, integer, count,number, defined and operationcode). Also, there was lack ofevidence for data quality issues tocommunicate with data sources. Thebank wanted to gain a betterunderstanding of the relationshipsbetween its disparate pieces of data.Traditional databases found itdifficult to analyze themmeaningfully.Our ADQM approach enabled riskfunctions to make use of structuredand unstructured customerinformation. This increased thepredictability power of the modelsand led to better credit risk decisionsby monitoring portfolios for earlyevidence of existing or potentialproblems and detecting financialcrime. For example, a fraudulenttransaction may be for a product theaccount owner has never bought orwould likely never buy or thegeographical location of the personwho made the purchase may notcoincide with where the accountowner was at the time of purchase.The machine learning algorithm candetect these inconsistencies afterbeing trained, and so it will be moresensitive to those data points withintransactions and flag them if thelocation data and the purchasedproduct is suspicious.The dashboard is developed to givefrequent, periodic insights in thequality of the data, and it containsthree sections: a summary thatshows an overall overview of thedata quality of the variables anddashboards for individual datavariables that give more insight inthe data quality of a specific variable.Workshops with experts wereorganized to gather more advanced,cross-relational data quality businessrules, which were incorporated in thedata quality monitoring dashboard.The dashboard brings transparencyto data quality issues and can beused as evidence to communicatewith stakeholders and trigger thedata quality remediation process.Machine learning algorithms wereused to help identifying anomalieswhich helps to increase the quality ofdata. Anomaly detection algorithmscan find strange data entries, thatthe usual, rule-based data qualityrules would not find. Application ofthe algorithms helps to narrow downthe potential pool of anomalies,which should lead to furtherinvestigation. Technology broughtfalse positives down in the antimoney laundering function. Thisallowed for focused approaches torisk detection and avoidance.LEARNINGTECHNIQUESHELPEDIMPROVE THEACCUR AC Y OFRISK MODELSBYIDENTIFYINGNON-LINEARAND COMPLE XPAT T E R N S I NLARGED ATA S E T SThe following was achieved:Creation and enrichment of dataassets in an efficient wayManagement of a data lifecycle,especially when it comes tosensitive and retiring dataImprovement of use of data anddiscovery by usersReduced risk costs and fines13

THROUGHTHIS,‘A N O M A LYDETEC TION’FINDS LESSFA L SEPOSITIVES,LE ADING TOA REDUCEDNEED TOM A N U A L LYVERIFYFL AGGEDTR ANSAC TIONSADQM improves ‘KnowYour Customer’ and ‘AntiMoney Laundering’Know Your Customer (KYC) and AntiMoney Laundering (AML) have beenin the news regularly. Data is thefoundation of both the KYC and AMLprocess. Manually screening allclients and verifying all transactionsis a huge challenge, due to the largeamounts of data that come with this.Capgemini Invent proposes ADQM asa solution.Financial institutions performbackground checks before acceptingcustomers, leading to a large pool ofcustomer data. Names are checkedagainst blacklists and sanction lists.Using Natural Language Processing(NPL) in combination with severalother Artificial Intelligence (AI)techniques, the quality of the datathat the bank collects can increase.For example, AI and NLP might beable to linka news article about ‘John Smith’with negative implicatinginformation to the actual ‘JohnSmith’ that is trying to open a bankaccount. In this case that might bethat the John Smith that was name inthe news article is not allowed tobecome a customer. However, asecond, different John Smith doesnot get unnecessarily flagged. Beingflagged here would imply that thebank needs to further investigate apotential customer. In this way,financial institutions are able toincrease the quality of their data andcan make more informed decisionsand take them quicker.‘Anomaly detection’ can be leveragedwithin transaction monitoring withthe goal to prevent moneylaundering. Banks contain a bigamount of transaction data. With thisdata, the goal is to discover criminalactivities, preferably with thesmallest number of false positives aspossible. A false positive here can befor example a transaction of a largesum of money to a certain countrythat is flagged as suspicious, howeverafter further manual investigation itturns out that there is no proof forcriminal activity.14The autoencoder technologyinvestigates the data and learns frompatterns in historical data to flagoutliers in the transactions.Anomaly detection algorithms willnot only reduce false positives butcan be used to detect suspicioustransactions as well. The advantageof using algorithms versus traditionalmethods is that manually determinedrules are static and are not allcovering while algorithms detectunknown real patterns. Findingsuspicious transactions can be doneusing two types of algorithms. Thefirst type is a supervised algorithmwhich needs a training data set inwhich the true anomalies are known.This algorithm also requires userinput to learn. The second type is anunsupervised algorithm. Thisalgorithm has a self-learning abilityand can update itself based on newdata.Capgemini has implemented bothRoot Cause analysis and Anomalydetection on client site. There is ashowcase available to demonstrateits principles and look & feel.Please reach out to us if you areinterested in a demo.

4ABOUT THE AUTHORSCasper StamFrank ClementHead of Data, Finance, Risk andRegulatoryLead Innovative nt@capgemini.comNetherlandsNetherlandsThomas BakkerElena Paniagua AvilaSenior Consultant Data, Finance,Risk and RegulatorySenior Consultant Data, Finance,Risk and a.avila@capgemini.comNetherlandsNetherlandsKoen WolversChao BaoConsultant Data, Finance, Riskand RegulatorySenior Consultant Data, Finance,Risk and mini.comNetherlandsNetherlands15

AboutCapgemini InventABOUT CAPGEMINI INVENTAs the digital innovation, consulting and transformation brand of the Capgemini Group, Capgemini Invent helps CxOsenvision and build what’s next for their organizations. Located in more than 30 offices and 22 creative studios aroundthe world, its 6,000 strong team combines strategy, technology, data science and creative design with deep industryexpertise and insights, to develop new digital solutions and business models of the future.Capgemini Invent is an integral part of Capgemini, a global leader in consulting, technology services and digitaltransformation. The Group is at the forefront of innovation to address the entire breadth of clients’ opportunities inthe evolving world of cloud, digital and platforms. Building on its strong 50-year heritage and deep industry-specificexpertise, Capgemini enables organizations to realize their business ambitions through an array of services from strategyto operations. Capgemini is driven by the conviction that the business value of technology comes from and throughpeople. It is a multicultural company of over 200,000 team members in more than 40 countries. The Group reported 2018global revenues of EUR 13.2 billion.Visit us at www.capgemini.com/inventThe information contained in this document is proprietary. 2019 Capgemini. All rights reserved.16

Traditional data quality management requires data quality policies where the business defines why data quality matters and what issues in the data should be avoided. Data rules are designed for the business to establish controls on the data. Data quality management is a continues process, therefore adaptation of the six sigma methods in data .

Related Documents:

Stage / Analytical Chemistry Lecture - 1 Introduction to Analytical Chemistry 1.1 Types of analytical chemistry & their uses . 1.2 Classifying Analytical Techniques. 1.3 Quantitative Analysis Methods. 1.4 Applications of Analytical Chemistry. 1.5 Units For Expressing Concentration of Solutions. 1.6 P Functions. 1.7 Stoichiometric Calculation.

work/products (Beading, Candles, Carving, Food Products, Soap, Weaving, etc.) ⃝I understand that if my work contains Indigenous visual representation that it is a reflection of the Indigenous culture of my native region. ⃝To the best of my knowledge, my work/products fall within Craft Council standards and expectations with respect to

Keywords: analytical validation, pharmaceutical analysis, analytical method INTRODUCTION Analytical methods play an essential role in the adequate fulfillment of product quality attributes. However, the proper quality can only be reached if the analytical method undergoes an appropriate validation pr

MATTERS . ISSUE 18 – WINTER EDITION 2021. W elcome to the eighteenth issue of Analytical Matters, the e-newsletter of the Analytical Division of the Royal Society of Chemistry (RSC). Analytical Matters aims to showcase the wide range of analytical science activities being r

Lifecycle Management of Analytical Methods Post-licensure activities for the method lifecycle management 1. Revalidation 2. Analytical Method Comparability Change in method (Method Replacement and modification) Analytical Method Transfer Post marketing changes to analytical pro

In-database analytical SQL with Oracle Database 12c This section outlines the high level processing concepts behind analytical SQL. Processing concepts behind analytical SQL Oracle’s in-database analytical SQL – first introduced in Oracle Database 8i Release

Everything is made of chemicals. Analytical chemistry determine what and how much. In other words analytical chemistry is concerned with the separation, identification, and determination of the relative amounts of the components making up a sample. Analytical chemistry is concerned with the chemical characterization of matter and the

of recent advances in green analytical chemistry as well as touch on some traditional methodologies that have always been environmentally benign, but perhaps not called green. 2. Trends in Green Analytical Chemistry Analytical chemistry provides the data necessary to make decisions about human and environmental health. Fast,