Text Mining For Central Banks - LSE Research Online

3y ago
12 Views
2 Downloads
2.75 MB
29 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Milena Petrie
Transcription

David Bholat, Stephen Hansen, Pedro Santos and CherylSchonhardt-BaileyText mining for central banksArticle (Published version)Original citation:Bholat, David, Hansen, Stephen, Santos, Pedro and Schonhardt-Bailey, Cheryl (2015) Textmining for central banks: handbook. Centre for Central Banking Studies (33). pp. 1-19. ISSN1756-7270 2015 Bank of EnglandThis version available at: http://eprints.lse.ac.uk/62548/Available in LSE Research Online: June 2015LSE has developed LSE Research Online so that users may access research output of theSchool. Copyright and Moral Rights for the papers on this site are retained by the individualauthors and/or other copyright owners. Users may download and/or print one copy of anyarticle(s) in LSE Research Online to facilitate their private study or for non-commercial research.You may not engage in further distribution of the material or use it for any profit-making activitiesor any commercial gain. You may freely distribute the URL (http://eprints.lse.ac.uk) of the LSEResearch Online website.

Centre for Central Banking StudiesHandbook – No. 33 Text mining for centralbanksDavid Bholat, Stephen Hansen, Pedro Santos and Cheryl Schonhardt-BaileyCCBS25YEARANNIVERSARY

CCBS Handbook No. 33Text mining for central banksDavid Bholat, Stephen Hansen, Pedro Santos and )ukAlthough often applied in other social sciences, text mining has been less frequently used ineconomics and in policy circles, particularly inside central banks. This Handbook is a briefintroduction to the field. We discuss how text mining is useful for addressing research topics ofinterest to central banks, and provide a step-by-step primer on how to mine text, including anoverview of unsupervised and supervised techniques.(*) We thank Ayeh Bandeh-Ahmadi, Aude Bicquelet, David Bradnum, Peter Eckley, Jo Gill, DavidGregory, Sujit Kapadia, Tom Khabaza, Christopher Lovell, Rickard Nyman, Paul Ormerod, PaulRobinson, Robert Smith, David Tuckett, Iulian Udrea and Derek Vallès for their input.ccbsinfo@bankofengland.co.ukCentre for Central Banking Studies, Bank of England, Threadneedle Street, London, EC2R 8AHThe views expressed in this Handbook are those of the authors, and are not necessarilythose of our employers.Series editor: Andrew Blake, email andrew.blake@bankofengland.co.ukThis copy is also available via the internet site atwww.bankofengland.co.uk/education/ccbs/handbooks lectures.htm Bank of England 2015ISSN: 1756-7270 (Online)

ContentsIntroduction11Text as data for central bank research22Primer on text mining techniques4Analytical pre-processing6Boolean techniques7Dictionary techniques8Weighting words9Vector space models93Latent Semantic Analysis10Latent Dirichlet Allocation11Descending Hierarchical Classification12Supervised machine learning12Conclusion13References14Further reading16Glossary18

Handbook No. 33 Text mining for central banks1Text mining for central banksIntroductionText mining (sometimes called natural languageprocessing(1) or computational linguistics) is an umbrellaterm for a range of computational tools and statisticaltechniques that quantify text.(2) Text mining is similar toreading in that both activities involve extracting meaningfrom strings of letters. However, the computational andstatistical analysis of text differs from reading in twoimportant respects. First, computer-enabled approaches canprocess and summarise far more text than any person hastime to read. And second, such approaches may be able toextract meaning from text that is missed by human readers,who may overlook certain patterns because they do notconform to prior beliefs and expectations.Although widely applied in other fields such as politicalscience and marketing, text mining has been historically lessused as a technique in economics. This is particularly the casewith respect to research undertaken inside central banks.There may be a couple of reasons why this has been the case.First, it may not be obvious that text can be described andanalysed as quantitative data.(3) As a result, there is probablya lack of familiarity in central banks with the tools andstatistical techniques that make this possible. Second, even ifcentral bankers have heard of text mining, they already haveaccess to other readily available quantitative data. Theopportunity and other types of costs from transforming textsinto quantitative data, and learning new tools and techniquesto analyse these data, may be viewed as outweighing theexpected benefits.However, text mining may be worth central banks’investment because these techniques make tractable a rangeof data sources which matter for assessing monetary andfinancial stability and cannot be quantitatively analysed byother means. Key text data for central banks include newsarticles, financial contracts, social media, supervisory andmarket intelligence, and written reports of various kinds.With text mining techniques, we can analyse one documentor a collection of documents (a corpus). A document couldbe a particular speech by a Bank of England (but here referredto as ‘Bank’) Monetary Policy Committee (MPC) member, astaff note, or a field report filed by an Agent.(4) Thecorresponding corpus would be all MPC member speeches,staff notes, and field reports, respectively.Although the intentional use of text mining techniques bycentral banks is still limited, central bankers already do reapthe benefits of text mining applications on a daily basis.Consider, for example, how often central bankers (or anyone)Google for information, or use spellcheck before publishingdocuments. Consider also the spam detection firewalls usedto insulate central banks from cyber-attacks, or the queryfunctionality in citation databases used for retrieving theexisting scholarly literature on any given topic. In these andother instances, text mining techniques operate in thebackground to help central banks perform their jobs moreefficiently.The additional purpose of this Handbook then is todemonstrate the value central banks may gain from a moreconscious application of text mining techniques, and toexplain some of them using examples relevant to centralbanks. The Handbook proceeds in two main parts. The firstpart explains how text mining can be applied in central bankresearch and policymaking, drawing on examples from theexisting literature. The second part of the Handbook thenprovides a step-by-step primer on how to mine text. Webegin by explaining how to prepare texts for analysis. Wethen discuss various text mining techniques, starting withsome intuitive approaches, such as Boolean and dictionarytechniques, before moving on to discuss those that are moreelaborate, namely Latent Semantic Analysis, Latent Dirichlet1Allocation and Descending Hierarchical Classification.Boolean and dictionary text mining, on the one hand, andLatent Semantic Analysis, Latent Dirichlet Allocation andDescending Hierarchical Classification techniques, on theother, map onto different epistemologies, that is, differentapproaches to knowledge-making: deduction and abduction,respectively.(5) Deduction starts from a general theory andthen uses particular datasets to test the validity of thetheory. By contrast, abduction attempts to infer the bestexplanation for a particular event based on some data,without ambition to generate an explanation generalisable toother cases.(6) Boolean and dictionary text mining aredeductive approaches in that they start with a predefined listof words, motivated by a general theory as to why thesewords matter. The strengths of this approach are simplicityand scalability. Code for its implementation is typically just afew lines long, and can be applied easily to massive text files.However, the weakness of this approach is its focus only onwords pre-judged by the researcher to be informative whileignoring all other words. By comparison, Latent Semantic(1) Natural language processing is the computational processing and analysis ofnaturally occurring human languages, as opposed to programming languages, likeJava.(2) There are also computer assisted approaches for qualitative analysis of text. Theseare outside the scope of this Handbook. However, see the following link for anoverview and comparison of some of the qualitative text mining esearchcentres/caqdas/support/choosing/. See also Upshall (2014).(3) Text is sometimes called unstructured data, in contrast to structured data(numbers). However, referring to text as unstructured is somewhat misleading.Text does have structure; most obviously grammar, but also structural patterns ofvarious kinds that text mining techniques extract.(4) Agents are Bank staff scattered across the UK who provide intelligence on localeconomic conditions.(5) Of course, deduction and abduction are ideal types. In reality, all explanatoryapproaches are mixed. Nevertheless, we think this classification helps situatedifferent text mining techniques in terms of their similarities and differences.(6) Induction is a third epistemology that, like abduction, starts from data withoutpriors, but like deduction, then seeks to generate general theoretical claims.

2Handbook No. 33 Text mining for central banksAnalysis, Latent Dirichlet Allocation and DescendingHierarchical Classification infer thematic patterns in aparticular corpus without claiming that these patterns hold inother documents. The main strength of these techniques isthat they analyse all words within the sampling frame andyield more sophisticated statistical outputs. Their maindisadvantage is programming complexity.Figure 1 Googling the labour marketText mining is a vast topic. By necessity we have had to beselective in the techniques we cover in the Handbook. Wemostly focus on unsupervised machine learning techniques.Unsupervised machine learning involves taking unclassifiedobservations and uncovering hidden patterns that structurethem in some meaningful way.(1) These techniques can becontrasted to supervised machine learning. Supervisedmachine learning starts with a researcher classifyingobservations to ‘train’ an algorithm under human‘supervision’ – to ‘learn’ the correlation between theresearcher’s ascribed classes and words characteristic ofdocuments in those classes (Grimmer and Stewart (2013)).While we touch on supervised machine learning briefly inconclusion, the focus of this Handbook is on unsupervisedmachine learning techniques because they resonate with theBank’s evolving ‘big data’ ethos (Bholat (2015); Haldane(2015)). Throughout the Handbook we bold key terms wherethey are defined, as we have done in this introduction.1Percent oflabour forceUnemployment rateOne issue that interests central banks is measuring risk anduncertainty in the economy and the financial system. Arecent contribution in this direction is research by Nyman etal. (2015). Nyman and his co-authors start from a generaltheory – the emotional finance hypothesis. This is thehypothesis that individuals gain conviction to take positionsin financial markets by creating narratives about the possibleoutcomes of their actions. These conviction narrativesembody emotion such as excitement about expected gains,and anxiety about possible losses. According to Nyman andhis co-authors, these narratives are not composed byindividuals in isolation. Rather, they are constructed socially,through interactions like when individuals talk to oneanother. Through these social interactions, narratives arecreated and disseminated, with potential impact on assetprices.987654YearSource: McLaren and Shanbhogue (2011).They test their hypothesis by looking at three text datasources: the Bank’s daily market commentary (2000-2010),broker research reports (2010-2013) and the Reuters’ NewsArchive (1996-2014). Sentiment is measured by constructingthe sentiment ratio in Equation 1.Equation 1 Sentiment ratio(a)Text as data for central bank researchIn order to motivate our discussion of text mining, we startby considering current core research topics of interest tocentral banks, using the Bank’s recently released One BankResearch Agenda (OBRA) Discussion Paper as a proxy (Bankof England (2015)). Indeed the OBRA Discussion Paperidentifies text as a data source whose analytical potential hasnot been fully tapped. This is particularly so given that thereis a wealth of new text data available via social media andinternet search records. McLaren and Shanbhogue (2011)offer a fine example of what can be done. Using Google dataon search volumes, they find that such data provides atimelier tracking of key economic variables than do officialstatistics. For instance, Figure 1 shows that Google searchesfor Jobseeker’s Allowance (JSA)(2) closely track officialunemployment.10Google index of JSA ( Excitement Anxiety ) Source: Nyman et al. (2015).(a)SI[T] is the sentiment ratio of document T, Excitement is the number of ‘excitement’ words, Anxiety is the number of ‘anxiety’ words and T is the total number of words in document T.The sign of the ratio gives an indication of market sentiment:bullish, if the ratio is positive, or bearish, if the number isnegative. The ratio is then compared with historical eventsand other financial indicators.In addition, they measure narrative consensus. In particular,their approach is to group articles into topic clusters.(3) Theuncertainty in the distribution of topics then acts as a proxyfor uncertainty. In other words, reduced entropy in the topicdistribution is used as an indicator of topic concentration orconsensus. Figure 2 depicts the time series for the sentimentindex and the consensus measure. The authors find evidenceof herding behaviour (reduced entropy) and increased1excitement ahead of the recent financial crisis.(1) The outputs of algorithms for unsupervised machine learning can be used as inputsinto econometric models for predicting some variable of interest, but this is adifferent approach from intentionally choosing the dimensions of content based ontheir predictive ability.(2) Unemployment benefit paid by the government of the United Kingdom.(3) In particular, the authors use X-means clustering algorithm, which employsBayesian Information Criteria (BIC) to detect the optimal number of clusters.They then use Shannon entropy as a measure of the topic distribution. Increasedconsensus is gauged through (1) reduction in the number of topic clusters, whenthe actual size of each cluster remains unchanged and (2) relative growth of oneparticular topic, for a fixed number of topic clusters.

Handbook No. 33 Text mining for central banks3Figure 2 Sentiment and entropy in Reuters’ News ArchiveRatioFactor 2SentimentEntropyYearSource: Nyman et al. (2015).Once uncertainty in the economy is measured, central banksaim to manage it. This is one of the main motivations behindthe Bank’s recent policy of forward guidance (Carney (2013)),by which the Bank steers expectations about the futuredirection of policy through communications of futureintentions and official forecasts. Text mining can help hereand in other similar instances to measure the extent to whichBank officials are communicating a consistent message to theoutside world.(1) And assessing the efficacy of the Bank’scommunications is an area of research identified by theOBRA Discussion Paper.Figure 3 from Vallès and Schonhardt-Bailey (2015)exemplifies the kind of research that can be conducted. Thefigure depicts the thematic content of MPC speeches andminutes in the last year of Mervin King’s Governorship andthe first year of Mark Carney’s Governorship.Figure 3 Thematic content of MPC Minutes(a)Factor 1Source: Vallès and Schonhardt-Bailey (2015).(a)These graphs depict the correlations between topics and speakers in the King and CarneyGovernorships. The positions of the points and the distance between points reflects the degreeof co-occurrences. The axes identify the maximum amount of association along factors, asexplained in greater detail in Section 2.Each graph spatially represents co-occurrences – that is, theconvergence and divergence of individuals in speaking aboutcertain topics. Spatial proximity suggests a greater degree ofco-occurrence. For instance, in both graphs, the ‘RealEconomy’ topic class is closely associated with the MPC whenit speaks as a committee in its minutes.A distinct divide can be detected in the topics discussed byMPC members in their external speeches during the Carneyera. While some members used their speeches to discussforward guidance, others did not. Vallès and SchonhardtBailey thus shed light on where the committee as a wholeconvey one message, while individual members are delivering1more varied messages.(2)Factor 2Just as central banks want to understand whether they arecommunicating a consistent message, they are equallyinterested in whether the various policies they enact arecomplementary or conflicting. In fact, the OBRA DiscussionFactor 1(1) See Rosa and Verga (2006), Blinder et al. (2008), Jansen and Haan (2010) andBennani and Farvaque (2014) for similar investigations into the consistency ofcentral banks’ communication. However, consistency in communication is notalways good. For example, Humpherys et al. (2011) developed models to identifyfraudulent financial statements from management communications and foundevidence that fraudulent statements are likely to contain less lexical diversity.(2) Other recent papers using text mining to understand central banks’communications include analysis by Bulir et al. (2014) of central banks’ inflationreports, and analysis by Siklos (2013) of minutes from five central banks, showinghow their diction changed after the financial crisis. Also Nergues et al. (2014) derivenetwork metrics to investigate changes in discourse in the European Central Bankbefore and after the financial crisis.

4Handbook No. 33 Text mining for central banksPaper identifies understanding the interactions betweenmonetary, macro-prudential and micro-prudential policy asan important research topic for the Bank. In order tounderstand these interactions, text mining may be useful.Here we draw inspiration from a recent paper by William Liand his co-authors titled “Law is Code” (Li et al. (2015)). Theirpaper tracks the increasing complexity of American law overtime by analysing the complete United States (US) Codefrom 1926 up to the present. Striking a chord similar in toneto Haldane’s (2012) critique of complex financial regulation,the authors argue that the increasing complexity of legalcode makes it difficult to understand, generates negativeunintended consequences, and is a potential drag onproductivity. In order to capture empirically the increasingcomplexity of the US Code, the authors produce several textbased metrics. These include:1Metrics assessing the size and substance of the Codeover time. The authors interpret the lengthening ofthe Code as measured by word count to mean it isincreasingly burdensome. They note that the grosssize of the Code and its rate of growth have beenincreasing in recent decades. In addition, they trackchanges in the substance of specific sections of theCode by comparing the words added and deletedacross time. For instance, Figure 4 shows changes inTitle 12 (Banks and Banking) of the Code between1934 and 1976.Number of wordsWords conserved since 19341841Bank Holding Company Act Definitions800101Repealed (delivery of circulating notes)7001818Termination of Status as Insured DepositoryInstitution1709Insurance of Mortgages1813Federal Deposit Insurance Act 58196419701976YearSource: Li et al. (2015).23Table 1 Sections of Title 12 of the US Code with greatestinterconnectivityName6001934There are obvious applications of the approach taken by Liand his c

Handbook No. 33 Text mining for central banks 1 Text mining for central banks Introduction Text mining (sometimes called natural language . in contrast to structured data (numbers). However, referring to text as unstructured is somewhat misleading. . theory – the emotional finance hypothesis. This is the

Related Documents:

Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

Thermal and System Management Approach for Exhaust Systems Amit Deshpande, Frank Popielas, Chris Prior, Rohit Ramkumar, Kevin Shaver Sealing Products Group, Dana Holding Corporation Abstract: The automotive and heavy-duty industry (off- and on-highway) requirements for emission, noise and fuel reduction and control have become more stringent. Based on the complexity of the system with its .