A Big Data Challenge: Visualizing Social Media Trends . - SAS Support

1y ago

12 Views

2 Downloads

975.49 KB

8 Pages

Last View : 19d ago

Last Download : 3m ago

Upload by : Grant Gall

Report this link

Download PDF

Transcription

Paper 1066-2017A Big Data Challenge: Visualizing Social Media Trends about Cancer usingSAS Text MinerScott Koval, Yijie Li, and Mia Lyst, Pinnacle Solutions, Inc.ABSTRACTAnalyzing big data and visualizing trends in social media is a challenge that many companies face aslarge sources of publically available data become accessible. While the sheer size of usable data can bestaggering, knowing how to find trends in unstructured textual data is just as important of an issue. At aBig Data conference, data scientists from several companies were invited to participate in tackling thischallenge by identifying trends in cancer using unstructured data from Twitter users and presenting theirresults. This paper explains how our approach using SAS analytical methods was superior to other BigData approaches in investigating these trends.INTRODUCTIONIn recent years, public interest and participation have become the heart of big data. In particular, datafrom social media has increased exponentially. At the 2016 Indy Big Data Conference, a VisualizationChallenge offered a way for companies to share their methodologies for handling large and complexdatasets. The Visualization Challenge required participants to mine a defined set of Twitter tweets andproduce visualizations that offer trends and insights to cancer. The raw data contained more than143,845,720 individual tweets with three attributes identifying a user ID, date and time, and text content ofeach tweet. The data was not allowed to be augmented in any way.Given the size and complexity of the data, only four companies ultimately participated in this challengewith only two weeks to generate results and a presentation.The traditional way of analyzing this data is hypothesis-based where data is examined based on aparticular question of interest. While other companies followed this approach using modules of theHadoop framework or other Visualization tools (e.g. Hadoop, Apache Spark, Apache Solr, Datameer), weoffered a data-driven, analytical solution by combining SAS Enterprise Guide, SAS Enterprise Miner andSAS Visual Analytics along with other tools (Hadoop, Python, and Spark). This allowed the data to tell thestory rather than restricting the outcomes based on our limited knowledge of current cancer trends.METHODOLOGYOur solution included a blend of different technologies in order to apply the best features from each to theappropriate function. As a result, Python, Spark and Scala were used to process the data in a timelymanner and SAS was used for text analytics and visualizations (Figure 1) to identify trends.1

Figure 1. Indy Big Data Challenge SolutionThe raw data provided for this challenge existed in 555,000 separate CSV files (18GB). We usedPython, Spark, and Scala to merge these files together and then imported the data into SAS EnterpriseGuide for additional preprocessing. A query was created to determine whether or not a tweet contained areference to the word ‘cancer’ or any related terms. This was used to help filter down the data to helpinvestigate the topic at hand. Retweets, or messages that are simply shared, were also removed from thedata in order to prevent a bias in the results. Overall, the filtered data included 1.9 million cancer relatedtweets to analyze.While only three columns were provided in the raw data, we were able to create additional fields to helpinvestigate the data. This included breaking up the date field into year, month, day, day of week, and timeof day variables. An additional field was created to form a cleaned up version of the tweet, which retainedalphanumeric values. Binary variables were also created to flag the message as containing a mention orhashtag. The number of mentions and hashtags each message had used was also calculated.Figure 2. SAS Enterprise Miner process flow for creating text topics on cancerThese processed data were imported into SAS Enterprise Miner for analysis (Figure 2). This programcontains an add-on called SAS Text Miner, a useful tool for analyzing unstructured text data in order toidentify underlying topics and segments of words. The concept of cancer is a very broad topic, and inorder to explore trends, we used this software to determine a list of text topics present in the data in orderto detect any underlying themes.After the data were randomly sampled, the first step in the analysis was to parse it using the Text Parsingnode in SAS Text Miner. This contains a series of tasks used to tokenize, stem, and restructure the data.A stop list was also used in this step in order to drop frequently used English words from the analysis.Examples of these common terms include, “a”, “the”, “of”, “at”, etc. A spell check was also used during2

this step to help correct commonly misspelled words and standardize the data and reduce noise duringanalysis.Next, the data were processed through the Text Filter node. The default parameter was used todetermine both the frequency (Log) and term (Entropy) weighting. After running the node, we used theInteractive Filter Viewer to further refine the results. Unnecessary words were manually dropped fromanalysis, and synonyms were created to group like-worded terms. An example of a synonym term beingcreated would be to combine the terms “SKIN CANCER” and “MELANOMA”.Now the data were ready to explore using the Text Topic node. We created up to 50 single-term topicsand 25 multi-term topics (Fig 3). Upon exploring the results, it appeared that 6 of the 50 single-term topicscontained specific cancers. These included Breast Cancer (n 321,745), Colon Cancer (n 39,016),Lung Cancer (n 62,014), Ovarian Cancer (n 60,226), Prostate Cancer (n 34583), and Skin Cancer(n 35,853). Six new cancer specific datasets were then created based on the tweets that were flaggedfor each of these types of cancer.Figure 3. Table containing cancer text topic resultsEach of these six new datasets was repeated through the same Text Parsing and Text Filteringtechniques mentioned above. After, the Text Cluster node was used for each of them to create ahierarchical cluster and segment the tweets based on frequently occurring terms (Fig 4). For each ofthese cancer types, we reviewed the clusters produced and categorized them with appropriately namedclusters.Figure 4. Diagram featuring flows to create text clusters for each type of cancerSAS Visual Analytics was used to help display the results by creating several reports and explorations.These visualizations helped analysts further explore the findings and infer trends.3

RESULTSThe results of the topic analysis corresponded to the top 6 Most Common Cancers in 2016, exceptOvarian Cancer (Tables 1 and 2). We suspect Ovarian Cancer may have surfaced in the analysis since itis the 5th leading cancer-related cause of death in women.Estimated NewCases246,660 – 2,600Cancer TypeBreast (Female – Male)Estimated Deaths40,450 – 440Lung (Including Bronchus)224,390158,080Prostate180,89026,120Colon and Rectal 76,38010,130Non-Hodgkin Lymphoma72,58020,150Thyroid64,3001,980Kidney (Renal Cell and Renal Pelvis) Cancer62,70014,240Leukemia (All c53,07041,780Table 1. 2016 Cancer facts and figuresCancerTypeBreastLungProstateColon Risks% ofTweets% of Topic% of Topic% of Topic% ofTopic% of Topic% 7%18.7%4.6%10.7%13.6%9.1%Table 2. Cancer cluster topic resultsSAS Visual Analytics allowed us to easily see seasonal trends in tweets by plotting the frequency oftweets over time for each of the main cancer types. Instances of tweets about breast cancer spiked everyyear during the month of October for the annual awareness month. The same is true for colon cancer inMarch, lung cancer in November, ovarian and prostate cancer in September, and skin cancer in May (Fig6). This would indicate that the awareness campaigns are effective in raising discussions of theassociated diseases during specific times of the year.4

Figure 6. Cancer awareness months work!Of the 6 diseases, breast cancer had the highest frequency of tweets which speaks to the very salientand established campaigns put out by organizations like Pink and Susan G. Komen (Fig 7). In addition,Breast and Ovarian cancer were the only two cancer topics to actually have clusters formed aroundfundraising campaigns.Figure 7. Breast cancer has the highest percentage of tweets5

When focusing more on the types of clusters to emerge, lung and colon cancer had the highestfrequencies of tweets categorized into research and studies segments. This could be due to high mortalityrates of these specific diseases and amount of funding spent on research.Word clouds of the hashtags and mentions for each cancer categories displayed some meaningful topmentioned words. At first, we examined result from breast cancer data. The top mentions in breast cancerinclude Taylor Swift, The Ellen Show, Kylie Minogue, Joan Lunden, Carolina Herrera, Christina Applegateand Robin Roberts, indicating a clear celebrity effect.In addition, from the word cloud of the Male Breast Cancer Awareness cluster, we find that the topmentioned word in this cluster is NFL (Fig 8).Figure 8. Breakdown of tweets in the Male Breast Cancer Awareness clusterFigure 9. Celebrities influence spikes in cancer tweetsSuch kind of celebrity effect can also be found in other subcategories of cancer.In May 2014, Hugh Jackman posted his skin cancer scare on Instagram and reminded people to wearsunscreen. This can be correlated to the spike in tweets about Sun Exposure and Sunscreen and the topmention for that month, @REALHUGHJACKMAN (Fig 9).6

This further confirms that anyone with high celebrity profile can help to increase public awareness incancer prevention and treatment.A table of all the trends and insights obtained in this analysis is shown below (Table 3):DescriptionGeneral TrendsTop 6 Cancer types obtained from Topic modeling correspond to the2016 Most Common Canter Types, except Ovarian Cancer.Ovarian Cancer tweets may have surfaced in Topic modeling due to highmortality rates.Breast Cancer had highest number of tweets.Cancer Awareness months show significant increase of tweets of theparticular cancer for that month.Breast and Ovarian cancer were the only two topic areas that surfacedfundraising/campaign tweets.Colon and Lung Cancer formed the largest clusters for Research &Studies.People tweet about Prevention & Screening for cancers where earlyprevention screening affects survival rates.Breast Cancer InsightsOvarian Cancer InsightsThe top 100 mentions contain several celebrities such as, Taylor Swift(mom), Ellen Show (mom), Kylie Minogue, Joan Lunden,Carolina Herrera (designer), Christina Applegate,Robin Roberts (Good Morning America), and Oprah@NFL is the top mention for the Male Breast Cancer cluster.Spike tweets for Clinical Studies cluster in May 2015 shows@THEROCATEST as the top mention. This correlates to the results ofROCA test which were shown to be twice as effective for early detectionas other screenings.Visible UK mentions (@OVARIANCANCERUK).Prostate CancerInsightsSpike in tweets for Screening cluster in May 2011 shows Coffee as a Top25 word. This correlates to a study that showed that coffee reduces therisk of prostate cancer.Visible UK mentions (@PROSTATEUK).Skin Cancer InsightsSpike in tweets for Sun Exposure and Sunscreen cluster in May 2014. Thiscorrelates to an Instagram that Hugh Jackman posted where he revealshis skin cancer scare and reminds people to wear sunscreen.Table 3. Trends and insights found in cancer tweet clustersCONCLUSIONAlthough we are not subject matter experts in current cancer trends, the use of SAS software in big dataanalytics allowed us to simplify a fairly complex problem and identify several trends and insights usingonly three columns of twitter data. By combining these tools and techniques together, we were able to letthe data speak for itself rather than relying on ad-hoc analysis. While other participants used the power7

of the Hadoop framework and big data visualization tools to process the all of the data, they did notperform analytical techniques in order to uncover hidden trends the data had to offer.REFERENCESAmerican Cancer Society (ACS). 2016. “Cancer Facts & Figures e Polytechnique Fédérale de Lausanne (EPFL). 2016. “Scala logo.png”. By Source, Fair use,https://en.wikipedia.org/w/index.php?curid 21286998.CONTACT INFORMATIONYour comments and questions are valued and encouraged. Contact the authors at:Scott KovalPinnacle Solutions, Inc.(317) epinnaclesolutions.comYijie LiPinnacle Solutions, Inc.(317) nnaclesolutions.comMia LystPinnacle Solutions, Inc.(317) nnaclesolutions.com8

A Big Data Challenge: Visualizing Social Media Trends about Cancer using SAS Text Miner Scott Koval, Yijie Li, and Mia Lyst, Pinnacle Solutions, Inc. ABSTRACT Analyzing big data and visualizing trends in social media is a challenge that many companies face as large sources of publically available data become accessible.

Related Documents:

M259 Visualizing Information Jan 14: DATA SOURCE THUR …

M259 Visualizing Information George Legrady 2014 Winter M259 Visualizing Information Jan 14: DATA SOURCE George Legrady, legrady@mat.ucsb.edu Yoon Chung Han hanyoonjung@gmail.com M259 Visualizing Information George Legrady 2014 Winter This

20 Views

2y ago

Paper SAS1888-2015 Visualizing Clinical Trial Data: Small Data, Big ...

Visualizing Clinical Trial Data: Small Data, Big Insights Michael Drutar and Elliot Inman, SAS Institute Inc., Cary, NC ABSTRACT Data visualization is synonymous with big data, for which billions of records and millions of variables are analyzed simultaneously. But that does not mean that data scientists analyzing clinical trial data that

7 Views

1y ago

Big Data Analytics Turning Big Data Into Big Money

The Rise of Big Data Options 25 Beyond Hadoop 27 With Choice Come Decisions 28 ftoc 23 October 2012; 12:36:54 v. . Gauging Success 35 Chapter 5 Big Data Sources.37 Hunting for Data 38 Setting the Goal 39 Big Data Sources Growing 40 Diving Deeper into Big Data Sources 42 A Wealth of Public Information 43 Getting Started with Big Data .

54 Views

1y ago

Data Science and Machine Learning Essentials

Data Science and Machine Learning Essentials Lab 3A - Visualizing Data By Stephen Elston and Graeme Malcolm Overview In this lab, you will learn how to use R or Python to visualize data. If you intend to work with R, complete the Visualizing Data with R exercise. If you plan to work with Python, complete the Visualizing Data with

12 Views

1y ago

BigDataBench: a Big Data Benchmark Suite from Internet Services

big data systems raise great challenges in big data bench-marking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target e-

26 Views

1y ago

Visualizing Data - GBV

Visualizing Data Ben Fry O'REILLY8 Beijing Cambridge Farnham Köln Sebastopol Taipei Tokyo . Table of Contents Preface vii 1. The Seven Stages of Visualizing Data 1 Why Data Display Requires Planning 2 An Example 6 Iteration and Combination 14 Principles 15 Onward 18 2. Getting Started with Processing 19

32 Views

2y ago

A Study on Big data security issues and challenges

of big data and we discuss various aspect of big data. We define big data and discuss the parameters along which big data is defined. This includes the three v’s of big data which are velocity, volume and variety. Keywords— Big data, pet byte, Exabyte

49 Views

2y ago

VOCES8 A CAPPELLA SONGBOOK - Edition Peters

Ben Folds VOCES8 A CAPPELLA SONGBOOK EP72443 A CAPPELLA SONGBOOK. 7-1 2--1 2-Soprano 1. Piano Birds fly ing- high, you knowhow I feel. Sun in the sky, you know how I feel. Wistful (q. c.64) S 1. A 1. A 2. T 1. B 1. Pno. Reeds drift ing- on by, you knowhow I feel. It’s a new dawn,it’s a new day, it’s a new life for 5 a tempo giusto Ooh p Ooh p Ooh p Ooh p du G ESop. 1 solo, ad lib .

88 Views

3y ago

Recent Views

An Introduction to Islamic capital markets - REDmoney Events

Capital markets are markets for buying and selling equity securities (i.e. shares) and debt securities (i.e. bonds). Capital markets include primary markets, where new stock and bond issues are sold to investors, and secondary markets, where existing securities are traded Key participants: buyers, sellers and financial intermediaries

1y ago

104 Views

Don't fear the bear (RES-4011Q-A)

the 0% line are bull markets, and the red-shaded areas below it are bear markets — a decline of more than 20%. You'll notice that bear markets are shorter than bull markets. On average, bear markets last about 12 months, with an average loss . of about 32%.* Bull markets, on average, last nearly five years (54 months), with an average gain .

1y ago

105 Views

1213 How to Educate Consumers on Your Financial Services

Financial Empowerment 2 Financial education –strategy that provides people with financial knowledge, skills and resources Financial education builds an individual’s knowledge, skills and capacity to use resources and tools, including financial products and services leading to Financial Literacy Financial empowerment includes financial education and financial literacy –focuses .

3y ago

301 Views

Motives for Investing in Foreign Markets

international financial markets have been developed. Financial man-agers of MNCs must understand the various international financial markets that are available so that they can use those markets to facilitate their international business transactions. The specific objectives of this chapter are to describe the background and corporate use of .

3y ago

142 Views

Common Risk Factors in Cryptocurrency

excess returns over the risk-free rate of each portfolio, and the excess returns of the long- . Journal of Financial Economics, Journal of Financial Markets Journal of Financial Economics. Journal of Financial Economics. Journal of Financial Economics Journal of Financial Economics Journal of Financial Economics Journal of Financial Economics .

3y ago

203 Views

GEE II: FINANCIAL MARKETS, MONETARY POLICY AND THE

Policy, 11th Edition (New York: Addison-Wesley, 2018) V. FINANCIAL CRISES IN ADVANCED ECONOMIES (MB) Ch. 12 Financial Crises (C) Mishkin, F.S., "Asymmetric Information and Financial Crises: A Historical Perspective," in R. Glenn Hubbard, ed., Financial Markets and Financial Cri

2y ago

309 Views

Consumer protection in the banking, insurance and financial services .

insurance and financial services sector. ASIC's role in the financial system 2 As Australia's corporate, markets, financial services and consumer credit regulator, ASIC strives to ensure that Australia's financial markets are fair and transparent and supported by confident and informed investors and financial consumers. 3 The

1y ago

122 Views

International financial markets and bank funding in the euro area .

International financial markets and bank funding in the euro area: dynamics and participants1 Jaime Caruana Adrian Van Rixtel General Manager Senior Economist Bank for International Settlements 1. Introduction Financial markets are undergoing major and at times very rapid changes, mostly as a result of the financial crisis that began in 2007.

1y ago

100 Views

FINS5512 FINANCIAL MARKETS AND INSTITUTIONS Course Outline .

This course will provide students with an introduction to Australian financial markets and an evaluation of the institutions, instruments and participants involved in the industry. The mainstream markets to be evaluated include the equity, money, bond, futures, options and exchange rate markets. The subject

3y ago

146 Views

Money & Capital Markets - City University of New York

Financial Markets & Institutions By Mishkin and Eakins 7th edition (2012) McGraw-Hill Publishers ISBN: 978-0-13-213683-9 Learning Goals In this case study based graduate course we will 1) explore the function and structure of financial markets, including money, bond, stock, mortgage and foreign exchange markets,

3y ago

125 Views

Impact of COVID-19 on the Global Financial System

markets. Equity markets began declining rapidly, losing around 30% of market value in a matter of weeks, with the speed of the sell-off exceeding that of the global financial crisis of 2008-2009 (GFC). By early March, short-term funding markets and international US dollar funding markets started to show signs of stress and, in the

3y ago

112 Views

2. An Overview of the Financial System

2-5 Structure of Financial Markets Debt and Equity Markets Primary and Secondary Markets Investment Banks underwrite securities in primary markets Brokers and dealers work in seconda

2y ago

111 Views

HDFC MF Yearbook 2021

HDFC group pledged Rs150cr contribution to the PM CARES Fund to provide relief and rehabilitation measures towards the . Global Economy and Markets 2. Key Future Trends 3. Indian Economy 4. Equity Markets & Sector Overview 5. Fixed Income Markets 3. . Developed markets (DMs) likely to achieve herd immunity by CY21 and Emerging Markets .

3y ago

124 Views

Feb 10th, 2020 Tax Loss Harvesting (TLH) South Bay .

SCHF FTSE Developed Markets Ex-US Emerging Markets VWO FTSE Emerging Markets EEM MSCI Emerging Markets Index IEMG MSCI Emerging Markets Investable Market Index Dividend Stocks VIG NASDAQ US Dividend Achievers Select SCHD Dow Jones U.S. Dividend 100 TIPS VTIP Barclays Capital US TIPS 0-5 Years

2y ago

314 Views

2021 Capital Markets Fact Book - sifma

Introduction 2021 Capital Markets Fact Book Page 7 US Capital Markets Are the Largest in the World The U.S. capital markets are largest in the world and continue to be among the deepest, most liquid, and most efficient. Equities: U.S. equity markets represent 38.5% of the 105.8 trillion in global equity market cap, or 40.7 trillion; this

1y ago

108 Views

A Big Data Challenge: Visualizing Social Media Trends . - SAS Support

It looks like you're using an ad-blocker