From Big Data To Big Projects: A Step-by-Step Roadmap

2y ago
68 Views
21 Downloads
244.04 KB
6 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Kelvin Chao
Transcription

2014 International Conference on Future Internet of Things and CloudFrom Big Data to Big Projects: a Step-by-stepRoadmapHajar MousanifLISI Laboratory, FSSMCadi Ayyad UniversityMarrakesh, Moroccomousannif@uca.maHasna Sabah, Yasmina Douiji, Younes Oulad SayadOSER research team, FSTGCadi Ayyad UniversityMarrakesh, Morocco{hasna.sabah; yasmina.douiji; younes.ouladsayad}@ced.uca.maAbstract – while technologies to build and run big data projectshave started to mature and proliferate over the last couple ofyears, exploiting all potentials of big data is still at a relativelyearly stage. In fact, building effective big data projects insideorganizations is hindered by the lack of a clear data-driven andanalytical roadmap to move businesses and organizations froman opinion-operated era where humans skills are a necessity toa data-driven and smart era where big data analytics plays amajor role in discovering unexpected insights in the oceans ofdata routinely generated or collected. This paper provides asolid and well-founded methodology for organizations to buildany big data project and reap the most rewards out of theirdata. It covers all aspects of big data project implementation,from data collection to final project evaluation. In each stage ofthe process, we introduce different sets of platforms and toolsin order to assist IT professionals and managers in gaining acomprehensive understanding of the methods and technologiesinvolved and in making the best use of them. We also completethe picture by illustrating the process through different realworld big data projects implementations.where descriptive, inquisitive, predictive and prescriptiveanalytics enter into action to improve results, supportmission-critical applications, and drive better decisionmaking. While doing so, this paper attempts to providesatisfying answers to the following fundamental questions: Where does big data come from? What is/are the appropriate system(s) to capture,cure, store, explore, share, transfer, analyze, andvisualize data? What is the size range of data and its implicationsin term of storage and retrieval? How could big data be used to determine marketopportunities and seize them? And how could itcontribute in making forecasts? And finally, how to take into account people’sexpectations of privacy and bake it in advance into thebig data project design?The remainder of this paper will be organized as follows:Section 2 introduces some existing methodologies forimplementing big data projects in today’s enterprise, andhighlights our contribution. In section 3, we describe a clearroadmap for building smart and effective big data projectswithin organizations and illustrate the stages of the processthrough different sets of platforms and tools, as well as realworld big data projects use cases. Conclusions anddirections for future work are given in section 4.Keywords: big data, advanced analytics, big dataproject, big data technologies.I.INTRODUCTIONEvery time we visit a website, “like” or “follow” a socialpage, and share our experiences, thoughts, feelings, andopinions on the Internet, we make already “big” data even“bigger”! Every day we collectively generate mountains ofdata that is waiting to be processed and analyzed. As anexample, almost 500 terabytes of data is uploaded each dayto Facebook servers [1], while Youtubers upload 100 hoursof video every minute [2], and over 571 new websites arecreated every minute of the day [3]. Yet, using big data isnot about collecting or generating massive amounts of data,but more about making sense of it. In fact, big data isabsolutely worthless if it is not actionable and mostimportantly smart. Hence, what companies nsactions, and such, may be of no use if no insights aresmartly and timely extracted from it.II.A recent survey of Gartner showed that companies arenow more aware of the opportunities offered by analyzinglarger amounts of data and are increasingly investing orplanning to invest in big data projects, from 58% in 2012 to64% last year [4]. This trend is accompanied by an increasein the need of global model or roadmap, that assist ITdepartments not only in implementing a Big data project, butin making the best of it in order to meet business objectives.The Gartner approach in [5] introduces a roadmap tosucceed big data solutions adoption, starting from the stageof company unawareness of the necessity of Big data infacing today’s business objectives, to the final stage of datadriven enterprise.How to establish an effective and rewarding big datasolution is the major concern of any company ororganization willing to embark on the big data adventure.Throughout this paper, we will show how business leadersand directors can leverage their data in the most efficientway possible through a clear and analytical methodology978-1-4799-4357-9/14 31.00 2014 IEEEDOI 10.1109/FiCloud.2014.66RELATED WORKThe other example is that of the U.S. Census Bureau,which has implemented a big data project to conduct a headcount of all the people in the U.S. The life cycle of the US373

census bureau big data project includes three fundamentalsteps: 1 ) Data collection using a multi-mode model, 2 )Data analysis to explore technology solutions based onmethodological techniques, and 3 ) Data dissemination byimplementing new platforms for integrating census andsurvey data with other Big data [6].In this section, we explore the design of the proposedmethodology and provide a set of useful tips to follow ineach phase of the big data project setup. The suggestedapproach, as shown in Figure1, consists of three majorphases: 1 ) elaboration of the global strategy, 2 )implementation of the project, and 3 ) post-implementation.In [7], ASE consulting provides a well-consideredapproach for building big data projects and which consists ofsix steps that are not committing to any particulartechnology or tool, ranging from understanding the scope ofthe project by identifying business problems andopportunities, to evaluating the big data project whileproviding insights into what worked well and what did not.A. Global Strategy ElaborationStarting a big data project requires big changes and newinvestments. The changes particularly include theestablishment of a new technological infrastructure, and anew way to process and harness data. Here are a few pointsto consider before undertaking any changes:1) Why a big data project?To answer this question, companies have to find theproblems that need a solution, and decide whether theycould be solved using new technologies or just withavailable software and techniques. Those problems could be:volume challenges, real time analytics, predictive analytics,customer-centric analytics, among others. The other thing toconsider is to define business priorities, by focusing on themost important activities that form the greatest economicleverage in the business.IBM in their recent report [8] introduced a three-phasesapproach for building big data projects: planning, executionand post-implementation, and which mainly consists inunderstanding the business and legal policies,communication between IT departments and the projectstakeholders, and conducting an impact analysis at the endof the implementation.With respect to all related literature presented above, theexisting efforts either fail to cover some fundamental aspectsof big data project setup, or limit their approach to providingbasic guidelines for big data projects implementationwithout further insights into the technologies and platformsinvolved. The present work comes to overcome suchlimitations by:2) What data should the organization consider?Once priorities are defined, business leaders and ITpractitioners must target the data that will yield most value.This important phase is defined by IBM as Data exploration[9], as it is about exploring both internal and external dataavailable to the company, to ensure that it can be accessed tosustain decision making and everyday operations. IBMInfoSphere Data Explorer [10] is one example of tools thatallows to perform such a task. It allows federated discovery,navigation and inquiry over a wide extend of data sourcesand types, both inside and outside the organization, to helpcompanies start big data initiatives. Providing a holistic approach to building big dataprojects, which tackles all implementation challengesa company or organization may face in each stage oftheir big data project setup: from strategy elaboration,to final project evaluation. Assisting companies and organizations, willing toestablish an effective and rewarding big datasolution, in gaining a comprehensive understandingof the technologies involved and in making the bestuse of them. 3) How to protect data?Securing a huge amount of continuously evolving datacan be very complicated, considering that firms’ serverscannot store all the needed data. Moreover, the fact that bigdata is most of the time processed in real-time induces evenmore security challenges. The Cloud Security Alliance(CSA), is one of the few organizations that took care of thisissue by providing, in their recent report [11], a set ofsolutions and methods to win every privacy or securitychallenge. Similarly, the Enterprise Strategy Group (ESG)shows in [12] the most significant obstacles facing theimplementation of a security policy in a big dataenvironment, and gives valuable tips for CIOs to enter thebig data security analytics Era, whereby companies wouldnot only be able to monitor the traffic coming into andgetting out of their systems to detect threats, but also predictcyber-attacks even before they happen.Baking in advance people’s expectations of privacyand security into the big data project design. Illustrating the proposed process through differentreal-world big data projects implementations.III.ROADMAP FOR BUILDING SMART BIG DATA PROJECTSWe consider that there are three main features to focuson when planning to implement a new security managementsolution: Protection of sensitive data: by controlling the accessto the data or by providing encryption solutions orboth. The chosen solution for this purpose must beeasy to integrate within the current system andFig. 1.Big data project workflow374

initiative is the World Bank [20]. The catalog includesmacro, financial and sector databases. In addition tosearching datasets on the World Bank site and downloadingtables, users can also access the data via different APIs [21].The open data catalog [22] maintains a list of worldwideopen data and shows an increase in governments’contribution. A number of companies are also making partsof their data available through download or via APIs likeYelp [23] which gives academics access to its businessrating database.3) Social networksSocial networks are storing huge amounts of data. Thisdata is generally accessible via proposed APIs or throughspecial grants. For example, Twitter proposes its own APIs[24] that give access to fractions from all tweets. Thecompany also established a certified partners program in2012: partners, such as Gnip [25] are given a deeper accessto twitter data, which they process to offer custom datasetsand services. Finally, Twitter announced in 2014 a datagrant project that will give selected research institutionsaccess to its public and historical data [26].consider performance issues. Vortmetric Encryption[13] and Voltage [14] are examples of such solutions. Network security management: by monitoring thelocal network, analyzing data coming from securitydevices and network end-points, timely detectingintrusions and suspicious traffic and reacting to it,without impeding the main objective of the big dataproject. Among available platforms for this purpose,we find LogRhythm SIEM 2.0 [15] and Fortscale[16]. Security intelligence: by providing actionable andcomprehensive insight that reduces risk andoperational effort for any size organization using datagenerated by users, applications and infrastructure.InfoSphere Streams and InfoSphere BigInsights [9]are some of big data technologies that offer securityintelligence features.4) What to avoid?Below are the most common mistakes companies maymake whether before or while undertaking a big datainitiative:4) CrowdsourcingCollecting massive amounts of data can be quitechallenging, especially when it has to be done in a largescale. Crowdsourcing is a great solution for data collectionand emerged in the last decade as an efficient way to harnessthe creativity and intelligence of crowds. Recently,researchers sought to apply crowdsourcing to human subjectresearch [27]. Technical University Munich (TUM)’sProteomicsDB and the International Barcode of Life projectsare two good examples of collecting and gathering datausing crowdsourcing [28]. Amazon Mechanical Turk(Mturk) [29] is one crowdsourcing framework among othersin which assignments are distributed to a population of manyunknown workers for fulfillment. Technology is not the goal of a big data project, it israther a mean to be seriously thought about oncebusiness objectives are fixed. There is no ever-lasting technological solution forimplementing the whole cycle of a big data project.As big data solutions proliferate, it becomes difficultto predict which platforms, applications or methodswill better work in the future. Hence, companiesshould stay open to any new big data solution. Avoid the warehouse-or-Hadoop trick, it isimperative to use both of them, as they work wellalongside and complement each other.C. Data preprocessingIt is important to lay the ground for data analysis byapplying various preprocessing operations to address variousimperfections in raw collected data. For example, data canhave different formats as multiple sources might beinvolved. It can also contain noise (errors, redundancies,outliers and others). Finally, it may simply need to fitrequirements of analysis algorithms. Data preprocessingincludes a range of operations [30]:B. Data collectionData collection is the first technical step of a big dataproject setup. This section will shed light on some major bigdata sources. We will cover sensors, open data, social mediaand crowdsourcing.1) Internet of thingsSensors are a major source of big data. They areincreasingly deployed everywhere: smart phones and otherdaily life devices, commercial buildings or transportationsystems. With an expected population of 1 trillion by 2015[17], sensors allow collecting various types of dataincluding: body related metrics, location, movement,temperature and sounds. Coupled with ubiquitous wirelessnetworks, sensors are driving myriad of smart innovations inthe context of Internet of Things (IoT), for example: smartbuildings where lightening and air-conditioning areoptimized, smart transportation and traffic managementsystem that monitor both vehicular and pedestrian traffic forbetter flow and better evacuation in emergencies [18], andsmart phones that automatically recognize our emotionalstates and appropriately respond to them [19].2) Open dataPublic institutions, organizations and a growing numberof private companies are making some of their datasetsavailable for public. A major contributor to open data Data cleaning eliminates the incorrect values andchecks for data inconsistency. Data integration: combines data from databases, filesand different sources. Data transformation: converts collected data formatsto the format of the destination data system. It can bedivided into two steps: 1 ) data mapping which linksdata elements from the source data system to thedestination data system, and 2 ) data reduction whichelaborates the data into a structure that is smaller butstill generates the same analytical results. Data discretization: can be considered as part of thedata reduction, yet it has its own particularimportance, and it refers to the process of partitioningor converting constant properties, characteristics orvariables to nominal variables, features or attributes.375

Secondly, deriving meaning from semi-structured andunstructured data requires adequate visualizations that arevariably supported by software. Examples are word clouds,association trees, and network diagrams.Most data mining and business intelligence platformsinclude data preprocessing tools such as the open sourceWEKA [31][31] and Data cleaner for Pentaho [32].D. Smart data analysisExtracting value from a huge set of data is the ultimategoal of many firms. One efficient approach to achieve this isto use advanced analytics, which provides algorithms toperform complex analytics on either structured orunstructured data. There are four types of advancedanalytics:Concerning the market of data visualization platforms, astudy by Forrester Inc [43] highlights market’s diversity andthe importance of both technology and visual design quality.It also shows that main technical differentiation factors arethe performance of the in-memory engine, the quality of thegraphical user interface, and the comprehensiveness of dataexploration and discovery tools. Leaders’ board includesTableau Software, Tibco Spotfire and SAS BI. A whitepaper by Tableau Inc [44] identifies, as shown in TABLE 1,seven key features to assess a visual analytics application. Descriptive analytics: answers the question: whathappened in the past? Knowing that in this context,the past could mean one minute ago or a few yearsback [33]. It uses descriptive statistics such as sums,counts, averages, min, and max to providemeaningful results about the analyzed dataset.Descriptive analytics is typically used in socialanalytics and recommendation engines, such asNetflix recommendation system.TABLE I.Key elements Inquisitive analytics: also called diagnostic analytics,it answers the question why something is happening?By validating or rejecting business hypotheses,Explorys, a leader in healthcare big data, usedinquisitive analytics to find out that an unexplainedvariation in the evaluation of patients’ weight wasmainly due to some documentation gap [34]. Predictive analytics: consists in studying the data wehave, to predict data we do not have, such as futureoutcomes, in a probabilistic way [35], answeringthereby the question “what is likely to happen?”. Prescriptive analytics: or Optimization analyticsconsists in guiding decision making by answering thequestion “so what?” or “what we must do now?” Itcan be used by companies to optimize theirscheduling, production, inventory and supply chaindesign [33].To guide companies in their software analytics choiceGartner published the first magic quadrant (MQ) foradvanced analytics [36], which presents 16 analyticsplatforms divided into four areas: leaders, challengers,visionaries and niche players, based on two criteria, whichare the completeness of vision and the ability to execute.The three top leaders of Gartner MQ are IBM [37], SASVisual analytics [38] and Knime [39].E. Representation and VisualizationVisualization guides the analysis process and presentsresults in a meaningful way. For the simple depiction ofdata, most software packages support classical charts anddashboards. The choice is generally dictated by the type ofdesired analytics [40]. Working at big data scale bringsmultiple technical issues [41]. Firstly, there are challengesrelated to the volume such as processing time, memorylimitations, and the need to fit different display types.Different approaches are explored to scale to big data, forexample at Intel Science & Technology Center for Big Data,projects work on various techniques such as visualsummaries, caching and prefetching to hide data storelatency, query steering and large scale parallelism [42].KEY ELEMENTS OF VISUAL ANALYTICS APPLICATION.Description & mentedhumanperceptionEffectivevisualproperties and welldesigned sDep

are some of big data technologies that offer security intelligence features. 4) What to avoid? Below are the most common mistakes companies may make whether before or while undertaking a big data initiative: Technology is not the goal of a big data project, it is rather a mean to be se

Related Documents:

The Rise of Big Data Options 25 Beyond Hadoop 27 With Choice Come Decisions 28 ftoc 23 October 2012; 12:36:54 v. . Gauging Success 35 Chapter 5 Big Data Sources.37 Hunting for Data 38 Setting the Goal 39 Big Data Sources Growing 40 Diving Deeper into Big Data Sources 42 A Wealth of Public Information 43 Getting Started with Big Data .

work/products (Beading, Candles, Carving, Food Products, Soap, Weaving, etc.) ⃝I understand that if my work contains Indigenous visual representation that it is a reflection of the Indigenous culture of my native region. ⃝To the best of my knowledge, my work/products fall within Craft Council standards and expectations with respect to

big data systems raise great challenges in big data bench-marking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target e-

of big data and we discuss various aspect of big data. We define big data and discuss the parameters along which big data is defined. This includes the three v’s of big data which are velocity, volume and variety. Keywords— Big data, pet byte, Exabyte

Retail. Big data use cases 4-8. Healthcare . Big data use cases 9-12. Oil and gas. Big data use cases 13-15. Telecommunications . Big data use cases 16-18. Financial services. Big data use cases 19-22. 3 Top Big Data Analytics use cases. Manufacturing Manufacturing. The digital revolution has transformed the manufacturing industry. Manufacturers

Big Data in Retail 80% of retailers are aware of Big Data concept 47% understand impact of Big Data to their business 30% have executed a Big Data project 5% have or are creating a Big Data strategy Source: "State of the Industry Research Series: Big Data in Retail" from Edgell Knowledge Network (E KN) 6

Hadoop, Big Data, HDFS, MapReduce, Hbase, Data Processing . CONTENTS LIST OF ABBREVIATIONS (OR) SYMBOLS 5 1 INTRODUCTION TO BIG DATA 6 1.1 Current situation of the big data 6 1.2 The definition of Big Data 7 1.3 The characteristics of Big Data 7 2 BASIC DATA PROCESSING PLATFORM 9

6 Big Data 2014 National Consumer Law Center www.nclc.org Conclusion and Recommendations Unfortunately, our analysis concludes that big data does not live up to its big promises. A review of the big data underwriting systems and the small consumer loans that use them leads us to believe that big data is a big disappointment.