Using Big Data For The Analysis Of - FIWARE

1y ago
15 Views
2 Downloads
2.48 MB
58 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Sabrina Baez
Transcription

0

Using Big Data for the analysis ofhistoric context informationFrancisco Romero BuenoTechnological Specialist. FIWARE data engineerfrancisco.romerobueno@telefonica.com

Big Data:What is it and how much data isthere2

What is big data? smalldata3

What is big or view of Stockholm Public Library.jpg big data4

Not a matter of thresholdsIf both the data used by your app and the processingcapabilities your app logic needs fit the availableinfrastructure, then you are not dealing with a bigdata problemIf either the data used by your app either theprocessing capabilities your app logic needs don’t fitthe available infrastructure, then you are facing a bigdata problem, and you need specialized services5

How much data is there?6

Data growing ni/vni-hyperconnectivity-wp.html

Two (three) approaches fordealing with Big Data:Batch and stream processing(and Lambda architectures)8

Batch processing It is about joining a lot of data (batching)– A lot may mean Terabytes or more – Most probably, data cannot be stored in a singleserver Once joined, it is analyzed– Most probably, aata cannot be analyzed using a singleprocess Time is not a problem– Batching can last for days or even months– Processing can last for hours or even days Analysis can be reproduced9

Stream processing It is about not storing the data and analyzingit on the fly– Most probably, data cannot be analyzed by asingle process Time is important– Since the data is not stored, it must be analyzedas it is received– The results are expected to be available in nearreal-time Analysis cannot be reproduced10

Lambda architectures A Big Data architecture is Lambda compliant if itproduces near-real time data insights based on thelast data only while large batches are accumulatedand processed for robust insights– Data must feed both batch-based and stream-basedsub-systems– Real-time insights are cached– Batch insights are cached– Queries to the whole system combine both kinds ofinsights http://lambda-architecture.net/11

Distributed storage:The Hadoop reference (HDFS)12

What happens if one shelving is notenough?You buy more shelves 13

then you create an index“The Avengers”, 1-100, shelf 1“The Avengers”, 101-125, shelf 2“Superman”, 1-50, shelf 2“X-Men”, 1-100, shelf 3“X-Men”, 101-200, shelf 4“X-Men”, 201, 225, shelf nTheAvengers14X-Men

Hadoop Distributed File System (HDFS) Based on Google File System Large files are stored across multiple machines(Datanodes) by spliting them into blocks that aredistributed Metadata is managed by the Namenode Scalable by simply adding more Datanodes Fault-tolerant since HDFS replicates each block (defaultto 3) Security based on authentication (Kerberos) andauthorization (permissions, HACLs) It is managed like a Unix-like file system15

Spliting, replication and distribution142111234large file.txt(4 blocks)332443rack 1: datanodes 1 to 4162rack 2: datanodes 5 to 8

Namenode metadataPathReplicas/user/user1/data/lar 3ge file.txtBlock IDs1 {dn1,dn2,dn5}2 {dn3,dn5,dn8}3 {dn3,dn6,dn8}4 {dn1,dn4,dn7}/user/user1/data/ot 2her file.txt5 { }6 { }7 { } 17

Datanodes failure recovering142111234332324432large file.txt(4 blocks)rack 1: datanodes 1 to 418rack 2: datanodes 5 to 8

Namenode failure recoveringPathReplicas/user/user1/data/lar 3ge file.txtBlock IDs1 {dn1,dn2,dn5}2 {dn2,dn5,dn8}3 {dn4,dn6,dn8}4 {dn1,dn4,dn7}/user/user1/data/ot 2her file.txt5 { }6 { }7 { } 19

services nodeHttpFSssh clientHadoopCommandsbrowserAPI RESTAPI RESTcustom appshttpclient machineManaging HDFSHUEWebHDFSssh daemonHDFS20

Managing HDFS: HTTP REST API The HTTP REST API supports the complete FileSystem interface for HDFS– Other Hadoop commands are not available through aREST API It relies on the webhdfs schema for URIswebhdfs:// HOST : HTTP PORT / PATH HTTP URLs are built as:http:// HOST : HTTP PORT /webhdfs/v1/ PATH ?op Full API specification– t-dist/hadoop-hdfs/WebHDFS.html21

Managing HDFS: HTTP REST APIexamples curl –X GET ser/frb/webinar/abriefhistoryoftime page1?op open&user.name frb”CHAPTER 1OUR PICTURE OF THE UNIVERSEA well-known scientist (some say it was Bertrand Russell) once gave a public lecture onastronomy. He described how the earth orbits around the sun and how the sun, in turn,orbits around the center of a vast curl -X PUT r/frb/webinar/afolder?op mkdirs&user.name frb"{"boolean":true} curl –X GET r/frb/webinar?op liststatus&user.name abriefhistoryoftime riefhistoryoftime riefhistoryoftime replication":0}]}} curl -X DELETE r/frb/webinar/afolder?op delete&user.name frb"{"boolean":true}22

Distributed batch computing:The Hadoop reference(MapReduce)23

What happens if you cannot read allyour books?24

Hadoop was created by Doug Cutting atYahoo!. based on the MapReduce patent by Google25

Well, MapReduce was really inventedby Julius CaesarDivide etimpera** Divide and conquer26

An exampleHow much pages are written in latin among the booksin the Ancient Library of INREF5P34GREEKREF2P128LATINpages 4545 (ref 1)still reading 12Mappers27

An exampleHow much pages are written in latin among the booksin the Ancient Library of EKREF2P128stillreading 45 (ref pers28

An exampleHow much pages are written in latin among the booksin the Ancient Library of Alexandria?GREEKREF7P20LATINREF4P73LATINpages 73LATINREF5P34LATINpages 3445 (ref 1) 73 (ref 4) 34 (ref 5)ReducerEGYPTIANGREEKREF8P230Mappers29

An exampleHow much pages are written in latin among the booksin the Ancient Library of Alexandria?GREEKGREEKREF7P2045 (ref 1) 73 (ref 4) 34 (ref 5)idle ReducerGREEKGREEKREF8P230Mappers30

An exampleHow much pages are written in latin among the booksin the Ancient Library of Alexandria?idle 45 (ref 1) 73 (ref 4) 34 (ref 5)idle Reduceridle Mappers31152 TOTAL

Another exampleHow much pages are written in all the languages amongthe books in the Ancient Library of F2P128EGYPTREF6P10EGYPTREF3P12still reading egy,12Reducer(egy,12)Mappers32

Another exampleHow much pages are written in all the languages amongthe books in the Ancient Library of Alexandria?GREEKREF7P20stillreading y,10ReducerMappers33gre,128

Another exampleHow much pages are written in all the languages amongthe books in the Ancient Library of ,230

Another exampleHow much pages are written in all the languages amongthe books in the Ancient Library of Alexandria?(gre,20)GREEKREF7P20Reduceridle gre,128gre,230gre,20

Another exampleHow much pages are written in all the languages amongthe books in the Ancient Library of Alexandria?idle Reduceridle lat,45lat,73lat,34lat,152egy,12egy,10egy,22idle ReducerMappers36gre,128gre,230gre,20gre,378

Writing MapReduce applications MapReduce applications are commonly written inJava language:– Can be written in other languages through Hadoop Streaming A MapReduce job consists of:– A driver, a piece of software where to define inputs, outputs,formats, etc. and the entry point for launching the job– A set of Mappers, given by a piece of software defining itsbehaviour– A set of Reducers, given by a piece of software defining itsbehaviour uce section)37

Implementing the example The input will be a single big file containing:symbolae ia est vincit,latin,134 The mappers will receive pieces of the above file, which will be read line byline– Each line will be represented by a (key,value) pair, i.e. the offset on the file and the realdata within the line, respectively– For each input pair a (key,value) pair will be output, i.e. a common “num pages” key andthe third field in the line The reducers will receive arrays of pairs produced by the mappers, allhaving the same key (“num pages”)– For each array of pairs, the sum of the values will be output as a (key,value) pair, in thiscase a “total pages” key and the sum as value38

Implementing the example: JCMapper.classpublic static class JCMapper extendsMapper Object, Text, Text, IntWritable {private final Text globalKey new Text(”num pages");private final IntWritable bookPages new IntWritable();@Overridepublic void map(Object key, Text value, Context context)throws Exception {String[] fields (“Processing “ fields[0]);if (fields[1].equals(“latin”)) {bookPages.set(fields[2]);context.write(globalKey, bookPages);} // if} // map} // JCMapper39

Implementing the example: JCReducer.classpublic static class JCReducer extendsReducer Text, IntWritable, Text, IntWritable {private final IntWritable totalPages new IntWritable();@Overridepublic void reduce(Text globalKey, Iterable IntWritable bookPages, Context context) throws Exception {int sum 0;for (IntWritable val: bookPages) {sum val.get();} // fortotalPages.set(sum);context.write(globalKey, totalPages);} // reduce} // JCReducer40

Implementing the example: JC.classpublic static void main(String[] args) throws Exception {int res ToolRunner.run(new Configuration(), new CKANMapReduceExample(), args);System.exit(res);} // main@Overridepublic int run(String[] args) throws Exception {Configuration conf this.getConf();Job job Job.getInstance(conf, ”julius th(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));return job.waitForCompletion(true) ? 0 : 1;} // run41

Simplifying the batch analysis:Querying tools42

Querying tools MapReduce paradigm may be hard to understandand, the worst, to use Indeed, many data analyzers just need to query forthe data– If possible, by using already well-known languages Regarding that, some querying tools appeared in theHadoop ecosystem– Hive and its HiveQL language quite similar to SQL– Pig and its Pig Latin language a new language43

Hive and HiveQL HiveQL reference– anguageManual All the data is loaded into Hive tables– Not real tables (they don’t contain the real data) but metadatapointing to the real data at HDFS The best thing is Hive uses pre-defined MapReduce jobsbehind the scenes!–––––Column selectionFields groupingTable joiningValues filtering Important remark: since MapReduce is used by Hive, thequeries make take some time to produce a result44

Hive CLI hivehive historyfile /tmp/myuser/hive job log opendata XXX XXX.txthive select column1,column2,otherColumns from mytable wherecolumn1 'whatever' and columns2 like '%whatever%';Total MapReduce jobs 1Launching Job 1 out of 1Starting Job job 201308280930 0953, Tracking URL http://cosmosmastergi:50030/jobdetails.jsp?jobid job 201308280930 0953Kill Command /usr/lib/hadoop/bin/hadoop job Dmapred.job.tracker cosmosmaster-gi:8021 -killjob 201308280930 09532013-10-03 09:15:34,519 Stage-1 map 0%, reduce 0%2013-10-03 09:15:36,545 Stage-1 map 67%, reduce 0%2013-10-03 09:15:37,554 Stage-1 map 100%, reduce 0%2013-10-03 09:15:44,609 Stage-1 map 100%, reduce 33%45

Hive Java API Hive CLI and Hue are OK for human-driven testingpurposes– Hive has no REST APIHive has several drivers and libraries–––––– But it is not usable by remote applicationsJDBC for JavaPythonPHPODBC for C/C Thrift for Java and C iveClientA remote Hive client usually performs:––A connection to the Hive server (TCP/10000)The query execution46

Hive Java API: get a connectionprivate static Connection getConnection(String ip, String port,String user, String password) {try iver");} catch (ClassNotFoundException e) {System.out.println(e.getMessage());return null;} // try catchtry {return DriverManager.getConnection("jdbc:hive://" ip ":” port "/default?user " user "&password “ password);} catch (SQLException e) {System.out.println(e.getMessage());return null;} // try catch} // getConnection47

Hive Java API: do the queryprivate static void doQuery() {try {Statement stmt con.createStatement();ResultSet res stmt.executeQuery("select column1,column2,” "otherColumns from mytable where “ “column1 'whatever' and “ "columns2 like '%whatever%'");while (res.next()) {String column1 res.getString(1);Integer column2 res.getInteger(2);} // whileres.close(); stmt.close(); con.close();} catch (SQLException e) {System.exit(0);} // try catch} // doQuery48

Hive tables creation Both locally using the CLI, or remotely using the Java API, use thiscommand:create [external] table. CSV-like HDFS filescreate external table table name ( field1 name field1 type , ., fieldN name fieldN type ) row formatdelimited field terminated by ‘ separator ' location‘/user/ username / path / to / the / data '; Json-like HDFS filescreate external table table name ( field1 name field1 type , ., fieldN name fieldN type ) row formatserde 'org.openx.data.jsonserde.JsonSerDe' location‘/user/ username / path / to / the / data ’;49

Distributed streamingcomputing:The Storm reference50

Storm project Created by Natham Marz at BackType/Twitter Distributed realtime computation system51

Storm basics Based on processing building blocks that can be composedin a topology–– It is scalable and fault-tolerant–– Spouts: blocks in charge of polling for data streams, producingdata tuplesBolts: blocks in charge of processing data tuples, performing basicoperations 1:1 operations: arithmetics, transformations N:1 operations: filtering, joining 1:N operations: spliting, replication A basic operation can be replicated many times in a layer of boltsIf a bolt fails, there are serveral other bolts performing the samebasic operation in the layerGuarantees the data will be processed–Storm perform an ACK mechanism for data tuples52

Big Data in FIWARE Lab:Cosmos and Sinfonier53

Cosmos Cosmos is the name of the Hadoop-basedglobal instance in FIWARE Lab Nothing has to be installed! There are two clusters exposing someservices:– Storage (storage.cosmos.lab.fiware.org) WebHDFS REST API (TCP/14000)– Computing (computing.cosmos.lab.fiware.org) Tidoop REST API (TCP/12000)Auth proxy (TCP/13000)HiveServer2 (TCP/10000)54

Feeding Cosmos with context data Cygnus tool– Apache Flume-based Standard NGSI connector for FIWARE Provides connectors for a wide variety ofpersistence backends–––––HDFSMySQLCKANMongoDBSTH Comet––––PostgreSQLKafkaDynamoDBCarto55

Sinfonier Sinfonier will be the name of the Stormbased global instance in FIWARE Lab Nothing will have to be installed! There will be one cluster exposing streaminganalysis services through an IDE Will be fed using Cygnus and Kafka queues Coming soon!56

Thank you!http://fiware.orgFollow @FIWARE on Twitter

A Big Data architecture is Lambda compliant if it produces near-real time data insights based on the last data only while large batches are accumulated and processed for robust insights -Data must feed both batch-based and stream-based sub-systems -Real-time insights are cached -Batch insights are cached

Related Documents:

May 02, 2018 · D. Program Evaluation ͟The organization has provided a description of the framework for how each program will be evaluated. The framework should include all the elements below: ͟The evaluation methods are cost-effective for the organization ͟Quantitative and qualitative data is being collected (at Basics tier, data collection must have begun)

Silat is a combative art of self-defense and survival rooted from Matay archipelago. It was traced at thé early of Langkasuka Kingdom (2nd century CE) till thé reign of Melaka (Malaysia) Sultanate era (13th century). Silat has now evolved to become part of social culture and tradition with thé appearance of a fine physical and spiritual .

On an exceptional basis, Member States may request UNESCO to provide thé candidates with access to thé platform so they can complète thé form by themselves. Thèse requests must be addressed to esd rize unesco. or by 15 A ril 2021 UNESCO will provide thé nomineewith accessto thé platform via their émail address.

̶The leading indicator of employee engagement is based on the quality of the relationship between employee and supervisor Empower your managers! ̶Help them understand the impact on the organization ̶Share important changes, plan options, tasks, and deadlines ̶Provide key messages and talking points ̶Prepare them to answer employee questions

Dr. Sunita Bharatwal** Dr. Pawan Garga*** Abstract Customer satisfaction is derived from thè functionalities and values, a product or Service can provide. The current study aims to segregate thè dimensions of ordine Service quality and gather insights on its impact on web shopping. The trends of purchases have

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

Chính Văn.- Còn đức Thế tôn thì tuệ giác cực kỳ trong sạch 8: hiện hành bất nhị 9, đạt đến vô tướng 10, đứng vào chỗ đứng của các đức Thế tôn 11, thể hiện tính bình đẳng của các Ngài, đến chỗ không còn chướng ngại 12, giáo pháp không thể khuynh đảo, tâm thức không bị cản trở, cái được

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan