Real Time Micro-Blog Summarization Based On Hadoop/HBase

1y ago

19 Views

2 Downloads

1.37 MB

45 Pages

Last View : 17d ago

Last Download : 3m ago

Upload by : Kairi Hasson

Report this link

Download PDF

Transcription

Real Time Micro-Blog Summarization based on Hadoop/HBase -Sanghoon Lee, Sunny Shakya Dept. of Computer Science, Georgia State University 05/03/2013 1

Outline Introduction Hadoop HDFS MapReduce HBase The Big Picture HBase Operation Application Twitter Application Architecture Demo Dept. of Computer Science, Georgia State University 05/03/2013 2

Introduction Apache Hadoop What is Apache Hadoop? Open source framework that supports data intensive distributed applications Created by Doug Cutting, the creator of Apache Lucene. Derived from Google's MapReduce and Google File System (GFS) papers. Solution for Big Data Deals with complexities of high volume, velocity and variety of data Transforms commodity hardware into services that Store petabytes of data reliably Allows huge distributed computations Dept. of Computer Science, Georgia State University 05/03/2013 3

Introduction Apache Hadoop What is Apache Hadoop? Key Attributes Redundant and reliable (no data loss) Extremely powerful Batch processing centric Easy to program distributed applications Run on commodity hardware. Easily Scalable Dept. of Computer Science, Georgia State University 05/03/2013 4

Introduction Apache Hadoop What is Apache Hadoop? MapReduce is the processing part of Hadoop HDFS is the data part of Hadoop MapReduce HDFS Machine Dept. of Computer Science, Georgia State University 05/03/2013 5

Introduction Apache Hadoop What is Apache Hadoop? The MapReduce server on a typical machine is called a TaskTracker The HDFS server on a typical machine is called a DataNode TaskTracker DataNode Machine Dept. of Computer Science, Georgia State University 05/03/2013 6

Introduction Apache Hadoop What is Apache Hadoop? Having multiple machines with Hadoop creates a cluster TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Machine Machine Machine Dept. of Computer Science, Georgia State University 05/03/2013 7

Introduction Apache Hadoop What is Apache Hadoop? JobTracker keeps track of jobs being run JobTracker TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Machine Machine Machine Dept. of Computer Science, Georgia State University 05/03/2013 8

Introduction Apache Hadoop What is Apache Hadoop? NameNode keep information about data location NameNode TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Machine Machine Machine Dept. of Computer Science, Georgia State University 05/03/2013 9

Introduction HDFS HDFS Scalable, Reliable and Manageable Highly scalable file system – Adds commodity servers and disks to scale storage and IO bandwidth – Supports parallel reading and processing of the data Read, Write, Rename and Append Optimized for streaming reads/writes of large files Bandwidth scales linearly with the number of nodes and disks – Fault Tolerant and Easy manageable Built-in redundancy Tolerates nodes and disk failures Automatically manages addition/removal of nodes Dept. of Computer Science, Georgia State University 05/03/2013 10

Introduction HDFS HDFS NameNode NameSpace Metadata DataNode D1 DataNode D2 D3 D3 DataNode D2 D1 D4 D3 Rack1 Dept. of Computer Science, Georgia State University DataNode D1 D4 D2 D4 Rack2 05/03/2013 11

Introduction HDFS HDFS and its Uses HDFS provides a reliable, scalable and manageable solution for working with huge amount of data HDFS has been successfully deployed in clusters of 10 – 4500 nodes and can store up to 25 petabytes of data Dept. of Computer Science, Georgia State University 05/03/2013 12

Introduction MapReduce MapReduce Client Job JobTracker NameNode Task TaskTracker Task Task TaskTracker Task Task Task DataNode DataNode D1 D1 D2 D3 D3 Server Server Dept. of Computer Science, Georgia State University D2 05/03/2013 13

Introduction MapReduce MapReduce Map Step The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. The worker node processes the smaller problem, and passes the answer back to its master node. Reduce Step The master node then collects the answers to all the subproblems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. Dept. of Computer Science, Georgia State University 05/03/2013 14

Introduction MapReduce MapReduce Dept. of Computer Science, Georgia State University 05/03/2013 15

Introduction Summary What is Apache Hadoop? Hadoop is Reliable Data is held in multiple locations Tasks that fail are redone Scalable Same program runs on 1, 1000 or 4000 machines Scales linearly Simple APIs Very powerful You can process in parallel massive amount of data Petabytes of data Processing in parallel allows for the timely processing of massive amount of data Dept. of Computer Science, Georgia State University 05/03/2013 16

Introduction The Big Picture What is Apache Hadoop? Pig HIVE MapReduce HDFS Dept. of Computer Science, Georgia State University 05/03/2013 17

Introduction The Big Picture What is Apache Hadoop? Pig HIVE MapReduce Zookeeper Hbase HDFS Dept. of Computer Science, Georgia State University 05/03/2013 18

HBase Introduction HBase Distributed column-oriented database built on top of HDFS Not relational and does not support SQL and is designed to run on a cluster of computers with scalability and ability to deal with any type of data in mind HBase is often described as a schema-less database. HBase is designed to run on a cluster of computers instead of a single computer. Dept. of Computer Science, Georgia State University 05/03/2013 19

HBase Introduction HBase HBase depends on Hadoop primarily for two reasons Hadoop MapReduce provides a distributed computation framework for high throughput data computation. The Hadoop Distributed File System (HDFS) gives HBase a reliable storage layer providing availability and reliability Dept. of Computer Science, Georgia State University 05/03/2013 20

HBase Introduction HBase Table Structure Every row in an HBase table has a unique identifier called its rowkey. Rowkey values are distinct across all rows in an HBase table. Every interaction with data in a table begins with the rowkey. Table rows are sorted by row key A cell which is the intersection of row and column is versioned. By default, their version is a timestamp auto-assigned by HBase at the time of cell insertion. A cell’s content is an uninterpreted array of bytes. Row columns are grouped into column families. All column family members have a common prefix Columns can be added on the fly by the client as long as the column family they belong to preexists Dept. of Computer Science, Georgia State University 05/03/2013 21

HBase Introduction HBase Table Structure Column Family - Info The table is lexicographically sorted on the rowkeys Rowkey Name Email Password sshakya1 Sunny ss1@gmail.com ss123 slee72 Sanghoon slee@gmail.com slee123 123slee Cell Each cell has multiple versions, represented by timestamps Dept. of Computer Science, Georgia State University 05/03/2013 22

HBase Introduction HBase Implementation Tables are automatically partitioned horizontally by HBase into regions. Each region comprises a subset of a table’s rows. Initially a table comprises a single region but as the size of the region grows, after it crosses a configurable size threshold As the table grows, the number of its regions grows. Regions are the units that get distributed over an HBase cluster In this way, a table that is too big for any one server can be carried by a cluster of servers with each node hosting a subset of the table’s total regions Dept. of Computer Science, Georgia State University 05/03/2013 23

HBase Introduction HBase Implementation Dept. of Computer Science, Georgia State University 05/03/2013 24

HBase Introduction HBase Implementation HBase internally keeps special catalog tables named ROOT META ROOT table hold the list of META table regions META table holds the list of all user-space regions Fresh Clients connect to the Zookeeper cluster first to learn the location of ROOT Clients then consult ROOT to know the location of the META region. The Clients then do a lookup against the found META region to figure the hosting user-space region and its location Dept. of Computer Science, Georgia State University 05/03/2013 25

HBase Introduction HBase Operations Dept. of Computer Science, Georgia State University 05/03/2013 26

HBase Introduction HBase Operations Five primitive commands : Get, Put, Delete, Scan, and Increment. create a table ‘mytable’ with a single column family ‘cf’ puts the bytes ‘hello HBase’ to a cell in ‘mytable’ in the ‘first’ row at the ‘cf:message’ columns Two ways to read a table – GET and SCAN Dept. of Computer Science, Georgia State University 05/03/2013 27

HBase Introduction HBase Operations via JAVA Client API Five primitive commands : Get, Put, Delete, Scan, and Increment. Dept. of Computer Science, Georgia State University 05/03/2013 28

HBase Versioned Data Versioned Data In addition to being a schema-less database, HBase is also versioned. Every time you perform an operation on a cell, HBase implicitly stores a new version. By default, HBase stores only the last three versions; this is configurable per column family Dept. of Computer Science, Georgia State University 05/03/2013 29

HBase Data Co-ordinates Data Co-ordinates Map RowKey, Map ColumnFamily, Map ColumnQualifier, Map Version, Data Dept. of Computer Science, Georgia State University 05/03/2013 30

HBase HBase Modes Modes of Operation HBase can run in three different modes Standalone All of HBase runs in one java process Pseudo-distributed A single machine run many java processes Full-distributed HBase is fully distributed across a cluster a machines. Dept. of Computer Science, Georgia State University 05/03/2013 31

HBase Different than Cassandra Different than Cassandra Cassandra HBase Lacks concept of a Table. It's not common to have multiple Concept of Table exists. Each table has it's own key space. keyspaces. Key space in a cluster is shared. Furthermore You can add and remove table as easily as a RDBMS. adding a keyspace requires a cluster restart! Offers sorting of columns. Does not have sorting of columns. Concept of Supercolumn allows you to design very flexible, Does not have supercolumns. But you can design a super very complex schemas. column like structure as column names and values are binary. Map Reduce support is new. You will need a Hadoop Map Reduce support is native. HBase is built on Hadoop. cluster to run it. Data will be transferred from Cassandra Data does not get transferred. cluster to the Hadoop cluster. No suitable for running large data map reduce jobs. Comparatively simpler to maintain if you don't have to Comparatively complicated as you have it has many have Hadoop. moving pieces such as Zookeeper, Hadoop and HBase itself. Does not have a native JAVA API as of now. No java doc. Has a nice native JAVA API. HBase has a thrift interface for Even though written in Java, you have to use Thrift to other languages too. communicate with the cluster. No master server, hence no single point of failure. Although there exists a concept of a master server, HBase itself does not depend on it heavily. HBase cluster can keep serving data even if the master goes down. Hadoop NameNode is a single point of failure. Dept. of Computer Science, Georgia State University 05/03/2013 32

Application Introduction Real time Micro-Blog Summarization Twitter Started in 2006 as the micro blogging sites Very popular micro-blogging site where people send short messages of 140 characters called tweets By 2013, it has 100 million active user sending 200 million tweets per day. A majority of posts are conversational or not meaningful 3.6% of the posts concern topic of mainstream news. It has become a very popular medium to disperse information. Dept. of Computer Science, Georgia State University 05/03/2013 33

Application Introduction Real time Micro-Blog Summarization Trending Topics – – – – Twitter provides a list of popular topics. A user retrieve a list of recent posts with the topic phrase. Some trends have pound # sign before the word or phrase. Hashtag is included particularly in Tweets to explain it as relating to a topic. – Problem: the user have to read manually through the posts for understanding a specific topic because the posts are sorted by recency, not relevancy. Dept. of Computer Science, Georgia State University 05/03/2013 34

Application Introduction Real time Micro-Blog Summarization Twitter APIs REST APIs (Request/Response) Public Steams Suitable for following specific user or topics and data mining Dept. of Computer Science, Georgia State University Streaming APIs (Persistent HTTP Conn) User Steams A single user’s view of Twitter Site Steams Intended for server connecting to many users 05/03/2013 35

Application Introduction Real time Micro-Blog Summarization Twitter APIs How to. REST APIs (Request/Response) Public Steams (Samples of all public updates) Dept. of Computer Science, Georgia State University Streaming APIs (Persistent HTTP Conn) User Steams (One User’s updates) Hadoop and HBase Site Steams (Multiple Users’ updates) 05/03/2013 36

Application Introduction Real time Micro-Blog Summarization Dept. of Computer Science, Georgia State University 05/03/2013 37

Application Introduction Real time Micro-Blog Summarization We need Server Dept. of Computer Science, Georgia State University 05/03/2013 38

Application Introduction Application Architecture Hadoop/HBase Service Node Twitter Streaming API Receive Twitter Information Web Logic Preprocessing Summarization data into webpage Static html Write rows Scan HTable Hbase REST Gateway Hadoop Slave DataNode, Region Server Hadoop Master NameNode, Hbase Server Dept. of Computer Science, Georgia State University Hadoop Slave DataNode, Region Server 05/03/2013 39

Application Introduction Summary Procedure Rowkey ColumnFamily Column name Timestamp Value Username/Time UserInfo Username 13452684 CSc8711 UserID 13452684 Xke1kdfk Location 13452684 CL400 Post 13452684 This is column #database HashTag 13452684 database Rowkey ColumnFamily Column name Timestamp Value database HashTag Number 13452684 1 Dept. of Computer Science, Georgia State University 05/03/2013 Extract Post 40

Application Introduction Summary Procedure Preprocessing Raw Posts Tockenizing Dept. of Computer Science, Georgia State University Removing StopWords Vectorized Stemming 05/03/2013 41

Application Introduction Summary Procedure TF(t, p) is the number of term t in the post TF-IPF Calculation IPF (t) is the inverse post frequency of the term t. totalPost is the total number of posts. numPost is the number of posts that the term t occurs. Rowkey ColumnFamily Column name Timestamp Value database/Time Top10Summary Summary 13452684 This is column #database Dept. of Computer Science, Georgia State University 05/03/2013 42

Application Demo Dept. of Computer Science, Georgia State University 05/03/2013 43

Application Future Work Future Work Evaluation of Summaries Summaries are generated but not evaluated or compared with other types of summaries such as human summary Utilize Hadoop/HBase in full distributed mode. Dept. of Computer Science, Georgia State University 05/03/2013 44

Thank You Any Questions? Dept. of Computer Science, Georgia State University 05/03/2013 45

Introduction Apache Hadoop . What is Apache Hadoop? MapReduce is the processing part of Hadoop HDFS is the data part of Hadoop Dept. of Computer Science, Georgia State University 05/03/2013 5 Introduction Apache Hadoop HDFS MapReduce Machine . What is Apache Hadoop? The MapReduce server on a typical machine is called a .

Related Documents:

Comparative Analysis of Text Summarisation Techniques

There are 3 types of summarization through which we can summarize any text, file, video, etc. Three types of summarization are extraction based, abstraction based and aided summarization. [1] 4.1 Extraction-based summarization In this type of summarization, the automatic system captures and finds objects and its instances from the whole

13 Views

10m ago

A Survey on Automatic Text Summarization - GitHub Pages

details of the summarization problem are also discussed. Special attention is devoted to automatic evaluation of summarization systems, as future research on summarization is strongly dependent on progress in this area. 1 Introduction The sub eld of summarization has been investigated by the NLP community for nearly the last half century.

7 Views

10m ago

Survey on Text Summarization - IIT Bombay

Table 1: Various Types of Summarization Tech-niques(Gambhir and Gupta,2017) senting document(s) in condensed form with-out loss of content and without much/negligi-ble repetition is an important task. 1.1 Types of Summarization Broadly summarization approaches are categorized as abstractive and extractive. In an extractive type of summarization .

8 Views

10m ago

Abstractive Sentence Summarization with Attentive Deep Recurrent Neural ...

A compressive summarization simply deletes words from the input sentence: y argmax m2f1;:::;MgN;mi 1 mi s(x;x[m 1;:::;mN]) (3) This paper primarily considers abstractive summarization. Summarization can naturally be done for many different lengths of input and requested summary lengths. In this paper, we restrict the task to simple sentence .

9 Views

10m ago

Sumblr: continuous summarization of evolving tweet streams - ntnu.edu.tw

prototype called Sumblr (SUMmarization By stream cLus-teRing). To the best of our knowledge, our work is the ﬁrst to study continuous tweet stream summarization. The over-all structure of the prototype is depicted in Figure 2. Sumblr consists of two main components, namely a Tweet Stream Clustering module and a High-level Summarization module.

9 Views

10m ago

Recent Advances in Document Summarization - GitHub Pages

generic summarization, making few assumptions about the audience or the goal for generating the summary. In contrast, in query-focused summarization, the goal is to summarize only the information in the input docu-ment(s) that is relevant to a speci c user query. 1.1 Outline and Scope The eld of document summarization has moved for-

10 Views

10m ago

Video Summarization Techniques: A Review - IJSTR

scenes, then user adopts various summarization techniques based on these desired features. Several feature-based video summarization techniques existed such as an event, object, color, motion, and attention-based techniques as mentioned in Fig. 2. The descriptions of these techniques are given below: 2.1.1 Event-based video summarization:

6 Views

10m ago

CUSTOMER SUCCESS REPORT 2018 - Userlane

1. Gainsight Blog 2. Intercom Blog 3. LinkedIn Articles 4. Sixteen Ventures 5. Facebook Groups 6. Totango Blog 7. Amity Blog 8. Userlane Blog 9. Kissmetrics Blog 10. Zendesk Blog 1

69 Views

2y ago

Recent Views

Cyber Security Guide for NZ Law Firms - WordPress

2 Incident Response Solutions Cyber Security Guide for NZ Law Firms Welcome to the Cyber Security Guide for NZ Law Firms The storage of sensitive client information and management of large funds make law firms an attractive target for cybercriminals. It is therefore critical for law firms to understand and mitigate the cyber risks they face.

1y ago

135 Views

New Prudential Regime for Investment Firms - Allen Overy

(iii) Investment firms - often referred to as 'Class 2 firms' - these are non-systemic investment firms that do not carry out dealing on own account or underwriting activities. This category of firms are subject to the full scope of the prudential regime is set out in the IFR and IFD. (iv) Small and non-interconnected investment firms -

1y ago

98 Views

The new EU prudential regime for investment firms

In any event, many bank and non-bank financial groups operating through investment firms in the UK have created new EU27 investment firms (or are scaling up existing EU27 investment firms) to serve EU27 clients as part of their Brexit planning. These firms will be subject to the new EU prudential regime. New Classification of Investment Firms

4m ago

48 Views

Actionable Intelligence: Successful Bi for Law Firms

Source: Gartner, Business Intelligence Imperative, 2001 ACTIONABLE INTELLIGENCE: SUCCESSFUL BI FOR LAW FIRMS - 3. A decade later, the fact gap remains a core issue. Law firms have more data than ever about . 1990 Mid-2000s 2015 A CONDENSED HISTORY OF BUSINESS INTELLIGENCE ACTIONABLE INTELLIGENCE: SUCCESSFUL BI FOR LAW FIRMS - 5.

1y ago

129 Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Investment banks hedge funds private equity

investment banks, hedge funds, and private equity firms can use the book to broaden their understanding of their industry and competitors. Finally, professionals at law firms, accounting firms, and other firms that advise investment banks, hedge funds, and private equity firms should

2y ago

372 Views

2021 Report on the State of the Legal Market

1 Thomson Reuters Peer Monitor data are based on reported results from 162 U.S.-based law firms, including 45 Am Law 100 firms, 56 Am Law Second 100 Firms, and 61 additional Midsize firms. 2 Malcolm Gladwell, The Tipping Point

2y ago

136 Views

Cyber Security for Law Firms

Cyber Security and Legal Practice (Australia) Cyber security threats are increasing. 2019 Cyber Security Report - American Bar Association (ABA)(United States) Over a quarter of firms report that they have experienced some sort of security breach Less than a third of law firms have an incident response plan. 2019 PwC Law Firms' Survey

1y ago

130 Views

MARTINDALE-HUBBELL TOP RANKED LAW FIRMS METHODOLOGY TOP - Fee, Smith

view the entire list online at: fortune.com & law.com martindale-hubbell top ranked law firms methodology ranked firms law top page proof—for approval only presents leal leaders coming in 2015 featured in women leaders law in the 2015 for more information call: 855-808-4520 or e-mail legalleaders@alm.com page proof—for approval .

1y ago

92 Views

Companies Law - Cayman Islands dollar

Law 1 of 1971-15th December, 1970 Law 7 of 2000- 20th July, 2000 Law 7 of 1973-28th June, 1973 Law 5 of 2001-20th April, 2001 Law 24 of 1974-22nd November, 1974 Law 10 of 2001-25th May, 2001 Law 25 of 1975-9th December, 1975 Law 29 of 2001-26th September, 2001 Law 19 of 1977-10th November, 1977 Law 46 of 2001-14th January, 2002

3y ago

454 Views

It’s the Law!

ciples stated in Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law. Students will be able to explain the application of Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law to observations or events related to SCUBA diving. MateriaLs None audio/visuaL MateriaLs None teachinG tiMe

2y ago

378 Views

WHAT LAW IS ? An Introduction to Law

common law system civil law system!! sources of law in civil law !! a1. primary: statutes (written law) enacted by legislative power are the principal source of law. ! a2. two subsidiary sources of law: ! a2.1 administrative regulations a.2.2 customs!! ! sources of law in common law !!! b1. two primary sources of

2y ago

385 Views

Growth Processes of High- Growth Firms in the UK - Nesta

Interest in high-growth ﬁrms (HGFs) has exploded in recent years, once the job-creating prowess of a minority of fast-growing ﬁrms became recognized - roughly 4% of ﬁrms can be expected to generate 50% of jobs (Storey, 1994, p. 117). Research into high-growth ﬁrms has itself undergone high-growth. However, the level of analysis has of-

1y ago

120 Views

Socio-economic profile Coastal and marine ecosystem and economy

According to the Philippine Plastics Industry Association, Inc. (PPIA), there are 1,088 firms throughout the Philippines. The majority of the plastics companies are situated in the National Capital Region (NCR) with 642 firms. This is followed by CALABARZON area with 176 firms. While Central Luzon registered 87 firms. Central Visayas have 87 firms.

1y ago

120 Views

Real Time Micro-Blog Summarization Based On Hadoop/HBase

It looks like you're using an ad-blocker