Operations And Big Data: Hadoop, Hive And Scribe - O'Reilly

1y ago
21 Views
2 Downloads
1.10 MB
38 Pages
Last View : 18d ago
Last Download : 3m ago
Upload by : Lilly Andre
Transcription

Operations and Big Data:Hadoop, Hive and ScribeZheng Shao 微博:@邵铮912/7/2011 Velocity China 2011

Agenda1Operations: Challenges and Opportunities2Big Data Overview3Operations with Big Data4Big Data Details: Hadoop, Hive, Scribe5Conclusion

Operationschallenges and opportunities

zeUnderstandImproveMonitor

zeUnderstandImproveMonitor

Challenges Huge amount of data Sampling may not be good enough Distributed environment Log collection is hard Hardware failures are normal Distributed failures are hard to understand

Example 1: Cache miss and performance Memcache layer has a bug thatdecreased the cache hit rate byhalfWebMemcacheMySQL MySQL layer got hit hard andperformance of MySQL degraded Web performance degraded

Example 2: Map-Reduce RetriesAttempt 1Map TaskAttempt 2Attempt 3Attempt 4 Attempt 1 hits a transientdistributed file system issue andfailed Attempt 2 hits a real hardwareissue and failed Attempt 3 hits a transientapplication logic issue and failed Attempt 4, by chance, succeeded The whole process slowed down

Example 3: RPC Hierarchy RPC 3A failedRPC 1RPC 0RPC 2RPC 3PRC 1A The whole RPC 0 failed becausePRC 1Bof that The blame was on owner ofRPC 3Aservice 3 because the log inRPC 3B service 0 shows that.

Example 4: Inconsistent results in RPC RPC 0 got results from both RPC1 and RPC 2RPC 0RPC 1RPC 2 Both RPC 1 and RPC 2succeeded But RPC 0 detects that theresults are inconsistent and fails We may not have logged anytrace information for RPC 1 andRPC 2 to continue debugging.

Opportunities Big Data Technologies Distributed logging systems Distributed storage systems Distributed computing systems Deeper Analysis Data mining and outlier detection Time-series StorageModelCompu0ngCompu0ngCompu0ng

Big Data OverviewAn example from Facebook

Big Data What is Big Data? Volume is big enough and hard to be managed by traditional technologies Value is big enough not to be sampled/dropped Where is Big Data used? Product analysis User behavior analysis Business intelligence Why use Big Data for Operations? Reuse existing infrastructure.

Overall Architecture 3GB/sec Near-realtime ProcessingPHPScribePolicyPTailPumaHBaseC Scribe- ‐HDFSJavaNectarScribeClient 6GB/secBatch Processing 9GB/secCopy/LoadCentralHDFSHive

Operationswith Big Data

logview Features PHP Fatal StackTrace Group StackTrace by similarity, order by counts Integrated with SVN/Task/Oncall tools Low-pri: Scribe can drop logview dataPHP ScribeClientScribeMid0erLogViewHTTP

logmonitor Rules Regular-expression based: ".*Missing Block.*" Rule has levels: WARN, ERROR, etc Dynamic rulesPropagaterulesApplyrulesRulesStorage RuleName,Count,PTail/LocalLogExamples Logmonitor LogmonitorClientStatsServerModifyrulesTopRulesWeb

Self Monitoring Goal: Set KPIs for SOA Isolate issues in distributed systems Make it easy for service owners to monitor Approach Log4J integration with Scribe JMX/Thrift/Fb303 counters Client-side logging Server-side unterqueryServiceOwner

Global Debugging with PTail Logging instruction Logging levels Logging destination (log name) Additional fields: Request IDLogdataService1ScribeLogdataLogdataRPC logginginstruc0onsService3PTailLogdataRPC Service2logginginstruc0ons

Hive Pipelines Daily and historical data analysis What is the trend of a metric? When did this bug first happen? Examples SELECT percentile(latency, “50,75,90,99”) FROM latency log; SELECT request id, GROUP CONCAT(log line) as total logFROM trace GROUP BY request idHAVING total log LIKE "%FATAL%“;

Big Data DetailsHadoop, Hive, Scribe

Key Requirements Ease of use Latency Smooth learning curve Real-time data Easy integration Historical data Structured/unstructured data Schema evolution Scalable Spiky traffic and QoS Raw data / Drill-down support Reliability Low data loss Consistent computation

Overall Architecture 3GB/sec Near-realtime ProcessingPHPScribePolicyPTailPumaHBaseC Scribe- ‐HDFSJavaNectarScribeClient 6GB/secBatch Processing 9GB/secCopy/LoadCentralHDFSHive

Distributed Logging System - Scribe https://github.com/facebook/scribe

Distributed Logging System - ntThriRRPCLogData category,message ogDataThriRRPC

Scribe Improvements Network efficiency Per-RPC Compression (use quicklz) Operation interface Category-based blacklisting and sampling Adaptive logging Use BufferStore and NullStore to drop messages as needed QoS Use separate hardware for now

Distributed Storage Systems - Scribe-HDFS Architecture Client Mid-tier Writers FeaturesScribeClients Scalability: 9GB/sec No single point of failure(except NameNode) Not open-sourced yetC1C2DataNodeC1C2DataNodeCalligraphusMid- ‐0erCalligraphusWritersZookeeperHDFS

Distributed Storage Systems - HDFS Architecture NameNode: namespace, block locations DataNodes: data blocks replicated 3 times Features 3000-node, PBs of spaces Highly reliable No random writesNameNodeHDFSClientDataNodes https://github.com/facebook/hadoop-20

HDFS Improvements Efficiency Random read keep-alive: HDFS-941 Faster checksum - HDFS-2080 Use fadvise - HADOOP-7714 Credits: -presentationslides-hadoop-and-performance

Distributed Storage Systems - HBase Architecture row, col-family, col, value Write-Ahead Log Records are sorted in memory/filesMaster FeaturesHBaseClient 100-node. Random read/write. Great write performance. b/RegionServers

Distributed Computing Systems – MR Architecture JobTracker TaskTracker MR ClientJobTrackerMRClient Features Push computation to data Reliable - Automatic retry Not easy to useTaskTracker

MR Improvements Efficiency Faster compareBytes: HADOOP-7761 MR sort cache locality: MAPREDUCE-3235 Shuffle: MAPREDUCE-64, MAPREDUCE-318 Credits: -presentationslides-hadoop-and-performance

Distributed Computing Systems – Hive Architecture MetaStore Compiler ExecutionMetaStoreHivecmdlineCompilerMRClient Features SQL Map-Reduce Select, Group By, Join UDF, UDAF, UDTF, ScriptMap- ‐ReduceTaskTrackers

Useful Features in Hive Complex column types Array, Struct, Map, Union CREATE TABLE (a struct c1:map string,string ,c2:array string ); UDFs UDF, UDAF, UDTF Efficient Joins Bucketed Map Join: HIVE-917

Distributed Computing Systems – Puma Architecture HDFS PTail Puma HBaseHDFSHBase Features StreamSQL: Select, Group By, Join UDF, UDAF Reliable – No data loss/duplicatePTail Puma

ConclusionBig Data can help operations

Big Data can help Operations 5 Steps to make it effective: Make Big Data easy to use Log more data and keep more sample whenever needed Build debugging infrastructure on top of Big Data Both real-time and historical analysis Continue to improve Big Data

(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc. All rights reserved. 1.0

Operations and Big Data: Hadoop, Hive and Scribe Zheng Shao 微博:@邵铮9 12/7/2011 Velocity China 2011 . 1 Operations: Challenges and Opportunities 2 Big Data Overview 3 Operations with Big Data 4 Big Data Details: Hadoop, Hive, Scribe 5 Conclusion Agenda . Operations challenges and opportunities . Operations Measure and

Related Documents:

1: hadoop 2 2 Apache Hadoop? 2 Apache Hadoop : 2: 2 2 Examples 3 Linux 3 Hadoop ubuntu 5 Hadoop: 5: 6 SSH: 6 hadoop sudoer: 8 IPv6: 8 Hadoop: 8 Hadoop HDFS 9 2: MapReduce 13 13 13 Examples 13 ( Java Python) 13 3: Hadoop 17 Examples 17 hoods hadoop 17 hadoop fs -mkdir: 17: 17: 17 hadoop fs -put: 17: 17

The hadoop distributed file system Anatomy of a hadoop cluster Breakthroughs of hadoop Hadoop distributions: Apache hadoop Cloudera hadoop Horton networks hadoop MapR hadoop Hands On: Installation of virtual machine using VMPlayer on host machine. and work with some basics unix commands needs for hadoop.

2006: Doug Cutting implements Hadoop 0.1. after reading above papers 2008: Yahoo! Uses Hadoop as it solves their search engine scalability issues 2010: Facebook, LinkedIn, eBay use Hadoop 2012: Hadoop 1.0 released 2013: Hadoop 2.2 („aka Hadoop 2.0") released 2017: Hadoop 3.0 released HADOOP TIMELINE Daimler TSS Data Warehouse / DHBW 12

The In-Memory Accelerator for Hadoop is a first-of-its-kind Hadoop extension that works with your choice of Hadoop distribution, which can be any commercial or open source version of Hadoop available, including Hadoop 1.x and Hadoop 2.x distributions. The In-Memory Accelerator for Hadoop is designed to provide the same performance

BIG DATA THE WORLD OF BIG DATA HADOOP ADMINISTRATOR Hadoop Administrator is one of the most sought after skills in the world today. The global Hadoop market is expected to be worth 50.24 billion by 2020, offering great career opportunities to professionals. For any organization to start off with Hadoop, they would need Hadoop

Inside Hadoop Big Data with Hadoop MySQL and Hadoop Integration Star Schema benchmark . www.percona.com Hadoop: when it makes sense BIG DATA . www.percona.com Big Data Volume Petabytes Variety Any type of data - usually unstructured/raw data No normalization .

Intellipaat's Big Data Hadoop training program helps you master Big Data Hadoop and Spark to get ready for the Cloudera CCA Spark and Hadoop Developer Certification (CCA175) exam, as well as to master Hadoop Administration, through 14 real-time industry-oriented case-study projects. In this Big Data course, you will master MapReduce,

A01 , A02 or A03 Verification of prior exempUcivil after exempt service must be on file with the X appointment (when appointing power. there is no break in service). A01 , A02 or A03 (to Copy of employee's retirement PM PPM X a permanent release letter from PERS must be 311.5, 360.3 appointment) after a on file with the appointing power.