Info Theory And Big Data [Read-Only] - Uni-due.de

3y ago
13 Views
2 Downloads
5.73 MB
74 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Bennett Almond
Transcription

Info theory and big data“Typical or not‐typical, that is the question”Han Vinck University Duisburg‐Essen, GermanySeptember 2016A.J. Han Vinck, Yerevan, September 2016

Content: big data issuesA definition: Large amount of collected and stored data to be used for further analysis‐ too large for traditional data processing applications.Benefits: We can do things that we could not do before!‐ Healthcare:‐ Telco:‐ Utilities:20% decrease in patient mortality by analyzing streaming patient data.92% decrease in processing time by analyzing networking and call data99% improved accuracy in placing power generation resources by analyzing 2.8petabytes of untapped dataNote: Remember that you must invest in security to protect your information.A.J. Han Vinck, Yerevan, September 20162

Big data: Collect ‐ , store ‐ and draw conclusions from the data Some problems: extract knowledge from the data: Knowledge is based on information or relevant datawhat to collect: variety, importance,how to store: volume, structurePrivacy, securityA.J. Han Vinck, Yerevan, September 20163

What kind of problems to solve?There are:‐ Technical processing problemshow to collect and store‐ Semantic‐content problemswhat to collect and how to useA.J. Han Vinck, Yerevan, September 20164

information theory can be used to quantify information and relationsTwo contributions of great importanceA.J. Han Vinck, Yerevan, September 20165

1956, Shannon and the „BANDWAGON“ Shannon was critical about „his information theory“A.J. Han Vinck, Yerevan, September 20166

nice picture (often used) to illustrate the idea of contentContext Understanding Who, what,, .How?Why?semantics are used to make decisions or draw conclusionA.J. Han Vinck, Yerevan, September 20167

Shannon and SemanticsShannon 1916‐2016A.J. Han Vinck, Yerevan, September 20168

Extension from the Shannon Fig.1 to the system using semanticsA.J. Han Vinck, Yerevan, September 20169

How to store/ large amounts of data?High density: need for error controldataDistributed: need for communicationdataCloud: out of control: need for trustdataA.J. Han Vinck, Yerevan, September 201610

How to access large amounts of data?Problems: ‐ where? – who? – how?concentratedDistributedCloudA.J. Han Vinck, Yerevan, September 201611

Shannon‘s reliable information theoryCommunication: transfer of informationknowledge is based on informationA.J. Han Vinck, Yerevan, September 201612

Reliable transmission/storage: ShannonNO SEMANTICS ! For a certain transmission quality (errors):Data rate codes exist (constructive) that give P(error) 0 at a certain maximum (calculable) efficiency (capacity)boundQuality of the channelA.J. Han Vinck, Yerevan, September 201613

Large memories are not error free! SSD drives use BCH codes that can correct 1 error or detect 2 errors.‐ Can we improve the lifetime of SSD when using stronger codes?‐ How big is the improvement?My MsC computer (1974)44 kB main memory!1 Mbyte hard disk3,8 TByteA.J. Han Vinck, Yerevan, September 201614

Assuming that memory cells get defective: Memory of N wordsGAIN in MTTF For a simple dmin 3 code the gain is proportional toChip surface needed to realize time TA.J. Han Vinck, Yerevan, September 201615

Shannon‘s information theoryNO SEMANTICS !‐ Assign ‐ log2p(x) bits to a message from a given set likely, short unlikely, large‐ Shannon showed how and quantified:the minimum obtainable average assigned lengthH(X) ‐ p(x) log p(x)A.J. Han Vinck, Yerevan, September 2016(SHANNON ENTROPY )16

Data compression (exact reconstruction possible)Exact: representation costly (depends on source variability!)Need a good algorithm (non exponential in the blocklength n)A.J. Han Vinck, Yerevan, September 201617

Data reduction (no exact reconstruction)NOTE: In big datawe are interested in the NOISE!No exact reconstruction: good memory reduction, but in general we lose the details‐ how many bits do we need for a particular distortion?‐ need to define the distortion properly!A.J. Han Vinck, Yerevan, September 201618

algorithms (old techniques from the past) to avoid large data filesNew data of length n„Close“ match from memorycompressdifferenceStore match #and differenceMemory with N sequencesIf we use N sequences from the memory, we need:k log2N bits for the memory data H(difference) for the new dataMemory can be updated. (frequency of using a word)Optimization: # of wordsA.J.inHanmemoryversus differenceVinck, Yerevan, September 201619

Modification to save bits for sources with memory Use predictionNew data of length ncompressdifferencepredictorH(difference) for the new dataAs for video stream coding usingHufmann codesExample: video coding using DCT and Hufmann codingA.J. Han Vinck, Yerevan, September 201620

Shannon prediction of Englisch (again, no semantics)A.J. Han Vinck, Yerevan, September 201621

Example: showing the importance of prediction Metering: only the difference with the last value is of interest If typical consumption, within expectations, encode difference If a‐typical, encode the real valueTypical range forexpected valuesJanFebrMarchTotal consumption in timeA.J. Han Vinck, Yerevan, September 201622

An important issue is outlier and anomaly detection Outlier legitimate data point that’s far away from the mean ormedian in a distributionEx: used in information theory Anomaly illegitimate data point that’s generated by a differentprocess than whatever generated the rest of the dataEx: Used in authentication of dataA.J. Han Vinck, Yerevan, September 201623

Further problems appear for classificationWhat is normal?A.J. Han Vinck, Yerevan, September 201624

Classical information theory approach: outliers Information theory focusses on typicality:‐ set of most probably outputs of a channel/source‐ uses measures like entropy, divergence, etc.A.J. Han Vinck, Yerevan, September 2016NO SEMANTICS25

Properties of typical sequences (Shannon, 1948)A.J. Han Vinck, Yerevan, September 201626

examplePROBLEM:We need the entropy!A.J. Han Vinck, Yerevan, September 201627

How to estimate entropy ? or a Prob. distribution? Given a finite set of observations can we estimate the entropy of a source?Many papers study this topic, especially in Neuro science.Ref:Estimation of the entropy based on its polynomial representation,Phys. Rev. E 85, 051139 (2012) [9 pages], Martin Vinck, FrancescoP. Battaglia, Vladimir B. Balakirsky, A. J. Han Vinck, and Cyriel M.A. PennartzA.J. Han Vinck, Yerevan, September 201628

Information retrievalA.J. Han Vinck, Yerevan, September 201629

Checking properties: questions Do you have a particular property? ( identification)example: is yellow a property ? search in the data base? Is this a valid property? ( authentication)example: is yellow a valid property? search in the property listA.J. Han Vinck, Yerevan, September 2016?30

test for validity of a property can be done using the Bloom filter T properties, every property to k ‚1‘ s in random positions in a n arrayProperty 2Property 10011010010nProperty 1Property ? Check property: check the map (k positions) of a property in the n arrayPerformance: P(false accepted) {( 1 – (1‐1/n)kT}k 2‐k, for k n/T ln 2A.J. Han Vinck, Yerevan, September 201631

Bloom (1970), quote.The same idea appeared as “superimposed codes,” at Bell Labs,which I left in 1969.every sum of up to T different code words logically includes no code wordother than those used to form the sum (Problem 2).A.J. Han Vinck, Yerevan, September 201632

Superimposed codes: check presence of a property Start with N x n array, every property corresponds to a row. Every row pn ‚1‘s101102100130111Property: the OR of any subset of size T does not cover any other rowSignature or descriptor list: the OR of T rowsCheck for a particular property: property covered by the signature? Example:N 1000100101110100101001010not covered, not included in the ORcovered, included in the ORnCode existence: Probability( a random vector is covered by T others) 0 for p ln2/T (same as before) andsince we have a specific code, n TlogNA.J. Han Vinck, Yerevan, September 201633

exampleA.J. Han Vinck, Yerevan, September 201634

Code Example BOUND:T log2N n 3 T2 log2 Npropertybinary 0010100001100001010001010100111000000Any OR of two property vectorsdoes not overlapwith another propertyA.J. Han Vinck, Yerevan, September 201635

How to retrieve information from a big set: Superimposed codesWe need associative memory!A.J. Han Vinck, Yerevan, September 201636

More general, take distinct for 1, 2, , mA.J. Han Vinck, Yerevan, September 201637

references Arkadii G. D'yachkov W.H. Kautz CALVIN N. MOOERS, (1956) "ZATOCODING AND DEVELOPMENTS IN INFORMATION RETRIEVAL", AslibProceedings, Vol. 8 Iss: 1, pp.3 ‐ 22 My own:ON SUPERIMPOSED CODES A.J. Han Vinck and Samuel Martirossianin Numbers, Information and Complexityeditors: Ingo Althöfer, Ning Cai, Gunter Dueck ‐ 2013 ‐ Technology & EngineeringA.J. Han Vinck, Yerevan, September 201638

Security and Privacy concerns for big dataProblems:‐ Data privacy‐ Data protection/securityA.J. Han Vinck, Yerevan, September 201639

Message encryption without source codingPart 1Part 2 Part n(for example every part 56 bits)dependency exists between parts of the messageencypher key n cryptograms, decypherkeyPart 1Part 2Part ndependency exists between cryptogramsAttacker:n cryptograms to analyze forparticular message of n partsA.J. Han Vinck, Yerevan, September 201640

Message encryption with source coding(for example every part 56 bits)Part 1Part 2 n-to-1keyPart 1Part nsource encodeAttacker:encypherPart 2 1 cryptogram- 1 cryptogram to analyze forparticular message of n partsdecypher- assume data compression factor nto-1Source decodeHence, less material for the samemessage!Part nA.J. Han Vinck, Yerevan, September 201641

The biometric identification/authentication problem1. Conversion to binary4. variations?3. Privacy2. Complex searching f(N)A.J. Han Vinck, Yerevan, September 201642

Illustration of the authentication problem using sh()comparePROBLEM: BIO differs and thus also the hash!Advantage no memorizationA.J. Han Vinck, Yerevan, September 201643

Information theory can helpto solve the security/privacy problem"transformedcryptography from an artto a science."ForFor PerfectPerfect secrecysecrecy we have a necessarycondition:necessary condition:H(S X) H(S)BH(S X) H(S)secret H(S) H(B) H(S) H(B)i.e. # of messages # of keysi.e. # of messages # of keysA.J. Han Vinck, Yerevan, September 201644

Shannons noisy key modelB‘ B EFor Perfect secrecy H(S X) H(S) H(S) H(B) – H(E)i.e. we pay a price for the noise!A.J. Han Vinck, Yerevan, September 201645

Shannons noisy key model used for biometricsBio with errorsBioB‘ B Es c(r)Ari JuelsDecode r E B E E B Esc(r) c(r) c(r) BData Basec(r) BE, rBK H(B) – H(E)Limit on‐ error correcting capability‐ and privacyCorrect guess 2‐kLarger k less errors corrected, more privacySmaller k more errors corrected, less privacy„Random“linear codeword with k „info“ symbolsA.J. Han Vinck, Yerevan, September 201646

Biometrics challenge: get biometric features into binaryprotection11/17/2016identificationA.J. Han VinckA.J. Han Vinck, Yerevan, September 20164747

Examples where information theory helps to solve problems in big data‐ data compression/reduction with/without distortion‐ data quality using error correction codes‐ data protection: cryptographic appproach‐ outlier/anomaly/classification‐ information retrievalA.J. Han Vinck, Yerevan, September 201648

The endMy website: https://www.uni‐due.de/dc/My recent (2013) book with some of my research results (free s/dc/book coding concepts and reed solomon codes.pdfA.J. Han Vinck, Yerevan, September 201649

A.J. Han Vinck, Yerevan, September 201650

A.J. Han Vinck, Yerevan, September 201651

Privacy?A.J. Han Vinck, Yerevan, September 201652

references J. Han Vinck, Yerevan, September 201653

Information theory: channel coding theorem (1) for a binary code with words of length n, and rate (efficiency) R k/nthe number of code words 2kTo achieve the Shannon Channel Capacity and Pe 0, n infinityan thus also k infinityHence:coding problem (# of code words 2k how to encode!)and also decoding problem!A.J. Han Vinck, Yerevan, September 201654

Topics we can work on based on past performance Information theoretical principles for anomaly detection Biometrics and big data Memory systems and big data Privacy in smart grid Information retrieval and superimposed codesA.J. Han Vinck, Yerevan, September 201655

Use error correcting code for noiseless source coding 2k code words of length n; Correct 2nH(p) noise vectorswhere2k x 2nH(p) 2n or k/n 1 – H(p) (at capacity)2n2nH(p)2kvA.J. Han Vinck, Yerevan, September 201656

An obvious algorithm (like Lempel and Ziv)Typical sequence of length nnext sequence of length nStored sequence of length 2Updated sequenceTest whether a string of length n is in the STORED sequence somewhereIf yes, then typicaldataIf not, then a‐typicalEfficiency: the entropy H bits/symbolSince the probability of a typical sequence is 2we expect all typical sequences in the stored sequenceA.J. Han Vinck, Yerevan, September 201657

Uniquely decipherable codesA.J. Han Vinck, Yerevan, September 201658

Some pictures"transformedcryptography from an artto a science."The book co‐authored with Warren Weaver, The Mathematical Theory of Communication, reprintsShannon's 1948 article and Weaver's popularization of it, which is accessible to the non‐specialist.[5] Inshort, Weaver reprinted Shannon'stwo‐part paper, wrote a 28 page introduction for a 144 pages59book, andA.J. Han Vinck, Yerevan, September 2016changed the title from "A mathematical theory." to "The mathematical theory."

Illustration of the authentication problem using a memorized passworddatabaseEnrollment: passwordhash(pwd)verification: passwordhash(pwd)A.J. Han Vinck, Yerevan, September 2016compare60

We use information and communication theoryA.J. Han Vinck, Yerevan, September 201661

Forsecrecy we have a necessaryFor PerfectPerfect secrecycondition:necessary condition:H(S X) H(S)BH(S X) H(S)secret H(S) H(B) H(S) H(B)i.e. # of messages # of keysi.e. # of messages # of keysBBsendersWiretap channel models B B sreceiversssenderBs BSecrecy rate:s BeavesdropperCs H(B) amount of secret bits/trA.J. Han Vinck, Yerevan, September 2016receiverwiretapper62

For Perfect secrecy H(S X) H(S) B EH(S) H(B) – H(E)i.e. we pay a price for the noise!Wiretap channel modelBssenderB EEs Esenderreceivers EsreceiverBs BAaronWynereavesdropperSecrecy rate Cs H(B) ‐ H(E) A.J. Han# Vinck,secretbits/transmissionYerevan,63 September 2016s Bwiretapper

Solution given by the Juels Wattenberg scheme: USING BINARY CODESfingerprint bfingerprint bdata baserc(r) c(r) bstore c(r) bc b c(r)rGenerate one outof 2k codewords c(r)Condition: given c(r) b it is hard to estimate b or c(r)Han VinckGuess: one out of 2k codewordsA.J. Han Vinck, Yerevan, September 201664

safe storage: how to deal with noisy fingerprints ?fingerprint b* b efingerprint bdata baserc(r) c(r) bstore c(r) bc(r) b c(r) erDECODE one outof 2k codewords c(r) rGenerate one outof 2k codewords c(r)Condition: given c(r) b it is hard to estimate b or c(r)Han VinckGuess: one out of 2k codewordsA.J. Han Vinck, Yerevan, September 201665

reconstruction of original fingerprintfingerprint b* b efingerprint bDECODE c(r)data baserc(r) Generate (random) oneout of 2k codewords c(r)Han Vinckc(r) bstore c(r) bc(r) b c(r) ec(r) bb can be reconstructed and used as correct passwordA.J. Han Vinck, Yerevan, September 201666

authentication, how to check the result?fingerprint b* b eDECODE c(r)c(r) bdata basec(r) bhash(r,b) c(r) ec(r) hash(r,b)c(r) r‘b‘hash(r‘,b‘)is b‘ a noisy version of b ?correct when r‘ r !False Rejection Rate (FRR) : valid b’ rejected;False Acceptance Rate (FAR) : invalid b’ accepted;Successful Attack Rate (SAR): correct guess c, construct b from c bPERFORMANCE DEPENDS on the CODE! Small k gives good error protectionA.J. Han Vinck, Yerevan, September 201667

Entropy, mutual information H(X,Y) H(X) H(Y X) H(Y) H(X Y) I(X;Y) H(X) – H(X Y) H(Y) – H(Y X) H(X) H(Y) – H(X,Y)XyA.J. Han Vinck, Yerevan, September 201668

How can we reduce the amount of data (1) Represent every possible source output of length n by a „binary“vector of length m. Noiseless: exact representation costly (depends on source variability!) Need a good algorithm (non exponential in the blocklength n) Noisy: good memory reduction, but in general we loose the details how many bits do we need for a particular distortion Need to define the distortion properly!NOTE: We are interested in the NOISE!A.J. Han Vinck, Yerevan, September 201669

How can we reduce the amount of data? (2) Assign ‐ log p(x) bits to a message likely, small unlikely, large Shannon showed how to do thisthen, the minimum obtainable average assigned length isH(X) ‐ p(x) log p(x)(SHANNON ENTOPY ) Suppose that we use another assignment – log q(x) The difference (DIVERGENCE) in average length isD(P Q) : ‐ p(x) log p(x) – ‐ p(x) log q(x) 0!A.J. Han Vinck, Yerevan, September 201670

What do we need? Good knowledge of the structure of the data for Good prediction High compression rate Variability for non‐stationary data statisticsA.J. Han Vinck, Yerevan, September 201671

Anomaly: Normal or abnormal We need to develope decision mechanisms!A.J. Han Vinck, Yerevan, September 201672

B EFor Perfect secrecy H(S X) H(S)H(S) H(B) – H(E)i.e. we pay a price for the noise!EWiretap channel modelBEB Essenders Esenderreceivers EsreceiverBs BeavesdropperSecrecy rate Cs H(B) ‐ H(E) #A.J.secretbits/transmissionHan Vinck, Yerevan,73 September 2016s BwiretapperAaronWyner

B EFor Perfect secrecy H(S X) H(S)H(S) H(B) – H(E)i.e. we pay a price for the noise!ErrorErrorBiossenderBio EBios Es Breceivers c(r)„Random“ linear codewordeavesdropperSecrecy rate Cs H(B) ‐ H(E) # secret bits/transmissionA.J. Han Vinck, Yerevan,74 September 2016c(r) BData basec(r) EEc(r) BB

Shannon‘s reliable information theory Communication: transfer of information knowledge is based on information A.J. Han Vinck, Yerevan, September 2016 12. Reliable transmission/storage: Shannon For a certain transmission quality (errors): codes exist (constructive) .

Related Documents:

When Restaurant Manager exports its batch sales transactions, the resulting data is in a "horizontal" structure, similar to the following: Ticket Info Employee Info Sale Info Sale Info Sale Info Sale Info Tender Info Member Info Tip Info For Club Office to be able to use the data, it must be converted into the vertical structure that

The Rise of Big Data Options 25 Beyond Hadoop 27 With Choice Come Decisions 28 ftoc 23 October 2012; 12:36:54 v. . Gauging Success 35 Chapter 5 Big Data Sources.37 Hunting for Data 38 Setting the Goal 39 Big Data Sources Growing 40 Diving Deeper into Big Data Sources 42 A Wealth of Public Information 43 Getting Started with Big Data .

big data systems raise great challenges in big data bench-marking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target e-

BASPINAL@SHAW.CA (Office Email) Contact Info Contact Info Contact Info Contact Info Contact Info Contact Info Contact Info Contact Info 8 JAG-2013-02024 s.22. FIGR0171 2013-12-04 10:55 AM Business Licences Expiring Between 2013-Dec-

of big data and we discuss various aspect of big data. We define big data and discuss the parameters along which big data is defined. This includes the three v’s of big data which are velocity, volume and variety. Keywords— Big data, pet byte, Exabyte

Retail. Big data use cases 4-8. Healthcare . Big data use cases 9-12. Oil and gas. Big data use cases 13-15. Telecommunications . Big data use cases 16-18. Financial services. Big data use cases 19-22. 3 Top Big Data Analytics use cases. Manufacturing Manufacturing. The digital revolution has transformed the manufacturing industry. Manufacturers

Big Data in Retail 80% of retailers are aware of Big Data concept 47% understand impact of Big Data to their business 30% have executed a Big Data project 5% have or are creating a Big Data strategy Source: "State of the Industry Research Series: Big Data in Retail" from Edgell Knowledge Network (E KN) 6

BIG DATA BIG PICTURE BIG OPPORTUNITIES We see big to continuously boil down the essential improvements until you achieve sustainable growth! 617.237.6111 info@databoiler.com databoiler.com # SEs preliminarily believe Our rationale for the rebukes 5 Multiple NBBOs would not vary from today’s self-aggregating practices or is