SecEDMO: Enabling Efficient Data Mining With Strong .

2y ago
48 Views
2 Downloads
3.10 MB
14 Pages
Last View : 2d ago
Last Download : 2m ago
Upload by : Gia Hauser
Transcription

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2019.2932065, IEEETransactions on Cloud Computing1SecEDMO: Enabling Efficient Data Mining withStrong Privacy Protection in Cloud ComputingJiahui Wu, Nankun Mu, Xinyu Lei, Junqing Le, Di Zhang, and Xiaofeng Liao Senior member, IEEEAbstract—Frequent itemsets mining and association rules mining are among the top used algorithms in the area of data mining. Secureoutsourcing of data mining tasks to the third-party cloud is an effective option for data owners. However, due to the untrust cloud and thedistrust between data owners, the traditional algorithms which only work over plaintext should be re-considered to take security and privacyconcerns into account. For example, each data owner may not be willing to disclose their own private data to others during the cooperativedata mining process. The previous solutions are either not sufficiently secure or not efficient. Therefore, we propose a Secure and EfficientData Mining Outsourcing (SecEDMO) scheme for secure outsourcing of frequent itemsets mining and association rules mining over the jointdatabase (i.e., database aggregated from multiple data owners) in the paradigm of cloud computing. Based on our customized lightweightsymmetric homomorphic encryption algorithm and a secure comparison algorithm, SecEDMO can ensure strong privacy protection and lowdata mining latency simultaneously. Moreover, the well-designed virtual transaction insertion algorithm can hide the information of the originaldatabase while still preserving the cloud’s ability to perform data mining over the obfuscated data. By evaluation of a numerical experimentand theoretical comparisons, the correctness, security, and efficiency of SecEDMO are confirmed.Index Terms—Data mining, frequent itemsets mining, association rules mining, privacy protection, cloud computing.F1Introduction1.1 MotivationsDiscovering the association relationships among the frequentitemsets from a given dataset is called association rules mining,which leverages the frequent itemsets mining process as a subroutine. Association rules mining (along with frequent itemsetsmining) is one of the most important algorithms in the field ofdata mining. For example, the classical association rules miningalgorithm (i.e., Apriori [1]) is identified as a top 3 algorithm indata mining [2]. It is widely utilized in many fields includingmarket supervision and management [3], web browsing preferenceprediction [4], intrusion detection [5], health-care services [6], tojust name a few. The traditional algorithms for frequent itemsetsmining and association rules mining mainly consider how to mineon plaintext domain, so they lack security and privacy concerns.However, the privacy-preserving property is strongly required inmany scenarios. One typical setting is that there are multiple dataowners wishing to learn association rules or frequent itemsetsfrom their joint data, which is outsourced and stored in the publiccloud. The security problem becomes a critical concern in thisservice model. First, the data owners may be unwilling to disclosetheir own data to other owners. For example, some data owners’ This work is supported by National Key Research and DevelopmentProgram of China (Grant no. 2016YF-B0800601), National Natural Science Foundation of China (Grant no. 61472331, 61772434, 61403121,61806169) and Fundamental Research Funds for the Central Universities(Grant no. XDJK2018D005). (Corresponding author: Liao Xiaofeng)J. Wu, N. Mu, J. Le, and D. Zhang are with the College of Electronic and Information Engineering, Southwest University, Chongqing400715, China, and Chongqing Key Laboratory of Nonlinear Circuits andIntelligent Information Processing. E-mail: wjh2015@email.swu.edu.cn;nankun.mu@qq.com; lejunqing@163.com; zhangdiii@163.comX. Lei is with the Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA. E-mail:leixinyu@msu.eduX. Liao is with the college of computer, Chongqing University, Chongqing400044, China. E-mail: xfliao@cqu.edu.cndata may have commercial value and they want to keep secret.Second, the public cloud cannot be fully trusted. These thirdparty clouds may have financial incentives to collect or infer theircustomer sensitive information. Moreover, these public cloudsmay be compromised and all of the stored information maybefurther leaked by hackers. Therefore, we focus on developing ascheme to enable privacy-preserving frequent itemsets mining andassociation rules mining with multiple data owners in the cloudcomputing model.1.2 Problem Formulation Frequent Itemsets Mining & Association Rules Mining. Frequent itemsets mining refers to finding frequent patterns (includingitemsets, sequences, substructures) that occur frequently in datasets. Support1 represents how frequently the pattern appears inthe data sets. With a pre-defined adjustable count threshold (i.e.,minimum support), the aim of the mining algorithm is to findall itemsets with the count/support greater than or equal to theminimum support. And the motivation of association rules miningis to find out the frequently occurred association relationshipsbetween frequent itemsets. Therefore, association rules miningalgorithm mainly consists of two steps: (1) find all frequentitemsets from the dataset (by using the frequent itemsets mining);(2) generate association rules from these frequent itemsets.More concretely, let I represent the collection of data items andD represent the database that consists of a set of transactions. Ingeneral, a rule has the form of X Y, where X, Y I. The ruleX Y with support s, i.e., (X Y).sp s, implies the occurrencecount that X and Y co-occur in the database. The rule X Y withconfidence c, i.e., (X Y).cf (X Y).sp/Y.sp c, impliesthat the probability of Y’s occurrence given X’s occurrence is c.We denote the minimum support and the minimum confidence asT s and T c , respectively. A rule X Y with support no less than1. To facilitate the subsequent operation in this paper, we denote the supportof a pattern as its count number.2168-7161 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2019.2932065, IEEETransactions on Cloud Computing21.4 Technical ChallengesThere are three technical challenges that SecEDMO shall dealwith. The first challenge is how to realize strong privacy-preserving,low data mining latency, and low communication complexitysimultaneously. It is challenging to strike a well-balanced tradeoff between them. The second challenge is how to design an effective algorithmto obfuscate some data for a large-sized database while stillpreserving the cloud’s ability to perform data mining over theobfuscated data. The third challenge is how to prevent privacy leakage fromencrypted databases and mining results in a strong threat model.It is desirable that (1) each data owner should not deduceother owners’ data; (2) the scheme should be able to deal withattacks from a corrupted owner and multiple collusive owners;In this paper, we design SecEDMO to address the abovechallenges. The comparison between SecEDMO and the previousschemes is illustrated in Figure 1, where the red area representsthe optimal solution in both security and performance (i.e., anideal solution). SecEDMO (in green area) is the closest schemeto the optimum compared with other schemes. It is shown thatour SecEDMO strikes a better trade-off between security andperformance than the prior art. Moreover, SecEDMO can beapplied in other secure data computing solution such as securedata aggregation (presented in Appendix B). In summary, thispaper makes the following contributions. We design a customized symmetric HE algorithm and a securecomparison algorithm to address the secure association rulesmining problem. The designed algorithms ensure strong privacypreserving and low data mining latency. We propose a virtual transaction insertion algorithm to obfuscatethe support of each item in the original database. The proposedalgorithm can hide the information of the original databasewhile still preserving the cloud’s ability to perform data miningover the obfuscated data. We design the scheme in a strong threat model. Various countermeasures (including secret-key dividing, ciphertext characterizing and mining share) are adopted to resist different kindsof attacks from the data owners, the cloud, and the outsideadversary.HighThe previous solutions for secure frequent itemsets mining orassociation rules mining suffer from three limitations as describedbelow. Some schemes [7]–[11] are built based on public-key homomorphic encryption (HE) algorithms (e.g., elliptic curve-basedElGamal, Paillier encryption). Since the current HE lacks efficiency, these schemes suffer from high running time. Some schemes perform privacy-preserving data mining overa number of distributed servers (e.g., [10], [12], [13]). Theseschemes have a large number of communication rounds. Some symmetric key encryption-based schemes (e.g., [14])can achieve low pruning time and low communication cost.However, they have a weak threat model, and therefore, theypossess poor anti-attack capabilities (as analyzed in [15]).Therefore, we believe that there is a critical need to develop asecure and effective association rules mining scheme on multiowner databases with high-efficiency in a strong threat model.1.5 Main ContributionsClassicSolutionsPerformance1.3 Limitation of Prior Art(3) the scheme should be able to resist attacks from an ], [11], [12]Schemes[6]–[9]LowT s and confidence no less than T c , is identified as an associationrule. Note that the thresholds T s and T c are configurable by dataowners. Privacy-Preserving Data Mining. We consider the privacypreserving frequent itemsets mining and association rules miningproblem by taking advantage of cloud computing. The cloud canachieve on-demand high-quality data mining and reduces storagecost at the same time. To achieve privacy-preserving data mining,each data owner needs to encrypt their data before outsourcing tothe untrusted cloud. The cloud should be able to perform miningover the encrypted data and then return the mining results (inencrypted form) to the data owners for decryption. Horizontally Partitioned Database. We consider horizontallypartitioned databases in this paper. The horizontally partitioneddatabase means that data outsourced by different owners are allin the same format. The other type of database is verticallypartitioned, which means that outsourced data of distinct dataowners are in different formats. For example, consider that the dataowners A and B have databases with items I {bread, milk, egg}and the data owner C has a database with items I {pearl, agate},then we have the database aggregated by A and B is horizontallypartitioned and the database aggregated by A and C is verticallypartitioned.HighLowSecurityFig. 1. Comparison between SecEDMO and the previousschemes in terms of performance and security.1.6 Paper OrganizationThe remainder of this paper is organized as follows. Section 2gives a formal description of our system model, threat model,and design goals. Section 3 introduces the preliminaries of therequested cryptographic knowledge. Section 4 presents the designsof HE, secure comparison, and virtual transaction insertion. InSection 5, the design of SecEDMO is elaborated. Section 6performs correctness and security analysis of SecEDMO and compares its security with several other schemes. Section 7 presentsperformance evaluation of SecEDMO. Related work is describedin Section 8, followed by conclusion in Section 9.2System Model, Threat Model, and Design goalsIn this section, we introduce the adopted system model, threatmodel, and design goals of SecEDMO.2168-7161 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2019.2932065, IEEETransactions on Cloud Computing32.1 System ModelFigure 2 shows the system model, which consists of threeentities: an administration server, multiple data owners, and acloud. To enable that distinct owners have distinct individualsecret keys, a trusted administration is introduced to distributekeys to each owner separately. The administration server respondsto a request of the concerted data owners to distribute secretkeys and parameters. After obtaining secret keys and parameters,each data owner separately encrypts his own database and thenoutsources the encrypted database to the cloud. The cloud connectsthe received horizontally partitioned databases from multiple dataowners and then mines frequent itemsets or association rulesaccording to the requests of data owners. The returned mining results consist of frequent itemsets, association rules, and thresholdcomparisons’ results. All of the returned results are ciphertexts inencrypted form. Upon receiving the returned ciphertexts from thecloud, the data owner decrypts them to obtain the correspondingmining results in ryptindividualdatabasesdecryptminingresultsHonest butCurious Cloudjoint databasesAlgorithm 1:data miningprocessingParametersand secret keysrawdatabases.Data owners’requirements2.3 Design Goals OutsideTypes of attack:AdversaryReason secret keysTamper ciphertextsCorrupt a data ownerCollude data owners and the cloudData Owners SideAdministrationServerthe corrupted data owners and the cloud. With the above threecapabilities, the adversaries can launch a variety of attacks (e.g.,attacks on reasoning secret keys, tampering ciphertexts, corruptinga data owner, and colluding with data owners or the cloud). Theseattacks are described and analyzed in detail in Section 6.mining resultsFig. 2. System model of privacy-preserving data mining for thejoint database.2.2 Threat ModelWe consider the possible threats of the system come from threeentities: the cloud, the data owners, and the outside adversary, asdepicted in Figure 2. Cloud. The cloud who performs data mining task over encryptedjoint databases is assumed to be honest-but-curious [8], [14],[16]. That is, the cloud follows the designed scheme honestly butattempts to deduce the private information of encrypted databases,frequent itemsets, and association rules. Such an assumption of untrusted cloud server better approaches the real-world applicationsbecause most of the cloud service providers are commercial thirdparties. We have also noted that there is an alternative securitymodel in which the cloud is assumed to be trusted [17]. Data Owners. We consider two types of data owners. Thefirst type of data owner is treated as a collaborative but curiousentity. They honestly follow the scheme but are curious aboutother owners’ private information. The second type of data owneris assumed to be corrupted. By corrupted data owners we meanthat the adversary may deviate from the scheme specificationsand attempt to deduce the other owners’ private information (e.g.,tamper the other owners’ ciphertexts). Outside Adversary. We consider the outside adversary who hasthe capabilities of (1) eavesdropping and tampering the ciphertextsin the communication channel between the data owners andthe cloud, (2) corrupting a data owner, and (3) colluding withSecEDMO should achieve the following design goals.Correctness. If both data owners and cloud follow the SecEDMO honestly, the data mining tasks can be indeed performedby the cloud, and the data owners can get the correct miningresults.Security. (1) Data owners’ databases should be confidential toone another. (2) The mining results should be secret to the cloud.(3) SecEDMO should resist attacks from an outside adversarywith various capabilities.Efficiency. SecEDMO should have low computational complexity, communication traffic, storage cost, and end-to-end latencyto guarantee the efficiency of secure association rules mining.Accommodate lightweight client. SecEDMO should keep as lessas workload on the client side (i.e., the data owner side) andaccommodate lightweight client.3 PreliminariesIn this section, we introduce the background knowledge ofhomomorphic encryption, virtual transactions, and mathematicalnotations. Homomorphic Encryption. Homomorphic encryption was firstproposed by Rivest et al., closely after the creation of RSA [18],[19]. HE satisfies the homomorphism property so that the algebraicoperations on the plaintext can be performed on the correspondingciphertext, which prevents the plaintext from being cracked. Ingeneral, an HE algorithm includes four algorithms: (GenKey, Enc,Dec, Eva) [20], which represent the key generation algorithm, theencryption algorithm, the decryption algorithm, and the evaluationalgorithm, respectively.We denote an operation on ciphertexts (Eva algorithm) as F,whereas a required operation on plaintext as f . Enc is homomorphic, if F(c1 , · · · , cn ) EncK ( f (m1 , · · · , mn )), where K is a setof secret keys generated by GenKey, and mi , ci (i 1, · · · , n) areplaintexts and ciphertexts respectively. Its underlying meaning is:f is expected to act on owners’ plaintexts but instead of operatingf , the corresponding ciphertexts are outsourced to an untrustedthird party and implement F(c1 , · · · , cn ). Then f (m1 , · · · , mn ) canbe obtained by decrypting F(c1 , · · · , cn ) – the third party worksfor the owners, but knows nothing about the operative plaintexts. Virtual Transactions. Virtual transactions, also known as fakeor fictitious transactions in [14], [21], are extra and nonexistenttransactions inserted into the original database. They are oftenused to obfuscate the privacy of the real database when thedatabase tends to be outsourced to the cloud. Conversely, withoutinsertion of virtual transactions, the original database can be easilyattacked by the cloud using the statistical attack although it isencrypted by a 1-1 substitution cipher. In general, after beinginserted with virtual transactions, the database adds a tag attributefor each transaction which records the real transaction as 1 andthe virtual transaction as 0. Table 1 gives an example of a databaseinserted with virtual transactions. The tag values of T 1 , T 3 are 1and T 2 , T 4 are 0, that is, T 1 , T 3 are real transactions and T 2 , T 4 arevirtual transactions.2168-7161 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCC.2019.2932065, IEEETransactions on Cloud Computing4TABLE 1. Database with inserted virtual transactions.TID bread milk apple banana TagT1T2T3T401000101101101111010 Mathematical Notations. Table 2 summarizes mathematicalnotations and their semantic meanings used throughout this paper. TABLE 2. The notations and their semantic meanings.NotationsMeaningsDBITiTIDX YX YX.sp(X Y).cflint , nvnnINT , nTλk, d, ε, Nr0 , rṙFpZ peiKNdGenKeyEnc, Dec

problem by taking advantage of cloud computing. The cloud can achieve on-demand high-quality data mining and reduces storage cost at the same time. To achieve privacy-preserving data mining, each data owner needs to encrypt their data before outsourcing to the untrusted cloud

Related Documents:

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

enable mining to leave behind only clean water, rehabilitated landscapes, and healthy ecosystems. Its objective is to improve the mining sector's environmental performance, promote innovation in mining, and position Canada's mining sector as the global leader in green mining technologies and practices. Source: Green Mining Initiative (2013).

Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning Popularity predates popularity of machine learning "Data mining" often associated with specific data types and patterns We will focus on "market-basket" data Widely applicable (despite the name) And two types of data mining patterns

Distributed Data Mining: mining data that is located in various different locations Uses a combination of localized data analysis with a global data model Hypertext/Hypermedia Data Mining: mining data which includes text, hype

down your commitment to practice jazz piano, tell it to others, and schedule in specific practice times. MONTH ONE: Jazz Piano 101 A. Chord types (Play each in all keys) 2 B. Quick Fix Voicing C. ETUDE: (Quick fix voicings with inversions for better voice leading) ALL MUSICAL EXAMPLES TAKEN FROM “JAZZ PIANO HANDBOOK” (ALFRED PUBLISHING) AND USED WITH PERMISSION MONTH TWO: Position .