Parallel Implementation Of Fuzzy Clustering Algorithm .

2y ago
32 Views
2 Downloads
543.61 KB
5 Pages
Last View : 8d ago
Last Download : 2m ago
Upload by : Esmeralda Toy
Transcription

Jerril Mathson Mathew et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (5) , 2015, 4740-4744Parallel Implementation of Fuzzy ClusteringAlgorithm Based on MapReduce Computing Modelof Hadoop –A Detailed SurveyJerril Mathson MathewLekshmy P ChandranM.Tech StudentCollege of Engineering KidangoorKerala, IndiaAssistant ProfessorCollege of Engineering KidangoorKerala, IndiaAbstract -Clustering is regarded as one of the significant taskin data mining which deals with primarily grouping of similardata. To cluster large data is a point of apprehension. Hadoop isa software framework which deals with distributed processing ofhuge amount of data across clusters of distributed computersusing MapReduce programming model. MapReduce allows akind of parallelization to solve a problem that involves largedatasets using computing clusters and is also a strikingimplication for data clustering involving large datasets. Mahout,a scalable machine learning library is an approach to Fuzzyclustering which runs on Hadoop. This paper focuses onstudying the performance of using Fuzzy K-mean clustering inMapReduce on Hadoop.Keywords- Fuzzy C Means Clustering (FCM), MapReduce,Hadoop, HDFS, Mahout, Parallel Computing.the so-called data clustering is a process which divides thecollection of physical or abstract objects into multiple classesor clusters. A cluster is a compilation of data objects. Dataobjects in the same cluster (group) are as similar as possible.However, data objects from different clusters are as differentas possible. By clustering, one can identify dense and sparseareas and find an interesting correspondence between theoverall allocation pattern and data attributes. The k-meansalgorithm belongs to a basic division technique of clusteringanalysis.In the appearance of gigantic data, existing clusteringalgorithms have encountered the blockage in time and spacecomplexity. This is one of the questions needed to be solvedurgently in the field of clustering algorithms. A proposal tounravel this problem is to apply the parallel processingtechnology to data clustering, design proficient parallelclustering algorithms, and progress the performance of dataclustering algorithms handling massive amounts of data.Cloud computing has got widespread attention as anpromising business model. [5] Hadoop is a cloud computingplatform which can more easily develop and process largedata in parallel. [6] Its main features include strongdevelopment capacity, low expenditure, high effectivenessand good trustworthiness. Hadoop platform consists of twoparts: Hadoop Distributed File System (HDFS) andMapReduce computing model. [7] On the basis of cloudcomputing platform Hadoop, the paper depicts a parallel kmeans clustering algorithm based on MapReduce computingmodel.The K-mean algorithm faces a problem of giving a hardpartitioning of the data which means that each point isdedicated to one and only one cluster. The data points on theedge of the cluster as well as lying near another cluster maynot be as much in the cluster as the points in the center ofcluster. Hence, Fuzzy K-mean clustering [1] (also known asFuzzy C-means clustering) given by Bezdek introduced thateach point has a probability of belonging to a certain cluster.A coefficient value associated with every point gives thedegree of being in the kth cluster and coefficient values shouldsum to one. Nowadays larger datasets are considered forclustering which do not even fit into main memory.I.INTRODUCTIONWith mature database technologies and universal dataapplications, business enterprises, research institutions andgovernment departments have agglomerated a large amountof data stored in different forms. How to hoard and handlethese enormous collections of data, as well as further dig outuseful knowledge which can lead the applications has becomea problematic issue.Data mining is also known as knowledge discovery indatabases. Data mining extracts mysterious probable valuableinformation or pattern from large, incomplete, noisy, blur,random data. With the hasty development of computertechnology and the popularity of the network, people havemore opportunities to use suitable way to barter informationwith the outside world. However, the influx of large amountsof data increases the difficulty of obtaining usefulinformation. How to obtain valuable information from largeamounts of data brings problems of implementing datamining structure. Due to the high complication of processingthese data, the computing power of the system is difficult tomeet the requirements. At this point, the limited computingresources which traditional stand-alone server can offerregularly cannot meet the desires. There need distributedcomputing technology to achieve large-scale parallelcomputing.Data clustering is a key research area in the field of datamining. Data clustering analyses the data and finds usefulinformation. Based on the philosophy of “Like attracts like”,www.ijcsit.com4740

Jerril Mathson Mathew et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (5) , 2015, 4740-4744Apache Hadoop[2,4] was born to solve the problemspertaining to large datasets. With the help of MapReduce,Hadoop fires a query on the large datasets, divide it and thenruns it in parallel on multiple nodes. Mahout[3] is a scalablemachine learning library which is built on Hadoop and isusually written in Java.The paper later explains about the literature survey in theSection II. The Section III gives the descriptions of themethodology. Finally we conclude with Section IV.II.LITERATURE SURVEYData set X {x1, x2, , xi, , xn} contains n ddimensional data points. Each data point belongs to ddimensional data space, which is defined as xi Rd. Thevariable k is the number of data subsets to be generated. Thek-means clustering algorithm arranges the data objects into kdivisions, which is defined as C {ck, k 1, 2, ,K}. Eachdivision represents a cluster ck. Each cluster has a clustercenter μk. Select the Euclidean distance as the similaritycriterion, and calculate the square of the distance between thepoints in the cluster and the cluster center. The square of thedistance is defined as:‖‖The goal of clustering makes the minimum of thetotal distance of all clusters. The entire distance of all clustersis defined as:‖‖‖‖1, 0, Apparently according to the least squares methodand the Lagrange principle, the cluster center μk is the averageof data points in the cluster ck. The k-means clusteringalgorithm begins from initial k categories, and assigns eachdata point to each category in order to diminish the total sumof squared distances. With the increase of the number of thecategories, the total sum of squared distances tends todecrease. When k is equal to n, J(C) is 0. Therefore, the totalsum of squared distances obtains the least only under thedetermined number k.Paper [9] presented a clustering algorithm based on MPI(Message Passing Interface). Because MPI uses the way ofinter-process communication to harmonize parallelcomputing, this leads to lower parallel efficiency, memoryoverhead. Paper [10] presented the parallelization of a kmeans algorithm based on PVM (Parallel Virtual Machine).However the algorithm is limited by the system and lackswww.ijcsit.comflexibility. Paper [11] used multi-core CPU platform toimprove the clustering speed. Paper [12] presented a fastclustering algorithm based on GPU (Graphics ProcessingUnit). Paper [13] presented a capable parallel clusteringalgorithm in a top-performance cluster environment. Thesesolutions which paper [11-13] presented are based onexpensive high-performance hardware. These solutionscannot make large-scale promotion. Paper [14] presented aclustering algorithm by means of data and task parallelism.The disadvantage is that communication overhead betweennodes is high.III.METHODOLOGYClustering, an unsupervised learning techniquegroups the samples having similarity in different classes. Thegroups or classes are referred to as clusters. Samples within aclass are of high similarity as compared to the samples ofother classes. It is learning by observation process which ismainly used in the areas like data mining, machine learningand statistics. Fuzzy K-mean clustering is based on centroidbased clustering technique.A. Hadoop PlatformBy Hadoop, an open source framework implementing theMapReduce programming model includes two componentsnamely the Hadoop Distributed File System (HDFS)[4] andMapReduce. HDFS is used for storage of large dataset andMapReduce is used for processing the datasets. In HDFS, thefile is split in contiguous chunks each of size 64MB (defaultblock size)[4] and each of these chunk is replicated indifferent racks. The NameNode in HDFS stores the metadataand the DataNodes stores the blocks from files. Associatedwith the NameNode and the DataNode is the daemon knownas the JobTracker and the TaskTracker respectively. It is theduty of the JobTracker to assign the jobs to the TaskTrackerwhich then processes each of the jobs assigned to it using theMapReduce model. Hadoop, a distributed file system iswritten in Java.Fig.1. Architecture diagram of HDFS4741

Jerril Mathson Mathew et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (5) , 2015, 4740-4744B. MapReduce TechniqueThere are mainly two programs in MapReduce[8], one isthe Map and another is Reduce. Dataset is split according toblock size of Hadoop. Map( ) function is associated with eachblock and considers the input pair in the form of a key andvalue and then processes the input pair thereby generating anintermediate set of key, value pairs. The function ofReduce( ) aggregates the intermediate results and generatesthe final output. Like HDFS, MapReduce of Hadoop alsoadopts Master / Slave architecture, particularly as shown inFig 2.distributes the many reduce tasks across the cluster of nodesand deals with distributing the appropriate fragment ofintermediate data to each reduce task.Tasks in each phase are put to death in a faulttolerant manner. If node(s) fail in the middle of a computationthe tasks allocated to them are re-distributed among theoutstanding nodes. Having many map and reduce tasksenables good load balancing and permits failed tasks to be rerun with small runtime overhead. Fig. 3 shows MapReducecomputing model.Fig.3. MapReduce Computing ModelFig.2. Architecture diagram of MapReduce of HadoopMapReduce is a programming paradigm thatexpresses a bulky distributed computation as a sequence ofdistributed operations on data sets of key/value pairs. TheMapReduce framework of Hadoop harnesses a cluster ofmachines and executes user defined MapReduce jobs athwartthe nodes in the cluster. MapReduce is composed byJobTrackers and TaskTrackers. A MapReduce working outhas two phases: a map phase and a reduce phase. The input tothe computation is a data set of key/value pairs.In the map phase, the architecture splits the inputdata set into a large number of fragments and allocates eachfragment to a map task. The framework also distributes themany map tasks across the cluster of nodes on which itoperates. Each map task guzzles key/value pairs from itsassigned fragment and produces a set of intermediatekey/value pairs. For each input key/value pair (k, v), the maptask invokes a user defined map function that transmutes theinput into a diverse key/value pair (k', v').Following the map phase the skeleton arranges theintermediate data set by key and produces a set of (k', v')tuples so that all the values coupled with a specifickey appeartogether. It also partitions the set of tuples into a number offragments equivalent to the number of reduce tasks. In thereduce phase, each reduce task consumes the fragment of (k',v') tuples assigned to it. For each such tuple it calls a userdefined reduce function that transmutes the tuple into anoutput key/value pair (k, v). Once again, the skeletonwww.ijcsit.comC. MahoutApache Mahout [3] is a scalable machine learning librarywhich involves clustering, classification and collaborativefiltering employed on top of Apache Hadoop using theMapReduce paradigm and is also written in Java. There arevarious subcategories of machine learning such asunsupervised learning, supervised learning, semi-supervisedlearning, learning to learn and reinforcement learning.D. Fuzzy K-Mean ClusteringThe fuzzy K-mean clustering, also known as softclustering is an extension of K-mean clustering. It minimizesthe intra-cluster variance. Bezdek introduced the concept offuzziness parameter (m) in Fuzzy K-mean clustering whichdetermines the degree of fuzziness in the clusters [1]. Thealgorithm of standard Fuzzy K-mean clustering algorithm isas follows:1. Choose a number of clusters2. Create distance matrix from a point xj to each of thecluster centers considering the Euclidean distancebetween the point and the cluster center using theformula: (1)where,dij Euclidean distance between the jth data point and the ithcluster center3. The membership matrix is created using:(2) where,is the membership of xj in the ith clusterm fuzziness parameter4742

Jerril Mathson Mathew et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (5) , 2015, 4740-4744p number of specified clustersdkj distance of xj in cluster CkFor a point in a sample, the total membership mustsum to 1.The value of m is kept generally greater than 1because if it is kept equal to 1, then it resembles K-meanclustering algorithm.4.The new centroid for each cluster is generated as: (3)Stopping criteria: - The algorithm continues untilany centers of the clusters do not change beyond theconvergence threshold and neither the points change in theassigned cluster.The limitation of this iterative algorithm is thatnumber of iterations is increased for forming overlappingclusters thereby increasing the execution time and if largedataset is used then it becomes difficult to handle in mainmemory. Hence to overcome this problem, MapReduceapproach is used.MapReduce ApproachMapReduce approach partitions the large datasetsand then computes on the partitioned dataset (known as jobs)in a parallel manner where the individual jobs are processedby the maps and then the sorted output from the maps areprocessed by the reduce.Input: Data points, randomly selected centroid points, numberof clusters.Output: Final centroids and their clustered points.Algorithm of Map:1. The randomly selected centroid point isconsidered as key and vector points as values.2. Calculate the Euclidean distance betweencentroid point and the vector point using (1)3. Compute the membership value of each vectorpoint and create the membership matrix using(2)4. Clusters are generated using nearest centroidand the data points assigned to that particularcluster5. Maintains a cache holding the detail aboutwhich vector point is in which cluster.Algorithm of Reduce:1. Recalculates the centroid for each clusterThe recalculated centroid would go serially to Mapand after that as it iterates, the work would be done in paralleluntil the centroid converged as depicted in figure 1. Total no.of reducers are less than total no. of mappers(M N).www.ijcsit.comFig.4. Fuzzy K-Mean Clustering Algorithm in MapReduceIV.CONCLUSIONThis paper conducts in-depth research on the parallel fuzzyalgorithm based on MapReduce computing platform ofHadoop. First, briefly describe the basic components ofHadoop platform including structural relationships of HDFSframework and the workflow of all stages of MapReduce.Then, consider the main issues, the main processes in thedesign of the parallel fuzzy k-means algorithm based onHadoop.With the rise of cloud computing concepts, the research ondata mining and clustering algorithms based on cloudcomputing platform gradually becomes a hot topic ofscholars. Future research directions include the following:Study general law of the parallelization of clusteringalgorithms. Find the relation between data scale and thenumber of nodes. Find influencing factors of speedup andscalability to design highly efficient parallel clusteringalgorithm. Study information security and privacy issuesbased on data mining applications in the cloud computingplatform. Solving the problem will play a key role in cloudcomputing applications in the actual business.ACKNOWLEDGMENTI would like to extend my sincere gratitude to my guide Ms.Lekshmy P Chandran, Assistant Professor, InformationTechnology Department, CEK and Ms. Anitha R, HOD,Computer Science Department, CEK for their valuablesuggestions and guidance which helped to improve thepaper’s quality.[1][2][3][4][5][6]REFERENCESBezdek, James C, "FCM : THE FUZZY c-MEANS CLUSTERINGALGORITHM", vol. 10, pp191-203, 1984Hadoop official site, he.org/Shvachko, K.; Hairong Kuang; Radia, S.; Chansler, R., "The HadoopDistributed File System," Mass Storage Systems and Technologies(MSST), 2010 IEEE 26th Symposium on , vol., no., pp.1,10, 3-7 May2010Armbrust M, Fox A.(2009) Above the clouds: a Berkeley view of cloudcomputing. University of California at Berkeley.Tom White.(2012) Hadoop: The Definitive Guide Third Edition.O’Reilly Media.4743

Jerril Mathson Mathew et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (5) , 2015, 4740-4744[7][8][9][10][11][12][13][14][15]Dean J, Ghemawat S.(2008) MapReduce: simplified data processing onlarge clusters. Communications of the ACM, 51(1), p.p.07-113.Ghemawat, H. Gobioff, S. Leung. “The Google file system,” In Proc.ofACM Symposium on Operating Systems Principles, Lake George, NY ,pp 29–43, Oct 2003Zhao Zongtang, Sun Shenli, Fan Ji (2005) Parallel clustering algorithmbased on MPI. Journal of Zhengzhou Institute of Aeronautical IndustryManagement, 24(3), p.p.160-171.Mao Jiali (2003) K-means Algorithm and Parallelization. Chongqing:College of Computer Science and Engineering Chongqing University.Li Jingbin, Yang Liu, Hua Bei (2011) Research on parallel K-Medoidsalgorithm based on multi-core platform. Application Research ofComputers, 28(2), p.p.498-505.Cao Feng, Zhou Aoying (2007) Fast Clustering of Data Streams UsingGraphics Processors. Journal of Software, 18(2), p.p.291-302.Zhou Bing, Feng Zhonghui, Wang Hexing (2007) The Study of ParallelClustering Algorithm for Cluster System. Computer Science,34(10),p.p.4-16.Boutsinas B, Gnardellis T.(2002) On distributing the clustering process.Pattern Recognition Letters, 23(8), p.p. 999-1008.Wegener D, Mock M, Adranale D.(2009) Toolkit-based highperformance data mining of large data on MapReduce clusters. ional Conference on Data Mining, Washington: IEEE, p.p.296301.Chen Hui-ping, Lin Li-li, Wang Jian-dong, Miao Xinrui (2008) Datamining platform WEKA and secondary development on WEKA.Computer Engineering and Applications, 44(19), p.p.76-79.X. Zhang, H. Li, and C. Qi, “Spatially constrained fuzzy-clusteringbased sensor placement for spatiotemporal fuzzy-control system,” IEEETrans. Fuzzy Syst., vol. 18, no. 5, pp. 946–957, Oct. 2010.L. F. S. Coletta, L. Vendramin, E. R. Hruschka, R. J. G. B. Campello,and W. Pedrycz, “Collaborative fuzzy clustering algorithms: Somerefinements and design guidelines,” IEEE Trans. Fuzzy Syst., vol. 20,no. 3, pp. 444– 462, Jun. 2012.E. Hullermeier, M. Rifqi, S. Henzgen, and R. Senge, “Comparing fuzzypartitions: A generalization of the rand index and related measures,”IEEETrans. Fuzzy Syst., vol. 20, no. 3, pp. 546–556, Jun. 2012.J. P. Mei and L. H. Chen, “A fuzzy approach for multitype relationaldata clustering,” IEEE Trans. Fuzzy Syst., vol. 20, no. 2, pp. 358–371,Apr. 2012.H. C. Huang, Y. Y. Chuang, and C. S. Chen, “Multiple kernel fuzzyclustering,”IEEE Trans. Fuzzy Syst., vol. 20, no. 1, pp. 120–134, Feb.2012.4744

cluster. Hence, Fuzzy K-mean clustering [1] (also known as Fuzzy C-means clustering) given by Bezdek introduced that each point has a probability of belonging to a certain cluster. A coefficient value associated with every point gives the degree of being in the kth cluster and coefficient values should sum to one.

Related Documents:

with ellipsoidal shape. Then, a fuzzy clustering algorithm for relational data is described (Davé and Sen,2002) Fuzzy k-means algorithm The most known and used fuzzy clustering algorithm is the fuzzy k-means (FkM) (Bezdek,1981). The FkM algorithm aims at discovering the best fuzzy

ing fuzzy sets, fuzzy logic, and fuzzy inference. Fuzzy rules play a key role in representing expert control/modeling knowledge and experience and in linking the input variables of fuzzy controllers/models to output variable (or variables). Two major types of fuzzy rules exist, namely, Mamdani fuzzy rules and Takagi-Sugeno (TS, for short) fuzzy .

fuzzy controller that uses an adaptive neuro-fuzzy inference system. Fuzzy Inference system (FIS) is a popular computing framework and is based on the concept of fuzzy set theories, fuzzy if and then rules, and fuzzy reasoning. 1.2 LITERATURE REVIEW: Implementation of fuzzy logic technology for the development of sophisticated

methods based on the fuzzy relation. These can be found in (10,12-161. Fuzzy clustering, based on fuzzy relations, was first proposed by Tamura et al. [12]. They proposed an N-step procedure by using the composition of fuzzy relations beginning with a reflexive and symmetric fuzzy relation R in X. They showed that there is an n such that

Different types of fuzzy sets [17] are defined in order to clear the vagueness of the existing problems. D.Dubois and H.Prade has defined fuzzy number as a fuzzy subset of real line [8]. In literature, many type of fuzzy numbers like triangular fuzzy number, trapezoidal fuzzy number, pentagonal fuzzy number,

Caiado, J., Maharaj, E. A., and D’Urso, P. (2015) Time series clustering. In: Handbook of cluster analysis. Chapman and Hall/CRC. Andrés M. Alonso Time series clustering. Introduction Time series clustering by features Model based time series clustering Time series clustering by dependence Introduction to clustering

the data set. In graph-theoretic fuzzy clustering, the graph representing the data structure is a fuzzy graph and di erent notions of connectivity lead to di erent types of clusters. The idea of fuzzy graphs is rst mentioned in [10] whereby the fuzzy analogues of several basic graph-theoretic concepts

named Fuzzy DBSCAN subsumes the previous ones, thus allowing to generate clusters with both fuzzy cores and fuzzy overlapping borders. Our proposals are compared w.r.t. state of the art fuzzy clustering methods over real world datasets. 1 Introduction The advent of the big data era has