Principles Of Green Data Mining

8m ago
456.28 KB
10 Pages
Last View : 1m ago
Last Download : n/a
Upload by : Adalynn Cowell

Proceedings of the 52nd Hawaii International Conference on System Sciences 2019Principles of Green Data MiningJohannes SchneiderUniversity of [email protected] BasallaUniversity of [email protected] paper develops a set of principles for greendata mining, related to the key stages of businessunderstanding, data understanding, data preparation,modeling, evaluation, and deployment. The principlesare grounded in a review of the Cross IndustryStandard Process for Data mining (CRISP-DM) modeland relevant literature on data mining methods andGreen IT. We describe how data scientists cancontribute to designing environmentally friendly datamining processes, for instance, by using green energy,choosing between make-or-buy, exploiting approachesto data reduction based on business understanding orpure statistics, or choosing energy friendly models.1. IntroductionThe use of computing power coupled with theunprecedented availability of data provide ampleopportunity to improve energy efficiency [18].However, they are also an increasingly relevant sourceof energy consumption and associated carbonemissions. Data centers consumed about 70 billionkWh in 2016 in the United States alone [50], and thetotal consumption of all IT is estimated to be close to5% of total energy consumption [18]. In response tothis increasing amount of energy used by IT,Greenpeace published the “Guide to Building theGreen Internet” [10], promoting “a more widespreadadaption in best practices” for energy efficient datacenter design. They demand that “data center operatorsand customers should regularly report their energyperformance and establish transparent energy savingstargets.” Electricity consumption is costly—it involvesvarious detrimental effects on nature and society,ranging from bird deaths by wind turbines, on to severeair pollution and CO2 emissions by coal power plants,and the risk of catastrophes stemming from nuclearpower plants.These concerns are partially addressed by currentinitiatives under notions such as green informationsystems (Green IS) or green information technologyURI: 978-0-9981331-2-6(CC BY-NC-ND 4.0)Stefan SeidelUniversity of [email protected](Green IT) [34, 57], but environmentally friendly datamining is a novel topic.Data scientists often leverage a large pool ofcomputational resources using sophisticated andcomputationally costly machine learning techniques toextract knowledge and insights from data. Thoughexisting processes such as the Cross Industry StandardProcess for Data mining (CRISP-DM) [61] providesome guidance on how to execute a data miningproject, the skills of a data scientist heavily rely oncreativity [53], involving many degrees of freedom,often including the choice of tools, models, and datasources.It is against this background that, in this paper, wedevelop guidelines for data scientists to implementmore environmentally friendly practices that cancomplement technology-focused perspectives aimingto design more energy efficient IT-based systems.Specifically, we are focusing attention on oneimportant area of data science—data mining. Datamining can be described as knowledge discovery fromdata [23] or in terms of different activities ascollecting, cleaning, processing, analyzing and gaininguseful insights from data [2]. We ask: How can datascientists implement more environmentally friendlydata mining processes?The remainder of this paper is structured asfollows. We first describe our methodology. We thenreview the data mining process and develop a set ofprinciples for green data mining. We conclude bydiscussing limitations and future work.2. Methodology.We derived our principles by analyzing the CRISPDM data mining process and literature on green IT anddata mining. In a first step, we identified factorsdetermining energy consumption. In a second step, weidentified individual steps of the CRISP-DM processby investigating possibilities for reduction of eachfactor. We limited our analysis to those aspects thatcan be directly influenced by data scientists, includingthe choice of data, its representation, as well asprocesses and techniques used throughout the dataPage 2065

analysis process. We do not target the development ofnovel data mining algorithms for specific problems orimproving hardware or software, though some of ourinsights might be helpful in guiding suchdevelopments.We conducted a narrative literature review [25] ongreen IT, green IS, and data mining because our goalwas to investigate elementary factors and researchoutcomes related to these areas of research. Green datascience [59] is a novel field and, therefore, is moreamenable to a qualitative approach such as narrativeliterature review than a more quantitative approachdetailing the current-state-of-research, as done for adescriptive review. Our focus was on using establishedonline databases from computer science as well asinformation systems such as IEEE Xplore, ectronic library and the ACM digital library. We didnot limit ourselves to journals since new ideas are oftenpresented first at academic conferences and asignificant body of works, in particular in the field ofcomputer science, only appear as conference articles.3. The data mining processThere are multiple data mining processes [27],most of which share common phases. CRISP-DM [61]is arguably the most widely known and practicedmodel [41], attending to business and dataunderstanding, data preparation, modelling, evaluationand deployment (Figure 1). The businessunderstanding phase clarifies project objectives andbusiness requirements, which are then translated into adata mining problem. There are unsupervised datamining problems including association pattern miningand clustering as well as supervised approaches likeclassification [2, 23]. Data understanding typicallyrequires initial data selection or collection. Data is firstanalyzed in an exploratory fashion to get a basicunderstanding of the data in the business context.Exploratory analysis supports the development ofhypothesis by identifying patterns in the data [3]. Itallows to get first insights as well as to identify dataquality problems. Data preparation includes using rawdata to derive data that can be fed into the models.Activities include data selection, transformation, andcleaning. The data might have to be preparedseparately for each model. The modelling phaseconsists of defining suitable models, selecting a model,and adapting the model, for instance, optimizing itsparameters to solve the data mining problem.Computational evaluation of the model is part of themodel selection process. Every data mining problemcan be tackled using different strategies and models.Generally, there is no clear consensus about whichmodel is best for a task. Consequently, some form oftrial and error can often not be avoided. This issupported by the “no free lunch” theorem stating thatany algorithm outperforms any other algorithm onsome datasets [63] as well as by empirical studies [9,26]. The choice of models depends on many factorssuch as data (dimensionality, number of observations,structuredness), data mining objectives (need for bestpossible expected outcome, need to explain results),and cost (focus on minimum human effort to build oroperate). From the perspective of green data mining,performance is assessed in terms of energyconsumption for model training and model use, forinstance, for making predictions. For the evaluationphase the main goal is to review all steps involved inthe construction of the model, and to verify whetherthe final model meets the defined business objectives.If the best model meets the evaluation criteria, then it isdeployed. Deployment ranges from fabricating a reportpresenting the findings in an easy-to-comprehendmanner to implementing a long running system. Such asystem might learn continuously while oftenperforming a prediction task.4. Principles of green data miningGrounded in concepts and ideas from the literatureon Green IT as well as data mining and its processes,Table 1: Factors and methods related to green data miningFactorProject Objectivesand ExecutionSubfactorsPerformance specification; Make, buy,shareMethods for Green Data MiningDataQuantity; Quality; Representation;Data acquisition method; Data storageSampling, Active Learning, DimensionalityReduction, Compression, Change of DataRepresentation, Data tation;Choice/Training of models; Trainingof modelsReuse of intermediate results; ApproximateModels/AlgorithmsIT InfrastructureHardware, e.g., CPU, StorageTransfer LearningPage 2066

we identified factors determining the ecologicalfootprint of data mining and we developed principlesfor reducing this footprint (Table 1, Figure 1).Green IT discusses institutional perspectives [39],the role of users, including their behavior and beliefswhen using IT-based systems [38] as well as technicalconcerns [1,19,24]. Topics include computationalmethods [1], their implementation in software [8,21],hardware components of computers [24,44],datacenters [39], cloud computing [18,33], parallel dataprocessing (for big data) [19,22,40], as well asorganizational and business aspects such as sustainablevalue chains, green oriented procurement [7], andadoption of Green IT [28]. Loeser et al. [30] discussedconstructs and practices from Green IT (and IS) withrespect to sourcing, operations, disposal, governanceand end products.Current literature on data mining [2,38,59], inparticular data mining processes [27], does notexplicitly discuss environmental concerns of datamining but touches upon aspects related tocomputational efficiency and storage such as datareduction and approximate algorithms.Next, we describe principles of green data miningrelated to the different steps of the CRISP-DM process.We first elaborate on those principles that pertain to allstages of the process (principles 1-3 in Figure 1),before we then turn to those which only addressspecific stages (principles 4-8).Principle #1: Identify and focus on the mostenergy consuming phasesTo maximize the outcome of time invested intomaking data mining more environmentally friendly, thefocus should be on the most energy consuming factors.This analysis can be performed by investigating thefactors listed in Table 1 and analyzing each processstep shown in Figure 1. Which process steps andfactors dominate energy consumption depends on thegoals and particularities of the data mining endeavor.Project objectives such as predictive accuracy orrequired confidence in the analysis are very likely tohave a profound impact on energy consumption, sincethey often indirectly influence the choice ofcomputational methods and data. For example, recent“deep learning” [20] methods have outperformed othermachine learning approaches for multiple classificationtasks. A data scientist might turn to deep learning tomeet certain project objectives, because it achievesstate-of-the-art performance with respect to accuracybut, at the same time, requires lots of data andcomputation. Data preparation does often only requiresimple techniques, but it might be dominating in termsof energy consumption if complex computationallyexpensive methods are needed to extract features fromthe data that are used in later phases of the process.Deployment might be the dominating step if a systemis built for continuous usage with large amounts ofdata. Still, deployment might contribute very little tothe overall energy consumption compared to modelselection, if the goal of the data mining project is toderive a report supporting a one-time decision.Principle #2: Share and re-use data, models,frameworks and skillsA data scientist might control make-or-buydecisions. For example, for marketing purposes, shemight choose to acquire data from social mediaFigure 1: Crisp DM with “green” design principlesPage 2067

channels such as Twitter or Facebook and conduct theanalysis by herself. She might also acquire models(implemented in software) to conduct the analysis. Shemight also decide to consult an external company toconduct the analysis or to obtain models. From anenvironmental perspective, outsourcing can bepreferable if the contractor is more energy-efficient inextracting the demanded information, for instance,because of their prior experience and specialization,more energy efficient infrastructure, or even possessionof relevant data. On a global scale, outsourcing of dataanalysis has the potential to involve less computationand to save energy.Progress in the field of data science also relies onpublicly available data, models, and developmentframeworks. Initiatives to make data available byresearch institutions [12] and by governments helpcreate entire ecosystems [11]. State-of-the-art tools todevelop (deep learning) models such as Google’sTensorflow are made freely available by largecorporations. For such frameworks there are alsonumerous pre-trained models freely available, e.g., forimage recognition based on the Imagenet dataset [12].Transfer learning is a technique that enables usingknowledge from existing models trained for a specifictask and dataset on different tasks [42, 31]. The idea isthat some “knowledge” of a model can be transferredto another domain. Deep learning networks mightbenefit from reusing parameters or layers of an alreadytrained network [4, 64] to reduce time (and energyconsumption) on developing a new model. Thus, agreen data scientist should also contribute data,models, and potentially extensions to frameworks toencourage re-use.Principle #3: Use green energyThe use of renewable (“green”) energy such assolar or wind should be maximized. Conceptually, theidea is to align computation with the availability ofgreen energy. Technical realizations for dataprocessing tasks for distributed data processingplatforms (e.g., Hadoop) have been investigated [19].A system must predict the availability of green energyas well as brown energy and derive a schedule tomaximize green energy use and to avoid using brownpower at peak demand times. This strategy might alsohave a positive impact on energy costs as theseincrease with demand. The data scientist shouldidentify the maximum possible slack in executing dataprocessing tasks based on business objectives. Moreflexible scheduling allows for using more greenenergy.4.1 Business understandingThe business understanding phase does typicallynot involve computation and as such generally does notcontribute directly to the energy consumption. Still,understanding the business requirements and trends inthe industry sector helps anticipate factors thatinfluence energy consumption of later process steps,such as “What data are relevant and should becollected?” or “What precision of numbers is needed(over time)?” or “How frequently is a deployed systemused?” or “How does the value of data change overtime?”Principle #4: Understand value, then collect andforgetFollowing the idea that “Data is the new oil”—astatement coined by Clive Humbly in 2006—it seemsnatural to collect as much data as possible, in particulargiven that storage is cheap and data might generatevalue “eventually.” It is not uncommon that data canbe obtained almost for free, for instance, in the form oftrace data generated by users visiting a webpage. But,more data increases costs (due to storage andprocessing), requires more energy, impacts systemperformance and complexity and, additionally,enhances the risk of information overload. Query timesto a database, for instance, increase with the amount ofdata stored in the database. The idea of collecting dataonly for the sake of collection has been criticized–“lessdata can be more value” [6]. The data scientist shouldthus try to determine what data is relevant for thebusiness or task at hand [65]. Moreover, the quality ofthe data should be taken into consideration becausedata of inferior quality might require non-negligibleeffort for data cleaning [23].Not all data has the same value. Even when dataconsists of a set of observations of the same kind,certain observations might be more valuable thanothers. For example, for observations, which should besplit into classes, “difficult” to classify observationsare often more helpful in training data mining modelsthan “easy” to classify observations [56]. Thoughcomputational methods can often determine therelevance of data with respect to well-defined metrics,a holistic understanding of the business, its objectives,data, and analytical methodology is essential to limitthe collection of data. Leading data analyticscompanies such as Google embrace the idea ofcomputing on more “little” data, that is, samples [6].This reasoning is well-founded not only based onstatistical models, but also because models benefitfrom training data in a highly non-linear fashion withdecreasing marginal gains given more data [16].Therefore, in some scenarios, reducing the volume ofdata might be feasible with considerable impact onenergy consumption but only minor changes for otherPage 2068

relevant metrics. Since each model comes with its ownstrengths and weaknesses related to interpretability,robustness, speed of learning, etc., the overallassessment of advantages and disadvantages must becarefully conducted and aligned with underlyingbusiness objectives.4.2 Data understandingPrinciple #5: Reduce dataThe data scientist might face the choice of whatdata to collect (or store). This choice must be madewith great foresight in order not to miss anyopportunity for data-driven value creation. Businessunderstanding as well as an in depth understanding ofthe data are necessary. However, there are alsomultiple helpful techniques based on computationaland statistical methods that might be supportive. Wedescribe strategies to minimize the amount of data tobe collected or used for training such as sampling anddimensionality reduction. These strategies can beemployed to limit the number of attributes orobservations, reducing precision and changing therepresentation of data.Principle #5.1: Reduce number of data itemsOften the data scientist can retrieve accurate resultsby looking at data samples or by using aggregated data.Data can also be categorized (or clustered) into groups,such that different attributes are relevant for somegroups but not for others. A group might also bedescribed using an average or median value. Thegrouping itself might be obtained by clusteringalgorithms, for instance, documents can besummarized using centroids obtained throughclustering [43]. Intuitively, one should maintain datathat is most relevant to achieve a certain task. Activelearning [2] seeks to incrementally acquire relevantsamples for learning. Thus, rather than having apassive model (or learner) that just uses the trainingdata as given, an active learner might ask explicitly fordata that is expected to yield maximal improvement inlearning. Active learning is typically used indetermining what data to collect. But the idea of activelearning might also be used to assess the relevance ofdata and filter data accordingly. A model can betrained using active learning by incrementally addingthe most important data items of the full dataset. Thelearning process might be stopped if there is no moredata that improves the model beyond a small threshold.Unused data, which does not improve the modelsignificantly, could then be discarded. Uncertaintysampling is the most prominent technique in activelearning in the context of classification [49]. It seeks toobtain labelled data, where there is most uncertaintyabout the correct class labels. Uncertainty sampling hasbeen employed successfully for margin-basedclassifiers such as Support Vector Machines (SVMs)[56]. Standard sampling techniques [52] can also behelpful to reduce the amount of data. One of thesimplest, but often sufficient approaches is to conductsimple random sampling—choosing each data pointwith the same probability without replacement ofselected data points. In a case study on predictingconversion probabilities for two online retailers, Stangeand Funk [52] could show that only 1% of the dataavailable to them was enough to achieve the optimaltradeoff between accuracy and the cost of collectingand processing the data. Stratified sampling is anappropriate sampling technique if groups arehomogeneous, that is, data within groups has lowervariance than data from distinct groups. One could alsoemploy density-based sampling, for instance, assignsamples with lower density a higher probability. This isuseful if data from rare regions is highly important.Principle #5.2: Reduce number or precision ofattributesThe dataset might contain attributes that areirrelevant for the analysis. These attributes can besafely neglected. The relevance might depend on thetype of data. For many text mining problems veryfrequent words—so-called stop words, such as “and”,”the”, “is”, ”are”—can be ignored. In fact, removingunnecessary or noisy attributes such as stop words isoftenrecommended[2].Moregenerally,dimensionality reduction can be achieved by featureselection and extraction as well as type transformation[51, 2]. Feature selection techniques encompass filterand wrapper methods as well as their combination.Filter models assess the impact of features by somecriterion independent of the model. Wrapper modelstrain the model using a subset of features. An exampleof a filter model is the use of predictive attributedependence, where the idea is that correlated featuresyield better outcomes than uncorrelated ones.Therefore, the relevance of an attribute might bedetermined by assessing the classification accuracywhen using all other attributes to predict the attribute.These techniques can be employed to remove attributesthat do not reach a minimum relevance threshold.Since many of the techniques are of heuristic nature,the impact of the removal of data that is deemedirrelevant should be tested, for instance, by comparingmodels being trained on the full and the reducedattribute set. Attribute reduction can also lead to anincrease in accuracy, e.g., for decision trees [59].Feature extraction is often performed through axisrotations in a way that axes are sorted according totheir ability to reconstruct data with minimal error [2].Page 2069

Axes with negligible impact on data reconstruction canbe removed. The derived dataset can often be used totrain a model or it might be used to reconstruct theoriginal data, which in turn is used for training. Theprior approach is preferable, since a lesser volume ofdata must be processed. Prominent techniques includesingular value decomposition (SVD), and a specialcase called principal component analysis (PCA).SVD and similar techniques for feature extractionsolve an optimization problem. This can be timeconsuming, making potential energy savingsquestionable. Random projections [51], where data isprojected onto random manifolds, are a more simpleand efficient dimensionality reduction technique.However, to achieve the same approximationguarantees more dimensions are needed than for SVD.Random projections preserve Euclidean distancesaccording to the Johnson-Lindenstrauss Lemma as wellas similarity computed using dot products [51], butrandom projections (as well as other dimensionalityreduction techniques) do not preserve metrics such asthe Manhattan distance. Therefore, some care is neededto ensure correct outcomes, when applyingdimensionality reduction techniques. There is alsoempirical evidence comparing learning outcomes onthe original data to outcomes on the data with reduceddimensionality [17]. Unfortunately, the comparisonneglects metrics relevant to energy, e.g., computationtime.Aggarwal [2] describes dimensionality reductionwith type transformation as the change of data from amore complex to a less complex type. For instance,graphs can be expressed as multidimensional data thatmight potentially be easier (and faster) to process.Time series can also be transformed to multidimensional data using the Haar Wavelet Transform orFourier Transformation that both express the data usinga (small) set of orthogonal functions. This form of datacompression typically implies a loss of precision [46].Often, a dataset might only contain a few informativeattributes and, therefore, the loss of precision might bevery small, while achieving a substantial amount ofdata reduction. A high level understanding of the datamining task helps the data scientist choose a suitabledimensionality reduction technique. A technique mightdistort some instances more than others, and a smallnumber of instances that are very different in theoriginal context can be very similar in the space withreduced dimensions. For tasks like outlier detectionthis can be inacceptable, since outliers might betransformed so that they are not identifiable in thetransformed data. Other tasks such as segmenting datainto unspecified groups (clustering) might be lessimpacted by altering a few instances in a non-desirableway.Principle #5.3: Change data representationData can be described in many ways without anyloss of information, using lossless compressionalgorithms [46]. This means that data is transformedamong different representations without any effect onthe minable knowledge. The green data scientist shouldprefer the representation that requires the least amountof storage, the least amount of computational effort toprocess throughout the data mining task, and the leastamount of computation to create from the original datadescription.A sequence of 0,0,0,0,99,99 can be written morecompactly as 0:4, 99:2. Another form of encoding isdifference encoding, where differences between twoelements are stored, e.g., 0,0,0,0,99,0. Differenceencoding is often beneficial for time-series data, wherecommonly there is a strong dependency betweenconsecutive data points. It is also possible to store onlynon-zero elements with indexes, e.g., the sequence0,0,0,0,99,99 becomes 4:99, 5:99. In multipledimensions such data structures are called sparsematrices. There are many applications where zeroentries are common, e.g., document-term matricesrepresenting textual documents and user-item matricesused to derive recommendations.Numerous compression algorithms can be used toalter the data representation: General purposealgorithms such as Lempel-ziv as well as algorithmstailored to specific types of data. Sakr [45], forinstance, surveys algorithms for XML datacompressions. A dataset can be compressed in such away that the entire dataset must be decompressed toaccess a single element. A compressed dataset mightalso allow for even faster access and manipulation ofdata than non-compressed data. For large matrices in asparse matrix representation, for instance, somemanipulations such as multiplication of two matricesare often faster. Compression and decompression alsoconsume energy and, thus, data compression might ormight not be beneficial depending on the number ofrequired compress and decompress operations. Generalpurpose algorithms allow to specify how much effortthey should invest into finding the representation thatminimizes space. Some algorithms take advantage ofcompressed representations and work on them ntation. In case data is transferred acrossnetworks or is infrequently accessed, compression iseven more appealing.Principle #5.4: Accurate specification ofattribute requirementsWhereas discrete attribute values stem from a fixedset of values, attributes with continuous values arePage 2070

stored with a specific precision. The precision ofindividual attributes as well as the set of possiblevalues can be defined by specifying an attribute type.For example, for an attribute containing temperaturemeasurements, a data scientist might specify aprecision of 0.001 degrees and a range of feasiblevalues such as [0,100] as so called “domain constraint”in database systems [15]. As a next step a data type canbe chosen that meets these requirements and uses theleast amount of storage—for instance, databasesprovide a set of data types according to the SQLstandard [15], whereas programming languages usuallyfollow the IEEE standards for floating point, integer,and other data types. The data type also determines theamount of storage and impacts the time and energy toconduct operations on data. The green data scientistshould specify reasonable requirements. Choosinginappropriate types might more than double the amountof needed storage. For example, choosing an integertype (64 bits) rather than a (single) byte type (8 bits)for an array of many values leads to an increase of afactor of almost eight in memory demand.Domain constraints depend on the data source, therange of the data, and the intended application: Forsensor data, the accuracy is given by the maximumprecision that seems achievable in the next years. Forfinancial data, the needed accuracy might be given bythe smallest unit, that is, one cent or one dollar. Fortime information, a precision up to milliseconds mightnot yield better outcomes than maintaining timestampswith hourly precision. For images, accuracy can betranslated to the maximal resolution in terms ofnumber of pixels or color depth that is beneficial forthe analysis.4.3 Data preparation and modelingPrinciple #6: Execute common operations onlyonceData preparation should be structured in such waythat common preparation operations for multiplemodels are executed only once. For example, it can bereasonable to store a version of pre-processed dataafter general transformation and cleaning steps havebeen performed. The principle of factoring outcommon operations is already known, for instance, inthe context of the Extract-Transform-Load (ETL)process optimization for data warehouses [58]. Theidea of storing temporary results has also been appliedin the co

review the data mining process and develop a set of principles for green data mining. We conclude by discussing limitations and future work. 2. Methodology . We derived our principles by analyzing the CRISP-DM data mining process and literature on green IT and data mining. In a first st