Data Mining - Documentation

1y ago
10 Views
1 Downloads
2.18 MB
19 Pages
Last View : 2d ago
Last Download : 3m ago
Upload by : Mya Leung
Transcription

Functional Genomics WorkshopOctober 15 -16, 2014Clustering, SVM, MDS, ranking,heat maps, networks,ontologies, enrichment Why do I have to find a mathinspired grad studentevery time I want to see my data?DataMining w/o Programming#A hands-on workshop at theFunctional Genomics Workshop, Ljubljana, SloveniaThese notes include Orangeworkflows that we will construct,and visualizations we will createduring the workshop.Workshop instructors:Blaz Zupan, Janez Demsar andTomaž Curk, with help frommembers of Bioinformatics Lab,Ljubljana.Welcome to the hands-on Data Mining workshop! This three-hourworkshop is designed for students and researchers in molecularbiology. You will see how common data mining tasks can beaccomplished without programming. We will use Orange toconstruct visual data mining flows. Many similar data miningenvironments exist, but the organizers prefer Orange for a simplereason—they are its authors.#If you haven’t already installed Orange, please follow theinstallation guide at p-orange #!1

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 1: Workflows in Orange#Orange workflows consist of components that read, process andvisualize data. We call them “widgets”. Widgets are placed on adrawing board—-the “canvas”. Widgets communicate by sendinginformation along a communication channel. Output from onewidget is used as input to another.#A simple workflow with twoconnected widgets and onewidget without connections. Theoutputs of a widget appear onthe right, while the inputs appear#We construct workflows by dragging widgets onto the canvas andconnecting them by drawing a line from the transmitting widget tothe receiving widget.on the left.2

Functional Genomics WorkshopOctober 15 -16, 2014Construct a data flow that consists of a File widget, two ScatterPlot widgets and two Data Table widgets.#Workflow with a File widget thatreads data from disk and sends itto the Scatter Plot and DataTable widget. The Data Tablerenders data in a spreadsheet,while the Scatter Plot visualizesit. Selected data points from theScatterplot are sent to two otherwidgets: Data Table (1) andScatter Plot (1).The File widget reads data from disk. Open the File Widget bydouble clicking its icon. Orange comes with several preloaded datasets. From these (“Browse documentation data sets ”), choosebrown-selected.tab, a yeast gene expression data set.#Orange workflows often startwith a File widget. The brownselected data set comprises 186rows (genes) and 81 columns.Out of the 81 columns, 79contain gene expressions ofbaker’s yeast under variousconditions, one column (markedas a “meta attribute”) providesgene names, and one columncontains the “class” value orgene function.!!!!!!!After you load the data, open the other widgets. In the Scatter Plotwidget select a few data points and watch as they appear in DataTable (1). Use a combination of two Scatter Plot widgets, wherethe second scatterplot shows a detail of a smaller region selected inthe first scatterplot.#The Scatterplot for a pair of random features does not providemuch information on gene function. Does this change with adifferent choice of the features? Try intelligent visualizationscoring by VizRank, which is implemented within the Scatter Plotwidget.#3

Functional Genomics WorkshopOctober 15 -16, 2014We can connect the output of the Data Table widget to the ScatterPlot widget to highlight chosen data instances (rows) in thescatterplot.#In this workflow we haveswitched on the option “Showchannel names betweenwidgets”.How does Orange distinguish between the primary data source andthe selection? It uses the first connected signal as the entire dataset and the next one as its subset. To make changes, double clickon the line connecting the two widgets.#!4

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 2: Classification#Genes in the yeast data set are labeled with three functions(“Proteas”, “Resp”, and “Ribo”). Can we construct a model thatpredicts the gene function based on the gene’s expression profile?We’ll first create a classification tree and observe its predictions.#Something in this workflow isconceptually wrong. Can youguess what?Classification trees split the data into smaller and smaller data setsuntil one of the classes prevails. We can use the Classification TreeGraph widget to visualize a classification tree model. Consider acombination with a scatterplot to visualize how the classificationtree splits the data.#The Classification Tree widgetoutputs a classification treemodel that is sent to theClassification Tree Graph widget,which renders the tree. Selectinga tree node in this widget willoutput the corresponding data.In the next workflow we split the data set into two subsets: atraining set and a test set. We construct the model from thetraining set, and observe the predicted class probabilities on thetest set. Are the predictions reasonable? How can we assess theirquality?#Widgets may transmit severaltypes of signals. Data Sampleroutputs both sampled data andleft-out data. Orange will ask youwhich type of signal to pass tothe receiving widget if it cannotresolve this automatically bymatching the signal types.To observe which data instances were selected, feed the output ofthe Data Sampler widget to the Data Table or Info widgets.#!5

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 3: Classification Accuracy#To measure the quality of the constructed model we split the datainto a training set and a test set. We evaluate the accuracy on datainstances that have not been used for training. Accuracy can bemeasured by the proportion of data instances for which the classprediction was correct.#Try changing the size of thetraining set and observe theimpact on accuracy. What do youexpect? Try this with other datasets that come with Orange.A predictor with 90% accuracy might sound good, but if 95 % ofinstances belong to the same class, it is actually worse than alwayspredicting the majority class. For datasets with a skewed classdistribution, other evaluation scores (such as Area Under ROC) aremore appropriate.#!6

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 4: Cross-Validation#Estimating the accuracy may depend on a particular split of thedata set. To increase robustness we can repeat the measurementseveral times, each time choosing a different subset of the data fortraining. One such method is cross-validation. It is available inOrange through the Test Learners widget. We will analyze itsoutput by examining the confusion matrix and the ROC curve.#The Confusion Matrix widgetoutputs data instances related tothe selected cells. In this schemawe visualize them in the ScatterPlot widget as a data subset.What can you say about themisclassified instances? Does thescatterplot provide insights? Arethere outliers?In cross-validation each data instance is used for testing exactlyonce. #We can use the Confusion Matrix widget to find how many testinstances were classified correctly and, if not, which class theywere mistaken for.#7

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 5: GEO Data Sets#The bioinformatics add-on provides access to a data set library byGene Expression Omnibus (GEO). Orange queries GEO for eachselected data set and downloads it. Construct the depictedworkflow and inspect a few data sets.#In the GEO Data Sets widget trychanging the setting of whatdata will be represented in rows.Check the output in the DataTable and Info widgets. Whichsetting would be appropriate forcreating a data set forclassification?The data sets that have been downloaded are marked with a bulletin the first column of the table.#!8

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 6: GEO Data Sets andClassification#From the GEO widget, select the data on breast cancer (GDS360)with 14 treatment resistant and 10 treatment sensitive tumors. Canwe predict the treatment sensitivity from gene expression profiles?#The Random Forest classifieroften achieves good accuracy ongene expression data. Trychanging the number ofclassification trees in the forest.How does the accuracy change?Does random forest beat a singleclassification tree? How doeslogistic regression compare withthe other two methods?We will test the accuracy of three learners: classification tree,logistic regression, and random forest. We recommend startingwith smaller data sets as some of the learning algorithms require alot of time.9

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 7: Venn Diagram#The following workflow looks intimidating, but it’s not ascomplicated as it looks. The question we are trying to answer is: dodifferent classifiers misclassify the same tissue samples? That is, aresome specific test instances hard to classify? Are they outliers, oreven originally misclassified tissue samples? We can answer all butthe last question by cross-validating the classifiers, selectingmisclassified instances in the Confusion Matrix, and relating thethree sets of misclassifications in the Venn diagram.#Most widgets in Orange areinteractive. For example, you canclick on different sections of theVenn diagram to output arelated data item and inspect itwith other widgets.We can now choose various sections of the Venn Diagram andinspect which of the data instances were the hardest to classify.10

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 8: Hierarchical Clustering#For hierarchical clustering, we need to measure the distancesbetween genes (rows), which are fed into a Hierarchical Clusteringwidget that displays the dendrogram. The dendrogram isinteractive: clicking on any branch sends its data instances to theoutput.#We used Euclidean distance (inthe Distances widget) andWard’s linkage (in theHierarchical Clustering widget).Euclidean distance may not bethe best choice in this case. Doyou agree? Experiment withother distance measures. Do younotice any changes in thedendrogram?We display data instances selected in the dendrogram in ascatterplot. Make sure this widget is showing an informativevisualization.#!11

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 9: k-Means Clustering#Hierarchical clustering is not suitable for larger data sets due to theprohibitive size of the distance matrix. An alternative approach,which doesn’t use the distance matrix, is k-means clustering. Herewe have to provide the number of clusters in advance.Alternatively, we can use cluster scoring techniques to discover theoptimal value for the number of clusters from a predefined range.You are free to try k-means clustering on any data set, however wewill discuss its properties on hand-painted data.#A game we like to play is to see if silhouette scoring in k-means candiscover the “correct” number of clusters.#How many clusters do you see inthe data set on the right? What isthe number of clusters proposedby the silhouette method and kmeans clustering? Help k-meansfind the expected number ofclusters by modifying the dataset.!12

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 10: Data Projection#We have already seen one type of data projection—the scatterplot—but we were limited to projecting the data onto a hyperplanedefined by two features. A technique that finds projections thatretain the most variance is Principal Component Analysis (PCA).Another approach is Multidimensional Scaling (MDS), where weembed the data into a low dimensional space while trying topreserve distances between objects. The two approaches oftenyield similar visualizations.#Try replacing the GEO Data Setswidget with the File widget andselect the brown-selected.tabdataset. Are the visualizations byPCA and MDS similar?PCA can also be used for preprocessing by transforming the datato a lower dimensional space. This could sometimes increaseaccuracy, but also make the results harder to interpret.#!13

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 11: Correlation Networks#Similarity between data instances (e.g. genes, tissue samples,chemicals) can also be visualized with a network. We need tochoose a similarity threshold or limit the number of edges pernode. You need to have the Orange network add-on installed toconstruct and explore similarity-based networks.#Widgets in the network add-onprovide many different optionsfor visualization and analysis.How do the resulting networkschange with different distancemetrics? Are hubs invariant tothe choice of the distancemetric? Which are the hubgenes?We added the Net Analysis widget to compute graph and nodelevel statistics and pass them to the Net Explorer widget to berendered in the network.#!14

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 12: Gene Set Enrichment#Data sets can store gene profiles in rows and also include genenames. We can use Orange workflows to select data instances, andsee if the corresponding genes are present in some pathways orGene Ontology terms. For this task the Orange bioinformaticsadd-on includes GO Browser and Gene Set Enrichment widgets.#Lists of gene sets (pathways, GOterms) in enrichment analysiswidgets are clickable. Tryrendering the output of thesewidgets in the Gene Info widget,and use it to find your favoritegene in the NCBI Genedatabase.GO Browser presents two views of enriched pathways: onedisplaying the ontology tree and the other showing a list ofenriched GO terms.#!15

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 13: The GeneticLandscape of a Cell#The data set for this lesson is inthe documentation data sets(File widget, yeastinteractions.tab). This is a samplewith 454 query genes with aThe title of this lesson comes from the famous Constanzo et al.(2010) Science paper. We use a sample of their gene interactiondata to reconstruct the correlation-based gene network. In thisdata set genes are described with their interaction profiles. We usethe absolute Pearson correlation coefficient to estimate distancesbetween genes (Distances widget). Two genes in the network areconnected if their profile distance is below a certain threshold (Netfrom Distances). We explore the “gene galaxy” (Net Explorer) forGO function and process enrichment (GO Browser).subset of 184 most informativearray genes selected using theCUR decomposition. Querygenes in the sample werechosen to represent geneannotation groups from Figure 2in Constanzo et al. (2010).16

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 14: DifferentialExpression Analysis#We can find the most differentially expressed genes in the gastriccancer data (GDS1210, 22 cases and 10 controls) with theDifferential Expression widget.#Is the distribution of observedgene scores always as differentfrom the null distribution as inGDS1210? Examine some otherdata sets from GEO. What canyou say about those in which theobserved score distribution issimilar to the null distribution?Are there many such data sets inGEO?The Differential Expression widget can compare the distributionof gene scores to scores from randomly permuted data.#!17

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 15: Heat Maps#We can visualize gene and case profiles with a combination of aheat map and hierarchical clustering. The Heat Map widgetsupports row selection and outputs the associated data, which canbe analyzed further (e.g. gene set enrichment analysis).#The Heat Map widget offersseveral ways to sort rows andcolumns, filter data, and definecolor schemes.We use this workflow to analyze yeast cell cycle data and select aparticular set of experiments using the Select Attributes widget.#!18

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 16: Chemogenomics#A chemogenomics data setcomprising 87 compounds and289 yeast strains is sampledfrom Lee et al. (2014). We usedata from the homozygous pooland extract compounds andstrains that are found to besignificant by the clearancealgorithm (the clearance maxWe will mine chemogenomics fitness signatures from Lee et al.(Science, 2014). In this data set compounds were characterizedthrough fitness of yeast single-mutant strains. We will check ifcompounds with similar profiles share common annotations. #Orange data sets can contain links to images in local files or on theweb, which can be viewed by the Image Viewer widget.#!parameter was set to 4.00). Loadthe data set from thedocumentation data sets (Filewidget, chemogenomics.tab).19

To observe which data instances were selected, feed the output of the Data Sampler widget to the Data Table or Info widgets.#! 5 The Classification Tree widget outputs a classification tree model that is sent to the Classification Tree Graph widget, which renders the tree. Selecting a tree node in this widget will output the corresponding data.

Related Documents:

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

enable mining to leave behind only clean water, rehabilitated landscapes, and healthy ecosystems. Its objective is to improve the mining sector's environmental performance, promote innovation in mining, and position Canada's mining sector as the global leader in green mining technologies and practices. Source: Green Mining Initiative (2013).

Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning Popularity predates popularity of machine learning "Data mining" often associated with specific data types and patterns We will focus on "market-basket" data Widely applicable (despite the name) And two types of data mining patterns

Distributed Data Mining: mining data that is located in various different locations Uses a combination of localized data analysis with a global data model Hypertext/Hypermedia Data Mining: mining data which includes text, hype

Introduction to Data Mining 2. Nature of Data Sets 3. Types of Structure Models and Patterns 4. Data Mining Tasks (What?) 5. Components of Data Mining Algorithms(How?) 6. Statistics vs Data Mining 2 Srihari . Flood of Data 3