Data Mining - Documentation

1y ago

10 Views

1 Downloads

2.18 MB

19 Pages

Last View : 2d ago

Last Download : 3m ago

Upload by : Mya Leung

Report this link

Download PDF

Transcription

Functional Genomics WorkshopOctober 15 -16, 2014Clustering, SVM, MDS, ranking,heat maps, networks,ontologies, enrichment Why do I have to find a mathinspired grad studentevery time I want to see my data?DataMining w/o Programming#A hands-on workshop at theFunctional Genomics Workshop, Ljubljana, SloveniaThese notes include Orangeworkflows that we will construct,and visualizations we will createduring the workshop.Workshop instructors:Blaz Zupan, Janez Demsar andTomaž Curk, with help frommembers of Bioinformatics Lab,Ljubljana.Welcome to the hands-on Data Mining workshop! This three-hourworkshop is designed for students and researchers in molecularbiology. You will see how common data mining tasks can beaccomplished without programming. We will use Orange toconstruct visual data mining flows. Many similar data miningenvironments exist, but the organizers prefer Orange for a simplereason—they are its authors.#If you haven’t already installed Orange, please follow theinstallation guide at p-orange #!1

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 1: Workflows in Orange#Orange workflows consist of components that read, process andvisualize data. We call them “widgets”. Widgets are placed on adrawing board—-the “canvas”. Widgets communicate by sendinginformation along a communication channel. Output from onewidget is used as input to another.#A simple workflow with twoconnected widgets and onewidget without connections. Theoutputs of a widget appear onthe right, while the inputs appear#We construct workflows by dragging widgets onto the canvas andconnecting them by drawing a line from the transmitting widget tothe receiving widget.on the left.2

Functional Genomics WorkshopOctober 15 -16, 2014Construct a data flow that consists of a File widget, two ScatterPlot widgets and two Data Table widgets.#Workflow with a File widget thatreads data from disk and sends itto the Scatter Plot and DataTable widget. The Data Tablerenders data in a spreadsheet,while the Scatter Plot visualizesit. Selected data points from theScatterplot are sent to two otherwidgets: Data Table (1) andScatter Plot (1).The File widget reads data from disk. Open the File Widget bydouble clicking its icon. Orange comes with several preloaded datasets. From these (“Browse documentation data sets ”), choosebrown-selected.tab, a yeast gene expression data set.#Orange workflows often startwith a File widget. The brownselected data set comprises 186rows (genes) and 81 columns.Out of the 81 columns, 79contain gene expressions ofbaker’s yeast under variousconditions, one column (markedas a “meta attribute”) providesgene names, and one columncontains the “class” value orgene function.!!!!!!!After you load the data, open the other widgets. In the Scatter Plotwidget select a few data points and watch as they appear in DataTable (1). Use a combination of two Scatter Plot widgets, wherethe second scatterplot shows a detail of a smaller region selected inthe first scatterplot.#The Scatterplot for a pair of random features does not providemuch information on gene function. Does this change with adiﬀerent choice of the features? Try intelligent visualizationscoring by VizRank, which is implemented within the Scatter Plotwidget.#3

Functional Genomics WorkshopOctober 15 -16, 2014We can connect the output of the Data Table widget to the ScatterPlot widget to highlight chosen data instances (rows) in thescatterplot.#In this workflow we haveswitched on the option “Showchannel names betweenwidgets”.How does Orange distinguish between the primary data source andthe selection? It uses the first connected signal as the entire dataset and the next one as its subset. To make changes, double clickon the line connecting the two widgets.#!4

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 2: Classification#Genes in the yeast data set are labeled with three functions(“Proteas”, “Resp”, and “Ribo”). Can we construct a model thatpredicts the gene function based on the gene’s expression profile?We’ll first create a classification tree and observe its predictions.#Something in this workflow isconceptually wrong. Can youguess what?Classification trees split the data into smaller and smaller data setsuntil one of the classes prevails. We can use the Classification TreeGraph widget to visualize a classification tree model. Consider acombination with a scatterplot to visualize how the classificationtree splits the data.#The Classification Tree widgetoutputs a classification treemodel that is sent to theClassification Tree Graph widget,which renders the tree. Selectinga tree node in this widget willoutput the corresponding data.In the next workflow we split the data set into two subsets: atraining set and a test set. We construct the model from thetraining set, and observe the predicted class probabilities on thetest set. Are the predictions reasonable? How can we assess theirquality?#Widgets may transmit severaltypes of signals. Data Sampleroutputs both sampled data andleft-out data. Orange will ask youwhich type of signal to pass tothe receiving widget if it cannotresolve this automatically bymatching the signal types.To observe which data instances were selected, feed the output ofthe Data Sampler widget to the Data Table or Info widgets.#!5

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 3: Classification Accuracy#To measure the quality of the constructed model we split the datainto a training set and a test set. We evaluate the accuracy on datainstances that have not been used for training. Accuracy can bemeasured by the proportion of data instances for which the classprediction was correct.#Try changing the size of thetraining set and observe theimpact on accuracy. What do youexpect? Try this with other datasets that come with Orange.A predictor with 90% accuracy might sound good, but if 95 % ofinstances belong to the same class, it is actually worse than alwayspredicting the majority class. For datasets with a skewed classdistribution, other evaluation scores (such as Area Under ROC) aremore appropriate.#!6

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 4: Cross-Validation#Estimating the accuracy may depend on a particular split of thedata set. To increase robustness we can repeat the measurementseveral times, each time choosing a diﬀerent subset of the data fortraining. One such method is cross-validation. It is available inOrange through the Test Learners widget. We will analyze itsoutput by examining the confusion matrix and the ROC curve.#The Confusion Matrix widgetoutputs data instances related tothe selected cells. In this schemawe visualize them in the ScatterPlot widget as a data subset.What can you say about themisclassified instances? Does thescatterplot provide insights? Arethere outliers?In cross-validation each data instance is used for testing exactlyonce. #We can use the Confusion Matrix widget to find how many testinstances were classified correctly and, if not, which class theywere mistaken for.#7

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 5: GEO Data Sets#The bioinformatics add-on provides access to a data set library byGene Expression Omnibus (GEO). Orange queries GEO for eachselected data set and downloads it. Construct the depictedworkflow and inspect a few data sets.#In the GEO Data Sets widget trychanging the setting of whatdata will be represented in rows.Check the output in the DataTable and Info widgets. Whichsetting would be appropriate forcreating a data set forclassification?The data sets that have been downloaded are marked with a bulletin the first column of the table.#!8

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 6: GEO Data Sets andClassification#From the GEO widget, select the data on breast cancer (GDS360)with 14 treatment resistant and 10 treatment sensitive tumors. Canwe predict the treatment sensitivity from gene expression profiles?#The Random Forest classifieroften achieves good accuracy ongene expression data. Trychanging the number ofclassification trees in the forest.How does the accuracy change?Does random forest beat a singleclassification tree? How doeslogistic regression compare withthe other two methods?We will test the accuracy of three learners: classification tree,logistic regression, and random forest. We recommend startingwith smaller data sets as some of the learning algorithms require alot of time.9

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 7: Venn Diagram#The following workflow looks intimidating, but it’s not ascomplicated as it looks. The question we are trying to answer is: dodiﬀerent classifiers misclassify the same tissue samples? That is, aresome specific test instances hard to classify? Are they outliers, oreven originally misclassified tissue samples? We can answer all butthe last question by cross-validating the classifiers, selectingmisclassified instances in the Confusion Matrix, and relating thethree sets of misclassifications in the Venn diagram.#Most widgets in Orange areinteractive. For example, you canclick on different sections of theVenn diagram to output arelated data item and inspect itwith other widgets.We can now choose various sections of the Venn Diagram andinspect which of the data instances were the hardest to classify.10

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 8: Hierarchical Clustering#For hierarchical clustering, we need to measure the distancesbetween genes (rows), which are fed into a Hierarchical Clusteringwidget that displays the dendrogram. The dendrogram isinteractive: clicking on any branch sends its data instances to theoutput.#We used Euclidean distance (inthe Distances widget) andWard’s linkage (in theHierarchical Clustering widget).Euclidean distance may not bethe best choice in this case. Doyou agree? Experiment withother distance measures. Do younotice any changes in thedendrogram?We display data instances selected in the dendrogram in ascatterplot. Make sure this widget is showing an informativevisualization.#!11

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 9: k-Means Clustering#Hierarchical clustering is not suitable for larger data sets due to theprohibitive size of the distance matrix. An alternative approach,which doesn’t use the distance matrix, is k-means clustering. Herewe have to provide the number of clusters in advance.Alternatively, we can use cluster scoring techniques to discover theoptimal value for the number of clusters from a predefined range.You are free to try k-means clustering on any data set, however wewill discuss its properties on hand-painted data.#A game we like to play is to see if silhouette scoring in k-means candiscover the “correct” number of clusters.#How many clusters do you see inthe data set on the right? What isthe number of clusters proposedby the silhouette method and kmeans clustering? Help k-meansfind the expected number ofclusters by modifying the dataset.!12

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 10: Data Projection#We have already seen one type of data projection—the scatterplot—but we were limited to projecting the data onto a hyperplanedefined by two features. A technique that finds projections thatretain the most variance is Principal Component Analysis (PCA).Another approach is Multidimensional Scaling (MDS), where weembed the data into a low dimensional space while trying topreserve distances between objects. The two approaches oftenyield similar visualizations.#Try replacing the GEO Data Setswidget with the File widget andselect the brown-selected.tabdataset. Are the visualizations byPCA and MDS similar?PCA can also be used for preprocessing by transforming the datato a lower dimensional space. This could sometimes increaseaccuracy, but also make the results harder to interpret.#!13

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 11: Correlation Networks#Similarity between data instances (e.g. genes, tissue samples,chemicals) can also be visualized with a network. We need tochoose a similarity threshold or limit the number of edges pernode. You need to have the Orange network add-on installed toconstruct and explore similarity-based networks.#Widgets in the network add-onprovide many different optionsfor visualization and analysis.How do the resulting networkschange with different distancemetrics? Are hubs invariant tothe choice of the distancemetric? Which are the hubgenes?We added the Net Analysis widget to compute graph and nodelevel statistics and pass them to the Net Explorer widget to berendered in the network.#!14

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 12: Gene Set Enrichment#Data sets can store gene profiles in rows and also include genenames. We can use Orange workflows to select data instances, andsee if the corresponding genes are present in some pathways orGene Ontology terms. For this task the Orange bioinformaticsadd-on includes GO Browser and Gene Set Enrichment widgets.#Lists of gene sets (pathways, GOterms) in enrichment analysiswidgets are clickable. Tryrendering the output of thesewidgets in the Gene Info widget,and use it to find your favoritegene in the NCBI Genedatabase.GO Browser presents two views of enriched pathways: onedisplaying the ontology tree and the other showing a list ofenriched GO terms.#!15

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 13: The GeneticLandscape of a Cell#The data set for this lesson is inthe documentation data sets(File widget, yeastinteractions.tab). This is a samplewith 454 query genes with aThe title of this lesson comes from the famous Constanzo et al.(2010) Science paper. We use a sample of their gene interactiondata to reconstruct the correlation-based gene network. In thisdata set genes are described with their interaction profiles. We usethe absolute Pearson correlation coeﬃcient to estimate distancesbetween genes (Distances widget). Two genes in the network areconnected if their profile distance is below a certain threshold (Netfrom Distances). We explore the “gene galaxy” (Net Explorer) forGO function and process enrichment (GO Browser).subset of 184 most informativearray genes selected using theCUR decomposition. Querygenes in the sample werechosen to represent geneannotation groups from Figure 2in Constanzo et al. (2010).16

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 14: DiﬀerentialExpression Analysis#We can find the most diﬀerentially expressed genes in the gastriccancer data (GDS1210, 22 cases and 10 controls) with theDiﬀerential Expression widget.#Is the distribution of observedgene scores always as differentfrom the null distribution as inGDS1210? Examine some otherdata sets from GEO. What canyou say about those in which theobserved score distribution issimilar to the null distribution?Are there many such data sets inGEO?The Diﬀerential Expression widget can compare the distributionof gene scores to scores from randomly permuted data.#!17

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 15: Heat Maps#We can visualize gene and case profiles with a combination of aheat map and hierarchical clustering. The Heat Map widgetsupports row selection and outputs the associated data, which canbe analyzed further (e.g. gene set enrichment analysis).#The Heat Map widget offersseveral ways to sort rows andcolumns, filter data, and definecolor schemes.We use this workflow to analyze yeast cell cycle data and select aparticular set of experiments using the Select Attributes widget.#!18

Functional Genomics WorkshopOctober 15 -16, 2014Lesson 16: Chemogenomics#A chemogenomics data setcomprising 87 compounds and289 yeast strains is sampledfrom Lee et al. (2014). We usedata from the homozygous pooland extract compounds andstrains that are found to besignificant by the clearancealgorithm (the clearance maxWe will mine chemogenomics fitness signatures from Lee et al.(Science, 2014). In this data set compounds were characterizedthrough fitness of yeast single-mutant strains. We will check ifcompounds with similar profiles share common annotations. #Orange data sets can contain links to images in local files or on theweb, which can be viewed by the Image Viewer widget.#!parameter was set to 4.00). Loadthe data set from thedocumentation data sets (Filewidget, chemogenomics.tab).19

To observe which data instances were selected, feed the output of the Data Sampler widget to the Data Table or Info widgets.#! 5 The Classiﬁcation Tree widget outputs a classiﬁcation tree model that is sent to the Classiﬁcation Tree Graph widget, which renders the tree. Selecting a tree node in this widget will output the corresponding data.

Related Documents:

DATA MINING - University of Rajshahi

Preface to the First Edition xv 1 DATA-MINING CONCEPTS 1 1.1 Introduction 1 1.2 Data-Mining Roots 4 1.3 Data-Mining Process 6 1.4 Large Data Sets 9 1.5 Data Warehouses for Data Mining 14 1.6 Business Aspects of Data Mining: Why a Data-Mining Project Fails 17 1.7 Organization of This Book 21 1.8 Review Questions and Problems 23

13 Views

1y ago

Data Mining in Bioinformatics - UQAM

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in

42 Views

2y ago

Multi Relational Data Mining Approaches: A Data Mining Technique

Data Mining and its Techniques, Classification of Data Mining Objective of MRD, MRDM approaches, Applications of MRDM Keywords Data Mining, Multi-Relational Data mining, Inductive logic programming, Selection graph, Tuple ID propagation 1. INTRODUCTION The main objective of the data mining techniques is to extract .

9 Views

7m ago

Data Mining: Why Data Mining? - Leiden University

October 20, 2009 Data Mining: Concepts and Techniques 7 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization October 20, 2009 Data Mining: Concepts and Techniques 8 Why Not Traditional Data Analysis? Tremendous amount of data

41 Views

3y ago

Exploration and Mining in Canada

enable mining to leave behind only clean water, rehabilitated landscapes, and healthy ecosystems. Its objective is to improve the mining sector's environmental performance, promote innovation in mining, and position Canada's mining sector as the global leader in green mining technologies and practices. Source: Green Mining Initiative (2013).

24 Views

1y ago

Data Mining Algorithms - Stanford University

Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning Popularity predates popularity of machine learning "Data mining" often associated with specific data types and patterns We will focus on "market-basket" data Widely applicable (despite the name) And two types of data mining patterns

11 Views

1y ago

Data Mining and its Application in Marketing and Business

Distributed Data Mining: mining data that is located in various different locations Uses a combination of localized data analysis with a global data model Hypertext/Hypermedia Data Mining: mining data which includes text, hype

26 Views

2y ago

Principles of Data Mining

Introduction to Data Mining 2. Nature of Data Sets 3. Types of Structure Models and Patterns 4. Data Mining Tasks (What?) 5. Components of Data Mining Algorithms(How?) 6. Statistics vs Data Mining 2 Srihari . Flood of Data 3

21 Views

2y ago

Recent Views

Stock Market Development and Economic Growth: Empirical Evidence from China

measures used to proxy for stock market size and the size of real economy. Most of the existing studies use stock market index as a proxy for measuring the growth and development of stock market in a country. We argue that stock market index may not be a good measure of stock market size when looking at its association with economic growth.

1y ago

263 Views

Lasso Technique Application In Stock Market Modelling: An Empirical .

This research tries to see the influence of G7 and ASEAN-4 stock market on Indonesian stock market by using LASSO model. Stock market estimation method had been conducted such as Stock Market Forecasting Using LASSO Linear Regression Model (Roy et al., 2015) and Mali et al., (2017) on Open Price Prediction of Stock Market Using Regression Analysis.

3m ago

18 Views

The Stock Market Profits Blueprint - Liberated Stock Trader

The stock market profits blueprint has been hand crafted to enable you to understand all the factors that play on the stock market. It is called a blueprint because a blueprint is in effect an architectural document to show how something is designed. The Blueprint will show you a powerful way to envisage how the stock market and the stock market

1y ago

181 Views

Factors Affecting Performance of Stock Market: Evidence from . - HRMARS

We used the data of Colombo Stock Exchange (CSE) for Sri Lankan stock market in this research which is the main stock exchange of Sri Lanka. The market capitalization of CSE is over 20 billion USD. Colombo stock exchange is the first south Asian region stock market and overall 52nd who obtain the membership of World Federation of Exchanges.

11m ago

103 Views

Stock Market Development in the Philippines: Past and Present

Philippine stock market. This paper may serve as a basis for further research on the stock market development in the country. This paper is organized as follows: Section 2 traces the origins of the stock market in the Philippines while section 3 outlines the reforms that have been implemented to strengthen the stock market.

1y ago

128 Views

Columbus,Ohio 1890

Slicing Steaks 3563 Beef Tender, Select In Stock 3852 Angus XT Shoulder Clod, Choice In Stock 3853 Angus XT Chuck Roll, Choice 20/up In Stock 3856 Angus XT Peeled Knuckle In Stock 3857 Angus XT Inside Rounds In Stock 3858 Angus XT Flats, Choice In Stock 3859 Angus XT Eye Of Round, Choice In Stock 3507 Point Off Bnls Beef Brisket, Choice In Stock

2y ago

268 Views

Buying Your First Stock - Stock-Trak

Stock Market Game Time: 15 Minutes Requires: StockTrak Curriculum , Computer Access Buying Your First Stock This lesson is an introduction to buying a stock. Students will be introduced to basic vocabulary that is involved with a buying and owning a stock. Stu-dents will be going through the entire process of buying a stock from looking

1y ago

164 Views

1.11.1. Where to Find Wall Street Training - Investing 101

investing and day trading, how to trade stock options, online free stock trading, market timing strategies, and mutual funds. But, first—learn what these terms mean. Play stock market games:Play stock market games: A stock simulation market game will train you to be comfortable with investing

2y ago

125 Views

Stock Price Prediction Using RNN and LSTM - JETIR

1. BASIC INTRODUCTION OF STOCK MARKET A stock market is a public market for trading of company stocks. Stock market prediction is the task to find the future price of a company stock. The price of a share depends on the number of people who want to buy or sell it. If there are more buyers, then prices will rise. If the seller has a number of .

1y ago

114 Views

Stock Market Wealth Effects - Harvard University

negative stock return and a subsequent decline in household spending and employment. We use a local labor market analysis to address this empirical challenge and provide quantitative evidence on the stock market consumption wealth e ect. Our empirical strategy combines regional heterogeneity in stock market wealth with aggregate movements in stock

1y ago

104 Views

Artificial Intelligence Approach for Stock Market - IJSER

The forecast of stock market helps investors to make investment decisions, via giving them strong insights about the behavior of stock market for avoiding investment risks. It was found that news has an influence on the stock price behavior [2]. The stock market is a constantly changing indicator of economic activity all over the world.

1y ago

109 Views

The Stock Market Game Student Activity Packet - Maryland Council on .

1. The Stock Market Game Kick Off! (3 mins) 2. Intro to Investing (4 mins) 3. Intro to Companies (3 mins) 4. Intro to Stocks (4 mins) 5. Building Your Portfolio (5 mins) 6. The Stock Market Game Trading Portfolio (6 mins) 7. The Stock Market Game Rules (6 mins) 8. Conducting Research (5 mins) 9. Entering Stock Trades (4 mins) 10. Assessing Risk .

1y ago

114 Views

Stock Market Uncertainty and the Stock-Bond Return Relation

implied volatility and stock turnover may prove useful for ﬁnancial applications that need to under-stand and predict stock and bond return co-movements. Finally, our empirical results suggest that the beneﬁts of stock-bond diversiﬁcation increase during periods of high stock market uncertainty. This study is organized as follow.

1y ago

158 Views

The Stock Market Crash of 1929, Great Depression, Dust .

The Stock Market Crash of 1929 In 1929, the Stock Market Crashed!! The stock of a business represents the original money paid into or invested in the business by its founders. So the stock represents how much mone

2y ago

358 Views

Web Based Stock Forecasters - Winlab

Stock market prediction is the act of trying to determine the future value of a company stock or other financial instrument traded on a financial exchange. The successful prediction of a stock's future price could yield significant profit. The stock market is not an efficient market.

1y ago

102 Views

Data Mining - Documentation

It looks like you're using an ad-blocker