Information Retrieval And Text Mining . - Semantic Scholar

2y ago
4 Views
2 Downloads
2.41 MB
68 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Francisco Tran
Transcription

Information Retrieval and TextMining Opportunities inBioinformaticsDr. N. JEYAKUMAR, M.Sc., Ph.D.,Dept. of BioinformaticsBharathiar UniversityCoimbatore - 641046

Purpose & Targeted Audience Purpose: broad overview of informationretrieval and text mining and its application tobioinformatics Audience: people with good background 2An attempt at a definitionA brief history of use in Bioinformatics literatureOutline of key applications, papers & emerging areasBiologyComputer scienceNeither of the two disciplines

Outline 3?Introduction to IR and TMBiomedical Literature ResourcesTwo basic tasks – Bio-Entity and EntityRelation IdentificationKnowledge Discovery with textText data integrationOutlook

Information Reterival and TextMining:Biology – why? 4Rich sources of text in the form of Abstracts Full text Patients’ records Annotations in data sources (sequence and structuredatabases)For example abstract database Medline contains 18 million records (abstracts) 50,000 records are added every monthNovel biomedical information are hidden across the text such as protein interactions, protein localization, geneannotations, molecular pathways etc

Information ExtractionSample PubMed RecordTI - Two potentially oncogenic cyclins, cyclin A and cyclin D1, share common properties of subunit configuration, tyrosinephosphorylation and physical association with the Rb proteinAB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viralsubversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor(SPF) as well as a candidate proto-oncogene.Other recent studies have identified human cyclin D1 (PRAD1) as a putative G1 cyclin and candidate proto-oncogene.However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function togovern cell cycle progression remain unresolved.In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclindependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein.The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells.Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readilyphosphorylated by pp60c-src in vitro.In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdkbinding subunit.Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 isassociated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinaseactivity.Immobilized, recombinant cyclins A and D1 were found to associate with cellular proteins in complexes that contain thep105Rb protein.5

Information ExtractionSample PubMed Record with Named EntitesTI - Two potentially oncogenic cyclins, cyclin A and cyclin D1, share common properties of subunit configuration, tyrosinephosphorylation and physical association with the Rb proteinAB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viralsubversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor(SPF) as well as a candidate proto-oncogene.Other recent studies have identified human cyclin D1 (PRAD1) as a putative G1 cyclin and candidate proto-oncogene.However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function togovern cell cycle progression remain unresolved.In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclindependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein.The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells.Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readilyphosphorylated by pp60c-src in vitro.In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdkbinding subunit.Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 isassociated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinaseactivity.Immobilized, recombinant cyclins A and D1 were found to associate with cellular proteins in complexes that contain thep105Rb protein.6.

Text Mining:Genetic Basics Gene/Protein – Associate/interact – Gene/protein pathway(concept)(conceptual relation)(concept) ( Biologicalprocess)(e.g) STAT3 interactBCL-X apoptosis (cell death)Gene/protein – symptom– disease(concept)(function) (concept)(e.g.) p53TNFRSF1Btumor suppressorInsulin resistancecancerdiabetesSo, the main goal of any text mining/information extraction system in biomedicaldomain is identify the bio-entitles and their relationship7

Part I: Information Retrieval andText Mining

Information Retrieval:Introduction and overview Information retrieval (IR) is the science of searching fordocuments, for information within documents and for metadataabout documents, as well as that of searching the World WideWeb.(e.g.) Google, Google Scholar, PUBMED, PUBMED CENTRAL Component Tasks Document indexing Query Types: Boolean queriesBag of words/Vector space modelRelated Tasks 9Sentence tokenization/word tolenizationSteamingStop word removalText classificationText Clustering

Information Retrieval:Information Retrieval - ExampleRelated DocumentsInput QueryIRSystem10

Information Retrieval:IR Stages of processing – Lexical Analysis 11Sentence tokenization separates text into individual sentences.Word tokenization breaks pieces of text into word-sized chunks; in biology this is a difficulttask as the definition of what a word is can be quite complex and it isfurther complicated by heavy use of punctuation (e.g., ERD-1/2, endothelin1).Stemming is a process that determines the stem of a word; a word stem is the mainpart and excludes elements that used to indicate plurality, tense, case,gender, person, etc. (e.g.) activate is the stem of the words activation, activated, activates, andactivating. Porter stemmer – may implementations available in NetStop word removal The most common words that unlikely to help text mining such asprepositions, articles, and pro-nouns (e.g.) “the”, “a”, “an”, with, “you” many stop word list are available on net

Information Retrieval:IR stages of processing – Query Types Boolean Queries 12Based on combination of terms using Boolean operatorsBasic Boolean operators: AND, OR, NOTQueries matched against the terms in the inverted indexfileFast and easy to implement but retrieves manyirreverent documents

Information Retrieval:Boolean QueriesDB: Database of documents.Vocabulary: {t1, ,tM } (Terms in DB, produced by thetokenization stage)Index Structure: A term all the documents containing dexblood pressure

Information Retrieval:IR stages of processing – Query Types Bag of words/ Vector space model 14text document is represented by the words it contains(and their occurrences)(e.g.) “Lord of the rings” {“the”, “Lord”, “rings”, “of”}Highly efficientMakes learning far simpler and easierOrder of words is not that important for certainapplicationsEach sentence is represented as vector of wordfrequenciesRelations betwteen the sentences identified by cosineangles

Information Retrieval:Vector space model(a)(b)Documents a, b, and xAGene BRCA1 and BRCA2 participate in repairingradiation-induced breaks in DNA . and othergenes.BCancer genes BRCA1 on chromosome 17 andBRCA2 on chromosome 13 might disablemechanisms . gene and drug. But BRCA1 andBRCA2 are also implicated .XGene therapy using novel drug to treat breast andovarian cancer . of BRCA1.Vector space representation of a, b, and xGeneBRCA1BRCA2Cancer drugV(a)2110 0V(b)2221 1V(x)1101 1Figure 1: Vector space representation: (a) Coding of texts as weighted vectors—each entry representsthe weight of the corresponding term in the vector representing a document, (b) Illustration of the cosinecoefficient similarity q1 and q2 of query vector V(x) with the two vectors V(a) and V(b) in vector space.Notice that V(x) is closer to V(b) than to V(a).15

Information Retrieval:Vector space modelDB: Database of documents.Vocabulary: {v1, ,vM } {Terms in DB}Document d DB: Vector, w1d, ,wMd , of weights.Weighting Principles Document frequency: Terms occurring in a few documents are moreuseful than terms occurring in many. Local term frequency: Terms occurring frequently within a document arelikely to be significant for the document. Document length: A term occurring the same # of times in a longdocument and in a short one has less significance in the long one. Relevance: Terms occurring in documents judged as relevant to a query,are likely to be significant (WRT the query).16

Information Retrieval:Vector space modelSome Weighting Schemes:1 if ti d0 otherwiseBinaryWid TFWid fid # of times ti occurs in d.Consider Local term frequencyTF X IDF(one version.)Wid fi dfi(fi # of docs containing ti)Consider Local term frequencyand Document frequency17

Information Retrieval:Vector space modelDocument d w1d, ,wMd DBQuery q w1q, ,wMq (q could itself be a document in DB.)Sim(q,Sim(q, d)d) cosine (q, d )q d q qq dd d18

Information Retrieval:IR Evaluation 19Precision: fraction of relevant documentsretrieved divided by the total returneddocumentsRecall: proportion of relevant documentsreturned divided by the total number ofrelevant documentsF-score: the harmonic mean of precisionand recallPrecision-recall curves

Information Retrieval:IR Evaluation precision TP / (TP FP)recall TP / (TP FN) F-measure 2 precision recall / (precision recall) 20

Text Clustering 21Find which documents have many words incommon, and place the documents with themost words in common into the same groups.Similarity of documents instead of similarityof sequences, expression profiles orstructuresCluster documents into topics, for instance:clinical, biochemical and microbiology articlesA clustering program tries to find the groupsin the data.

Text Clustering Idea D {D1, , Dn} – the set of documents 22Frequent terms carry more information about the“cluster” they might belong toHighly co-related frequent terms probably belong tothe same clusterDj subsetOf T, the set of all termsThen candidate clusters are generated from F {F1, , Fk}, where each Fi is a set of all frequentterms which occur together.

Text Mining:Text Clustering- ystemDocDocDocDocDocDocDoDocDocc23Doc

Text Clustering Techniques used PartitioningHierarchical 24AgglomerativeDivisiveGrid basedModel based

Text Classification The problem statement 25Given a set of documents, each with a labelcalled the class label for that documentGiven, a classifier which learns from theabove data setFor a new, unseen document, the classifiershould be able to “predict” with a high degreeof accuracy the correct class to which the newdocument belongs

Text Classification Common problem in information science.Assignment of an electronic document to one or more categories,based on its contents (words).Supervised document classification where training examples ofdocument classification are provided and the correct classificationmodel is learnt based on one of the following techniques: 26naive Bayes classifiertf-idflatent semantic indexingsupport vector machinesartificial neural networkkNNdecision trees, such as ID3Classification techniques have been applied to spam filtering

Text Classification - Example(e.g.) Spam mail filteringSpam MailNew Mail27TextMiningSystemGood Mail

Text Mining:Introduction and overview Text mining aims to identify non-trivial, implicit,previously unknown, and potentially useful patternsin text (e.g. classification system, summarization,association rules, hyphothesis etc.)Includes more established research areas such as Information Retrieval (IR), Natural Language Processing (NLP), Information Extraction (IE), and traditional Data Mining (DM)Related Tasks 28Text SummarizationQuestion and Answering

IR and Text Mining:The Big PictureUnstructured Text(implicit knowledge)Structured content(explicit knowledge)

Text Mining:Text Mining – Simple ExampleAutomatically curating literature informationManualCuratorList of MeSHkeywordsPublicationTextMiningSystem30List of MeSHkeywords

Text Mining:Pattern or Knowledge Discovery - ExampleHypothesis generation(e.g.1) Ram and Ravi are friends(e.g.2) Ram and Rajiv are friends Ravi and Rajiv may be friend orknown to each other(e.g.1) gene A regulate gene B(e.g.2) gene B induce gene C gene A, B, C are in samepathway31

Text Mining:Related Fields 32Information retrieval aims to identify to identify relevantdocuments in response to a query (e.g. Google search, PubMeDsearch etc.)Natural language processing, also called computationallinguistics attempts to use automated means to process textand deduce its syntactic and semantic structureInformation extraction aims to identify automatically specificpredefined classes of entities (e.g. protein and gene names),relations (e.g. protein interactions) or known facts (celllocalization) in natural language text

Text Mining:Natural Language processing and Component Tasks Syntactic and semantic relation of textGives sentence structure and how word are form thesentence(e.g.) noun, verb, adverb, pro-noun, prepositions etcand complete sentence structureComponent Tasks 33Part of speech (pos) taggingShallow parsingFull parsing

Text Mining:NLP stages of processing 34Part-of-speech tagging involves the assignment of part-of-speech information or labelssuch as word categories (e.g., adjective, article, noun, propernoun, preposition, verb) and other lexical class markers toindividual tokens a text corpus. e.g., John (noun) gave (verb) the (det) ball (noun)Shallow parsing refers to a class of techniques concerned with the identification ofphrasal chunks (noun, noun phrase, verb, verb phrase) in eachsentence of a corpus without assignment of ‘deep’ hierarchicalstructures (graph).Full parsing is concerned with the construction of a complete parse tree (deephierarchical structures) for a sentence in a corpus

Text Mining:NLP - POS tagging Part of Speech (POS) tagging - involves the assignmentof part-of-speech information or labels such as wordcategories (e.g., adjective, article, noun, proper noun,preposition, verb) sentence BRCA1 physically associates with p53 and stimulates its transcriptionalactivity. /sentence POS Sentence BRCA1/NNP physically/RB associates/VBZ with/IN p53/NN and/CCstimulates/VBZ its/PRP transcriptional/JJ activity/NN /POS Sentence 35

Text Mining:NLP - Full Parser 36Full parsing - Complete understanding of sentencestructure

Text Mining:Information Extraction and Component Tasks Find conceptsPro-noun conceptsConcept relations, scenario relations Component Tasks 37(e.g.) genes, protein names, relations, cross relationsNamed entity recognition (NER)Co-reference resolutionTemplate element extractionTemplate relation extractionScenario template extraction

Text Mining:IE – Named Entity Tagging Named entity tagging in Text. (identifying concepts suchas protein/gene names etc.) sentence It has been show that genistein induces phosphorylation of ATM on serine1981 and phosphorylation of histone H2AX on serine 13 in B cells. /sentence Tagged Sentence It has been shown that smallmol genistein /smallmol inducesphosphorylation of protein ATM /protein on enzyme serine1981 /enzyme andphosphorylationof protein histoneH2AX protein on enzyme serine 13 /enzyme in celltype Bcells /celltype . /Tagged Sentence 38

Text Mining:IE – Template Relation Extraction Template relation extraction (identifying relationbetween the concepts such as protein-proteininteractions etc.) sentence It has been show that genistein induces phosphorylation of ATM on serine1981 and phosphorylation of histone H2AX on serine 13 in B cells. /sentence protein id p1 ATM /protein protein id p2 histone H2AX /protein smallmol id s1 genistein /smallmol relation id r1 type ’induce’ node1 s1 node2 p1 relation id r2 type ’induce’ node1 s1 node2 p2 39

Text Mining:IE – Methodology40 Rule based approaches Context-free grammar approaches Full parsing approaches Sublanguage driven IE Ontology-driven IE

Text Mining:Text Mining from Related Fields 41Data collection (gathering documents related to specificproblem) (IR)Data pre-processing (tokenization, normalization, parsing,stemming, stop word removal etc.) (NLP/IR)Finding entities (named objects like proteins, genes etc.) (IE)Finding facts (relationships among entities) (IE)Mining (more complex relationship among entities andconcept to concept relationships) (TM) (e.g.1) gene A regulate gene B (e.g.2) gene B induce gene C gene A, B, C are in same pathway

Text Mining:Text mining stages of processing42

Text Mining:Text mining stages of processingText preprocessing– Stemming, stop wordremoval– Syntactic/Semantic textanalysisFeatures Generation– Bag of wordsFeatures Selection– Simple counting– StatisticsText/Data Mining– ClassificationClassification- Supervisedlearning– ClusteringClustering- UnsupervisedlearningPost--processingPost– Analyzing results– Evaluation43

Text Mining:Resources Example44

Text Mining:Resources Example45

Text Mining:Resources Example46

Part II: Text Mining and BiomedicalLiterature

Text Mining:Biology – why? 48Rich sources of text in the form of Abstracts Full text Patients’ records Annotations in data sources (sequence and structuredatabases)For example abstract database Medline contains 18 million records (abstracts) 50,000 records are added every monthNovel biomedical information are hidden across the text such as protein interactions, protein localization, geneannotations, molecular pathways etc

Text Mining:Why Text About Biology is Special 49Large number of Entities/concepts (gene, proteinsetc)Evolving field, no wild followed standards forterminology - Rapid change and inconsistencyAmbiguity (many proteins and genes have samename)Synonymy (many proteins and genes have manynames)Abbreviations (large use of abbrevations in text)

Text Mining:What are concepts/relations of interest 50Genes (T-Gene)Proteins (P53)CompoundsBiological Functions (lipid metabolism)Biological Process (cell death, apoptosis)Pathways (cell metabolism, Urea Cycle)Dieses (Cancer, Alzheimer's, etc.)

Text Mining:Curation of Biological Literature Classical Method: Manual Curation Text Mining assisted Curation 51Trained human experts reads scientific literature and extracts information ofinterestManual time consuming and labor intensive processAccurate through human inference and background knowledge(E.g.) MeSH Uniprot, GOA, SGD, MGI etc.Retrieval of relevant literature from literature repositoriesTextual evidence and entity detectionRevision and editing of manual recordsE.g. TextPresso, Rodriguez-Penagos et al (gene regulation), Grover elat (PPI), Chang et al (Pathways), Ongenaert et al (methylation)

Text Mining:Curation of Literature in Biology – Pictorial summary52

Text Mining:Current Literature Repositories e-Books: NCBI BookshelfCitation of Biomedical Research Articles Abstract:PubMed (http://www,ncbi.nlm.nih.gov/pubmed)Full text research articles: 53PubMed Central (PMC)Highwire PressBioMed CentralGoogle Scholar

Text Mining:PUBMED Overview Statistics 54Developed by NCBICitation entries of scientific articles of all biomedical sciencesEach entry is characterized by a unique identifier, the PubMedidentifier: PMIDOften links to the full text articles are displayedNo.No.No.No.ofofofofCitations 16 millionIndexed Journals approx. 5000English Articles 12 millionArticles with Abstracts 7,000,000

Text Mining:PUBMED 55Approximately 1 million entries refer to gene descriptionsAuthor, journal and title information of the publicationSome records with gene symbols and molecular sequencedatabank numbersIndexed with Medical Subject Headings (MeSH)Accessed online through a text-based search query systemcalled EntrezOffers additional programming utilities, the Entrez ProgrammingUtilities (eUtils)Majority of (apprx 80%) current biomedical text mining is basedon PubMed

Text Mining:PUBMED – web page56

Text Mining:PUBMED Central 57Digital archive of full text life science journalsArticles have a unique PMCIDAllows Boolean query searchOffers free full text articlesJournal Publishing XML DTD, but also other widely used DTD inlife science

Text Mining:PUBMED Central – web page58

Text Mining:NCBI Book self 59Collection of biomedical text booksAllows boolean query searchesOffers free full text articlesDirect searching the books or from PubMed abstract

Text Mining:Google Scholar 60Google Scholar is a freely accessible Web search engine thatindexes the full text of scholarly literature across an array ofpublishing formats and disciplines. Released in beta inNovember 2004Serves as one full-text biomedical resource for text mining

Text Mining:Other Biomedical Corpus 61BioCreative corpusGENIA corpusYapex corpus

Text Mining:GENIA Corpushttp://www-tsujii.is.s.u-tokyo.ac.jp/ genia/topics/Corpus/62

Text Mining:Applications Areas in Biology 63Help to address the following problems: Finding biological named entities (e.g. protein, gene,chemical names etc.) in context to particular study Finding molecule interactions (e.g. protein-proteininteractions, protein-gene interactions etc.) Finding relations between bio-concepts (e.g. relationsbetween genes-disease, disease-drug) Finding bio-chemical pathways Finding sub-cellular localization information of proteins Constructing biological vocabulary/ontology from text Automatically Curating biological databases Assisting gene expression data mining process Knowledge-based information retrieval in context tobiological repositories (e.g. MEDLINE etc.)

Text MiningSample Data Processing – Biomedical Text64

Text Mining:BioMedical Text Mining Systems - Examples iHOP EBIMed http://biomint.pharmadm.com/An easy to use information retrieval and extraction toolTextpresso 65http://www.gopubmed.org/Clusters documents based on Gene/MesH OntologyBioMinT oncept based search linked to UniprotGoPubMed http://www.ihop-net.org/UniPub/iHOP/Gene centric search Enginehttp://www.textpresso.org/Text categorization genome search engine

Reference 66Shatkay H., “Hairpins in bookstacks: Information retrieval frombiomedical text”, Briefings in Bioinformatics, Vol. 6(3), 222-238,(2005).Natarajan J., Berrar D., Hack C.J., Dubitzky W., “Knowledgediscovery in biology and biotechnology texts: A review oftechniques, evaluation strategies, and applications”, CriticalReviews in Biotechnology, Vol. 25, 31-52, (2005).Krallinger M., Valencia A., “Text-Mining and InformationRetrieval Services for Molecular Biology”, Genome Biology, Vol6, 224 ( 2005).

Acknowledgement 67Prof. Werner Dubitzky – Univeristy of UlsterDr. Daniel Berrar – Unveristy of UlsterMartin Krallinger and Ashish V Tendulkar – APBIO Text MiningTools in BiologyDr. Hagit Shatkay http://www.shatkay.org/

Thank YouContact:N. JEYAKUMAR: n.jeyakumar@yahoo.co.in68

latent semantic indexing support vector machines artificial neural network kNN decision trees, such as ID3 Classification techniques have been applied to spam filtering. Text Classification -Example Spam Mail (e.g.) Spam mail filteri

Related Documents:

Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text Text text text

2.1 Machine Learning Techniques and Information Retrieval 21 2.1.1 Machine Learning Paradigms 22 2.1.2 Applications of Machine Learning Techniques in Information Retrieval 26 2.2 Web Mining 32 2.2.1 Web Content Mining 35 2.2.2 Web Structure Mining 43 2.2.3 Web Usage Mining 46 2.3

Keywords: recurrent neural networks; text mining; semantic data mining; taxonomies; document classification 1. Introduction The task of classifying data instances has been addressed in data mining, machine learning, database, and information retrieval research [1]. In text mining, document classification refers to

The 7 Basic Principles of Retrieval Practice Following are the seven basic principles of retrieval practice. 1. Keep It Short and Simple Retrieval practice should only take a few of minutes of class time and should be easy to explain, set up, and conclude. A perfect example is Agarwal and Bain’s (2019) retrieval

Manipulations of Initial Retrieval Practice Conditions 7 Retrieval Practice Compared to Restudy and Elaborative Study 7 Comparisons of Recall, Recognition, and Initial Retrieval Cueing Conditions 8 Retrieval Practice With Initial Short-Answer and Multiple-Choice Tests 9 Positive and Negative Effects of Initial Multiple-Choice Questions 11

[B]. RETRIEVAL PHASE The retrieval phase is the reverse process of the storage phase. In this phase another automatic monorail will arrive at the retrieval reference point without any load (package) on it. The proximity sensor will sense it, the sensor will change to on state which sends the signal to PLC alerting it about the request of retrieval.

enable mining to leave behind only clean water, rehabilitated landscapes, and healthy ecosystems. Its objective is to improve the mining sector's environmental performance, promote innovation in mining, and position Canada's mining sector as the global leader in green mining technologies and practices. Source: Green Mining Initiative (2013).

DATA MINING What is data mining? [Fayyad 1996]: "Data mining is the application of specific algorithms for extracting patterns from data". [Han&Kamber 2006]: "data mining refers to extracting or mining knowledge from large amounts of data". [Zaki and Meira 2014]: "Data mining comprises the core algorithms that enable one to gain fundamental in