Biological Sequence Matching Using Fuzzy Logic

2y ago
30 Views
2 Downloads
369.69 KB
5 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Harley Spears
Transcription

International Journal of Scientific & Engineering Research Volume 2, Issue 7, July-2011ISSN 2229-55181Biological Sequence Matching Using Fuzzy LogicNivit Gill, Shailendra SinghAbstract: Sequence alignment is the most basic and essential module of computational bio-informatics. In this paper, we propose a multiplesequence alignment algorithm that employs fuzzy logic to measure the similarity of sequences based on fuzzy parameters. To guarantee theoptimal alignment of the sequences, dynamic programming is used to align the sequences. The algorithm is tested on few sets of realbiological sequences taken from NCBI bank and its performance is evaluated using SinicView tool.Index Terms: Biological sequences, Dynamic programming, Fuzzy logic, Fuzzy matching score, Fuzzy parameters, Global alignment,Multiple sequence ��——————1 INTRODUCTIONGenetic code of every organism can be represented as asequence of alphabets, such as four base pairs of DNAand RNA, or twenty amino acids of protein. Over time,these biological sequences undergo changes, calledmutations, and as a result, many organisms evolve. Allliving organism cell are composed of DNA molecules thatare passed from one generation to other. This is the reasonfor some living organisms being biologically similar andsome being distinct. The goal of bioinformatics is to align alarge number of sequences in order to study theirevolutionary relationships through comparative sequenceanalysis.With the help of bio-informatics, computations are appliedto the biological sequences in order to analyze andmanipulate them. The prime objective behind this is todiscover and record the role of genetics in an organism’sbiological characteristics. Sequence alignment is the mostbasic and essential part of computational bio-informaticsand provides a base for other tasks of bioinformatics, suchas sequence assembly, sequence annotation, structural andfunctionalprediction,evolutionary orphylogenyrelationship analysis. A sequence alignment refers to themethod of arranging biological sequences in order to searchsimilar regions in the sequences. The sequences with highdegree of similarity have similar structure and function, andsuch sequences help in deriving evolutionary orphylogenetic relationships among organisms.In this paper, we propose to align multiple biologicalsequences by matching the sequences using fuzzy logic. Inthe proposed method, the given biological sequences arecompared pair wise so as to determine the number ofmatches, and mismatches between them. Then these countsare fuzzified using fuzzy membership functions, and thenfuzzified counts are put in an aggregate fuzzy function inorder to find the fuzzy match value of the two sequences.The fuzzy match value is further used to order �sequences according to the similarity. The most similar pairis aligned first and the rest of the sequences are then alignedprogressively, to this aligned pair.The outline of this paper is: Section 2 discusses the basicsof sequence alignment and its types and Section 3 reviewsthe related work. The fuzzy logic concepts and its usage inthe proposed algorithm are provided in Section 4. Section 5details the classical Needleman-Wunsch algorithm. Theproposed algorithm is described in Section 6. Experimentalresults and their discussions are presented in Section 7 andfinally Section 8 concludes the paper.2 SEQUENCE ALIGNMENT CONCEPTSAny biological sequence is formed from a sequence ofcharacters drawn from an alphabet. For DNA sequence, thecharacter alphabet is {A, C, G, T}, for RNA sequence, thealphabet is {A, C, G, U}, and for protein sequence, thecharacter set is {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S,T, W, Y, V}. A sequence alignment is the process ofidentifying one-to-one correspondence among sub-units ofsequences in order to measure the similarities among them.These similar regions provide functional, structural, andevolutionary information about the sequences under study.Aligned sequences are generally represented as rows withina matrix. Gaps (‘-‘) are inserted between the characters sothat identical or similar characters are aligned in successivecolumns. Gaps are also called indels, as they representinsertion of a character in or a deletion of a character from abiological sequence. Sequence alignment of two biologicalsequences is called pair-wise sequence alignment, and incase more than two biological sequences are involved, it iscalled multiple sequence alignment [12]. The sequencealignment can be global or local. Global alignment "forces"the alignment to span the entire length of all querysequences (Figure 1). Local alignments identify regions ofsimilarity within long sequences that are often widelydivergent overall (Figure 2).Nivit Gill is currently pursuing masters of engineering in computerscience and engineering in Punjab Engineering College- University ofTechnology Chandigarh, India. E-mail: nivitgill@gmail.comShailendra Singh is an Assistant Professor in the department ofcomputer science and engineering, Punjab Engineering College- IJSER : http://www.ijser.orgshailendrasingh@pec.ac.in

International Journal of Scientific & Engineering Research Volume 2, Issue 7, July-2011ISSN 2229-5518Fig. 1 Global Alignment of two biological sequencesFig. 2 Local Alignment of two biological sequencesScoring Matrices are used to quantify the similarityachieved by an alignment [12]. These matrices contain avalue (positive, zero or negative value) for each possiblesubstitution, and the alignment score is the sum of thematrix's entries for each aligned pair. For gaps (indels), aspecial gap score is used--a very simple one is just to add aconstant penalty score for each indel. The optimalalignment is the one which maximizes the alignment score.Commonly used matrices are PAM (Percent AcceptedMutations) matrices, BLOSUM (BLOck SUbstitutionMatrix), etc. The most popular algorithms for local andglobal alignment are based on dynamic programming:Needleman-Wunsch algorithm for global alignment, andSmith-Waterman algorithm for local alignment.3 LITERATURE REVIEWAs the sizes of biological sequence databases growexponentially, the need for fast and efficient sequencealignment algorithms is ever-increasing. Most of theresearch work has been intended on primarily providingnew algorithms with the main requisite of the meeting thedemands of efficient sequence alignment. Researchers haveused all the latest techniques with the aim of providing fastand efficient alignment algorithms.Needleman and Wunsch proposed a dynamic programmingalgorithm for performing a global alignment of twosequences [1]. Smith and Waterman proposed an algorithmto find a pair of segments one from each of two longsequences such that there is no other pair of segmentswith greater similarity (homology) [2]. In this localalignment algorithm, similarity measure allowed arbitrarylength deletions and insertions. A new algorithm for localalignment of DNA sequences had been proposed by Dasand Dey [4]. Naznin, Sarker and Essam designed aniterative progressive alignment method for multiplesequence alignment by using new techniques for bothgenerating guide trees for randomly selected sequences aswell as for rearranging the sequences in the guide trees [10].Paul and Konar proposed direct comparison methods toobtainglobalandlocal alignment betweenthetwo sequences [5]. They also proposed an alternate scoringscheme based on fuzzy concept. Cai, Juedes, andLiakhovitch proposed to combine existing efficientalgorithms for near optimal global and local multiplesequence alignment with evolutionary computationtechniques to search for better near optimal sequence2alignments [3]. Y. Chen et.al proposed a partitioningapproach, based on ant-colony optimization algorithm thatsignificantly improved the solution time and quality byutilizing the locality structure of the problem [6]. Yue andTang applied the divide-and-conquer strategy to align threesequences so as to reduce the memory usage from O (n3) toO(n2). They used dynamic programming so as to guaranteeoptimal alignment [9]. Nasser et al. provided a hybridapproach of dynamic programming and fuzzy logic to alignmultiple sequences progressively [8]. They computedoptimal alignment of subsequences based on several factorssuch as quality of bases, length of overlap, gap penalty.Chang et al. established fuzzy PAM matrix using fuzzy logicand then estimated score for fitness function of geneticalgorithm using fuzzy arithmetic [7]. Their experimentalresults evidenced fuzzy logic useful in dealing with theuncertainties problem, and applied to protein sequencealignment successfully.In all this work, the main objective of the researchers hadbeen to apply different techniques in order to provideefficient alignment algorithms in terms of time and memoryrequirements.4 FUZZY APPROACHFuzzy logic is based on the fuzzy-set theory proposed byL.A. Zadeh in 1965. It is a form of many-valued logic whichdeals with reasoning that is approximate rather than fixedand exact. In contrast with "crisp logic", where binary setshave two-valued logic: true or false, fuzzy logic variablesmay have a truth value that ranges in degree between 0 and1 [16]. Fuzzy logic has been extended to handle the conceptof partial truth, where the truth value may range betweencompletely true and completely false.In a fuzzy system, the values of a fuzzified input execute allthe rules in the knowledge repository that have the fuzzifiedinput as part of their premise. This process generates a newfuzzy set representing each output or solution variable.Defuzzification creates a value for the output variable fromthat new fuzzy set [11]. So, in order to apply fuzzy logic toan application, first the inputs must be fuzzified so that theirvalue is in the range 0 to 1, then the rules defined by theapplication are applied, and after this, the results derivedfrom various rules are combined using an aggregationfunction. Finally, the aggregated results are defuzzified byusing an inference function. The evaluations of the fuzzyrules and the combination of the results of the individualrules are performed using fuzzy set operations. Theoperations on fuzzy sets are different than the operations onnon-fuzzy sets [13]. The operations for OR and ANDoperators are max and min, respectively. For complement(NOT) operation, NOT(A) is evaluated as (1-A).The proposed sequence-matching algorithm uses threeinput variables – match-count (#match), mismatch-count(#mismatch), and calculated-score (#score – calculatedusing substitution matrix). These inputs are then fuzzifiedusing following membership functions:IJSER 2011http://www.ijser.org

International Journal of Scientific & Engineering Research Volume 2, Issue 7, July-2011ISSN 2229-5518(match) (mismatch) (score)0, if #match 0,1, if #match lenSeq,[0,1] (#match / lenSeq)(1)0, if #mismatch 0,1, if #mismatch lenSeq,[0,1] (#mismatch / lenSeq)(2)0, if #score 0 1, if #score perfectScore,[0,1] #score/perfectscore(3)In these equations, lenSeq is the length of the shortersequence of the two sequences being matched, andperfectScore is the score of matching the two candidatesequences, if there are no indels or replacements.5 THE NEEDLEMAN-WUNSCH ALGORITHMIn Needleman-Wunsch algorithm, a scoring matrix iscalculated for the two given sequences A and B, by placingone sequence along row side and another column side. Thesize of the matrix is (M 1)*(N 1) (M and N are the lengthsof the two sequences). The optimal score at each matrix (i, j)position is calculated by adding the current match score topreviously scored positions and subtracting gap penalties,which may evaluate to either a positive, negative or 0 value.A matrix F(i, j) indexed by residues of each sequence is builtrecursively, such thatF(i, 0) F(0, j) 0F(i, j) max { F(i-1, j-1) S(xi, yj), F(i-1, j) G, F(i, j-1) G }subject to boundary conditions; here, S(i, j) is thesubstitution score for residues i and j, and G is the gappenalty [17].The proposed algorithm uses the Needleman-Wunschalgorithm for aligning two biological sequences. Thesubstitution matrix S, given in Figure 3, is used to calculatethe scores.A CG TAC2-1-121-1-11G1-12-1T-11-123multiple sequences, by first aligning the two most similarsequences, using Needleman-Wunsch algorithm. Thensequences are chosen, one by one, from the remainingsequences, and are aligned to the aligned set of sequences.The proposed algorithm is composed of following twoalgorithms:Algorithm MATCH SEQFL (A, B)This algorithm finds the fuzzy score of similarity, based onfuzzy parameters (as given in section 4), for the given twobiological sequences A and B.I. Calculate the number of matches and mismatchesbetween the two sequences A and B. Also calculate thescore using the substitution matrix.II. Fuzzify the MatchCount, MismatchCount, and Scoreinto the µ(match), µ(mismatch), and µ(score) using theequations 1, 2 and 3, given in section 4.III. Calculate the aggregate fuzzy similarity score based onthe three fuzzy parameters.Algorithm ALIGN SEQFL (SeqDB, N)This algorithm aligns the given set of N sequences stored inSeqDB using progressive approach. The resultant alignedsequences are stored in AlignDB.I. Find the fuzzy similarity value between all pairs ofsequences using algorithm MATCH SEQFL. Removethe most similar pair of sequences from the input set,say it is sequences P and Q. Align sequences P and Qusing the Needleman Wunsch algorithm. Say, thecorresponding aligned sequences are AlignedP andAlignedQ. Store these aligned sequences in AlignDB.II. Select a sequence R from SeqDB, which has higherfuzzy similarity value to one of the aligned sequences.Now align the sequence R with the aligned sequencesAlignedP and AlignedQ. Find the fuzzy match value ofboth aligned pairs, using MATCH SEQFL algorithm.From the two aligned versions of R, select the onewhich gives higher fuzzy similarity score. Say, it isalignR, and store it in AlignDB. Remove the sequenceR from the given sequence set SeqDB.III. Repeat Step II, until all the biological sequences ofSeqDB have been aligned.7 RESULTS AND DISCUSSIONSFig. 3 Substitution Matrix used in the proposed algorithmAn alignment is computed using the F-matrix (calculatedabove): start from the bottom right cell, and compare the cellvalue with the three possible sources ((i-1, j-1) i.e. a Match,(i, j-1) i.e. an Insert, and (i-1, j) i.e. a Delete) to see which itcame from. If it is same as Match, then Ai and Bj are aligned,if same as Delete, then Ai is aligned with a gap, and if sameas Insert, then Bj is aligned with a gap.MATLABTM was used to implement the proposedalgorithm. The fuzzy logic toolbox GUI tool provides aneasy method to build a fuzzy inference system [15]. Usingthis tool, we designed a fuzzy sequence matcher system forimplementing the algorithm (Fig. 4).6 PROPOSED ALGORITHMThe proposed algorithm attempts to align multiplebiological sequences (DNA, for example), using fuzzy logic.The algorithm uses progressive approach for aligningIJSER 2011http://www.ijser.org

International Journal of Scientific & Engineering Research Volume 2, Issue 7, July-2011ISSN 2229-55184Fig. 5(a) Detailed text view of the AH1N1 sequence-set alignmentresults for the (401-500) base pairsFig. 4 System FLSeqMatcher with 3 inputs, 1 output, and 3 rulesThree different sets of influenza virus genome sequences(host: human): AH1N1, AH1N2, and AH1N3, for differentcountries were collected (randomly) from NCBI’s Influenzavirus resource site [18]. Table I enlists the various attributesof the tested sequence sets.The proposed algorithm was used to align all the three setsof sequences. The performance of the alignment results wasevaluated using SinicView – a visualization environment forcomparison of multiple sequence alignment tools [14].Fig. 5(b) Distribution of Per cent Identical rate in the aligned AH1N1sequence-setTABLE IFig. 6(a) Detailed text view of the AH1N2 sequence-set alignmentresults for the (401-500) base pairsTESTED SEQUENCE SETSInfluenzaVirus typeNo. -2009(majorly2007)2007-2009AH3N2501033USA2008The alignment results for AH1N1, AH1N2 and AH3N2virus sets are shown in figures 5(a), 6(a) and 7(a)respectively. The line graph (in red) shows the percentidentity plot for the whole length of sequences. The secondportion in these figures shows the detailed text alignmentsfor the base pairs ranging from 401 to 500. The left side ofthe second portion lists the names of the sequences, and theright side shows the corresponding aligned sequences. Thevertical red bar indicates an identity match in all thesequences. Fig. 5(b), 6(b), and 7(b) plot the distribution ofpercent identical rate of the three aligned sequence- setsrespectively.Fig. 6(b) Distribution of Per cent Identical rate in the aligned AH1N2sequence-setFig. 7(a) Detailed text view of the AH3N2 sequence-set alignmentresults for the (401-500) base pairsIJSER 2011http://www.ijser.org

International Journal of Scientific & Engineering Research Volume 2, Issue 7, July-2011ISSN 2229-55185obtained clearly indicated that the algorithm based on fuzzylogic is an efficient method for biological sequencematching.REFERENCES[1][2][3]Fig. 7(b) Distribution of Per cent Identical rate in the aligned AH3N2sequence-set[4]Table II summarizes the distribution of percent identical rateof alignments of candidate influenza virus sequence-sets.[5]TABLE IITABULATION OF PER CENT IDENTICAL RATE DISTRIBUTION IN THEALIGNED CANDIDATE SEQUENCE-SETSSequence SetAH1N1Per centIdentical Rate70 8080 905.102AH1N20 4014.90440 5014.89450 605.31960 7059.57470 805.31990 100100AH3N2[6]Distribution %94.898[7][8]The evaluation results depict a lot of variation which can bedue to the difference in the origin of sequence (i.e.country/continent) and the time (year) of collection. In thefirst set (AH1N1), the percent identity rate varies majorly in80% - 90% range, as the sequences originate from USA andcollection years are 2007-2009 (majority 2007). Second set’sper cent identity rate is low as the aligned sequences belongto different countries in a continent and the collection year isalso different. AH3N2 sequences show high almost 100%identity rate as the sequences have been majorly collectedfrom Washington (USA) and in the year 2008.8 CONCLUSIONIn this paper, an algorithm for sequence matching based onfuzzy logic has been proposed and implemented. The fuzzysimilarity score of two sequences was calculated based onfew fuzzy parameters. The calculated similarity score guidesin the progressive alignment of multiple sequences, whichwas done using dynamic programming. The experimentalresults show that the algorithm performs the alignment ofsequences quite well and the nature and relationship ofsequences reveal the alignment performance. The R 2011http://www.ijser.orgS. B. Needleman and C. D. Wunsch, "A General Method Applicable to theSearch for Similarities in the Amino Acid Sequence of two Proteins," J.Molecular Biology, vol. 48, pp. 443-453, 1970.T. F. Smith and M. S. Waterman, "Identification of Common MolecularSubsequence," J. Molecular Biology, vol. 147, pp. 195-197, 1981.L.Cai,D.Juedes,E.Liakhovitch,“Evolutionary Computation Techniques for ongressonEvolutionary Computation, 2000, pp. 829-835Swagatam Das & Debangshu Dey, “A New Algorithm for LocalAlignment in DNA Sequencing”, Proc. of IEEE Conference, INDICON2004, pp. 410-413Bandyopadhyay, S.S.; Paul, S.; Konar, A., “Improved Algorithms for DNASequence Alignment and Revision of Scoring Matrix”, Proceedings ofInternational Conference on Intelligent Sensing and InformationProcessing, 2005, pp. 485-490Y. Pan, Y. Chen, Juan Chen, Wei Liu, Ling Chen “PartitionedOptimization Algorithms for Mul

Chang et al. established fuzzy PAM matrix using fuzzy logic and then estimated score for fitness function of genetic algorithm using fuzzy arithmetic [7]. Their experimental results evidenced fuzzy logic useful in dealing with the uncertainties problem

Related Documents:

ing fuzzy sets, fuzzy logic, and fuzzy inference. Fuzzy rules play a key role in representing expert control/modeling knowledge and experience and in linking the input variables of fuzzy controllers/models to output variable (or variables). Two major types of fuzzy rules exist, namely, Mamdani fuzzy rules and Takagi-Sugeno (TS, for short) fuzzy .

fuzzy controller that uses an adaptive neuro-fuzzy inference system. Fuzzy Inference system (FIS) is a popular computing framework and is based on the concept of fuzzy set theories, fuzzy if and then rules, and fuzzy reasoning. 1.2 LITERATURE REVIEW: Implementation of fuzzy logic technology for the development of sophisticated

Different types of fuzzy sets [17] are defined in order to clear the vagueness of the existing problems. D.Dubois and H.Prade has defined fuzzy number as a fuzzy subset of real line [8]. In literature, many type of fuzzy numbers like triangular fuzzy number, trapezoidal fuzzy number, pentagonal fuzzy number,

Fuzzy Logic IJCAI2018 Tutorial 1. Crisp set vs. Fuzzy set A traditional crisp set A fuzzy set 2. . A possible fuzzy set short 10. Example II : Fuzzy set 0 1 5ft 11ins 7 ft height . Fuzzy logic begins by borrowing notions from crisp logic, just as

of fuzzy numbers are triangular and trapezoidal. Fuzzy numbers have a better capability of handling vagueness than the classical fuzzy set. Making use of the concept of fuzzy numbers, Chen and Hwang [9] developed fuzzy Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) based on trapezoidal fuzzy numbers.

ii. Fuzzy rule base: in the rule base, the if-then rules are fuzzy rules. iii. Fuzzy inference engine: produces a map of the fuzzy set in the space entering the fuzzy set and in the space leaving the fuzzy set, according to the rules if-then. iv. Defuzzification: making something nonfuzzy [Xia et al., 2007] (Figure 5). PROPOSED METHOD

2D Membership functions : Binary fuzzy relations (Binary) fuzzy relations are fuzzy sets A B which map each element in A B to a membership grade between 0 and 1 (both inclusive). Note that a membership function of a binary fuzzy relation can be depicted with a 3D plot. (, )xy P Important: Binary fuzzy relations are fuzzy sets with two dimensional

Update to reflect user’s comments Version 2 1.3.16 Hugo den Boogert UEQ31 Update to reflect new developments and user’s comments Version 0 1.10.2018 Habsi, Haitham UEQ32 Revised entirely to SP (previously, it was PR-1708) iii Related Business Processes Code Business Process (EPBM 4.0) iv Related Corporate Management Frame Work (CMF) Documents The related CMF Documents can be retrieved from .