Identifying Protein-protein Interaction Sites On A Genome .

3y ago

39 Views

2 Downloads

416.58 KB

8 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Aarya Seiber

Report this link

Download PDF

Transcription

Identifying protein-protein interaction sites on agenome-wide scale Haidong Wang Eran Segal Asa Ben-Hur Daphne Koller Douglas L. BrutlagComputerScience Dept. hdwang, eran, koller @cs.stanford.edu Dept. of Biochemistry asa.benhur, brutlag @stanford.eduStanford University, CA 94305AbstractProtein interactions typically arise from a physical interaction of one ormore small sites on the surface of the two proteins. Identifying these sitesis very important for drug and protein design. In this paper, we proposea computational method based on probabilistic relational model that attempts to address this task using high-throughput protein interaction dataand a set of short sequence motifs. We learn the model using the EMalgorithm, with a branch-and-bound algorithm as an approximate inference for the E-step. Our method searches for motifs whose presence in apair of interacting proteins can explain their observed interaction. It alsotries to determine which motif pairs have high affinity, and can thereforelead to an interaction. We show that our method is more accurate thanothers at predicting new protein-protein interactions. More importantly,by examining solved structures of protein complexes, we find that 2/3 ofthe predicted active motifs correspond to actual interaction sites.1 IntroductionMany cellular functions are carried out through physical interactions between proteins.Discovering the protein interaction map can therefore help to better understand the workings of the cell. Indeed, there has been much work recently on developing high-throughputmethods to produce a more complete map of protein-protein interactions [1, 2, 3].Interactions between two proteins arise from physical interactions between small regions on the surface of the proteins [4] (see Fig. 2(b)). Finding interaction sites is animportant task, which is of particular relevance to drug design. There is currently no highthroughput experimental method to achieve this goal, so computational methods are required. Existing methods either require solving a protein’s 3D structure (e.g., [5]), andtherefore are computationally very costly and not applicable on a genome-wide scale, oruse known interaction sites as training data (e.g., [6]), which are relatively scarce and hencehave poor coverage. Other work focuses on refining the highly noisy high-throughput interaction maps [7, 8, 9], or on assessing the confidence levels of the observed interactions [10].In this paper, we propose a computational method for predicting protein interactionsand the sites at which the interactions take place, which uses as input only high-throughputprotein-protein interaction data and the protein sequences. In particular, our method assumes no knowledge of the 3D protein structure, or of the sites at which binding occurs.Our approach is based on the assumption that interaction sites can be described using alimited repertoire of conserved sequence motifs [11]. This is a reasonable assumption since

P1ddaAaP5P2AdAbP5AdAdP1AabBab S BdbbdbcbP2P3AdbP4T12 OAadAddBab S BabT15 OAbdBab ST25 O(a)(b)Figure 1: (a) Simple illustration of our assumptions for protein-protein interactions. The smallelements denote motif occurrences on proteins, with red denoting active and gray denoting inactivemotifs. (b) A fragment of our probabilistic model, for the proteins . We use yellow todenote an assignment of the value true, and black to denote the value false; full circles denote anassignment observed in the data, and patterned circles an assignment hypothesized by our algorithm.The dependencies involving inactive motif pairs were removed from the graph because they do notaffect the rest of the model.interaction sites are significantly more conserved than the rest of the protein surface [12].Given a protein interaction map, our method tries to explain the observed interactions byidentifying a set of sites of motif occurrence on every pair of interacting proteins throughwhich the interaction is mediated. To understand the intuition behind our approach, consider the example of Fig. 1(a). Here, the interaction pattern of the protein can best beexplained using the motif pair , where appears in and in the proteins but not in . By contrast, the motif pair is not as good an explanation, because also appears in , which has a different interaction pattern. In general, our method aimsto identify motif pairs that have high affinity, potentially leading to interaction betweenprotein pairs that contain them.However, a sequence motif might be used for a different purpose, and not give rise to anactive binding site; it might also be buried inside the protein, and thus be inaccessible forinteraction. Thus, the appearance of an appropriate motif does not always imply interaction.A key feature of our approach is that we allow each motif occurrence in a protein to beeither active or inactive. Interactions are then induced only by the interactions of highaffinity active motifs in the two proteins. Thus, in our example, the motif in is inactive,and hence does not lead to an interaction between and , despite the affinity betweenthe motif pair . We note that Deng et al. [8] proposed a somewhat related method forgenome-wide analysis of protein interaction data, based on protein domains. However,their method is focused on predicting protein-protein interactions and not on revealing thesite of interaction, and they do not allow for the possibility that some domains are inactive.Our goal is thus to identify two components: the affinities between pairs of motifs, andthe activity of the occurrences of motifs in different proteins. Our algorithm addresses thisproblem by using the framework of Bayesian networks [13] and probabilistic relationalmodels [14], which allows us to handle the inherent noise in the protein interaction dataand the uncertain relationship between interactions and motif pairs. We construct a modelencoding our assumption that protein interactions are induced by the interactions of activemotif pairs. We then use the EM algorithm [15], to fill in the details of the model, learningboth the motif affinities and activities from the observed data of protein-protein interactionsand protein motif occurrences. We address the computational complexity of the E-step inthese large, densely connected models by using an approximate inference procedure basedon branch-and-bound.We evaluated our model on protein-protein interactions in yeast and Prosite motifs [11].As a basic performance measure, we evaluated the ability of our method to predict newprotein-protein interactions, showing that it achieves better performance than several other

models. In particular, our results validate our assumption that we can explain interactionsvia the interactions of active sequence motifs. More importantly, we analyze the abilityof our method to discover the mechanism by which the interaction occurs. Finally, weexamined co-crystallized protein pairs where the 3D structure of the interaction is known,so that we can determine the sites at which the interaction took place. We show that ouractive motifs are more likely to participate in interactions.2 The Probabilistic ModelThe basic entities in our probabilistic model are the proteins and the set of sequence motifsthat can mediate protein interactions. Our model therefore contains a set of protein entities , with the motifs that occur in them. Each protein is associated withthe set of motifs that occur in it, denoted by . As we discussed, a key premise ofour approach is that a specific occurrence of a sequence motif may or may not be active. Thus, each motif occurrence is associated with a binary-value variable ,false otherwise. We structure thewhich takes the value true if is active in protein and , to capture our intuition that prior probability true the number of active motifs in a protein is roughly a constant fraction of the total numberof motifs in the protein, but that even proteins with few motifs tend to have at least somenumber of active motifs.A pair of active motifs in two proteins can potentially bind and induce an interactionbetween the corresponding proteins. Thus, in our model, a pair of proteins interact if eachcontains an active motif, and this pair of motifs bind to each other. The probability withwhich two motifs bind to each other is called their affinity. We encode this assumption byincluding in our model entities corresponding to a pair of proteins . For eachpair of motifs and , we introduce a variable , which is adeterministic AND of the activity of these two motifs. Intuitively, this variable representswhether the pair of motifs can potentially interact. The probability with which two activemotif occurrences bind is their affinity. We model the binding event between two motif occurrencesusing a variable , and define:true true and true false, whereis the affinity between motifs and . This model reflects our assumption that two motif occurrences can bind only ifthey are both active, but their actual binding probability depends on their affinity. Note thatthis affinity is a feature of the motif pair and does not depend on the proteins in which theyappear.We must also account for interactions that are not explained by our set of motifs,whether because of false positives in the data, or because of inadequacies of our modelor of our motif set. Thus, we add a spurious binding variable , for cases where aninteraction between and exists, but cannot be explained well by our set of activemotifs. The probability that a spurious binding occurs is given by true.Finally, we observe an interaction between two proteins if and only if some form ofbinding occurs, whether by a motif pair or a spurious binding. Thus, we define a variable , which represents whether protein was observed to interact with protein ! , to be a deterministic OR of all the binding variablesand. Overall, is a noisy-OR [13] of all motif pair variables .Note that our model accounts for both types of errors in the protein interaction data.False negatives (missing interactions) in the data are addressed through the fact that thepresence of an active motif pair only implies that binding takes place with some probability.False positives (wrong interactions) in the data are addressed through the introduction ofthe spurious interaction variables.

The full model defines a joint probability distribution over the entire set of attributes: where each of these conditional probability distributionsas specified above. We use isto denote the entire set of model parameters . An instantiation of ourprobabilistic model is illustrated in Fig. 1(b).3 Learning the ModelWe now turn to the task of learning the model from the data. In a typical setting, we aregiven as input a protein interaction data set, specifying a set of proteins and a set ofobserved interacting pairs . We are also given a set of potentially relevant motifs, andthe occurrences of these motifs in the different proteins in . Thus, all the variables exceptfor the variables are hidden. Our learning task is thus twofold: we need to infer the valuesof the hidden variables, both the activity variables , , and the binding variables , ; we also need to find a setting of the model parameters , which specify themotif affinities. We use a variant of the EM algorithm [15] to find both an assignment tothe parameters , and an assignment to the motif variables , which is a local maximumof the likelihood function . Note that, to maximize this objective, wesearch for a MAP assignment to the motif activity variables, but sum out over the otherhidden variables. This design decision is reasonable in our setting, where determining motifactivities is an important goal; it is a key assumption for our computational procedure.As in most applications of EM, our main difficulty arises in the E-step, where we needto compute the distribution over the hidden variables given the settings of the observedvariables and the current parameter settings. In our model, any two motif variables (bothwithin the same protein and across different proteins) are correlated, as there exists a pathof influence between them in the underlying Bayesian network (see Fig. 1(c)). These correlations make the task of computing the posterior distribution over the hidden variablesintractable, and we must resort to an approximate computation. While we could apply ageneral purpose approximate inference algorithm such as loopy belief propagation [16],such methods may not converge in densely connected model such as this one, and thereare few guarantees on the quality of the results even if they do converge. Fortunately, ourmodel turns out to have additional structure that we can exploit. We now describe an approximate inference algorithm that is tailored to our model, and is guaranteed to convergeto a (strong) local maximum.Our first observation is that the only variables that correlate the different protein pairs are the motif variables . Given an assignment to these activity variables, the network decomposes into a set of independent subnetworks, one for each protein pair. Basedon this observation, we divide our computation of the E-step into two parts. In the first,we find an assignment to the motif variables in each protein, ; in the second, we compute the posterior probability over the binding motif pair variables , , given theassignment to the motif variables.We begin by describing the second phase. We observe that, as all the motif pair variables, , are fully determined by the motif variables, the only variables left to reasonabout are the binding variables and . The variables for any pair are independent of the rest of the model given the instantiation to and the interaction evidence. That fact, combined with the noisy-OR form of the interaction, allows us to compute the posterior probability required in the E-step exactly and efficiently. Specifically,the computation for the variables associated with a particular protein pair is as follows, where we omit the common prefix to simplify notation. Iffalse, then

false true . Otherwise, if true true, then true The first term in the numerator is simply the motif affinity; the second term is ifnumeratorcaneasilybecomputedastrue and otherwise. The . The computation for is very similar. true We now turn to the first phase, of finding a setting to all of the motif variables. Unfortunately, as we discussed, the model is highly interconnected, and a finding an optimaljoint setting to all of these variables is intractable. We thus approximate finding thisjoint assignment using a method that exploits our specific structure. Our method iteratesover proteins, finding in each iteration the optimal assignment to the motif variables of eachprotein given the current assignment to the motif activities in the remaining proteins. Theprocess repeats, iterating over proteins, until convergence.As we discussed, the likelihood of each assignment to can be easily computedusing the method described above. However, the computation for each protein is still exponential in the number of motifs it contains, which can be large (e.g., 15). However, inour specific model, we can apply the following branch-and-bound algorithm (similar to anapproach proposed by Henrion [17] for BN2O networks) to find the globally optimal assignment to the motif variables of each protein. The idea is that we search over the spaceof possible assignments for one that maximizes the objective we wish to maximize.We can show that if making a motif active relative to one assignment does not improve theobjective, it will also not improve the objective relative to a large set of other assignments. denote the objective we wishMore precisely, let to maximize, where is the fixed assignment to motif variables in all proteinsexceptdenote the assignment to all the motif variables in except . Let forcomputethe ratioof after we switch from false to true. Let . We true denote the probability that motif does not bind with any active motif in . We can now compute: true false false ! true (1) true truewhere is the prior probability for a motif in protein to be active.Now, considera different point in the search, where our current motif activity assign ment is , which has all the active motifs in and some additional ones. The .first two terms in the product of Eq. (1) are the same forandFor the final term (themanipulation large fraction), one can show using some algebraic islowerthanthatfor .Weconcludethatthatthistermin , and hence that: true false true false It follows that, if switching motif from inactive to active relative to decreases , itwill also decrease if we have some additional active motifs.We can exploit this property in a branch-and-bound algorithm in order to find the globally optimal assignment . Our algorithm keeps a set of viable candidates for motifassignments. For presentation, we encode assignments via the set of active motifs theycontain. Initially, contains only the empty assignment . We start out by considering

motif assignments with a single active motif. We put such an assignment in if its -score is higher than . Now, we consider assignments that have two activemotifs. We consider only if both and are in . If so, we evaluate its -score,and add it to if this score is greater than that of and . Otherwise, we throw itaway. We continue this process for all assignments of size : For each assignment withactive motif set , we test whether for all ; if we compare to each , and add it if it dominates all of them. The algorithm terminates when, fromsome , no assignment of size is saved.To understand the intuition behind this pruning procedure, consider a candidate assignment , and assume that , but . In this case, we musthave that , but adding to that assignment reduces the -score. In this case, asshown by our analysis, adding to the superset would also reduce the -score.This algorithm is still exponential in worst case. However, in our setting, a protein withmany motifs has a low prior probability that each of them is active. Hence, adding newmotifs is less likely to increase the -score, and the algorithm tends to terminate quickly.As we show in Section 4, this algorithm significantly reduces the cost of our procedure.Our E-step finds an assignment to which is a strong local optimum of the objective function : The assignment has higher probability than anyassignment that changes any of the motif variables for any single protein. For that assignment, our algorithm also computes the distribution over all of the binding variables, asdescribed above. Using this completion, we can now easily compute the (expected) sufficient statistics for the differe

protein-protein interaction data and the protein sequences. In particular, our method as-sumes no knowledge of the 3D protein structure, or of the sites at which binding occurs. Our approach is based on the assumption that interaction sites can be described using a limited repertoire of conserved sequence motifs [11].

Related Documents:

Human Computer Interaction Notes Interaction Design ...

Human Computer Interaction Notes Interaction Design ( Scenarios) Interaction Design is about creating user experiences that enhance and augment the way people work, communicate, and interact.1 Interaction Design has a much wider scope than Human Computer Interaction. ID is concerned with the theory and practice of designing user experiences for any technology or

39 Views

2y ago

Protein:ligand standard binding free energies: A tutorial for ...

protein:ligand, K eq [protein:ligand] [protein][ligand] (1) can be restated as, K eq 1 [ligand] p 1 p 0 (2) where p 0 is the fraction of free protein and p 1, the fraction of protein binding the ligand. Assuming low protein concentration, one can imagine an isolated protein in a solution of Nindistinguishable ligands. Under these premises .

14 Views

6m ago

Structural analysis of protein–DNA and protein–RNA ...

biological signiﬁcance of protein complexation with RNA has been well recognized, the speciﬁc mecha-nism of protein–RNA interaction is not fully understood [10]. Measurement of sequence–speciﬁc DNA– protein and RNA–protein interactions is a key experimental procedure in molecular biology of gene regulation.

36 Views

3y ago

Quick Nutrition Guide for Bariatric Surgery

MLO Super High Protein powder, MLO Brown Rice Protein powder, MLO Milk and Egg Protein powder, MLO Vegetable Protein powder UNJURY Protein bariatric surgery patients Optimum Protein Diet Shakes Bariatric Fusion Protein Supplement Bodytech Whey Pro 24 Premier

53 Views

2y ago

Coomassie Protein Assay - G-Biosciences

of protein assay for research applications. Protein assays based on these methods are divided into two categories: dye binding protein assays and protein a ssays based on alkaline copper. The dye binding protein assay s are based on the binding of protein molecules to Coomassie dye under acidic conditions.

12 Views

1y ago

Protein interaction networks: Protein domain interaction ...

methods have been explored and multiple ways of using biological evidences have been studied in this framework [6, 8, 38, 102, 97, 68, 72, 79, 61, 99]. High-throughput PPI experiments for elucidating protein-protein interactions have been applied to model organisms in recent yea

5 Views

2y ago

10193 CAR BUYER JOURNEY 2017 - coxautoinc.com

While car buyers use a variety of sites to shop, third-party sites are the most-used site of any online resource. THIRD-PARTY SITES ARE THE MOST-USED SITES FOR ONLINE CAR SHOPPING 83% 77% 86% 35% 51% 29% 54% 55% 53% 3rd Party Sites Dealership OEM Sites Total New Used SOURCES USED TO SHOP *

27 Views

1y ago

Adeng Pustikaningsih, M.Si. Dosen Jurusan Pendidikan ...

Dosen Jurusan Pendidikan Akuntansi Fakultas Ekonomi Universitas Negeri Yogyakarta CP: 08 222 180 1695 Email : adengpustikaningsih@uny.ac.id. 23-2. 23-3 PREVIEW OF CHAPTER Intermediate Accounting IFRS 2nd Edition Kieso, Weygandt, and Warfield 23. 23-4 6. Identify sources of information for a statement of cash flows. 7. Contrast the direct and indirect methods of calculating net cash flow from .

83 Views

3y ago

Recent Views

Dear Members of the Harvard Community,

Life science graduate education at Harvard is comprised of 14 Ph.D. programs of study across four Harvard faculties—Harvard Faculty of Arts and Sciences, Harvard T. H. Chan School of Public Health, Harvard Medical School, and Harvard School of Dental Medicine. These 14 programs make up the Harvard Integrated Life Sciences (HILS).

3y ago

182 Views

Xavier Du Maine, Lara Roach, Perspectives - Harvard University

Sciences at Harvard University Richard A. and Susan F. Smith Campus Center 1350 Massachusetts Avenue, Suite 350 Cambridge, MA 02138 617-495-5315 gsas.harvard.edu Office of Diversity and Minority Affairs minrec@fas.harvard.edu gsas.harvard.edu/diversity Office of Admissions and Financial Aid admiss@fas.harvard.edu gsas.harvard.edu/apply

1y ago

146 Views

PROGRAM ON CRISIS LEADERSHIP - Harvard Kennedy School

Harvard Kennedy School Arnold M. Howitt Harvard Kennedy School Philip B. Heymann Harvard Law School April 2014 An earlier version of this white paper provided background for an expert dialogue on lessons learned from the events of the Boston Marathon bombing that was held at the John F. Kennedy School of Government at Harvard

2y ago

330 Views

Harvard Law School - WordPress

Law & Business, Harvard Law School, and H. Douglas Weaver Professor of Business Law. Harvard Business School. 10.30-10.55h. 13th Lecture "Cross-border Insolvency: the New European Regime". Pedro de Miguel Asensio. Full Professor of Private International Law. UCM. 11.00-12.00h. Round Table. "Latest reforms and tendencies on Insolvency Law".

1y ago

145 Views

Harvard Buildings Emergency Phones Harvard University .

Faculty of Arts and Sciences, Harvard University Class of 2018 LEGEND Harvard Buildings Emergency Phones Harvard University Police Department Designated Pathways Harvard Shuttle Bus Stops l e s R i v e r a C h r YOKE ST YMOR E DRIVE BEACON STREET OXFORD ST VENUE CAMBRIDGE STREET KIRKLAND STREET AUBURN STREET VE MEMORIAL

3y ago

171 Views

THE FIRST CENTURY OF THE AMERICAN . - Princeton

Harvard University Press, 1935) and Harvard College in the Seventeenth Century (Cambridge: Harvard University Press, 1936). Quotes, Founding of Harvard, 168, 449. These works are summarized in Three Centuries of Harvard (Cambridge: Harvard U

2y ago

225 Views

Catherine G. Barrera HARVARD UNIVERSITY

danbjork@fas.harvard.edu HARVARD UNIVERSITY Placement Director: Gita Gopinath GOPINATH@HARVARD.EDU 617-495-8161 Placement Director: Nathan Nunn NNUNN@FAS.HARVARD.EDU 617-496-4958 Graduate Administrator: Brenda Piquet BPIQUET@FAS.HARVARD.EDU 617-495-8927 Office Contact Information Department of Economics

2y ago

363 Views

SEAS Lab Safety Officer Orientation

Kuan ebrandin@harvard.edu akuan@fas.harvard.edu Donhee Ham MD B129, MDB132 Dongwan Ha dha@seas.harvard.edu Lene Hau Cruft 112-116 Danny Kim dannykim@seas.harvard.edu Robert Howe 60 Oxford, 312-317,319-321 Paul Loschak loschak@seas.harvard.edu Evelyn Hu McKay 222,226,232 Kathryn Greenberg greenber@fas.harvard.edu

2y ago

359 Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Dangerous Defendants - Yale Law Journal

Law School, Louisiana State University Paul M. Hebert Law Center, Roger Williams University School of Law, Rutgers Law School, Sandra Day O'Connor College of Law, Southern Methodist University Dedman School of Law, University of Georgia School of Law, and University of Utah S.J. Quinney College of Law. For institutional support, I am grateful .

1y ago

169 Views

EMPLOYER GUIDE - Harvard Kennedy School

HARVARD KENNEDY SCHOOL EMPLOYER GUIDE At Harvard Kennedy School, our students are being trained in public policy . Post in the HKS JACK job bank or send to HKS_Career@hks.harvard.edu 2. Browse our resume book to identify a student with the skills and experience you need 3. Visit us and meet our talent HARVARD

2y ago

138 Views

2008-2009 FACT BOOK - Harvard University

Harvard Business School Harvard Medical School Harvard Faculty of Arts and Sciences Harvard School of Public . Publishing Division Joint Center for Housing Studies American Repertory Theatre . WIDE is the Wide-scale Interactive Development for Educators. (5) The Nanoscale Science and Engineering Center is a joint program with M.I.T., U.C.S .

2y ago

364 Views

HARVARD UNIVERSITY 2007-08

Harvard Business School Harvard Medical School Harvard Faculty of Arts and Sciences Harvard School of Public . Publishing Division Joint Center for Housing Studies* American Repertory Theatre . WIDE is the Wide-scale Interactive Development for Educators. (5) The Nanoscale Science and Engineering Center is a joint program with M.I.T., U.C.S .

2y ago

314 Views

ANNA BRADY - Harvard University

Jun 02, 2008 · ANNA BRADY 12 Oxford Street Apt. 9 Cambridge, MA 02138 (617) 495-3108 abrady@jd11.law.harvard.edu EDUCATION HARVARD LAW SCHOOL, Candidate for J.D., June 2011 Activities: Harvard Civil Rights-Civil Liberties Law Review UNIVERSITY OF CHICAGO, B.A. i

2y ago

137 Views

Identifying Protein-protein Interaction Sites On A Genome .

It looks like you're using an ad-blocker