Current Topics In Genome Analysis Spring 2005 Week 4 .

2y ago
3 Views
2 Downloads
4.88 MB
45 Pages
Last View : 5m ago
Last Download : 3m ago
Upload by : Arnav Humphrey
Transcription

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis ICurrent Topics in Genome AnalysisSpring 2005Week 4Biological Sequence Analysis IAndy Baxevanis, Ph.D.Overview Week 4: Comparative methods and concepts Similarity vs.vs. HomologyGlobal vs.vs. Local AlignmentsScoring MatricesBLASTBLAT Week 5: Predictive methods and concepts Profiles, patterns, motifs, and domains Secondary structure prediction Structures: VAST, Cn3D, and de novo prediction1

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IWhy do sequence alignments? Provide a measure of relatedness betweennucleotide or amino acid sequences Determining relatedness allows one to drawbiological inferences regarding structural relationshipsfunctional relationshipsevolutionary relationships importance of using correct terminologyDefining the Terms The quantitative measure: Similarity Always based on an observableUsually expressed as percent identityQuantify changes that occur as two sequences diverge substitutions insertions deletions Identify residues crucial for maintaining a protein’protein’sstructure or function High degrees of sequence similarity might imply a common evolutionary history possible commonality in biological function2

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IDefining the Terms The conclusion: Homology Genes are or are not homologous(not measured in degrees) Homology implies an evolutionary relationship The term “homolog”homolog” may apply to therelationship between genes separated by the event of speciation(orthology)orthology) between genes separated by the event of geneticduplication (paralogy(paralogy))Defining the Terms Orthologs Sequences are direct descendants of a sequence in acommon ancestor Most likely have similar domain structure, threedimensional structure, and biological function Paralogs Related through a gene duplication event Provides insight into “evolutionary innovation”innovation”(adapting a pre-existing gene product for a newfunction)3

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IDefining the TermsOrthologsA1B2Most recentcommon ancestorC3αDefining the TermsParalogsOrthologsMost recentcommon ancestorGene duplication A1B2C3A4αB5C6βGenes 1-3 are orthologousGenes 4-6 are orthologousAny pair of α and β genes are paralogous(genes related through a gene duplication event)4

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IOverview Week 4: Comparative methods and concepts Similarity vs.vs. HomologyGlobal vs.vs. Local AlignmentsScoring MatricesBLASTBLAT Week 5: Predictive methods and concepts Profiles, patterns, motifs, and domains Secondary structure prediction Structures: VAST, Cn3D, and de novo predictionGlobal Sequence Alignments Sequence comparison along the entire length ofthe two sequences being aligned Best for highly-similar sequences of similarlength As the degree of sequence similarity declines,global alignment methods tend to missimportant biological relationships5

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis ILocal Sequence Alignments Sequence comparison intended to find the mostsimilar regions in the two sequences beingaligned (“(“paired subsequences”subsequences”) Regions outside the area of local alignment areexcluded More than one local alignments could begenerated for any two sequences being compared Best for sequences that share some similarity, orfor sequences of different lengthsOverview Week 4: Comparative methods and concepts Similarity vs.vs. HomologyGlobal vs.vs. Local AlignmentsScoring MatricesBLASTBLAT Week 5: Predictive methods and concepts Profiles, patterns, motifs, and domains Secondary structure prediction Structures: VAST, Cn3D, and de novo prediction6

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IScoring Matrices Empirical weighting scheme to representbiology (side chain chemistry, structure, andfunction) Cys/Pro important for structure and function Trp has bulky side chain Lys/Arg have positively-charged side chainsScoring Matrices Conservation: What residues can substitute foranother residue and not adversely affect thefunction of the protein? Ile/Val - both small and hydrophobicSer/Thr - both polarConserve charge, size, hydrophobicity,hydrophobicity,other physicochemical factors Frequency: How often does a particularresidue occur amongst the entire constellationof proteins?7

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IScoring Matrices Importance of understanding scoring matrices Appear in all analyses involving sequencecomparison Implicitly represent particular evolutionary patterns Choice of matrix can strongly influence outcomesMatrix Structure: 1-1-1-1-1-1-1 Simple match/mismatch scoring scheme Assumes each nucleotide occurs 25% of the time8

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IMatrix Structure: 4-4-41BLOSUM62PAM Matrices Margaret Dayhoff and colleagues, 1978 Look at patterns of substitutions in highly relatedproteins ( 85% similar) within multiple sequencealignments Analysis documented 1572 changes in 71 groups ofproteins examined Substitution tables constructed based on results Given high degree of similarity within originalsequence set, results represent substitution patternthat would be expected over short evolutionarydistances9

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IPAM Matrices Short evolutionary distance change in function unlikely Point Accepted Mutation (PAM) The new side chain must function the same way asthe old one (“(“acceptance”acceptance”) On average, 1 PAM corresponds to 1 amino acidchange per 100 residues 1 PAM 1% divergence Extrapolate to predict patterns at longer evolutionarydistancesPAM Matrices: Assumptions All sites assumed to be equally mutable Replacement of amino acids is independent ofprevious mutations at the same position Replacement is independent of surroundingresidues Forces responsible for sequence evolution overshorter time spans are the same as those overlonger time spans10

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IPAM Matrices: Sources of Error Small, globular proteins of average compositionused to derive matrices Errors in PAM 1 are magnified up to PAM 250(only PAM 1 is based on direct observation) Does not account for conserved blocks or motifsBLOSUM Matrices Henikoff and Henikoff,Henikoff, 1992 BlocksBlocks SubstitutionSubstitution Matrix Look only for differences in conserved, ungappedregions of a protein family (“(“blocks”blocks”) Directly calculated, using no extrapolations More sensitive to detecting structural or functionalsubstitutions Generally perform better than PAM matrices forlocal similarity searches (Henikoff and Henikoff,Henikoff, 1993)11

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IBLOSUM n Calculated from sequences sharing no more than n%identity Contribution of sequences n% identical clustered andweighted to 1** * TEETSSQESAEEDKKPAQETEETSSQESAEEDA T Hook Domain (Block IPB000637B)2,000 blocks representing 500 groups of related proteinsBLOSUM n Clustering reduces contribution of closely-relatedsequences (less bias towards substitutions that occur inthe most closely related members of a family) Substitution frequencies are more heavily-influenced bysequences that are more divergent than this cutoff Reducing n yields more distantly-related sequences12

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis ISo many matrices.Triple-PAM Strategy (Altschul,Altschul, 1991)PAM 40PAM 160PAM 250Short alignments, highly similarDetecting known members of a protein familyLonger, weaker local alignments70-90%50-60% 30%BLOSUM (Henikoff,Henikoff, 1993)BLOSUM 90BLOSUM 80BLOSUM 62BLOSUM 30Short alignments, highly similarDetecting known members of a protein familyMost effective in finding all potential similaritiesLonger, weaker local alignments70-90%50-60%30-40% 30%So many matrices. Matrix EquivalenciesPAM 250PAM 160PAM 120 BLOSUM 45BLOSUM 62BLOSUM 80 Specialized matrices Transmembrane proteins Species-specific matricesWheeler, 200313

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis ISo many matrices.No single matrix isthe complete answer forall sequence comparisonsGaps Compensate for insertions and deletions Used to improve alignments between twosequences Must be kept to a reasonable number, to notreflect a biological implausible scenario( 1 gap per 20 residues good rule-of-thumb) Cannot be scored simply as a “match”match” or a“mismatch”mismatch”14

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IAffine Gap PenaltyFixed deduction for introducing a gap plusan additional deduction proportional to the length of the gapDeduction for a gap G LnwhereandGLn gap-opening penaltygap-extension penaltylength of the gapnuc pro51121Can adjust scores to make gap insertion more or lesspermissive, but most programs will use values of G and Lmost appropriate for the scoring matrix selectedOverview Week 4: Comparative methods and concepts Similarity vs.vs. HomologyGlobal vs.vs. Local AlignmentsScoring MatricesBLASTBLAT Week 5: Predictive methods and concepts Profiles, patterns, motifs, and domains Secondary structure prediction Structures: VAST, Cn3D, and de novo prediction15

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IBLAST Basic Local Alignment Search Tool Seeks high-scoring segment pairs (HSP) pair of sequences that can be aligned without gaps when aligned, have maximal aggregate score(score cannot be improved by extension or trimming) score must be above score threshhold S gapped or ungapped Results not limited to the “best HSP”HSP” for anygiven sequence pairBLAST AlgorithmsProgramQuery SequenceTarget teinBLASTXNucleotide,Proteinsix-frame translationTBLASTNProteinNucleotide,six-frame translationTBLASTXNucleotide,six-frame translationNucleotide,six-frame translation16

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis INeighborhood WordsQuery Word (W PQAPQNetc.1815141413131313131212 7 5 6NeighborhoodScore Threshold(T 13)High-Scoring Segment NQWIKQPLMDKNRIEERLNLVEA LA LTP G R W P D ER ATLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA36533017

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis KQPLMDKNRIEERLNLVEA LA LTP G R W P D ER ve ScoreX365330Significance decay mismatches gap penaltiesSTExtensionScores and QWIKQPLMDKNRIEERLNLVEA LA LTP G R W P D ER rlin-Altschul EquationE kmNe-λSCumulative ScoreXmNmNλSk# letters in query# letters in databasesize of search spacenormalized scoreminor constantSTExtension18

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IScores and QWIKQPLMDKNRIEERLNLVEA LA LTP G R W P D ER ATLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA365330E kmNe-λSCumulative ScoreXNumber of HSPsfound purely by chanceLower values signifyhigher similaritySTExtensionScores and QWIKQPLMDKNRIEERLNLVEA LA LTP G R W P D ER ve ScoreX365330E 10-6for nucleotidesE 10-3for proteinsSTExtension19

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis I20

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IAvailable protein databases eference SequencesSWISS-PROTPatentsProtein Data BankLast 30 days21

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis ILow-Complexity RegionsDefined as regions of biased composition Homopolymeric runsShort-period repeatsSubtle over-representation of several residues gi 20455478 sp P50553 ASC1 HUMANgi 20455478 sp P50553 ASC1 HUMAN AchaeteAchaete-scute homolog 1 ymericalanine-glutamine tractIdentifying Low-Complexity Regions Biological origins and role not well-understood DNA replication errors (polymerase slippage)? Unequal crossing-over? May confound sequence analysis BLAST relies on uniformly-distributed amino acidfrequencies Often lead to false positives Filtering is advised (and usually enabled by default)22

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IPAM30PAM70BLOSUM80BLOSUM62BLOSUM45E value thresholdReports all hits with E 1023

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IOrganism [ORGN]24

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IColor keyUnrelatedhitsGap withinsame hit 1 HSP MaskedregionDescendingscoreorder25

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis I26

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IDescendingscoreorder0.0 means 10-10006e-95 6 x 10-95SGStructureGeneAccept(for now)Reject27

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis I 25% for proteins 70% for nucleotides— GapX LowComplexityNo definition line second HSP identified28

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis I HSP 2HSP 1Suggested BLAST CutoffsE valueSequenceIdentityNucleotide 10-6 70%Protein 10-3 25% Do not use these cutoffs blindly! Pay attention to alignments on either side ofthe dividing line Do not ignore biology!29

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IDatabase Searching Artifacts Low-complexity regions Nucleotide searches: removed with DUST ( N) Protein searches: removed with SEG( X) Repetitive elements LINE, SINE, Alu Automatic masking “experimental and still underdevelopment”development” repeatmasker.orgDatabase Searching Artifacts Low-quality sequence hits Expressed sequence tags (ESTs(ESTs)) Single-pass sequence reads from large-scale sequencing(possibly with vector contaminants)30

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IBLAST 2 Sequences Finds local alignments between two protein ornucleotide sequences of interest All BLAST programs available Select BLOSUM and PAM matrices available forprotein comparisons Same affine gap costs (adjustable) Input sequences can be masked Implementations NCBI Web interface bl2seq downloadable /blast/executables/31

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IPAM30PAM70BLOSUM80BLOSUM62BLOSUM4532

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IMegaBLAST Optimized for aligning very long and/orhighly-similar sequences Good for batch nucleotide searches Search targets include Entire eukaryotic genomes Complete chromosomes and contigs from RefSeq Run speeds approximately 10 times faster thanBLASTN Adjusted word size Different gap scoring scheme33

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IBLASTN vs. MegaBLAST Word size BLASTN default MegaBLAST default 11 28 Non-affine gap penaltiesDeduction for a gap r/2 – qwhereandr match reward(default 1)q mismatch penalty(default -2)no penalty for opening the gap34

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis I35

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis I36

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IOverlapping clones?Two separate regions of chromosome 5?Finished sequence neededCheck subsequent builds of mouse genomeOverview Week 4: Comparative methods and concepts Similarity vs.vs. HomologyGlobal vs.vs. Local AlignmentsScoring MatricesBLASTBLAT Week 5: Predictive methods and concepts Profiles, patterns, motifs, and domains Secondary structure prediction Structures: VAST, Cn3D, and de novo prediction37

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IBLAT “BLAST-Like Alignment Tool”Tool” Designed to rapidly-align longer nucleotide sequences(L 40) having 95% sequence similarity Can find exact matches reliably down to L 33 Method of choice when looking for exact matches innucleotide databases 500 times faster for mRNA/DNA searches May miss divergent or shorter sequence alignments Can be used on protein sequencesWhen to Use BLAT To characterize an unknown gene or sequence fragment Find its genomic coordinatesDetermine gene structure (the presence and position of exons)exons)Identify markers of interest in the vicinity of a sequence To find highly-similar sequences Identify gene family members Identify putative homologs To display a specific sequence as a separate track38

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis I39

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis I40

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis I41

NHGRI Current Topics in Genome Analysis 2005Biological Sequence Analysis IFASTA Identifies regions of local alignment Employs an approximation of theSmith-Waterman algorithm to determine thebest alignment between two sequences Method is significantly different from that usedby BLAST Online implementations bi.ac.uk/fasta33Overview Week 4: Comparative methods and concepts Similarity vs.vs. HomologyGlobal vs.vs. Local AlignmentsScoring MatricesBLASTBLAT Week 5: Predictive methods and concepts Profiles, patterns, motifs, and domains Secondary structure prediction Structures: VAST, Cn3D, and de novo prediction42

Further ReadingAltschul, S.F., Boguski, M.S., Gish, W., and Wootton, J.C. 1994. Issues in searching molecularsequence databases. Nat. Genet. 6: 119-129. A review of the issues that are of importance in usingsequence similarity search programs, including potential pitfalls.Baxevanis, A.D. Assessing pairwise sequence similarity: BLAST and FASTA. In Bioinformatics: APractical Guide to the Analysis of Genes and Proteins, third edition (Baxevanis, A.D. and Ouellette,B.F.F., eds.), John Wiley and Sons, 2005. An overview of the methods used to generate pairwisesequence alignments and assess the biological significance of results.Henikoff, S. and Henikoff, J.G. 2000. Amino acid substitution matrices. Adv. Protein Chem. 54: 73-97. Acomprehensive review covering the factors critical to the construction of protein scoring matrices.Korf, I., Yandell, M., and Bedell, J. BLAST. O’Reilly and Associates, 2003. An in-depth treatment of theBLAST algorithm, its applications, as well as installation, hardware, and software considerations. Thebook provides “documentation” that is not easily found elsewhere.Pearson, W.R. Finding protein and nucleotide similarities with FASTA. 2003. Current Protocols inBioinformatics 3.9.1-3.9.23. An in-depth discussion of the FASTA algorithm, including worked examplesand additional information regarding run options and use scenarios.Wheeler, D.G. Selecting the right protein scoring matrix. 2003. Current Protocols in Bioinformatics 3.5.13.5.6. A discussion of PAM, BLOSUM, and specialized scoring matrices, with guidance regarding theproper choice of matrices for particular types of protein-based analyses.ReferencesAltschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1991. Basic local alignment searchtool. J. Mol. Biol. 215: 403-410.Altschul, S.F., Madden T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. 1997. GappedBLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.

Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. 1978. A model of evolutionary change in proteins. In Atlasof Protein Sequence and Structure, M.O. Dayhoff, ed., National Biomedical Research Foundation,Washington, 5: 345-352.Henikoff, S. and Henikoff, J.G. 1991. Automated assembly of protein blocks for database searching.Nucleic Acids Res. 19: 6565-6572.Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl.Acad. Sci. USA 89: 10915-10919.Henikoff, S. and Henikoff, J.G. 1993. Performance evaluation of amino acid substitution matrices.Proteins Struct. Funct. Genet. 17: 49-61.Henikoff, S. and Henikoff, J.G. 2000. Amino acid substitution matrices. Adv. Protein Chem. 54: 73-97.Karlin, S. and Altschul, S.F. 1990. Methods for assessing the statistical significance of molecularsequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87: 2264-2268.Kent, W.J. 2002. BLAT: the BLAST-like alignment tool. Genome Res. 12: 656-664.Pearson, W.R. 1995. Comparison of methods for searching protein sequence databases. Protein Sci. 4:1145-1160.Pearson, W.R. 2000. Flexible sequence similarity searching with the FASTA3 program package.Methods Mol. Biol. 132: 185-219.Pearson, W.R. Finding protein and nucleotide similarities with FASTA. 2003. Current Protocols inBioinformatics 3.9.1-3.9.23.Pearson, W.R. and Lipman, D.J. 1988. Improved tools for biological sequence comparison. Proc. Natl.Acad. Sci. USA 85: 2444-2448.Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol.147: 195-197.

Tatusova, T.A. and Madden, T.L. 1999. BLAST2Sequences, a new tool for comparing protein andnucleotide sequences. FEMS Microbio. Lett. 174: 247-250.Wootton, J.C. and Federhen, S. 1993. Statistics of local complexity in amino acid sequences andsequence databases. Comput. Chem. 17: 149-163.

PAM Matrices Short evolutionary distance change in function unlikely Point Accepted Mutation (PAM) The new side chain must function the same way as the old one (“acceptance”) On average, 1 PAM corresponds to 1 amino acid change per 100 residues 1 PAM 1% divergence Extrapolate to predict patterns at longer .

Related Documents:

The human genome is the first genome entirely sequenced. b. The human genome is about the same size as the genome of E. coli. c. Researchers completed the genomes of yeast and fruit flies during the same time they sequenced the human genome. d. The sequence of the human genome was completed in June 2000. 10.

The human genome is the first genome entirely sequenced. b. The human genome is about the same size as the genome of E. coli. c. Researchers completed the genomes of yeast and fruit flies during the same time they sequenced the human genome. d. Aworking copy of the human genome was completed in June 2000. 10.

(A), Gossypium hirsutum L. JGI (AD1) and Gossypium barbadebse L. NAU (AD2) to Arabidopsis thaliana. Using DNA demethylase genes sequence of Arabidopsis as reference, 25 DNA demethylase genes were identified in cotton by BLAST analysis. There are 4 genes in the genome D, 5 genes in the genome A, 10 genes in the genome AD1, and 6 genes in the .

Paramecium tetraurelia that lack epigenetic modulation of excision frequently do (Duret et al. 2008). cing Project, we used high-throughput T. thermophila MIC genome se-quencing to initiate the genome-scale investigation of nuclear differ-entiation from MIC to MAC. By aligning MIC genome Sanger

Thanks to the Human Genome Project, scientists now know the DNA sequence of the entire human genome. The Human Genome Project is an international project that includes scientists from around the world. It began in 1990, and by 2003, scientists had sequenced all 3 billion base pairs of human

sequencing-by-synthesis on a PicoTiterPlate device image and signal processing whole genome mapping or assembly Comparison of high-throughput Sanger technology to the 454 technology used by the Genome Sequencer 20 System, in whole genome sequencing 7 days * Weeks ** 2.5 days 1 day † De novo s

meristematic cell volume defined the lower limit of guard cell volume (fig. 1); the smallest guard cells were only slightly larger than meristematic cells of the same genome size. Genome size was a strong and significant predictor of meristematic cell vol-ume (log(volume)p0:69#log(genome size)12:68; R2p0:98, P 0:001; Šímová and Herben .

consortium today it has ptlblished a draft sequence and initial analysis of the human genome-the genetic blueprint for a human being. The paper will be published in the Feb. 15 issue of the journal Nature. The draft sequence, which covers more than 90 percent of the human genome, represents the exact order of DNA's four chemical bases, commonly