Biol5705 Module: Gene Sequence Analysis Lecture 1

2y ago
14 Views
2 Downloads
389.92 KB
31 Pages
Last View : 10d ago
Last Download : 3m ago
Upload by : Hayden Brunner
Transcription

Biol5705Module: Gene Sequence AnalysisLecture 1Homology SearchingDr. Morgan Langille

Outline What is homology? orthologs, paralogs, etc. local vs global alignment e values, bit scores, "coverage", identity vssimilaritydifferent blast flavours (blastn, blastp, tblastn,etc.)Blast (Web)2

What is homology? Homology refers to shared ancestryTwo sequences are homologous if they arederived from a common ancestral sequenceOne sequence by itself is not informative; it must be analyzed by comparative methodsagainst existing sequence databases to develophypothesis concerning relatives and function.3

Types of homologs Orthologs Think same gene in different organism Often thought to have similar functionParalogs Think gene duplication Less likely to have similar function4

What is similarity? Similarity is a measure of the likeness between sequences.Gene searching tools calculate the similarity betweensequences and rank more similar sequences higher.Sequences can NOT be partially homologous WRONG: Gene X is 80% homologous to Gene YSequences can be partially similar CORRECT: Gene X has 80% identity to Gene Y5

Identity vs Similarity Identity is a percentage measurement thatstates how many characters in the sequenceare identicalSimilarity can also be used as a metric whichmeans how many characters are “positivescoring”6

Assessing KRHGLDNYRGYSLGNWVCAAKFESNFNTRbnLszSST TMSITDCRETGSSKYLCNIPCSALLSSDITASVNC FDASVNRCKGTDVQAWIRGCRLis this alignment significant?7

Twilight ZoneEvolutionary Distance VS Percent Sequence Identity120Sequence Identity (%)1008060Twilight Zone4020004080120160200240280320360400Number of Residues8

Some Simple Suggestions If two sequence are 100 residues and 25%identical, they are likely relatedIf two sequences are 15-25% identical they may berelated, but more tests are neededIf two sequences are 15% identical they areprobably not related9

Global vs Local Alignments can be global or local (this is algorithm specific) A global alignment is an optimal alignment that includes all charactersfrom each sequence (Multiple Sequence Alignment)A local alignment is an optimal alignment that includes only the mostsimilar local region or regions (e.g BLAST).10

Dot Plots Popular freeware package is Dotterhttp://sonnhammer.sbc.su.se/Dotter.html11

The BLAST algorithm The BLAST programs (Basic Local Alignment SearchTools) are a set of sequence comparison algorithmsintroduced in 1990 that are used to search sequencedatabases for optimal local alignments to a query. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) “Basic local alignmentsearch tool.” J. Mol. Biol. 215:403-410.Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ(1997) “Gapped BLAST and PSI-BLAST: a new generation of protein databasesearch programs.” NAR 25:3389-3402.12

Several different BLAST programs:ProgramDescriptionblastpCompares an amino acid query sequence against a protein sequence database.blastnCompares a nucleotide query sequence against a nucleotide sequence database.blastxCompares a nucleotide query sequence translated in all reading frames against a protein sequencedatabase. You could use this option to find potential translation products of an unknownnucleotide sequence.tblastnCompares a protein query sequence against a nucleotide sequence database dynamically translatedin all reading frames.tblastxCompares the six-frame translations of a nucleotide query sequence against the six-frametranslations of a nucleotide sequence database. Please note that the tblastx program cannot beused with the nr database on the BLAST Web page because it is too computationally intensive.13

MegaBLAST megaBLAST For aligning very similar sequencesNucleotide onlyVery efficient for long query sequencesUses big word (k-tuple) sizes to start search–Very fast14

http://www.ncbi.nlm.nih.gov/BLAST/15

http://www.ncbi.nlm.nih.gov/BLAST/16

QUERY sequence(s)BLAST resultsBLAST programBLASTdatabase17

Considerations for choosing aBLAST database First consider your research question: Are you looking for an particular gene in aparticular species?– Are you looking for additional members of aprotein family across all species?– BLAST against the genome of that species.BLAST against the non-redudant database (nr), if you can’t find hitscheck wgs, htgs, and the trace archives.Are you looking to annotate genes in yourspecies of interest?–BLAST against known genes (RefSeq) and/or ESTs from a closelyrelated species.18

When choosing a database forBLAST Changing your choice of database ischanging your search spaceDatabase size affects the BLAST statisticsDatabases change rapidly and are updatedfrequently19

Where does the score (S) come from? The quality of each pair-wise alignment isrepresented as a score and the scores areranked.Scoring matrices are used to calculate thescore of the alignment base by base (DNA)or amino acid by amino acid (protein).The alignment score will be the sum ofthe scores for each position.20

What’s a scoring matrix? Substitution matrices are usedfor amino acid alignments. each possible residue substitutionis given a scoreA simpler unitary matrix is usedfor DNA pairs each position can be given a scoreof 1 if it matches and a score of-1 if it does not.21

BLOSUM vs. PAMBLOSUM 45PAM 250More DivergentBLOSUM 62PAM 160BLOSUM 90PAM 100Less Divergent BLOSUM 62 is the default matrix in BLAST. Though it is tailoredfor comparisons of moderately distant proteins, it performs well indetecting closer relationships. A search for distant relatives maybe more sensitive with a different matrix.22

Sequence Similarity Searching – Thestatistics are important Discriminating between real and artifactualmatches is done using an estimate ofprobability that the match might occur bychance.23

What do the Score and the e-value reallymean? The quality of the alignment is represented by the Score. Score (S)– The score of an alignment is calculated as the sum of substitution and gapscores. Substitution scores are given by a look-up table (PAM, BLOSUM)whereas gap scores are assigned empirically .The significance of each alignment is computed as an Evalue. E value (E)–Expectation value. The number of different alignments with scores equivalentto or better than S that are expected to occur in a database search by chance.The lower the E value, the more significant the score.24

I’m confused! What does the E-value meanagain? E value (E) Expectation value. The number of different alignments with scoresequivalent to or better than S that are expected to occur in adatabase search by chance. The lower the E value, the moresignificant the score.When E 0.01, P-values and E-value are nearly identical. So, the E-value is the number of times you expect to see your hitoccur in the database (with as good as or better score) due torandom chance alone.25

Notes on E-values Low E-values suggest that sequences are homologous Can’t show non-homologyStatistical significance depends on both the size of thealignments and the size of the sequence database Important consideration for comparing results across differentsearchesE-value increases as database gets biggerE-value decreases as alignments get longer26

Coverage Coverage: The proportion of the aligned lengthwith respect to the length of the query orsubject.Example Your gene is 1000bp, and you have a Blastalignment from 250-500. What is the querycoverage?27

FASTA File Format Plain text file (e.g. don't open with Word!) Each sequence has 2 parts. One header line starts with “ ”– One or more sequence lines:– e.g. “ This is a fasta header. Any text goes here.”e.g. “ATTCTCGCTCGAATCGATCGCATAGTAGCA”Each file can contain multiple sequencesSequences can be DNA or protein (not amixture) 28

Alignments29

Databases NR “non-redundant” database Sequences from various experiments (not just completed genomes) May not be that “non-redundant”RefSeq Curated sequences by NCBI Does not contain duplicatesSwissprot A manually curated sequence of proteinsProtein Data Bank Contains protein sequences that have 3D structures available30

Blast Web Demo Assignment 1 ment1.pdfDue before next class31

BLOSUM vs. PAM BLOSUM 45 BLOSUM 62 BLOSUM 90 PAM 250 PAM 160 PAM 100 More Divergent Less Divergent BLOSUM 62 is the default matrix in BLAST. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix.

Related Documents:

Teacher’s Book B LEVEL - English in school 6 Contents Prologue 8 Test paper answers 10 Practice Test 1 11 Module 1 11 Module 2 12 Module 3 15 Practice Test 2 16 Module 1 16 Module 2 17 Module 3 20 Practice Test 3 21 Module 1 21 Module 2 22 Module 3 25 Practice Test 4 26 Module 1 26 Module 2 27 Module 3 30 Practice Test 5 31 Module 1 31 Module .

One Gene-One Enzyme Hypothesis (Beadle & Tatum) The function of a gene is to dictate the production of a specific enzyme One Gene—One Enzyme but not all proteins are enzymes those proteins are coded by genes too One Gene—One Protein but many proteins are composed of several polypeptides, each of which has its own gene One Gene—One Polypeptide

WinDbg Commands . 0:000 k . Module!FunctionD Module!FunctionC 130 Module!FunctionB 220 Module!FunctionA 110 . User Stack for TID 102. Module!FunctionA Module!FunctionB Module!FunctionC Saves return address Module!FunctionA 110 Saves return address Module!FunctionB 220 Module!FunctionD Saves return address Module!FunctionC 130 Resumes from address

XBEE PRO S2C Wire XBEE Base Board (AADD) XBEE PRO S2C U.FL XBEE Pro S1 Wire RF & TRANSRECEIVER MODULE XBEE MODULE 2. SIM800A/800 Module SIM800C Module SIM868 Module SIM808 Module SIM7600EI MODULE SIM7600CE-L Module SIM7600I Module SIM800L With ESP32 Wrover B M590 MODULE GSM Card SIM800A LM2576

AQA GCE Biology A2 Award 2411 Unit 5 DNA & Gene Expression Unit 5 Control in Cells & Organisms DNA & Gene Expression Practice Exam Questions . AQA GCE Biology A2 Award 2411 Unit 5 DNA & Gene Expression Syllabus reference . AQA GCE Biology A2 Award 2411 Unit 5 DNA & Gene Expression 1 Total 5 marks . AQA GCE Biology A2 Award 2411 Unit 5 DNA & Gene Expression 2 . AQA GCE Biology A2 Award 2411 .

this genotype is caused by more than one gene because there are 4 phenotypes not 3 in F2 (9:3:3:1) Ð1 gene F2 would have 3 phenotypes 1:2:1 ratio Complementary Gene Action : one good copy of each gene is needed for expression of the final phenotype Ð9:7 ratio Epistasis : one gene can mask the effect of another gene

The profile matrixfor a given motif contains frequency counts for each letter at each position of the isolated conserved region. 8 Sequence logo and consensus sequence We can extract the so-called consensus sequence, i.e. the string of most frequent letters: A graphical representation of the consensus sequence is called a sequence logo:

American Petroleum Institute (API) has developed such guidelines for evaluation of the capacity of the pile foundations (API RP2A, 20th edition 1993). These guidelines address a wide scope of topics such as operating and environmental loading; determination of static capacity; influences on capacity, stiffness; applications of discrete element and continuum analytical models; use of in situ .