Homology - Washington University Genetics

2y ago
15 Views
2 Downloads
5.16 MB
126 Pages
Last View : 15d ago
Last Download : 3m ago
Upload by : Allyson Cromer
Transcription

HomologyBio5488Ting Wang1/25/15, 1/27/15

GGGCCTTTTTT x x CTTTCCCGGCCTCCTGGCCAmatch: 1mismatch: -1matching score 16 How to align them? Why we can align them?Why 1 for match, and -1 for mismatch?What does the score mean?Is 16 a good score?

Outline Nobel-price-worthy work on homologyWhat is homology?How to detect homology?How to quantify homology?How to use homology?Homology beyond sequence analysisNext-gen sequencing alignment

Russell Doolittle(Bishop and Varmus)

Bishop and Varmus strategy (Nobel price 1989)Doolittle strategy (could be the first Nobel price for computational biology)What is the significance?

A few DefinitionsHomologs: genes/sequences sharing a common originOrthologs: genes originating from a single ancestralgene in the last common ancestor of the comparedgenomes; genes related via speciationParalogs: genes related via duplicationXenolog: sequences that have arisen out of horizontal transfer events (symbiosis,viruses, etc)Co-orthologs: two or more genes in one lineage that are, collectively, orthologous toone or more genes in another lineage due to a lineage-specific duplication(s)Outparalogs: paralogous genes resulting from a duplication(s) preceding a givenspeciation eventInparalogs: paralogous genes resulting from a lineage-specific duplication(s)subsequent to a given speciation event

Relation of sequencesNeed ancestral sequences to distinguish orthologs and paralogs(exercise on board)

Similarity versus Homology Similarity refers to the likeness or% identity between 2 sequences Similarity means sharing astatistically significant number ofbases or amino acids Similarity does not implyhomology Similarity can be quantified It is ok to say that two sequencesare X% identical It is ok to say that two sequenceshave a similarity score of Z It is generally incorrect to say thattwo sequences are X% similar Homology refers to sharedancestryTwo sequences are homologous ifthey are derived from a commonancestral sequenceHomology usually impliessimilarityLow complexity regions can behighly similar without beinghomologousHomologous sequences are notalways highly similarA sequence is either homologousor not.Never say two things are X%homologous

Why Compare Sequences? Sequence comparisons lie at the heart of all bioinformatics Identify sequences– What is this thing I just found? Compare new genes to known ones Compare genes from different species– information about evolution Guess functions for entire genomes full of new genesequences– Metagenomics What does it matter if two sequences are similar or not?– Globally similar sequences are likely to have the same biologicalfunction or role– Locally similar sequences are likely to have some physical shapeor property with similar biochemical roles– If we can figure out what one does, we may be able to figure outwhat they all do

Sequence alignment How to optimally align two sequences– Dot plots– Dynamic programming Global alignment Local alignment How to score an alignment Fast similar sequence search– BLAST– BLAT– More recent development: short read alignment Determine statistical significance Using information in multiple sequence alignment toimprove sensitivity

Visual Alignments (Dot Plots) Build a comparison matrix– Rows: Sequence #1– Columns: Sequence #2 Filling– For each coordinate, if the character in therow matches the one in the column, fill in thecell– Continue until all coordinates have beenexamined

Example Dot Plot

Noise in Dot Plots Nucleic Acids (DNA, RNA)– 1 out of 4 bases matches at random Windowing helps reduce noise– Can require X bp match before plotting– Percentage of bases matching in the windowis set as threshold

Met14 vsMet2“DotPlot”Match 1Mismatch -1Gray: 1MET14 (1000nt)

Met14 vsMet2Red: 5MET14 (1000nt)

chain of human hemoglobin chain of humanhemoglobin

MAZ: Myc associated zinc finger isoform 1 self alignment

Human vs ChimpY chromosomecomparison

Dot plots of DNA sequence identity betweenchimpanzee and human Y chromosomes and chromosomes 21.JF Hughes et al. Nature 000, 1-4 (2010) doi:10.1038/nature08700

Aligning sequences by residue Match: award Mismatch (substitution or mutation): penalize Insertion/Deletion (INDELS – gaps): penalize(gap open, gap extension)A L I G N M E N T - L I G A M E N T

More than one solution is possible Which alignment is best?A T C G G A T - C T A – C – G G – A C TA T C G G A T C T A – C G G – A C T

Alignment Scoring Scheme Possible scoring scheme:match: 2mismatch: -1indel –2 Alignment 1: 5*2 1*-1 4*-2 10 – 1 – 8 1 Alignment 2: 6*2 1*-1 2*-2 12 – 1 – 4 7

Dynamic Programming Global Alignments:– Needleman S.B. and Wunsch C.D. (1970) J. Mol. Biol. 48, 443-453 Local Alignments:– Smith T.F. and Waterman M.S. (1981) J. Mol. Biol.147, 195-197– One simple modification of Needleman/Wunsch: when a value in thescore matrix becomes negative, reset it to zero (begin of newalignment) Guaranteed to be mathematically optimal:– Given two sequences (and a scoring system) these algorithms areguaranteed to find the very best alignment between the twosequences! Slow N2 algorithm Performed in 2 stages Prepare a scoring matrix using recursive function Scan matrix diagonally using traceback protocol

Dynamic 10000G GT0000000I00000100E EC0000000N NS000010010E EGENESIST*SG6040302020100I IE4050302020100N3030402020100CS 100S000010010

DP (demo) Match 5, mismatch -3, gap -4

DP (demo)S1,1 MAX{S0,0 5, S1,0 - 4, S0,1 – 4,0} MAX{5, -4, -4, 0} 5

DP (demo)S1,2 MAX{S0,1 -3, S1,1 - 4, S0,2 – 4, 0} MAX{0 - 3, 5 – 4, 0 – 4, 0} MAX{-3, 1, -4, 0} 1

DP (demo)S1,3 MAX{S0,2 -3, S1,2 - 4, S0,3 – 4, 0} MAX{0 - 3, 1 – 4, 0 – 4, 0} MAX{-3, -3, -4, 0} 0

DP (demo)

Trace Back (Local Alignment) Maximum local alignment score is thehighest score anywhere in the matrix (14in this example) 14 is found in two separate cells,indicating two possible multiple alignmentsproducing the maximal local alignmentscore

Trace Back (Local Alignment) Traceback begins in the position with the highestvalue. At each cell, we look to see where we move nextaccording to the pointers When a cell is reached where there is not apointer to a previous cell, we have reached thebeginning of the alignment

Trace Back Demo

Trace Back Demo

Trace Back Demo

Maximum Local AlignmentG A A T T C - A G G A T – C G AG A A T T C - A G G A – T C G A - - - 5 3 5 5 4 5 4 5 14 - - - 5 3 5 4 5 5 4 5 14

Linear vs. Affine Gaps So far, gaps have been modeled as linear More likely contiguous block of residues insertedor deleted– 1 gap of length k rather than k gaps of length 1 Can create scoring scheme to penalize big gapsrelatively less– Biggest cost is to open new gap, but extendingis not so costly

Affine Gap Penalty wx g r(x-1)wx : total gap penaltyg: gap open penaltyr: gap extend penaltyx: gap length gap penalty chosen relative to score matrix

Scoring Alignments Pick a scoring matrix BLOSUM62 PAM250 Match 5, mismatch -4 Decide on gap penalties -gap opening penalty (-8) -gap extension penalty (-1) Assume every position is independent Sum scores at each position [log(x*y) logx logy]

Scoring MatricesSij qijlog()pi p jl An empirical model of evolution, biology and chemistry allwrapped up in a 20 X 20 (or 4 X 4) table of numbers Structurally or chemically similar residues should ideally havehigh diagonal or off-diagonal numbers Structurally or chemically dissimilar residues should ideallyhave low diagonal or off-diagonal numbers What does the score mean: The likelihood of seeing tworesidues align (preserved) than random expected.

Scoring AlignmentsBlosum62 Scoring Matrix

BLOSUM substitution matricesDeveloped for distantly related proteinsSubstitutions only from multiple alignments ofconserved regions of protein families, hand curated,constitute the known homologous blocksIdentity threshold to define conserved blocks can bevaried, e.g. 62% idenitity gives BLOSUM62Scores calculated from frequency of amino acids inaligned pairs compared to what would be expecteddue to abundance alone, given all sequencesHenikoff and Henikoff (1992) Proc. Natl. Acad. Sci. USA 89, 10915-19

Blosum MatriciesWhat score should we give to a ser residuealigned with a thr residue?P(S : T homology)score (S : T) log 2P(S : T random)

example of deriving Blosum scores forS:S, S:T, and T:TDatabase of known alignmentsSDHIP HKSASDHLP HRTASDHIP omology Model (consider each pair of sequences separately)S:S pairs in alignments 11S:T pairs in alignments 6T:T pairs in alignments 9Total pairs in alignments 117P(S:S homology) 11/117 .094P(S:T homology) 6/117 .051P(T:T homology) 9/117 .078

example of deriving Blosum scores forS:S, S:T, and T:TDatabase of known alignmentsSDHIP HKSASDHLP HRTASDHIP HKSGSEHLPSEHLPWMFET RTQCWMFDT RTNCWLFDT KTQCKSQCKTQCRandom ModelNumber of S residues 8Number of T residues 8Total residues 72P(S:S random) P(S)P(S) (8/72)2 .012P(S:T random) 2*P(S)P(T) 2*(8/72)2 .024P(T:T random) P(T)P(T) (8/72)2 .012

example of deriving Blosum scores forS:S, S:T, and T:TP(S : S homology).094score(S : S) log 2 log 2 2.96P(S : S random).012P(S : T homology).051score(S : T) log 2 log 2 1.09P(S : T random).024P(T : T homology).078score(T : T) log 2 log 2 2.70P(T : T random).012

BLOSUM and PAMBLOSUM 45PAM 250More DivergentBLOSUM 62PAM 160BLOSUM 90PAM 100Less Divergent BLOSUM 62 is the default matrix in BLAST 2.0.Though it is tailored for comparisons of moderatelydistant proteins, it performs well in detecting closerrelationships. A search for distant relatives may bemore sensitive with a different matrix. PAM matrices: point accepted mutation

Scoring Matrices Take Home Points Based on log odds scores– Ratios 1 give positive scores, ratios 1 givenegative scores– Because log(x*y) logx logy the score of an alignment isthe sum of the scores for each pair of aligned residues Assume independence of adjacent residueswhen scoring Introduced the concept that the frequency ofa residue in a multiple alignment isinformative

Fast Similar Sequence Search Can we run Smith-Waterman betweenquery and every DB sequence? Yes, but too slow! General approach– Break query and DB sequence to matchsubsequences– Extend the matched subsequences, filterhopeless sequences– Use dynamic programming to get optimalalignment

BLAST Basic Local Alignment Search Tool Altschul et al. J Mol Biol. 1990 One of the most widely used bioinformaticsapplications– Alignment quality not as good as Smith-Waterman– But much faster, supported at NCBI with big computercluster For tutorials or BLASTinfo/information3.html

BLAST Algorithm Steps Query and DB sequences are optionallyfiltered to remove low-complexity regions– E.g. ACACACACA, TTTTTTTTT

BLAST Algorithm Steps Query and DB sequences are optionallyfiltered to remove low-complexity regions Break DB sequences into k-mer words andhash their locations to speed later searches– k is usually 11 for DNA/RNA and 3 for proteinLPPQGLLLPPPPQPQGQGLGLL

BLAST Algorithm Steps Query and DB sequences are optionallyfiltered to remove low-complexity regions Break DB sequences into k-mer wordsand hash their locations to speed latersearches Each k-mer in query find possible k-mersthat matches well with it– “well” is evaluated by substitution matrices

BLAST Algorithm Steps Only words with T cutoff score is kept– T is usually 11-13, 50 words make T cutoff– Note: this is 50 words at every query position For each DB sequence with a high scoring word,try to extend it in both endsQuery:LP PQG LLDB seq:MP PEG LLHSP score 9 15 8 32– Form HSP (High-scoring Segment Pairs)– Use BLOSUM to score the extended alignment– No gaps allowed

The BLAST Search AlgorithmQuery Word{Query: EGPRGPKGPNGPDGPHGPMGPSGPQAPQN 1815141413131313131212Score Threshold (13)325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA 365 LA L TP G R W P D ER A290 TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA 330High-scoring Segment Pair

BLAST Algorithm Steps Keep only statistically significant HSPs– Based on the scores of aligning 2 random seqs Use Smith-Waterman algorithm to join the HSPs and getoptimalalignment– Gaps are alloweddefault (-11, -1)

BLAST algorithm summary“query”“seeds”“subjects” (database)Scan the index and find all word hits(111111)(111000111)DP extension to recover the high scoring pairs“neighborhood words”(branch and bound algorithm)Extending high scoring pairsIndexing all seedsEvaluate Significance of HSPs byKarlin-Altschul Statistic: E KMNexp(-lambda*S)

Different BLAST ProgramsBLAST DB: nr (non-redundant):–GenBank, RefSeq,EMBL est:– expressedsequences (cDNA),redundant Swissprot and pdb:–protein databasesIf query is DNA, butknown to be coding(e.g. cDNA)– Translate cDNAinto protein– Zero gapextension penalty

Different BLAST ProgramsProgramDescriptionblastpCompares an amino acid query sequence against a protein sequencedatabase.blastnCompares a nucleotide query sequence against a nucleotide sequencedatabase.blastxCompares a nucleotide query sequence translated in all reading framesagainst a protein sequence database. You could use this option to findpotential translation products of an unknown nucleotide sequence.tblastnCompares a protein query sequence against a nucleotide sequencedatabase dynamically translated in all reading frames.tblastxCompares the six-frame translations of a nucleotide query sequenceagainst the six-frame translations of a nucleotide sequence database.Please note that the tblastx program cannot be used with the nrdatabase on the BLAST Web page because it is too computationallyintensive.

PSI-BLAST Position Specific Iterative BLAST– Align high scoring hits in initial BLAST toconstruct a profile for the hits– Use profile (PSSM) for next iteration BLASTQuerySeq2Seq1Seq3Seq4 Find remote homologs or protein families FP sequences can degrade search quickly

PSI-BLASTInitial Refined MatchesPSSM Gly-30-3020-60-70His bi.ac.uk/blastpgp/

Reciprocal Blast Search for orthologous sequencesbetween two species– GeneA in Species1 BLAST Species2 GeneB– GeneB in Species2 BLAST Species1 GeneA– GeneAorthologousGeneB Also called bi-directional best hit

BLAT BLAST-Like Alignment Tool– Compare to BLAST, BLAT can align much longerregions (MB) really fast with little resources– E.g. can map a sequence to the genome in secondson one Linux computer– Allow big gaps (mRNA to genome)– Need higher similarity ( 95% for DNA and 80% forproteins) for aligned sequences Basic approach– Break long sequence into blocks– Index k-mers, typically 8-13– Stitch blocks together for final alignment

BLAT: IndexingGenome: cacaattatcacgaccgc3-mers: cac aat tat cac gac cgcIndex:aat 3cac 0,9cgc 15gac 12tat 6cDNA (mRNA - DNA): aattctcac3-mers: aat att ttc tct ctc tca cac0123456hits: aat 0,3cac 6,0cac 6,9-36-3clump: cacAATtatCACgaccgc aattctcac

BLAT Example Enter sequence and parameters

BLAT Example Get result instantly!!

Summary of Fast Search Fast sequence similarity search– Break seq, hash DB sub-seq, match sub-seqand extend, use DP for optimal alignment– *BLAST, most widely used, many applicationswith sound statistical foundations– *BLAT, align sequence to genome, fast yetneed higher similarity

BLAST score and significance Report DB sequences above a threshold– E value: Number (instead of probability pvalue) of matches expected merely by chanceE Kmn e S-xp(s ³ x) »1- exp[-e ] m, n are query and DB length K, are constants Smaller E, more stringent

Are these proteins homologs?SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLYL PW LY NY CLProbably not (score 9)SEQ 2: QFFPLMPPAPYWILATDYENLPLVYSCTTFFWLFSEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLYL PW LDATYKNYAY CLMAYBE (score 15)SEQ 2: QFFPLMPPAPYWILDATYKNYALVYSCTTFFWLFSEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLYRVV L PSW LDATYKNYAY CDVTYKLSEQ 2: RVVPLMPSAPYWILDATYKNYALVYSCDVTYKLFMost likely (score 24)

Significance of 45Low score unrelatedHigh score homologsHow high is high enough?

Other significance questions Pairwise sequence comparison scoresMicroarray expression measurementsSequence motif scoresFunctional assignments of genesCall peaks from ChIP-seq data

The null hypothesis We are interested in characterizing thedistribution of scores from sequence comparisonalgorithms. We would like to measure how surprising a givenscore is, assuming that the two sequences arenot related. The assumption is called the null hypothesis. The purpose of most statistical tests is todetermine whether the observed results provide areason to reject the hypothesis that they aremerely a product of chance factors.

Gaussian vs.Extreme Value Distribution (EVD)Gaussian (Normal)Extreme value

Gaussian 68 in., 3 in. What is the chance of picking a person at least 75 in. tall P(X 75)?z score( x) x 75 68 2.333From Table:z 2.33 P 0.01

EVD , , , Each alignment/score that BLAST returns is thevery best alignment/score among a large numberof alignments/scores for those two sequences (ieand EVD problem).(Altschul et al, Nat. Genet. 1994)

Computing a p-value The probability ofobserving a score 4is the area under thecurve to the right of 4. This probability iscalled a p-value. p-value Pr(data null)

Scaling the EVD An extreme value distribution derived from, e.g., theSmith-Waterman algorithm will have a characteristicmode μ and scale parameter λ. P S x 1 exp e x These parameters depend upon the size of the query,the size of the target database, the substitution matrixand the gap penalties.

An exampleYou run BLAST and get a score of 45. You then run BLAST on a shuffledversion of the database, and fit an extreme value distribution to the resultingempirical distribution. The parameters of the EVD are μ 25 and λ 0.693.What is the p-value associated with 45? P S x 1 exp e x 1 exp e 1 exp 9.565 10 P S 45 1 exp e 0.693 45 25 13.86 7 1 0.999999043 9.565 10 7

Summary of statistical significance A distribution plots the frequency of a given type ofobservation. The area under the distribution is 1. Most statistical tests compare observed data to theexpected result according to the null hypothesis. Sequence similarity scores follow an extreme valuedistribution, which is characterized by a larger tail. The p-value associated with a score is the areaunder the curve to the right of that score.

Applying homology: concept and technology Genome evolution– Mammalian genome evolution– Human genome variation– Cancer genome evolution Gene finding– Comparative approaches– Ab initio approaches Hidden Markov ModelProtein structure– Threading Regulatory motif finding– Profile comparison Pathway/Network comparison– PathBLAST Conservation– Ultra conserved elements– Human accelerated regions

Gene prediction Comparing to a known gene from a differentspecies Using EST evidence (aligning transcript to genome) Predicting from sequence (HHM) Using conservation– Signature of coding potential– What about RNA gene? Using other genomics signals– Specific epigenetic marks of promoters and genebodies

Modeling gene featuresExon 1[human]Intron 1Exon 2Intron 2Exon 35’3’[mouse]CNSCNSCNS

Genscan (Burge and Karlin, 1998) Dramatic improvementover previous methods Generalised HMM Different parametersets for different GCcontent regions (intronlength distribution andexon stats)

Predicting non-coding RNA? From sequence?– Not clear which properties can be exploited– Sequence features such as promoters are tooweak Histone modifications conservation worked

So far, only linear sequence comparisonA C T T G A A T T C G C C TCGAATTCAC

Expanding the idea of a sequence

Central theme of the new algorithm –compare profilesACGTT810801G003010C004012A071065 60006000114000606000501001050510

Met14 vsMet2“DotPlot”Match 1Mismatch -1Gray: 1MET14 (1000nt)

Met14 vsMet2Red: 5MET14 (1000nt)

Met14 vsMet2MET14 (1000nt)PhyloNetTTTCACGTGAP 1.75E-5HSPs:E 0.1P 0.002P 0.003P 0.03P 0.02

PathBlast, NetworkBlastTrey Ideker

Comparative Genomics Functional DNA often evolves slower than neutral DNA. To detect functional elements:– align genomes of related species,– and find regions of high conservation. The difference between conservation and constraint.

Ultra conserved elementsultra conservede.d 12.5

HARs: Human accelerated regions 118 bp segment with 18 changes betweenthe human and chimp sequences Expect less than 1

Human HAR1F differs from the ancestralRNA stucture

Aligning Short Reads

PastandCurrent Sequencing0 and1st generationsequencingTechnologiesPre-1992“old fashionedway”S35 ddNTPsGelsManual loadingManual base calling1992-1999ABI 373/377Fluorescent ddNTPs*GelsManual loadingAutomated base calling*1999ABI 3700Fluorescent ddNTPsCapillaries*Robotic loading*Automated base callingBreaks down frequently2003ABI 3730XLFluorescent ddNTPsCapillariesRobotic loadingAutomated base callingReliable*

Nextor chnology454/Roche GS-20/FLX(Oct 2005)ABI SOLiD(Oct 2007)Illumina/Solexa1G Genetic Analyser (Feb 2007)

ClustergenerationClusterGeneration8 channels (lanes)

IGA without cover

Flowcell imagingFlow cell imaging

A flow cellA flow cell contains eight lanesLane 1Lane 2.Lane 8Each lane/channel contains three columns of tilesColumn 1Column 2Column 3Each column contains 100 tilesTileEach tile is imaged four times percycle – one image per base.20K-30KClusters345,600 images for a 36-cycle run350 X 350 µm

Data analysis pipelineData Analysis PipelineFirecresttiff image files(345,600)AdditionalData AnalysisBustardintensity filesSequence filesElandAlignment to Genome

Primary tools and analysis tasks Image processing– (unique to each manufacturer) Basecalling– (unique to each manufacturer) Align sequence reads to reference genome Assemble contigs and whole genomes usingquality scores and/or paired-end information Peak finding for Chip-Seq applications– (and statistics to validate, map to regulated genes,etc) SNP calling/genotyping Transcript profiling– measure gene expression, identifying alternative splicing, etc.

NGS: Sequence alignment Map the large numbers of short reads to areference genome– In a broader sense: Identify similar sequences(DNA, RNA, or protein) in consequence offunctional, structural, or evolutionary relationshipsbetween the them– Applications: Genome assembly, SNP detection,homology search, etc large faster search speedshort greater search sensitivity.

Mapping Reads Back Hash Table (Lookup table)– FAST, but requires perfect matches Array Scanning– Can handle mismatches, but not gaps Dynamic Programming (Smith Waterman, Forward,Viterbi)– Indels– Mathematically optimal solution– Slow (most programs use Hash Mapping as a prefilter) Burrows-Wheeler Transform (BW Transform)– FAST (memory efficient)– But for gaps/mismatches, it lacks sensitivity

Many short read aligners

Short read mapping Input:– A reference genome– A collection of many 25-100bp tags (reads)– User-specified parameters Output:– One or more genomic coordinates for each tag In practice, only 70-75% of tags successfullymap to the reference genome. Why?

Multiple mapping A single tag may occur more than once inthe reference genome. The user may choose to ignore tags thatappear more than n times. As n gets large, you get more data, butalso more noise in the data.

Inexact matching? An observed tag may not exactly match any position inthe reference genome. Sometimes, the tag almost matches one or morepositions. Such mismatches may represent a SNP or a bad readout. The user can specify the maximum number ofmismatches, or a phred-style quality score threshold. As the number of allowed mismatches goes up, thenumber of mapped tags increases, but so does thenumber of incorrectly mapped tags.

Using base qualities to evaluateREAD: AGGTCCGGGATACCGGGGACBETTER!Q: 30CHR1: CGGTCCGGGATACCGGGGACCHR2: AGGTCCGGGATACCGGGGGTBETTER!Q: 10 10

Hash table (Eland, SOAP) Main idea: preprocess genome to speed up queries– Hash every substring of length k– K is a tiny constant For each query p, can easily retrieve all suffixes of thegenome that start with p1, p2, pk. Easy to implement. Significant speed up in practice. Large memory consumption. Inexact match is difficult.– Need multiple hash tables– More memory

Spaced seedalignment (MAQ) Tags and tag-sizedpieces of reference arecut into small “seeds.” Pairs of spaced seedsare stored in an index. Look up spaced seeds foreach tag. For each “hit,” confirmthe remaining positions. Report results to the user.

Index the reference genome: Suffix Tree Each suffix correspondsto exactly one path fromthe root to a leaf Edges spell non-emptystrings Construction: linear timeand space Check if a string oflength m is a substring Each substring is aprefix of a suffix!

Burrows-Wheeler(Bowtie, BWA) Store entire referencegenome. Align tag base by basefrom the end. When tag is traversed, allactive locations arereported. If no match is found, thenback up and try asubstitution.

Why Burrows-Wheeler? BWT very compact:– Approximately ½ byte per base– As large as the original text, plus a few“extras”– Can fit onto a standard computer with 2GB ofmemory Linear-time search algorithm– proportional to length of query for exactmatches

Burrows-Wheeler Transform (BWT)BWTacaacg acaacgaacg acacaacg acg acacaacg acg acaag acaacgc aaacBurrows-Wheeler Matrix (BWM)

Key observationa1c1a2a3c2g1 1“last first (LF) mapping”The i-th occurrence of character X inthe last column corresponds tothe same text character as the i-thoccurrence of X in the first column.1 acaacg12aacg ac11acaacg 13acg aca21caacg a12cg acaa31g acaac2

Burrows-Wheeler Matrix314256 acaacgaacg acacaacg acg acacaacg acg acaag acaacSee the suffix array?

Burrows-Wheeler Transform Originally designed for data compression for large text Burrows-Wheeler matrix: sort lexicographically all cyclicrotations of S BWT(S): the last column of Burrows-Wheeler matrix Compression: runs of repeated characters are easy tocompress using move-to-front transform and run-lengthencoding, etc. BWT(S) is a reversible permutation of S

Reverse Burrows-Wheeler Transform BW Matrix Property: Last-First (LF)Mapping The ith occurrence of character X in thelast column correspond to the same textcharacter as the ith occurrence of X in thefirst column

qc the nextSearchingcharacter to BWTthe left in the queryBWT Exact MatchingFerragina P, Manzini G: Opportunistic data structures with applications. FOCS. IEEE Computer Society; 2000.

Searching BWTBWT(agcagcagact) tgcc ggaaaacgcagcaSearch for pattern: gcagcagca agcagcagact agcagcagact agcagcagact agcagcagactact agcagcagact agcagcagact agcagcagact agcagcagagact agcagcagact agcagcagact agcagcagact agcagcagcagact agcagcagact agcagcagact agcagcagact agcagcagcagact agcagcagact agcagcagact agcagcagact cagact agcagcagact agcagcagact agcagcagact agcagcagcagact agcagcagact agcagcagact agcagcagact agct agcagcagact agcagcagact agcagcagact agcagcagagact agcagcagact agcagcagact agcagcagact agcagcagcagact agcagcagact agcagcagact agcagcagact agcagcagcagact agcagcagact agcagcagact agcagcagact at agcagcagact agcagcagact agcagcagact agcagcagac

SummaryAvoidsomeexcessivebacktrackingA Bowtie genome index consists ofthe BWT plusauxiliarydata structures. All told,Bowtie index: memory footprintthe index is only about 65% larger than the input genome. This is radically smaller thanother indexing schemes such as suffix trees, suffix arrays, and seed hash tables. Thisallows Bowtie to run easily and efficiently on a typical computer with 2 gigabytes of RAM.owtie index: memory footprintHuman genome memory footprintHuman genome memory footprintBowtie IndexSuffix Tree1.3 gigabytes 35 gigabytesSuffix ArrayHash Tables 12 gigabytes 12 gigabytesMemory footprintBowtieSearch AlgorithmThe index is only about 65% larger than the input genometh occurrenceThis allows Bowtie to run easily andonofa typical computer with 2 GBthe iefficientlyThe Burrows-Wheeler MatrixRAM.character X in thehas a property called the LF 19last column The Bowtie index can be distributedover the Internet.mapping: the ith occurrence of

BLOSUM and PAM BLOSUM 45 BLOSUM 62 BLOSUM 90 PAM 250 PAM 160 PAM 100 More Divergent Less Divergent BLOSUM 62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix.

Related Documents:

Keywords: Homology, evolution, vertebrates, fossils, comparative anatomy Introduction One of the most fundamental concepts underlying student understanding of the theory of evolution is homology, a term coined by Sir Richard Owen in 1848, which discusses the structure of the skeleton in vertebrates (Owen, 1848). Homology may

Genetics – A Continuity of Life – Daniel Fairbanks, Ralph Anderson. Concepts of Genetics – Klug and Cummings. Principles of Genetics – Hartt and Jones. GN 5 B 07 : CORE COURSE VII Medical Genetics Total – 54 hrs Unit- 1: Principles of Human Genetics (2 hrs) History, Origin of medical genetics, classification of genetic disease .

Somatic cell genetics. Books Recommended: 1. Genetics - Gardener 2. Molecular Genetics of Bacteria 2nd edition 1995, Jeremy W.Dale.-John Wiley and sons. 3.Cell biology (1993)-David E.Sadva (Jones and Barrette) 4.Modern genetics (2nd edition,1984)-A.J.Ayala and W.Castra(Goom Helns,London) 5. Genetics by P.K Gupta. 6. Genetics by Verma and Agarwal 7.

HUMAN GENETICS AND PEDIGREES 7.4 . Key Concept A combination of methods is used to study human genetics . Human Genetics Human genetics follows the patterns seen in other organisms The basic principles of genetics

describing this Rosetta-focused methodology, we note that other groups have also developed excellent tools outside Rosetta for RNA homology modeling, and we advise the reader to try multiple tools and check for Xu&Chen,2018).We also note that current homology modeling methods can be computationally

invariants in the remaining case, that of rational homology spheres. Around the same time, Lim [12] succeeded in providing a combinatorial de-scription of the Seiberg–Witten invariants of integral homology spheres. Namely, in this case they coincide with the Casson invariant. In [19] we investigated a spe-

For any positive integer l, let Z (l)"ZC1 (l!1)!n!D. For a rational homology 3-sphere M (i.e. a closed oriented 3-manifold whose homology group H 1 (M,Z)is"nite), let Q[[x]](M)be the set of all power series l d l x l such that d l is a number in ZC 1 (2l#n(n!1))!DH 1 (M,Z)DD1.2. Quantum invariants of oriented framed links and 3-manifolds For a framed oriented link ‚ with m ordered .

on Embedded Contact Homology [1] as well as his 2008 lectures at MSRI. The fundamental objects of study for Embedded Contact Homology (ECH) are Reeb orbits on contact manifolds, so we de ne: De nition 1.1. A contact manifold (Y; ) is a smooth, 2n 1 dimensional manifold to