Homology - Washington University Genetics

2y ago

15 Views

2 Downloads

5.16 MB

126 Pages

Last View : 15d ago

Last Download : 3m ago

Upload by : Allyson Cromer

Report this link

Download PDF

Transcription

HomologyBio5488Ting Wang1/25/15, 1/27/15

GGGCCTTTTTT x x CTTTCCCGGCCTCCTGGCCAmatch: 1mismatch: -1matching score 16 How to align them? Why we can align them?Why 1 for match, and -1 for mismatch?What does the score mean?Is 16 a good score?

Outline Nobel-price-worthy work on homologyWhat is homology?How to detect homology?How to quantify homology?How to use homology?Homology beyond sequence analysisNext-gen sequencing alignment

Russell Doolittle(Bishop and Varmus)

Bishop and Varmus strategy (Nobel price 1989)Doolittle strategy (could be the first Nobel price for computational biology)What is the significance?

A few DefinitionsHomologs: genes/sequences sharing a common originOrthologs: genes originating from a single ancestralgene in the last common ancestor of the comparedgenomes; genes related via speciationParalogs: genes related via duplicationXenolog: sequences that have arisen out of horizontal transfer events (symbiosis,viruses, etc)Co-orthologs: two or more genes in one lineage that are, collectively, orthologous toone or more genes in another lineage due to a lineage-speciﬁc duplication(s)Outparalogs: paralogous genes resulting from a duplication(s) preceding a givenspeciation eventInparalogs: paralogous genes resulting from a lineage-speciﬁc duplication(s)subsequent to a given speciation event

Relation of sequencesNeed ancestral sequences to distinguish orthologs and paralogs(exercise on board)

Similarity versus Homology Similarity refers to the likeness or% identity between 2 sequences Similarity means sharing astatistically significant number ofbases or amino acids Similarity does not implyhomology Similarity can be quantified It is ok to say that two sequencesare X% identical It is ok to say that two sequenceshave a similarity score of Z It is generally incorrect to say thattwo sequences are X% similar Homology refers to sharedancestryTwo sequences are homologous ifthey are derived from a commonancestral sequenceHomology usually impliessimilarityLow complexity regions can behighly similar without beinghomologousHomologous sequences are notalways highly similarA sequence is either homologousor not.Never say two things are X%homologous

Why Compare Sequences? Sequence comparisons lie at the heart of all bioinformatics Identify sequences– What is this thing I just found? Compare new genes to known ones Compare genes from different species– information about evolution Guess functions for entire genomes full of new genesequences– Metagenomics What does it matter if two sequences are similar or not?– Globally similar sequences are likely to have the same biologicalfunction or role– Locally similar sequences are likely to have some physical shapeor property with similar biochemical roles– If we can figure out what one does, we may be able to figure outwhat they all do

Sequence alignment How to optimally align two sequences– Dot plots– Dynamic programming Global alignment Local alignment How to score an alignment Fast similar sequence search– BLAST– BLAT– More recent development: short read alignment Determine statistical significance Using information in multiple sequence alignment toimprove sensitivity

Visual Alignments (Dot Plots) Build a comparison matrix– Rows: Sequence #1– Columns: Sequence #2 Filling– For each coordinate, if the character in therow matches the one in the column, fill in thecell– Continue until all coordinates have beenexamined

Example Dot Plot

Noise in Dot Plots Nucleic Acids (DNA, RNA)– 1 out of 4 bases matches at random Windowing helps reduce noise– Can require X bp match before plotting– Percentage of bases matching in the windowis set as threshold

Met14 vsMet2“DotPlot”Match 1Mismatch -1Gray: 1MET14 (1000nt)

Met14 vsMet2Red: 5MET14 (1000nt)

chain of human hemoglobin chain of humanhemoglobin

MAZ: Myc associated zinc finger isoform 1 self alignment

Human vs ChimpY chromosomecomparison

Dot plots of DNA sequence identity betweenchimpanzee and human Y chromosomes and chromosomes 21.JF Hughes et al. Nature 000, 1-4 (2010) doi:10.1038/nature08700

Aligning sequences by residue Match: award Mismatch (substitution or mutation): penalize Insertion/Deletion (INDELS – gaps): penalize(gap open, gap extension)A L I G N M E N T - L I G A M E N T

More than one solution is possible Which alignment is best?A T C G G A T - C T A – C – G G – A C TA T C G G A T C T A – C G G – A C T

Alignment Scoring Scheme Possible scoring scheme:match: 2mismatch: -1indel –2 Alignment 1: 5*2 1*-1 4*-2 10 – 1 – 8 1 Alignment 2: 6*2 1*-1 2*-2 12 – 1 – 4 7

Dynamic Programming Global Alignments:– Needleman S.B. and Wunsch C.D. (1970) J. Mol. Biol. 48, 443-453 Local Alignments:– Smith T.F. and Waterman M.S. (1981) J. Mol. Biol.147, 195-197– One simple modification of Needleman/Wunsch: when a value in thescore matrix becomes negative, reset it to zero (begin of newalignment) Guaranteed to be mathematically optimal:– Given two sequences (and a scoring system) these algorithms areguaranteed to find the very best alignment between the twosequences! Slow N2 algorithm Performed in 2 stages Prepare a scoring matrix using recursive function Scan matrix diagonally using traceback protocol

Dynamic 10000G GT0000000I00000100E EC0000000N NS000010010E EGENESIST*SG6040302020100I IE4050302020100N3030402020100CS 100S000010010

DP (demo) Match 5, mismatch -3, gap -4

DP (demo)S1,1 MAX{S0,0 5, S1,0 - 4, S0,1 – 4,0} MAX{5, -4, -4, 0} 5

DP (demo)S1,2 MAX{S0,1 -3, S1,1 - 4, S0,2 – 4, 0} MAX{0 - 3, 5 – 4, 0 – 4, 0} MAX{-3, 1, -4, 0} 1

DP (demo)S1,3 MAX{S0,2 -3, S1,2 - 4, S0,3 – 4, 0} MAX{0 - 3, 1 – 4, 0 – 4, 0} MAX{-3, -3, -4, 0} 0

DP (demo)

Trace Back (Local Alignment) Maximum local alignment score is thehighest score anywhere in the matrix (14in this example) 14 is found in two separate cells,indicating two possible multiple alignmentsproducing the maximal local alignmentscore

Trace Back (Local Alignment) Traceback begins in the position with the highestvalue. At each cell, we look to see where we move nextaccording to the pointers When a cell is reached where there is not apointer to a previous cell, we have reached thebeginning of the alignment

Trace Back Demo

Maximum Local AlignmentG A A T T C - A G G A T – C G AG A A T T C - A G G A – T C G A - - - 5 3 5 5 4 5 4 5 14 - - - 5 3 5 4 5 5 4 5 14

Linear vs. Affine Gaps So far, gaps have been modeled as linear More likely contiguous block of residues insertedor deleted– 1 gap of length k rather than k gaps of length 1 Can create scoring scheme to penalize big gapsrelatively less– Biggest cost is to open new gap, but extendingis not so costly

Affine Gap Penalty wx g r(x-1)wx : total gap penaltyg: gap open penaltyr: gap extend penaltyx: gap length gap penalty chosen relative to score matrix

Scoring Alignments Pick a scoring matrix BLOSUM62 PAM250 Match 5, mismatch -4 Decide on gap penalties -gap opening penalty (-8) -gap extension penalty (-1) Assume every position is independent Sum scores at each position [log(x*y) logx logy]

Scoring MatricesSij qijlog()pi p jl An empirical model of evolution, biology and chemistry allwrapped up in a 20 X 20 (or 4 X 4) table of numbers Structurally or chemically similar residues should ideally havehigh diagonal or off-diagonal numbers Structurally or chemically dissimilar residues should ideallyhave low diagonal or off-diagonal numbers What does the score mean: The likelihood of seeing tworesidues align (preserved) than random expected.

Scoring AlignmentsBlosum62 Scoring Matrix

BLOSUM substitution matricesDeveloped for distantly related proteinsSubstitutions only from multiple alignments ofconserved regions of protein families, hand curated,constitute the known homologous blocksIdentity threshold to define conserved blocks can bevaried, e.g. 62% idenitity gives BLOSUM62Scores calculated from frequency of amino acids inaligned pairs compared to what would be expecteddue to abundance alone, given all sequencesHenikoff and Henikoff (1992) Proc. Natl. Acad. Sci. USA 89, 10915-19

Blosum MatriciesWhat score should we give to a ser residuealigned with a thr residue?P(S : T homology)score (S : T) log 2P(S : T random)

example of deriving Blosum scores forS:S, S:T, and T:TDatabase of known alignmentsSDHIP HKSASDHLP HRTASDHIP omology Model (consider each pair of sequences separately)S:S pairs in alignments 11S:T pairs in alignments 6T:T pairs in alignments 9Total pairs in alignments 117P(S:S homology) 11/117 .094P(S:T homology) 6/117 .051P(T:T homology) 9/117 .078

example of deriving Blosum scores forS:S, S:T, and T:TDatabase of known alignmentsSDHIP HKSASDHLP HRTASDHIP HKSGSEHLPSEHLPWMFET RTQCWMFDT RTNCWLFDT KTQCKSQCKTQCRandom ModelNumber of S residues 8Number of T residues 8Total residues 72P(S:S random) P(S)P(S) (8/72)2 .012P(S:T random) 2*P(S)P(T) 2*(8/72)2 .024P(T:T random) P(T)P(T) (8/72)2 .012

example of deriving Blosum scores forS:S, S:T, and T:TP(S : S homology).094score(S : S) log 2 log 2 2.96P(S : S random).012P(S : T homology).051score(S : T) log 2 log 2 1.09P(S : T random).024P(T : T homology).078score(T : T) log 2 log 2 2.70P(T : T random).012

BLOSUM and PAMBLOSUM 45PAM 250More DivergentBLOSUM 62PAM 160BLOSUM 90PAM 100Less Divergent BLOSUM 62 is the default matrix in BLAST 2.0.Though it is tailored for comparisons of moderatelydistant proteins, it performs well in detecting closerrelationships. A search for distant relatives may bemore sensitive with a different matrix. PAM matrices: point accepted mutation

Scoring Matrices Take Home Points Based on log odds scores– Ratios 1 give positive scores, ratios 1 givenegative scores– Because log(x*y) logx logy the score of an alignment isthe sum of the scores for each pair of aligned residues Assume independence of adjacent residueswhen scoring Introduced the concept that the frequency ofa residue in a multiple alignment isinformative

Fast Similar Sequence Search Can we run Smith-Waterman betweenquery and every DB sequence? Yes, but too slow! General approach– Break query and DB sequence to matchsubsequences– Extend the matched subsequences, filterhopeless sequences– Use dynamic programming to get optimalalignment

BLAST Basic Local Alignment Search Tool Altschul et al. J Mol Biol. 1990 One of the most widely used bioinformaticsapplications– Alignment quality not as good as Smith-Waterman– But much faster, supported at NCBI with big computercluster For tutorials or BLASTinfo/information3.html

BLAST Algorithm Steps Query and DB sequences are optionallyfiltered to remove low-complexity regions– E.g. ACACACACA, TTTTTTTTT

BLAST Algorithm Steps Query and DB sequences are optionallyfiltered to remove low-complexity regions Break DB sequences into k-mer words andhash their locations to speed later searches– k is usually 11 for DNA/RNA and 3 for proteinLPPQGLLLPPPPQPQGQGLGLL

BLAST Algorithm Steps Query and DB sequences are optionallyfiltered to remove low-complexity regions Break DB sequences into k-mer wordsand hash their locations to speed latersearches Each k-mer in query find possible k-mersthat matches well with it– “well” is evaluated by substitution matrices

BLAST Algorithm Steps Only words with T cutoff score is kept– T is usually 11-13, 50 words make T cutoff– Note: this is 50 words at every query position For each DB sequence with a high scoring word,try to extend it in both endsQuery:LP PQG LLDB seq:MP PEG LLHSP score 9 15 8 32– Form HSP (High-scoring Segment Pairs)– Use BLOSUM to score the extended alignment– No gaps allowed

The BLAST Search AlgorithmQuery Word{Query: EGPRGPKGPNGPDGPHGPMGPSGPQAPQN 1815141413131313131212Score Threshold (13)325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA 365 LA L TP G R W P D ER A290 TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA 330High-scoring Segment Pair

BLAST Algorithm Steps Keep only statistically significant HSPs– Based on the scores of aligning 2 random seqs Use Smith-Waterman algorithm to join the HSPs and getoptimalalignment– Gaps are alloweddefault (-11, -1)

BLAST algorithm summary“query”“seeds”“subjects” (database)Scan the index and find all word hits(111111)(111000111)DP extension to recover the high scoring pairs“neighborhood words”(branch and bound algorithm)Extending high scoring pairsIndexing all seedsEvaluate Significance of HSPs byKarlin-Altschul Statistic: E KMNexp(-lambda*S)

Different BLAST ProgramsBLAST DB: nr (non-redundant):–GenBank, RefSeq,EMBL est:– expressedsequences (cDNA),redundant Swissprot and pdb:–protein databasesIf query is DNA, butknown to be coding(e.g. cDNA)– Translate cDNAinto protein– Zero gapextension penalty

Different BLAST ProgramsProgramDescriptionblastpCompares an amino acid query sequence against a protein sequencedatabase.blastnCompares a nucleotide query sequence against a nucleotide sequencedatabase.blastxCompares a nucleotide query sequence translated in all reading framesagainst a protein sequence database. You could use this option to findpotential translation products of an unknown nucleotide sequence.tblastnCompares a protein query sequence against a nucleotide sequencedatabase dynamically translated in all reading frames.tblastxCompares the six-frame translations of a nucleotide query sequenceagainst the six-frame translations of a nucleotide sequence database.Please note that the tblastx program cannot be used with the nrdatabase on the BLAST Web page because it is too computationallyintensive.

PSI-BLAST Position Specific Iterative BLAST– Align high scoring hits in initial BLAST toconstruct a profile for the hits– Use profile (PSSM) for next iteration BLASTQuerySeq2Seq1Seq3Seq4 Find remote homologs or protein families FP sequences can degrade search quickly

PSI-BLASTInitial Refined MatchesPSSM Gly-30-3020-60-70His bi.ac.uk/blastpgp/

Reciprocal Blast Search for orthologous sequencesbetween two species– GeneA in Species1 BLAST Species2 GeneB– GeneB in Species2 BLAST Species1 GeneA– GeneAorthologousGeneB Also called bi-directional best hit

BLAT BLAST-Like Alignment Tool– Compare to BLAST, BLAT can align much longerregions (MB) really fast with little resources– E.g. can map a sequence to the genome in secondson one Linux computer– Allow big gaps (mRNA to genome)– Need higher similarity ( 95% for DNA and 80% forproteins) for aligned sequences Basic approach– Break long sequence into blocks– Index k-mers, typically 8-13– Stitch blocks together for final alignment

BLAT: IndexingGenome: cacaattatcacgaccgc3-mers: cac aat tat cac gac cgcIndex:aat 3cac 0,9cgc 15gac 12tat 6cDNA (mRNA - DNA): aattctcac3-mers: aat att ttc tct ctc tca cac0123456hits: aat 0,3cac 6,0cac 6,9-36-3clump: cacAATtatCACgaccgc aattctcac

BLAT Example Enter sequence and parameters

BLAT Example Get result instantly!!

Summary of Fast Search Fast sequence similarity search– Break seq, hash DB sub-seq, match sub-seqand extend, use DP for optimal alignment– *BLAST, most widely used, many applicationswith sound statistical foundations– *BLAT, align sequence to genome, fast yetneed higher similarity

BLAST score and significance Report DB sequences above a threshold– E value: Number (instead of probability pvalue) of matches expected merely by chanceE Kmn e S-xp(s ³ x) »1- exp[-e ] m, n are query and DB length K, are constants Smaller E, more stringent

Are these proteins homologs?SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLYL PW LY NY CLProbably not (score 9)SEQ 2: QFFPLMPPAPYWILATDYENLPLVYSCTTFFWLFSEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLYL PW LDATYKNYAY CLMAYBE (score 15)SEQ 2: QFFPLMPPAPYWILDATYKNYALVYSCTTFFWLFSEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLYRVV L PSW LDATYKNYAY CDVTYKLSEQ 2: RVVPLMPSAPYWILDATYKNYALVYSCDVTYKLFMost likely (score 24)

Significance of 45Low score unrelatedHigh score homologsHow high is high enough?

Other significance questions Pairwise sequence comparison scoresMicroarray expression measurementsSequence motif scoresFunctional assignments of genesCall peaks from ChIP-seq data

The null hypothesis We are interested in characterizing thedistribution of scores from sequence comparisonalgorithms. We would like to measure how surprising a givenscore is, assuming that the two sequences arenot related. The assumption is called the null hypothesis. The purpose of most statistical tests is todetermine whether the observed results provide areason to reject the hypothesis that they aremerely a product of chance factors.

Gaussian vs.Extreme Value Distribution (EVD)Gaussian (Normal)Extreme value

Gaussian 68 in., 3 in. What is the chance of picking a person at least 75 in. tall P(X 75)?z score( x) x 75 68 2.333From Table:z 2.33 P 0.01

EVD , , , Each alignment/score that BLAST returns is thevery best alignment/score among a large numberof alignments/scores for those two sequences (ieand EVD problem).(Altschul et al, Nat. Genet. 1994)

Computing a p-value The probability ofobserving a score 4is the area under thecurve to the right of 4. This probability iscalled a p-value. p-value Pr(data null)

Scaling the EVD An extreme value distribution derived from, e.g., theSmith-Waterman algorithm will have a characteristicmode μ and scale parameter λ. P S x 1 exp e x These parameters depend upon the size of the query,the size of the target database, the substitution matrixand the gap penalties.

An exampleYou run BLAST and get a score of 45. You then run BLAST on a shuffledversion of the database, and fit an extreme value distribution to the resultingempirical distribution. The parameters of the EVD are μ 25 and λ 0.693.What is the p-value associated with 45? P S x 1 exp e x 1 exp e 1 exp 9.565 10 P S 45 1 exp e 0.693 45 25 13.86 7 1 0.999999043 9.565 10 7

Summary of statistical significance A distribution plots the frequency of a given type ofobservation. The area under the distribution is 1. Most statistical tests compare observed data to theexpected result according to the null hypothesis. Sequence similarity scores follow an extreme valuedistribution, which is characterized by a larger tail. The p-value associated with a score is the areaunder the curve to the right of that score.

Applying homology: concept and technology Genome evolution– Mammalian genome evolution– Human genome variation– Cancer genome evolution Gene finding– Comparative approaches– Ab initio approaches Hidden Markov ModelProtein structure– Threading Regulatory motif finding– Profile comparison Pathway/Network comparison– PathBLAST Conservation– Ultra conserved elements– Human accelerated regions

Gene prediction Comparing to a known gene from a differentspecies Using EST evidence (aligning transcript to genome) Predicting from sequence (HHM) Using conservation– Signature of coding potential– What about RNA gene? Using other genomics signals– Specific epigenetic marks of promoters and genebodies

Modeling gene featuresExon 1[human]Intron 1Exon 2Intron 2Exon 35’3’[mouse]CNSCNSCNS

Genscan (Burge and Karlin, 1998) Dramatic improvementover previous methods Generalised HMM Different parametersets for different GCcontent regions (intronlength distribution andexon stats)

Predicting non-coding RNA? From sequence?– Not clear which properties can be exploited– Sequence features such as promoters are tooweak Histone modifications conservation worked

So far, only linear sequence comparisonA C T T G A A T T C G C C TCGAATTCAC

Expanding the idea of a sequence

Central theme of the new algorithm –compare profilesACGTT810801G003010C004012A071065 60006000114000606000501001050510

Met14 vsMet2“DotPlot”Match 1Mismatch -1Gray: 1MET14 (1000nt)

Met14 vsMet2Red: 5MET14 (1000nt)

Met14 vsMet2MET14 (1000nt)PhyloNetTTTCACGTGAP 1.75E-5HSPs:E 0.1P 0.002P 0.003P 0.03P 0.02

PathBlast, NetworkBlastTrey Ideker

Comparative Genomics Functional DNA often evolves slower than neutral DNA. To detect functional elements:– align genomes of related species,– and find regions of high conservation. The difference between conservation and constraint.

Ultra conserved elementsultra conservede.d 12.5

HARs: Human accelerated regions 118 bp segment with 18 changes betweenthe human and chimp sequences Expect less than 1

Human HAR1F differs from the ancestralRNA stucture

Aligning Short Reads

PastandCurrent Sequencing0 and1st generationsequencingTechnologiesPre-1992“old fashionedway”S35 ddNTPsGelsManual loadingManual base calling1992-1999ABI 373/377Fluorescent ddNTPs*GelsManual loadingAutomated base calling*1999ABI 3700Fluorescent ddNTPsCapillaries*Robotic loading*Automated base callingBreaks down frequently2003ABI 3730XLFluorescent ddNTPsCapillariesRobotic loadingAutomated base callingReliable*

Nextor chnology454/Roche GS-20/FLX(Oct 2005)ABI SOLiD(Oct 2007)Illumina/Solexa1G Genetic Analyser (Feb 2007)

ClustergenerationClusterGeneration8 channels (lanes)

IGA without cover

Flowcell imagingFlow cell imaging

A flow cellA flow cell contains eight lanesLane 1Lane 2.Lane 8Each lane/channel contains three columns of tilesColumn 1Column 2Column 3Each column contains 100 tilesTileEach tile is imaged four times percycle – one image per base.20K-30KClusters345,600 images for a 36-cycle run350 X 350 µm

Data analysis pipelineData Analysis PipelineFirecresttiff image files(345,600)AdditionalData AnalysisBustardintensity filesSequence filesElandAlignment to Genome

Primary tools and analysis tasks Image processing– (unique to each manufacturer) Basecalling– (unique to each manufacturer) Align sequence reads to reference genome Assemble contigs and whole genomes usingquality scores and/or paired-end information Peak finding for Chip-Seq applications– (and statistics to validate, map to regulated genes,etc) SNP calling/genotyping Transcript profiling– measure gene expression, identifying alternative splicing, etc.

NGS: Sequence alignment Map the large numbers of short reads to areference genome– In a broader sense: Identify similar sequences(DNA, RNA, or protein) in consequence offunctional, structural, or evolutionary relationshipsbetween the them– Applications: Genome assembly, SNP detection,homology search, etc large faster search speedshort greater search sensitivity.

Mapping Reads Back Hash Table (Lookup table)– FAST, but requires perfect matches Array Scanning– Can handle mismatches, but not gaps Dynamic Programming (Smith Waterman, Forward,Viterbi)– Indels– Mathematically optimal solution– Slow (most programs use Hash Mapping as a prefilter) Burrows-Wheeler Transform (BW Transform)– FAST (memory efficient)– But for gaps/mismatches, it lacks sensitivity

Many short read aligners

Short read mapping Input:– A reference genome– A collection of many 25-100bp tags (reads)– User-specified parameters Output:– One or more genomic coordinates for each tag In practice, only 70-75% of tags successfullymap to the reference genome. Why?

Multiple mapping A single tag may occur more than once inthe reference genome. The user may choose to ignore tags thatappear more than n times. As n gets large, you get more data, butalso more noise in the data.

Inexact matching? An observed tag may not exactly match any position inthe reference genome. Sometimes, the tag almost matches one or morepositions. Such mismatches may represent a SNP or a bad readout. The user can specify the maximum number ofmismatches, or a phred-style quality score threshold. As the number of allowed mismatches goes up, thenumber of mapped tags increases, but so does thenumber of incorrectly mapped tags.

Using base qualities to evaluateREAD: AGGTCCGGGATACCGGGGACBETTER!Q: 30CHR1: CGGTCCGGGATACCGGGGACCHR2: AGGTCCGGGATACCGGGGGTBETTER!Q: 10 10

Hash table (Eland, SOAP) Main idea: preprocess genome to speed up queries– Hash every substring of length k– K is a tiny constant For each query p, can easily retrieve all suffixes of thegenome that start with p1, p2, pk. Easy to implement. Significant speed up in practice. Large memory consumption. Inexact match is difficult.– Need multiple hash tables– More memory

Spaced seedalignment (MAQ) Tags and tag-sizedpieces of reference arecut into small “seeds.” Pairs of spaced seedsare stored in an index. Look up spaced seeds foreach tag. For each “hit,” confirmthe remaining positions. Report results to the user.

Index the reference genome: Suffix Tree Each suffix correspondsto exactly one path fromthe root to a leaf Edges spell non-emptystrings Construction: linear timeand space Check if a string oflength m is a substring Each substring is aprefix of a suffix!

Burrows-Wheeler(Bowtie, BWA) Store entire referencegenome. Align tag base by basefrom the end. When tag is traversed, allactive locations arereported. If no match is found, thenback up and try asubstitution.

Why Burrows-Wheeler? BWT very compact:– Approximately ½ byte per base– As large as the original text, plus a few“extras”– Can fit onto a standard computer with 2GB ofmemory Linear-time search algorithm– proportional to length of query for exactmatches

Burrows-Wheeler Transform (BWT)BWTacaacg acaacgaacg acacaacg acg acacaacg acg acaag acaacgc aaacBurrows-Wheeler Matrix (BWM)

Key observationa1c1a2a3c2g1 1“last first (LF) mapping”The i-th occurrence of character X inthe last column corresponds tothe same text character as the i-thoccurrence of X in the first column.1 acaacg12aacg ac11acaacg 13acg aca21caacg a12cg acaa31g acaac2

Burrows-Wheeler Matrix314256 acaacgaacg acacaacg acg acacaacg acg acaag acaacSee the suffix array?

Burrows-Wheeler Transform Originally designed for data compression for large text Burrows-Wheeler matrix: sort lexicographically all cyclicrotations of S BWT(S): the last column of Burrows-Wheeler matrix Compression: runs of repeated characters are easy tocompress using move-to-front transform and run-lengthencoding, etc. BWT(S) is a reversible permutation of S

Reverse Burrows-Wheeler Transform BW Matrix Property: Last-First (LF)Mapping The ith occurrence of character X in thelast column correspond to the same textcharacter as the ith occurrence of X in thefirst column

qc the nextSearchingcharacter to BWTthe left in the queryBWT Exact MatchingFerragina P, Manzini G: Opportunistic data structures with applications. FOCS. IEEE Computer Society; 2000.

Searching BWTBWT(agcagcagact) tgcc ggaaaacgcagcaSearch for pattern: gcagcagca agcagcagact agcagcagact agcagcagact agcagcagactact agcagcagact agcagcagact agcagcagact agcagcagagact agcagcagact agcagcagact agcagcagact agcagcagcagact agcagcagact agcagcagact agcagcagact agcagcagcagact agcagcagact agcagcagact agcagcagact cagact agcagcagact agcagcagact agcagcagact agcagcagcagact agcagcagact agcagcagact agcagcagact agct agcagcagact agcagcagact agcagcagact agcagcagagact agcagcagact agcagcagact agcagcagact agcagcagcagact agcagcagact agcagcagact agcagcagact agcagcagcagact agcagcagact agcagcagact agcagcagact at agcagcagact agcagcagact agcagcagact agcagcagac

SummaryAvoidsomeexcessivebacktrackingA Bowtie genome index consists ofthe BWT plusauxiliarydata structures. All told,Bowtie index: memory footprintthe index is only about 65% larger than the input genome. This is radically smaller thanother indexing schemes such as suffix trees, suffix arrays, and seed hash tables. Thisallows Bowtie to run easily and efficiently on a typical computer with 2 gigabytes of RAM.owtie index: memory footprintHuman genome memory footprintHuman genome memory footprintBowtie IndexSuffix Tree1.3 gigabytes 35 gigabytesSuffix ArrayHash Tables 12 gigabytes 12 gigabytesMemory footprintBowtieSearch AlgorithmThe index is only about 65% larger than the input genometh occurrenceThis allows Bowtie to run easily andonofa typical computer with 2 GBthe iefficientlyThe Burrows-Wheeler MatrixRAM.character X in thehas a property called the LF 19last column The Bowtie index can be distributedover the Internet.mapping: the ith occurrence of

BLOSUM and PAM BLOSUM 45 BLOSUM 62 BLOSUM 90 PAM 250 PAM 160 PAM 100 More Divergent Less Divergent BLOSUM 62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix.

Related Documents:

Evolving the Concept of Homology

Keywords: Homology, evolution, vertebrates, fossils, comparative anatomy Introduction One of the most fundamental concepts underlying student understanding of the theory of evolution is homology, a term coined by Sir Richard Owen in 1848, which discusses the structure of the skeleton in vertebrates (Owen, 1848). Homology may

9 Views

2y ago

B.Sc. GENETICS SYLLABUS

Genetics – A Continuity of Life – Daniel Fairbanks, Ralph Anderson. Concepts of Genetics – Klug and Cummings. Principles of Genetics – Hartt and Jones. GN 5 B 07 : CORE COURSE VII Medical Genetics Total – 54 hrs Unit- 1: Principles of Human Genetics (2 hrs) History, Origin of medical genetics, classification of genetic disease .

84 Views

3y ago

SEMESTER-III PAPER-I INHERITANCE BIOLOGY UNIT-I Extra ...

Somatic cell genetics. Books Recommended: 1. Genetics - Gardener 2. Molecular Genetics of Bacteria 2nd edition 1995, Jeremy W.Dale.-John Wiley and sons. 3.Cell biology (1993)-David E.Sadva (Jones and Barrette) 4.Modern genetics (2nd edition,1984)-A.J.Ayala and W.Castra(Goom Helns,London) 5. Genetics by P.K Gupta. 6. Genetics by Verma and Agarwal 7.

46 Views

3y ago

Extending Mendelian Genetics

HUMAN GENETICS AND PEDIGREES 7.4 . Key Concept A combination of methods is used to study human genetics . Human Genetics Human genetics follows the patterns seen in other organisms The basic principles of genetics

40 Views

2y ago

Using Rosetta for RNA homology modeling - Stanford University

describing this Rosetta-focused methodology, we note that other groups have also developed excellent tools outside Rosetta for RNA homology modeling, and we advise the reader to try multiple tools and check for Xu&Chen,2018).We also note that current homology modeling methods can be computationally

8 Views

1y ago

SEIBERG WITTEN INVARIANTS OF RATIONAL HOMOLOGY 3-SPHERES

invariants in the remaining case, that of rational homology spheres. Around the same time, Lim [12] succeeded in providing a combinatorial de-scription of the Seiberg–Witten invariants of integral homology spheres. Namely, in this case they coincide with the Casson invariant. In [19] we investigated a spe-

18 Views

3y ago

On perturbative PS N) invariants of rational homology 3 ...

For any positive integer l, let Z (l)"ZC1 (l!1)!n!D. For a rational homology 3-sphere M (i.e. a closed oriented 3-manifold whose homology group H 1 (M,Z)is"nite), let Q[[x]](M)be the set of all power series l d l x l such that d l is a number in ZC 1 (2l#n(n!1))!DH 1 (M,Z)DD1.2. Quantum invariants of oriented framed links and 3-manifolds For a framed oriented link ‚ with m ordered .

21 Views

3y ago

Introduction on Embedded Contact Homology

on Embedded Contact Homology [1] as well as his 2008 lectures at MSRI. The fundamental objects of study for Embedded Contact Homology (ECH) are Reeb orbits on contact manifolds, so we de ne: De nition 1.1. A contact manifold (Y; ) is a smooth, 2n 1 dimensional manifold to

32 Views

2y ago

Recent Views

SHORT RUSSIAN PHRASEBOOK FOR ENGLISH-SPEAKING TRAVELERS .

Russian words in English. Version 4.0 December 2011 English-Russian phrases for trips to Belarus (Russia) compiled by Andrei Burdenkov, a certified Minsk guide a@minskguide.travel 2 Common Russian phrases. Numerals Russian phrase Say it in Russian English translation 0 – ноль nol' zero 1 – один Odin one 2 – два Dva two 3 – три Tri three

3y ago

209 Views

UTOMOTIVEEMC - unitest

AUTOMOTIVEEMC SAFETY&EMC2011 Table3ISO11452-2severitylevels SeverityLevel Field/(V/m) I25 II50 III75 IV100 V(opentotheusersofthestandard) a

3y ago

38 Views

DELHI TECHNOLOGICAL UNIVERSITY

4 ME104 Basic Mechanical Engineering AEC 4 4 0 0 3 0 25 - 25 50 - 5 ME106 Workshop Practice AEC 2 0 0 3 0 3 - 50 - - 50 6 HU102 Communication Skills HMC 3 3 0 0 3 0 25 - 25 50 - Total 21 16 1 7. ME-9 II Year: Odd Semester S.No. Code Title Area Cr L T P TH PH CWS PRS MTE ETE PRE 1. PE251 Engineering Materials & Metallurgy AEC 4 3 0 2 3 0 15 15 30 40 2. ME201 Mechanics of Solids DCC 4 3 0 2 3 0 .

3y ago

49 Views

Director Due Diligence

what to consider before nominating a potential director onto a board and should not be considered as a checklist in deciding whether to accept a potential candidate or not. From a potential director’s perspective, the paper aims to guide the individual on what to consider about a company prior to accepting an appointment. Terminology used in the paper Whilst the terms “company” and .

3y ago

154 Views

Hierarchical Level and Leadership Style

ORGANIZATIONAL BEHAVIOR AND HUMAN PERFORMANCE 18, 131--145 (1977) Hierarchical Level and Leadership Style ARTHUR G. JAGO AND VICTOR H. VROOM School of Organization and Management, Yale University This research investigates the relationship between the hierarchical level of managerial personnel and individual differences in their leadership styles, specifically the degree to which they are .

3y ago

64 Views

EFEKTIVITAS MEDIA PEMBELAJARAN E-LEARNING

EFEKTIVITAS MEDIA PEMBELAJARAN E-LEARNING TERHADAP PRESTASI BELAJAR PENDIDIKAN AGAMA ISLAM SISWA DI SMA NEGERI 1 YOGYAKARTA SKRIPSI Diajukan Kepada Fakultas Ilmu Tarbiyah dan Keguruan Universitas Islam Negeri Sunan Kalijaga Yogyakarta Untuk Memenuhi Sebagian Syarat Memperoleh Gelar Sarjana Strata Satu Pendidikan Islam Disusun oleh: Aldila Siddiq Hastomo (09410111) PENDIDIKAN AGAMA ISLAM .

3y ago

179 Views

USAHA PENGERINGAN EMPON-EMPON BAHAN OBAT HERBAL DI .

Keywords: empon-empon, herbal medicine, production, management, marketing. PENDAHULUAN Indonesia merupakan salah satu negara penghasil komoditi obat-obatan yang potensial. Aneka ragam jenis tanaman obat telah diproduksi sebagai bahan baku obat modern maupun obat tradisional (jamu). Prospek pengembangan produksi tanaman obat cukup cerah antara lain karena berkembangnya industri obat modern dan .

3y ago

92 Views

Sara Cristina Oliveira A Guitarra clássica Marques Almeida .

caracterização técnica, estilística e estética do repertório para guitarra de Carlo Domeniconi. Associados a este ponto fundamental, apresenta-se uma abordagem histórica e analítica dos recursos técnicos e estilísticos da guitarra, e, de uma forma sucinta dada a natureza do trabalho, os movimentos estéticos da vanguarda e a sua reflexão na criação artística, que constituem um .

3y ago

36 Views

UNCLASSIFIED//FOR OFFICIAL USE ONLY DELIBERATIVE DOCUMENT .

to provide intelligence and information on transnational organized crime, terrorism, cyber-threat actors, counterintelligence vulnerabilities, economic security, and other developing threats that pose a critical danger to the Nation’s security and our democratic way of life.

3y ago

89 Views

2020 Asia Pacific Virtual Summit Agenda Overview

Cyber Threat Hunting: Resourcing & Methods Sindhu HS, Vice President, Goldman Sachs y To effectively and efficiently execute cyber threat missions leveraging collaborative Tiger Teams to address resource constraints common in every organization y To proactively identify previously undetected malicious activity and

3y ago

28 Views

Corporate Governance and Controls: The Federal Reserve’s .

The Federal Reserve has proposed new supervisory guidance on corporate governance (Governance Proposal) that would apply to large U.S. financial institutions . its internal risk management, its internal controls and, for U.S. G-SIBs, its recovery

3y ago

20 Views

Cross Stitch Tutorial - allstitches4you

cross stitch kit. It is better to cut and fasten off your thread at the back of the needle work as normal, and start again at the new area of the design. Half Cross Stitch Many projects now have areas worked in half cross stitch, for example to give a "soft focus" background.

3y ago

32 Views

Service Manual for WP6 Diesel Engine

Service Manual for WP6 Diesel Engine Preface WP6 mechanical pump series diesel engine has the features of compact structure, reliable

3y ago

121 Views

From Eye to Insight

medaka, and Xenopus, are often used in molecular and developmental biology. An adult zebrafish is shown below. Adult zebrafish (Danio rerio). In molecular and developmental biology, these aquatic vertebrate model organisms are widely applied to study molecular processes of development and as disease models. To study these molecular

3y ago

25 Views

Geometry SIG Update - POSC Caesar

ISO 15926 P&ID/3D Information Models and Proteus Schema Mapping Project yCurrent Team Members – Adrian Laud (Lead), Keith Willshaw, Andrew Prosser, Manoj Dharwadkar yDocument P&ID and 3D Models in terms of Classes and Templates yMapping the Proteus Schema to the ISO 15926 P&ID/3D Information Model

3y ago

45 Views

Homology - Washington University Genetics

It looks like you're using an ad-blocker