Introduction To Bioinformatics - Lehigh University

2y ago
27 Views
3 Downloads
2.02 MB
43 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Nora Drum
Transcription

Introductionto BioinformaticsDan LoprestiAssociate ProfessorOffice PL 404Bdal9@lehigh.eduIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 1

Motivation“Biology easily has 500 years of exciting problems to work on.”Donald Knuth (Stanford Professor & famous computer scientist)By developing techniques for analyzing sequence data and relatedstructures, we can attempt to understand molecular basis of ion to Bioinformatics LoprestiBioS 95 November 2008 Slide 2

Before We Get GoingRecall your recent lectures by Professors Marzillier and Warewho presented biological background:Professor Marzillier's lecture on Nov 7.Professor Ware's lecture on Nov 10.Today I'll focus on the related computational questions.Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 3

BioinformaticsWhat is bioinformatics? Application of techniques from computerscience to problems from biology.Computer ScienceBioinformaticsBiologyWhy is it interesting?Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 4 Important problems.Massive quantities of data.Desperate need for efficient solutions.Success is rewarded.

Data ExplosionOur genetic identity isencoded in long moleculesmade up of four basic units,the nucleic acids:(1) Adenine,(2) Cytosine,(3) Guanine,(4) Thymine.To first approximation, DNA isa language over a four characteralphabet, {A, C, G, T}.Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide nbankstats.html

GenomesComplete set of chromosomes that determines an organism isknown as its genome.Us?Mus ://www.nsrl.ttu.edu/tmot1/mus gle.asp?strID 324Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 6Conclusion: sizedoes not matter!(But you alreadyknew this.)

Comparative GenomicsRecall this amazing diagram from Professor Ware's lecture:How did wedecipher ources/Human ion to Bioinformatics LoprestiBioS 95 November 2008 Slide 7

Algorithms are CentralAn algorithm is a precisely-specified series of steps to solve aparticular problem of interest. Develop model(s) for task at hand. Study inherent computational complexity: Can task be phrased as an optimization problem? If so, can it be solved efficiently? Speed, memory, etc. If we can't find a good algorithm, can we prove task is “hard”? If known to be hard, is there approximation algorithm (one thatworks at least some of the time or comes close to optimal)? Conduct experimental evaluations (perhaps iterate above steps).Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 8

Sequence Nature of .htmlMacromolecules are chains of simpler molecules.In the case of proteins, these basicbuilding blocks are amino acids.In DNA and RNA, they are nucleotides.Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 9

NCBI GenBankNational Center forBiotechnology Information(NCBI), which is branch ofNational Library of Medicine(NLM), which is branch ofNational Institutes of Health(NIH), maintains GenBank, aworldwide repository ofgenetic sequence data (allMassive quantities of sequence data publicly available DNAneed for good computational Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 10

Reading DNARecall Professor Marzillier's lecture:This is known as ns/sommaire2/sanger.htmhttp://www.iupui.edu/ wellsctr/MMIA/htm/animations.htmIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 11Gel electrophoresis isprocess of separating amixture of molecules in agel media by application ofan electric field.In general, DNA moleculeswith similar lengths willmigrate same distance.Make DNA fragments thatend at each base: A, C, G, T.Then run gel and read offsequence: ATCGTG .

Reading DNAOriginal sequence: TCGATAGCATCGTGTCGATAGCGCIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 12

Sequencing a GenomeMost genomes are enormous (e.g., 1010 base pairs in case ofhuman). Current sequencing technology, on the other hand, onlyallows biologists to determine 103 base pairs at a time.This leads to some very interesting problems in bioinformatics .Genetic linkage map(107 – 108 base pairs)Physical map(105 – 106 base pairs)ACTAGCTGATCGATTTAGCAGCAG.Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 13Sequencing(103 – 104 base pairs)

Sequencing a GenomeGenomes can also be determined using a technique known asshotgun sequencing.Computer scientists haveplayed an important role indeveloping algorithms forassembling such data.It's kind of like puttingtogether a jigsaw puzzle withmillions of pieces (a lot ofwhich are “blue /pubbooks/bc mcampbell genomics 1/medialib/method/shotgun.htmlIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 14

Sequence rgetoriginalIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 15

Sequence AssemblyA simple model of DNA assembly is the Shortest SupersequenceProblem: given a set of sequences, find the shortest sequence Ssuch that each of original sequences appears as subsequence of S.Look for overlap between prefix ofone sequence and suffix of another:ACCGT3CGTGC21TTACIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 16--ACCGT-----CGTGCTTAC----TTACCGTGC

Sequence AssemblySketch of algorithm: Create an overlap graph in which every node represents afragment and edges indicate overlap. Determine which overlaps will be used in the final assembly:find an optimal spanning forest in overlap graph.5sW AGTATTGGCAATCZ AATCGATGU ATGCAAACCTy9w4xX CCTTTTGGY TTGGCAATCAS AATCAGGIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 17343zu

Sequence Assembly Look for paths of maximum weight: use greedy algorithm toselect edge with highest weight at every step. Selected edge must connect nodes with in- and out-degrees 1. May end up with set of paths: each corresponds to a contig.5sy9wW Y S4AGTATTGGCAATCAGGx343zIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 18uAGTATTGGCAATCTTGGCAATCAAATCAGGZ U XAATCGATGATGCAAACCTCCTTTTGGAATCGATGCAAACCTTTTGG

Sequence ComparisonWhat's the problem? Google for biologists . Given new DNA or protein sequence, biologist will want to searchdatabases of known sequences to look for anything similar. Sequence similarity can provide clues about function andevolutionary relationships. Databases such as GenBank are far too large to search manually.To search them efficiently, we need an algorithm.Shouldn't expect exact matches (so it's not really like google): Genomes aren't static: mutations, insertions, deletions. Human (and machine) error in reading sequencing gels.Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 19

Genomes Aren't StaticSequence comparison must account for such i PDFs/deletion.pdfIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 20http://www.accessexcellence.org/AB/GG/nhgri PDFs/insertion.pdf

Genomes Aren't StaticDifferent kinds of mutations can arise during DNA mutation.htmIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 21

The Human FactorIn addition, errors can arise during the sequencing process:“.the error rate is generally less than 1% over the first 650bases and then rises significantly over the remaining tmlA hard-to-read gel (arrow marks location where bands of similarintensity appear in two different lanes):Machines also makemistakes, of course!http://hshgp.genome.washington.edu/teacher ction to Bioinformatics LoprestiBioS 95 November 2008 Slide 22

Sequence ComparisonWhy not just line up sequences and count matches?A G T C T A T ADifference 2A T T C T G T ADoesn't work well in case of deletions or insertions:A G T C T A T ADifference 8G T C T A T AIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 23One missing symbol at start ofsequence leads to large difference!

Sequence ComparisonInstead, we'll use a technique known as dynamic programming. Model allows three basic operations: delete a single symbol,insert a single symbol, substitute one symbol for another. Goal: given two sequences, find the shortest series ofoperations needed to transform one into the other.A G T C T A T ADelete TA G C T A T AA G C T G T AIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 24Substitute G for A

Sequence ComparisonHow can we determine optimal series of operations? Approach is to build up longer solutions from previouslycomputed shorter solutions. Say we want to compute solution at index i in first sequenceand index j in second sequence:Sequence 1ivs.Sequence 2jAssume that we already know the best way to compare:Sequence 1Sequence 2vs.iIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 25Sequence 1vs.Sequence 2Sequence 1vs.Sequence 2j

Sequence ComparisonSo, best way to do this comparison:iSequence 1vs.Sequence 2jIs best choice from following three cases:Sequence 1Sequence 1Sequence 1Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 26ivs.Sequence 2vs.Sequence 2vs.Sequence 2j insertingj deletingi substitutingjfori

Sequence ComparisonNormally, this computation builds a table of distance values:sequencetcost of inserting t0cost of deleting ssequencesεεd [i-1, j] 1d [i, j] minIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 27d [i, j-1] 1d [i-1, j-1] 0 if s[i] t[j]1 if s[i] t[j]

Sequence ComparisonBy keeping track of optimal decision, we can determine operations:Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 28

Genome RearrangementsRecall what wesaw earlier: 99% of mouse genes have homologues in human genome. 96% of mouse genes are in same relative location to one another. Mouse genome can be broken up into 300 synteny blocks which,when rearranged, yield human genome. Provides a way to think about evolutionary relationships.Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 29

Reversal DistanceHuman Chromosome X12345-145Cut and reverse-3-2Cut and reverse-3-2-5-41-41Cut and reverse-35Mouse Chromosome X2Reversal distance is the minimum number of such steps needed.Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 30

Interesting SidenoteEarly work on a related problem,sorting by prefix reversals, wasperformed in 1970's by ChristosPapadimitriou, a famous computerscientist now at UC Berkeley, andone “William H. Gates” .Yes, that Bill Gates .Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 31

History of Chromosome XRat Consortium, Nature, 2004Hypothesized reversalsIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 32

Waardenburg’s SyndromeMouse provides insight into human genetic disorder: Waardenburg’s syndrome ischaracterized by pigmentary dysphasia. Disease gene linked to Chromosome 2,but not clear where it was located.“Splotch” mice: A breed of mice (with splotch gene) had similar symptoms. Scientists succeeded in identifying location of gene in mice. This gave clues as to where same gene is located in humans.Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 33

Building the “Tree of Life”Note: these trees are “best guesses”and certainly contain some errors!Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 34http://en.wikipedia.org/wiki/Phylogenetic ogyPages/T/Taxonomy.htmlScientists build phylogenetic trees in an attempt to understandevolutionary relationships. Reversal distance is often used here.

DNA Microarrays Allows simultaneous measurement of the level of transcriptionfor every gene in a genome (gene expression). Differential expression, changes over time. Single microarray can test 10k genes. Data obtained faster than can be processed. Want to find genes that behave similarly. A pattern discovery problem.green repressedred inducedIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 35

Using DNA Microarrays Track sample over aperiod of time to seegene expression overtime. Track two differentsamples under sameconditions to seedifference in geneexpressions.Each box represents onegene’s expression over 10 Clustering.pptIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 36

DNA MicroarraysK-means clustering is one way to organize this data: Given set of n data points and an integer k. We want to find set of k points that minimizes the mean-squareddistance from each data point to its nearest cluster center.Sketch of algorithm: Choose k initial center points randomly and cluster data. Calculate new centers for each cluster using points in cluster. Re-cluster all data using new center points. Repeat second two steps until no data points are moved from onecluster to another or some other convergence criterion is met.Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 37

Clustering Microarray Data Pick k 2 centers at random. Cluster data around these centerpoints. Re-calculate centers based oncurrent clusters.From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici.Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 38

Clustering Microarray Data Re-cluster data around newcenter points. Repeat last two steps until nomore data points are moved intoa different cluster.From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici.Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 39

Example of Hierarchical ClusteringDifferent genes that express similarlyFrom “Cluster analysis and display of genome-wide expression patterns” by Eisen, Spellman, Brown, and Botstein, Proc.Natl. Acad. Sci. USA, Vol. 95, pp. 14863–14868, December 1998Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 40

Why Study Bioinformatics? Still many urgent open problems lots of opportunities tomake fundamental contributions (and become rich and famous). Stretch your creativity and problem-solving skills to the limit. Join a cross-disciplinary team – work with interesting people. Participate in unlocking the mysteries of life itself. Make the world a better place.Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 41

CSE Course in BioinformaticsIn CSE 308, we cover:CSE 308 is not a Intro to molecular biology & algorithms, programming course! Basic programming for bioinformatics, Genetic sequence comparison & alignment, Physical mapping, sequencing, and assembly of DNA, Standard formats and sources for genomic data, Advanced topics: DNA microarrays, genome rearrangements,RNA and protein structure prediction, etc.Materials @ http://www.cse.lehigh.edu/ lopresti/courses.htmlQuestions: dal9@lehigh.eduIntroduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 42

Thank you!Introduction to Bioinformatics LoprestiBioS 95 November 2008 Slide 43

Introduction to Bioinformatics Lopresti BioS 95 November 2008 Slide 13 Sequencing a Genome Most genomes are enormous (e.g., 1010 base pairs in case of human). Current sequencing technology, on the other hand, only allows biologists to determine 103 base pairs at a time. This leads to some very interesting problems in bioinformatics .

Related Documents:

Lehigh County Drug and Alcohol 71, 74 Lehigh County Family Court 128 Lehigh County Juvenile Probation 128 Lehigh County Conference of Churches 65 Lehigh County Housing Authority and Valley Housing Development Corporation 57 Lehigh County Information and Referral 41 Lehigh County Office of Children and Youth 29 .

Lehigh Valley Drug and Alcohol Intake Units 25 Lehigh Valley Eye Center 65 Lehigh Valley Eye Center and Children's Eye Care 58 Lehigh Valley Family Health Center 49 Lehigh Valley Hospital Center 40 Lehigh Valley Hospital Center for Women's Medicine 58 Lehigh Valley Hospital Center for Women's Health at Casa - "Viva Nueva" Clinic 58 .

Lehigh Valley Drug and Alcohol Intake Units 46 Lehigh Valley Eye Center - Bethlehem 98 Lehigh Valley Eye Center and Children's Eye Care - Allentown 90 Lehigh Valley Family Health Center 81 Lehigh Valley Health Network 25 Lehigh Valley Hospital Center 71 Lehigh Valley Hospital Center for Women's Health at Casa - "Viva Nueva" Clinic 91

Structural bioinformatics adds scale and precision Structural Bioinformatics Structure Prediction Integrative Methods Molecular Simulation Structure Alignment Functional Site Comparison Docking . Lehigh University BioS 10: BioSciences in the 21st Century Brian Y. Chen Many computational fields support Structural Bioinformatics Structural

Bioinformatics Crash Course Ian Misner Ph.D. Bioinformatics Coordinator UMD Bioinformatics Core . Bioinformatics!Core The Plan Monday – Introductions – Linux and Python Hands-on Training Tuesday – NGS Introduction – RNAseq with Sailfish (Dr. Steve Mount, CBCB) – RNAse

Greater Bath Area Chamber of Commerce, Bath, PA Greater Lehigh Valley Chamber of Commerce, Lehigh Valley, PA Greater Northern Lehigh Chamber of Commerce, Lehigh Valley, PA Greater Pittsburgh Chamber of Commerce, Pittsburgh, PA He

COURT OF COMMON PLEAS OF LEHIGH COUNTY . Rule 51 Title and Citation of Rules. All civil rules of procedure adopted by the Court of Common Pleas of Lehigh County shall be cited as Lehigh Rules of Civil Procedure ("Leh.R.C.P.") Rule 52 Effective Dates of Rules. (a) A rule or amendment to a rule shall become effective upon the date specified by

It is not strictly a storage tank and it is not a pressure vessel. API does not have a standard relating to venting capacities for low-pressure process vessels. It is recommended that engineering calculations be performed based on process and atmospheric conditions in order to determine the proper sizing of relief devices for these type vessels. However, it has been noted that may people do .