Introduction To R: Basic String And DNA Sequence Handling

3y ago

25 Views

3 Downloads

209.09 KB

6 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Josiah Pursley

Report this link

Download PDF

Transcription

Introduction to R: Basic string and DNA sequence handlingManuela Benary and Pål O. WestermarkStrings and other data structuresSo far, we have concentrated on numerical data. However, can we store characters too? Let’s try working with thealphabet of the DNA. In this section we will search for occurences of the hormone response element in the promoter ofa gene.nucleobase - Tprint(nucleobase)## [1] TRUEnucleobase - "T"print(nucleobase)## [1] "T"Here in the first example, the variable ’nucleobase’ will have the value ’TRUE’. ’T’ and ’F’ are the short form for boolean’TRUE’ and ’FALSE’. So, as you can see, characters need to be surrounded by quotation marks (””). We can comparedifferent strings with ’ ’, and test whether they are identical.nucleobase "T"## [1] TRUEnucleobase "A"## [1] FALSEBase complementPlease write a function ’basecomp’ to translate nucleotides into their complement.If you considered all cases, testing your function would give you something likebasecomp("A")## [1] "T"basecomp("T")## [1] "A"basecomp("P")## Warning in basecomp("P"):You have entered an invalid baseBut there’s an easier way. Let’s create a vector of the bases (similar to a vector of numbers)thebases - c("A", "C", "G", "T")thebases[2]## [1] "C"names(thebases)1

Introduction to R: Basic string and DNA sequence handling2## NULLnames(thebases) - c("T", "G", "C", "A")thebases["C"]##C## "G"Each element in a vector can be accessed either by an index or by giving the element a name. The function ’names’assigns the names for the individual elements of a vector.Of course, we can store also whole sequences of characters. One example is the sequence of the hormone responseelement, which can be bound by the activated steroid receptor and thus initiates gene expression (Figure 1). A secondexample is the TATA-box, which can be found in the core promoter region of 10–20% of human genes. We can combineBioinformaticsSSa20148differentstrings- invector as well:Figure 1:eukaryotic corecore promoterpromoter includingincluding thethe TATA-boxTATA-box andand thethe(dimeric)(dimeric)bindingbindingsitesite3: A hypothetical example of an eukaryoticof a hormone receptor.hre - "AGGTCA"print(hre)nucleobase "A"## [1] "AGGTCA"## [1] FALSEdnaseq - c(hre, "TATAAA")print(dnaseq)##Base[1] complement"AGGTCA" "TATAAA"Please write a function ’basecomp’ to translate nucleotides into their complement.print(dnaseq[2])## If[1]you"TATAAA"considered all cases, testing your function would give you something likeWe can easily split strings using a separating character and we will end up with a list of characters. A list can hold allbasecomp("A")types of different things. The list is your first encounter with a data structure. Lists work like boxes within boxes. If wehavea vector## [1]"T" of strings, the result for each string will be added as a new element in the list. Each element in the listcan be accessed via [[]] (mind the difference: for vectors only one bracket is necessary).basecomp("T")dnabases - strsplit(dnaseq, split NULL)print(dnabases)## [1] "A"## [[1]]basecomp("P")##[1] "A" "G" "G" "T" "C" "A"#### [[2]]Warning: You have entered an invalid base#### [1] "T" "A" "T" "A" "A" "A"But there’s an easier way. Lets create a vector of the bases (similar to a vector of numbers)print(dnabases[[1]])thebases -"G"c("A","G","A""T")##[1] "A""G" "C","T" "C"thebases[2]dnabases[[2]] - 11print(dnabases)## [1] "C"names(thebases)## NULLnames(thebases) - c("T", "G", "C", "A")

Introduction to R: Basic string and DNA sequence handling##########3[[1]][1] "A" "G" "G" "T" "C" "A"[[2]][1] 11Here, we used a named argument, ’split’. This tells us at which characters we want to split our strings. If set to NULL,like above, this means splitting between all characters. As you can see, you can mix numbers and characters in your listand R will not complain. If we want to get the complement of the HR-element, we can use the function ’basecomp’from before and apply this function to each nucleotide in our sequence. One way is using a for-loop again (if you do notfeel confident yet, give it another try). However, the better way (faster and efficient) is to use the function ’apply’, inparticular ’lapply’. ’l’ stands for ’list’ here. A general overview of apply-functions in R can be found at Neil SaundersBlog.dnacomplist - lapply(dnabases[[1]], basecomp)print(dnacomplist)You got back a list, but this is not the easiest way to handle strings. Let’s use ’sapply’ (’s’ stands for ’simplify’), whichyou already have encountered:dnacompvec - sapply(dnabases[[1]], basecomp)print(dnacompvec)Lists and vectors Print the third element of ’dnacompvec’ and of ’dnacomplist’. Are they the same? Use the vector ’thebases’ to define the complement of the first element of ’dnabases’. Remember, you can usea vector to access elements of a vector.In general, DNA sequences are stored as a single strand. If we want to find occurences of the HRE we have to considerthe original pattern as well as the reverse complement: in general proteins can bind to any string, sometimes even in anyorientation. We already know how to generate the complement, but we need to reverse the sequence to generate thereverse complement.1:5## [1] 1 2 3 4 5rev(1:5)## [1] 5 4 3 2 1dnaseqcompt - ompt, collapse "")## [1] "TGACCT"(There are other ways too to reverse, index, and slice vectors, but we cannot cover all here.)Reverse complementWrite a function ’revcomplement’, which returns the reverse complement of a given DNA sequence.Testing the function with the HR-element should result in the following.revcomplement(hre)## [1] "TGACCT"Finally, we have the building blocks to actually search for occurences of the hormone receptor element. We will use the

Introduction to R: Basic string and DNA sequence handling4promoter of the gene ’Prdx1’ as an example here.Download a gene sequenceGo to genome.ucsc.edu and download 3000 bp upstream of the transcription start site of the gene ’Prdx1’. Genome Browser, select Mammal / Mouse / mm10, type Prdx1. Click blue box with prdx1, upper left GenomicSequence. Select only Promoter/Upstream, fill in 3000 bases. One FASTA record per gene. Check All upper case,uncheck mask repeats. Save as/Speichern unter choose name prdx1.fa and format Text. Choose the samedirectory as your R working directory, which you obtain with getwd().We will now read the sequence from the file line by line and store these in a vector. The first line should be the header,which we do not keep. And we will collapse the array of strings into one gigantic string.rawseq - readLines("prdx1.fa")prdx1arr - rawseq[2:length(rawseq)]prdx1seq - paste(prdx1arr, collapse "")You can check whether everything went right by counting the number of characters in your string using the functionnchar(). We are now going to write code to count the occurences of a given string in this DNA sequence. As alwaysin good engineering tradition, we approach the problem by creating small building blocks, which we later combine toachieve our goal. The first building block is a function for picking out substrings:substr(prdx1seq, 1, 2)## [1] "TG"Substrings Extract the bases from position 4 to 9. Using substr and nchar, extract the last 6 bases of the prdx1 gene.If we want to look for occurrences of the motif AGGTCA, we need the motif (should be stored as ’hre’) and its reversecomplement.comp.hre - revcomplement(hre)comp.hre## [1] "TGACCT"len - nchar(hre)Next, we want to extract all (overlapping) substrings of the length of the motif (’len’) in our sequence, which would bea reoccuring use of the function ’substr’: substr(prdx1seq, 1, len)substr(prdx1seq, 2, len 1).substr(prdx1seq, nchar(prdx1seq) - (len - 1), nchar(prdx1seq))This looks similar to the for-loop (or even easier an apply-function) we have used before. So let’s do this systematically:We need the starting index and the ending index for the substring. We need to be real careful at the end of the genesequence to make sure that the substring is as long as the motif.leftlims - 1:(nchar(prdx1seq) - (len - 1))rightlims - len:nchar(prdx1seq)substr() is a function which has two arguments, therefore we cannot use the simple ’apply’. However, we can use’mapply’ to apply a function to each element of multiple arguments (see Figure 2]).

BioinformaticsSSBasic2014 string and DNA sequence handlingIntroduction to- R:115Figure 4: Disecting a large sequence into a vector of overlapping fragments using the function ’mapply’.Figure 2: Dissecting a large sequence into a vector of overlapping fragments using the function ’mapply’.substr(prdx1seq, 1, 2)prdx1substrings - mapply(substr, prdx1seq, leftlims, rightlims,USE.NAMES FALSE)## [1] "TG"head(prdx1substrings)## [1] "TGATGA" "GATGAC" "ATGACC" "TGACCT" "GACCTG" "ACCTGA"SubstringsWe combined ’substr’ and ’mapply’ to split our sequence into a vector of overlapping fragments. The next step is acheck whetherfragmentis a hit.This 4checkExtractathebases frompositionto 9.should be written as a function as it is to be reused multiple times.Usingsubstrandnchar,extractthelast 6 baseshitcounter - function(fragment, themotif){ of the prdx1 gene.if (fragment themotif)If return(1)we want to look for occurrences of the motif AGGTCA, we need the motif (should be stored as ’hre’) and its reverseelsecomplement.return(0)}comp.hre - revcomplement(hre)comp.hreThe if-else-statement can be written in a shorter way using the function ’ifelse’. This function condenses the returnstatementsinto one function.##[1] "TGACCT"occurences of HRElenFinding - nchar(hre)Next, we want to extract all (overlapping) substrings of the length of the motif (’len’) in our sequence, which would Test your function ’hitcounter’ with fragment number 1 and 4.be a reoccuring use of the function ’substr’: Try writing a shorter version of ’hitcounter’ using ’ifelse’ (if you are not sure what to do, use the help pagessubstr(prdx1seq, 1, len)of R/RStudio).substr(prdx1seq, 2, len 1) Extend your function ’hitcounter’ to search for the reverse complement as well. Test your function again with.fragment number 1 and 4.substr(prdx1seq, nchar(prdx1seq) - (len - 1), nchar(prdx1seq)) Bonus: What happens if you give your function ’hitcounter’ your complete vector instead of one element?This looks similar to the for-loop (or even easier an apply-function) we have used before. So, let’s do this systematically:We need the starting index and the ending index for the substring. We need to be careful at the end of the gene sequence,So, weourisownbuildingblockthat returns a 1 whenever a fragment matches a given motif. If we now apply thisthatthecreatedsubstringas longas themotif.function to all our fragments, we obtain a vector with mostly 0’s and a few 1’s, as seen in the last exercise. If we sumleftlims - 1:(nchar(prdx1seq)- (len - 1))these numbersup .rightlims len:nchar(prdx1seq)scores - hitcounter(prdx1substrings, hre)nhits - sum(scores)’Substr’ is a function which has two arguments, therefor we cannot use the simple ’apply’. However, we can usenhits’mapply’ to apply a function to each element of multiple arguments (see Figure 4]).## [1] 3prdx1substrings - mapply(substr, prdx1seq, leftlims, rightlims,. we get the total number of hitsin our sequence. As we may want to use that in other sequences or with other motifsUSE.NAMES FALSE)as well, we will wrap everything up in a function called ’counthits’head(prdx1substrings)## [1] "TGATGA" "GATGAC" "ATGACC" "TGACCT" "GACCTG" "ACCTGA"

Introduction to R: Basic string and DNA sequence handlingcounthits - function(sequence, motif) {compmotif - revcomplement(motif)len - nchar(motif)leftlim - 1:(nchar(sequence) - (len - 1))rightlim - len:nchar(sequence)frags - mapply(substr, sequence, leftlim, rightlim, USE.NAMES FALSE)scores - ifelse(frags motif frags compmotif, 1, 0)return(sum(scores))}6

Introduction to R: Basic string and DNA sequence handling 5 Bioinformatics - SS 2014 11 Figure 4: Disecting a large sequence into a vector of overlapping fragments using the function ÕmapplyÕ. substr (prdx1seq, 1, 2) ## [1] "TG" Substrings Extract the bases from position 4 to 9. Using substr and nchar, extract the last 6 bases of the prdx1 gene.

Related Documents:

Legacy Learning Systems - Mr. S. R. Brandt

You can also tune your guitar to a keyboard or piano. The open strings of a guitar correspond to certain notes on a keyboard. SESSION 1 3 Starting Off Right Learn &Master Guitar E A D G B E B 6th string 5th string 4th string 3rd string 2nd string 1st string 5th Fret 1st string 6th string 5th string 4th string 3rd string 2nd string E A D GB E .

17 Views

1y ago

Legacy Learning Systems - Weebly

17 Views

1y ago

The Hausmann String Quartet was formed in the summer of ...

Barber, Samuel String Quartet No.1, Op.11 Bartok, Bela String Quartet No.2, Op.17 String Quartet No.4 Beethoven, Ludwig van String Quartet No.1 in F major, Op.18 No.1 String Quartet No.2 in G major, “Compliments” Op.18 No.2 String Quartet No.6 in B-flat major, Op.18 No.6 String Quartet No.7 in F major, “Rasumovsky 1” Op.59 No.1

29 Views

2y ago

th International String Quartets Competition Premio Paolo ...

String Quartet n. 15 op. 144 Anton Webern String Quartet op. 28 Five Movements for String Quartet Six Bagatelles for String Quartet Alexander Von Zemlinsky String Quartet n. 2 op. 15 2) Toshio Hosokawa UTA-ORI. Weaving Song for string quartet (2020) New composition for String Quartet

30 Views

2y ago

Introduction to Application Builder

Alternatively, you can use the operator as follows: a a b; which is equivalent to: a "string A" " and string B"; and equivalent to: a "string A" " " "and string B"; where the middle string is a string with a single whitespace character. Comparing Strings Comparing string values in

26 Views

2y ago

Benchmarking Declarative Approximate Selection Predicates

query string. Given a query string and a string tuple , the similarity score of and in this class of predicates is of the form weight of the token,where is the query-based in string and weight of the token is the tuple-based in string . 3.2.1 Tf-idf Cosine Similarity The tf-idf cosine similarity [24] between a query string and a string tuple

52 Views

3y ago

I Just Got A Dulcimer, Now What???

3 string 4 string (double melody) 5 string (double melody and bass usually) 6 string (every course doubled). Doubling a string provides more volume for the notes sounded on that string compared to the notes on the other courses. Another string arrangement seen among more advanced players

26 Views

2y ago

SECOND EDITION A Guide to Forgotten String Quartets

Barber: Serenade for string quartet Roussel: String Quartet Les Vendredis Weill: String Quartet Malipiero: String Quartet No. 1 Skalkottas: String Quartet No. 4 4. Introduction to the First Edition This eBook was written to address the following situation: there are many, many

41 Views

2y ago

Recent Views

BETTER NUTRITION BRIGHTER FUTURE - Maryland.gov Enterprise Agency Template

TOFU BUY: 12- to 16-ounce container Brands and types shown here ONLY Not WIC Approved: With added fats, sugar, oil, or salt With added ﬂavorings, sauces, or seasonings Azumaya Extra Firm Franklin Farms Firm, Medium Firm, Extra Firm, Soft House Foods Organic: Soft, Firm, Medium Firm, Extra Firm Premium: Soft, Firm, Medium Firm, Extra Firm

1y ago

192 Views

Leaving a Law Firm: A Guide to the Ethical Obligations in Law Firm .

associates or otherwise employed in the firm "not to (1) actively exploit their positions within the [law firm] for their own personal benefits, or (2) hinder the ability of the [law firm] to conduct the business for which it was developed." Burke v. Lakin Law Firm, 2008 WL 64521 (S.D.Ill. Jan. 3, 2008), quoting FoodComm Intern. V.

1y ago

113 Views

Uses of Special Purpose Vehicles (SPVs) in structuring financing .

TFR "Best Law Firm in Trade Finance" Trade & Forfaiting Review (TFR) named Sullivan & Worcester "Best Law Firm in Trade Finance" in its 2014, 2015 and 2016 TFR Excellence Awards . GTR "Best Law Firm" Sullivan & Worcester UK LLP was top ranked firm in the . Global Trade Review (GTR) Best Law Firm 2015 and 2016 polls . The Legal 500 UK . 2016

1y ago

139 Views

Global Elite Law Firm Brand Index 2022 - thomsonreuters

such areas as law firm brand, firm usage, and legal market trends. The responses are distilled . down into four different and non-related measures gathered from the Sharplegal research and . then used to generate the individual Global Elite Law Firm Brand Index score for each law firm. How we generate our insights. In-depth interviews with

1y ago

166 Views

Notice and Order - Law Firm Names - Amendments to RPC 7.5 and Related Rules

LAW FIRM NAMES - AMENDMENTS TO RPC 7.5 AND COURT RULES 1:21-1A, 1:21-1B, AND 1:21-1C The Supreme Court has adopted amendments to Rule of Professional Conduct 7.5 ("Law Firm Names and Letterheads") so as to remove the requirement that the law firm name include the name of a lawyer and describe the nature of the firm's legal practice.

1y ago

134 Views

Law Student's Guide to the Washington, DC-Area Law Firm Market

Years 6-8: Return to law firm as senior associate or counsel . Benefit: In addition to your government experience, law firm employers will value your prior firm experience with billing time, working with private sector clients, etc. In other words, you already "know how law firms work" and this provides a smoother transition back. *Disclaimer:

1y ago

151 Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Overcoming Ethical Challenges for Multi-Firm Lawyers and Their Firms .

- Florida Bar Op. 94-7: o Law firm refers personal injury cases to a lawyers who is "of counsel" to the firm and who sometimes works in the law firm's offices, but who also . Formal Ops. 1995-9: o A law firm named "A B & C" is a NY partnership consisting of partners A, B, and C. Motivated by tax concerns, C retires and becomes .

1y ago

116 Views

Companies Law - Cayman Islands dollar

Law 1 of 1971-15th December, 1970 Law 7 of 2000- 20th July, 2000 Law 7 of 1973-28th June, 1973 Law 5 of 2001-20th April, 2001 Law 24 of 1974-22nd November, 1974 Law 10 of 2001-25th May, 2001 Law 25 of 1975-9th December, 1975 Law 29 of 2001-26th September, 2001 Law 19 of 1977-10th November, 1977 Law 46 of 2001-14th January, 2002

3y ago

454 Views

It’s the Law!

ciples stated in Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law. Students will be able to explain the application of Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law to observations or events related to SCUBA diving. MateriaLs None audio/visuaL MateriaLs None teachinG tiMe

2y ago

378 Views

WHAT LAW IS ? An Introduction to Law

common law system civil law system!! sources of law in civil law !! a1. primary: statutes (written law) enacted by legislative power are the principal source of law. ! a2. two subsidiary sources of law: ! a2.1 administrative regulations a.2.2 customs!! ! sources of law in common law !!! b1. two primary sources of

2y ago

385 Views

Law Firm Performance Metrics - Thomson Reuters

ProLaw XII reporting offers a firm the capability to turn data into knowledge for law firm performance management. The new reporting features within ProLaw XII provide key financial and operational metrics necessary to monitor firm performance - many of which can be self‐defined by the firm.

1y ago

104 Views

CHAPTER 11 35 per hour to firm A but differ in their .

flock to the piece rate firm. After the price of output falls, firm A values all workers at 17.50 per hour, while worker 1’s value at firm B falls to 50 cents, worker 2’s value falls to 1 at firm B, etc. The question is what happens to the wage. Presumably wage also falls, to 17.50 per hour in firm A.

2y ago

165 Views

Faculty of Juridical, Social and Political Sciences Year .

Law L Law IV 8 Drept procesual civil II / Civil Procedure Law II 5 Law L Law IV 8 Dreptul comerțului internațional / International ommercial Law 4 Law L Law IV 8 riminalistică / Forensics 4 Law L Law IV 8 Practică de cercetare pentru elaborarea lucrării de lincență(3 săptămân

2y ago

384 Views

Introduction To R: Basic String And DNA Sequence Handling

It looks like you're using an ad-blocker