Perl Pipelines - Gorgonzola.cshl.edu

1y ago
9 Views
1 Downloads
869.70 KB
48 Pages
Last View : 1m ago
Last Download : 5m ago
Upload by : Brenna Zink
Transcription

Perl PipelinesUsing perl as bioinformatics glueSimon Prochnikwith code from Scott Cain1

Some of what cover in the coursePhylogenetics: paup,MrBayes, phyml, PAMLmultiple sequencealignment: clustalw,musclepairwise sequence alignment:smith-waterman, fastaGene predictionsequence search:blastEST alignment to genomegenome sequenceEST sequencedatabases: nr,swissprot2

perl and bioperl glue all these togetherPhylogenetics: paup,MrBayes, phyml, PAMLmultiple sequencealignment: clustalw,musclepairwise sequence alignment:smith-waterman, fastaGene predictionsequence search:blastEST alignment to genomegenome sequenceEST sequencedatabases: nr,swissprot3

Looking for information on bioperl modules and functionsbioperl toolbar bioperl main page search box on left tool barUseful linksbioperl Quick start http://www.bioperl.org/wiki/Quick startbioperl scripts pagehttp://www.bioperl.org/wiki/Bioperl scriptshas someone else come across the problem? http://www.bioperl.org/wiki/FAQHow do I. http://www.bioperl.org/wiki/HOWTOsFull rowse modules (then use search function in your dulesperldoc Bio::Align::clustalw http://www.bioperl.org/wiki/Main Page 4

Toy example Let’s assume we have a multiple fasta file and we want touse perl to run the program clustalw to make a multiplesequence alignment and read in the results. Here are some sequences vca4886446 93762MSPPPTHSTTESRMAPPSQSSTPSGDVDGS vca4887371 120236MAGLHSVPKLSARRPDWELPELHGDLQLAP vca4887497 89954MAYKLFGTAAVLNYDLPAERRAELDAMSME vca4888938 93984MLHTDLQPPRCRTSGPRPDPLRMETRARER5

perldoc NAMEBio::AlignIO::clustalw - clustalw sequence input/output streamSYNOPSISDo not use this module directly.class.Use it via the Bio::AlignIODESCRIPTIONThis object can transform Bio::Align::AlignI objects to and fromclustalw files.From bioperl quickstart pageProcess a Clustalw formatted alignmentUse Bio::AlignIO to parse multiple sequence alignments. If one wanted to convert the alignment into aNEXUS multiple alignment format for running with PAUP and MrBayes, use Bio::AlignIO::nexus. If onewanted to run a PHYLIP program use the Bio::AlignIO::phylip module to convert the alignment intoPHYLIP multiple alignment format.6

Finding help on formats rent/bioperl-live/Bio/AlignIO.html7

Figuring out what formats are availableexamples aregiven in thesynopsisSearch the Bio::AlignIOdocumentation webpage for examples of‘clustalw’It looks like we can use :-format ‘clustalw’to make Bio::AlignIO work with clustalw format fileshttp://search.cpan.org/ birney/bioperl-1.4/Bio/AlignIO.pm8

Using Bio::AlignIO::clustalw via Bio::AlignIO#!/usr/bin/perl -wuse strict;use Bio::AlignIO; inputfilename "testaln.clustalw"; in Bio::AlignIO- new(-file inputfilename ,-format 'clustalw');while ( my aln in- next aln() ) {!.!do something!.}Here we tellBio::AlignIO to usethesubclass ::clustalw9

perldoc -f command to get help% perldoc -f splitsplit /PATTERN/,EXPR,LIMITsplit /PATTERN/,EXPRsplit /PATTERN/splitSplits the string EXPR into a list of strings and returns thatlist. By default, empty leading fields are preserved, andempty trailing ones are deleted. (If all fields are empty,they are considered to be trailing.)10

The perl debugger perl -d myScript.plLoading DB routines from perl5db.pl version 1.28Editor support available.Enter h or h h' for help, or man perldebug' for more help.main::(myScript.pl:3):! print "hello world\n";DB 1 hqn or s return c 45b 45b 45 a 0p ax ahelpquitnext line or step through next linerepeat last n or scontinue to line 45break at line 45break at line 45 if a equals 0print the value of aunpack or extract the data structurein a11

The interactive perl debugger perl -de 4Loading DB routines from perl5db.pl version 1.28Editor support available.Enter h or h h' for help, or man perldebug' for more help.main::(-e:1):!4DB 1 a {foo [1,2] , boo [2,3] , moo [6,7]}DB 2 x a0 HASH(0x8cd314)'boo' ARRAY(0x8c3298)0 21 3'foo' ARRAY(0x8d10d4)0 11 2'moo' ARRAY(0x815a88)0 61 712

More perl tricks: one line perl perl -e COMMAND perl -e '@a (1.4);print join("\t",@a),"\n"'12!3! 4#print IDs from fasta file perl -ne 'if (/ (\S )/) {print " 1\n"}' volvox AP2EREBP.favca4886446 93762vca4887371 120236vca4887497 89954 see Chapter 19, p. 492-502 Perl book 3rd ed.13

Is a module installled?one-line perl program with ‘-e’this is the program in quotes% perl -e 'use Bio::AlignIO::clustalw'all ok: no errorsThe module in the next example hasn’t been installed(it doesn’t actually exist)% perl -e 'use Bio::AlignIO::myformat'Can't locate Bio/AlignIO/myformat.pm in@INC (@INC contains: /sw/lib/perl5 /sw/lib/perl5/darwin /Users/simonp/lib i-2level /Users/simonp/Library/Perl/5.8.1 /Users/simonp/com lib /Users/simonp/cvs/bdgp/software/perl-modules .To install a module% sudo cpaninstall Bio::AlignIO::clustalwperl can’t find the module in any ofthe paths in the PERL5LIB list (whichis in the perl variable @INC)You can add directories withuse lib ‘/Users/yourname/lib’;after the use strict; at the beginningof your script14

Looking for help with Google Google program name documentation / docs / command lineeg google ‘clustal command line’USE OF OPTIONS!All parameters of Clustalw can be used as optionswith a "-" That permits to use Clustalw in a script orin batch.!!!! clustalw -optionsCLUSTAL W (1.7) Multiple Sequence Alignmentsclustalw option list:-help-options-infile filename-outfile filename-type protein OR dna-output gcg OR gde OR pir OR phylip15

Build a command line from the options you needUSE OF OPTIONS!All parameters of Clustalw can be used as optionswith a "-" That permits to use Clustalw in a script orin batch.!!!! clustalw -optionsCLUSTAL W (1.7) Multiple Sequence Alignmentsclustalw option list:-help-options-infile filename-outfile filename-type protein OR dna-output gcg OR gde OR pir OR phylipCommand line would be:% clustalw -infile ExDNA.fasta -outfile ExDNA.pir -output pir-type dna16

Running a command line from perlCommand lineclustalw -infile ExDNA.fasta-outfile ExDNA.pir -output pir -type dnaScript#!/usr/bin/perluse strict; use warnings;use Bio::AlignIO;my file ‘ExDNA.fasta’;my clustFile ‘ExDNA.pir’;my cmd “clustalw -infile file -outfile clustFile-output pir -type dna”;# build command lineprint “Call to clustalw cmd\n”;# show commandmy oops system cmd;# system call and save return# value in oopsdie “FAILED !” if oops;# oops true if failedmy in Bio::AlignIO- new(-file clustFile,-format 'clustalw');while ( my aln in- next aln() ) {.}17

Util.pm packagepackage Util;use strict;our @EXPORT qw(do or die);!!! ! ! ! ! !!!! ! ! ! ! !use Exporter;use base 'Exporter';# allow do or die() to be exported! # without specifying! # Util::do or die()# -----------sub do or die {my cmd shift;print "CMD: cmd\n";my oops system cmd;die "Failed" if oops;}# -----------1;18

Util.pm in a script#!/usr/bin/perluse strict; use warnings;use lib ‘lib’; # you might need to tell perl where to find Util.pm# or with something like this# use lib ‘/Users/simonp/lib’;use Util;use Bio::AlignIO;my file ‘ExDNA.fasta’;my clustFile ‘ExDNA.pir’;my cmd “clustalw -infile file -outfile clustFile-output pir -type dna”;# build command lineprint “Call to clustalw cmd\n”;# show commanddo or die( cmd);!!# I use this all the timemy in Bio::AlignIO- new(-file clustFile,-format 'clustalw');while ( my aln in- next aln() ) {.}19

See bioperl-run for modules that will run other programs The Run package (bioperl-run) provides wrappers for executing some 60common bioinformatics applications (bioperl-run in CVS) http://www.bioperl.org/wiki/Run packageBio::Tools::Run::Alignment::clustalwThere are several pieces to bioperlhttp://www.bioperl.org/wiki/Using CVS bioperl-live bioperl-pedigreeCore modules including parsers and main objectsbioperl-run Wrapper modules around key applicationsbioperl-ext Ext package has C extensions including alignment routines and link tostaden IO library for sequence trace reads.bioperl-microarraybioperl-guibioperl-db20

Smart Essential coding practices use strict; use warnings. ALWAYS. Do it!Put all the hard stuff in subroutines. This makes the code easy to read and understand. don’t copy and paste code: bugs multiply, corrections getcomplicated;It keeps the code on a single screen, which prevents bugs.Each subroutine should have similar design.If you want to re-use a subroutine several times, put it in a moduleand re-use the module eg Util.pm#comments (ESC-; makes a comment in EMACS)what a subroutine expects and returnsanything new to you or unusualUse tab indentation for loops, logic, subroutines it’s so much easier to spot bugs and follow the code21

Coding strategy Use the simplest tool for the job: it will be faster to codeRe-use and modify existing code as much as possibleTurn to bigger/more complicated tools if and only if you needthem: is it going to take less time to wait for your code to finish than learningabout a complex tool? is it going to take more time to write a complex tool or search for it on theweb or ask your friends what they use?Write your code in small pieces and test each piece as you go.Check your input data weird characters, line returns (\r or \n ? ), whitespace at the end of lines,spaces instead of tabs. You can use % od -c mydatafile moreare there missing pieces, duplicated IDs?use a small piece of (real or fake) data to test your codeIs the output exactly what you expect?22

gene pred pipe.pl (by Scott Cain) part I#!/usr/bin/perl -wuse s::GFF;my acc ARGV[0]; # read argument from command line# main functions in simple subroutinesmy seq obj acc to seq obj( acc);my masked seq repeat mask( seq obj);my @predictions run genscan( masked seq);predictions to gff(@predictions);warn ------23

gene pred pipe.pl (by Scott Cain) part IIsub acc to seq obj {#takes a genbank accession, fetches the seq from#genbank and returns a Bio::Seq object#parent script has to use Bio::DB::Genbank my acc shift;my db new Bio::DB::GenBank;return db- get Seq by id( acc);}sub repeat mask {#takes a Bio::Seq object and runs RepeatMasker locally.#Parent script must use Bio::Tools::Run::RepeatMasker my seq shift;#BTRRM- new() takes a hash for configuration parameters#You'll have to set those up appropriatelymy factory Bio::Tools::Run::RepeatMasker- new();return factory- masked seq( seq);}24

gene pred pipe.pl (by Scott Cain) part IIIsub run genscan {#takes a Bio::Seq object and runs Genscan locally and returns#a list of Bio::SeqFeatureI objects#Parent script must use Bio::Tools::Run::Genscan my seq shift;#BTRG- new() takes a hash for configuration parameters#You'll have to set those up appropriatelymy factory Bio::Tools::Run::Genscan- new();#produces a list of Bio::Tools::Prediction::Gene objects#which inherit from Bio::SeqFeature::Gene::Transcript#which is a Bio::SeqFeatureI with child featuresmy @genes factory- run( seq);my @features;for my gene (@genes) {push @features, gene- features;}return @features;}sub predictions to gff {#takes a list of features and writes GFF2 to a file#parent script must use Bio::Tools::GFF my @features @ ;my gff out Bio::Tools::GFF- new(-gff version 2,-file ' prediction.gff'); gff out- write feature( ) for (@features);return;}25

Getting arguments from the command line with Getopt::Long andGetOptions() complicated.pl -flag --pie -start 4-expect 1e-50 -value 0.00423 -pet cat -pet dog order of arguments doesn’t matterdeals with flags, integers, decimals, strings, listsan example:use Getopt::Long;my ( flag, count, price, string);GetOptions( “flag” \ flag,“count i”,\ count,# integer“price f”,\ price,# floating point 0.12,3e-49“name s”,\ string,# always use trailing ‘,’);26

genbank to blast.pl (by Scott Cain) part I#!/usr/bin/perl -wuse strict;use lib "/home/scott/cvs stuff/bioperl-live";# this will change depending!!!!!!!!!!!# on your machineuse Getopt::Long;use Bio::DB::GenBank;#use Bio::Tools::Run::RepeatMasker;# running repeat masked first is a good!!!!!!!!!# idea, but takes a whileuse Bio::Tools::Run::RemoteBlast;use Bio::SearchIO;use Bio::SearchIO::Writer::GbrowseGFF;use Bio::SearchIO::Writer::HTMLResultWriter;use Data::Dumper;# print out contents of objects etc#take care of getting argumentsmy usage " 0 [--html] [--gff] --accession GB accession number ";my ( HTML, GFF, ACC);GetOptions ("html" \ HTML,"gff" \ GFF,"accession s" \ ACC);unless ( ACC) {warn " usage\n";exit(1);}#This will set GFF as the default if nothing is set but allowing both to be set GFF 1 unless HTML;#Now do real stuff .27

genbank to blast.pl (by Scott Cain) part II# Now do real stuff# nice and neat subroutine calls# easy to understand logic of codemy seq obj acc to seq obj( ACC);my masked seq repeat mask( seq obj);my blast res blast seq( masked seq);gff out( blast res, ACC) if GFF;html out( blast res, ACC) if HTML;#------------------------------------------28

genbank to blast.pl (by Scott Cain) part IIIsub acc to seq obj {print STDERR "Getting record from GenBank\n";my acc shift;my db new Bio::DB::GenBank;return db- get Seq by id( acc);}sub repeat mask {my seq shift;return seq;#short circuiting RM since we#don't have it installed, but this would be where# you would run it#my factory Bio::Tools::Run::RepeatMasker new();#return factory- masked seq( seq);}29

genbank to blast.pl (by Scott Cain) part IVsub blast seq {my seq shift;my prog 'blastn';my e val '1e-10';my db 'refseq rna';my @params (-prog prog,-expect e val,-readmethod 'SearchIO',-data db);my factory Bio::Tools::Run::RemoteBlast- new(@params); factory- submit blast( seq);my v 1; # message flagprint STDERR "waiting for BLAST." if ( v 0 );while ( my @rids factory- each rid ) {foreach my rid ( @rids ) {my rc factory- retrieve blast( rid);if( !ref( rc) ) { #waiting.if( rc 0 ) { factory- remove rid( rid);}print STDERR "." if ( v 0 );sleep 25;}else {print STDERR "\n";return rc- next result();}}}}30

genbank to blast.pl (by Scott Cain) part Vsub gff out {my ( result, acc) @ ;my gff out Bio::SearchIO- new(-output format 'GbrowseGFF',-output signif 1,-file " acc.gff",-reference 'query',-hsp tag 'match part',); gff out- write result( result);}sub html out {my ( result, acc) @ ;my writer Bio::SearchIO::Writer::HTMLResultWriter- new();my html out Bio::SearchIO- new(-writer writer,-format 'blast',-file " acc.html"); html out- write result( result);}31

32

33

How to approach perl pipelines use strict and warningsuse (bio)perl as gluehttp://www.bioperl.org/wiki/Main Pagegoogle.comtest small pieces as you write them (debugger: perl -d)construct a command line and test it (catch failure .or die.)convert into system call, check it worked with small sample datasetextend to more complex code only as neededif you use code more than once, put it into a subroutine in a modulee.g. Util.pmget command line arguments with GetOptions()34

Perl Best PracticesPfB 2009Always usestrictand usewarnings.use strict and use warnings arebuilt-in ways to get error messages fromperl.use strict forces you to declare yourvariables.use warnings turns on warnings fortypos and trying to print undefined variables.

Always check that youropens and closes workedopen (MYFILE, seqfile) or die"couldn't open seqfile: !\n"; ! is the error message that explains whatwent wrong. perl test.pl seqs.txtcouldn't open seqs.txt:No suchfile or directoryExplicitly pass variables to a functionFor now, say shift @ARGV, not shift.while (my line )notwhile ( )

Just because you can use a clever shortcutdoesn't mean you'll remember how cleveryou were a month later.It takes more time to debug code writtenwith shortcuts than you'll save using them.Write code a step at a timeand then run it.If you write 2 lines of code, run it, and it'sbroken, how many lines do you have tosearch to find your problem?If you write 10 lines of code, run it, and it'sbroken, how many lines do you have tosearch to find your problem?

Write code a step at a timeand then run it.Which will take longer and be morefrustrating?Going stepwise and checking your workalong the way, or writing a bunch and thentrying to find where the problem is?Write pseudocodeIt helps you to think more clearly and figureout what you want to do.Draw diagrams of your data structures.Just because you're working on thecomputer doesn't mean you can't use apencil and paper.

Write comments to explain codeanytime you had tostop and think about it.Like pseudocode, it helps you to figure outwhat you mean.If your code says one thing and yourcomments another, it might be a bug.Use newlines and tabs toorganize your code.Just as you write English in paragraphscontaining related ideas, write your code inparagraphs that keep related code together.

Use indentation for blocksIt helps you see the structure of your codeand see mismatched parentheses orbrackets.Use an editor that colors textfor you and helps you indentIf you forget to close a string or put a on avariable, usually you'll notice that all of yourscript is colored as if it's a string.And you didn't mean that, did you?

Use meaningful variable namesIt makes your code self-documenting.my a x/ y * 100;vsmy percent gc gc count/ nucleotide count * 100;Use meaningful variable namesLonger is usually better.Using a meaningful name will help you toread your code and understand what it'ssupposed to be doing.

Use meaningful variable namesIf the name says what’s supposed to be inside thevariable, you're more likely to notice if it’s wrong.my seq Bio::Seq- new (-seq 'AGTC');# later .print seq, "\n"; perl seq.plBio::Seq HASH(0x100804ed0)Use meaningful variable namesNow that my sequence object is named seq obj,I’m less likely to forget it’s not the actual sequencestring.my seq obj Bio::Seq- new (-seq 'AGTC');# later .print seq obj- seq, "\n"; perl seq.plAGTC

Use meaningful variable namesCall a ref a ref.my gene names ref \%gene names;Call an object an object.my seq obj Bio::Seq- new();Use meaningful variable namesUse singular for scalar names and plural for arraynames.my @genes;foreach my gene (@genes) {}

Don't reuse variable namesRemember the scope example.Don't make a subroutine with the samename as a built-in function.Make usage statementsSay what parameters and options your scriptexpects.

Make usage statementsmy usage "Usage: 0[ -triangles proteomeIds order transcripts lowestfirst and add metrics[ -evalue]cutoff]proteomeIds are separated by a comma.Finds pairs of longest, active transcripts which are mbh withrespect to score-triangles option prints MBH pairs with transcript IDs inincreasing order followed by evalue and score columns (total 4cols)";die usage if #ARGV ! 1;Make usage statements perl MBHpairs.plUsage: MBHpairs.pl proteomeIds [ -trianglesorder transcripts lowestfirst and add metrics[ -evalue]cutoff]proteomeIds are separated by a comma.Finds pairs of longest, active transcripts which arembh with respect to score-triangles option prints pairs with transcript IDs inincreasing order followed by evalue and score columns(total 4 cols)";

Keep good notesJust like in the wet lab, keep a lab notebook.Here's a (short) article which has a gooddiscussion of how to do this:http://tinyurl.com/organizecompbioUse version controlKeeps track of previous versions of yourcode. If you screw something up, you can goback to to any earlier version.We can show you how if you’re interested.

Why am I insisting you followthese rules?Am I just being pedantic?I have made all of the mistakes you havemade and many more you can't evenimagine. Debugging code is not fun.These rules are designed to keep you fromgetting in your own way.You may think you are writing code "just foryou", or "just for this project", but rememberthat in six months you'll be the one who hasto figure out what the heck you werethinking when you wrote it.

There’s a whole book on this topicby Damien Conway

Perl Pipelines Using perl as bioinformatics glue Simon Prochnik with code from Scott Cain 1. Some of what cover in the course Gene prediction EST alignment to genome EST sequence Phylogenetics: paup, MrBayes, phyml, PAML multiple sequence alignment: clustalw, muscle pairwise sequence alignment:

Related Documents:

Why Perl? Perl is built around regular expressions -REs are good for string processing -Therefore Perl is a good scripting language -Perl is especially popular for CGI scripts Perl makes full use of the power of UNIX Short Perl programs can be very short -"Perl is designed to make the easy jobs easy,

Perl can be embedded into web servers to speed up processing by as much as 2000%. Perl's mod_perl allows the Apache web server to embed a Perl interpreter. Perl's DBI package makes web-database integration easy. Perl is Interpreted Perl is an interpreted language, which means that your code can be run as is, without a

Run Perl Script Option 3: Create a Perl script my_script.pl: Run my_script.pl by calling perl: 8/31/2017 Introduction to Perl Basics I 10 print Hello World!\n; perl ./my_script.pl Option 4: For a small script with several lines, you can run it directly on the command line: perl -e print Hello World!\n;

Other Perl resources from O’Reilly Related titles Learning Perl Programming Perl Advanced Perl Programming Perl Best Practices Perl Testing: A Developer’s . Intermedi

Run Perl Script Option 3: Create a Perl script my_script.pl: Run my_script.pl by calling perl: 8/31/2017 Introduction to Perl Basics I 10 print Hello World!\n; perl ./my_script.pl Option 4: For a small script with several lines, you can run it directly on the command line: perl -e print Hello World!\n;

Perl's creator, Larry Wall, announced it the next day in his State of the Onion address. Most notably, he said "Perl 6 is going to be designed by the community." Everyone thought that Perl 6 would be the version after the just-released Perl v5.6. That didn't happen, but that's why "Perl" was in the name "Perl 6."

tutorial Sorry about that but I have to keep my tutorial's example scripts short and to the point Finally, this is a tutorial for Perl/Tk only I will not be teaching perl here So if you know perl, continue But if you are a beginner to perl, I would recommend that you read my perl tutorial

A. General guidance for academic writing The style of writing required for LSHTM assessments may call for different skills to those you have used in your previous education or employment. If you are not entirely confident in this, remember that the more academic writing you do, the better you will become at it. Aspects that may be new or unfamiliar, such as citing and referencing, should .