Stacks: An Analysis Tool Set For Population Genomics

3y ago
53 Views
2 Downloads
1.02 MB
17 Pages
Last View : 9d ago
Last Download : 3m ago
Upload by : River Barajas
Transcription

Molecular Ecology (2013) 22, 3124–3140doi: 10.1111/mec.12354Stacks: an analysis tool set for population genomicsJULIAN CATCHEN,* PAUL A. HOHENLOHE,*† SUSAN BASSHAM,* ANGEL AMORES‡and W I L L I A M A . C R E S K O **Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403-5289, USA, †Biological Sciences, University ofIdaho, Moscow, ID 83844-3051, USA, ‡Institute of Neuroscience, University of Oregon, Eugene, OR 97403-1254, USAAbstractMassively parallel short-read sequencing technologies, coupled with powerful softwareplatforms, are enabling investigators to analyse tens of thousands of genetic markers.This wealth of data is rapidly expanding and allowing biological questions to beaddressed with unprecedented scope and precision. The sizes of the data sets are nowposing significant data processing and analysis challenges. Here we describe an extension of the Stacks software package to efficiently use genotype-by-sequencing data forstudies of populations of organisms. Stacks now produces core population genomicsummary statistics and SNP-by-SNP statistical tests. These statistics can be analysedacross a reference genome using a smoothed sliding window. Stacks also now providesseveral output formats for several commonly used downstream analysis packages. Theexpanded population genomics functions in Stacks will make it a useful tool toharness the newest generation of massively parallel genotyping data for ecological andevolutionary genetics.Keywords: GBS, genetics, next-generation sequencing, population genomics, RAD-seqReceived 19 November 2012; revision received 16 April 2013; accepted 16 April 2013IntroductionThe study of nearly complete genetic information innumerous individuals drawn from scores of populations is now rapidly becoming a reality (Storz 2005;Bonin 2008; Hohenlohe et al. 2010a, 2012a; Stapley et al.2010). New molecular genetic techniques (Mardis 2008),enabled by massively parallel short-read sequencingtechnologies coupled with powerful software, havebeen critical to advances in this nascent field of population genomics. Investigators have employed thesemethods to move from painstakingly developing dozens of microsatellite markers to rapidly producing tensof thousands of single nucleotide polymorphism (SNP)markers (Davey et al. 2011; McCormack et al. 2013).Several molecular approaches have been developed tofocus the large number of short reads provided bymodern sequencing platforms on specific, restrictionenzyme–anchored positions in the genome (e.g. CRoPS,Van Orsouw et al. 2007; RAD-seq, Baird et al. 2008; EtterCorrespondence: William A. Cresko, Fax: 541-346-2364;E-mail: wcresko@uoregon.eduet al. 2011b; GBS, Elshire et al. 2011; double-digestRAD-seq, Peterson et al. 2012; and 2bRAD, Wang et al.2012b). This family of reduced representation genotypingapproaches, generically called genotype-by-sequencing(GBS) or restriction site–associated DNA sequencing(RAD-seq; Davey et al. 2011), subsamples the genome athomologous locations to identify and type SNPs evenlythroughout the genome. Population genomics using GBSallows classic problems in ecological and evolutionarygenetics, such as identification of parentage and relatedness, migration and gene flow, population structure andphylogeography, and phylogenetic reconstruction, to beaddressed with unprecedented power and precision(Mitchell-Olds et al. 2008; Hohenlohe et al. 2010a; Stapleyet al. 2010). More importantly, population genomic studies allow the simultaneous identification of a genomewide average and outliers for any given statistic to helpidentify genomic regions contributing to local adaptationor even speciation (Lewontin & Krakauer 1973; MaynardSmith & Haigh 1974; Luikart et al. 2003; Beaumont &Balding 2004; Nielsen 2005; Storz 2005; Nielsen et al.2007; Foll & Gaggiotti 2008; Gaggiotti et al. 2009; Hohenlohe et al. 2010b, 2012b; Strasburg et al. 2012). 2013 John Wiley & Sons Ltd

S T A C K S F O R P O P U L A T I O N G E N O M I C S 3125The wealth of genetic data provided by massivelyparallel short-read sequencing brings serious challengesin data processing and analysis (Shendure & Ji 2008;Glenn 2011). Studies now commonly comprise billionsof raw sequences used to genotype tens of thousands tomillions of SNPs. The key to making such studies feasible is software that can efficiently assemble readstogether, identify alleles and genotypes, and track thosegenotypes in hundreds of individuals in scores of populations using a statistically rigorous framework (Lynch2009; Gompert et al. 2010; Hohenlohe et al. 2010b). Tohelp minimize the challenges of using GBS methodsfor genetic studies, we developed Stacks (http://creskolab.uoregon.edu/stacks/), a computational pipeline designed to work with any restriction enzyme–based GBS data. Stacks is computationally robust, efficient and flexible and can assemble short reads de novoor use data aligned to a reference genome. The Stackssoftware can handle data from thousands of individualsand incorporates a MySQL database and web front endfor efficient data visualization, management and modification. Stacks was initially designed for genetic mappingcrosses (Catchen et al. 2011), and we have added significant functionality for ecological and evolutionary genomic analyses. Here, we describe and evaluate these newfeatures of Stacks using RAD-seq data from Oregonthreespine stickleback (Gasterosteus aculeatus) populations. A complete manual for Stacks is ks manual.pdf), as are additional tutorials and other resources.Experimental space and the central concept ofStacksAnalysing GBS data requires several steps such asacquiring raw sequence data, filtering out low-qualityreads, assembling or aligning reads, and finally inferring SNPs and genotypes. Each step has its own associated challenges and uncertainties. These arise fromgenomic attributes such as the number of loci identified,the degree of repetitive sequences throughout thegenome, and the level of polymorphism and divergenceamong populations. These biological factors also interact with sequencing characteristics such as the qualityof DNA and degree of sample multiplexing, the totalnumber and length of reads, and the sequencing errorrate. Key decisions therefore need to be made at eachstep about such items as the required depth of coverageor allowable nucleotide distance between reads forassembly. Finally, because of biological and sequencingsampling variation, the use of statistical models willoften be necessary.We have built the Stacks software platform to bemodular and tunable to facilitate iterative exploration of 2013 John Wiley & Sons Ltdthe biological and sequencing parameter space for aparticular study and to easily acquire and incorporateadditional data. At the core of Stacks is the catalogue –a collection of all the loci and alleles identified in apopulation of individuals. In a mapping cross, the catalogue is simple and contains only loci found in the parents, enabling the identification of parental allelespresent in the progeny. In the more general case of aset of individuals from one or more populations, thecatalogue grows more complex and can often containmany more loci and segregating alleles. If a referencegenome is available, those loci can be ordered, allowingthem to be compared along the genome. Stacks uses arelational database and a web-based user interface. Thisinterface allows for data visualization and user-directedmodifications and corrections to the genetic hypotheses.Below we describe some of the major steps, decisionpoints, statistical considerations and ways to specify themajor parameters for Stacks.Major steps of a Stacks analysisThe raw input data to Stacks are sequenced DNA fragments from any restriction enzyme–based GBS protocol. These protocols provide reads that will beanchored to homologous locations in the genome,which then appear as well arranged ‘stacks’ whenvisualized (see Davey et al. 2011 for details). Stacks canhandle raw sequencing data in FASTA or FASTQformat to identify loci de novo and reads alignedagainst a reference genome in SAM (Li et al. 2009)format. Aligned reads may be gapped to allow forindels. Regardless of whether the data are assembledde novo, or aligned against a reference genome, manysubsequent steps in Stacks are shared.Stacks is a collection of several original C programsand Perl scripts. The components of Stacks can be runindividually by hand or using one of two providedwrapper programs that will execute the entire pipeline(denovo map.pl or ref map.pl).The pipeline is outlined in Fig. 1 and can bedescribed as follows:1 Raw sequence reads are demultiplexed and cleaned(process radtags).2 Data from each individual are grouped into loci, andpolymorphicnucleotidesitesareidentified(ustacks or pstacks for unaligned or aligneddata, respectively).3 Loci are grouped together across individuals and acatalogue is written (cstacks).4 Loci from each individual are matched against thecatalogue to determine the allelic state at each locusin each individual (sstacks).

3126 J . C A T C H E N E T A L .Fig. 1 The Stacks pipeline. Stacks proceeds in five major stages. First, reads are demultiplexed and cleaned by the process radtags program. The next three stages comprise the main Stacks pipeline: building loci (ustacks/pstacks), creating the catalogue of loci (cstacks) and matching against the catalogue (sstacks). In the fifth stage, either the populations orgenotypes program is executed, depending on the type of input data. The populations program tabulates the state of lociwithin and among populations, calculates population genetics statistics and exports to a number of additional, useful formats. Thegenotypes program is further described in Catchen et al. 2011.5 Allelic states are either converted into a set of mappablegenotypes (for a genetic map) using genotypes orsubjected to population genetic statistics via populations, with the results being written in one or severaluseful output files.As described previously in Catchen et al. (2011), aweb-based front end, backed by a MySQL database, isavailable to visualize the data. Both denovo map.pland ref map.pl will automatically populate a MySQLdatabase during execution.De novo stack formationStacks will, through the program ustacks, use a k-mersearch algorithm to merge alleles into loci. First, exactlymatching reads are formed into stacks using a hashingalgorithm. Stacks are subsequently decomposed intok-mers (subsequences of length k) that are comparedamong stacks to find matching alleles (see Catchen et al.2011 for more detail). In the previous version of Stacks,this process was controlled by two parameters. Thestack depth parameter (-m) controls the number of raw 2013 John Wiley & Sons Ltd

S T A C K S F O R P O P U L A T I O N G E N O M I C S 3127single locus, the locus would be broken down using ahierarchical clustering algorithm. We have replaced thisalgorithm with a more sensitive heuristic that is basedupon a minimum-spanning tree [See Appendix S1, 1.1,Supporting information for details of the algorithm].reads required to form a stack, and the mismatchparameter (-M) specifies the number of allowed nucleotide mismatches between two stacks to merge them intoa locus.We here add a third parameter. The maximum stacksallowed per locus can also now be modulated(- -max locus stacks). The expectation for nonrepetitive genomic regions is that a monomorphic locuswill produce a single stack because the two sequenceson the two homologous chromosomes are identical andthus indistinguishable. In contrast, a polymorphic locuswill produce two stacks representing alternative alleles(Fig. 2A). More complex cases abound, however, fromshort, sequencing error-based stacks in addition to thetrue alleles, to repetitive sequences, where hundreds ofloci in the genome may collapse to a single putativelocus. Stacks can be used to identify and remove theseconfounding cases. For example, the maximum stacksper locus parameter allows the user to limit the numberof stacks at any single locus (default 3). If the limit isexceeded, the locus is blacklisted, meaning it will not beavailable for insertion into, or matching against, thecatalogue. These confusing loci can be ignored for allsubsequent analyses. However, Stacks also contains adeleveraging algorithm in ustacks to help deconvolute some of these confounded loci. In previous versions of Stacks, if too many stacks were present at a(A)Reference-guided stack formationWhen a reference genome is available, Stacks relies on aset of aligned reads to assemble loci. Through the program pstacks, Stacks is able to use data from anyalignment program that can produce SAM or BAM output files and has been extensively tested with Bowtie(Langmead et al. 2009), BWA (Li & Durbin 2009) andGSNAP (Wu & Nacu 2010). The pstacks programwill read the CIGAR string (Li et al. 2009) from eachalignment in the SAM file to determine whether theread contained an insertion, deletion or soft-masking[see Appendix S1, 1.2, Supporting information for information on CIGAR strings]. When a deletion hasoccurred in the read relative to the reference, pstackswill insert Ns to regain phase with the reference, andtrim the end of the read to keep the length constant.Conversely, if an insertion has occurred in the read relative to the reference, pstacks will trim out theinserted bases and pad the end of the read with Ns.Both of these operations will allow bi-allelic loci CCCTGCGGAGGACCTGTTACCACC1131ACT033332320 A4CTG1111TCTC315 CTTFig. 2 The ustacks deleveraging algorithm. (A) The simples

De novo stack formation Stacks will, through the program ustacks, use a k-mer search algorithm to merge alleles into loci. First, exactly matching reads are formed into stacks using a hashing algorithm. Stacks are subsequently decomposed into k-mers (subsequences of length k) that are compared among stacks to find matching alleles (see Catchen .

Related Documents:

1.1 Piezo-stacks with on-stack-insulation (osi) 5 1.2 Piezo-stacks with in-stack-insulation (isi) 7 2. Aspects of actuator operation 11 2.1 Bulk-stacks or hollow ring-stacks 11 2.2 Mechanics of actuator operation 12 2.3 Mounting advises 14 2.4 Influences from ambient 17 2.5 Electrical operation conditions 18 3. Products, technical data 20

The viticulture industry stacks large volumes of wine barrel inventory on portable steel racks. Individual wine barrel stacks consist of two barrels at each level placed side by side on the rack below. These stacks

Of these stacks, 207 are 500 to 699 feet tall, 63 are 700 to 999 feet tall, and the remaining 14 are 1,000 feet tall or higher. About one-third of these stacks are concentrated in 5 states along the Ohio River Valley. While about half of tall stacks began operating more than 30 years ago, there has been an increase in the number of tall stacks that

Description of selected steel stacks/chimneys . 1. Type of stack circular self-supporting industrial steel stacks 2. Heights of stacks: 30 m ,35m,40m,45m ,50m ,55m,60 m ( short stacks) 3. Top diameter for each stack is taken as minimum h/30 as per provision in IS 6533 :1989 4. Variation in base diameter for each stack for fixed

e Adobe Illustrator CHEAT SHEET. Direct Selection Tool (A) Lasso Tool (Q) Type Tool (T) Rectangle Tool (M) Pencil Tool (N) Eraser Tool (Shi E) Scale Tool (S) Free Transform Tool (E) Perspective Grid Tool (Shi P) Gradient Tool (G) Blend Tool (W) Column Graph Tool (J) Slice Tool (Shi K) Zoom Tool (Z) Stroke Color

6 Track 'n Trade High Finance Chapter 4: Charting Tools 65 Introduction 67 Crosshair Tool 67 Line Tool 69 Multi-Line Tool 7 Arc Tool 7 Day Offset Tool 77 Tool 80 Head & Shoulders Tool 8 Dart/Blip Tool 86 Wedge and Triangle Tool 90 Trend Fan Tool 9 Trend Channel Tool 96 Horizontal Channel Tool 98 N% Tool 00

5-3 Catalyst 3750 Switch Software Configuration Guide OL-8550-04 Chapter 5 Managing Switch Stacks Understanding Switch Stacks Note A switch stack is different from a switch cluste

(Corporate Officer). Full day event, get a hamper and 10 via expenses for drinks. Andrew Tamplin is doing a morning session, breakout rooms including a live band, quiz, virtual Christmas choir, guided meditation/yoga, virtual pub, pets corner, creative room (cooking workshops, magic tricks, circus skills). Dec 11th.