SMRT Tools Reference Guide (v7.0.0) - PacBio

3y ago
63 Views
3 Downloads
2.19 MB
124 Pages
Last View : 17d ago
Last Download : 3m ago
Upload by : Kaden Thurman
Transcription

SMRT Tools Reference GuideIntroductionThis document describes the command-line tools included with SMRT Linkv7.0.0. These tools are for use by bioinformaticians working with secondaryanalysis results. The command-line tools are located in the SMRT ROOT/smrtlink/smrtcmds/bin subdirectory.InstallationThe command-line tools are installed as an integral component of the SMRTLink software. For installation details, see SMRT Link SoftwareInstallation (v7.0.0). To install only the command-line tools, use the --smrttools-onlyoption with the installation command, whether for a new installation oran upgrade. Examples:smrtlink-*.run --rootdir smrtlink --smrttools-onlysmrtlink-*.run --rootdir smrtlink --smrttools-only --upgradePacific Biosciences Command-Line ToolsFollowing is information on the Pacific Biosciences-supplied command-linetools included in the installation. Third-party tools installed are described atthe end of the document.ToolDescriptionarrowThe variantCaller tool with the consensus algorithm set to arrow. See“variantCaller” on page 102 for details.bam2fasta/bam2fastqConverts PacBio BAM files into gzipped FASTA and FASTQ files.See “bam2fasta/bam2fastq” on page 3.bamsieveGenerates a subset of a BAM or PacBio Data Set file based on either awhitelist of hole numbers, or a percentage of reads to be randomly selected.See “bamsieve” on page 3.bax2bamConverts the legacy PacBio basecall format (bax.h5) into the BAM basecallformat. See “bax2bam” on page 5.blasrAligns long reads against a reference sequence. See “blasr” on page 7.ccsCalculates consensus sequences from multiple “passes” around a circularizedsingle DNA molecule (SMRTbell template). See “ccs” on page 13.datasetCreates, opens, manipulates and writes Data Set XML files.See “dataset” on page 17.DemultiplexBarcodesIdentifies barcode sequences in PacBio single-molecule sequencing data. See“Demultiplex Barcodes” on page 23.Page 1

ToolDescriptionfasta-toreferenceConverts a FASTA file to a ReferenceSet Data Set XML.See “fasta-to-reference” on page 34.ipdSummaryDetects DNA base-modifications from kinetic signatures.See “ipdSummary” on page 34.isoseq3Characterizes full-length transcripts and generates full-length transcriptisoforms, eliminating the need for computational reconstruction.See “isoseq3” on page 38.julietA general-purpose minor variant caller that identifies and phases minor singlenucleotide substitution variants in complex populations. See “juliet” on page 41.laaFinds phased consensus sequences from a pooled set of ampliconssequenced with Pacific Biosciences’ SMRT technology. See “laa” on page 49.motifMakerIdentifies motifs associated with DNA modifications in prokaryotic genomes.See “motifMaker” on page 55.pbalignAligns PacBio reads to reference sequences; filters aligned reads according touser-specified filtering criteria; and converts the output to PacBio BAM, SAM,or PacBio DataSet format. See “pbalign” on page 57.pbdagconImplements DAGCon (Directed Acyclic Graph Consensus); a sequenceconsensus algorithm based on using directed acyclic graphs to encodemultiple sequence alignments. See “pbdagcon” on page 60.pbindexCreates an index file that enables random access to PacBio-specific data inBAM files. See “pbindex” on page 61.pbmm2Aligns PacBio reads to reference sequences. A SMRT wrapper for minimap2,and the successor to blasr and pbalign. See “pbmm2” on page 61.pbservicePerforms a variety of useful tasks within SMRT Link.See “pbservice” on page 68.pbsmrtpipeSecondary analysis workflow engine of PacBio’s SMRT Analysis software.See “pbsmrtpipe” on page 72.pbsvStructural variant caller for PacBio reads. See “pbsv” on page 88.pbtranscriptPart of the Iso-Seq Analysis pipeline used for the Classify and Cluster/polishsteps, as well as post-polish analysis. See “pbtranscript” on page 92.pbvalidateValidates that files produced by PacBio software are compliant with PacificBiosciences’ own internal specifications. See “pbvalidate” on page 98.quiverThe variantCaller tool with the consensus algorithm set to quiver.See “variantCaller” on page 102 for details.sawriterGenerates a suffix array file from an input FASTA file.See “sawriter” on page 100.summarizeModificationsGenerates a GFF summary file from the output of base modification analysiscombined with the coverage summary GFF generated by resequencingpipelines. See “summarize Modifications” on page 101.variantCallerVariant-calling tool which provides several variant-calling algorithms for PacBiosequencing data. See “variantCaller” on page 102.arrowThis is the variantCaller tool with the consensus algorithm set to arrow.See “variantCaller” on page 102 for details.Page 2

bam2fasta/bam2fastqThe bam2fastx tools convert PacBio BAM files into gzipped FASTA andFASTQ files, including demultiplexing of barcoded data.UsageBoth tools have an identical interface and take BAM and/or Data Set filesas input.Examplesbam2fasta -o projectName m54008 160330 053509.subreads.bambam2fastq -o myEcoliRuns m54008 160330 053509.subreads.bamm54008 160331 235636.subreads.bambam2fasta -o myHumanGenomem54012 160401 000001.subreadset.xmlInput Files One or more *.bam files *.subreadset.xml file (Data Set file)Output Files *.fasta.gz *.fastq.gzbamsieveThe bamsieve tool creates a subset of a BAM or PacBio Data Set filebased on either a whitelist of hole numbers, or a percentage of reads to berandomly selected, while keeping all subreads within a read together.Although bamsieve is BAM-centric, it has some support for dataset XMLand will propagate metadata, as well as scraps BAM files in the specialcase of SubreadSets. bamsieve is useful for generating minimal test DataSets containing a handful of reads.bamsieve operates in two modes: whitelist/blacklist mode where theZMWs to keep or discard are explicitly specified, or percentage/countmode, where a fraction of the ZMWs is randomly selected.ZMWs may be whitelisted or blacklisted in one of several ways: As a comma-separated list on the command line. As a flat text file, one ZMW per line. As another PacBio BAM or Data Set of any type.Usagebamsieve [-h] [--version] [--log-file LOG FILE][--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL} --debug --quiet -v][--show-zmws] [--whitelist WHITELIST] [--blacklist BLACKLIST][--percentage PERCENTAGE] [-n COUNT] [-s SEED][--ignore-metadata][--barcodes]input bam [output bam]Page 3

RequiredDescriptioninput bamThe name of the input BAM file or Data Set from which reads will be read.output bamThe name of the output BAM file or Data Set where filtered reads will be written to.(Default None)OptionsDescription-h, --helpDisplays help information and exits.--versionDisplays program version number and exits.--log-file LOG FILEWrites the log to file. (Default None, writes to stdout.)--log-levelSpecifies the log level; values are [DEBUG, INFO, WARNING, ERROR, CRITICAL].(Default WARNING)--debugAlias for setting the log level to DEBUG. (Default False)--quietAlias for setting the log level to CRITICAL to suppress output. (Default False)-v, --verboseSets the verbosity level. (Default NONE)--show-zmwsPrints a list of ZMWs and exits. (Default False)--whitelist WHITELISTSpecifies the ZMWs to include in the output. This can be a comma-separated listof ZMWs, or a file containing a list of ZMWs (one hole number per line), or a BAM/Data Set file. (Default NONE)--blacklist BLACKLISTSpecifies the ZMWs to exclude from the output. This can be a comma-separatedlist of ZMWs, or a file containing a list of ZMWs (one hole number per line), or aBAM/Data Set file that specifies ZMWs. (Default NONE)--percentage PERCENTAGE Specifies a percentage of a SMRT Cell to recover (Range 1-100) rather than aspecific list of reads. (Default NONE)-n COUNT, --count COUNT Specifies a specific number of ZMWs picked at random to recover. (Default NONE)-s SEED, --seed SEEDSpecifies a random seed for selecting a percentage of reads. (Default NONE)--ignore-metadataDiscard the input Data Set metadata. (Default False)--barcodesSpecifies that the whitelist or blacklist contains barcode indices instead of ZMWnumbers. (Default False)ExamplesPulling out two ZMWs from a BAM file: bamsieve --whitelist 111111,222222 full.subreads.bam sample.subreads.bamPulling out two ZMWs from a Data Set file: bamsieve --whitelist 111111,222222 full.subreadset.xml sample.subreadset.xmlUsing a text whitelist: bamsieve --whitelist zmws.txt full.subreads.bam sample.subreads.bamPage 4

Using another BAM or Data Set as a whitelist: bamsieve --whitelist mapped.alignmentset.xml full.subreads.bam mappable.subreads.bamGenerating a whitelist from a Data Set: bamsieve --show-zmws mapped.alignmentset.xml mapped zmws.txtAnonymizing a Data Set: bamsieve --whitelist zmws.txt --ignore-metadata --anonymize full.subreadset.xmlanonymous sample.subreadset.xmlRemoving a read: bamsieve --blacklist 111111 full.subreadset.xml filtered.subreadset.xmlSelecting 0.1% of reads: bamsieve --percentage 0.1 full.subreads.bam random sample.subreads.bamSelecting a different 0.1% of reads: bamsieve --percentage 0.1 --seed 98765 full.subreads.bam random sample.subreads.bamSelecting just two ZMWs/reads at random: bamsieve --count 2 full.subreads.bam two reads.subreads.bamSelecting by barcode: bamsieve --barcodes --whitelist 4,7 full.subreads.bam two barcodes.subreads.bamGenerating a tiny BAM file that contains only mappable reads: bamsieve --whitelist mapped.subreads.bam full.subreads.bam mappable.subreads.bam bamsieve --count 4 mappable.subreads.bam tiny.subreads.bamSplitting a Data Set into two halves: bamsieve --percentage 50 full.subreadset.xml split.1of2.subreadset.xml bamsieve --blacklist split.1of2.subreadset.xml ting Unmapped Reads: bamsieve --blacklist mapped.alignmentset.xml The bax2bam tool converts the legacy PacBio basecall format (bax.h5)into the BAM basecall format.Page 5

Usagebax2bam [options] input files. OptionsOptionsDescription-h, --helpDisplays help information and exits.--versionDisplays program version number and exits.Pulse feature optionsThese options configure pulse features in the output BAM. Supportedfeatures include:Pulse FeatureBAM stitutionTagstNIf the Pulse Feature option is used, then only those features listed areincluded, regardless of the default state. --pulsefeatures STRING (Comma-separated list of desired pulsefeatures, using the names in the table above.) --losslessframes (Store full, 16-bit IPD/PulseWidth data, instead ofthe default downsampled, 8-bit encoding.)Input Files movie.1.bax.h5, movie.2.bax.h5 . (Note: Input files should befrom the same movie.) --xml STRING (Data Set XML file containing a list of movie names.) -f STRING, --fofn STRING (File-of-file-names containing a list ofinput files.)Output Files -o STRING (Prefix of output file names. The movie name is used if noprefix is provided.) --output-xml STRING (Explicit output XML name. If not provided,bax2bam will use the -o prefix ( prefix .dataset.xml). If that is notspecified either, the output XML file name is moviename .dataset.xml.)Page 6

Output read types: (Note: These types are mutually exclusive.)– --subread: Output subreads (Default)– --hqregion: Output HQ regions– --polymeraseread: Output full polymerase read– --ccs: Output CCS sequences Output BAM file type: --internal Output BAMs in internal mode. Currently this indicatesthat non-sequencing ZMWs should be included in the output scrapsBAM file, if applicable.ExampleAssuming your original file is named mydata.bas.h5, you can producea file mynewbam.subreads.bam using the following command:bax2bam -o mynewbam mydata.1.bax.h5 mydata.2.bax.h5 mydata.3.bax.h5blasrThe blasr tool aligns long reads against a reference sequence, possibly amulti-contig reference.blasr maps reads to genomes by finding the highest scoring localalignment or set of local alignments between the read and the genome.The initial set of candidate alignments is found by querying a rapidlysearched precomputed index of the reference genome, and then refininguntil only high-scoring alignments are kept. The base assignment inalignments is optimized and scored using all available quality information,such as insertion and deletion quality values.Because alignment approximates an exhaustive search, alignmentsignificance is computed by comparing optimal alignment score to thedistribution of all other significant alignment scores.Usageblasr {subreads ccs}.bam genome.fasta --bam --out aligned.bam [--options]blasr {subreadset consensusreadset}.xml genome.fasta --bam --out aligned.bam [-options]blasr reads.fasta genome.fasta [--options]Input Files {subreads ccs}.bam is in PacBio BAM format, which is the nativeSequel /Sequel II System output format of SMRT reads. PacBio BAMfiles carry rich quality information (such as insertion, deletion, andsubstitution quality values) needed for mapping, consensus calling andvariant detection. For the PacBio BAM format specifications, BAM.html. {subreadset consensusreadset}.xml is in PacBio Data Set format.For the PacBio Data Set format specifications, DataSet.html.Page 7

reads.fasta: A multi-FASTA file of reads. While any FASTA file isvalid input, bam or dataset files are preferable as they contain morerich quality value information. genome.fasta: A FASTA file to which reads should map, usuallycontaining reference sequences.Output Files aligned.bam: The pairwise alignments for each read, in PacBio BAMformat.Input OptionsOptionsDescription--sa suffixArrayFileUses the suffix array sa for detecting matches between the reads and the reference.(The suffix array is prepared by the sawriter program.)--ctab tabSpecifies a table of tuple counts used to estimate match significance, created byprintTupleCountTable. While it is quick to generate on the fly, if there are manyinvocations of blasr, it is useful to precompute the ctab.--regionTable tableSpecifies a read-region table in HDF format for masking portions of reads. This maybe a single table if there is just one input file, or a fofn (file-of-file names). When aregion table is specified, any region table inside the reads.plx.h5 orreads.bax.h5 files is ignored. Note: This option works only with PacBio RS IIHDF5 files.--noSplitSubreadsDoes not split subreads at adapters. This is typically only useful when the genome inan unrolled version of a known template, and contains template-adapter-reversetemplate sequences. (Default False)Options for Aligning OutputOptionsDescription--bestn nProvides the top n alignments for the hit policy to select from. (Default 10)--samWrites output in SAM format.--bamWrites output in PacBio BAM format.--clippingUses no/hard/soft clipping for SAM output. (Default none)--out fileWrites output to file. (Default terminal)--unaligned fileOutput reads that are not aligned to file.--m tIf not printing SAM, modifies the output of the alignment. t 0: Print blast-like output with 's connecting matched nucleotides. 1: Print only a summary: Score and position. 2: Print in Compare.xml format. 3: Print in vulgar format (Deprecated). 4: Print a longer tabular version of the alignment. 5: Print in a machine-parsable format that is read bycompareSequences.py.--noSortRefinedAlignments Once candidate alignments are generated and scored via sparse dynamicprogramming, they are rescored using local alignment that accounts for differenterror profiles. Resorting based on the local alignment may change the order inwhich the hits are returned. (Default False)Page 8

OptionsDescription--allowAdjacentIndelsAllows adjacent insertion or deletions. Otherwise, adjacent insertion anddeletions are merged into one operation. Using quality values to guide pairwisealignments may dictate that the higher probability alignment contains adjacentinsertions or deletions. Tools such as GATK do not permit this and so they arenot reported by default.--headerPrints a header as the first line of the output file describing the contents of eachcolumn.--titleTable tabBuilds a table of reference sequence titles. The reference sequences areenumerated by row, 0,1,. The reference index is printed in alignmentresults rather than the full reference name. This makes output concise,particularly when very verbose titles exist in reference names. (Default NULL)--minPctSimilarity pReports alignments only if they are greater than p percent identity. (Default 0)--holeNumbers LISTAligns reads whose ZMW hole numbers are in LIST only.LIST is a comma-delimited string of ranges, such as 1,2,3,10-13. Thisoption only works when reads are in base or pulse h5 format.--hitPolicy policySpecifies how blasr treats multiple hits: all: Reports all alignments. allbest: Reports all equally top-scoring alignments. random: Reports a single random alignment. randombest: Reports a single random alignment from multiple equally topscoring alignments. leftmost: Reports an alignment which has the best alignment score andhas the smallest mapping coordinates in any reference.Options for Anchoring Alignment Regions These options will have the greatest effects on speed and sensitivity.OptionsDescription--minMatch mSpecifies the minimum seed length. A higher value will speed up alignment,but decrease sensitivity. (Default 12)--maxMatch m--maxLCPLength mStops mapping a read to the genome when the LCP length reaches m. This isuseful when the query is part of the reference, for example when constructingpairwise alignments for de novo assembly. (Both options work the same.)--maxAnchorsPerPosition mDo not add anchors from a position if it matches to more than m locations inthe target.--advanceExactMatches ESpeeds up alignments with match -E fewer anchors. Rather than findinganchors between the read and the genome at every position in the read, whenan anchor is found at position i in a read of length L, the next position in aread to find an anchor is at i L-E. Use this when aligning already assembledcontigs. (Default 0)--nCandidates nKeeps up to n candidates for the best alignment. A large value will slowmapping as the slower dynamic programming steps are applied to moreclusters of anchors - this can be a rate-limiting step when reads are very long.(Default 10)--concordantMaps all subreads of a ZMW (hole) to where the longest full pass subread ofthe ZMW aligned to. This requires using the region table and hq regions. Thisoption only works when reads are in base or pulse h5 format.(Default False)--placeGapConsistentlyProduces alignments with gaps placed consistently for better variant calling.See “Gaps When Aligning” on page 11 for details.Page 9

Options for Refining Refines concordant alignments. This slightly increases alignment accuracyat the cost of time. This option is omitted if –-concordant is not set toTrue. (Default False)--sdpTupleSize KUses matches of length K to speed dynamic programming alignments. Thisoption controls accuracy of assigning gaps in pairwise alignments once amapping has been found, rather than mapping sensitivity itself.(Default 11)--scoreMatrix "score matrixstring"Specifies an alternative score matrix for scoring FASTA reads. The matrix isin the formatACGTNA abcdeC fghijG klmnoT pqrstN uvwxyThe values a.y should be input as a quoted space separated string: "ab c . y". Lower scores are better, so matches should be less thanmismatches; such as a,g,m,s -5 (match), mismatch 6.--affineOpen valueSets the penalty for opening an affine alignment. (Default 10)--affineExtend aChanges affine (extension) gap penalty. Lower value allows more gaps.(Default 0)Options for Overlap/Dynamic Programming Alignments andPairwise Overlap f

bamsieve --percentage 0.1 --seed 98765 full.subreads.bam random_sample.subreads.bam Selecting just two ZMWs/reads at random: bamsieve --count 2 full.subreads.bam two_reads.subreads.bam Selecting by barcode: bamsieve --barcodes --whitelist 4,7 full.subreads.bam two_barcodes.subreads.bam Generating a tiny BAM file that contains only .

Related Documents:

Citi SMRT Rewards Program Terms and Conditions . earning SMRT on Qualifying Retail Spend is 4.7% is on the selected categories listed in Table A below which bear the MCC and/or transactions description (if applicable) as set out in Table B of below (" onus rate"). If the monthly minimum Qualifying Retail Spend is less than S 500, you will

2009/12 6500-009-101 REV C www.stryker.com SMRT Power System Model 6500 Operations / Maintenance Manual For parts or technical assistance: USA: 1-800-327-0770 (option 2)

RingDown technology, provide advanced diagnostics and prognostics for power systems. SMRT Probe is a non-invasive, stand-alone, early-warning solution for diagnosing and monitoring the health of power systems. SMRT Probe Sensors Resonant frequency-sampling device to extract degradation signatures

The new CANBERRA MRT STATION on North-South Line (NSL) opens for service. 2020 The 43KM THOMSON-EAST COAST LINE (TEL) opens and will eventually have 32 new stations. SMRT Trains signed a contract with Bombardier Transportation (BT) for the provision of LONG-TERM SERVICE SUPPORT FOR THE 106 BOMBARDIER MOVIA TRAINS which will be used for

THIS LEASE AGREEMENT is made the day of 201 Between SMRT LIGHT RAIL 199704861WLTD (Company Registration Number: ), a public limited company incorporated in Singapore and having its registered office at 251 . Lease and/or under any statutory provision or enactment including all taxes or impositions by whatever name called levied or imposed on

Pro Tools 9.0 provides a single, unified installer for Pro Tools and Pro Tools HD. Pro Tools 9.0 is supported on the following types of systems: Pro Tools HD These systems include Pro Tools HD software with Pro Tools HD or Pro Tools HD Native hard-ware. Pro Tools These systems include Pro Tools software with 003 or Digi 002 family audio .

Automation test script is repeatable Proficiency is required to write the automation test scripts. A. Automation Tools Categories Software testing automation tools can be divided into different categories as follows: Unit Testing Tools, Functional Testing Tools, Code Coverage Tools, Test Management Tools, and Performance Testing Tools.

12/16/2017 5136637 lopez damien 12/16/2017 5166979 lorenzano adam 12/16/2017 5117861 mejia martin 12/16/2017 5113853 milner gabriella 12/16/2017 5137867 navarro david 12/16/2017 5109380 negrete sylvia 12/16/2017 4793891 piliposyan alexander