Practical Guide To Interpreting RNA-seq Data

1y ago
11 Views
2 Downloads
6.44 MB
54 Pages
Last View : 27d ago
Last Download : 3m ago
Upload by : Tripp Mcmullen
Transcription

Practical Guide to InterpretingRNA-seq DataSkyler Kuhn1,2Mayank Tandon1,21. CCR Collaborative Bioinformatics Resource (CCBR), Center for Cancer Research, NCI2. Advanced Biomedical Computational Science, Frederick National Laboratory for Cancer Research

1OverviewI. Experimental DesignIV. Downstream AnalysisHypothesis-drivenPrincipal Components Analysis (PCA)Overview of Best PracticeDifferential ExpressionII. Quality-controlPre- and post- alignment QC metricsInterpretationIII. PipelineFastQ Files - Counts matrixReproducibilityPathway AnalysisV. Advanced VisualizationsGroup comparisonsAlternative Splicing EventsPathway Diagrams

I.Experimental Design

2I. Experimental Design: OverviewHypothesis-drivenAddresses a well thought-out quantifiable questionConsiderations:Library Construction: mRNA versus total RNASingle-end versus Paired-end SequencingSequencing Depth: quantifying gene-level or transcript-level expressionNumber of Replicates: statistical-power and ability drop a bad sampleReducing Batch Effects

3I. Experimental Design: Library ConstructionTotal RNA contains high-levels of ribosomal RNA (rRNA): 80%mRNApoly(A) selection standard profiling for gene expressionLow RIN may results in 3’ biasTotal RNArRNA depletionmRNA non-coding RNA species (lncRNA)Prokaryotic samples

4I. Experimental Design: Sequencing DepthmRNA: poly(A)-selectionRecommended Sequencing Depth: 10-20M paired-end reads (or 20-40M reads)RNA must be high quality (RIN 8)Total RNA: rRNA depletionRecommended Sequencing Depth: 25-60M paired-end reads (or 50-120M reads)RNA must be high quality (RIN 8)* Differential Isoform regulation or alternative splicing events: 100M paired-end reads

5I. Experimental Design: Number of ReplicatesRecommendedBiological Replicates Technical ReplicatesNumber of Replicates: 4Peace-of-mind: Ability drop a bad sample without compromising statistical powerBare MinimumBiological Replicates Technical ReplicatesNumber of Replicates: 3

6I. Experimental Design: Reducing Batch EffectsGroupBatchBatch*Treatment r1KO11Treatment r2KO21Treatment r3KO11Different Lab TechniciansTreatment r4KO21Different processing timesCntrl r1WT12Different Reagent LotsCntrl r2WT22Cntrl r3WT12Cntrl r4WT22Unwanted sources of technical variationDecrease batch effects by uniform processingProtocol-drivenSequencingLane effectSample Name* Confounded Groups and Batches!

II.Quality Control

7II. Quality-control: OverviewNo need to reinvent the wheel but there are a lot of wheels!Pre-alignment Quality-controlSequencing QualityContamination ScreeningPost-alignment Quality-controlAlignment QualityAggregation and InterpretationMultiQC ReportQC metric guidelines

8II. Quality-control: Pre-alignmentSequencing QualityFastQC: run twice on raw and trimmed dataContamination ScreeningFastQ ScreenFastQC rawFastQC trimmedKrakenBioBloomAdapter TrimmingContaminationScreening

II. Quality-control: Pre-alignmentFastQC (raw)Adapter TrimmingFastQC (trimmed)FastQCIdentify potential problems that can arise during sequencing or library prepRun on raw reads (pre-adapter removal) and trimmed reads (post-adapter removal)Summarizes:- Per base and per sequence quality scores- Per sequence GC content- Per sequence adapter content- Per sequence read lengths- Overrepresented sequences9

II. Quality-control: FastQC10

II. Quality-control: Pre-alignmentAdapter TrimmingContamination ScreenAlignmentFastQ ScreenAligns to Human, Mouse, Fungi, Bacteria, Viral referencesEasy to interpret and important QC stepKrakenTaxonomic composition of microbial contamination- Archaea- Bacteria- Plasmid- Viral11

FastQ ScreenContamination Screening12

Kraken KronaMicrobial Taxonomic Composition13

II. Quality-control: Post-alignmentAlignmentAlignment QualityQuantify CountsPreseqEstimates library complexityPicard RNAseqMetricsNumber of reads that align to coding, intronic, UTR, intergenic, ribosomal regionsNormalize gene coverage across a meta-gene body- Identify 5’ or 3’ biasRSeQCSuite of tools to assess various post-alignment quality- Calculate distribution of Insert Size- Junction Annotation (% Known, % Novel read spanning splice junctions)- BAM to BigWig (Visual Inspection with IGV)14

CollectRnaseqMetrics Alignment Summary15

Picard CollectRnaseqMetricsNormalized Gene Coverage3’ Bias16

17II. Quality-control: AggregationMultiQCHTML report that aggregates information across all samples- Plots, filtering, and highlightingHighly customizable with great documentation- Add text and embed custom figures- Create your own module to extend missing functionalitySupports over 73 commonly-used open source bioinformatics tools

QC Metric GuidelinesmRNAtotal RNARNA Type(s)CodingCoding non-coding 8 [low RIN 3’ bias] 8Paired-endPaired-end10-20M PE reads25-60M PE readsQ30 70%Q30 70% 70% 65% 7M PE reads (or 14M reads) 16.5M PE reads (or 33M reads) 5% 15%Picard RNAseqMetricsCoding 50%Coding 35%Picard RNAseqMetricsIntronic Intergenic 25%Intronic Intergenic 40%RINSingle-end vs Paired-endRecommended SequencingDepthFastQCPercent Aligned to ReferenceMillion Reads Aligned ReferencePercent Aligned to rRNA

III.Pipeline

III. Processing Pipeline18Conceptual DiagramAdapters are composed ofsynthetic sequences and shouldbe removed prior to alignmentCounting the number of readsthat align to particular feature ofinterest (genes, isoforms, etc)Adapter TrimmingQuantificationRaw dataFastQ filesAlignmentAdding biological context to yourdata, find where reads align tothe reference genomeDifferential ExpressionSummarizing differencesbetween two groups orconditions (KO vs. WT)

III. Processing Pipeline Practical ExampleCutadaptSTAR19RSEMFastQC: Pre- and post- trimmingCutadapt: Remove adaptersFastQ Screen: Run twice on different set of referencesSTAR: Splice-aware alignerRSEM: Generates gene and isoform countsMultiQC: Aggregates everything into an HTML reportFastQ files to raw counts matrix

20III. Processing Pipeline: ReproducibilityWorkflow management systemsSnakemake, NextflowPackage managementNo active management: rat’s nest of interdependencies prone to breakPython: virtual environmentsConda: Python, R, Scala, Java, C/C , FORTRANDocker or Singularity: Portability and high reproducibility

IV.Downstream Analysis

IV. Downstream Analysis Step 1: ThinkStep 2: AnalyzeStep 3: QC?Step 4: Nobel Prize!Answer BiologicalQuestionsAdapter TrimmingQuantificationRaw dataFastQ filesAlignmentDifferential Expression

21IV. Downstream AnalysisPrincipal Components Analysis (PCA)Data summarization, visualization, and QC toolDifferential ExpressionFind genes that are different between groups of interestPathway EnrichmentAnalyze for broader biological patterns

IV. Downstream Analysis: PCAPrincipal Components Analysis (PCA) Dimensionality reduction techniqueCaptures patterns of variance into singular valuesVisualizes global transcriptomic patterns22

IV. Downstream Analysis: PCAPrincipal Components Analysis (PCA) Dimensionality reduction techniqueCaptures patterns of variance into singular valuesVisualizes global transcriptomic patterns22

IV. Downstream Analysis: PCAPCA can help drive biological insights.23

IV. Downstream Analysis: PCAPCA can help drive biological insights.23

IV. Downstream Analysis: PCA or be used as a QC tool24

25IV. Downstream Analysis: Differential ExpressionGoal: Identify genes or transcripts that vary due tobiological effectsQuestion: Can’t I just use a t-test to do that?Answer: Sure. But data are noisy. bad ideaSo we apply normalization and/or employspecialized statistical tests.Law, C. W., et al. (2014). "voom: Precision weights unlock linear model analysis toolsfor RNA-seq read counts." Genome Biol 15(2): R29.

IV. Downstream Analysis: Differential Expression26Seyednasrollah, F., et al. (2015). "Comparison of software packages for detectingdifferential expression in RNA-seq studies." Brief Bioinform 16(1): 59-70.

IV. Downstream Analysis: Differential Expression27Seyednasrollah, F., et al. (2015). "Comparison of software packages for detectingdifferential expression in RNA-seq studies." Brief Bioinform 16(1): 59-70.

28IV. Downstream Analysis: Differential ExpressionPractical Rules of ThumbLimma, DESeq2, and EdgeR will work be very similarly in most cases- Consensus or intersection of the three is sometimes usedLimma works better with larger cohorts ( 7 or more samples per group)DESeq2 works better with small cohorts ( 3 or less per group)- May also be more sensitive for low depth dataEdgeR provides convenience functions for converting to various normalized values

IV. Downstream Analysis: Differential ExpressionOutput29

IV. Downstream Analysis: Differential ExpressionOutput29

IV. Downstream Analysis: Pathway EnrichmentGene annotation and network databases capture biological meaningManual curation, text miningGene function and/or interactionsDozens of databases and hundreds of toolsDepends on how you want to look at gene-pathway relationships30

IV. Downstream Analysis: Pathway EnrichmentTypes of pathway analysisSimple enrichment test: Qualitative- Fisher’s Exact Test- Hypergeometric testEnrichment algorithms: Quantitative- GSEA (Broad Institute)Network AnalysisCommercial vs. open source31

IV. Downstream Analysis: Pathway EnrichmentTypes of pathway analysisSimple enrichment test: Qualitative- Fisher’s Exact Test- Hypergeometric testEnrichment algorithms: Quantitative- GSEA (Broad Institute)Network AnalysisCommercial vs. open source32

33IV. Downstream Analysis: Pathway EnrichmentTypes of pathway analysisSimple enrichment test: Qualitative- Fisher’s Exact Test- Hypergeometric testEnrichment algorithms: Quantitative- GSEA (Broad Institute)Network AnalysisCommercial vs. open source

34IV. Downstream Analysis: Pathway EnrichmentTypes of pathway analysis

V.Visualizations

35V. Visualizations of RNA-Seq DataGroup comparisons of pathway enrichmentHeatmapsVisualizing Set OverlapDotplotsSashimi plotsAlternative Splicing

V. Visualizations: Group EnrichmentGroup comparison of pathway enrichment: Simple Enrichment Test36

V. Visualizations: Expression Heatmap37

V. Visualizations: Set Intersection38

V. Visualizations: Pathway enrichment39

V. Visualizations: Sashimi Plot40

41ConclusionsThink BEFORE you sequence!This is a three-way partnership: bench sequencing analysis- Everyone should agree on experimental design, platform, approachQC is extremely important!There is no need to reinvent the wheel but there are a lot of wheelsGarbage in, Garbage out!- Only some problems can be fixed bioinformaticallyThere will always be significant changes detectedInterpretation must be cautious and deliberate

THANKS!AcknowledgementsCCBR, NCBR, and GAU membersAny questions?

MiSeqCost-BenefitConsiderationsCaveats:Expected reads/sample based on maximumpossible yieldRun TimeMax OutputMax Reads PerRunLanesMaximum ReadNextSeqHiSeq 40004–55 hours 12–30 hours 1–3.5 daysNovaseq 13 - 44hours15 Gb120 Gb1500 Gb6000 Gb25 million400 million5 billion20 billion11842 300 bp2 150 bp2 150 bp2 x 250**Typical runs likely yield 80% of maxLengthDifferent platforms may have different turnaroundtimes depending on queue length and popularityCost from SF 623 1956 1007/lane 4382/laneMax Coverage2 million33 million52 million416 millionreadsreadsreadsreads 51.91 163.92 83.92 365.16Library Prep cost is not included here: 50-84 depending on type of kit(12 samples) per sample(12 ng-platforms.html?langsel /ch/

QC Metric Guidelines mRNA total RNA RNA Type(s) Coding Coding non-coding RIN 8 [low RIN 3' bias] 8 Single-end vs Paired-end Paired-end Paired-end Recommended Sequencing Depth 10-20M PE reads 25-60M PE reads FastQC Q30 70% Q30 70% Percent Aligned to Reference 70% 65% Million Reads Aligned Reference 7M PE reads (or 14M reads) 16.5M PE reads (or 33M reads)

Related Documents:

(Structure of RNA from Life Sciences for all, Grade 12, Figure 4.14, Page 193) Types of RNA RNA is manufactured by DNA. There are three types of RNA. The three types of RNA: 1. Messenger RNA (mRNA). It carries information about the amino acid sequence of a particular protein from the DNA in the nucleus to th

The process of protein synthesis can be divided into 2 stages: transcription and translation. 5 as a template to make 3 types of RNA: a) messengermessenger--RNA (mRNA)RNA (mRNA) b) ribosomalribosomal--RNA (rRNA)RNA (rRNA) c) transfertransfer--RNA (tRNA)RNA (tRNA) Objective 32 2)2) During During translationtranslation, the

10 - RNA Modifications After the RNA molecule is produced by transcription (Part 9), the structure of the RNA is often modified prior to being translated into a protein. These modifications to the RNA molecule are called RNA modifications or posttranscriptional modifications. Most RNA modifications apply onl

13.1 RNA RNA Synthesis In transcription, RNA polymerase separates the two DNA strands. RNA then uses one strand as a template to make a complementary strand of RNA. RNA contains the nucleotide uracil instead of the nucleotide thymine. Follow the direction

DNA AND RNA Table 4.1: Some important types of RNA. Name Abbreviation Function Messenger RNA mRNA Carries the message from the DNA to the protein factory Ribosomal RNA rRNA Comprises part of the protein factory Transfer RNA tRNA Transfers the correct building block to the nascent protein Interference RNA

Biological Functions of Nucleic Acids tRNA (transfer RNA, adaptor in translation) rRNA (ribosomal RNA, component of ribosome) snRNA (small nuclear RNA, component of splicesome) snoRNA (small nucleolar RNA, takes part in processing of rRNA) RNase P (ribozyme, processes tRNA) SRP RNA (

Coding and non-coding RNA zCoding RNAs (4% ) - transcriptome mRNAs : rapid turnover . RNA editing . RNA Pol II is an RNA Factory Capping of RNA pol II transcripts . Methods in enzymology, 2005). zScrambled control zPositive control (GAPDH) siRNA synthesis

The Structure of RNA There are 3 main structural differences between RNA and DNA: 1. The sugar in RNA is ribose instead of deoxyribose. 2. RNA is single-stranded. 3. RNA contains uracil instead of thymine.