Practical Guide to InterpretingRNA-seq DataSkyler Kuhn1,2Mayank Tandon1,21. CCR Collaborative Bioinformatics Resource (CCBR), Center for Cancer Research, NCI2. Advanced Biomedical Computational Science, Frederick National Laboratory for Cancer Research
1OverviewI. Experimental DesignIV. Downstream AnalysisHypothesis-drivenPrincipal Components Analysis (PCA)Overview of Best PracticeDifferential ExpressionII. Quality-controlPre- and post- alignment QC metricsInterpretationIII. PipelineFastQ Files - Counts matrixReproducibilityPathway AnalysisV. Advanced VisualizationsGroup comparisonsAlternative Splicing EventsPathway Diagrams
I.Experimental Design
2I. Experimental Design: OverviewHypothesis-drivenAddresses a well thought-out quantifiable questionConsiderations:Library Construction: mRNA versus total RNASingle-end versus Paired-end SequencingSequencing Depth: quantifying gene-level or transcript-level expressionNumber of Replicates: statistical-power and ability drop a bad sampleReducing Batch Effects
3I. Experimental Design: Library ConstructionTotal RNA contains high-levels of ribosomal RNA (rRNA): 80%mRNApoly(A) selection standard profiling for gene expressionLow RIN may results in 3’ biasTotal RNArRNA depletionmRNA non-coding RNA species (lncRNA)Prokaryotic samples
4I. Experimental Design: Sequencing DepthmRNA: poly(A)-selectionRecommended Sequencing Depth: 10-20M paired-end reads (or 20-40M reads)RNA must be high quality (RIN 8)Total RNA: rRNA depletionRecommended Sequencing Depth: 25-60M paired-end reads (or 50-120M reads)RNA must be high quality (RIN 8)* Differential Isoform regulation or alternative splicing events: 100M paired-end reads
5I. Experimental Design: Number of ReplicatesRecommendedBiological Replicates Technical ReplicatesNumber of Replicates: 4Peace-of-mind: Ability drop a bad sample without compromising statistical powerBare MinimumBiological Replicates Technical ReplicatesNumber of Replicates: 3
6I. Experimental Design: Reducing Batch EffectsGroupBatchBatch*Treatment r1KO11Treatment r2KO21Treatment r3KO11Different Lab TechniciansTreatment r4KO21Different processing timesCntrl r1WT12Different Reagent LotsCntrl r2WT22Cntrl r3WT12Cntrl r4WT22Unwanted sources of technical variationDecrease batch effects by uniform processingProtocol-drivenSequencingLane effectSample Name* Confounded Groups and Batches!
II.Quality Control
7II. Quality-control: OverviewNo need to reinvent the wheel but there are a lot of wheels!Pre-alignment Quality-controlSequencing QualityContamination ScreeningPost-alignment Quality-controlAlignment QualityAggregation and InterpretationMultiQC ReportQC metric guidelines
8II. Quality-control: Pre-alignmentSequencing QualityFastQC: run twice on raw and trimmed dataContamination ScreeningFastQ ScreenFastQC rawFastQC trimmedKrakenBioBloomAdapter TrimmingContaminationScreening
II. Quality-control: Pre-alignmentFastQC (raw)Adapter TrimmingFastQC (trimmed)FastQCIdentify potential problems that can arise during sequencing or library prepRun on raw reads (pre-adapter removal) and trimmed reads (post-adapter removal)Summarizes:- Per base and per sequence quality scores- Per sequence GC content- Per sequence adapter content- Per sequence read lengths- Overrepresented sequences9
II. Quality-control: FastQC10
II. Quality-control: Pre-alignmentAdapter TrimmingContamination ScreenAlignmentFastQ ScreenAligns to Human, Mouse, Fungi, Bacteria, Viral referencesEasy to interpret and important QC stepKrakenTaxonomic composition of microbial contamination- Archaea- Bacteria- Plasmid- Viral11
FastQ ScreenContamination Screening12
Kraken KronaMicrobial Taxonomic Composition13
II. Quality-control: Post-alignmentAlignmentAlignment QualityQuantify CountsPreseqEstimates library complexityPicard RNAseqMetricsNumber of reads that align to coding, intronic, UTR, intergenic, ribosomal regionsNormalize gene coverage across a meta-gene body- Identify 5’ or 3’ biasRSeQCSuite of tools to assess various post-alignment quality- Calculate distribution of Insert Size- Junction Annotation (% Known, % Novel read spanning splice junctions)- BAM to BigWig (Visual Inspection with IGV)14
CollectRnaseqMetrics Alignment Summary15
Picard CollectRnaseqMetricsNormalized Gene Coverage3’ Bias16
17II. Quality-control: AggregationMultiQCHTML report that aggregates information across all samples- Plots, filtering, and highlightingHighly customizable with great documentation- Add text and embed custom figures- Create your own module to extend missing functionalitySupports over 73 commonly-used open source bioinformatics tools
QC Metric GuidelinesmRNAtotal RNARNA Type(s)CodingCoding non-coding 8 [low RIN 3’ bias] 8Paired-endPaired-end10-20M PE reads25-60M PE readsQ30 70%Q30 70% 70% 65% 7M PE reads (or 14M reads) 16.5M PE reads (or 33M reads) 5% 15%Picard RNAseqMetricsCoding 50%Coding 35%Picard RNAseqMetricsIntronic Intergenic 25%Intronic Intergenic 40%RINSingle-end vs Paired-endRecommended SequencingDepthFastQCPercent Aligned to ReferenceMillion Reads Aligned ReferencePercent Aligned to rRNA
III.Pipeline
III. Processing Pipeline18Conceptual DiagramAdapters are composed ofsynthetic sequences and shouldbe removed prior to alignmentCounting the number of readsthat align to particular feature ofinterest (genes, isoforms, etc)Adapter TrimmingQuantificationRaw dataFastQ filesAlignmentAdding biological context to yourdata, find where reads align tothe reference genomeDifferential ExpressionSummarizing differencesbetween two groups orconditions (KO vs. WT)
III. Processing Pipeline Practical ExampleCutadaptSTAR19RSEMFastQC: Pre- and post- trimmingCutadapt: Remove adaptersFastQ Screen: Run twice on different set of referencesSTAR: Splice-aware alignerRSEM: Generates gene and isoform countsMultiQC: Aggregates everything into an HTML reportFastQ files to raw counts matrix
20III. Processing Pipeline: ReproducibilityWorkflow management systemsSnakemake, NextflowPackage managementNo active management: rat’s nest of interdependencies prone to breakPython: virtual environmentsConda: Python, R, Scala, Java, C/C , FORTRANDocker or Singularity: Portability and high reproducibility
IV.Downstream Analysis
IV. Downstream Analysis Step 1: ThinkStep 2: AnalyzeStep 3: QC?Step 4: Nobel Prize!Answer BiologicalQuestionsAdapter TrimmingQuantificationRaw dataFastQ filesAlignmentDifferential Expression
21IV. Downstream AnalysisPrincipal Components Analysis (PCA)Data summarization, visualization, and QC toolDifferential ExpressionFind genes that are different between groups of interestPathway EnrichmentAnalyze for broader biological patterns
IV. Downstream Analysis: PCAPrincipal Components Analysis (PCA) Dimensionality reduction techniqueCaptures patterns of variance into singular valuesVisualizes global transcriptomic patterns22
IV. Downstream Analysis: PCAPrincipal Components Analysis (PCA) Dimensionality reduction techniqueCaptures patterns of variance into singular valuesVisualizes global transcriptomic patterns22
IV. Downstream Analysis: PCAPCA can help drive biological insights.23
IV. Downstream Analysis: PCAPCA can help drive biological insights.23
IV. Downstream Analysis: PCA or be used as a QC tool24
25IV. Downstream Analysis: Differential ExpressionGoal: Identify genes or transcripts that vary due tobiological effectsQuestion: Can’t I just use a t-test to do that?Answer: Sure. But data are noisy. bad ideaSo we apply normalization and/or employspecialized statistical tests.Law, C. W., et al. (2014). "voom: Precision weights unlock linear model analysis toolsfor RNA-seq read counts." Genome Biol 15(2): R29.
IV. Downstream Analysis: Differential Expression26Seyednasrollah, F., et al. (2015). "Comparison of software packages for detectingdifferential expression in RNA-seq studies." Brief Bioinform 16(1): 59-70.
IV. Downstream Analysis: Differential Expression27Seyednasrollah, F., et al. (2015). "Comparison of software packages for detectingdifferential expression in RNA-seq studies." Brief Bioinform 16(1): 59-70.
28IV. Downstream Analysis: Differential ExpressionPractical Rules of ThumbLimma, DESeq2, and EdgeR will work be very similarly in most cases- Consensus or intersection of the three is sometimes usedLimma works better with larger cohorts ( 7 or more samples per group)DESeq2 works better with small cohorts ( 3 or less per group)- May also be more sensitive for low depth dataEdgeR provides convenience functions for converting to various normalized values
IV. Downstream Analysis: Differential ExpressionOutput29
IV. Downstream Analysis: Differential ExpressionOutput29
IV. Downstream Analysis: Pathway EnrichmentGene annotation and network databases capture biological meaningManual curation, text miningGene function and/or interactionsDozens of databases and hundreds of toolsDepends on how you want to look at gene-pathway relationships30
IV. Downstream Analysis: Pathway EnrichmentTypes of pathway analysisSimple enrichment test: Qualitative- Fisher’s Exact Test- Hypergeometric testEnrichment algorithms: Quantitative- GSEA (Broad Institute)Network AnalysisCommercial vs. open source31
IV. Downstream Analysis: Pathway EnrichmentTypes of pathway analysisSimple enrichment test: Qualitative- Fisher’s Exact Test- Hypergeometric testEnrichment algorithms: Quantitative- GSEA (Broad Institute)Network AnalysisCommercial vs. open source32
33IV. Downstream Analysis: Pathway EnrichmentTypes of pathway analysisSimple enrichment test: Qualitative- Fisher’s Exact Test- Hypergeometric testEnrichment algorithms: Quantitative- GSEA (Broad Institute)Network AnalysisCommercial vs. open source
34IV. Downstream Analysis: Pathway EnrichmentTypes of pathway analysis
V.Visualizations
35V. Visualizations of RNA-Seq DataGroup comparisons of pathway enrichmentHeatmapsVisualizing Set OverlapDotplotsSashimi plotsAlternative Splicing
V. Visualizations: Group EnrichmentGroup comparison of pathway enrichment: Simple Enrichment Test36
V. Visualizations: Expression Heatmap37
V. Visualizations: Set Intersection38
V. Visualizations: Pathway enrichment39
V. Visualizations: Sashimi Plot40
41ConclusionsThink BEFORE you sequence!This is a three-way partnership: bench sequencing analysis- Everyone should agree on experimental design, platform, approachQC is extremely important!There is no need to reinvent the wheel but there are a lot of wheelsGarbage in, Garbage out!- Only some problems can be fixed bioinformaticallyThere will always be significant changes detectedInterpretation must be cautious and deliberate
THANKS!AcknowledgementsCCBR, NCBR, and GAU membersAny questions?
MiSeqCost-BenefitConsiderationsCaveats:Expected reads/sample based on maximumpossible yieldRun TimeMax OutputMax Reads PerRunLanesMaximum ReadNextSeqHiSeq 40004–55 hours 12–30 hours 1–3.5 daysNovaseq 13 - 44hours15 Gb120 Gb1500 Gb6000 Gb25 million400 million5 billion20 billion11842 300 bp2 150 bp2 150 bp2 x 250**Typical runs likely yield 80% of maxLengthDifferent platforms may have different turnaroundtimes depending on queue length and popularityCost from SF 623 1956 1007/lane 4382/laneMax Coverage2 million33 million52 million416 millionreadsreadsreadsreads 51.91 163.92 83.92 365.16Library Prep cost is not included here: 50-84 depending on type of kit(12 samples) per sample(12 ng-platforms.html?langsel /ch/
QC Metric Guidelines mRNA total RNA RNA Type(s) Coding Coding non-coding RIN 8 [low RIN 3' bias] 8 Single-end vs Paired-end Paired-end Paired-end Recommended Sequencing Depth 10-20M PE reads 25-60M PE reads FastQC Q30 70% Q30 70% Percent Aligned to Reference 70% 65% Million Reads Aligned Reference 7M PE reads (or 14M reads) 16.5M PE reads (or 33M reads)
(Structure of RNA from Life Sciences for all, Grade 12, Figure 4.14, Page 193) Types of RNA RNA is manufactured by DNA. There are three types of RNA. The three types of RNA: 1. Messenger RNA (mRNA). It carries information about the amino acid sequence of a particular protein from the DNA in the nucleus to th
The process of protein synthesis can be divided into 2 stages: transcription and translation. 5 as a template to make 3 types of RNA: a) messengermessenger--RNA (mRNA)RNA (mRNA) b) ribosomalribosomal--RNA (rRNA)RNA (rRNA) c) transfertransfer--RNA (tRNA)RNA (tRNA) Objective 32 2)2) During During translationtranslation, the
10 - RNA Modifications After the RNA molecule is produced by transcription (Part 9), the structure of the RNA is often modified prior to being translated into a protein. These modifications to the RNA molecule are called RNA modifications or posttranscriptional modifications. Most RNA modifications apply onl
13.1 RNA RNA Synthesis In transcription, RNA polymerase separates the two DNA strands. RNA then uses one strand as a template to make a complementary strand of RNA. RNA contains the nucleotide uracil instead of the nucleotide thymine. Follow the direction
DNA AND RNA Table 4.1: Some important types of RNA. Name Abbreviation Function Messenger RNA mRNA Carries the message from the DNA to the protein factory Ribosomal RNA rRNA Comprises part of the protein factory Transfer RNA tRNA Transfers the correct building block to the nascent protein Interference RNA
Biological Functions of Nucleic Acids tRNA (transfer RNA, adaptor in translation) rRNA (ribosomal RNA, component of ribosome) snRNA (small nuclear RNA, component of splicesome) snoRNA (small nucleolar RNA, takes part in processing of rRNA) RNase P (ribozyme, processes tRNA) SRP RNA (
Coding and non-coding RNA zCoding RNAs (4% ) - transcriptome mRNAs : rapid turnover . RNA editing . RNA Pol II is an RNA Factory Capping of RNA pol II transcripts . Methods in enzymology, 2005). zScrambled control zPositive control (GAPDH) siRNA synthesis
The Structure of RNA There are 3 main structural differences between RNA and DNA: 1. The sugar in RNA is ribose instead of deoxyribose. 2. RNA is single-stranded. 3. RNA contains uracil instead of thymine.