Rna seq illumina pdf




















Although this technique yields high-quality RNA, the total yield is low and requires PCR amplification, thereby introducing amplification biases and creating less distinguishable expression profiles across different cell types Kube et al.

Cell purification and enrichment protocols are also available, such as differential centrifugation and fluorescence-activated cell sorting Cantor et al. In conjunction with RNA-Seq, these experimental methods have overcome previous technical limitations and enable researchers to uncover unique expression signatures across specific cell-types and developmental stages Moran et al. In addition to these experimental methods, in silico probabilistic models can be applied in downstream analysis to differentiate the transcript abundances of distinct cells from RNA-Seq data of heterogeneous tissue samples Erkkila et al.

Interestingly, in some cases, the sample heterogeneity can have advantages in transcriptome profiling by identifying novel pathways, implicating cellular origins of disease, or identifying previously unknown pathological sites Alizadeh et al. Beyond tissue heterogeneity, considerable evidence indicates that cell-to-cell variability in gene expression is ubiquitous, even within phenotypically homogeneous cell populations Huang Unfortunately, conventional RNA-Seq studies do not capture the transcriptomic composition of individual cells.

The transcriptome of a single cell is highly dynamic, reflecting its functionality and responses to ever-changing stimuli. Furthermore, genes that show mutually exclusive expression in individual cells may be observed as genes showing co-expression in expression analyses of bulk cell populations.

To uncover cell-to-cell variation within populations, significant efforts have been invested in developing single-cell RNA-Seq methods. The biggest challenge has been extending the limits of library preparation to accommodate extremely low input RNA. In addition, PCR amplification methods do not linearly amplify transcript and are prone to introduce biases based on the nucleic acid composition of different transcripts, ultimately altering the relative abundance of these transcripts in the sequencing library.

Methods that avoid PCR amplification steps, such as CEL-Seq, through linear in vitro amplification of the transcriptome can avoid these biases Hashimshony et al. In addition, the use of nanoliter-scale reaction volumes with microfluidic devices as opposed to microliter-scale reactions can reduce biases that arise during sample preparation Wu et al. Although single-cell methods are still under active development, quantitative assessments of these techniques indicate that obtaining accurate transcriptome measurements by single-cell RNA-Seq is possible after accounting for technical noise Brennecke et al.

These methods will undoubtedly be important for uncovering oscillatory and heterogeneous gene expression within single-cell types, as well as identifying cell-specific biomarkers that further our understanding of biology across many physiological and pathological conditions.

When designing an RNA-Seq experiment, the selection of a sequencing platform is important and dependent on the experimental goals. Currently, several NGS platforms are commercially available and other platforms are under active technological development Metzker The majority of high-throughput sequencing platforms use a sequencing-by-synthesis method to sequence tens of millions of sequence clusters in parallel. The NGS platforms can often be categorized as either ensemble-based i.

The differences between these sequencing techniques and platforms can affect downstream analysis and interpretation of the sequencing data. In recent years, the sequencing industry has been dominated by Illumina, which applies an ensemble-based sequencing-by-synthesis approach Bentley et al. Using fluorescently labeled reversible-terminator nucleotides, DNA molecules are clonally amplified while immobilized on the surface of a glass flowcell.

Because molecules are clonally amplified, this approach provides the relative RNA expression levels of genes. To remove potential PCR-amplification biases, PCR controls and specific steps in the downstream computational analysis are required. Low error rates are particularly important for sequencing miRNAs, whose relatively small sizes result in misalignment or loss of reads if error rates are too high.

The platform has two flow cells, each providing eight separate lanes for sequencing reactions to occur. The sequencing reactions can take between 1. The simplified workflow of the MiSeq instrument offers rapid turnaround time for transcriptome sequencing on a smaller scale.

This approach uses DNA polymerase to perform uninterrupted template-directed synthesis using fluorescently labeled nucleosides. As each base is enzymatically incorporated into a growing DNA strand, a distinctive pulse of fluorescence is detected in real-time by zero-mode waveguide nanostructure arrays.

An advantage of SMRT is that it does not include a PCR amplification step, thereby avoiding amplification bias and improving uniform coverage across the transcriptome. Another advantage of this sequencing approach is the ability to produce extraordinarily long reads with average lengths of to bp, which greatly improves the detection of novel transcript structures Au et al. Another important consideration for choosing a sequencing platform is transcriptome assembly.

Transcriptome assembly, which is discussed in greater detail later, is necessary to transform a collection of short sequencing reads into a set of full-length transcripts. In general, longer sequencing reads make it simpler to accurately and unambiguously assemble transcripts, as well as identify splicing isoforms. The extremely long reads generated by the PacBio platform are ideal for de novo transcriptome assembly in which the reads are not aligned to a reference transcriptome.

The longer reads will facilitate an accurate detection of alternative splice isoforms, which may not be discovered with shorter reads. Moleculo, a company acquired by Illumina, has developed long-read sequencing technology capable of producing bp reads. Although it has yet to be widely adopted for transcriptome sequencing, the long reads aid transcriptome assembly. Lastly, Illumina has developed protocols for its desktops MiSeq to sequence slightly longer reads up to bp.

Although much shorter than PacBio and Moleculo reads, the longer MiSeq reads can also be used to improve both de novo and reference transcriptome assembly. Gene expression profiling by RNA-Seq provides an unprecedented high-resolution view of the global transcriptional landscape.

As the sequencing technologies and protocol methodologies continually evolve, new informatics challenges and applications develop. Beyond surveying gene expression levels, RNA-Seq can also be applied to discover novel gene structures, alternatively spliced isoforms, and allele-specific expression ASE.

In addition, genetic studies of gene expression using RNA-Seq have observed genetically correlated variability in expression, splicing, and ASE Montgomery et al. This section will introduce how expression data are analyzed to provide greater insight into the extensive complexity of transcriptomes.

Although basic sequencing analysis tools are more accessible than ever, RNA-Seq analysis presents unique computational challenges not encountered in other sequencing-based analyses and requires specific consideration to the biases inherent in expression data.

Overview of RNA-Seq data analysis. Following typical RNA-Seq experiments, reads are first aligned to a reference genome. Second, the reads may be assembled into transcripts using reference transcript annotations or de novo assembly approaches. Next, the expression level of each gene is estimated by counting the number of reads that align to each exon or full-length transcript.

Downstream analyses with RNA-Seq data include testing for differential expression between samples, detecting allele-specific expression, and identifying expression quantitative trait loci eQTLs. Mapping RNA-Seq reads to the genome is considerably more challenging than mapping DNA sequencing reads because many reads map across splice junctions.

In fact, conventional read mapping algorithms, such as Bowtie Langmead et al. One approach to resolving this problem is to supplement the reference genome with sequences derived from exon—exon splice junctions acquired from known gene annotations Mortazavi et al. As RNA-Seq data have become more widely used, a number of splicing-aware mapping tools have been developed specifically for mapping transcriptome data.

Each aligner has different advantages in terms of performance, speed, and memory utilization. Selecting the best aligner to use depends on these metrics and the overall objectives of the RNA-Seq study. After RNA-Seq reads are aligned, the mapped reads can be assembled into transcripts. The majority of computational programs infer transcript models from the accumulation of read alignments to the reference genome Trapnell et al.

An alternative approach for transcript assembly is de novo reconstruction, in which contiguous transcript sequences are assembled with the use of a reference genome or annotations Robertson et al. The reconstruction of transcripts from short-read data is a major challenge and a gold standard method for transcript assembly does not exist.

The nature of the transcriptome e. RGASP3 has initiated efforts to evaluate computational methods for transcriptome reconstruction and has found that most algorithms can identify discrete transcript components, but the assembly of complete transcript structures remains a major challenge Steijger et al. A common downstream feature of transcript reconstruction software is the estimation of gene expression levels.

Computational tools such as Cufflinks Trapnell et al. Alternative approaches, such as HTSeq, can quantify expression without assembling transcripts by counting the number of reads that map to an exon Anders et al. To accurately estimate gene expression, read counts must be normalized to correct for systematic variability, such as library fragment size, sequence composition bias, and read depth Oshlack and Wakefield ; Roberts et al. To account for these sources of variability, the reads per kilobase of transcripts per million mapped reads RPKM metric normalizes a transcript's read count by both the gene length and the total number of mapped reads in the sample.

For paired end-reads, a metric that normalizes for sources of variances in transcript quantification is the paired fragments per kilobase of transcript per million mapped reads FPKM metric, which accounts for the dependency between paired-end reads in the RPKM estimate Trapnell et al.

Another technical challenge for transcript quantification is the mapping of reads to multiple transcripts that are a result of genes with multiple isoforms or close paralogs.

However, this strategy is far from ideal for genes lacking unique exons. An alternative strategy used by Cufflinks Trapnell et al. To identify known miRNAs, the sequencing reads can be mapped to a specific database, such as miRBase, a repository containing over 24, miRNA loci from species in its latest release v21 in June Kozomara and Griffiths-Jones In addition, several tools have been developed to facilitate analysis of miRNAs including the commonly used tools miRanalyzer Hackenberg et al.

MiRanalyzer can detect known miRNAs annotated on miRBase as well as predict novel miRNAs using a machine-learning approach based on the random forest method with a broad range of features. Although miRDeep and miRanalyzer contain modules for target prediction, expression quantification, and differential expression, the methods developed for mRNA quantification and differential expression can also be applied to miRNA data Eminaga et al. At each stage in the RNA-Seq analysis pipeline, careful consideration should be applied to identifying and correcting for various sources of bias.

Bias can arise throughout the RNA-Seq experimental pipeline, including during RNA extraction, sample preparation, library construction, sequencing, and read mapping Kleinman and Majewski ; Lin et al.

First, the quality of the raw sequence data in FASTQ-format files should be evaluated to ensure high-quality reads. Several important parameters that should be evaluated include the sequence diversity of reads, adaptor contamination, base qualities, nucleotide composition, and percentage of called bases. These technical artifacts can arise at the sequencing stage or during the construction of the RNA-Seq. If possible, actions to correct for these biases should be performed, such as trimming the ends of reads, to expedite the speed and improve the quality of the read alignments.

After aligning the reads, additional parameters should be assessed to account for biases that arise at the read mapping stage. One of the most common sources of mapping errors for RNA-Seq data occurs when a read spans the splicing junction of an alternatively spliced gene. A misalignment can be easily introduced due to ambiguous mapping of the read end to one of the two or more possible exons and is especially common when reads are mapped to a reference transcriptome that contains an incomplete annotation of isoforms Kleinman and Majewski ; Pickrell et al.

If genotype information is available, the integrity of the samples should also be evaluated by investigating the correlation of single-nucleotide variants SNVs between the DNA and RNA reads 't Hoen et al.

The concordance between the DNA and RNA sequencing data may provide insight into sample swaps or sample mixtures caused accidentally as a result of personnel or equipment error.

In the case of a mixture of samples, more significant patterns of allele-specific expression would be observed than expected for a single individual as a result of more combinations of heterozygous and homozygous sites that would skew the alleles beyond the expected allelic ratio.

A primary objective of many gene expression experiments is to detect transcripts showing differential expression across various conditions. Extensive statistical approaches have been developed to test for differential expression with microarray data, where the continuous probe intensities across replicates can be approximated by a normal distribution Cui and Churchill ; Smyth ; Grant et al.

Although in principle these approaches are also applicable to RNA-Seq data, different statistical models must be considered for discrete read counts that do not fit a normal distribution.

However, further studies indicated that biological variability is not captured by the Poisson assumption, resulting in high false-positive rates due to underestimation of sampling error Anders and Huber ; Langmead et al. Hence, negative binomial distribution models that take into account overdispersion or extra-Poisson variation have been shown to best fit the distribution of read counts across biological replicates.

To model the count-based nature of RNA-Seq data, complex statistical models have been developed to handle sources of variability that model overdispersion across technical and biological replicates. One source of variability is differences in sequencing read depth, which can artificially create differences between samples.

For instance, differences in read depth will result in the samples appearing more divergent if raw read counts between genes are compared. Although this correction metric is commonly used in place of read counts, the presence of several highly expressed genes in a particular sample can significantly alter the RPKM and FPKM values. To account for this bias, several statistical models have been proposed that use the highly expressed genes as model covariates Robinson and Oshlack Another source of variability that has been observed is that the distribution of sequencing reads is unequal across genes.

Therefore, a two-parameter generalized Poisson model that simultaneously considers read depth and sequencing bias as independent parameters was developed and shown to improve RNA-Seq analysis Srivastava and Chen More complex normalization methods have also been developed to account for hidden covariates without removing significant biological variability.

To detect differential expression, a variety of statistical methods have been designed specifically for RNA-Seq data. Although these packages can assign significance to differentially expressed transcripts, the biological observations should be carefully interpreted.

Each model makes specific assumptions that may be violated in the context of the observed data; therefore, an understanding of the model parameters and their constraints is critical for drawing meaningful and accurate biological conclusions Bullard et al.

Furthermore, replicates in RNA-Seq experiments are crucial for measuring variability and improving estimations for the model parameters Tarazona et al. Biological replicates e. Although the number of replicates required per condition is an open research question, a minimum of three replicates per sample has been suggested Auer and Doerge In many cases, multiplexed RNA-Seq libraries can be used to add biological replicates without increasing sequencing costs if sequenced at a lower depth and will greatly improve the robustness of the experimental design Liu et al.

Additionally, the accuracy of measurements of differential gene expression can be further improved by using ERCC spike-in controls to distinguish technical variation from biological variation. A major advantage of RNA-Seq is the ability to profile transcriptome dynamics at a single-nucleotide resolution. Therefore, the sequenced transcript reads can provide coverage across heterozygous sites, representing transcription from both the maternal and paternal alleles.

If a sufficient number of reads cover a heterozygous site within a gene, the null hypothesis is that the ratio of maternal to paternal alleles is balanced. Significant deviation from this expectation suggests allele-specific expression ASE. Potential mechanisms for ASE include genetic variation e. Studies have also applied ASE to identify expression modifiers of protein-coding variation Lappalainen et al. Furthermore, ASE studies using single-cell transcriptomics have uncovered a stochastic pattern of allelic expression that may contribute to variable expressivity, a novel perspective which may have fundamental implications for variable disease penetrance and severity Deng et al.

Conventional workflows to detect ASE involve counting reads containing each allele at heterozygous sites and applying a statistical test, such as the binomial test or the Fisher's exact test Degner et al.

However, more rigorous statistical approaches are necessary to overcome technical challenges involved in ASE detection. These challenges include read-mapping bias, sampling variance, overdispersion at extreme read depths, alternatively spliced alleles, insertions and deletions indels , and genotyping errors.

To account for overdispersion, one approach is to model allelic read counts using a beta-binomial distribution at individual loci Sun ; however, accurate estimation of the overdispersion parameter requires replicates and, in our experience, major source of bias come from site-specific mapping differences.

Another strategy is to use a hierarchical Bayesian model that combines information across loci, as well as across replicates and technologies, to make global and site-specific inferences for ASE Skelly et al. To assess reference-allele mapping bias, the number of mismatches in reads containing the nonreference allele should be assessed as increased bias is observed with greater sequence divergence between alleles Stevenson et al.

To correct for read-mapping bias, an enhanced reference genome can be constructed that masks all SNP positions or includes the alternative alleles at polymorphic loci Degner et al. Statistical methods to better address these technical biases are under active development and are expected to foster further improvements in ASE detection. Another prominent direction of RNA-Seq studies has been the integration of expression data with other types of biological information, such as genotyping data.

The combination of RNA-Seq with genetic variation data has enabled the identification of genetic loci correlated with gene expression variation, also known as expression quantitative trait loci eQTLs. This expression variation caused by common and rare variants is postulated to contribute to phenotypic variation and susceptibility to complex disease across individuals Majewski and Pastinen The goal of eQTL analysis is to identify associations that will uncover underlying biological processes, discover genetic variants causing disease, and determine causal pathways.

Most of the eQTLs identified directly influenced gene expression in an allele-specific manner and were located near transcriptional start sites, indicating that eQTLs could modulate expression directly, or in cis. Although trans -eQTLs show weaker effects and present validation difficulties, they can potentially reveal previously unknown pathways in gene regulation networks. RNA-Seq has revolutionized QTL analyses because it enables association analyses of more than just gene expression levels alone.

For example, RNA-Seq provides unprecedented opportunity to investigate variations in splicing by profiling alternately spliced isoforms of a gene.

This has enabled the identification of variants influencing the quantitative expression of alternatively spliced isoforms commonly referred to as splicing-QTLs sQTLs Lalonde et al. In addition, specific RNA-Seq library constructions e. The expanding potential of RNA-Seq to associate phenotypic variations with genetic variation offers an enhanced understanding of gene regulation.

Traditional eQTL mapping methods that were developed for microarray data use linear models such as linear regression and ANOVA to associate genetic variants with gene expression Kendziorski and Wang These methods have been directly applied to RNA-Seq data following appropriate normalization of total read counts.

Nonlinear approaches have also been developed to test associations, such as generalized linear and mixed models, Bayesian regression Servin and Stephens Alternative models, such as Merlin, have also been developed to detect eQTLs from expression data that include related individuals using pedigree data Abecasis et al.

In addition, several methods have been developed to simultaneously test the effect of multiple SNPs on the expression of a single gene using Bayesian methods Lee et al. To further improve on the detection of causal regulatory variants, several studies have integrated ASE information with eQTL analysis. These studies showed that genetic variants showing allele-specific effects and identified as eQTLs show higher enrichment in functional annotations and provide stronger evidence of cis -regulatory impact Battle et al.

Because high-throughput sequencing has created genotype data sets featuring millions of SNPs and expression data sets featuring tens of thousands of transcripts, the task of testing billions of transcript-SNP pairs in eQTL analysis can be computationally intensive. To mitigate this computational burden, software has been developed such as Matrix eQTL to efficiently test the associations by modeling the effect of genotype as either additive linear least squares model or categorical ANOVA model Shabalin Because of the large number of tests performed, it is important to correct for multiple-testing by calculating the false discovery rate Benjamini and Hochberg ; Yekutieli and Benjamini or resampling using bootstrap or permutation procedures Karlsson ; Zhang et al.

However, the design and interpretation of eQTL studies is not straightforward. Many complications result from the complexity of gene regulation, which shows both spatial cell and tissue location specificity as well as temporal developmental stage specificity. For instance, several studies have performed eQTL analysis across multiple tissues, indicating that genetic regulatory elements can have tissue-specific effects Petretto et al.

Therefore, future eQTL analyses should test for SNP-transcript associations in well-defined cell types that are relevant to the trait of interest Lonsdale et al.

For example, a study detecting eQTLs in cardiovascular disease should use heart tissue while a study interested in autoimmune disease should use whole blood. Another major consideration for eQTL studies is accounting for population structure and elucidating the causal variants Stranger et al.

The structure of genomic variation can vary significantly between populations and will influence the resolution of any genetic association study Frazer et al. As eQTL studies integrate data across different populations and use population-scale genome sequencing, the ability to elucidate causal variants will greatly improve Montgomery et al. As sequencing technologies advance, computational tools will need to evolve in parallel to solve new technical challenges and support novel applications.

For example, as the ability of sequencing platforms to produce longer reads becomes a reality, new mapping methods are required to accurately and efficiently align long reads. Because longer reads can span multiple exon—exon junctions, the identification and quantification of alternative isoforms will improve significantly with the extra information encoded in longer reads.

Furthermore, as laboratory methods mature to enable sequencing of minute quantities of RNA, complex statistical approaches will be needed to discriminate between technical noise and meaningful biological variation. These progresses will facilitate the analysis of transcriptomes in rare cell types and cell states, enabling researchers to reconstruct biological networks active at the cellular level.

In addition, these advancements will allow transcriptome analysis to move into the field of clinical diagnostics; for example, earlier monitoring of cancer screening and pregnancy could be accomplished by sequencing cancerous RNA or fetal RNA in the maternal blood.

Furthermore, the integration of whole-genome sequencing with RNA-Seq in larger samples will provide greater insight into genetic regulatory variation. Nat Rev Endocrinol. Circulating mRNA as diagnostic markers. US patent application A1. October 14, Improved detection of circulating transcripts. Preeclampsia activates lipoxygenase and its metabolite hydroxyeicosatetraenoic acid enhances constriction in umbilical arteries.

Prostaglandins Leukot Essent Fatty Acids. Differential expression of nestin in normal and pre-eclamptic human placentas. Acta Obstet Gynecol Scand. Plasma biomarker discovery in preeclampsia using a novel differential isolation technology for circulating extracellular vesicles. Am J Obstet Gynecol. J Clin Endocrinol Metab. PLoS One. Gene expression profiling of placentae from women with early- and late-onset pre-eclampsia: down-regulation of the angiogenesis-related genes ACVRL1 and EGFL7 in early-onset disease.

Mol Hum Reprod. Evaluation of current and new biomarkers in severe preeclampsia: a microarray approach reveals the VSIG4 gene as a potential blood biomarker. Preeclampsia: novel insights from global RNA profiling of trophoblast subpopulations. The concentration of circulating corticotropin-releasing hormone mRNA in maternal plasma is increased in preeclampsia. Clin Chem. Prenat Diagn.

Hypertens Pregnancy. Am J Hypertens. Clin Exp Hypertens. Myeloid and lymphoid dendritic cells in normal pregnancy and pre-eclampsia. Clin Exp Immunol. A transcriptional profile of the decidua in preeclampsia. Eur J Endocrinol. Preeclampsia is associated with low circulating levels of insulin-like growth factor I and 1,dihydroxyvitamin D in maternal and umbilical cord compartments.

In: Sifakis S, ed. From Preconception to Postpartum. InTech; Eur Rev Med Pharmacol Sci. Corticotrophin releasing hormone and the timing of birth.

Front Biosci. Angiogenic factors in preeclampsia and related disorders. Cold Spring Harb Perspect Med. Angiogenic factors and preeclampsia. All the information you need, from BeadChips to library prep to sequencer selection and analysis. Use this guide to select the best tools for your lab. User-friendly software tools simplify RNA-Seq data analysis for biologists, regardless of bioinformatics experience.

Illumina offers an integrated mRNA-Seq workflow for a deeper understanding of biology. All Illumina sequencing systems are capable of paired-end sequencing, which facilitates detection of novel RNA transcripts, gene fusions, and more. A clear, more complete view of the coding transcriptome mRNA-Seq detects known and novel transcripts and measures transcript abundance for accurate, comprehensive analysis. Introduction to mRNA Sequencing.

View Recommended Workflow. Advantages of mRNA Sequencing. Read Interview. Accurate, High-Resolution View of the Transcriptome. View Video. Learn More. Illumina Stranded mRNA Prep A simple, scalable, cost-effective, rapid single-day solution for analyzing the coding transcriptome leveraging as little as 25 ng input of standard non-degraded RNA. Read Technical Bulletin.

Illumina RNA Prep with Enrichment Achieve rapid, targeted interrogation of an expansive number of target genes with exceptional capture efficiency and coverage uniformity. NextSeq System Flexible desktop sequencer supporting multiple applications, enabling 5—16 mRNA samples to be sequenced in a single run.

NovaSeq System Scalable throughput and flexibility for virtually any genome, sequencing method, and scale of project. Platform Comparison Tool Compare sequencing platforms and identify the best system for your lab and applications. Genomatix Pathway System GePS Associates single gene or list of genes with annotation data for pathways, diseases, tissues, and small molecules.

BaseSpace Correlation Engine A growing library of curated genomic data to support researchers in identifying disease mechanisms, drug targets, and biomarkers.



0コメント

  • 1000 / 1000