| 1 | == Using RNA-Seq to quantify gene levels and assay for differential expression for transposable elements == |
| 2 | |
| 3 | === Background === |
| 4 | |
| 5 | * For each sample, map reads to genome using splice-aware mapper. |
| 6 | * Count reads mapped to each gene (or other set of features). |
| 7 | * Use gene counts to identify differentially expressed genes. |
| 8 | |
| 9 | === General suggestions === |
| 10 | * **Preliminary issues** |
| 11 | * Statistics for all methods require a matrix of counts (positive integer values) for each gene for each sample. |
| 12 | * Create a tab-delimited matrix of integer counts, with column labels for each sample. |
| 13 | * Genes with no counts in any sample should generally be removed to permit higher statistical power to identify differential expression. |
| 14 | * According to [[http://www.ncbi.nlm.nih.gov/pubmed/20167110|Bullard et al., 2010]], differential expression analysis is influenced more by the normalization method than by the choice of differential expression statistic. |
| 15 | * Note that without replication, one cannot make very strong conclusions. High-throughput sequencing, just like every other technology, needs biological replication. |
| 16 | * One can conclude that certain genes in sample A have a different RNA abundance than in sample B, but the results cannot be generalized. |
| 17 | * Example, using an extremely precise balance: If Dick weighs more than Sally, we cannot conclude that males weigh more than females because we know nothing about the variability of weights among males and among females. Even if we weighed several individuals together, we'd still be missing information about within-group variability. |
| 18 | * Sample commands to get raw counts from an alignment file: |
| 19 | * ''coverageBed -split -abam accepted_hits.bam -b transcripts.gtf > transcript.coverage.bed'' (See the [http://bedtools.readthedocs.io/en/latest/content/tools/coverage.html bedTools coverage] page for details) |
| 20 | * ''htseq-count -m intersection-strict --stranded=no accepted_hits.sam transcripts.gff > transcript.coverage.txt'' (See the [[http://www-huber.embl.de/users/anders/HTSeq/doc/count.html|htseq-count]] page for details) |
| 21 | * In our view, htseq-count is better at handling reads that map to a genome region with overlapping genes. |
| 22 | |
| 23 | === Step by step analysis === |
| 24 | |
| 25 | * **Mapping** |
| 26 | * Use [https://github.com/alexdobin/STAR STAR] or another spliced mapper to map short reads to the genome of choice. |
| 27 | * See our [http://barcwiki.wi.mit.edu/wiki/SOPs/mapping mapping SOP] for more details. |
| 28 | |
| 29 | * **Quantification of raw counts** |
| 30 | |
| 31 | * Is your sequencing library stranded or unstranded? This information is needed to help these tools accurately count features. Strandedness of some library prep methods: |
| 32 | * TruSeq Stranded mRNA Kits ("TruSeqStrandedPolyA") reads are reverse stranded (stranded in the reverse direction relative to the transcript orientation). |
| 33 | * SMART-Seq v4 Ultra Low Input RNA Kit ("SMARTerUltra-lowPOLYA-V4") reads are unstranded. |
| 34 | * KAPA RNA HyperPrep Kits ("KAPAHyperPrepmRNA") reads are reverse stranded. |
| 35 | * The Whitehead Genome Core has some more [http://genomecore.wi.mit.edu/index.php/NCBISubmission Library Prep Descriptions]. |
| 36 | * See [[SAMBAMqc]] (and/or look at mapped reads in a genome browser) to determine or verify strandedness |
| 37 | |
| 38 | {{{ |
| 39 | # single-end reads (unstranded) |
| 40 | featureCounts -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam |
| 41 | # single-end reads (forward stranded) |
| 42 | featureCounts -s 1 -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam |
| 43 | # single-end reads (reverse stranded) |
| 44 | featureCounts -s 2 -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam |
| 45 | |
| 46 | |
| 47 | # paired-end reads (unstranded) |
| 48 | featureCounts -p -a gene_anotations.gtf -o MySample.featureCounts.txt MySample.bam |
| 49 | # paired-end reads (forward stranded) |
| 50 | featureCounts -p -s 1 -a gene_annotations.gtf -o MySamples.featureCounts.txt *sortedByName.bam |
| 51 | # paired-end reads (reverse stranded) |
| 52 | featureCounts -p -s 2 -a gene_annotations.gtf -o MySamples.featureCounts.txt *sortedByName.bam |
| 53 | }}} |
| 54 | |
| 55 | |
| 56 | * **Other** |
| 57 | * Review articles: |
| 58 | * [[http://www.ncbi.nlm.nih.gov/pubmed/24020486|Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data.]] - Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND, Betel D. Genome Biol. 2013 Sep 10;14(9):R95. |
| 59 | * [[http://www.ncbi.nlm.nih.gov/pubmed/21106489|A survey of statistical software for analyzing RNA-seq data]] - Gao D, Kim J, Kim H, Phang TL, Selby H, Tan AC, Tong T. Hum Genomics. 2010 Oct;5(1):56-60. |
| 60 | * [[http://www.ncbi.nlm.nih.gov/pubmed/21176179|From RNA-seq reads to differential expression results]] - Oshlack A, Robinson MD, Young MD. Genome Biol. 2010;11(12):220. Epub 2010 Dec 22. |
| 61 | * For more practical information, see the third session of [http://jura.wi.mit.edu/bio/education/R2011/ An introduction to R and Bioconductor: A BaRC Short Course] and the [http://jura.wi.mit.edu/bio/hot_topics/ BaRC Hot Topic] (under "Short Read Sequencing", see "Practical RNA-Seq analysis") |
| 62 | |