9 | | === General suggestions === |
10 | | * **Preliminary issues** |
11 | | * Statistics for all methods require a matrix of counts (positive integer values) for each gene for each sample. |
12 | | * Create a tab-delimited matrix of integer counts, with column labels for each sample. |
13 | | * Genes with no counts in any sample should generally be removed to permit higher statistical power to identify differential expression. |
14 | | * According to [[http://www.ncbi.nlm.nih.gov/pubmed/20167110|Bullard et al., 2010]], differential expression analysis is influenced more by the normalization method than by the choice of differential expression statistic. |
15 | | * Note that without replication, one cannot make very strong conclusions. High-throughput sequencing, just like every other technology, needs biological replication. |
16 | | * One can conclude that certain genes in sample A have a different RNA abundance than in sample B, but the results cannot be generalized. |
17 | | * Example, using an extremely precise balance: If Dick weighs more than Sally, we cannot conclude that males weigh more than females because we know nothing about the variability of weights among males and among females. Even if we weighed several individuals together, we'd still be missing information about within-group variability. |
18 | | * Sample commands to get raw counts from an alignment file: |
19 | | * ''coverageBed -split -abam accepted_hits.bam -b transcripts.gtf > transcript.coverage.bed'' (See the [http://bedtools.readthedocs.io/en/latest/content/tools/coverage.html bedTools coverage] page for details) |
20 | | * ''htseq-count -m intersection-strict --stranded=no accepted_hits.sam transcripts.gff > transcript.coverage.txt'' (See the [[http://www-huber.embl.de/users/anders/HTSeq/doc/count.html|htseq-count]] page for details) |
21 | | * In our view, htseq-count is better at handling reads that map to a genome region with overlapping genes. |
22 | | |