= Experimental design of short read sequencing experiments = == How long should the reads be? Should they be single or paired-end? == * What is the goal of your experiment? * For typical RNA-seq expression level quantification, a read or read pair gets one count, regardless of the length. As a result, shorter reads may provide just as good data, as long as they aren't so short that repetitive mapping is a problem. * Longer and/or paired reads are surely beneficial if the experimental goal is * novel gene discovery: longer reads are much better at identifying novel splice junctions * According to [https://www.ncbi.nlm.nih.gov/pubmed/26100517 Chhangawala et al., 2015], 50-bp reads are adequate for identifying differentially expressed genes. Identifying novel splice junctions, however, can benefit from paired and longer reads. * For variant discovery, coverage is key, whether it's fewer long reads or more shorter reads (as long as the reads are long enough to map uniquely) * How much read length is used for primers, adapters, barcodes, etc.? Of course make sure that enough actual experimental DNA is left for effective mapping. == If you are able to sequence more than one lane, how should the samples be partitioned? == * The magnitude of a lane effect is typically small but typically non-zero. * To balance any lane effect, sequence all of your samples on each of your lanes. * Another benefit of barcoding and mixing all samples together is that the samples can be re-sequenced in other lanes in the future (from the same library preparation) without unbalancing the experimental design. == How many reads are needed for each sample? == * Resources like [http://scotty.genetics.utah.edu/ Scotty] can help with a power analysis for RNA-seq experiments. == How many replicates are needed for a RNA-seq experiment? == * One resource is the [https://bioconductor.org/packages/release/bioc/html/RNASeqPower.html RNASeqPower] R package described in [https://www.ncbi.nlm.nih.gov/pubmed/23961961 Hart et al., 2019] * A good resource to help with this analysis is [https://www.ncbi.nlm.nih.gov/pubmed/25246651 Ching et al.] * The main command from RNASeqPower is '''rnapower(depth, n, cv, effect, alpha, power)''', where one can enter all values but one, and the algorithm will calculate the missing value * The inputs are * depth = the number of reads (of a median gene, which according to Ching et al. is abou 16-32 per million mapped reads) * n = sample size (number of biological replicates per group) * cv = coefficient of variation (sqrt(disperson)), which can vary between 0.2 (cell lines) and 0.5 (animals or human subjects); see the above references for a discussion of typical values for different types of experiments * effect = effect size that one wants to identify (such as 2 for a 2-fold difference) * alpha = test statistic (such as FDR) threshold, such as 0.05 * power = power of the test (the fraction of true positives that will be detected); can be calculated or set to a value like 0.8 == Calculating number of DNA or RNA reads needed to obtain the desired coverage == * Some useful references: * Sims et al., 2014. [http://www.ncbi.nlm.nih.gov/pubmed/24434847 Sequencing depth and coverage: key considerations in genomic analyses.] * Includes methods to estimate the number of reads required for single nucleotide variant calling, and RNA-seq and ChIP-seq experiments * Ajay et al., 2011. [http://www.ncbi.nlm.nih.gov/pubmed/21771779/ Accurate and comprehensive sequencing of personal genomes.] * Includes methods to estimate the number of reads required for single nucleotide variant calling * ''Example 1'' (genome sequencing): For a genome of 3e+9 nt, to get 35x coverage we would need: * For 40-nt reads: * 3e+9 * 35 / 40 = 2.625e+09 => ~2.6 billion reads * For 100-nt reads: * 3e+9 * 35 / 100 = 1.05e+09 => ~1 billion reads * '' Example 2'' (RNA_seq experiment): * If we have * 6 million 35x35-nt paired end reads * a genome with ~7000 genes expressed * average gene length = 5741 bp * then the total length of the transcriptome is 7000 x 5741 => 38,297,000 nt * and the total length of the reads is 6 million x 70 nt [35 + 35] => 420,000,000 nt * so the average coverage will be 420,000,000 / 38,297,000 => ~11x * but note that coverage will be very irregular to due a wide range of expression levels