How long should the reads be? Should they be single or paired-end?

  • What is the goal of your experiment?
    • For typical RNA-seq expression level quantification, a read or read pair gets one count, regardless of the length. As a result, shorter reads may provide just as good data, as long as they aren't so short that repetitive mapping is a problem.
    • Longer and/or paired reads are surely beneficial if the experimental goal is
      • novel gene discovery: longer reads are much better at identifying novel splice junctions
    • For variant discovery, coverage is key, whether it's fewer long reads or more shorter reads (as long as the reads are long enough to map uniquely)
  • How much read length is used for primers, adapters, barcodes, etc.? Of course make sure that enough actual experimental DNA is left for effective mapping.

If you are able to sequence more than one lane, how should the samples be partitioned?

  • The magnitude of a lane effect is typically small but typically non-zero.
  • To balance any lane effect, sequence all of your samples on each of your lanes.
  • Another benefit of barcoding and mixing all samples together is that the samples can be re-sequenced in other lanes in the future (from the same library preparation) without unbalancing the experimental design.

How many reads are needed for each sample?

Calculating number of DNA or RNA reads needed to obtain the desired coverage

  • Example 1 (genome sequencing): For a genome of 3e+9 nt, to get 35x coverage we would need:
    • For 40-nt reads:
      • 3e+9 * 35 / 40 = 2.625e+09 => ~2.6 billion reads
    • For 100-nt reads:
      • 3e+9 * 35 / 100 = 1.05e+09 => ~1 billion reads
  • Example 2 (RNA_seq experiment):
    • If we have
      • 6 million 35x35-nt paired end reads
      • a genome with ~7000 genes expressed
      • average gene length = 5741 bp
    • then the total length of the transcriptome is 7000 x 5741 => 38,297,000 nt
    • and the total length of the reads is 6 million x 70 nt [35 + 35] => 420,000,000 nt
    • so the average coverage will be 420,000,000 / 38,297,000 => ~11x
    • but note that coverage will be very irregular to due a wide range of expression levels