Changes between Version 4 and Version 5 of SOPs/rna-seq-diff-expressions_TE


Ignore:
Timestamp:
02/24/21 20:23:59 (4 years ago)
Author:
twhitfie
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/rna-seq-diff-expressions_TE

    v4 v5  
    33=== Background ===
    44
    5     * Repetitive elements comprise a substantial portion of many eukaryotic genomes. In humans, for example, estimates of the repetitive fraction of the genome range from [https://www.pnas.org/content/111/17/6131.full 1/2] to more than [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3228813/ 2/3]. Moreover, repetitive elements are known to play an important role in cellular function and disease, yet are typically excluded from the analysis of high-throughput sequencing experiments.  This exclusion is due to the ambiguity that accompanies assigning multi-mapping reads.
     5    * Repetitive elements comprise a substantial portion of many eukaryotic genomes. In humans, for example, estimates of the repetitive fraction of the genome range from [https://www.pnas.org/content/111/17/6131.full 1/2] to more than [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3228813/ 2/3]. Moreover, repetitive elements are known to play an important role in cellular function and disease, yet are typically excluded from the analysis of high-throughput sequencing experiments.  This exclusion is due to the ambiguity that accompanies assigning multi-mapping reads (i.e. short reads that cannot be mapped ''uniquely'' to genomic loci). Strategies to address this ambiguity include assigning fractional reads to multiple matching loci or assigning such reads to [https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-15-583 subfamilies of repetitive elements]. [https://academic.oup.com/bioinformatics/article/31/22/3593/240793 TEtranscripts] leverages subfamily annotations to include transposable elements (TEs), along with the customary genes annotations, in the analysis of short read sequencing data. Transposable elements (e.g. LTRs, LINEs and SINEs) make up most of the repetitive DNA in the human genome, with the remainder being tandem repeats (e.g. satellites and microsatellites) that characterize heterochromatin and centromeres.  The workflow below illustrates how to use [https://github.com/mhammell-laboratory/TEtranscripts TEtranscripts] on the resources at the Whitehead Institute.
    66
    77=== Step by step analysis ===
     
    1010    * Use [https://github.com/alexdobin/STAR STAR] or another spliced mapper to map short reads to the genome of choice.
    1111    * See our [http://barcwiki.wi.mit.edu/wiki/SOPs/mapping mapping SOP] to search for details on running STAR.
     12    * When assessing transcription of TEs, it is ''essential'' to include multi-mapping reads.  When using STAR, in particular, the winAnchorMultimapNmax and outFilterMultimapNmax flags are used to control multimapping, the former by setting a lower bound on how many loci must have a matching seed and the latter defining the upper bound on how many loci a read maps to in order to report it.  A command for STAR mapping paired end reads using gzipped fastq input can look like:
     13
     14{{{
     15bsub STAR --genomeDir /path/to/STAR/index/for/organism --readFilesIn /path/to/reads_1.fastq.gz /path/to/reads_2.fastq.gz --outFileNamePrefix somePrefix --sjdbScore 2 --runThreadN 8 --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100
     16}}}
    1217
    1318  * **Quantification of raw counts**
     19    * TEtranscripts uses the BAM file(s) of aligned reads (from STAR in this example) as input.
     20    * TEtranscripts relies on separate gene annotation files (GTFs) for genes and TEs.  A curated collection of TE GTFs can be found [https://www.dropbox.com/sh/1ppg2e0fbc64bqw/AACUXf-TA1rnBIjvykMH2Lcia?dl=0 here].
     21    * Before assigning reads, it is important to know whether they are stranded (see the '''Quantification of raw counts''' section of our [http://barcwiki.wi.mit.edu/wiki/SOPs/rna-seq-diff-expressions best practices] page for details on how to determine this).
     22    * The best way to use the resources on the cluster to assign reads to genes and TEs is by running TEcount separately on each experiment (reverse stranded reads are shown in the example below, for forward stranded reads use --stranded forward and for unstranded reads use --stranded no (the default)):
    1423
    1524{{{
    16 # Unstranded reads
    17 TEtranscripts --format BAM --stranded no -t treat1.bam treat2.bam -c control1.bam control2.bam --GTF genes.gtf --TE transposons.gtf --mode multi --project treat_vs_control --minread 1 -i 100 --padj 0.05 --norm DESeq_default --sortByPos
    18 
    19 # Forward stranded reads
    20 TEtranscripts --format BAM --stranded forward -t treat1.bam treat2.bam -c control1.bam control2.bam --GTF genes.gtf --TE transposons.gtf --mode multi --project treat_vs_control --minread 1 -i 100 --padj 0.05 --norm DESeq_default --sortByPos
    21 
    2225# Reverse stranded reads
    23 TEtranscripts --format BAM --stranded reverse -t treat1.bam treat2.bam -c control1.bam control2.bam --GTF genes.gtf --TE transposons.gtf --mode multi --project treat_vs_control --minread 1 -i 100 --padj 0.05 --norm DESeq_default --sortByPos
     26bsub TEcount --sortByPos --format BAM --stranded reverse -b /path/to/alignment.bam --GTF /path/to/gene.gtf --TE /path/to/TE.gtf --mode multi --project projectName -i 100
    2427}}}
    2528
     29    * The --sortByPos flag is necessary here because this was the sorting used in the STAR mapping, above.
     30    * The -I 100 (default) flag sets the maximum number of expectation maximization steps to take in computing maximum likelihood estimates of counts for repetitive elements.
    2631
    27  * **Other**
    28    * Alternative software:
    29       * [[https://www.nature.com/articles/s41467-019-13035-2|Transposable element expression in tumors is associated with immune infiltration and increased antigenicity]] -  Yu Kong, Christopher M. Rose, Ashley A. Cass, Alexander G. Williams, Martine Darwish, Steve Lianoglou, Peter M. Haverty, Ann-Jay Tong, Craig Blanchette, Matthew L. Albert, Ira Mellman, Richard Bourgon, John Greally, Suchit Jhunjhunwala & Haiyin Chen-Harris ''Nature Communications'' '''10''', 5228 (2019)
     32* **Assessing differential expression for genes and TEs**
     33    * After running TEcount on each sample in your experiment, the reported counts (i.e. a list of raw counts per gene/TE for each sample) can be combined into a counts matrix and analyzed following the steps outlined in the '''Statistics for differential expression''', '''Identifying differentially expressed genes''' and '''Accounting for a batch effect in a differential expression model''' sections of our [http://barcwiki.wi.mit.edu/wiki/SOPs/rna-seq-diff-expressions best practices] page.
     34    * If the number of samples is not too large, the counting and analysis of differential expression can be carried out using a single execution of TEtranscripts (reverse stranded reads are shown in the example below, for forward stranded reads use --stranded forward and for unstranded reads use --stranded no (the default)):
    3035
     36{{{
     37# Reverse stranded reads
     38bsub TEtranscripts --format BAM --stranded reverse -t /path/to/treat1.bam /path/to/treat2.bam -c /path/to/control1.bam /path/to/control2.bam --GTF /path/to/gene.gtf --TE /path/to/TE.gtf --mode multi --project treat_vs_control --minread 1 -i 100 --padj 0.05 --norm DESeq_default --sortByPos
     39}}}
     40 * **Alternative software**
     41   * [https://github.com/nerettilab/RepEnrich2 RepEnrich2]
     42      * [[https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-15-583|Transcriptional landscape of repetitive elements in normal and cancer human cells]] - Steven W Criscione, Yue Zhang, William Thompson, John M Sedivy & Nicola Neretti ''BMC Genomics'' '''15''', 583 (2014).
     43   * [http://research-pub.gene.com/REdiscoverTEpaper/software/ REdiscoverTE]:
     44      * [[https://www.nature.com/articles/s41467-019-13035-2|Transposable element expression in tumors is associated with immune infiltration and increased antigenicity]] -  Yu Kong, Christopher M. Rose, Ashley A. Cass, Alexander G. Williams, Martine Darwish, Steve Lianoglou, Peter M. Haverty, Ann-Jay Tong, Craig Blanchette, Matthew L. Albert, Ira Mellman, Richard Bourgon, John Greally, Suchit Jhunjhunwala & Haiyin Chen-Harris ''Nature Communications'' '''10''', 5228 (2019).
     45