Changes between Version 51 and Version 52 of SOPs/atac_Seq


Ignore:
Timestamp:
06/02/21 12:36:11 (4 years ago)
Author:
byuan
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/atac_Seq

    v51 v52  
    1010 * [#QC Run quality control and calculate QC metrics]
    1111 * [#call_peaks Call ATAC-seq "peaks"] with a high coverage of mapped reads
     12 * [#Blacklist Blacklist filtering for peaks ]
    1213 * [#Analyze Analyze peak regions for binding motifs]
    1314 * Identify differentially accessible regions (for multiple-sample experiments)
     
    2728Paired-end (PE) reads are better and recommended over single reads for ATAC-seq.  The original [[https://www.nature.com/articles/nmeth.2688 | ATAC-Seq method]] from the Greenleaf lab recommends PE reads, stating, "We found that ATAC-seq paired-end reads produced detailed information about nucleosome packing and positioning."
    2829
    29 === [=#preprocess Map reads] ===
     30=== [=#preprocess Map reads, and post-alignment filtering] ===
    3031
    3132  * Map reads with bowtie2 (or another unspliced mapping tool).
     
    5657  * Check deduplication level with [[http://barcwiki.wi.mit.edu/wiki/SOPs/qc_shortReads | 'fastqc']].
    5758 
    58    * For samples from human, mouse, fly, or C. elegans, one can prevent some probable false-positive peaks by removing reads that overlap "blacklisted" regions.  The blacklist, [https://www.nature.com/articles/s41598-019-45839-z popularized by ENCODE], is a a comprehensive set of genomic regions that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The blacklist regions can be downloaded from [https://github.com/Boyle-Lab/Blacklist/].  We have them on Whitehead servers at /nfs/BaRC_datasets/ENCODE_blacklist/Blacklist/lists
    59 
    60    * Reads overlapping the blacklist can be filtered using alignmentSieve (from the deepTools package) or 'intersectBed -v' (from the bedtools suite):
    61 {{{
    62 alignmentSieve -b Reads.bam --blackListFileName hg38-blacklist.bed -o Reads.no_blackList.bam
    63 intersectBed -v -a Reads.bam -b hg38-blacklist.bed > Reads.no_blackList.bam
    64 }}}
    65 
    66    * An alternative for handling blacklist regions is to keep the mapped reads as is but (further downstream) remove peaks overlapping blacklist regions.  Between-sample normalization will typically differ whether one filters these regions in reads or in peaks.
    67 
    6859
    6960=== [=#QC Run quality control and calculate QC metrics] ===
     
    7667dev.off()
    7768}}}
    78 See [[https://www.nature.com/articles/nmeth.2688/figures/2 | Fig 2]] from Buenrostro et al. for the ideal distribution of fragment sizes.
     69      See [[https://www.nature.com/articles/nmeth.2688/figures/2 | Fig 2]] from Buenrostro et al. for the ideal distribution of fragment sizes.
    7970
    8071   * Calculate the TSS enrichment score (the degree to which transcription start sites show enrichment for ATAC-seq reads) using BaRC code (/nfs/BaRC_Public/BaRC_code/Python/calculate_TSS_enrichment_score/calculate_TSS_enrichment_score.py)
     
    8879  * The [https://www.encodeproject.org/atac-seq/#standards ENCODE project] has [[https://www.encodeproject.org/atac-seq/#standards|recommendations]] on TSS enrichment scores, fragment size distribution. You can also download ENCODE pipeline and analyze your samples with the pipeline. Its QC output html file includes quality controls results and their interpretation. It also estimates the library complexity based the uniqueness of reads. 
    8980
    90   * In addition to do varies quality controls, [[https://www.sciencedirect.com/science/article/pii/S240547122030079X | ataqv]] summarizes QC results into an interactive html page, which also allows you to view multiple samples together.
    91 
    92 First, run ataqv on each bam file to generate JSON files.
    93 Here is a sample command for a bulk ATAC_seq sample:
     81  * [[https://www.sciencedirect.com/science/article/pii/S240547122030079X | ataqv]] summarizes QC results into an interactive html page, which also allows you to view multiple samples together.
     82          * First, run ataqv on each bam file to generate JSON files. Here is a sample command for a bulk ATAC_seq sample:
    9483
    9584
     
    111100
    112101}}}
    113 
    114 
    115 Next, run mkarv on the JSON files to generate the interactive web viewer. By default, SRR891268 will be used as the reference sample in the viewer. You can specify a different reference when you built your viewer instance, refer to the information on how to configure the reference with mkarv -h.
     102         Next, run mkarv on the JSON files to generate the interactive web viewer. By default, SRR891268 will be used as the reference sample in the viewer. You can specify a different reference when you built your viewer instance, refer to the information on how to configure the reference with mkarv -h.
    116103
    117104{{{
     
    146133
    147134
    148   * Using pair-end bed as macs2 input. It considers ends of both mates, focus on cutting/insertion sites enrichment in ATAC-seq. [[ [[ https://twitter.com/XiChenUoM/status/1336658454866325506 | Explaination ]]. Codes implemented in the ATAC-seq review paper ( https://github.com/alexyfyf/atac_nf/blob/7f996b7de0e349c5a10dbbd75b2c266339517a3b/atac.nf#L341 )
    149  [[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1929-3 | From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis]]
     135  * Using pair-end bed as macs2 input. It considers ends of both mates, focus on cutting/insertion sites enrichment in ATAC-seq. ( https://twitter.com/XiChenUoM/status/1336658454866325506 )
     136       * Codes below were implemented in the ATAC-seq review paper ( https://github.com/alexyfyf/atac_nf/blob/7f996b7de0e349c5a10dbbd75b2c266339517a3b/atac.nf#L341 ).
     137       * [[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1929-3 | From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis]]
    150138{{{
    151139    # convert bam to bed
     
    156144
    157145   * Run macs2 with BAMPE option: If using the 'BAMPE' option with paired-end reads, let MACS run the pileup and calculate 'extsize'.
    158        * #BAMPE format: there is no special format for BAMPE - MACS will treat PE reads as coming from the same fragment, from the manual,
     146       * BAMPE format: there is no special format for BAMPE - MACS will treat PE reads as coming from the same fragment, from the manual,
    159147       * "If the BAM file is generated for paired-end data, MACS will only keep the left mate(5' end) tag. However, when format BAMPE is specified, MACS will use the real fragments inferred from alignment results for reads pileup."
    160148
     
    165153}}}
    166154
    167    * Run macs 2 using --nomodel --shift s --extsize 2s. This can be used for single-end reads. For PE reads, [[ https://github.com/macs3-project/MACS/issues/145 | it ignores the the right mate, and not recommended ]]. [[ https://github.com/macs3-project/MACS/issues/145 | In case where short fragment population is extremely dominant, the final output won't be off much as compared with BAMPE ]]
     155   * Run macs 2 using --nomodel --shift s --extsize 2s. This can be used for single-end reads.
     156      * For PE reads, it ignores the the right mate, and not recommended ( https://github.com/macs3-project/MACS/issues/145 ).
     157      * In case where short fragment population is extremely dominant, the final output won't be off much as compared with BAMPE ( https://github.com/macs3-project/MACS/issues/145 )
    168158{{{
    169159
     
    171161}}}
    172162 
    173     * Which macs optMACS' author, T.Liu, recommends using -f BAMPE if PE reads are used [[https://github.com/taoliu/MACS/issues/331]], using BAMPE option asks MACS to pileup and calculate the extension size -  works for finding accessible regions within cut sites.  The additional parameters can also be used to look only at the //exact// cut sites by Tn5 instead of the open/accessible regions [[https://github.com/taoliu/MACS/issues/145]], if so, -f BAMPE may not be suitable.
    174       * Shifting reads, pos. strand +4 and neg strand -5 (see recommendations below) may be needed as well to find //exact// cut sites.
    175 
    176 
    177    * [[https://github.com/jsh58/Genrich | Genrich]] is another piece of software for peak-calling. It has the advantages of (a) running all of the post-alignment steps through peak-calling with one command, and (b) can process multiple replicates. Detailed information can be found in [[https://informatics.fas.harvard.edu/atac-seq-guidelines.html|Harvard ATAC-seq Guidelines]]
     163    * Which macs option?
     164       * MACS' author, T.Liu, recommends using -f BAMPE if PE reads are used [[https://github.com/taoliu/MACS/issues/331]], using BAMPE option asks MACS to pileup and calculate the extension size -  works for finding accessible regions within cut sites.  The additional parameters can also be used to look only at the //exact// cut sites by Tn5 instead of the open/accessible regions [[https://github.com/taoliu/MACS/issues/145]], if so, -f BAMPE may not be suitable.
     165          * Shifting reads, pos. strand +4 and neg strand -5 (see recommendations below) may be needed as well to find //exact// cut sites.
     166
     167
     168[[https://github.com/jsh58/Genrich | Genrich]] is another piece of software for peak-calling. It has the advantages of (a) running all of the post-alignment steps through peak-calling with one command, and (b) can process multiple replicates. Detailed information can be found in [[https://informatics.fas.harvard.edu/atac-seq-guidelines.html|Harvard ATAC-seq Guidelines]]
    178169
    179170{{{
     
    207198
    208199
     200=== [=#Blacklist Blacklist filtering for peaks ] ===
     201   * For samples from human, mouse, fly, or C. elegans, one can prevent some probable false-positive peaks by removing reads that overlap "blacklisted" regions.  The blacklist, [https://www.nature.com/articles/s41598-019-45839-z popularized by ENCODE], is a a comprehensive set of genomic regions that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The blacklist regions can be downloaded from [https://github.com/Boyle-Lab/Blacklist/].  We have them on Whitehead servers at /nfs/BaRC_datasets/ENCODE_blacklist/Blacklist/lists
     202{{{
     203bedtools intersect -v -a ${PEAK} -b ${BLACKLIST} \
     204                 | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' \
     205                 | grep -P 'chr[\dXY]+[ \t]'  | gzip -nc > ${FILTERED_PEAK}
     206
     207}}}
     208
    209209=== [=#Analyze Analyze peak regions for binding motifs] ===
    210210