Changes between Version 15 and Version 16 of SOPs/atac_Seq


Ignore:
Timestamp:
08/14/20 13:30:18 (5 years ago)
Author:
gbell
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/atac_Seq

    v15 v16  
    66
    77=== Basic Approach ===
    8  * [#preprocess Pre-process reads](remove adapters and other "contamination")
     8 * [#preprocess Pre-process reads] (remove adapters and other "contamination")
    99 * [#map Map reads to the genome] (with an unspliced mapping tool)
    1010 * [#QC Run quality control and calculate QC metrics]
     
    1313 * Identify differentially accessible regions (for multiple-sample experiments)
    1414 * Identify potential targets of peak regions and/or differentially accessible regions
    15 
    16   * Paired-end reads are better and recommended over single reads for ATAC-seq.  The original [[https://www.nature.com/articles/nmeth.2688 | ATAC-Seq method]] from the Greenleaf lab recommends PE reads, stating, "We found that ATAC-seq paired-end reads produced detailed information about nucleosome packing and positioning."
    1715
    1816=== [=#preprocess Pre-process reads] ===
     
    2725}}}
    2826
     27Paired-end (PE) reads are better and recommended over single reads for ATAC-seq.  The original [[https://www.nature.com/articles/nmeth.2688 | ATAC-Seq method]] from the Greenleaf lab recommends PE reads, stating, "We found that ATAC-seq paired-end reads produced detailed information about nucleosome packing and positioning."
     28
    2929=== [=#preprocess Map reads] ===
    3030
    31   * Map reads with bowtie2 (or another unspliced mapping tool)
    32   * If mapping paired-end reads, keep only concordant pairs
     31  * Map reads with bowtie2 (or another unspliced mapping tool).
     32  * If mapping paired-end reads, keep only concordant pairs.
    3333
    3434{{{
     35bowtie2 --very-sensitive --no-discordant -p 2 -X 2000 -x /nfs/genomes/human_hg38_dec13_no_random/bowtie/hg38 -1 read1.fq -2 read2.fq | samtools view -ub - | samtools sort - >| bowtie_out.bam
     36
     37# using the following settings:
    3538# --very-sensitive
    3639# --no-discordant    suppress discordant alignments for paired reads
    3740# -X/--maxins <int>  maximum fragment length (default=500). Increase to 2000 to get a better nucleosome distribution.
    38 
    39 bowtie2 --very-sensitive --no-discordant -p 2 -X 2000 -x /nfs/genomes/human_hg38_dec13_no_random/bowtie/hg38 -1 read1.fq -2 read2.fq | samtools view -ub - | samtools sort -T $name - >| bowtie_out.bam
    4041}}}
    4142
    42   * Remove duplicates with Picard MarkDuplicates or Samtools rmdup
    43   * Check deduplication level with 'fastqc'
    44   * Remove reads mapped to mitochondria
     43  * Remove duplicates with Picard's 'MarkDuplicates' or 'samtools rmdup'.
     44  * Check deduplication level with 'fastqc'.
     45  * Remove reads mapped to mitochondria.
    4546
    4647{{{
     
    5051=== [=#QC Run quality control and calculate QC metrics] ===
    5152
    52 Calculate the fragment size distribution with the ATACseqQC R package:
     53Calculate the fragment size distribution with the [https://www.bioconductor.org/packages/release/bioc/html/ATACseqQC.html ATACseqQC] R package:
    5354{{{
    5455library("ATACseqQC")
     
    5758dev.off()
    5859}}}
     60See [[https://www.nature.com/articles/nmeth.2688/figures/2 | Fig 2]] from Buenrostro et al. for the ideal distribution of fragment sizes.
    5961
    6062Calculate the TSS enrichment score (the degree to which transcription start sites show enrichment for ATAC-seq reads) using BaRC code (/nfs/BaRC_Public/BaRC_code/Python/calculate_TSS_enrichment_score/calculate_TSS_enrichment_score.py)
     
    6466./calculate_TSS_enrichment_score.py --outdir OUT_QC_1 --outprefix Sample_A --fastq1 ATACseq_reads.fq.gz --tss TSS.hg38.bed --chromsizes chromInfo.hg38.txt --bam ATACseq_mapped_reads.bam >| Sample_A.TSS_enrichment_score.txt
    6567}}}
     68
     69The [https://www.encodeproject.org/atac-seq/#standards ENCODE project] has some recommended TSS enrichment scores, depending on the genome.
    6670
    6771=== [=#call_peaks Call peaks] ===
     
    8185}}}
    8286
    83   * Additional notes on calling peaks using MACS: using MACS default options without BAMPE may not call the correct peaks.  Alternative options/parameters in calling peaks using MACS, if BAMPE is not used are, --nomodel --shift s --extsize 2s, this can be used for single-end reads as well, e.g. --shift 100 --extsize 200 is suitable for ATAC-Seq.   
     87Additional notes on calling peaks using MACS: using MACS default options without BAMPE may not call the correct peaks.  Alternative options/parameters in calling peaks using MACS, if BAMPE is not used are, --nomodel --shift s --extsize 2s, this can be used for single-end reads as well, e.g. --shift 100 --extsize 200 is suitable for ATAC-Seq.   
    8488    * MACS' author, T.Liu, recommends using -f BAMPE if PE reads are used [[https://github.com/taoliu/MACS/issues/331]], using BAMPE option asks MACS to pileup and calculate the extension size -  works for finding accessible regions within cut sites.  The additional parameters can also be used to look only at the //exact// cut sites by Tn5 instead of the open/accessible regions [[https://github.com/taoliu/MACS/issues/145]], if so, -f BAMPE may not be suitable.
    8589
     
    8892{{{
    8993
    90 # need to sort by name
    91 samtools sort -n -T abc bowtie/ATAC.bam >ATAC_sortedByName.bam"
    92 Genrich -e chrM -j -r -v -t ATAC_sortedByName.bam  -o peaks/ATAC_peak  -f peaks/ATAC_log
     94# Start by sorting mapped reads by read name
     95samtools sort -n bowtie/ATAC.bam > ATAC_sortedByName.bam
     96# Run Genrich
     97Genrich -e chrM -j -r -v -t ATAC_sortedByName.bam -o peaks/ATAC_peak -f peaks/ATAC_log
    9398
    94 Where
    95 -j      ATAC-seq mode (must be specified)  intervals are interpreted that are centered on transposase cut sites (the ends of each DNA fragment). Only properly paired alignments are analyzed by default
    96 -r      Remove PCR duplicates
     99where
     100-j              ATAC-seq mode (must be specified)  intervals are interpreted that are centered on transposase cut sites (the ends of each DNA fragment). Only properly paired alignment are analyzed by default
     101-r              Remove PCR duplicates
    97102-d <int>        Expand cut sites to the given length (default 100bp)
    98 -e <arg>        Chromosomes (reference sequences) to exclude. Can be a comma-separated list, e.g. -e chrM,chrY.
     103-e <arg>        Chromosomes (reference sequences) to exclude. Can be a comma-separated  list, e.g. -e chrM,chrY.
    99104-E <file>       Input BED file(s) of genomic regions to exclude, such as 'N' homopolymers or high mappability regions
    100 -t  <file>       Input SAM/BAM file(s) for experimental sample(s)
    101 -o  <file>       Output peak file (in ENCODE narrowPeak format)
    102 -f <LOG>        Those who wish to explore the results of varying the peak-calling parameters (-q/-p, -a, -l, -g) should consider having Genrich produce a log file when it parses the SAM/BAM files (for example, with -f <LOG> added to the above command). Then, Genrich can call peaks directly from the log file with the -P option:
    103                         Genrich  -P  -f <LOG>  -o <OUT2>  -p 0.01  -a 200  -v
    104 }}}
    105 
    106 Calculate the FRiP score
    107 {{{
    108 /nfs/BaRC_Public/BaRC_code/Python/calculate_FRiP_score/calculate_FRiP_score.py
     105-t <file>       Input SAM/BAM file(s) for experimental sample(s)
     106-o <file>       Output peak file (in ENCODE narrowPeak format)
     107-f <LOG>        Those who wish to explore the results of varying the peak-calling parameters (-q/-p, -a, -l, -g) should consider having Genrich produce a log file when it parses the SAM/BAM files (for example, with -f <LOG> added to the above command). Then, Genrich can call peaks directly from the log file with the -P option: Genrich -P -f <LOG> -o <OUT2> -p 0.01  -a 200 -v
    109108}}}
    110109
    111110
     111Calculate the FRiP score (with /nfs/BaRC_Public/BaRC_code/Python/calculate_FRiP_score/calculate_FRiP_score.py).  The FRiP (Fraction of reads in peaks) score describes the fraction of all mapped reads that fall into the called peak regions.  The higher the score, the better, preferably over 0.3, according to [https://www.encodeproject.org/atac-seq/#standards ENCODE].
     112{{{
     113# Using a 'narrowPeak' file listing MACS2 peaks
     114./calculate_FRiP_score.py SampleA.bam SampleA_peaks.narrowPeak
     115# Using a BED file of peaks
     116./calculate_FRiP_score.py SampleA.bam SampleA_peaks.bed
     117}}}
     118
     119
     120  * [[https://github.com/LiuLabUB/HMMRATAC | HMMRATAC]] is another piece of software for ATAC-seq peak-calling.
     121
    112122=== Tips/Recommendations for ATAC-Seq ===
    113123//Based on presentation by H.Liu at BioC 2020 presentation //
    114   *  post-alignment filtering: remove mito/ChrM reads, remove duplciates, remove mapping artifacts: < 38bp and fragments > 2kb, and discordant reads
    115     * The nucleosome free region (NFR) should be > 38bp to < 147 bases (one nucleosome); mono nucleosome fragments are in the range of 147-200 bp
    116     * unless you have high coverage, use all reads for downstream analysis and not remove mono-, di-, tri-, etc. nucleosome fragments or reads.  Otherwise, you may miss many open chromatin regions during peak calling.
    117     * recommend at least ~50M reads/coverage for calling open/accessible regions, and ~200M reads for TF/footprinting.
    118   * check fragment size distribution (from ATACSeqQC)
    119     * see [[https://www.nature.com/articles/nmeth.2688/figures/2 | Fig 2]] from Buenrostro, J.D., et al. for expected distribution of fragment size
     124  * Post-alignment filtering: remove mito/ChrM reads, remove duplicates, remove mapping artifacts: < 38bp and fragments > 2kb, and discordant reads
     125  * Unless you have high coverage, use all reads for downstream analysis and not remove mono-, di-, tri-, etc. nucleosome fragments or reads.  Otherwise, you may miss many open chromatin regions during peak calling.
     126  * The nucleosome free region (NFR) should be > 38bp to < 147 bases (one nucleosome); mono nucleosome fragments are in the range of 147-200 bp
     127  * For calling open/accessible regions, at least ~50M reads are recommended.
     128  * For transcription factor footprinting, at least ~200M reads are recommended.
    120129  * Tn5 produces 5’ overhangs of 9 bases long: pos. strand + 4 and neg strand -5 (shiftGAlignmentsList and shiftReads function) splitGcAlignmentsByCut: creates different bins of reads: NFR, mono, di, etc. Shifted reads that do not fit into any of the bins should be discarded.
    121   *  use housekeeping genes to check QC: signal enrichment is expected in the regulatory regions of housekeeping genes in good ATACSeq eperiments. Use IGVSnapshot function with geneNames param. splitGAlignmentsByCut: creates different bins of reads: NFR, mono, di, etc. Shifted reads that do not fit into any of the bins should be discarded.
     130  * Use housekeeping genes to check QC: signal enrichment is expected in the regulatory regions of housekeeping genes in good ATACSeq experiments. Use IGVSnapshot function with geneNames param. splitGAlignmentsByCut: creates different bins of reads: NFR, mono, di, etc. Shifted reads that do not fit into any of the bins should be discarded.
    122131
    123132
    124 === Other Software/Reference ===
     133=== Further reading ===
    125134
    126135* [[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1929-3 | From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis]]
    127 * software: [[https://github.com/LiuLabUB/HMMRATAC | HMMRATAC]]
    128136
    129137