== Using ATAC-Seq to identify open chromatin == ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) is a method to identify regions of accessible ("open") chromatin that are presumed to be available for binding by transcription factors and other DNA-binding proteins, which in turn can regulate genome function. It can help answer experimental questions previously addressed by techniques like MNase-seq, DNase-Seq, and FAIRE-Seq. === Basic Approach === * [#preprocess Pre-process reads](remove adapters and other "contamination") * [#map Map reads to the genome] (with an unspliced mapping tool) * Run quality control and calculate QC metrics * [#call_peaks Call ATAC-seq "peaks"] with a high coverage of mapped reads * Analyze peak regions for binding motifs * Identify differentially accessible regions (for multiple-sample experiments) * Identify potential targets of peak regions and/or differentially accessible regions * Paired-end reads are better and recommended over single reads for ATAC-seq. The original [[https://www.nature.com/articles/nmeth.2688 | ATAC-Seq method]] from the Greenleaf lab recommends PE reads, stating, "We found that ATAC-seq paired-end reads produced detailed information about nucleosome packing and positioning." === [=#preprocess Pre-process reads] === * Check adapter contamination with fastqc. Remove adapters with cutadapt or trim_galore: {{{ # --fastqc Run FastQC in the default mode on the FastQ file once trimming is complete. # --nextera Adapter sequence to be trimmed is the first 12bp of the Nextera adapter 'CTGTCTCTTATA' instead of the default auto-detection of adapter sequence. trim_galore --fastqc -nextera --paired --length 30 -o clean_adaptor_folder read1.fq read2.fq }}} === [=#preprocess Map reads] === * Map reads with bowtie2, * If mapping paired-end reads, keep only concordant pairs {{{ # --very-sensitive # --no-discordant suppress discordant alignments for paired reads # -X/--maxins maximum fragment length (default=500). Increase to 2000 to get a better nucleosome distribution. bowtie2 --very-sensitive --no-discordant -p 2 -X 2000 -x /nfs/genomes/human_hg38_dec13_no_random/bowtie/hg38 -1 read1.fq -2 read2.fq | samtools view -ub - | samtools sort -T $name - >| bowtie_out.bam }}} * Remove duplicates with Picard MarkDuplicates or Samtools rmdup * Check deduplication level with 'fastqc' * Remove reads mapped to mitochondria {{{ samtools view -h file.bam | grep -v chrM | samtools view -b -h -f 0x2 - | samtools sort - > file.sorted.bam }}} === [=#call_peaks Call peaks] === MACS v2 works well for this. If using the 'BAMPE' option with paired-end reads, let MACS run the pileup and calculate 'extsize'; ask MACS to keep only 1 read if duplicates are present. {{{ macs2 callpeak -f BAMPE -t file.sorted.bam --broad --keep-dup 1 -B -q 0.01 -g mm -n MACS_ATACSeq_Peaks #note: if duplicates were removed, use --keep-dup all; if not, use --keep-dup 1. Removing duplicates is recommended. #BAMPE format: there is no special format for BAMPE - MACS will treat PE reads as coming from the same fragment, from the manual, "If the BAM file is generated for paired-end data, MACS will only keep the left mate(5' end) tag. However, when format BAMPE is specified, MACS will use the real fragments inferred from alignment results for reads pileup." }}} * Additional notes on calling peaks using MACS: using MACS default options without BAMPE may not call the correct peaks. Alternative options/parameters in calling peaks using MACS, if BAMPE is not used are, --nomodel --shift s --extsize 2s, this can be used for single-end reads as well, e.g. --shift 100 --extsize 200 is suitable for ATAC-Seq. * MACS' author, T.Liu, recommends using -f BAMPE if PE reads are used [[https://github.com/taoliu/MACS/issues/331]], using BAMPE option asks MACS to pileup and calculate the extension size - works for finding accessible regions within cut sites. The additional parameters can also be used to look only at the //exact// cut sites by Tn5 instead of the open/accessible regions [[https://github.com/taoliu/MACS/issues/145]], if so, -f BAMPE may not be suitable. * [[https://github.com/jsh58/Genrich | Genrich]] is another piece of software for peak-calling. It has the advantage of running all of the post-alignment steps through peak-calling with one command. It can take multiple replicates. Detailed information can be found in [[https://informatics.fas.harvard.edu/atac-seq-guidelines.html|Harvard ATAC-seq Guidelines]] {{{ # need to sort by name samtools sort -n -T abc bowtie/ATAC.bam >ATAC_sortedByName.bam" Genrich -e chrM -j -r -v -t ATAC_sortedByName.bam -o peaks/ATAC_peak -f peaks/ATAC_log Where -j ATAC-seq mode (must be specified) intervals are interpreted that are centered on transposase cut sites (the ends of each DNA fragment). Only properly paired alignments are analyzed by default -r Remove PCR duplicates -d Expand cut sites to the given length (default 100bp) -e Chromosomes (reference sequences) to exclude. Can be a comma-separated list, e.g. -e chrM,chrY. -E Input BED file(s) of genomic regions to exclude, such as 'N' homopolymers or high mappability regions -t Input SAM/BAM file(s) for experimental sample(s) -o Output peak file (in ENCODE narrowPeak format) -f Those who wish to explore the results of varying the peak-calling parameters (-q/-p, -a, -l, -g) should consider having Genrich produce a log file when it parses the SAM/BAM files (for example, with -f added to the above command). Then, Genrich can call peaks directly from the log file with the -P option: Genrich -P -f -o -p 0.01 -a 200 -v }}} === Tips/Recommendations for ATAC-Seq === //Based on presentation by H.Liu at BioC 2020 presentation // * post-alignment filtering: remove mito/ChrM reads, remove duplciates, remove mapping artifacts: < 38bp and fragments > 2kb, and discordant reads * The nucleosome free region (NFR) should be > 38bp to < 147 bases (one nucleosome); mono nucleosome fragments are in the range of 147-200 bp * unless you have high coverage, use all reads for downstream analysis and not remove mono-, di-, tri-, etc. nucleosome fragments or reads. Otherwise, you may miss many open chromatin regions during peak calling. * recommend at least ~50M reads/coverage for calling open/accessible regions, and ~200M reads for TF/footprinting. * check fragment size distribution (from ATACSeqQC) * see [[https://www.nature.com/articles/nmeth.2688/figures/2 | Fig 2]] from Buenrostro, J.D., et al. for expected distribution of fragment size * Tn5 produces 5’ overhangs of 9 bases long: pos. strand + 4 and neg strand -5 (shiftGAlignmentsList and shiftReads function) splitGcAlignmentsByCut: creates different bins of reads: NFR, mono, di, etc. Shifted reads that do not fit into any of the bins should be discarded. * use housekeeping genes to check QC: signal enrichment is expected in the regulatory regions of housekeeping genes in good ATACSeq eperiments. Use IGVSnapshot function with geneNames param. splitGAlignmentsByCut: creates different bins of reads: NFR, mono, di, etc. Shifted reads that do not fit into any of the bins should be discarded. === Other Software/Reference === * [[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1929-3 | From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis]] * software: [[https://github.com/LiuLabUB/HMMRATAC | HMMRATAC]]