== Using ATAC-Seq to identify open chromatin == ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) is a method to identify regions of accessible ("open") chromatin that are presumed to be available for binding by transcription factors and other DNA-binding proteins, which in turn can regulate genome function. It can help answer experimental questions previously addressed by techniques like MNase-seq, DNase-Seq, and FAIRE-Seq. === Basic Approach === If your samples consist of human or mouse specimens with biological replicates, you can execute the [[ENCODE pipeline | ENCODE pipeline]] in our server seamlessly, encompassing steps 1 through 7 in a single run. 1. [#preprocess Pre-process reads] (remove adapters and other "contamination") 2. [#map Map reads to the genome, and post-alignment filtering] 3. [#postAlignQC Post-alignment quality control and calculate QC metrics] 4. [#call_peaks Call ATAC-seq "peaks"] with a high coverage of mapped reads 5. [#AfterPeakQC Quality control after calling peaks] 6. [#Blacklist Blacklist filtering for peaks ] 7. [#QC Run quality control and calculate QC metrics] 8. [#Analyze Analyze peak regions for binding motifs] 9. Identify differentially accessible regions (for multiple-sample experiments) 10. Identify potential targets of peak regions and/or differentially accessible regions === [=#preprocess Pre-process reads] === * Check adapter contamination with fastqc. Remove adapters with cutadapt or trim_galore. {{{ # --fastqc Run FastQC in the default mode on the fastq file once trimming is complete. # --nextera Adapter sequence to be trimmed is the first 12bp of the Nextera adapter 'CTGTCTCTTATA' instead of the default auto-detection of adapter sequence. Since many ATAC-seq experiments use the Nextera DNA Library Prep Kit, one commonly needs to trim Nextera adapters. # --paired If reads are paired, both reads need to pass or they are both removed. # --length Discard trimmed reads shorter than this length. trim_galore --fastqc -nextera --paired --length 30 -o clean_adaptor_folder read1.fq [read2.fq] }}} Paired-end (PE) reads are better and recommended over single reads for ATAC-seq. The original [[https://www.nature.com/articles/nmeth.2688 | ATAC-Seq method]] from the Greenleaf lab recommends PE reads, stating, "We found that ATAC-seq paired-end reads produced detailed information about nucleosome packing and positioning." === [=#preprocess Map reads, and post-alignment filtering] === * Map reads with bowtie2 (or another unspliced mapping tool). * If mapping paired-end reads, keep only concordant pairs. {{{ bowtie2 --very-sensitive --no-discordant -p 2 -X 2000 -x /nfs/genomes/human_hg38_dec13_no_random/bowtie/hg38 -1 read1.fq -2 read2.fq | samtools view -ub - | samtools sort - >| bowtie_out.bam # using the following settings: # --very-sensitive # --no-discordant suppress discordant alignments for paired reads # -X/--maxins maximum fragment length (default=500). Increase to 2000 to get a better nucleosome distribution. }}} * Remove reads with low quality score: MAPQ < 30 {{{ # Use alignmentSieve from [[https://deeptools.readthedocs.io/en/develop/|DeepTools]] # "--samFlagInclude 2" => keep only properly paired reads alignmentSieve -b Mapped_reads.bam --minMappingQuality 30 --samFlagInclude 2 -o Mapped_reads.MAPQ30.bam # Use samtools (for single-end reads) samtools view -b -q 30 Mapped_reads.bam > Mapped_reads.MAPQ30.bam }}} * Remove duplicates with Picard's 'MarkDuplicates' or 'samtools rmdup'. If your sequence depth is exceptionally high, duplicated reads might actually be part of the normal signals, meaning they may not necessarily need to be removed. {{{ java -jar /usr/local/share/picard-tools/picard.jar MarkDuplicates I=Mapped_reads.bam O=Mapped_reads.noDups.bam M=foo.marked_dup_metrics.txt REMOVE_DUPLICATES=true }}} * Check deduplication level with [[http://barcwiki.wi.mit.edu/wiki/SOPs/qc_shortReads | 'fastqc']]. * Remove reads mapped to mitochondria. {{{ samtools view -h Mapped_reads.bam | grep -v chrM | samtools view -b -h -f 0x2 - | samtools sort - > Mapped_reads.sorted.bam }}} * Index the bam file. {{{ samtools index Mapped_reads.sorted.bam }}} === [=#postAlignQC Post-alignment quality control and calculate QC metrics] === * Calculate the fragment size distribution using BaRC code (for paired-end reads) (/nfs/BaRC_Public/BaRC_code/R/calculate_ATACseq_fragment_size_distribution/calculate_ATACseq_fragment_size_distribution.R) {{{ # USAGE: calculate_ATACseq_fragment_size_distribution.R input.bam output.pdf title_of_the_figure calculate_ATACseq_fragment_size_distribution.R sample.bam sample_fragment_sizes.pdf Sample }}} Run these commands in R to createan image with the fragment size distribution using the function fragSizeDist from the [https://www.bioconductor.org/packages/release/bioc/html/ATACseqQC.html ATACseqQC] R package: {{{ library("ATACseqQC") pdf("My_sample.fragment_sizes.pdf", w=11, h=8.5) fragSizeDist("Mapped_reads.bam", "My_sample") dev.off() }}} See [[https://www.nature.com/articles/nmeth.2688/figures/2 | Fig 2]] from Buenrostro et al. for the ideal distribution of fragment sizes. * Calculate the TSS enrichment score (the degree to which transcription start sites show enrichment for ATAC-seq reads) using BaRC code (/nfs/BaRC_Public/BaRC_code/Python/calculate_TSS_enrichment_score/calculate_TSS_enrichment_score.py) {{{ # USAGE: calculate_TSS_enrichment_score.py --outdir OUTDIR --outprefix OUTPREFIX --fastq1 FASTQ1 --tss TSS_BED --chromsizes CHROMSIZES --bam BAM # The alignment file (.bam) and tss (.bed) file should have the same chromosomes(contigs). ./calculate_TSS_enrichment_score.py --outdir OUT_QC_1 --outprefix Sample_A --fastq1 ATACseq_reads.fq.gz --tss TSS.hg38.bed --chromsizes chromInfo.hg38.txt --bam ATACseq_mapped_reads.bam >| Sample_A.TSS_enrichment_score.txt }}} === [=#call_peaks Call peaks] === MACS v2 is applicable for ATAC-Seq using the appropriate options/parameters. * Run macs2 using pair-end bed as input and the options "--shift -75 --extsize 150". With those settings you will be creating a profile of reads around the cutting sites (one at each end of the fragment/paired read) that will result on peaks centered around the cutting sites (open chromatin). This is an important difference with ChIP-seq analysis. On ChIP-seq the binding event tends to be in the middle of the fragment; on ATAC-seq chromatin was opened where the cutting occurred and that is the end of the fragment. [[https://twitter.com/XiChenUoM/status/1336658454866325506|cutting/insertion sites enrichment in ATAC-seq]]. {{{ # convert bam to bed bedtools bamtobed -i foo.bam > foo_pe.bed # shift reads. Reads should be shifted + 4 bp and − 5 bp for positive and negative strand respectively, to account for the 9-bp duplication created by DNA repair of the nick by Tn5 transposase cat foo.pe.bed | awk -F $'\t' 'BEGIN {OFS = FS}{ if ($6 == "+") {$2 = $2 + 4} else if ($6 == "-") {$3 = $3 - 5} print $0}' >| foo_tn5_pe.bed # call peaks. # --keep-dup all: since duplicates have been removed in previous step macs2 callpeak -t foo_tn5_pe.bed -n foo -f BED -g mm -q 0.01 --nomodel --shift -75 --extsize 150 --call-summits --keep-dup all }}} * If you are working with the nucleosome free region (NFR) ( detail information can be found in the bottom of the page), macs2 BAMPE option also works. When using'BAMPE' option with paired-end reads, we let MACS run the pileup and calculate 'extsize'. * BAMPE format: there is no special format for BAMPE - MACS will treat PE reads as coming from the same fragment, from the manual, * "If the BAM file is generated for paired-end data, MACS will only keep the left mate(5' end) tag. However, when format BAMPE is specified, MACS will use the real fragments inferred from alignment results for reads pileup." {{{ macs2 callpeak -f BAMPE -t NFR.bam --keep-dup all -B -q 0.01 -g mm -n MACS_ATACSeq_Peaks }}} * We strongly suggest to use pair-end reads. But if you have single-end reads, you can run macs2 using --nomodel --shift s --extsize 2s. {{{ macs2 callpeak -t $foo.bam -f BAM -B -q 0.01 -g hs -n foo --nomodel --shift -75 --extsize 150 }}} ==== Other Tools or Options ==== * [[https://github.com/LiuLabUB/HMMRATAC | HMMRATAC]] * [[https://github.com/jsh58/Genrich | Genrich]] === Subset peaks into reproducible peaks === * If you have biological replicates, you can identify reproducible peaks with the 'idr' function, based on the method of [[https://projecteuclid.org/journals/annals-of-applied-statistics/volume-5/issue-3/Measuring-reproducibility-of-high-throughput-experiments/10.1214/11-AOAS466.full | Qunhua Li et al]] and using [[https://github.com/nboley/idr | code from Nathan Boley]]. {{{ idr --samples rep1.narrowPeak rep2.narrowPeak --input-file-type narrowPeak --output-file IDR.narrowPeak.bed --plot }}} * The output file contains the '''scaled IDR value''' (min(int(-125*log2(IDR), 1000)) in the 5th field. If one wants to choose 0.05 as the IDR threshold to identify "reproducible" peaks, then this metric must be at least 540. {{{ awk '$5 >= 540 {print $0}' IDR.narrowPeak.bed > IDR.narrowPeak.filtered.bed }}} * To convert the 'scaled IDR value' into the actual IDR (somewhat analogous to the FDR), the formula is 2^scaled_IDR_value/(-125)^ , although scaled IDR values of 1000 (the max value) have an IDR < 0.004. === [=#AfterPeakQC Quality control after calling peaks] === * Assessing Peak Calls by calculating the FRiP score (with /nfs/BaRC_Public/BaRC_code/Python/calculate_FRiP_score/calculate_FRiP_score.py). The FRiP (Fraction of reads in peaks) score describes the fraction of all mapped reads that fall into the called peak regions. The higher the score, the better, preferably over 0.3, according to [https://www.encodeproject.org/atac-seq/#standards ENCODE]. {{{ # Using a 'narrowPeak' file listing MACS2 peaks ./calculate_FRiP_score.py Sample.bam Sample_peaks.narrowPeak > Sample_FRiP_score.txt # Using a BED file of peaks ./calculate_FRiP_score.py Sample.bam Sample_peaks.bed > Sample_FRiP_score.txt }}} * The [https://www.encodeproject.org/atac-seq/#standards ENCODE project] has [[https://www.encodeproject.org/atac-seq/#standards|recommendations]] on TSS enrichment scores (ENCODE uses one tss to represent a gene, not one tss per transcript. TSS score is much smaller with one tss per transcript), fragment size distribution. You can also download ENCODE pipeline and analyze your samples with the pipeline. Its QC output html file includes quality controls results and their interpretation. It also estimates the library complexity based the uniqueness of reads. Here is a [[http://barc.wi.mit.edu/education/hot_topics/ChIPseq_ATACseq_2021/qc.html | sample QC report]]. * [[https://www.sciencedirect.com/science/article/pii/S240547122030079X|ataqv]] summarizes QC results into an interactive html page, which also allows you to view multiple samples together. * First, run ataqv on each bam file to generate JSON files. Here is a sample command for a bulk ATAC_seq sample: {{{ # --peak-file: peak file in bed format # --tss-file: tss in bed format # --excluded-region-file A bed file containing excluded regions, in this case, we exclude the regions in the ENCODE blast list # --metrics-file: output in json format # --ignore-read-groups: Even if read groups are present in the BAM file, ignore them and combine metrics for all reads under a single sample and library named with the --name option. This also implies that a single peak file will be used for all reads. # The duplicated reads need to be labeled by Picard MarkDuplicates ( REMOVE_DUPLICATES=FALSE) before running the ataqv # ataqc supports human/mouse/rat/fly/worm/yeast currently. If your genome is not listed, you can still run it by adding --autosomal-reference-file and --mitochondrial-reference-name. # --autosomal-reference-file: a file containing autosomal reference names, one per line # --mitochondrial-reference-name: name for the mitochondrial DNA ataqv --peak-file sample1_peak.bed --name sample1 --metrics-file sample1.ataqv.json.gz --excluded-region-file /nfs/genomes/human_hg38_dec13/anno/hg38-blacklist.v2.bed.gz --tss-file hg38.tss.refseq.bed.gz --ignore-read-groups human sample1.bam > sample1.ataqv.out }}} Next, run mkarv on the JSON files to generate the interactive web viewer. By default, SRR891268 will be used as the reference sample in the viewer. You can specify a different reference when you built your viewer instance, refer to the information on how to configure the reference with mkarv -h. {{{ # This will create a folder named as sample1, whose index.html contains the interactive plots. mkarv sample1 sample1.ataqv.json.gz # If you have multiple samples, you can combine them into a single report. In this case, both sample1 and sample2 will be in the same plots inside folder called "all_samples" mkarv all_samples sample1.ataqv.json.gz sample2.ataqv.json.gz }}} === [=#Blacklist Blacklist filtering for peaks ] === * For samples from human, mouse, fly, or C. elegans, one can prevent some probable false-positive peaks by removing reads that overlap "blacklisted" regions. The blacklist, [https://www.nature.com/articles/s41598-019-45839-z popularized by ENCODE], is a a comprehensive set of genomic regions that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The blacklist regions can be downloaded from [https://github.com/Boyle-Lab/Blacklist/]. We have them on Whitehead servers at /nfs/BaRC_datasets/ENCODE_blacklist/Blacklist/lists {{{ bedtools intersect -v -a ${PEAK} -b ${BLACKLIST} | gzip -nc > ${FILTERED_PEAK} }}} === [=#Analyze Analyze peak regions for binding motifs] === Search motifs with [[http://homer.ucsd.edu/homer/ngs/peakMotifs.html | homer findMotifsGenome.pl]] By default, it performs de novo motif discovery as well as check the enrichment of known motifs (By default, known.motifs in the downloaded homer folder is used). Note: findMotifsGenome.pl calls tab2fasta.pl, which sharing the same name as one of our BaRC script. Make sure that you calls the script from homer. {{{ # add homer bin folder to your path, so the scripts called by findMotifsGenome.pl could be identified PATH=/nfs/BaRC_Public/homer/bin:$PATH findMotifsGenome.pl peak.bed hg38 out_dir -size 300 -S 2 -p 5 -cache 100 -fdr 5 -mask -mknown Jaspar_hs_core_homer.motifs -mcheck Jaspar_hs_core_homer.motifs input parameters: -mask: use the repeat-masked sequence -size: (default 200). Explanation from homer website: "If analyzing ChIP-Seq peaks from a transcription factor, Chuck would recommend 50 bp for establishing the primary motif bound by a given transcription factor and 200 bp for finding both primary and "co-enriched" motifs for a transcription factor. When looking at histone marked regions, 500-1000 bp is probably a good idea (i.e. H3K4me or H3/H4 acetylated regions). -mknown (known motifs to check for enrichment. -mcheck (known motifs to check against de novo motifs, -S: Number of motifs to find (default 25) -p Number of processors to use (default 1) }}} To download species specific Jaspar motifs, convert to homer motif format, and save to motif file. In this example, download human core motifs from JASPAR2016, and saved to Jaspar_hs_core_homer.motifs {{{ library(TFBSTools) opts <- list() opts["collection"] <- "CORE" opts["species"] = 9606 Jaspar_hs_core <- getMatrixSet(JASPAR2016::JASPAR2016, opts) # convert to homer motif format: library(universalmotif) write_homer (Jaspar_hs_core, file="Jaspar_hs_core_homer.motifs") }}} The findMotifsGenome.pl creates two html files, one for de novo identified motifs, the other is known motifs. Annotated motifs with homer [[http://homer.ucsd.edu/homer/ngs/quantification.html | annotatePeaks.pl]] {{{ annotatePeaks.pl peak.bed hg38 -m input_motifs -mbed motif.bed > annotated_motifs.txt Where: -m: motifs can be combined first and save as a file. In the output file, this will link motifs associated with a peak together. -mbed (Output motif positions to a BED file to load at genome browser) }}} For each peak, it gives the distance to nearest feature, categorized them into promoter, intergenic, intron#, exon#, TSS) === Other recommendations for ATAC-Seq === //Based on presentation by H. Liu at BioC 2020 presentation // * Post-alignment filtering: remove mito/ChrM reads, remove duplicates, remove mapping artifacts: < 38bp and fragments > 2kb, and discordant reads * Unless you have high coverage, use all reads for downstream analysis and not remove mono-, di-, tri-, etc. nucleosome fragments or reads. Otherwise, you may miss many open chromatin regions during peak calling. * The nucleosome free region (NFR) should be > 38bp to < 147 bases (one nucleosome); mono nucleosome fragments are in the range of 147-200 bp * Shorter reads lengths, e.g. 50x50 or 75x75, should be used instead of longer reads, e.g. 100x100 or longer, to ensure NFR/fragments are sequenced * ATAC-Seq depth or coverage: * calling open/accessible regions, at least ~50M reads are recommended * transcription factor footprinting, at least ~200M reads are recommended or optimal to ensure high coverage of the NFR * Tn5 produces 5’ overhangs of 9 bases long: pos. strand +4 and neg strand -5 (see shiftGAlignmentsList and shiftReads functions in ATACseqQC package) * splitGAlignmentsByCut (in ATACseqQC package): creates different bins of reads, e.g. NFR, mono, di, etc. Shifted reads that do not fit into any of the bins should be discarded. * Use housekeeping genes to check QC: signal enrichment is expected in the regulatory regions of housekeeping genes in good ATAC-seq experiments. Use IGVSnapshot function with geneNames param. === Further reading === * [[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1929-3 | From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis]]