Version 28 (modified by 4 years ago) ( diff ) | ,
---|
Using ATAC-Seq to identify open chromatin
ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) is a method to identify regions of accessible ("open") chromatin that are presumed to be available for binding by transcription factors and other DNA-binding proteins, which in turn can regulate genome function. It can help answer experimental questions previously addressed by techniques like MNase-seq, DNase-Seq, and FAIRE-Seq.
Basic Approach
- Pre-process reads (remove adapters and other "contamination")
- Map reads to the genome (with an unspliced mapping tool)
- Run quality control and calculate QC metrics
- Call ATAC-seq "peaks" with a high coverage of mapped reads
- Analyze peak regions for binding motifs
- Identify differentially accessible regions (for multiple-sample experiments)
- Identify potential targets of peak regions and/or differentially accessible regions
Pre-process reads
- Check adapter contamination with fastqc. Remove adapters with cutadapt or trim_galore:
# --fastqc Run FastQC in the default mode on the FastQ file once trimming is complete. # --nextera Adapter sequence to be trimmed is the first 12bp of the Nextera adapter 'CTGTCTCTTATA' instead of the default auto-detection of adapter sequence. trim_galore --fastqc -nextera --paired --length 30 -o clean_adaptor_folder read1.fq read2.fq
Paired-end (PE) reads are better and recommended over single reads for ATAC-seq. The original ATAC-Seq method from the Greenleaf lab recommends PE reads, stating, "We found that ATAC-seq paired-end reads produced detailed information about nucleosome packing and positioning."
Map reads
- Map reads with bowtie2 (or another unspliced mapping tool).
- If mapping paired-end reads, keep only concordant pairs.
bowtie2 --very-sensitive --no-discordant -p 2 -X 2000 -x /nfs/genomes/human_hg38_dec13_no_random/bowtie/hg38 -1 read1.fq -2 read2.fq | samtools view -ub - | samtools sort - >| bowtie_out.bam # using the following settings: # --very-sensitive # --no-discordant suppress discordant alignments for paired reads # -X/--maxins <int> maximum fragment length (default=500). Increase to 2000 to get a better nucleosome distribution.
- Remove duplicates with Picard's 'MarkDuplicates' or 'samtools rmdup'.
- Check deduplication level with 'fastqc'.
- Remove reads mapped to mitochondria.
samtools view -h file.bam | grep -v chrM | samtools view -b -h -f 0x2 - | samtools sort - > file.sorted.bam
- For samples from human, mouse, fly, or C. elegans, one can prevent some probable false-positive peaks by removing reads that overlap "blacklisted" regions. The blacklist, popularized by ENCODE, is a a comprehensive set of genomic regions that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The blacklist regions can be downloaded from https://github.com/Boyle-Lab/Blacklist/. We have them on Whitehead servers at /nfs/BaRC_datasets/ENCODE_blacklist/Blacklist/lists
- Reads overlapping the blacklist can be filtered using alignmentSieve (from the deepTools package) or 'intersectBed -v' (from the bedtools suite):
alignmentSieve -b Reads.bam --blackListFileName hg38-blacklist.bed -o Reads.no_blackList.bam intersectBed -v -a Reads.bam -b hg38-blacklist.bed > Reads.no_blackList.bam
- An alternative for handling blacklist regions is to keep the mapped reads as is but (further downstream) remove peaks overlapping blacklist regions. Between-sample normalization will typically differ whether one filters these regions in reads or in peaks.
Run quality control and calculate QC metrics
Calculate the fragment size distribution with the ATACseqQC R package:
library("ATACseqQC") pdf("My_sample.fragment_sizes.pdf", sep="_"), w=11, h=8.5) fragSizeDist("Mapped_reads.bam", "My_sample") dev.off()
See Fig 2 from Buenrostro et al. for the ideal distribution of fragment sizes.
Calculate the TSS enrichment score (the degree to which transcription start sites show enrichment for ATAC-seq reads) using BaRC code (/nfs/BaRC_Public/BaRC_code/Python/calculate_TSS_enrichment_score/calculate_TSS_enrichment_score.py)
# USAGE: calculate_TSS_enrichment_score.py --outdir OUTDIR --outprefix OUTPREFIX --fastq1 FASTQ1 --tss TSS_BED --chromsizes CHROMSIZES --bam BAM ./calculate_TSS_enrichment_score.py --outdir OUT_QC_1 --outprefix Sample_A --fastq1 ATACseq_reads.fq.gz --tss TSS.hg38.bed --chromsizes chromInfo.hg38.txt --bam ATACseq_mapped_reads.bam >| Sample_A.TSS_enrichment_score.txt
The ENCODE project has some recommended TSS enrichment scores, depending on the genome.
Call peaks
MACS v2 works well for this.
If using the 'BAMPE' option with paired-end reads, let MACS run the pileup and calculate 'extsize'; ask MACS to keep only 1 read if duplicates are present.
macs2 callpeak -f BAMPE -t file.sorted.bam --broad --keep-dup 1 -B -q 0.01 -g mm -n MACS_ATACSeq_Peaks #note: if duplicates were removed, use --keep-dup all; if not, use --keep-dup 1. Removing duplicates is recommended. #BAMPE format: there is no special format for BAMPE - MACS will treat PE reads as coming from the same fragment, from the manual, "If the BAM file is generated for paired-end data, MACS will only keep the left mate(5' end) tag. However, when format BAMPE is specified, MACS will use the real fragments inferred from alignment results for reads pileup."
Additional notes on calling peaks using MACS: using MACS default options without BAMPE may not call the correct peaks. Alternative options/parameters in calling peaks using MACS, if BAMPE is not used are, --nomodel --shift s --extsize 2s, this can be used for single-end reads as well, e.g. --shift 100 --extsize 200 is suitable for ATAC-Seq.
- MACS' author, T.Liu, recommends using -f BAMPE if PE reads are used https://github.com/taoliu/MACS/issues/331, using BAMPE option asks MACS to pileup and calculate the extension size - works for finding accessible regions within cut sites. The additional parameters can also be used to look only at the exact cut sites by Tn5 instead of the open/accessible regions https://github.com/taoliu/MACS/issues/145, if so, -f BAMPE may not be suitable.
Genrich is another piece of software for peak-calling. It has the advantages of (a) running all of the post-alignment steps through peak-calling with one command, and (b) can process multiple replicates. Detailed information can be found in Harvard ATAC-seq Guidelines
# Start by sorting mapped reads by read name samtools sort -n bowtie/ATAC.bam > ATAC_sortedByName.bam # Run Genrich Genrich -e chrM -j -r -m 30 -v -t ATAC_sortedByName.bam -o peaks/ATAC_peak -f peaks/ATAC_log where -j ATAC-seq mode (must be specified) intervals are interpreted that are centered on transposase cut sites (the ends of each DNA fragment). Only properly paired alignment are analyzed by default -r Remove PCR duplicates -m <int> Minimum MAPQ to keep an alignment (def. 0) -d <int> Expand cut sites to the given length (default 100bp) -e <arg> Chromosomes (reference sequences) to exclude. Can be a comma-separated list, e.g. -e chrM,chrY. -E <file> Input BED file(s) of genomic regions to exclude, such as 'N' homopolymers or high mappability regions -t <file> Input SAM/BAM file(s) for experimental sample(s) -o <file> Output peak file (in ENCODE narrowPeak format) -f <LOG> Those who wish to explore the results of varying the peak-calling parameters (-q/-p, -a, -l, -g) should consider having Genrich produce a log file when it parses the SAM/BAM files (for example, with -f <LOG> added to the above command). Then, Genrich can call peaks directly from the log file with the -P option: Genrich -P -f <LOG> -o <OUT2> -p 0.01 -a 200 -v
HMMRATAC is another piece of software for ATAC-seq peak-calling.
Calculate the FRiP score (with /nfs/BaRC_Public/BaRC_code/Python/calculate_FRiP_score/calculate_FRiP_score.py). The FRiP (Fraction of reads in peaks) score describes the fraction of all mapped reads that fall into the called peak regions. The higher the score, the better, preferably over 0.3, according to ENCODE.
# Using a 'narrowPeak' file listing MACS2 peaks ./calculate_FRiP_score.py SampleA.bam SampleA_peaks.narrowPeak # Using a BED file of peaks ./calculate_FRiP_score.py SampleA.bam SampleA_peaks.bed
Analyze peak regions for binding motifs
Search motifs with homer findMotifsGenome.pl
By default, it performs de novo motif discovery as well as check the enrichment of known motifs (By default, known.motifs in the downloaded homer folder is used).
Note: findMotifsGenome.pl calls tab2fasta.pl, which sharing the same name as one of our BaRC script. Make sure that you calls the script from homer.
# add homer bin folder to your path, so the scripts called by findMotifsGenome.pl could be identified PATH=/nfs/BaRC_Public/homer/bin:$PATH findMotifsGenome.pl peak.bed hg38 out_dir -size 300 -S 2 -p 5 -cache 100 -fdr 5 -mask -mknown Jaspar_hs_core_homer.motifs -mcheck Jaspar_hs_core_homer.motifs input parameters: -mask: use the repeat-masked sequence -size: (default 200). Explanation from homer website: "If analyzing ChIP-Seq peaks from a transcription factor, Chuck would recommend 50 bp for establishing the primary motif bound by a given transcription factor and 200 bp for finding both primary and "co-enriched" motifs for a transcription factor. When looking at histone marked regions, 500-1000 bp is probably a good idea (i.e. H3K4me or H3/H4 acetylated regions). -mknown <motif file> (known motifs to check for enrichment. -mcheck <motif file> (known motifs to check against de novo motifs, -S: Number of motifs to find (default 25) -p Number of processors to use (default 1)
To download species specific Jaspar motifs, convert to homer motif format, and save to motif file.
In this example, download human core motifs from JASPAR2016, and saved to Jaspar_hs_core_homer.motifs
library(TFBSTools) opts <- list() opts["collection"] <- "CORE" opts["species"] = 9606 Jaspar_hs_core <- getMatrixSet(JASPAR2016::JASPAR2016, opts) # convert to homer motif format: library(universalmotif) write_homer (Jaspar_hs_core, file="Jaspar_hs_core_homer.motifs")
The findMotifsGenome.pl creates two html files, one for de novo identified motifs, the other is known motifs.
Annotated motifs with homer annotatePeaks.pl
annotatePeaks.pl peak.bed hg38 -m input_motifs -mbed motif.bed > annotated_motifs.txt Where: -m: motifs can be combined first and save as a file. In the output file, this will link motifs associated with a peak together. -mbed <filename> (Output motif positions to a BED file to load at genome browser)
For each peak, it gives the distance to nearest feature, categorized them into promoter, intergenic, intron#, exon#, TSS)
Other recommendations for ATAC-Seq
Based on presentation by H. Liu at BioC 2020 presentation
- Post-alignment filtering: remove mito/ChrM reads, remove duplicates, remove mapping artifacts: < 38bp and fragments > 2kb, and discordant reads
- Unless you have high coverage, use all reads for downstream analysis and not remove mono-, di-, tri-, etc. nucleosome fragments or reads. Otherwise, you may miss many open chromatin regions during peak calling.
- The nucleosome free region (NFR) should be > 38bp to < 147 bases (one nucleosome); mono nucleosome fragments are in the range of 147-200 bp
- For calling open/accessible regions, at least ~50M reads are recommended.
- For transcription factor footprinting, at least ~200M reads are recommended.
- Tn5 produces 5’ overhangs of 9 bases long: pos. strand + 4 and neg strand -5 (shiftGAlignmentsList and shiftReads function) splitGcAlignmentsByCut: creates different bins of reads: NFR, mono, di, etc. Shifted reads that do not fit into any of the bins should be discarded.
- Use housekeeping genes to check QC: signal enrichment is expected in the regulatory regions of housekeeping genes in good ATAC-seq experiments. Use IGVSnapshot function with geneNames param. splitGAlignmentsByCut: creates different bins of reads: NFR, mono, di, etc. Shifted reads that do not fit into any of the bins should be discarded.
Further reading