'''Contents:''' * [#chipseq ChIP-Seq] * [#CUTnRUN CUT&RUN] == Using ChIP-Seq to identify and/or quantify peaks (or other interesting enriched regions) == === [=#chipseq Basic Approach] === * [#map Step 1: Map reads] * [#cca Step 2: Perform strand cross correlation analysis] * [#peaks Step 3: Call peaks] * [#repro Step 4: Identify reproducible peaks] (for peak-calling experiments with replication) * [#link Step 5: Link "bound" regions to genomic features] * [#compare Step 6: Compare binding across different samples] Note that quality control is important after read mapping and after peak calling. The ENCODE consortium recommends some [[https://genome.ucsc.edu/ENCODE/qualityMetrics.html|quality metrics]]. A basic analysis pipeline can be found in our [[http://barc.wi.mit.edu/education/hot_topics/ChIPseq_2018/ChIPseq_2018.commands.txt| Hot Topics workshop]]. It should be customized, based on your specific experiment, as mentioned below. === Reviews === [http://www.plosone.org/article/info:doi/10.1371/journal.pone.0011471 Evaluation of ChIP-Seq performance] Wilbanks EG, and Facciotti MT (2010; PLoS ONE) [http://nar.oxfordjournals.org/content/early/2010/11/25/nar.gkq1187.abstract?etoc A manually curated ChIP-seq benchmark demonstrates room for improvement in current peak-finder programs] (NAR Nov. 2010) [http://genome.cshlp.org/content/22/9/1813.full ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia] (Genome Research, 2012) === [=#map Step 1: Map reads] === * Use [[http://bowtie-bio.sourceforge.net/tutorial.shtml|Bowtie]], [[http://bowtie-bio.sourceforge.net/bowtie2/index.shtml|bowtie2]], or another unspliced mapping tool. We recommend using Bowtie2 (default parameters), filtering to remove multi-mapped reads (eg. a mapping quality of 2 or higher) is not needed as Bowtie2 randomly chooses a best hit when multiple good hits exist (see [[http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#reporting | Bowtie2 manual [under Reporting] ]). Low mapping quality reads may need to be removed if using another aligner since MACS does not filter on mapping quality. * See the [[http://barcwiki.wi.mit.edu/wiki/SOPs/mapping|mapping SOP]] for more details. === [=#cca Step 2: Perform strand cross correlation analysis] === * The goal of this step is to assess the quality of the IP and to estimate the fragment size of the immunoprecipitated DNA. * For a detailed explanation on strand cross-correlation analysis see box 2 of this paper ([[http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003326|Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data ]]). {{{ run_spp.R -c=TreatmentIP.bam -savp=TreatmentIP.spp.pdf -out=TreatmentIP.run_spp.out }}} * ENCODE recommends normalized strand coefficientt (NSC) > 1.05 and relative strand correlation (RSC) > 0.08 [[https://genome.cshlp.org/content/genome/22/9/1813.full.html#boxed-text-3 | Ref]] * After this analysis a good ChIP-seq experiment will have a second peak (reflecting the fragment size) at least as tall as the first peak (a "phantom" peak reflecting read length). This is how the graph should look: ([[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431496/figure/F4/|Fig4E]]). If the second peak is smaller than the first, ([[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431496/figure/F4/|like the example shown in Fig4G Marginal]]), macs will not estimate fragment size correctly. In that case we recommend running macs with parameters "--nomodel" and "--shiftsize=half_of_the_fragment_size", or "--nomodel" and "--extsize=fragment_size". The fragment size is detected on the strand cross correlation analysis. === [=#peaks Step 3: Call peaks] === Peak-calling identifies presumed bound regions. This is especially relevant for transcription factors, which tend to have narrow peaks. For other ChIP-seq experiments that assay binding proteins exhibiting broad "peaks" (such as specific histone modifications), identifying "peaks" may be less helpful (a less useful way of representing the data), and some other (such as sliding window) quantification may be more relevant. Peaks can be called without the use of a control sample; however, it's highly recommended to include one (e.g. IgG or input, with input generally preferred over IgG). Some of the parameters to consider when comparing peak-calling programs are: * Adjustment of sequence tags to better represent the original DNA fragment (by shifting tags in the 3′ direction or by extending tags to the estimated length of the original fragments) * Background model used * Use of strand dependent bimodality BaRC performed a [[chipSeqAnalysisPackages|ChIP-Seq bake off]] in the summer of 2011, with these [[chipSeqBakeOff|ChIP-Seq bake off results]]. Based on our ChIP-Seq bake off and on a published review ([[http://www.plosone.org/article/info:doi/10.1371/journal.pone.0011471|Evaluation of ChIP-Seq performance]]), MACS2 is the peak-calling program we typically use. ==== MACS2 ==== * [[https://github.com/taoliu/MACS|MACS2]] is appropriate for both proteins like transcription factors that may have narrow peaks, as well as histone modifications that may affect broader regions. For broader peaks we recommend using --nomodel,--broad, --nolambda (if there's no control), and using the fragment size calculated on the strand cross correlation analysis. We recommend using macs2 rather than macs14 for broad peaks. Identifying reproducible peaks (across replicates) with IDR requires MACS2 and works best with a relaxed p-value threshold. {{{ bsub macs2 callpeak -t IP_reads.mapped_only.bam -c Control_reads.mapped_only.bam -g mm -n Name -B -f BAM --keep-dup auto --nomodel --extsize "size calculated on the strand crosscorrelation analysis" bsub macs2 callpeak -t IP_reads.mapped_only.bam -c Control_reads.mapped_only.bam -g mm -n Name -B -f BAMPE --keep-dup 1 -p 1e-3 }}} * -B create bedgraph output files * -f Input format (typically BAM for single-end reads and BAMPE for paired-end reads). In BAMPE mode, MACS2 will only use properly-paired read alignments, see [[https://groups.google.com/forum/#!topic/macs-announcement/Vh916L3tOOE | MACS2 Forum ]], also from the readme/manual, "when format BAMPE is specified, MACS will use the real fragments inferred from alignment results for reads pileup.", i.e. --nomodel/--extsize will be ignored. * --nomodel whether or not to build the shifting model. If True, MACS will not build model. By default it means shifting size = 100. * --extsize When nomodel is true, MACS will use this value as fragment size to extend each read towards 3' end * --keep-dup How should MACS handle duplicate tags at the same location? The default is to keep just one. The 'auto' option (recommended for high-coverage samples) has MACS calculate the expected maximum tags at the same location (based on the binomal distribution). FastQC results can be used to gauge the duplication levels in a sample. * For future IDR analysis, use '-p 1e -3' => Set p-value cutoff to 1e-3 (which is more relaxed than the default setting) ==== Output format (xls/text file):==== * chr name * start pos of peak * end pos of peak * length of peak region * absolute peak summit position * pileup height at peak summit * -log10(pvalue) for the peak summit * fold enrichment * -log10(qvalue) at peak summit [[https://github.com/taoliu/MACS/wiki | MACS2 wiki]] === [=#repro Step 4: Identify reproducible peaks with the IDR method [for peak-calling experiments with replication]] === You can identify reproducible peaks with the 'idr' function, based on the method of [[https://projecteuclid.org/journals/annals-of-applied-statistics/volume-5/issue-3/Measuring-reproducibility-of-high-throughput-experiments/10.1214/11-AOAS466.full | Qunhua Li et al]] and using [[https://github.com/nboley/idr | code from Nathan Boley]]. Start by running macs2 (macs version 2, not version 1) to call ChIP-seq peaks (see step 3) The IDR analysis requires that you call lots of peaks, including more than the obvious "good" ones (signal). ==== Sort peaks in .narrowPeak files (created by macs2) ==== Sort from best to worst using the -log10(p-value) column (column 8), and only keep the top 100k peaks (at most). {{{ bsub "sort -k 8,8nr IP.1_vs_control.1_peaks.narrowPeak | head -n 100000 | gzip -c >| IP.1_vs_control.1.regionPeak.gz" bsub "sort -k 8,8nr IP.2_vs_control.2_peaks.narrowPeak | head -n 100000 | gzip -c >| IP.2_vs_control.2.regionPeak.gz" }}} ==== Estimate Irreproducibility Discovery Rate (IDR) between replicates ==== For full usage, try 'idr h' {{{ idr --samples My_TF_rep_1.narrowPeak My_TF_rep_2.narrowPeak --input-file-type narrowPeak --output-file My_TF.IDR.txt --plot }}} === [=#link Step 5: Link "bound" regions to genomic features] === Both MACS and SISSRs provide bed files with the set of peaks, presumably indicating bound regions. To Link "bound" (or other interesting) regions to genomic features (genes, promoters, enhancers, etc.), check out this SOP: [[genome_regions_annotations|Linking genome regions to genome annotations]] Below it is a detailed example of how to do the following annotation: A. Find the genes that overlap with your peaks A. Find the genes that have a peak within -D distance upstream of the TSS/gene A. Find the genes that have a peak -D distance downstream of the gene A. Find the genes that have a peak -D distance downstream of the TSS Note: We assume that the TSS is the 5' end of the transcript (gene), but you can also go to the UCSC Bioinformatics table browser and download the coordinates of the transcription start sites (TSS). **Step A: Get the coordinates of the desired annotation features** For analysis A, B and C those coordinates can be downloaded from the UCSC tables. {{{ i.e., on the table browser select: Mammal->Mouse group: Genes and Gene Prediction Tracks track: RefSeq Genes table: refGene region: genome output format: bed (name your file) get output }}} {{{ for A: Select Whole Gene for B: Select Upstream by D bases for C: Downstream by D bases }}} You will have now 3 annotation files named for example: {{{ mm9_REFseqgenes.bed (gene bodies) mm9_REFseqgenes1kbUp.bed (1Kb upstream of the gene) mm9_REFseqgenes1kbDownofGene.bed (1Kb downstream of the gene) }}} The gene bodies file has more columns that you need, so keep only the first 6 columns. {{{ cut -f1-6 mm9_REFseqgenes.bed > mm9_REFseqgenes_TRIM.bed }}} For annotation D, to get the coordinates of the regions from the TSS to a certain distance downstream of the TSS (beginning of the gene): 1. Take the genes files and keep the coordinate that represents the start of the gene (based on the strand), {{{ awk -F "\t" '{if ($6 == "+") {print $1"\t"$2"\t"$2+1"\t"$4"\t"$5"\t"$6} else {print $1"\t"$3-1"\t"$3"\t"$4"\t"$5"\t"$6} }' mm9_REFseqgenes.bed >| mm9_REFseqgenesTSS.bed }}} 2. Then use the tool "slopBed" to increase the range of genomic coordinates to include downstream of the TSS The tool "slopBed" needs a file with the chromosome sizes that can be created with this command {{{ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from mm9.chromInfo" > mm9.genome }}} Slop the file with the TSS coordinates to get the desired distance downstream (3') of the TSS {{{ slopBed -i mm9_REFseqgenesTSS.bed -g mm9.genome -l 0 -r 1000 -s >| mm9_REFseqgenes_1kbdownofTSS.bed }}} **Step B: Intersect your peaks and your annotation files** {{{ intersectBed -a genes.bed -b peaks.bed -wa -wb -wa returns the list from file A -wb returns the list from file B }}} {{{ intersectBed -a mm9_1KbUpDownofTSS.bed -b peaks.bed -wa -wb >| 1KbUpDownofTSS_peaks.bed }}} === [=#compare Step 6: Compare binding across different samples] === * Method 1: Run IDR to identify reproducible peaks * Method 2: Use [[https://github.com/taoliu/MACS/wiki/Call-differential-binding-events | MACS2 bdgdiff]] to call differential binding. bedGraph (.bdg) files are required which can be created using the "-B" option without "--SPMR" option. * Method 3: Intersect bound regions (peaks) * This binary comparison asks simply, "Is each peak in sample A present in sample B (and vice versa)?" * This ignores the enrichment/score of each peak and gives each non-peak a score of 0. * Method 4: Compare affinity (number of mapped reads) across bound regions * Filter out redundant reads from each BAM file with a command like 'samtools rmdup'. * Merge peaks of all samples (and make non-redundant): * mergeBed -i Peaks_all_redundant.bed > Peaks.bed * Count reads in each sample that overlaps each peak * multiBamCov -bams *.sorted.bam -bed Merged_peaks.bed > Merged_peak_counts.reads_samples.bed * Modify counts per peak to account for differing numbers of mapped reads. * Method 5: Use [[https://github.com/shenlab-sinai/ngsplot | ngsplot]] to make stacked heatmaps,and profiles, of peaks for each sample * Method 6: Use [[https://github.com/Genometric/MSPC | MSPC]] from [[https://www.ncbi.nlm.nih.gov/pubmed/25957351 | Jalili et al., 2015]]. == [=#CUTnRUN CUT&RUN] == Cleavage Under Targets and Release Using Nuclease (CUT&RUN) is an alternative method to ChIP-Seq: it improves on lower input, i.e. number of cells needed, and reduces antibody cross-reactivity. On average, sequencing depths of 3 to 5 million reads per sample should be sufficient, though depths of more than 15 million reads will increase duplication rates. For other details, see the [[https://www.cellsignal.com/learn-and-support/frequently-asked-questions/cut-and-run-faqs | Cell Signaling Technology FAQ]]. Similar to ChIP-Seq, higher coverage is needed for histone marks compared to transcription factor binding sites. CUT&RUN data analysis is similar to a typical ChIP-Seq. The following protocol is slightly modified from [[https://elifesciences.org/articles/46314| Improved CUT&RUN chromatin profiling tools]]: * Mapping: bowtie2 can be used for aligning the reads with the following parameters (for paired-end reads) * end-to-end read alignment not required: * --local * --very-sensitive-local * concordant pairs only: * --no-mixed * --no-discordant * min/max fragment length, respectively * -I 10 * -X 700 * Calling peaks: MACS2 //callpeak// is suitable to call peaks * -t treatment file * -c control file (a control is required for CUT&RUN, e.g. IgG) * -f BEDPE or BAMPE * -g hs or mm * –p 1e-5 * --keep-dup all * note: MACS2 default is 1, i.e. keep one tag; high duplication rates in CUT&RUN may be due to poor data quality, see [[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1802-4 | CUT&RUNTools]] * --nomodel and --extsize can be used after strand cross-correlation [see above] * note: for paired-end reads (using BAMPE), these parameters are ignored since MACS2 will pileup and compute the 'extsize'. Called peaks should be filtered the same way as one would do with ChIP-seq peaks.