| 1 | | == Get reproducible peaks from multiple ChIP-seq replicates with IDR (after using macs for peak calling) == |
| 2 | | |
| 3 | | For more information about the method, see the main [[https://sites.google.com/site/anshulkundaje/projects/idr| IDR page]]. |
| 4 | | |
| 5 | | === Run macs2 (macs version 2, not version 1) to call ChIP-seq peaks. === |
| 6 | | |
| 7 | | The IDR analysis requires that you call lots of peaks, including all "good" ones (signal) and some bad ones (noise). |
| 8 | | |
| 9 | | Adjust the fragment length with --nomodel --shiftsize if NSC/RSC plots are not good. |
| 10 | | |
| 11 | | {{{ |
| 12 | | # -B => create bedgraph output files |
| 13 | | # -p 1e -3 => Set p-value cutoff to 1e-3 (which is more relaxed than the default setting) |
| 14 | | |
| 15 | | bsub macs2 callpeak -t IP_1.bam -c control_1.bam -f BAM -g hs -n IP.1_vs_control.1 -B -p 1e-3 |
| 16 | | bsub macs2 callpeak -t IP_2.bam -c control_2.bam -f BAM -g hs -n IP.2_vs_control.2 -B -p 1e-3 |
| 17 | | }}} |
| 18 | | |
| 19 | | === Sort peaks in .narrowPeak files (created by macs2) === |
| 20 | | |
| 21 | | Sort from best to worst using the -log10(p-value) column (column 8), and only keep the top 100k peaks (at most). |
| 22 | | |
| 23 | | {{{ |
| 24 | | bsub "sort -k 8,8nr IP.1_vs_control.1_peaks.narrowPeak | head -n 100000 | gzip -c >| IP.1_vs_control.1.regionPeak.gz" |
| 25 | | bsub "sort -k 8,8nr IP.2_vs_control.2_peaks.narrowPeak | head -n 100000 | gzip -c >| IP.2_vs_control.2.regionPeak.gz" |
| 26 | | }}} |
| 27 | | |
| 28 | | === Estimate Irreproducibility Discovery Rate (IDR) between replicates === |
| 29 | | |
| 30 | | The interface of this command has been slightly modified by BaRC, but the main method remains exactly the same. |
| 31 | | |
| 32 | | The command requires names and lengths of all chromosomes in a file (ex: chromInfo) of format chromosome<tab>size. |
| 33 | | |
| 34 | | {{{ |
| 35 | | # USAGE: batch-consistency-analysis.r [peakfile1] [peakfile2] [peak.half.width] [outfile.prefix] [min.overlap.ratio] [is.broadpeak] [ranking.measure]\ |
| 36 | | # |
| 37 | | # peak.half.width => set the peak.half.width to the reported peak width in the peak files (-1 (minus one) is recommended) |
| 38 | | # outfile.prefix => the one-word name of your analysis |
| 39 | | # min.overlap.ratio => fractional bp overlap (0 to 1) between peaks in replicates to be considered as overlapping peaks. (0 is recommended) |
| 40 | | # is.broadpeak => Is the peak file format broadPeak (or else is narrowPeak)? (F for false) The IDR method is not recommended for broad peaks. |
| 41 | | # ranking.measure => p.value is recommended |
| 42 | | |
| 43 | | batch-consistency-analysis.r IP.1_vs_control.1.regionPeak.gz IP.2_vs_control.2.regionPeak.gz -1 rep1_vs_rep2_IDR 0 F p.value chromInfo.txt |
| 44 | | }}} |
| 45 | | |
| 46 | | === Create the IDR plots === |
| 47 | | {{{ |
| 48 | | # USAGE: batch-consistency-plot.r [npairs] [output.prefix] [input.file.prefix1] [input.file.prefix2] [input.file.prefix3] .... |
| 49 | | # This method can plot 1 or more pairs of replicates |
| 50 | | |
| 51 | | batch-consistency-plot.r 1 rep1_vs_rep2_IDR_plot rep1_vs_rep2_IDR |
| 52 | | }}} |
| 53 | | |
| 54 | | === Generate a conservative and an optimal final set of peak calls === |
| 55 | | {{{ |
| 56 | | # Use an IDR cutoff of 0.01 to 0.05, depending on the number of pre-IDR peaks and size of the genome: |
| 57 | | # IDR <= 0.05 for < 100K pre-IDR peaks for large genomes (human/mouse) |
| 58 | | # IDR <= 0.01 or 0.02 for ~15K to 40K peaks in smaller genomes such as worm |
| 59 | | |
| 60 | | awk '{ if($NF < 0.05) print $0 }' rep1_vs_rep2_IDR-overlapped-peaks.txt > rep1_vs_rep2_conserved_peaks_by_IDR.txt |
| 61 | | }}} |
| | 1 | None at this time |