Changes between Version 18 and Version 19 of SOPs/InProgress
- Timestamp:
- 05/14/14 09:59:24 (11 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
SOPs/InProgress
v18 v19 1 == Get reproducible peaks from multiple chip-seq replicates with IDR if macs is used for peak calls. https://sites.google.com/site/anshulkundaje/projects/idr==1 == Get reproducible peaks from multiple chip-seq replicates with IDR if macs is used for peak calls. == 2 2 3 For more information about the method, see the main [[https://sites.google.com/site/anshulkundaje/projects/idr IDR page]] 3 4 4 === Use macs2 to get peaks (not macs14): You can adjust fragment length with --nomodel --shiftsize if NSC/RSC plots are not good === 5 === Run macs2 (macs version 2, not version 1) to call ChIP-seq peaks. Adjust fragment length with --nomodel --shiftsize if NSC/RSC plots are not good === 6 7 The IDR analysis requires that you call lots of peaks, including all "good" ones and some that are just noise. 8 5 9 {{{ 6 # -B create bedgraph 7 # -p 1e -3 : p values as 1e -3 8 bsub macs2 callpeak -t IP_1.bam -c control_1.bam -f BAM -g hs -n IP.1_vs_control.1 -B -p 1e-3 9 bsub macs2 callpeak -t IP_2.bam -c control_2.bam -f BAM -g hs -n IP.2_vs_control.2 -B -p 1e-3 10 # -B => create bedgraph output files 11 # -p 1e -3 => Set p-value cutoff to 1e-3 (which is more relaxed than the default setting) 12 13 bsub macs2 callpeak -t IP_1.bam -c control_1.bam -f BAM -g hs -n IP.1_vs_control.1 -B -p 1e-3 14 bsub macs2 callpeak -t IP_2.bam -c control_2.bam -f BAM -g hs -n IP.2_vs_control.2 -B -p 1e-3 10 15 }}} 11 16 12 === Sort .narrowPeak files ( macs2 output) from best to worst using the -log10(pvalue) column i.e. column 8, and only keep the top 100k peaks === 17 === Sort peaks in .narrowPeak files (created by macs2) from best to worst using the -log10(p-value) column (column 8), and only keep the top 100k peaks (at most) === 18 13 19 {{{ 14 bsub "sort -k 8 nr,8nr IP.1_vs_control.1_peaks.narrowPeak |head -n 100000|gzip -c >| IP.1_vs_control.1.regionPeak.gz"15 bsub "sort -k 8 nr,8nr IP.2_vs_control.2_peaks.narrowPeak |head -n 100000|gzip -c >| IP.2_vs_control.2.regionPeak.gz"20 bsub "sort -k 8,8nr IP.1_vs_control.1_peaks.narrowPeak |head -n 100000 | gzip -c >| IP.1_vs_control.1.regionPeak.gz" 21 bsub "sort -k 8,8nr IP.2_vs_control.2_peaks.narrowPeak |head -n 100000 | gzip -c >| IP.2_vs_control.2.regionPeak.gz" 16 22 }}} 17 23 18 === Estimate Irreproducibility Discovery Rate (IDR) between replicates: === 24 === Estimate Irreproducibility Discovery Rate (IDR) between replicates === 25 26 The interface of this command has been slightly modified by BaRC, but the main method remains exactly the same. 27 28 The command requires names and lengths of all chromosomes in a file (ex: chromInfo) of format chromosome<tab>size. 29 19 30 {{{ 20 # chromInfo.txt format: chromosome<tab>size 21 # -1 (one) not l, 0 F p.value are fixed. Use p.value as the ranking measure. 22 # rep1_vs_rep2_IDR is used for output file names 23 Rscript batch-consistency-analysis.r IP.1_vs_control.1.regionPeak.gz IP.2_vs_control.2.regionPeak.gz -1 rep1_vs_rep2_IDR 0 F p.value chromInfo.txt 31 # USAGE: batch-consistency-analysis.r [peakfile1] [peakfile2] [peak.half.width] [outfile.prefix] [min.overlap.ratio] [is.broadpeak] [ranking.measure]\ 32 # 33 # peak.half.width => set the peak.half.width to the reported peak width in the peak files (-1 (minus one) is recommended) 34 # outfile.prefix => the one-word name of your analysis 35 # min.overlap.ratio => fractional bp overlap (0 to 1) between peaks in replicates to be considered as overlapping peaks. (0 is recommended) 36 # is.broadpeak => Is the peak file format broadPeak (or else is narrowPeak)? (F for false) 37 # ranking.measure => p.value is recommended 38 # 39 batch-consistency-analysis.r IP.1_vs_control.1.regionPeak.gz IP.2_vs_control.2.regionPeak.gz -1 rep1_vs_rep2_IDR 0 F p.value chromInfo.txt 24 40 }}} 25 41 26 === Plot the IDR plots:===42 === Create the IDR plots === 27 43 {{{ 28 # This R script can also take multiple comparisons, please refer to https://sites.google.com/site/anshulkundaje/projects/idr for detail. 29 Rscript batch-consistency-plot.r 1 rep1_vs_rep2_IDR_plot rep1_vs_rep2_IDR 44 # USAGE: batch-consistency-plot.r [npairs] [output.prefix] [input.file.prefix1] [input.file.prefix2] [input.file.prefix3] .... 45 # This method can plot 1 or more pairs of replicates 46 batch-consistency-plot.r 1 rep1_vs_rep2_IDR_plot rep1_vs_rep2_IDR 30 47 }}} 31 48 32 === Generate a conservative and an optimal final set of peak calls :===49 === Generate a conservative and an optimal final set of peak calls === 33 50 {{{ 34 # IDR cutoff is between 0.01 or 0.05 depending on the number of pre-IDR peaks and size of genomes:35 # IDR <=0.05 for < 100K pre-IDR peaks for large genomes (human/mouse)36 # IDR <= 0.01 or 0.02 for ~15K to 40K peaks in smaller genomes such as worm51 # Use an IDR cutoff of 0.01 to 0.05, depending on the number of pre-IDR peaks and size of the genome: 52 # IDR <=0.05 for < 100K pre-IDR peaks for large genomes (human/mouse) 53 # IDR <= 0.01 or 0.02 for ~15K to 40K peaks in smaller genomes such as worm 37 54 38 awk '{ if($NF < 0.05) print $0 }' rep1_vs_rep2_IDR-overlapped-peaks.txt >rep1_vs_rep2_conserved_peaks_by_IDR.txt55 awk '{ if($NF < 0.05) print $0 }' rep1_vs_rep2_IDR-overlapped-peaks.txt > rep1_vs_rep2_conserved_peaks_by_IDR.txt 39 56 }}}