wiki:SOPs/InProgress

Version 19 (modified by gbell, 11 years ago) ( diff )

--

Get reproducible peaks from multiple chip-seq replicates with IDR if macs is used for peak calls.

For more information about the method, see the main https://sites.google.com/site/anshulkundaje/projects/idr IDR page

Run macs2 (macs version 2, not version 1) to call ChIP-seq peaks. Adjust fragment length with --nomodel --shiftsize if NSC/RSC plots are not good

The IDR analysis requires that you call lots of peaks, including all "good" ones and some that are just noise.

# -B => create bedgraph output files
# -p 1e -3 => Set p-value cutoff to 1e-3 (which is more relaxed than the default setting)

bsub macs2 callpeak -t IP_1.bam -c control_1.bam  -f BAM -g hs -n IP.1_vs_control.1 -B -p 1e-3
bsub macs2 callpeak -t IP_2.bam -c control_2.bam  -f BAM -g hs -n IP.2_vs_control.2 -B -p 1e-3

Sort peaks in .narrowPeak files (created by macs2) from best to worst using the -log10(p-value) column (column 8), and only keep the top 100k peaks (at most)

bsub "sort -k 8,8nr IP.1_vs_control.1_peaks.narrowPeak |head -n 100000 | gzip -c >| IP.1_vs_control.1.regionPeak.gz"
bsub "sort -k 8,8nr IP.2_vs_control.2_peaks.narrowPeak |head -n 100000 | gzip -c >| IP.2_vs_control.2.regionPeak.gz"

Estimate Irreproducibility Discovery Rate (IDR) between replicates

The interface of this command has been slightly modified by BaRC, but the main method remains exactly the same.

The command requires names and lengths of all chromosomes in a file (ex: chromInfo) of format chromosome<tab>size.

# USAGE: batch-consistency-analysis.r [peakfile1] [peakfile2] [peak.half.width] [outfile.prefix] [min.overlap.ratio] [is.broadpeak] [ranking.measure]\
#
# peak.half.width => set the peak.half.width to the reported peak width in the peak files (-1 (minus one) is recommended)
# outfile.prefix => the one-word name of your analysis
# min.overlap.ratio => fractional bp overlap (0 to 1) between peaks in replicates to be considered as overlapping peaks. (0 is recommended)  
# is.broadpeak => Is the peak file format broadPeak (or else is narrowPeak)? (F for false)
# ranking.measure => p.value is recommended
#
batch-consistency-analysis.r IP.1_vs_control.1.regionPeak.gz IP.2_vs_control.2.regionPeak.gz -1 rep1_vs_rep2_IDR 0 F p.value chromInfo.txt 

Create the IDR plots

# USAGE: batch-consistency-plot.r [npairs] [output.prefix] [input.file.prefix1] [input.file.prefix2] [input.file.prefix3] ....
# This method can plot 1 or more pairs of replicates
batch-consistency-plot.r 1 rep1_vs_rep2_IDR_plot rep1_vs_rep2_IDR

Generate a conservative and an optimal final set of peak calls

 # Use an IDR cutoff of 0.01 to 0.05, depending on the number of pre-IDR peaks and size of the genome:
 # IDR <=0.05 for < 100K pre-IDR peaks for large genomes (human/mouse)
 # IDR <= 0.01 or 0.02 for ~15K to 40K peaks in smaller genomes such as worm

awk '{ if($NF < 0.05) print $0 }' rep1_vs_rep2_IDR-overlapped-peaks.txt > rep1_vs_rep2_conserved_peaks_by_IDR.txt
Note: See TracWiki for help on using the wiki.