Changes between Version 18 and Version 19 of SOPs/InProgress


Ignore:
Timestamp:
05/14/14 09:59:24 (11 years ago)
Author:
gbell
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/InProgress

    v18 v19  
    1 == Get reproducible peaks from multiple chip-seq replicates with IDR if macs is used for peak calls. https://sites.google.com/site/anshulkundaje/projects/idr ==
     1== Get reproducible peaks from multiple chip-seq replicates with IDR if macs is used for peak calls. ==
    22
     3For more information about the method, see the main [[https://sites.google.com/site/anshulkundaje/projects/idr IDR page]]
    34
    4 === Use macs2 to get peaks (not macs14): You can adjust fragment length with --nomodel --shiftsize if NSC/RSC plots are not good ===
     5=== Run macs2 (macs version 2, not version 1) to call ChIP-seq peaks.  Adjust fragment length with --nomodel --shiftsize if NSC/RSC plots are not good ===
     6
     7The IDR analysis requires that you call lots of peaks, including all "good" ones and some that are just noise.
     8
    59{{{
    6 # -B create bedgraph
    7 # -p 1e -3 : p values as 1e -3
    8 bsub macs2 callpeak -t IP_1.bam -c control_1.bam  -f BAM -g hs -n IP.1_vs_control.1 -B  -p 1e-3
    9 bsub macs2 callpeak -t IP_2.bam -c control_2.bam  -f BAM -g hs -n IP.2_vs_control.2 -B  -p 1e-3
     10# -B => create bedgraph output files
     11# -p 1e -3 => Set p-value cutoff to 1e-3 (which is more relaxed than the default setting)
     12
     13bsub macs2 callpeak -t IP_1.bam -c control_1.bam  -f BAM -g hs -n IP.1_vs_control.1 -B -p 1e-3
     14bsub macs2 callpeak -t IP_2.bam -c control_2.bam  -f BAM -g hs -n IP.2_vs_control.2 -B -p 1e-3
    1015}}}
    1116
    12 === Sort .narrowPeak files ( macs2 output) from best to worst using the -log10(pvalue) column i.e. column 8, and only keep the top 100k peaks ===
     17=== Sort peaks in .narrowPeak files (created by macs2) from best to worst using the -log10(p-value) column (column 8), and only keep the top 100k peaks (at most) ===
     18
    1319{{{
    14 bsub "sort -k 8nr,8nr IP.1_vs_control.1_peaks.narrowPeak |head -n 100000|gzip -c >| IP.1_vs_control.1.regionPeak.gz"
    15 bsub "sort -k 8nr,8nr IP.2_vs_control.2_peaks.narrowPeak |head -n 100000|gzip -c >| IP.2_vs_control.2.regionPeak.gz"
     20bsub "sort -k 8,8nr IP.1_vs_control.1_peaks.narrowPeak |head -n 100000 | gzip -c >| IP.1_vs_control.1.regionPeak.gz"
     21bsub "sort -k 8,8nr IP.2_vs_control.2_peaks.narrowPeak |head -n 100000 | gzip -c >| IP.2_vs_control.2.regionPeak.gz"
    1622}}}
    1723
    18 === Estimate Irreproducibility Discovery Rate (IDR) between replicates: ===
     24=== Estimate Irreproducibility Discovery Rate (IDR) between replicates ===
     25
     26The interface of this command has been slightly modified by BaRC, but the main method remains exactly the same.
     27
     28The command requires names and lengths of all chromosomes in a file (ex: chromInfo) of format chromosome<tab>size.
     29
    1930{{{
    20 # chromInfo.txt format: chromosome<tab>size
    21 # -1 (one) not l, 0 F p.value are fixed. Use p.value as the ranking measure.
    22 # rep1_vs_rep2_IDR is used for output file names
    23 Rscript batch-consistency-analysis.r IP.1_vs_control.1.regionPeak.gz IP.2_vs_control.2.regionPeak.gz -1 rep1_vs_rep2_IDR 0 F p.value chromInfo.txt
     31# USAGE: batch-consistency-analysis.r [peakfile1] [peakfile2] [peak.half.width] [outfile.prefix] [min.overlap.ratio] [is.broadpeak] [ranking.measure]\
     32#
     33# peak.half.width => set the peak.half.width to the reported peak width in the peak files (-1 (minus one) is recommended)
     34# outfile.prefix => the one-word name of your analysis
     35# min.overlap.ratio => fractional bp overlap (0 to 1) between peaks in replicates to be considered as overlapping peaks. (0 is recommended) 
     36# is.broadpeak => Is the peak file format broadPeak (or else is narrowPeak)? (F for false)
     37# ranking.measure => p.value is recommended
     38#
     39batch-consistency-analysis.r IP.1_vs_control.1.regionPeak.gz IP.2_vs_control.2.regionPeak.gz -1 rep1_vs_rep2_IDR 0 F p.value chromInfo.txt
    2440}}}
    2541
    26 === Plot the IDR plots: ===
     42=== Create the IDR plots ===
    2743{{{
    28  # This R script can also take multiple comparisons, please refer to https://sites.google.com/site/anshulkundaje/projects/idr for detail.
    29  Rscript batch-consistency-plot.r 1 rep1_vs_rep2_IDR_plot rep1_vs_rep2_IDR
     44# USAGE: batch-consistency-plot.r [npairs] [output.prefix] [input.file.prefix1] [input.file.prefix2] [input.file.prefix3] ....
     45# This method can plot 1 or more pairs of replicates
     46batch-consistency-plot.r 1 rep1_vs_rep2_IDR_plot rep1_vs_rep2_IDR
    3047}}}
    3148
    32 === Generate a conservative and an optimal final set of peak calls: ===
     49=== Generate a conservative and an optimal final set of peak calls ===
    3350{{{
    34  #IDR cutoff is between 0.01 or 0.05 depending on the number of pre-IDR peaks and size of genomes:
    35  #IDR <=0.05 for < 100K pre-IDR peaks for large genomes (human/mouse)
    36  #IDR <= 0.01 or 0.02 for ~15K to 40K peaks in smaller genomes such as worm
     51 # Use an IDR cutoff of 0.01 to 0.05, depending on the number of pre-IDR peaks and size of the genome:
     52 # IDR <=0.05 for < 100K pre-IDR peaks for large genomes (human/mouse)
     53 # IDR <= 0.01 or 0.02 for ~15K to 40K peaks in smaller genomes such as worm
    3754
    38 awk '{ if($NF < 0.05) print $0 }'  rep1_vs_rep2_IDR-overlapped-peaks.txt > rep1_vs_rep2_conserved_peaks_by_IDR.txt
     55awk '{ if($NF < 0.05) print $0 }' rep1_vs_rep2_IDR-overlapped-peaks.txt > rep1_vs_rep2_conserved_peaks_by_IDR.txt
    3956}}}