Context Navigation

Changes between Version 9 and Version 10 of SOPs/chip_seq_peaks

Timestamp:: 05/15/14 09:35:21 (11 years ago)
Author:: gbell
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SOPs/chip_seq_peaks

-              v9
+              v10
  * See the [[http://barcwiki.wi.mit.edu/wiki/SOPs/mapping|mapping SOP]] for more details.
+=== Step 2: Strand cross correlation analysis ===
+ * The goal of this step is to asses the quality of the IP and to estimate the fragment size of the DNA immunoprecipitated.
+=== Step 2: Perform strand cross correlation analysis ===
+ * The goal of this step is to asses the quality of the IP and to estimate the fragment size of the immunoprecipitated DNA.
  * For a detailed explanation on strand cross-correlation analysis see box 2 of this paper ([[http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003326|Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data ]]).
 {{{
 …
 }}}
   * After this analysis a good ChIP-seq experiment will have a second peak (reflecting the fragment size) at least as tall as the first peak (reflecting read length). This is how the graph should look: ([[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431496/figure/F4/|Fig4E ]]). If the second peak is smaller that the first, ([[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431496/figure/F4/|like the example shown in Fig4G Marginal ]]),  macs will not estimate fragment size correctly. In that case we recommend running macs with parameters "--nomodel" and "--shiftsize=half_of_the_fragment_size". The fragment size is detected on the strand cross correlation analysis.
+  * After this analysis a good ChIP-seq experiment will have a second peak (reflecting the fragment size) at least as tall as the first peak (reflecting read length). This is how the graph should look: ([[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431496/figure/F4/|Fig4E]]). If the second peak is smaller that the first, ([[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431496/figure/F4/|like the example shown in Fig4G Marginal]]),  macs will not estimate fragment size correctly. In that case we recommend running macs with parameters "--nomodel" and "--shiftsize=half_of_the_fragment_size". The fragment size is detected on the strand cross correlation analysis.
 === Step 3: Call peaks (bound regions) ===
 …
 ==== MACS14 ====
   * For MACS to work the header of the sequences has to have no spaces.
   * macs points to macs14 on WIBR local machines
+  * For MACS to work the header of the sequences can have no spaces.
+  * The command 'macs' points to macs14 on WIBR local machines
   * MACS may have trouble with a SAM file from bowtie if it contains unmapped reads (which it generally does).  As a result, you may need to filter out unmapped reads with a command like
 {{{
 …
 ==== MACS2 ====
+  * MACS2 is appropriate for both proteins like transcription factors that may have narrow peaks, as well as histone modifications that may affect broader regions. For broader peaks we recommend using --nomodel, --nolambda (if there's no control), and using the fragment size calculated on the strand cross correlation analysis.  We recommend using macs2 rather than macs14 for broad peaks.
+{{{
+macs2  callpeak -t IP_reads.mapped_only.bam  -c Control_reads.mapped_only.bam -f BAM -g mm  -n Name --nomodel   -B
+}}}
+  * --nomodel  whether or not to build the shifting model. If True, MACS will not build model. by default it means shifting size = 100.
+  * MACS2 is appropriate for both proteins like transcription factors that may have narrow peaks, as well as histone modifications that may affect broader regions. For broader peaks we recommend using --nomodel, --nolambda (if there's no control), and using the fragment size calculated on the strand cross correlation analysis.  We recommend using macs2 rather than macs14 for broad peaks.  Identifying reproducible peaks (across replicates) with IDR requires MACS2 and works best with a relaxed p-value threshold.
+{{{
+bsub macs2 callpeak -t IP_reads.mapped_only.bam -c Control_reads.mapped_only.bam -f BAM -g mm -n Name --nomodel -B
+bsub macs2 callpeak -t IP_reads.mapped_only.bam -c Control_reads.mapped_only.bam -f BAM -g mm -n Name -B -p 1e-3
+}}}
+  * --nomodel  whether or not to build the shifting model. If True, MACS will not build model. By default it means shifting size = 100.
   * -f Input format
   * -B create bedgraph output files
+  * For future IDR analysis, use '-p 1e -3' => Set p-value cutoff to 1e-3 (which is more relaxed than the default setting)
 ==== SISSRs ====
 [[http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/sissrs/|SiSSRs]] ([[http://nar.oxfordjournals.org/cgi/content/full/36/16/5221|Reference]]) ([[http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/sissrs/SISSRs-Manual.pdf|Manual]])
 …
+=== Step 4: Linking bound regions to genes ===
+=== Step 4 [for experiments with replication]: Identify reproducible peaks ===
+For more information about the method, see the main [[https://sites.google.com/site/anshulkundaje/projects/idr| IDR page]].
+Run macs2 (macs version 2, not version 1) to call ChIP-seq peaks (see step 3)
+The IDR analysis requires that you call lots of peaks, including all "good" ones (signal) and some bad ones (noise).
+==== Sort peaks in .narrowPeak files (created by macs2) ====
+Sort from best to worst using the -log10(p-value) column (column 8), and only keep the top 100k peaks (at most).
+{{{
+bsub "sort -k 8,8nr IP.1_vs_control.1_peaks.narrowPeak | head -n 100000 | gzip -c >| IP.1_vs_control.1.regionPeak.gz"
+bsub "sort -k 8,8nr IP.2_vs_control.2_peaks.narrowPeak | head -n 100000 | gzip -c >| IP.2_vs_control.2.regionPeak.gz"
+}}}
+==== Estimate Irreproducibility Discovery Rate (IDR) between replicates ====
+The interface of this command has been slightly modified by BaRC, but the main method remains exactly the same.
+The command requires names and lengths of all chromosomes in a file (ex: chromInfo) of format chromosome<tab>size.
+{{{
+# USAGE: batch-consistency-analysis.r [peakfile1] [peakfile2] [peak.half.width] [outfile.prefix] [min.overlap.ratio] [is.broadpeak] [ranking.measure]\
+#
+# peak.half.width => set the peak.half.width to the reported peak width in the peak files (-1 (minus one) is recommended)
+# outfile.prefix => the one-word name of your analysis
+# min.overlap.ratio => fractional bp overlap (0 to 1) between peaks in replicates to be considered as overlapping peaks. (0 is recommended)
+# is.broadpeak => Is the peak file format broadPeak (or else is narrowPeak)? (F for false)  The IDR method is not recommended for broad peaks.
+# ranking.measure => p.value is recommended
+batch-consistency-analysis.r IP.1_vs_control.1.regionPeak.gz IP.2_vs_control.2.regionPeak.gz -1 rep1_vs_rep2_IDR 0 F p.value chromInfo.txt
+}}}
+==== Create the IDR plots ====
+{{{
+# USAGE: batch-consistency-plot.r [npairs] [output.prefix] [input.file.prefix1] [input.file.prefix2] [input.file.prefix3] ....
+# This method can plot 1 or more pairs of replicates
+batch-consistency-plot.r 1 rep1_vs_rep2_IDR_plot rep1_vs_rep2_IDR
+}}}
+==== Generate a conservative and an optimal final set of peak calls ====
+{{{
+# Use an IDR cutoff of 0.01 to 0.05, depending on the number of pre-IDR peaks and size of the genome:
+# IDR <= 0.05 for < 100K pre-IDR peaks for large genomes (human/mouse)
+# IDR <= 0.01 or 0.02 for ~15K to 40K peaks in smaller genomes such as worm
+awk '{ if($NF < 0.05) print $0 }' rep1_vs_rep2_IDR-overlapped-peaks.txt > rep1_vs_rep2_conserved_peaks_by_IDR.txt
+}}}
+=== Step 5: Link bound regions to genes ===
 Both MACS and SISSRs provide bed files with the set of peaks, presumably indicating bound regions.
 …
 Note: We assume that the TSS is the 5' end of the transcript (gene), but you can also go to the UCSC Bioinformatics table browser and download the coordinates of the transcription start sites (TSS).
+=== Comparing binding across different samples ===
+  * Method 1: Intersect bound regions (peaks)
+    * This binary comparison asks simply, "Is each peak in sample A present in sample B (and vice versa)?"
+    * This ignores the enrichment/score of each peak and gives each non-peak a score of 0.
+  * Method 2: Compare affinity (number of mapped reads) across bound regions
+    * Filter out redundant reads from each BAM file with a command like 'samtools rmdup'.
+    * Merge peaks of all samples (and make non-redundant):
+    * mergeBed -i Peaks_all_redundant.bed > Peaks.bed
+    * Count reads in each sample that overlaps each peak
+    * intersectBed -bed -wb -abam Reads_1.noDups.bam -b Peaks.bed | cut -f13-15 | mergeBed -i stdin -n > Peak_counts.reads_1.bed
+    * Modify counts per peak to account for differing numbers of mapped reads.
+  * Method 3: Use [[http://code.google.com/p/ngsplot/ | ngsplot]] to make stacked heatmaps of peaks for each sample
+----
+**Step 1**
+**Get the coordinates of the annotation features that you are going to use**. For analysis A, B and C those coordinates can be downloaded from the UCSC tables.
+**Step A: Get the coordinates of the desired annotation features**
+For analysis A, B and C those coordinates can be downloaded from the UCSC tables.
 {{{
 i.e., on the table browser select:
 …
 }}}
+----
+**Step 2: Intersect your peaks and your annotation files**
+**Step B: Intersect your peaks and your annotation files**
 {{{
 intersectBed -a genes.bed -b peaks.bed -wa -wb
 …
+=== Step 5 [if appropriate]: Comparing binding across different samples ===
+  * Method 1: Run IDR to identify reproducible peaks
+  * Method 2: Intersect bound regions (peaks)
+    * This binary comparison asks simply, "Is each peak in sample A present in sample B (and vice versa)?"
+    * This ignores the enrichment/score of each peak and gives each non-peak a score of 0.
+  * Method 3: Compare affinity (number of mapped reads) across bound regions
+    * Filter out redundant reads from each BAM file with a command like 'samtools rmdup'.
+    * Merge peaks of all samples (and make non-redundant):
+    * mergeBed -i Peaks_all_redundant.bed > Peaks.bed
+    * Count reads in each sample that overlaps each peak
+    * intersectBed -bed -wb -abam Reads_1.noDups.bam -b Peaks.bed | cut -f13-15 | mergeBed -i stdin -n > Peak_counts.reads_1.bed
+    * Modify counts per peak to account for differing numbers of mapped reads.
+  * Method 4: Use [[http://code.google.com/p/ngsplot/ | ngsplot]] to make stacked heatmaps of peaks for each sample