Changes between Version 9 and Version 10 of SOPs/chip_seq_peaks


Ignore:
Timestamp:
05/15/14 09:35:21 (11 years ago)
Author:
gbell
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/chip_seq_peaks

    v9 v10  
    1313
    1414 * See the [[http://barcwiki.wi.mit.edu/wiki/SOPs/mapping|mapping SOP]] for more details.
    15 === Step 2: Strand cross correlation analysis ===
    16  * The goal of this step is to asses the quality of the IP and to estimate the fragment size of the DNA immunoprecipitated.
     15
     16=== Step 2: Perform strand cross correlation analysis ===
     17
     18 * The goal of this step is to asses the quality of the IP and to estimate the fragment size of the immunoprecipitated DNA.
    1719 * For a detailed explanation on strand cross-correlation analysis see box 2 of this paper ([[http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003326|Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data ]]).
    1820{{{
     
    2022}}}
    2123
    22   * After this analysis a good ChIP-seq experiment will have a second peak (reflecting the fragment size) at least as tall as the first peak (reflecting read length). This is how the graph should look: ([[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431496/figure/F4/|Fig4E ]]). If the second peak is smaller that the first, ([[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431496/figure/F4/|like the example shown in Fig4G Marginal ]]),  macs will not estimate fragment size correctly. In that case we recommend running macs with parameters "--nomodel" and "--shiftsize=half_of_the_fragment_size". The fragment size is detected on the strand cross correlation analysis.
     24  * After this analysis a good ChIP-seq experiment will have a second peak (reflecting the fragment size) at least as tall as the first peak (reflecting read length). This is how the graph should look: ([[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431496/figure/F4/|Fig4E]]). If the second peak is smaller that the first, ([[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431496/figure/F4/|like the example shown in Fig4G Marginal]]),  macs will not estimate fragment size correctly. In that case we recommend running macs with parameters "--nomodel" and "--shiftsize=half_of_the_fragment_size". The fragment size is detected on the strand cross correlation analysis.
    2325 
    2426=== Step 3: Call peaks (bound regions) ===
     
    3335
    3436==== MACS14 ====
    35   * For MACS to work the header of the sequences has to have no spaces.
    36   * macs points to macs14 on WIBR local machines
     37  * For MACS to work the header of the sequences can have no spaces.
     38  * The command 'macs' points to macs14 on WIBR local machines
    3739  * MACS may have trouble with a SAM file from bowtie if it contains unmapped reads (which it generally does).  As a result, you may need to filter out unmapped reads with a command like
    3840{{{
     
    7779==== MACS2 ====
    7880
    79   * MACS2 is appropriate for both proteins like transcription factors that may have narrow peaks, as well as histone modifications that may affect broader regions. For broader peaks we recommend using --nomodel, --nolambda (if there's no control), and using the fragment size calculated on the strand cross correlation analysis.  We recommend using macs2 rather than macs14 for broad peaks.
    80 {{{
    81 macs2  callpeak -t IP_reads.mapped_only.bam  -c Control_reads.mapped_only.bam -f BAM -g mm  -n Name --nomodel   -B
    82 }}}
    83   * --nomodel  whether or not to build the shifting model. If True, MACS will not build model. by default it means shifting size = 100.                     
     81  * MACS2 is appropriate for both proteins like transcription factors that may have narrow peaks, as well as histone modifications that may affect broader regions. For broader peaks we recommend using --nomodel, --nolambda (if there's no control), and using the fragment size calculated on the strand cross correlation analysis.  We recommend using macs2 rather than macs14 for broad peaks.  Identifying reproducible peaks (across replicates) with IDR requires MACS2 and works best with a relaxed p-value threshold.
     82{{{
     83bsub macs2 callpeak -t IP_reads.mapped_only.bam -c Control_reads.mapped_only.bam -f BAM -g mm -n Name --nomodel -B
     84bsub macs2 callpeak -t IP_reads.mapped_only.bam -c Control_reads.mapped_only.bam -f BAM -g mm -n Name -B -p 1e-3
     85}}}
     86  * --nomodel  whether or not to build the shifting model. If True, MACS will not build model. By default it means shifting size = 100.                     
    8487  * -f Input format
    8588  * -B create bedgraph output files
     89  * For future IDR analysis, use '-p 1e -3' => Set p-value cutoff to 1e-3 (which is more relaxed than the default setting)
     90
    8691==== SISSRs ====
    8792[[http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/sissrs/|SiSSRs]] ([[http://nar.oxfordjournals.org/cgi/content/full/36/16/5221|Reference]]) ([[http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/sissrs/SISSRs-Manual.pdf|Manual]])
     
    120125
    121126
    122 === Step 4: Linking bound regions to genes ===
     127=== Step 4 [for experiments with replication]: Identify reproducible peaks ===
     128
     129For more information about the method, see the main [[https://sites.google.com/site/anshulkundaje/projects/idr| IDR page]].
     130
     131Run macs2 (macs version 2, not version 1) to call ChIP-seq peaks (see step 3)
     132
     133The IDR analysis requires that you call lots of peaks, including all "good" ones (signal) and some bad ones (noise).
     134
     135==== Sort peaks in .narrowPeak files (created by macs2) ====
     136
     137Sort from best to worst using the -log10(p-value) column (column 8), and only keep the top 100k peaks (at most).
     138
     139{{{
     140bsub "sort -k 8,8nr IP.1_vs_control.1_peaks.narrowPeak | head -n 100000 | gzip -c >| IP.1_vs_control.1.regionPeak.gz"
     141bsub "sort -k 8,8nr IP.2_vs_control.2_peaks.narrowPeak | head -n 100000 | gzip -c >| IP.2_vs_control.2.regionPeak.gz"
     142}}}
     143
     144==== Estimate Irreproducibility Discovery Rate (IDR) between replicates ====
     145
     146The interface of this command has been slightly modified by BaRC, but the main method remains exactly the same.
     147
     148The command requires names and lengths of all chromosomes in a file (ex: chromInfo) of format chromosome<tab>size.
     149
     150{{{
     151# USAGE: batch-consistency-analysis.r [peakfile1] [peakfile2] [peak.half.width] [outfile.prefix] [min.overlap.ratio] [is.broadpeak] [ranking.measure]\
     152#
     153# peak.half.width => set the peak.half.width to the reported peak width in the peak files (-1 (minus one) is recommended)
     154# outfile.prefix => the one-word name of your analysis
     155# min.overlap.ratio => fractional bp overlap (0 to 1) between peaks in replicates to be considered as overlapping peaks. (0 is recommended) 
     156# is.broadpeak => Is the peak file format broadPeak (or else is narrowPeak)? (F for false)  The IDR method is not recommended for broad peaks.
     157# ranking.measure => p.value is recommended
     158
     159batch-consistency-analysis.r IP.1_vs_control.1.regionPeak.gz IP.2_vs_control.2.regionPeak.gz -1 rep1_vs_rep2_IDR 0 F p.value chromInfo.txt
     160}}}
     161
     162==== Create the IDR plots ====
     163{{{
     164# USAGE: batch-consistency-plot.r [npairs] [output.prefix] [input.file.prefix1] [input.file.prefix2] [input.file.prefix3] ....
     165# This method can plot 1 or more pairs of replicates
     166
     167batch-consistency-plot.r 1 rep1_vs_rep2_IDR_plot rep1_vs_rep2_IDR
     168}}}
     169
     170==== Generate a conservative and an optimal final set of peak calls ====
     171{{{
     172# Use an IDR cutoff of 0.01 to 0.05, depending on the number of pre-IDR peaks and size of the genome:
     173# IDR <= 0.05 for < 100K pre-IDR peaks for large genomes (human/mouse)
     174# IDR <= 0.01 or 0.02 for ~15K to 40K peaks in smaller genomes such as worm
     175
     176awk '{ if($NF < 0.05) print $0 }' rep1_vs_rep2_IDR-overlapped-peaks.txt > rep1_vs_rep2_conserved_peaks_by_IDR.txt
     177}}}
     178
     179=== Step 5: Link bound regions to genes ===
    123180Both MACS and SISSRs provide bed files with the set of peaks, presumably indicating bound regions.
    124181
     
    132189Note: We assume that the TSS is the 5' end of the transcript (gene), but you can also go to the UCSC Bioinformatics table browser and download the coordinates of the transcription start sites (TSS).
    133190
    134 === Comparing binding across different samples ===
    135 
    136   * Method 1: Intersect bound regions (peaks)
    137     * This binary comparison asks simply, "Is each peak in sample A present in sample B (and vice versa)?"
    138     * This ignores the enrichment/score of each peak and gives each non-peak a score of 0.
    139   * Method 2: Compare affinity (number of mapped reads) across bound regions
    140     * Filter out redundant reads from each BAM file with a command like 'samtools rmdup'.
    141     * Merge peaks of all samples (and make non-redundant):
    142     * mergeBed -i Peaks_all_redundant.bed > Peaks.bed
    143     * Count reads in each sample that overlaps each peak
    144     * intersectBed -bed -wb -abam Reads_1.noDups.bam -b Peaks.bed | cut -f13-15 | mergeBed -i stdin -n > Peak_counts.reads_1.bed
    145     * Modify counts per peak to account for differing numbers of mapped reads.
    146   * Method 3: Use [[http://code.google.com/p/ngsplot/ | ngsplot]] to make stacked heatmaps of peaks for each sample
    147 
    148 ----
    149 **Step 1**
    150 
    151 **Get the coordinates of the annotation features that you are going to use**. For analysis A, B and C those coordinates can be downloaded from the UCSC tables.
     191**Step A: Get the coordinates of the desired annotation features**
     192
     193For analysis A, B and C those coordinates can be downloaded from the UCSC tables.
    152194{{{
    153195i.e., on the table browser select:
     
    204246}}}
    205247
    206 
    207 ----
    208 
    209 **Step 2: Intersect your peaks and your annotation files**
     248**Step B: Intersect your peaks and your annotation files**
    210249{{{
    211250intersectBed -a genes.bed -b peaks.bed -wa -wb
     
    219258
    220259
    221 
     260=== Step 5 [if appropriate]: Comparing binding across different samples ===
     261
     262  * Method 1: Run IDR to identify reproducible peaks
     263  * Method 2: Intersect bound regions (peaks)
     264    * This binary comparison asks simply, "Is each peak in sample A present in sample B (and vice versa)?"
     265    * This ignores the enrichment/score of each peak and gives each non-peak a score of 0.
     266  * Method 3: Compare affinity (number of mapped reads) across bound regions
     267    * Filter out redundant reads from each BAM file with a command like 'samtools rmdup'.
     268    * Merge peaks of all samples (and make non-redundant):
     269    * mergeBed -i Peaks_all_redundant.bed > Peaks.bed
     270    * Count reads in each sample that overlaps each peak
     271    * intersectBed -bed -wb -abam Reads_1.noDups.bam -b Peaks.bed | cut -f13-15 | mergeBed -i stdin -n > Peak_counts.reads_1.bed
     272    * Modify counts per peak to account for differing numbers of mapped reads.
     273  * Method 4: Use [[http://code.google.com/p/ngsplot/ | ngsplot]] to make stacked heatmaps of peaks for each sample
     274