79 | | * MACS2 is appropriate for both proteins like transcription factors that may have narrow peaks, as well as histone modifications that may affect broader regions. For broader peaks we recommend using --nomodel, --nolambda (if there's no control), and using the fragment size calculated on the strand cross correlation analysis. We recommend using macs2 rather than macs14 for broad peaks. |
80 | | {{{ |
81 | | macs2 callpeak -t IP_reads.mapped_only.bam -c Control_reads.mapped_only.bam -f BAM -g mm -n Name --nomodel -B |
82 | | }}} |
83 | | * --nomodel whether or not to build the shifting model. If True, MACS will not build model. by default it means shifting size = 100. |
| 81 | * MACS2 is appropriate for both proteins like transcription factors that may have narrow peaks, as well as histone modifications that may affect broader regions. For broader peaks we recommend using --nomodel, --nolambda (if there's no control), and using the fragment size calculated on the strand cross correlation analysis. We recommend using macs2 rather than macs14 for broad peaks. Identifying reproducible peaks (across replicates) with IDR requires MACS2 and works best with a relaxed p-value threshold. |
| 82 | {{{ |
| 83 | bsub macs2 callpeak -t IP_reads.mapped_only.bam -c Control_reads.mapped_only.bam -f BAM -g mm -n Name --nomodel -B |
| 84 | bsub macs2 callpeak -t IP_reads.mapped_only.bam -c Control_reads.mapped_only.bam -f BAM -g mm -n Name -B -p 1e-3 |
| 85 | }}} |
| 86 | * --nomodel whether or not to build the shifting model. If True, MACS will not build model. By default it means shifting size = 100. |
122 | | === Step 4: Linking bound regions to genes === |
| 127 | === Step 4 [for experiments with replication]: Identify reproducible peaks === |
| 128 | |
| 129 | For more information about the method, see the main [[https://sites.google.com/site/anshulkundaje/projects/idr| IDR page]]. |
| 130 | |
| 131 | Run macs2 (macs version 2, not version 1) to call ChIP-seq peaks (see step 3) |
| 132 | |
| 133 | The IDR analysis requires that you call lots of peaks, including all "good" ones (signal) and some bad ones (noise). |
| 134 | |
| 135 | ==== Sort peaks in .narrowPeak files (created by macs2) ==== |
| 136 | |
| 137 | Sort from best to worst using the -log10(p-value) column (column 8), and only keep the top 100k peaks (at most). |
| 138 | |
| 139 | {{{ |
| 140 | bsub "sort -k 8,8nr IP.1_vs_control.1_peaks.narrowPeak | head -n 100000 | gzip -c >| IP.1_vs_control.1.regionPeak.gz" |
| 141 | bsub "sort -k 8,8nr IP.2_vs_control.2_peaks.narrowPeak | head -n 100000 | gzip -c >| IP.2_vs_control.2.regionPeak.gz" |
| 142 | }}} |
| 143 | |
| 144 | ==== Estimate Irreproducibility Discovery Rate (IDR) between replicates ==== |
| 145 | |
| 146 | The interface of this command has been slightly modified by BaRC, but the main method remains exactly the same. |
| 147 | |
| 148 | The command requires names and lengths of all chromosomes in a file (ex: chromInfo) of format chromosome<tab>size. |
| 149 | |
| 150 | {{{ |
| 151 | # USAGE: batch-consistency-analysis.r [peakfile1] [peakfile2] [peak.half.width] [outfile.prefix] [min.overlap.ratio] [is.broadpeak] [ranking.measure]\ |
| 152 | # |
| 153 | # peak.half.width => set the peak.half.width to the reported peak width in the peak files (-1 (minus one) is recommended) |
| 154 | # outfile.prefix => the one-word name of your analysis |
| 155 | # min.overlap.ratio => fractional bp overlap (0 to 1) between peaks in replicates to be considered as overlapping peaks. (0 is recommended) |
| 156 | # is.broadpeak => Is the peak file format broadPeak (or else is narrowPeak)? (F for false) The IDR method is not recommended for broad peaks. |
| 157 | # ranking.measure => p.value is recommended |
| 158 | |
| 159 | batch-consistency-analysis.r IP.1_vs_control.1.regionPeak.gz IP.2_vs_control.2.regionPeak.gz -1 rep1_vs_rep2_IDR 0 F p.value chromInfo.txt |
| 160 | }}} |
| 161 | |
| 162 | ==== Create the IDR plots ==== |
| 163 | {{{ |
| 164 | # USAGE: batch-consistency-plot.r [npairs] [output.prefix] [input.file.prefix1] [input.file.prefix2] [input.file.prefix3] .... |
| 165 | # This method can plot 1 or more pairs of replicates |
| 166 | |
| 167 | batch-consistency-plot.r 1 rep1_vs_rep2_IDR_plot rep1_vs_rep2_IDR |
| 168 | }}} |
| 169 | |
| 170 | ==== Generate a conservative and an optimal final set of peak calls ==== |
| 171 | {{{ |
| 172 | # Use an IDR cutoff of 0.01 to 0.05, depending on the number of pre-IDR peaks and size of the genome: |
| 173 | # IDR <= 0.05 for < 100K pre-IDR peaks for large genomes (human/mouse) |
| 174 | # IDR <= 0.01 or 0.02 for ~15K to 40K peaks in smaller genomes such as worm |
| 175 | |
| 176 | awk '{ if($NF < 0.05) print $0 }' rep1_vs_rep2_IDR-overlapped-peaks.txt > rep1_vs_rep2_conserved_peaks_by_IDR.txt |
| 177 | }}} |
| 178 | |
| 179 | === Step 5: Link bound regions to genes === |
134 | | === Comparing binding across different samples === |
135 | | |
136 | | * Method 1: Intersect bound regions (peaks) |
137 | | * This binary comparison asks simply, "Is each peak in sample A present in sample B (and vice versa)?" |
138 | | * This ignores the enrichment/score of each peak and gives each non-peak a score of 0. |
139 | | * Method 2: Compare affinity (number of mapped reads) across bound regions |
140 | | * Filter out redundant reads from each BAM file with a command like 'samtools rmdup'. |
141 | | * Merge peaks of all samples (and make non-redundant): |
142 | | * mergeBed -i Peaks_all_redundant.bed > Peaks.bed |
143 | | * Count reads in each sample that overlaps each peak |
144 | | * intersectBed -bed -wb -abam Reads_1.noDups.bam -b Peaks.bed | cut -f13-15 | mergeBed -i stdin -n > Peak_counts.reads_1.bed |
145 | | * Modify counts per peak to account for differing numbers of mapped reads. |
146 | | * Method 3: Use [[http://code.google.com/p/ngsplot/ | ngsplot]] to make stacked heatmaps of peaks for each sample |
147 | | |
148 | | ---- |
149 | | **Step 1** |
150 | | |
151 | | **Get the coordinates of the annotation features that you are going to use**. For analysis A, B and C those coordinates can be downloaded from the UCSC tables. |
| 191 | **Step A: Get the coordinates of the desired annotation features** |
| 192 | |
| 193 | For analysis A, B and C those coordinates can be downloaded from the UCSC tables. |
221 | | |
| 260 | === Step 5 [if appropriate]: Comparing binding across different samples === |
| 261 | |
| 262 | * Method 1: Run IDR to identify reproducible peaks |
| 263 | * Method 2: Intersect bound regions (peaks) |
| 264 | * This binary comparison asks simply, "Is each peak in sample A present in sample B (and vice versa)?" |
| 265 | * This ignores the enrichment/score of each peak and gives each non-peak a score of 0. |
| 266 | * Method 3: Compare affinity (number of mapped reads) across bound regions |
| 267 | * Filter out redundant reads from each BAM file with a command like 'samtools rmdup'. |
| 268 | * Merge peaks of all samples (and make non-redundant): |
| 269 | * mergeBed -i Peaks_all_redundant.bed > Peaks.bed |
| 270 | * Count reads in each sample that overlaps each peak |
| 271 | * intersectBed -bed -wb -abam Reads_1.noDups.bam -b Peaks.bed | cut -f13-15 | mergeBed -i stdin -n > Peak_counts.reads_1.bed |
| 272 | * Modify counts per peak to account for differing numbers of mapped reads. |
| 273 | * Method 4: Use [[http://code.google.com/p/ngsplot/ | ngsplot]] to make stacked heatmaps of peaks for each sample |
| 274 | |