wiki:SOPs/chip_seq_peaks

Version 53 (modified by byuan, 4 years ago) ( diff )

--

Using ChIP-Seq to identify and/or quantify peaks (or other interesting enriched regions)

Basic Approach

  • For each sample map the reads to the appropriate genome (e.g. with Bowtie2).
  • Call peaks (eg. MACS) or use another method to analyze enrichment and/or presumed binding. For ChIP-seq experiments profiling transcription factors (with discrete binding sites), binding site identification (via peak calling) is typically recommended. Peaks can be called without control; however, it's highly recommended to include a control sample (e.g. IgG or input, with input generally preferred over IgG). For ChIP-seq experiments profiling epigenetic (such as histone) modifications, however, modeling ChIP enrichment as peaks may not accurately describe the actual data, and some other (such as sliding window) quantification may be more relevant.
  • Note that quality control is important after read mapping and after peak calling. The ENCODE consortium recommends some quality metrics.

Reviews

Evaluation of ChIP-Seq performance Wilbanks EG, and Facciotti MT (2010; PLoS ONE)

A manually curated ChIP-seq benchmark demonstrates room for improvement in current peak-finder programs (NAR Nov. 2010)

ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia (Genome Research, 2012)

Step 1: Map reads

  • Use Bowtie, bowtie2, or another unspliced mapping tool. We recommend using Bowtie2 (default parameters), filtering to remove multi-mapped reads (eg. a mapping quality of 2 or higher) is not needed as Bowtie2 randomly chooses a best hit when multiple good hits exist (see [| Bowtie2 manual [under Reporting ]). Low mapping quality reads may need to be removed if using another aligner since MACS does not filter on mapping quality.

Step 2: Perform strand cross correlation analysis

  • The goal of this step is to assess the quality of the IP and to estimate the fragment size of the immunoprecipitated DNA.
  • For a detailed explanation on strand cross-correlation analysis see box 2 of this paper (Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data ).
    run_spp.R -c=TreatmentIP.bam -savp=TreatmentIP.spp.pdf -out=TreatmentIP.run_spp.out
    
    • ENCODE recommends normalized strand coefficientt (NSC) > 1.05 and relative strand correlation (RSC) > 0.08 Ref
    • After this analysis a good ChIP-seq experiment will have a second peak (reflecting the fragment size) at least as tall as the first peak (a "phantom" peak reflecting read length). This is how the graph should look: (Fig4E). If the second peak is smaller than the first, (like the example shown in Fig4G Marginal), macs will not estimate fragment size correctly. In that case we recommend running macs with parameters "--nomodel" and "--shiftsize=half_of_the_fragment_size", or "--nomodel" and "--extsize=fragment_size". The fragment size is detected on the strand cross correlation analysis.

Step 3: Call peaks (presumed bound regions), especially for transcription factors

Some of the parameters to consider when comparing programs are:

  • Adjustment of sequence tags to better represent the original DNA fragment (by shifting tags in the 3′ direction or by extending tags to the estimated length of the original fragments)
  • Background model used
  • Use of strand dependent bimodality

BaRC performed a ChIP-Seq bake off in the summer of 2011, with these ChIP-Seq bake off results.

Based on our ChIP-Seq bake off and on a published review (Evaluation of ChIP-Seq performance), MACs and SISSRs are good programs to try.

MACS2

  • MACS2 is appropriate for both proteins like transcription factors that may have narrow peaks, as well as histone modifications that may affect broader regions. For broader peaks we recommend using --nomodel,--broad, --nolambda (if there's no control), and using the fragment size calculated on the strand cross correlation analysis. We recommend using macs2 rather than macs14 for broad peaks. Identifying reproducible peaks (across replicates) with IDR requires MACS2 and works best with a relaxed p-value threshold.
    bsub macs2 callpeak -t IP_reads.mapped_only.bam -c Control_reads.mapped_only.bam -g mm -n Name -B -f BAM   --keep-dup auto --nomodel --extsize "size calculated on the strand crosscorrelation analysis"
    bsub macs2 callpeak -t IP_reads.mapped_only.bam -c Control_reads.mapped_only.bam -g mm -n Name -B -f BAMPE --keep-dup 1    -p 1e-3 
    
  • -B create bedgraph output files
  • -f Input format (typically BAM for single-end reads and BAMPE for paired-end reads). In BAMPE mode, MACS2 will only use properly-paired read alignments, see MACS2 Forum , also from the readme/manual, "when format BAMPE is specified, MACS will use the real fragments inferred from alignment results for reads pileup.", i.e. --nomodel/--extsize will be ignored.
  • --nomodel whether or not to build the shifting model. If True, MACS will not build model. By default it means shifting size = 100.
  • --extsize When nomodel is true, MACS will use this value as fragment size to extend each read towards 3' end
  • --keep-dup How should MACS handle duplicate tags at the same location? The default is to keep just one. The 'auto' option (recommended for high-coverage samples) has MACS calculate the expected maximum tags at the same location (based on the binomal distribution). FastQC results can be used to gauge the duplication levels in a sample.
  • For future IDR analysis, use '-p 1e -3' => Set p-value cutoff to 1e-3 (which is more relaxed than the default setting)

Output format (xls/text file):

  • chr name
  • start pos of peak
  • end pos of peak
  • length of peak region
  • absolute peak summit position
  • pileup height at peak summit
  • -log10(pvalue) for the peak summit
  • fold enrichment
  • -log10(qvalue) at peak summit

MACS2 wiki

MACS14

We now recommend using macs2. Macs2 is more effective at peak finding especially for broad peaks. If you still want to use macs1.4, below are our previous recommendations on how to use it.

  • For MACS to work the header of the sequences can have no spaces.
  • The command 'macs' points to macs14 on WIBR local machines
  • MACS may have trouble with a SAM file from bowtie if it contains unmapped reads (which it generally does). As a result, you may need to filter out unmapped reads with a command like
    samtools view -hS -F 4 all_reads.sam > mapped_reads.sam
    

MACS (README) (Zhang et al., 2008)

  • MACS is appropriate for both proteins like transcription factors that may have narrow peaks, as well as histone modifications that may affect broader regions. However, for broader peaks changing values of the parameters may be needed: eg. using --nomodel, --nolambda (if there's no control), --call-subpeaks. Running MACS with different parameters and viewing in IGV the results can help in choosing the optimal values.

Sample commands to run MACS (current version as of March 5 2012): 1.4.2 using mapped reads in map or sam format:

macs -t IP_mapped.map -c Control_mapped.map -g 1.87e9 --name=outputName --format=BOWTIE --tsize=36 --wig --space=25  --mfold=10,30
macs -t IP_mapped.sam -c Control_mapped.sam -g 1.87e9 --name=outputName --format=SAM    --tsize=36 --wig --space=25  --mfold=10,30
macs -t IP_mapped.sam -c Control_mapped.sam -g 1.87e9 --name=outputName --format=SAM    --tsize=36 --wig --space=25  --nomodel --shiftsize=100

The parameters used on the command are:

  • -t TFILE Treatment file
  • -c CFILE Control file
  • --name=NAME Experiment name, which will be used to generate output file names. DEFAULT: "NA"
  • --format=FORMAT Format of tag file, "BED" or "ELAND" or "ELANDMULTI" or "ELANDMULTIPET" or "SAM" or "BAM" or "BOWTIE". DEFAULT: "BED"
  • --tsize=TSIZE Tag size. DEFAULT: auto detected tag size
  • --wig Whether or not to save shifted raw tag count at every bp into a wiggle file
  • --mfold=MFOLD Select the regions within MFOLD range of high-confidence enrichment ratio against background to build model. The regions must be lower than upper limit, and higher than the lower limit. DEFAULT:10,30
  • -g GSIZE Effective genome size. It can be 1.0e+9 or 1000000000, or shortcuts:'hs' for human (2.7e9), 'mm' for mouse (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruitfly (1.2e8), Default:hs
  • --keep-dup=1 Controls the MACS behavior towards duplicate tags at the exact same location. DEFAULT: 1 in both MACS 1.4 and MACS2.
  • --nomodel whether or not to build the shifting model. If True, MACS will not build model. by default it means shifting size = 100.
  • --shiftsize The arbitrary shift size in bp. When nomodel is true, MACS will use this value as 1/2 of fragment size. DEFAULT: 100.
bsub "macs14 -t IP_mapped.map -c Control_mapped.map --name=outputName --format=BOWTIE --tsize=36 --wig --space=25 --mfold=10,30"
bsub "macs14 -t IP_mapped.sam -c Control_mapped.sam --name=outputName --format=SAM    --tsize=36 --wig --space=25 --mfold=10,30"

Note: The wig files that macs14 generates are not normalized.

SISSRs

SiSSRs (manual) ChIP-Seq bake off

SISSRs uses strand bimodality to try to find the summit of the peak. The summit should be very close to the DNA bound by the TF. It is more appropriate for TFs because they tend to bind in specific narrow regions.

Map with Bowtie, use --sam parameter to get a SAM output file

bsub "bowtie -t -m 3 -n 3 -l 36 --strata --best --solexa1.3-quals --sam inputSeq bowtieOutput.sam"

SISRs input is a bed file. Convert mapped reads from SAM to BAM and from BAM to bed format

bsub "samtools view -S -b -o bowtieOutput.bam bowtieOutput.sam" 
-S       input is SAM
-b       output BAM
bsub "bamToBed -i bowtieOutput.bam > bowtieOutput.bed"

Run SISSRs with a sample command like

sissrs.pl -i bowtieOutput.bed -o outputFile -s 2716965481 -b Background.bed -L 200

The parameters used in the sample command:

  • -s is the size of the genome
  • -L is the maximum length of the fragment
  • -m is the percentage of mappable bps. Default is .8 for Eland in human.

For more detailed description of parameters see our [chipSeqBakeOff ChIP-Seq bake off].

Step 4 [for peak-calling experiments with replication]: Identify reproducible peaks

For more information about the method, see the main IDR page.

Run macs2 (macs version 2, not version 1) to call ChIP-seq peaks (see step 3)

The IDR analysis requires that you call lots of peaks, including all "good" ones (signal) and some bad ones (noise).

Sort peaks in .narrowPeak files (created by macs2)

Sort from best to worst using the -log10(p-value) column (column 8), and only keep the top 100k peaks (at most).

bsub "sort -k 8,8nr IP.1_vs_control.1_peaks.narrowPeak | head -n 100000 | gzip -c >| IP.1_vs_control.1.regionPeak.gz"
bsub "sort -k 8,8nr IP.2_vs_control.2_peaks.narrowPeak | head -n 100000 | gzip -c >| IP.2_vs_control.2.regionPeak.gz"

Estimate Irreproducibility Discovery Rate (IDR) between replicates

The interface of this command has been slightly modified by BaRC, but the main method remains exactly the same. The modified scripts are in /nfs/BaRC_Public/BaRC_code/R/IDR

The command requires names and lengths of all chromosomes in a file (ex: chromInfo) of format chromosome<tab>size.

# USAGE: batch-consistency-analysis.r [peakfile1] [peakfile2] [peak.half.width] [outfile.prefix] [min.overlap.ratio] [is.broadpeak] [ranking.measure]\
#
# peak.half.width => set the peak.half.width to the reported peak width in the peak files (-1 (minus one) is recommended)
# outfile.prefix => the one-word name of your analysis
# min.overlap.ratio => fractional bp overlap (0 to 1) between peaks in replicates to be considered as overlapping peaks. (0 is recommended)  
# is.broadpeak => Is the peak file format broadPeak (or else is narrowPeak)? (F for false)  The IDR method is not recommended for broad peaks.
# ranking.measure => p.value is recommended

batch-consistency-analysis.r IP.1_vs_control.1.regionPeak.gz IP.2_vs_control.2.regionPeak.gz -1 rep1_vs_rep2_IDR 0 F p.value chromInfo.txt 

Create the IDR plots

# USAGE: batch-consistency-plot.r [npairs] [output.prefix] [input.file.prefix1] [input.file.prefix2] [input.file.prefix3] ....
# This method can plot 1 or more pairs of replicates

batch-consistency-plot.r 1 rep1_vs_rep2_IDR_plot rep1_vs_rep2_IDR

Generate a conservative and an optimal final set of peak calls

# Use an IDR cutoff of 0.01 to 0.05, depending on the number of pre-IDR peaks and size of the genome:
# IDR <= 0.05 for < 100K pre-IDR peaks for large genomes (human/mouse)
# IDR <= 0.01 or 0.02 for ~15K to 40K peaks in smaller genomes such as worm
# awk note: $NF refers to the last data column which (in the *overlapped-peaks.txt file) is the IDR

awk '{ if($NF < 0.05) print $0 }' rep1_vs_rep2_IDR-overlapped-peaks.txt > rep1_vs_rep2_conserved_peaks_by_IDR.txt

Step 5: Link "bound" (or other interesting) regions to genomic features (genes, promoters, enhancers, etc.)

Both MACS and SISSRs provide bed files with the set of peaks, presumably indicating bound regions.

To link this regions to genes check out this SOP: Linking genome regions to genome annotations

Below it is a detailed example of how to do the following annotation:

  1. Find the genes that overlap with your peaks
  2. Find the genes that have a peak within -D distance upstream of the TSS/gene
  3. Find the genes that have a peak -D distance downstream of the gene
  4. Find the genes that have a peak -D distance downstream of the TSS

Note: We assume that the TSS is the 5' end of the transcript (gene), but you can also go to the UCSC Bioinformatics table browser and download the coordinates of the transcription start sites (TSS).

Step A: Get the coordinates of the desired annotation features

For analysis A, B and C those coordinates can be downloaded from the UCSC tables.

i.e., on the table browser select:
Mammal->Mouse
group: Genes and Gene Prediction Tracks
track: RefSeq Genes
table: refGene
region: genome
output format: bed
(name your file)
get output
for A: Select Whole Gene
for B: Select Upstream by D bases 
for C: Downstream by D bases 

You will have now 3 annotation files named for example:

mm9_REFseqgenes.bed   (gene bodies)
mm9_REFseqgenes1kbUp.bed  (1Kb upstream of the gene)
mm9_REFseqgenes1kbDownofGene.bed (1Kb downstream of the gene)

The gene bodies file has more columns that you need, so keep only the first 6 columns.

cut -f1-6 mm9_REFseqgenes.bed > mm9_REFseqgenes_TRIM.bed 

For annotation D, to get the coordinates of the regions from the TSS to a certain distance downstream of the TSS (beginning of the gene):

  1. Take the genes files and keep the coordinate that represents the start of the gene (based on the strand),
awk -F "\t" '{if ($6 == "+") {print  $1"\t"$2"\t"$2+1"\t"$4"\t"$5"\t"$6} else {print  $1"\t"$3-1"\t"$3"\t"$4"\t"$5"\t"$6}  }' mm9_REFseqgenes.bed >| mm9_REFseqgenesTSS.bed 
  1. Then use the tool "slopBed" to increase the range of genomic coordinates to include downstream of the TSS

The tool "slopBed" needs a file with the chromosome sizes that can be created with this command

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from mm9.chromInfo" > mm9.genome

Slop the file with the TSS coordinates to get the desired distance downstream (3') of the TSS

slopBed -i mm9_REFseqgenesTSS.bed  -g mm9.genome -l 0 -r 1000 -s  >|  mm9_REFseqgenes_1kbdownofTSS.bed 

Step B: Intersect your peaks and your annotation files

intersectBed -a genes.bed -b peaks.bed -wa -wb
-wa returns the list from file A
-wb returns the list from file B
intersectBed -a mm9_1KbUpDownofTSS.bed -b peaks.bed -wa -wb >| 1KbUpDownofTSS_peaks.bed

Step 6 [if appropriate]: Comparing binding across different samples

  • Method 1: Run IDR to identify reproducible peaks
  • Method 2: Use MACS2 bdgdiff to call differential binding. bedGraph (.bdg) files are required.
  • Method 3: Intersect bound regions (peaks)
    • This binary comparison asks simply, "Is each peak in sample A present in sample B (and vice versa)?"
    • This ignores the enrichment/score of each peak and gives each non-peak a score of 0.
  • Method 4: Compare affinity (number of mapped reads) across bound regions
    • Filter out redundant reads from each BAM file with a command like 'samtools rmdup'.
    • Merge peaks of all samples (and make non-redundant):
      • mergeBed -i Peaks_all_redundant.bed > Peaks.bed
    • Count reads in each sample that overlaps each peak
      • multiBamCov -bams *.sorted.bam -bed Merged_peaks.bed > Merged_peak_counts.reads_samples.bed
    • Modify counts per peak to account for differing numbers of mapped reads.
  • Method 5: Use ngsplot to make stacked heatmaps,and profiles, of peaks for each sample
  • Method 6: Use MSPC from Jalili et al., 2015.
Note: See TracWiki for help on using the wiki.