wiki:SOPs/chipSeqBakeoffResults

MACS

  • Summary: This software is recommended. It appears to work correctly, is flexible but easy to use, and generates all desired output files.
  • Version 1.3.7.1 is installed at /nfs/BaRC/apps/MACS/bin/macs
  • Required input(s):
    • file of mapped ChIP reads
    • accepted formats: bed, eland, bowtie, sam/bam
  • Recommended:
    • effective genome size for mapping (or defaults to 2.7e+9 [human])
    • file of mapped control reads
  • Each read can only be mapped to one location
  • Replicate samples should be merged before processing.
  • Reads are extended by distance d (calculated by program) between paired peaks and shifted by d/2 towards middle of paired peaks
  • Peaks are selected as middle of peak pair and ranked by Poisson p-value (if it's less than threshold, default = 1e-5)
  • Control data (reads) are used to calculate FDR
  • Program is easy to run: type 'macs' to get arguments
  • Tags with counts in numbers greater than expected are removed
  • Google group: http://groups.google.com/group/macs-announcement
  • Output includes
    • Peaks in BED file and Excel file
    • WIG files (one per chr, gzipped) for ChIP and control datasets
    • R code to make figure of model (meta-peak pair)
  • Read IDs cannot have spaces
  • One run takes about 20 min, wtih most time used in writing of wig files
  • No method for handling replicate lanes
  • Several publications used MACS

SISSRs

Note Oct 21 2010 SISSRS script is now in /nfs/BaRC/apps/SISSRs/sissrs.pl

  • Summary: This software is recommended specially for CHIP-seq data for TFs. It appears to work correctly and be more stringent than other methods. It is highly recommended to use the option -a that only keeps 1 read per genomic coordinate. Generates output files that can be converted to bed or wig files.
  • Version: v1.4, the perl script and manual are located at /nfs/BaRC/apps/SISSRs/
  • Publication for SISSRs has been cited by 31 other articles.
  • Required input:
  • Parameters needed and recommended:
    • -i chip bed file
    • -b control bed file
    • -o output file
    • -s size of genome
    • -L is the maximum length of the fragment
    • -m is the percentage of mappable bps. Default is .8 for Eland in human. (I left the default, Inma Jan 10)
    • -a only keeps 1 read per genomic coordinate
    • -t report only the center of the inferred binding site / -c (default) reports binding site (BS) start and BS end and clusters BS is close to each other / -r same as -c but it doesn't cluster the sites.
    • Leave other parameters as default.effective genome size for mapping (or defaults to 2.7e+9 [human]
  • SISSRs compares forward and reverse tags to find out the "transition" point. This point should correspond with the binding site. (see Fig 1 of the paper).
  • Control data is used as background model.
  • The distribution of fold enrichment ratios of 1 million random genomic positions is used to calculate p-values. P-value is the prob that that enrichment over BG is seen by chance.
  • Peaks over the p-value threshold are reported (default: 0.001)
  • Software can not used reads mapped to more than one location
  • Replicas have to be processed separately or concatenate together
  • No scaling is done in data sets with different number of tags. It is recommended to have close to 10 million tags on the control sample. If not a more stringent fold enrichment will be required to meet the p-value. (my(Inma) control samples had around ~6 millions though)
  • Program is easy to use for bioinformatics people. It requires converting input formats to bed and running the sissr.pl perl script.
  • Runs take around 10 minutes
  • Other comments:
    • SISSRs treats forward and reverse reads differently and finds the middle point that should represent the “actual binding site”. For this reason peaks are narrow. The program output either the binding location or a peak that is the binding location plus/minus the average length of the fragments from chip (-F). This parameter could be estimated from reads (default) or given by the user.
    • Option –c reports binding sites that are F-bp from each other as a binding region. The NumTags in this case reports the number of tags for the strongest site.
    • The program doesn't give the total read counts at each position. The closer I could get to output "signal wig files" was to use p value = 1. This outputs all possible transition points but there are coordinates on the genome that are not output because they are not transition points.

CisGenome

  • Summary: Can be used as a first pass by biologists since the results are acceptable and there's an easy to use GUI available. *Version 1.1 was used to test, source code is in /nfs/BaRC/ChIP_Seq_analysis/CisGenome/cisgenome_project. The scripts need to be run on a 64-bit machine. Both command line and GUI (for Windows) available. * User Manual *CisGenome has been cited 29 times *CisGenome does not have any options to treat replicates, or reads mapped to multiple locations ie. reads aligned to multiple locations are treated independently, likewise for replicates *In two-sample analysis, CisGenome uses a conditional binomial model to identify regions in which the ChIP reads are significantly enriched compared to the control.
    Windows passing a user-specified FDR cutoff are used to generate predicted binding regions.
    If there's only one sample available, a negative binomial distribution is used to model background in computing FDR. *Smoothing of data is accomplished simply by sliding a window of fixed width and replacing the tag count at each site with the summed value within the window.
    CisGenome computes the number of reads needed in each window to satisfy the FDR cutoff, windows passing user cutoff FDR is reported as significant. *Two post-processing filtering options are available, both options were used in testing: *Boundary Refinement: highest ranking peak pairs (from forward and reverse) are used to define bound region. *Single-strand filtering: reports regions only when there are both forward and reverse tags showing the binding region is significant. *Multiple scripts are needed to run command line CisGenome, a wrapper can be written to automate this, if needed *Input:
    • aln file, there's a script to convert eland to aln, if using another aligner convert the format such that it's, <chr>\t<start>\t<strand> eg. chr11 113929683 R
    • Genome database which can be downloaded from CisGenome website if you're planning to do any annotation (optional)

*The aln file is sorted and converted to a bar file, which is then used to compute FDR, an FDR of 0.10 was used here, and finally peaks a detected (as stated above) *Output:

*bar files, which has a genome-scale of all peaks *cod files, which contains significant peaks

*There's no utility to convert CisGenome's output to wig files. However, the bar and cod files can be converted to wig using other scripts. *Run time was approx 30 min. *Results:

*H3K4me3_ES (combined) had 20264 significant peaks by CisGenome, 19630 were reported on Marson et al.
19567 of the CisGenome peaks overlaps with what's reported. *Nanog_mES (combined) had 9403 significant peaks by CisGenome, 16688 were reported on Marson et al.
All CisGenome peaks overlaps with what's reported.

ERANGE

  • Summary: This software is not recommended because the running time is too long on our unix machines.