wiki:SOPs/enriched_tf_binding_sites

Version 10 (modified by gbell, 4 years ago) ( diff )

--

De novo search for overrepresented DNA motifs that could represent Transcription Factor Binding Sites (TFBS)

Different methods have different ways of sample the DNA motifs and estimate the overrepresentation of the motifs. Searching for all sites is a related but different task.

Below are links to review articles:

MEME

Based on expectation maximization (deterministic optimization). Spurious motifs can be reduced by filtering the input sequences, for example based on fold enrichment and/or reducing the sequence length (eg. ~200bp regions within the summit) from MACS for TFs in ChIP-Seq data.

Sample commands:

meme testsmall.FA -oc TEST-OUT -dna
meme seq.fa -minw 6 -maxw 50 -mod oops
# Look for 5 motifs of width 5-15, with 0 or 1 motifs per sequence expected
meme Promoters.fa -dna -oc . -mod zoops -nmotifs 5 -minw 5 -maxw 15 -revcomp
[-oc <output dir>]      name of directory for output files will replace existing directory
[-dna]                  sequences use DNA alphabet
[-minw <minw>]          minumum motif width
[-maxw <maxw>]          maximum motif width
[-mod oops|zoops|anr]   distribution of motifs
     oops    One per sequence
     zoops   Zero or one per sequence
     anr     Any number

Tomtom can then be run to compare MEME motifs to database(s) of known motifs. It's part of the MEME suite.

tomtom -no-ssc -verbosity 1 -min-overlap 5 -dist pearson -evalue -thresh 10.0 -o tomtom_out meme_out/meme.txt /nfs/BaRC_datasets/MEME_matrix_databases/Jaspar.meme.2016.txt /nfs/BaRC_datasets/MEME_matrix_databases/MotifDb.matrices.txt /nfs/BaRC_datasets/MEME_matrix_databases/Transfac_2014.1.dat.txt

MEME-ChIP

Motif Analysis of Large DNA Datasets. It is especially appropriate for analyzing the bound genomic regions identified in a transcription factor (TF) ChIP-seq experiment. Note, MEME-ChIP pre-processes the data around the center of the region, "Prior to motif discovery and motif enrichment analysis, MEME-ChIP centers and trims each sequence to 100 bp; the full-length sequences are used in the subsequent motif visualization step." MEME-ChIP

MEME-ChIP Documentation

MEME-ChIP Submission form

Sample files: enrichFileTest AllSequencesTest.txt

De novo search for all DNA motifs that could represent Transcription Factor Binding Sites (TFBS)

Source of the Motifs:

  • Databases such as TRANSFAC or JASPAR
  • Protein binding arrays (PBM).
  • TFBS prediction programs.

Depending on the source of the motif, the program used to scan for potential binding sites may be different.

TRANSFAC's match - for transcription factor binding sites

  • commercial application requiring a license for the most up-to-date version
  • Whitehead only: See BaRC_datasets/Transfac for the command-line program and (old) data files
    # Search using all Transfac profiles
    match matrix.dat MyPromoters.fa MyPromoters.match_out.txt minSUM_good.prf
    
    # Search using a subset of profiles
    match matrix.dat MyPromoters.fa MyPromoters.vert.match_out.txt vertebrate_non_redundant_minSUM.prf
    
  • Publication: Kel et al., 2003
  • Public web site (older data): http://www.gene-regulation.com/cgi-bin/pub/programs/match/bin/match.cgi

For position weight matrices (PWM) or regular expressions we can use programs like MAST. Most prediction programs have a setting to scan for TFBS for a given motif.

Example of mast commands:

mast motif.txt  Sequence.fasta
mast p53_BMC.txt  Promoters.fasta

Note: See TracWiki for help on using the wiki.