wiki:SOPs/enriched_tf_binding_sites

De novo search for overrepresented DNA motifs that could represent Transcription Factor Binding Sites (TFBS)

Different methods have different ways of sample the DNA motifs and estimate the overrepresentation of the motifs. Searching for all sites is a related but different task.

Below are links to review articles:

MEME

The MEME function of the MEME Suite uses expectation maximization (deterministic optimization) to perform de novo motif discovery from sequence input. Spurious motifs can be reduced by filtering the input sequences, for example based on fold enrichment and/or reducing the sequence length (eg. ~200bp regions within the summit) from MACS for TFs in ChIP-Seq data.

Sample commands:

meme testsmall.FA -oc TEST-OUT -dna
meme seq.fa -minw 6 -maxw 50 -mod oops
# Look for 5 motifs of width 5-15, with 0 or 1 motifs per sequence expected
meme Promoters.fa -dna -oc . -mod zoops -nmotifs 5 -minw 5 -maxw 15 -revcomp
[-oc <output dir>]      name of directory for output files will replace existing directory
[-dna]                  sequences use DNA alphabet
[-minw <minw>]          minumum motif width
[-maxw <maxw>]          maximum motif width
[-mod oops|zoops|anr]   distribution of motifs
     oops    One per sequence
     zoops   Zero or one per sequence
     anr     Any number

Tomtom can then be run to compare MEME motifs to database(s) of known motifs. It is part of the MEME suite.

tomtom -no-ssc -verbosity 1 -min-overlap 5 -dist pearson -evalue -thresh 10.0 -o tomtom_out meme_out/meme.txt /nfs/BaRC_datasets/MEME_matrix_databases/Jaspar.meme.2016.txt /nfs/BaRC_datasets/MEME_matrix_databases/MotifDb.matrices.txt /nfs/BaRC_datasets/MEME_matrix_databases/Transfac_2014.1.dat.txt

MEME-ChIP

Motif Analysis of Large DNA Datasets. It is especially appropriate for analyzing the bound genomic regions identified in a transcription factor (TF) ChIP-seq experiment. Note, MEME-ChIP pre-processes the data around the center of the region, "Prior to motif discovery and motif enrichment analysis, MEME-ChIP centers and trims each sequence to 100 bp; the full-length sequences are used in the subsequent motif visualization step." MEME-ChIP

MEME-ChIP Documentation

MEME-ChIP Submission form

Sample files: enrichFileTest AllSequencesTest.txt

Search for all DNA motifs that could represent Transcription Factor Binding Sites (TFBS)

Source of the Motifs:

  • Databases such as TRANSFAC or JASPAR
  • Protein binding arrays (PBM).
  • TFBS prediction programs.

Depending on the source of the motif file, the program used to scan for potential binding sites may be different:

  • MEME-format files can be found in /nfs/BaRC_datasets/MEME_matrix_databases
  • TRANSFAC-format files can be found in /nfs/BaRC_datasets/Transfac

TRANSFAC's match - for transcription factor binding sites

  • commercial application requiring a license for the most up-to-date version
  • Whitehead only: See BaRC_datasets/Transfac for the command-line program and (old) data files such as the PRF files with matrix subsets of different types.
    # Search using all Transfac profiles
    match matrix.dat MyPromoters.fa MyPromoters.match_out.txt minSUM_good.prf
    
    # Search using a subset of profiles
    match matrix.dat MyPromoters.fa MyPromoters.vert.match_out.txt vertebrate_non_redundant_minSUM.prf
    
  • Publication: Kel et al., 2003
  • Public web site (older data; requires registration): http://gene-regulation.com/pub/programs.html#match

For position weight matrices (PWM) or regular expressions we can use programs like MAST or FIMO. Most prediction programs have a setting to scan for TFBS for a given motif. Be aware that these programs generally predict a large number of potential binding sites, many more than are likely to be functional, especially in one desired cell type.

MAST can be run using its web https://meme-suite.org/meme/tools/mast or from the https://meme-suite.org/meme/doc/mast.html?man_type=web using one or more motifs in MEME format on a set of sequences like this:

mast myMotifs.meme mySequences.fa

When running from the command line, MAST will create a directory called mast_out where it places its output.

FIMO can be run using its web https://meme-suite.org/meme/tools/fimo or from the https://meme-suite.org/meme/doc/fimo.html?man_type=web using one or more motifs in MEME format on a set of sequences like this:

fimo myMotifs.meme mySequences.fa

When running from the command line, FIMO will create a directory called fimo_out where it places its output.

Note: See TracWiki for help on using the wiki.