De novo search for overrepresented DNA motifs that could represent Transcription Factor Binding Sites (TFBS)
Different methods have different ways of sample the DNA motifs and estimate the overrepresentation of the motifs. Searching for all sites is a related but different task.
Below are links to review articles:
The MEME function of the MEME Suite uses expectation maximization (deterministic optimization) to perform de novo motif discovery from sequence input. Spurious motifs can be reduced by filtering the input sequences, for example based on fold enrichment and/or reducing the sequence length (eg. ~200bp regions within the summit) from MACS for TFs in ChIP-Seq data.
meme testsmall.FA -oc TEST-OUT -dna meme seq.fa -minw 6 -maxw 50 -mod oops # Look for 5 motifs of width 5-15, with 0 or 1 motifs per sequence expected meme Promoters.fa -dna -oc . -mod zoops -nmotifs 5 -minw 5 -maxw 15 -revcomp
[-oc <output dir>] name of directory for output files will replace existing directory [-dna] sequences use DNA alphabet [-minw <minw>] minumum motif width [-maxw <maxw>] maximum motif width [-mod oops|zoops|anr] distribution of motifs oops One per sequence zoops Zero or one per sequence anr Any number
Tomtom can then be run to compare MEME motifs to database(s) of known motifs. It is part of the MEME suite.
tomtom -no-ssc -verbosity 1 -min-overlap 5 -dist pearson -evalue -thresh 10.0 -o tomtom_out meme_out/meme.txt /nfs/BaRC_datasets/MEME_matrix_databases/Jaspar.meme.2016.txt /nfs/BaRC_datasets/MEME_matrix_databases/MotifDb.matrices.txt /nfs/BaRC_datasets/MEME_matrix_databases/Transfac_2014.1.dat.txt
Motif Analysis of Large DNA Datasets. It is especially appropriate for analyzing the bound genomic regions identified in a transcription factor (TF) ChIP-seq experiment. Note, MEME-ChIP pre-processes the data around the center of the region, "Prior to motif discovery and motif enrichment analysis, MEME-ChIP centers and trims each sequence to 100 bp; the full-length sequences are used in the subsequent motif visualization step." MEME-ChIP
Search for all DNA motifs that could represent Transcription Factor Binding Sites (TFBS)
Source of the Motifs:
- Databases such as TRANSFAC or JASPAR
- Protein binding arrays (PBM).
- TFBS prediction programs.
Depending on the source of the motif file, the program used to scan for potential binding sites may be different:
- MEME-format files can be found in /nfs/BaRC_datasets/MEME_matrix_databases
- TRANSFAC-format files can be found in /nfs/BaRC_datasets/Transfac
TRANSFAC's match - for transcription factor binding sites
- commercial application requiring a license for the most up-to-date version
- Whitehead only: See BaRC_datasets/Transfac for the command-line program and (old) data files such as the PRF files with matrix subsets of different types.
# Search using all Transfac profiles match matrix.dat MyPromoters.fa MyPromoters.match_out.txt minSUM_good.prf # Search using a subset of profiles match matrix.dat MyPromoters.fa MyPromoters.vert.match_out.txt vertebrate_non_redundant_minSUM.prf
- Publication: Kel et al., 2003
- Public web site (older data): http://www.gene-regulation.com/cgi-bin/pub/programs/match/bin/match.cgi
For position weight matrices (PWM) or regular expressions we can use programs like MAST or FIMO. Most prediction programs have a setting to scan for TFBS for a given motif. Be aware that these programs generally predict a large number of potential binding sites, many more than are likely to be functional, especially in one desired cell type.
/usr/local/meme/bin/mast myMotifs.meme mySequences.fa
When running from the command line, MAST will create a directory called mast_out where it places its output.
/usr/local/meme/bin/fimo myMotifs.meme mySequences.fa
When running from the command line, FIMO will create a directory called fimo_out where it places its output.