Version 10 (modified by 4 years ago) ( diff ) | ,
---|
De novo search for overrepresented DNA motifs that could represent Transcription Factor Binding Sites (TFBS)
Different methods have different ways of sample the DNA motifs and estimate the overrepresentation of the motifs. Searching for all sites is a related but different task.
Below are links to review articles:
MEME
Based on expectation maximization (deterministic optimization). Spurious motifs can be reduced by filtering the input sequences, for example based on fold enrichment and/or reducing the sequence length (eg. ~200bp regions within the summit) from MACS for TFs in ChIP-Seq data.
Sample commands:
meme testsmall.FA -oc TEST-OUT -dna meme seq.fa -minw 6 -maxw 50 -mod oops # Look for 5 motifs of width 5-15, with 0 or 1 motifs per sequence expected meme Promoters.fa -dna -oc . -mod zoops -nmotifs 5 -minw 5 -maxw 15 -revcomp
[-oc <output dir>] name of directory for output files will replace existing directory [-dna] sequences use DNA alphabet [-minw <minw>] minumum motif width [-maxw <maxw>] maximum motif width [-mod oops|zoops|anr] distribution of motifs oops One per sequence zoops Zero or one per sequence anr Any number
Tomtom can then be run to compare MEME motifs to database(s) of known motifs. It's part of the MEME suite.
tomtom -no-ssc -verbosity 1 -min-overlap 5 -dist pearson -evalue -thresh 10.0 -o tomtom_out meme_out/meme.txt /nfs/BaRC_datasets/MEME_matrix_databases/Jaspar.meme.2016.txt /nfs/BaRC_datasets/MEME_matrix_databases/MotifDb.matrices.txt /nfs/BaRC_datasets/MEME_matrix_databases/Transfac_2014.1.dat.txt
MEME-ChIP
Motif Analysis of Large DNA Datasets. It is especially appropriate for analyzing the bound genomic regions identified in a transcription factor (TF) ChIP-seq experiment. Note, MEME-ChIP pre-processes the data around the center of the region, "Prior to motif discovery and motif enrichment analysis, MEME-ChIP centers and trims each sequence to 100 bp; the full-length sequences are used in the subsequent motif visualization step." MEME-ChIP
Sample files:
enrichFileTest
AllSequencesTest.txt
De novo search for all DNA motifs that could represent Transcription Factor Binding Sites (TFBS)
Source of the Motifs:
- Databases such as TRANSFAC or JASPAR
- Protein binding arrays (PBM).
- TFBS prediction programs.
Depending on the source of the motif, the program used to scan for potential binding sites may be different.
TRANSFAC's match - for transcription factor binding sites
- commercial application requiring a license for the most up-to-date version
- Whitehead only: See BaRC_datasets/Transfac for the command-line program and (old) data files
# Search using all Transfac profiles match matrix.dat MyPromoters.fa MyPromoters.match_out.txt minSUM_good.prf # Search using a subset of profiles match matrix.dat MyPromoters.fa MyPromoters.vert.match_out.txt vertebrate_non_redundant_minSUM.prf
- Publication: Kel et al., 2003
- Public web site (older data): http://www.gene-regulation.com/cgi-bin/pub/programs/match/bin/match.cgi
For position weight matrices (PWM) or regular expressions we can use programs like MAST. Most prediction programs have a setting to scan for TFBS for a given motif.
Example of mast commands:
mast motif.txt Sequence.fasta mast p53_BMC.txt Promoters.fasta