=== De novo search for overrepresented DNA motifs that could represent Transcription Factor Binding Sites (TFBS)=== Different methods have different ways of sample the DNA motifs and estimate the overrepresentation of the motifs. [http://barcwiki/wiki/SOP/PatternsMotifs Searching for all sites] is a related but different task. Below are links to review articles: * [[http://www.springerlink.com/content/k712222066900072/#section=93033&page=1|Discovering sequence motifs]] * [[http://www.biomedcentral.com/1471-2105/8/S7/S21#IDAW3YLD|A survey of DNA motif finding algorithms]] ==== MEME ==== Based on expectation maximization (deterministic optimization). Spurious motifs can be reduced by filtering the input sequences, for example based on fold enrichment and/or reducing the sequence length (eg. ~200bp regions within the summit) from MACS for TFs in ChIP-Seq data. Sample commands: {{{ meme testsmall.FA -oc TEST-OUT -dna meme seq.fa -minw 6 -maxw 50 -mod oops # Look for 5 motifs of width 5-15, with 0 or 1 motifs per sequence expected meme Promoters.fa -dna -oc . -mod zoops -nmotifs 5 -minw 5 -maxw 15 -revcomp }}} {{{ [-oc ] name of directory for output files will replace existing directory [-dna] sequences use DNA alphabet [-minw ] minumum motif width [-maxw ] maximum motif width [-mod oops|zoops|anr] distribution of motifs oops One per sequence zoops Zero or one per sequence anr Any number }}} Tomtom can then be run to compare MEME motifs to database(s) of known motifs. It's part of the MEME suite. {{{ tomtom -no-ssc -verbosity 1 -min-overlap 5 -dist pearson -evalue -thresh 10.0 -o tomtom_out meme_out/meme.txt /nfs/BaRC_datasets/MEME_matrix_databases/Jaspar.meme.2016.txt /nfs/BaRC_datasets/MEME_matrix_databases/MotifDb.matrices.txt /nfs/BaRC_datasets/MEME_matrix_databases/Transfac_2014.1.dat.txt }}} ==== MEME-ChIP ==== Motif Analysis of Large DNA Datasets. It is especially appropriate for analyzing the bound genomic regions identified in a transcription factor (TF) ChIP-seq experiment. Note, MEME-ChIP pre-processes the data around the center of the region, "Prior to motif discovery and motif enrichment analysis, MEME-ChIP centers and trims each sequence to 100 bp; the full-length sequences are used in the subsequent motif visualization step." [[http://bioinformatics.oxfordjournals.org/content/27/12/1696.full | MEME-ChIP]] [[http://meme.nbcr.net/meme/memechip-intro.html|MEME-ChIP Documentation]] [[http://meme.nbcr.net/meme/cgi-bin/meme-chip.cgi|MEME-ChIP Submission form]] Sample files: [[enrichFileTest|enrichFileTest]] [[AllSequencesTest.txt|AllSequencesTest.txt]][[br]][[br]] === De novo search for all DNA motifs that could represent Transcription Factor Binding Sites (TFBS) === Source of the Motifs: * Databases such as TRANSFAC or JASPAR * Protein binding arrays (PBM). * TFBS prediction programs. Depending on the source of the motif, the program used to scan for potential binding sites may be different. [http://gene-regulation.com/Match_command_line.txt TRANSFAC's match] - for transcription factor binding sites * commercial application requiring a license for the most up-to-date version * Whitehead only: See BaRC_datasets/Transfac for the command-line program and (old) data files {{{ # Search using all Transfac profiles match matrix.dat MyPromoters.fa MyPromoters.match_out.txt minSUM_good.prf # Search using a subset of profiles match matrix.dat MyPromoters.fa MyPromoters.vert.match_out.txt vertebrate_non_redundant_minSUM.prf }}} * Publication: [http://www.ncbi.nlm.nih.gov/pubmed/12824369 Kel et al., 2003] * Public web site (older data): http://www.gene-regulation.com/cgi-bin/pub/programs/match/bin/match.cgi For position weight matrices (PWM) or regular expressions we can use programs like MAST. Most prediction programs have a setting to scan for TFBS for a given motif. **Example of mast commands:** {{{ mast motif.txt Sequence.fasta mast p53_BMC.txt Promoters.fasta }}}