=== De novo search for overrepresented DNA motifs that could represent Transcription Factor Binding Sites (TFBS)=== Different methods have different ways of sample the DNA motifs and estimate the overrepresentation of the motifs. [http://barcwiki/wiki/SOP/PatternsMotifs Searching for all sites] is a related but different task. Below are links to review articles: * [[http://www.springerlink.com/content/k712222066900072/#section=93033&page=1|Discovering sequence motifs]] * [[http://www.biomedcentral.com/1471-2105/8/S7/S21#IDAW3YLD|A survey of DNA motif finding algorithms]] ==== MEME ==== The [[http://meme-suite.org/tools/meme|MEME]] function of the [[http://meme-suite.org/index.html|MEME Suite]] uses expectation maximization (deterministic optimization) to perform de novo motif discovery from sequence input. Spurious motifs can be reduced by filtering the input sequences, for example based on fold enrichment and/or reducing the sequence length (eg. ~200bp regions within the summit) from MACS for TFs in ChIP-Seq data. Sample commands: {{{ meme testsmall.FA -oc TEST-OUT -dna meme seq.fa -minw 6 -maxw 50 -mod oops # Look for 5 motifs of width 5-15, with 0 or 1 motifs per sequence expected meme Promoters.fa -dna -oc . -mod zoops -nmotifs 5 -minw 5 -maxw 15 -revcomp }}} {{{ [-oc ] name of directory for output files will replace existing directory [-dna] sequences use DNA alphabet [-minw ] minumum motif width [-maxw ] maximum motif width [-mod oops|zoops|anr] distribution of motifs oops One per sequence zoops Zero or one per sequence anr Any number }}} [[http://meme-suite.org/tools/tomtom|Tomtom]] can then be run to compare MEME motifs to database(s) of known motifs. It is part of the MEME suite. {{{ tomtom -no-ssc -verbosity 1 -min-overlap 5 -dist pearson -evalue -thresh 10.0 -o tomtom_out meme_out/meme.txt /nfs/BaRC_datasets/MEME_matrix_databases/Jaspar.meme.2016.txt /nfs/BaRC_datasets/MEME_matrix_databases/MotifDb.matrices.txt /nfs/BaRC_datasets/MEME_matrix_databases/Transfac_2014.1.dat.txt }}} ==== MEME-ChIP ==== Motif Analysis of Large DNA Datasets. It is especially appropriate for analyzing the bound genomic regions identified in a transcription factor (TF) ChIP-seq experiment. Note, MEME-ChIP pre-processes the data around the center of the region, "Prior to motif discovery and motif enrichment analysis, MEME-ChIP centers and trims each sequence to 100 bp; the full-length sequences are used in the subsequent motif visualization step." [[http://bioinformatics.oxfordjournals.org/content/27/12/1696.full | MEME-ChIP]] [[http://meme.nbcr.net/meme/memechip-intro.html|MEME-ChIP Documentation]] [[http://meme.nbcr.net/meme/cgi-bin/meme-chip.cgi|MEME-ChIP Submission form]] Sample files: [[enrichFileTest|enrichFileTest]] [[AllSequencesTest.txt|AllSequencesTest.txt]][[br]][[br]] === Search for all DNA motifs that could represent Transcription Factor Binding Sites (TFBS) === Source of the Motifs: * Databases such as TRANSFAC or [[http://jaspar.genereg.net|JASPAR]] * Protein binding arrays (PBM). * TFBS prediction programs. Depending on the source of the motif file, the program used to scan for potential binding sites may be different: * MEME-format files can be found in /nfs/BaRC_datasets/MEME_matrix_databases * TRANSFAC-format files can be found in /nfs/BaRC_datasets/Transfac [http://gene-regulation.com/Match_command_line.txt TRANSFAC's match] - for transcription factor binding sites * commercial application requiring a license for the most up-to-date version * Whitehead only: See BaRC_datasets/Transfac for the command-line program and (old) data files such as the PRF files with matrix subsets of different types. {{{ # Search using all Transfac profiles match matrix.dat MyPromoters.fa MyPromoters.match_out.txt minSUM_good.prf # Search using a subset of profiles match matrix.dat MyPromoters.fa MyPromoters.vert.match_out.txt vertebrate_non_redundant_minSUM.prf }}} * Publication: [http://www.ncbi.nlm.nih.gov/pubmed/12824369 Kel et al., 2003] * Public web site (older data): http://www.gene-regulation.com/cgi-bin/pub/programs/match/bin/match.cgi For position weight matrices (PWM) or regular expressions we can use programs like [[http://meme-suite.org/tools/mast|MAST]] or [[http://meme-suite.org/doc/fimo.html|FIMO]]. Most prediction programs have a setting to scan for TFBS for a given motif. Be aware that these programs generally predict a large number of potential binding sites, many more than are likely to be functional, especially in one desired cell type. MAST can be run using its web [[http://meme-suite.org/tools/mast|interface]] or from the [[http://meme-suite.org/doc/mast.html|command line]] using one or more motifs in MEME format on a set of sequences like this: {{{ /usr/local/meme/bin/mast myMotifs.meme mySequences.fa }}} When running from the command line, MAST will create a directory called mast_out where it places its output. FIMO can be run using its web [[http://meme-suite.org/tools/fimo|interface]] or from the [[http://meme-suite.org/doc/fimo.html?man_type=command|command line]] using one or more motifs in MEME format on a set of sequences like this: {{{ /usr/local/meme/bin/fimo myMotifs.meme mySequences.fa }}} When running from the command line, FIMO will create a directory called fimo_out where it places its output.