Changes between Initial Version and Version 1 of SOPs/enriched_tf_binding_sites


Ignore:
Timestamp:
01/23/13 16:49:43 (12 years ago)
Author:
trac
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/enriched_tf_binding_sites

    v1 v1  
     1=== De novo search for overrepresented DNA motifs that could represent Transcription Factor Binding Sites (TFBS)===
     2Different methods have different ways of sample the DNA motifs and estimate the overrepresentation of the motifs.
     3
     4Below are links to review articles:
     5
     6 * [[http://www.springerlink.com/content/k712222066900072/#section=93033&page=1|Discovering sequence motifs]]
     7 * [[http://www.biomedcentral.com/1471-2105/8/S7/S21#IDAW3YLD|A survey of DNA motif finding algorithms]]
     8
     9==== MEME ====
     10Is based on expectation maximization (deterministic optimization).
     11
     12Sample commands:
     13
     14{{{
     15//meme testsmall.FA -oc TEST-OUT -dna //
     16//meme seq.fa -minw 6 -maxw 50 -mod oops //
     17}}}
     18
     19{{{
     20[-oc <output dir>]      name of directory for output files will replace existing directory
     21[-dna]                  sequences use DNA alphabet
     22[-minw <minw>]          minumum motif width
     23[-maxw <maxw>]          maximum motif width
     24[-mod oops|zoops|anr]   distribution of motifs
     25     oops    One per sequence
     26     zoops   Zero or one per sequence
     27     anr     Any number
     28}}}
     29==== MEME-ChIP ====
     30Motif Analysis of Large DNA Datasets. It is especially appropriate for analyzing the bound genomic regions identified in a transcription factor (TF) ChIP-seq experiment.  Note, MEME-ChIP pre-processes the data around the center of the region, "Prior to motif discovery and motif enrichment analysis, MEME-ChIP centers and trims each sequence to 100 bp; the full-length sequences are used in the subsequent motif visualization step." [[http://bioinformatics.oxfordjournals.org/content/27/12/1696.full | MEME-ChIP]]
     31
     32
     33[[http://meme.nbcr.net/meme/memechip-intro.html|MEME-ChIP Documentation]]
     34
     35[[http://meme.nbcr.net/meme/cgi-bin/meme-chip.cgi|MEME-ChIP Submission form]]
     36
     37
     38
     39
     40==== FIRE ====
     41It can only be applied to several distinct groups of sequences that have a common feature that could come from specific binding of TFs (expression pattern, being bound by a TF, etc). Motifs are selected based on how informative they are in predicting one or more of the group of sequences.  Saying that a motif is overrepresented in a group means that it is overrepresented in that group versus the other groups of sequences. It doesn't mean it is overrepresented in one group versus the background of that group or that organism. FIRE doesn't make any assumptions about the background sequences and it doesn't have to model the background. It is background independent. It is very different to other prediction programs. FIRE is based in mutual information.
     42
     43[[https://tavazoielab.c2b2.columbia.edu/FIRE/|FIRE Web site]] [[https://tavazoielab.c2b2.columbia.edu/lab/publications/Elemento_etal_Mol_Cell_2007.pdf|FIRE Paper[PDF]]]
     44
     45
     46Sequences are divided into several groups.// i.e.// Corresponding to different expression profiles, or bound by different TFs in ChIP-Seq experiments.
     47One input files has the all sequences in fasta format; the other input file has a list of the sequence names follow by the group they belong to, like:
     48
     49ID           cluster
     50
     51sequenceName1   1
     52sequenceName2   1
     53sequenceName3   2
     54sequenceName4   2
     55
     56
     57This is a sample command.
     58
     59{{{
     60fire.pl  --expfiles=enrichFileTest  --exptype=discrete --fastafile_dna=AllSequencesTest.txt --nodups=1
     61}}}
     62
     63This script generates the file specifying the groups (enrichment file) and the file containing the sequences: [[PrepareFilesForFIRE_keepall.txt|"PrepareFilesForFIRE_keepall.pl"]]
     64 
     65
     66Sample files:
     67[[enrichFileTest|enrichFileTest]]
     68[[AllSequencesTest.txt|AllSequencesTest.txt]][[br]][[br]]
     69
     70==== OTHER PROGRAMS YOU MAY WANT TO EXPLORE ====
     71===== Amadeus =====
     72[[http://bioinfo-out.curie.fr/training/CGH-PATHWAYworkshop/pathway_charting_materials/amadeusPaper.pdf|Amadeus paper]]
     73
     74===== Weeder and YMF =====
     75Enumerate the n-mers and look for overrepresentation of the n-mers versus background.  [[http://iona/barcwiki/doku.php?id=identifying_all_and_or_enriched_transcription_factor_binding_sites|See review articles]]
     76 
     77
     78===== Gibbs sampling =====
     79Is Based on probabilistic optimization.  [[http://iona/barcwiki/doku.php?id=identifying_all_and_or_enriched_transcription_factor_binding_sites|See review articles]]
     80 
     81===== ConTra =====
     82
     83Combines multiz alignments (from UCSC) and PWMs from JASPAR and TRANSFAC, to predict TFBS. [[http://bioit.dmbr.ugent.be/contrav2/index.php|ConTra]]
     84
     85
     86===== TRAP =====
     87
     88[[http://trap.molgen.mpg.de/cgi-bin/home.cgi | TF Affinity Prediction (TRAP)]], uses binding affinities to predict association between TF and co-regulated genes. [[http://bioinformatics.oxfordjournals.org/content/25/4/435.full|PASTAA: identifying transcription factors associated with sets of co-regulated genes]]
     89
     90=== Scan for TFBS using known motifs  ===
     911. Source of the Motifs:
     92       * Databases: Transfac, Jaspar, other.
     93       * Protein binding arrays (PBM).
     94       * TFBS prediction programs.
     95 
     96       
     972. Depending on the source of the motif the program used to scan will be different.
     98    * For position weight matrices (PWM) or regular expressions we can use programs like MAST or FIRE. Most prediction programs have a setting to scan for TFBS for a given motif.
     99   
     100      **Example of a mast command:** (files are on the system_testing -> Aug2010_Testing folder)
     101{{{
     102mast motif.txt  Sequence.fasta
     103mast p53_BMC.txt  Sengupta.fasta
     104 }}}
     105     
     106**Example of a FIRE command**
     107{{{
     108fire.pl --expfiles=groups.txt --exptype=discrete --fastafile_dna=FileWithSeqsFASTa.txt --nodups=1 --doskipdiscovery=1 --motiffile_dna=dnamotifs.txt
     109 }}}
     110     
     111INPUT FILES[[br]][[br]]
     112     
     113//groups.txt// defines the sequences in each group
     114     
     115{{{
     116ID           cluster
     117sequenceName1   1
     118sequenceName2   1
     119sequenceName3   2
     120sequenceName4   2
     121}}}
     122     
     123//FileWithSeqsFASTa.txt//:
     124has all the sequences in fasta format[[br]][[br]]
     125     
     126//dnamotifs.txt//:
     127has the DNA motifs //i.e.// .AGATA[AT]..
     128
     129
     130
     131
     132    * For other inputs like files coming from PBM, using PWM is a simplification that throws out part of the data. It is more appropriate to use specific script.