wiki:SOPs/enriched_tf_binding_sites

De novo search for overrepresented DNA motifs that could represent Transcription Factor Binding Sites (TFBS)

Different methods have different ways of sample the DNA motifs and estimate the overrepresentation of the motifs. Searching for all sites is a related but different task.

Below are links to review articles:

MEME

Based on expectation maximization (deterministic optimization). Spurious motifs can be reduced by filtering the input sequences, for example based on fold enrichment and/or reducing the sequence length (eg. ~200bp regions within the summit) from MACS for TFs in ChIP-Seq data.

Sample commands:

meme testsmall.FA -oc TEST-OUT -dna
meme seq.fa -minw 6 -maxw 50 -mod oops
# Look for 5 motifs of width 5-15, with 0 or 1 motifs per sequence expected
meme Promoters.fa -dna -oc . -mod zoops -nmotifs 5 -minw 5 -maxw 15 -revcomp
[-oc <output dir>]      name of directory for output files will replace existing directory
[-dna]                  sequences use DNA alphabet
[-minw <minw>]          minumum motif width
[-maxw <maxw>]          maximum motif width
[-mod oops|zoops|anr]   distribution of motifs
     oops    One per sequence
     zoops   Zero or one per sequence
     anr     Any number

Tomtom can then be run to compare MEME motifs to database(s) of known motifs. It's part of the MEME suite.

tomtom -no-ssc -verbosity 1 -min-overlap 5 -dist pearson -evalue -thresh 10.0 -o tomtom_out meme_out/meme.txt /nfs/BaRC_datasets/MEME_matrix_databases/Jaspar.meme.2016.txt /nfs/BaRC_datasets/MEME_matrix_databases/MotifDb.matrices.txt /nfs/BaRC_datasets/MEME_matrix_databases/Transfac_2014.1.dat.txt

MEME-ChIP

Motif Analysis of Large DNA Datasets. It is especially appropriate for analyzing the bound genomic regions identified in a transcription factor (TF) ChIP-seq experiment. Note, MEME-ChIP pre-processes the data around the center of the region, "Prior to motif discovery and motif enrichment analysis, MEME-ChIP centers and trims each sequence to 100 bp; the full-length sequences are used in the subsequent motif visualization step." MEME-ChIP

MEME-ChIP Documentation

MEME-ChIP Submission form

FIRE

It can only be applied to several distinct groups of sequences that have a common feature that could come from specific binding of TFs (expression pattern, being bound by a TF, etc). Motifs are selected based on how informative they are in predicting one or more of the group of sequences. Saying that a motif is overrepresented in a group means that it is overrepresented in that group versus the other groups of sequences. It doesn't mean it is overrepresented in one group versus the background of that group or that organism. FIRE doesn't make any assumptions about the background sequences and it doesn't have to model the background. It is background independent. It is very different to other prediction programs. FIRE is based in mutual information.

FIRE Web site FIRE Paper[PDF]

Sequences are divided into several groups. i.e. Corresponding to different expression profiles, or bound by different TFs in ChIP-Seq experiments. One input files has the all sequences in fasta format; the other input file has a list of the sequence names follow by the group they belong to, like:

ID cluster

sequenceName1 1 sequenceName2 1 sequenceName3 2 sequenceName4 2

This is a sample command.

fire.pl  --expfiles=enrichFileTest  --exptype=discrete --fastafile_dna=AllSequencesTest.txt --nodups=1

This script generates the file specifying the groups (enrichment file) and the file containing the sequences: PrepareFilesForFIRE_keepall.pl

Sample files: enrichFileTest AllSequencesTest.txt

OTHER PROGRAMS YOU MAY WANT TO EXPLORE

Amadeus

Amadeus paper

Weeder and YMF

Enumerate the n-mers and look for overrepresentation of the n-mers versus background. See review articles

Gibbs sampling

Is Based on probabilistic optimization. See review articles

ConTra

Combines multiz alignments (from UCSC) and PWMs from JASPAR and TRANSFAC, to predict TFBS. ConTra

TRAP

TF Affinity Prediction (TRAP), uses binding affinities to predict association between TF and co-regulated genes. PASTAA: identifying transcription factors associated with sets of co-regulated genes

RSA-Tools: Peak-motifs

RSA-Tools: Peak-motifs, discover motifs in ChIP-Seq peak sequences.

TAMO =

TAMO : motif discovery package (incl. interfaces to other motif searching eg. MEME) along with integration of expression and other databases.

WebMOTIFS

WebMOTIFS : motif discovery using TAMO and other tools (eg. MEME).

Scan for TFBS using known motifs

  1. Source of the Motifs:
    • Databases: Transfac, Jaspar, other.
    • Protein binding arrays (PBM).
    • TFBS prediction programs.

  1. Depending on the source of the motif the program used to scan will be different.
    • For position weight matrices (PWM) or regular expressions we can use programs like MAST or FIRE. Most prediction programs have a setting to scan for TFBS for a given motif.

Example of a mast command: (files are on the system_testing -> Aug2010_Testing folder)

mast motif.txt  Sequence.fasta
mast p53_BMC.txt  Sengupta.fasta

Example of a FIRE command

fire.pl --expfiles=groups.txt --exptype=discrete --fastafile_dna=FileWithSeqsFASTa.txt --nodups=1 --doskipdiscovery=1 --motiffile_dna=dnamotifs.txt

INPUT FILES

groups.txt defines the sequences in each group

ID	     cluster
sequenceName1	1
sequenceName2	1
sequenceName3	2
sequenceName4	2

FileWithSeqsFASTa.txt: has all the sequences in fasta format

dnamotifs.txt: has the DNA motifs i.e. .AGATA[AT]..

  • For other inputs like files coming from PBM, using PWM is a simplification that throws out part of the data. It is more appropriate to use specific script.