wiki:SOP/PatternsMotifs

Searching for patterns, motifs, or profiles in a DNA or protein sequence

This is a traditional bioinformatics task, and many tools do this in a variety of ways. One main determinant of tool is the representation of what you're looking for. Identifying enriched sites is a related but different task.

Search with a pattern (text, with optional choices at some positions)

dreg (EMBOSS suite) - for nucleic acids (where "pattern" is a regular expression)

dreg -pattern "GGCC[ACGT]" -sequence My_promoters.fa -outfile My_promoters.GGCCN.dreg_out.txt

preg (EMBOSS suite) - for proteins (where "pattern" is a regular expression)

dreg -pattern "LPE[ACS]G" -sequence My_proteins.fa -outfile My_proteins.fa.LPEMG.preg_out.txt

fuzznuc (EMBOSS suite) - for nucleic acids (where "pmismatch" is the number of mismatches in the pattern)

fuzznuc -pattern "nnnGGCCTnnn" -sequence My_promoters.fa -pmismatch 1 -outfile My_promoters.GGCCT.1mis.fuzznuc_out.txt

fuzzpro (EMBOSS suite) - for proteins (where "pmismatch" is the number of mismatches in the pattern)

fuzzpro -pattern "xxxxLPEAGxxxx" -sequence My_proteins.fa -pmismatch 1 -outfile My_proteins.LPEAG.1mis.fuzzpro_out.txt

EMBOSS output format can be changed with the option, -rformat. See EMBOSS Report Formats for more details.

Search with a custom profile (a probability matrix, with choices at all positions)

These searches are generally a two-step process, one step to create the motif and one step to search with it. There are several choices of detailed options, so check out the documentation. You need to align your sequences before you can create a profile.

prophecy + profit (EMBOSS suite) - for proteins

prophecy -sequence Aligned_protein_sites.fa -type F -name MyProfile -outfile MyProfile.txt -filter
profit -infile MyProfile.txt -sequence My_proteins.fa -outfile My_proteins.MyProfileprofit_out.txt

HMMER - for proteins or nucleic acids

# Create a HMM from a set of proteins or nucleic acids (fasta or other common format)
hmmbuild MyProfile.hmm Aligned_protein_sites.fa
# Use the HMM to search a fasta file of proteins
hmmsearch MyProfile.hmm Protein_set.fa > Protein_set.MyProfile.hmmsearch_out.txt

JASPAR - for transcription factor binding sites

MEME Suite For proteins or nucleic acids

  • Build a PWM from a series of sequences using the meme command
    meme Sequence_set.fa -protein -nmotifs 5 -minw 8 -maxw 12 -o output_directory
    

Program options:

-protein: specifies your input sequences are amino acids.
-dna: specifies your input sequences are DNA nucleotides.
-nmotifs: indicates the number of motifs to look for in your sequences
-minw, -maxw: minimum and maximum motif widths

  • Using the MAST program within the MEME-suite, with the PWM output by the above command you can search a list of protein sequences, in FASTA format, for your motif.
    mast -minseqs 100 -m 1 -comp -ev 10 -o output_directory
    

Program options:

-minseqs: Specifies the number of sequences to analyze.
-m: Specifies the number of motifs within your matrix file.
-comp: This option can improve search selectivity when erroneous matches are due to biased sequence composition.
-ev: MAST only displays sequences matching your query with E-values below the given threshold you specify here. By default, sequences in the database with matches with E-values less than 10 are displayed. If your motifs are very short or have low information content (are not very specific), it may be impossible for any sequence to achieve a low E-value.

Search with a dataset of profiles

JASPAR - for transcription factor binding sites

  • Select a JASPAR CORE database (like Vertebrata) to search your DNA sequence(s) with the set of profiles

TRANSFAC's match - for transcription factor binding sites

  • commercial application requiring a license for the most up-to-date version
  • Whitehead only: See BaRC_datasets/Transfac for the command-line program and data files
    # Search using all Transfac profiles
    match matrix.dat MyPromoters.fa MyPromoters.match_out.txt minSUM_good.prf
    # Search using a subset of profiles
    match matrix.dat MyPromoters.fa MyPromoters.vert.match_out.txt vertebrate_non_redundant_minSUM.prf
    
  • Publication: Kel et al., 2003
  • Public web site (older data): http://www.gene-regulation.com/cgi-bin/pub/programs/match/bin/match.cgi

HMMER - for proteins or nucleic acids

  • Use a set of HMMs (like Pfam or another public protein domain resource) to search a fasta file of proteins
    hmmsearch Pfam-A.hmm MyProteins.fa > MyProteins.Pfam-A.hmm.out.txt
    

Bioconductor packages

Note: See TracWiki for help on using the wiki.