wiki:SOP/PatternsMotifs

Version 14 (modified by gbell, 9 years ago) ( diff )

--

Searching for patterns, motifs, or profiles in a DNA or protein sequence

This is a traditional bioinformatics task, and many tools do this in a variety of ways. One main determinant of tool is the representation of what you're looking for. Identifying enriched sites is a related but different task.

Search with a pattern (text, with optional choices at some positions)

dreg (EMBOSS suite) - for nucleic acids (where "pattern" is a regular expression)

dreg -pattern "GGCC[ACGT]" -sequence My_promoters.fa -outfile My_promoters.GGCCN.dreg_out.txt

preg (EMBOSS suite) - for proteins (where "pattern" is a regular expression)

dreg -pattern "LPE[ACS]G" -sequence My_proteins.fa -outfile My_proteins.fa.LPEMG.preg_out.txt

fuzznuc (EMBOSS suite) - for nucleic acids (where "pmismatch" is the number of mismatches in the pattern)

fuzznuc -pattern "nnnGGCCTnnn" -sequence My_promoters.fa -pmismatch 1 -outfile My_promoters.GGCCT.1mis.fuzznuc_out.txt

fuzzpro (EMBOSS suite) - for proteins (where "pmismatch" is the number of mismatches in the pattern)

fuzzpro -pattern "xxxxLPEAGxxxx" -sequence My_proteins.fa -pmismatch 1 -outfile My_proteins.LPEAG.1mis.fuzzpro_out.txt

EMBOSS output format can be changed with the option, -rformat. See EMBOSS Report Formats for more details.

Search with a custom profile (a probability matrix, with choices at all positions)

These searches are generally a two-step process, one step to create the motif and one step to search with it. There are several choices of detailed options, so check out the documentation. You need to align your sequences before you can create a profile.

prophecy + profit (EMBOSS suite) - for proteins

prophecy -sequence Aligned_protein_sites.fa -type F -name MyProfile -outfile MyProfile.txt -filter
profit -infile MyProfile.txt -sequence My_proteins.fa -outfile My_proteins.MyProfileprofit_out.txt

HMMER - for proteins or nucleic acids

# Create a HMM from an aligned set of proteins or nucleic acids (fasta or other common format)
hmmbuild MyProfile.hmm Aligned_protein_sites.fa
# Use the HMM to search a fasta file of proteins
hmmsearch MyProfile.hmm Protein_set.fa > Protein_set.MyProfile.hmmsearch_out.txt

JASPAR - for transcription factor binding sites

Search with a dataset of profiles

JASPAR - for transcription factor binding sites

  • Select a JASPAR CORE database (like Vertebrata) to search your DNA sequence(s) with the set of profiles

TRANSFAC's match - for transcription factor binding sites

  • commercial application requiring a license for the most up-to-date version
  • Whitehead only: See BaRC_datasets/Transfac for the command-line program and data files
    # Search using all Transfac profiles
    match matrix.dat MyPromoters.fa MyPromoters.match_out.txt minSUM_good.prf
    # Search using a subset of profiles
    match matrix.dat MyPromoters.fa MyPromoters.vert.match_out.txt vertebrate_non_redundant_minSUM.prf
    
  • Publication: Kel et al., 2003
  • Public web site (older data): http://www.gene-regulation.com/cgi-bin/pub/programs/match/bin/match.cgi

HMMER - for proteins or nucleic acids

  • Use a set of HMMs (like Pfam or another public protein domain resource) to search a fasta file of proteins
    hmmsearch Pfam-A.hmm MyProteins.fa > MyProteins.Pfam-A.hmm.out.txt
    

Bioconductor packages

Note: See TracWiki for help on using the wiki.