wiki:SOP/PatternsMotifs

Context Navigation

Version 14 (modified by gbell, 10 years ago) ( diff )
--

Searching for patterns, motifs, or profiles in a DNA or protein sequence

This is a traditional bioinformatics task, and many tools do this in a variety of ways. One main determinant of tool is the representation of what you're looking for. Identifying enriched sites is a related but different task.

Search with a pattern (text, with optional choices at some positions)

dreg (EMBOSS suite) - for nucleic acids (where "pattern" is a regular expression)

dreg -pattern "GGCC[ACGT]" -sequence My_promoters.fa -outfile My_promoters.GGCCN.dreg_out.txt

preg (EMBOSS suite) - for proteins (where "pattern" is a regular expression)

dreg -pattern "LPE[ACS]G" -sequence My_proteins.fa -outfile My_proteins.fa.LPEMG.preg_out.txt

fuzznuc (EMBOSS suite) - for nucleic acids (where "pmismatch" is the number of mismatches in the pattern)

fuzznuc -pattern "nnnGGCCTnnn" -sequence My_promoters.fa -pmismatch 1 -outfile My_promoters.GGCCT.1mis.fuzznuc_out.txt

fuzzpro (EMBOSS suite) - for proteins (where "pmismatch" is the number of mismatches in the pattern)

fuzzpro -pattern "xxxxLPEAGxxxx" -sequence My_proteins.fa -pmismatch 1 -outfile My_proteins.LPEAG.1mis.fuzzpro_out.txt

EMBOSS output format can be changed with the option, -rformat. See EMBOSS Report Formats for more details.

Search with a custom profile (a probability matrix, with choices at all positions)

These searches are generally a two-step process, one step to create the motif and one step to search with it. There are several choices of detailed options, so check out the documentation. You need to align your sequences before you can create a profile.

prophecy + profit (EMBOSS suite) - for proteins

prophecy -sequence Aligned_protein_sites.fa -type F -name MyProfile -outfile MyProfile.txt -filter
profit -infile MyProfile.txt -sequence My_proteins.fa -outfile My_proteins.MyProfileprofit_out.txt

HMMER - for proteins or nucleic acids

# Create a HMM from an aligned set of proteins or nucleic acids (fasta or other common format)
hmmbuild MyProfile.hmm Aligned_protein_sites.fa
# Use the HMM to search a fasta file of proteins
hmmsearch MyProfile.hmm Protein_set.fa > Protein_set.MyProfile.hmmsearch_out.txt

JASPAR - for transcription factor binding sites

Search with a dataset of profiles

JASPAR - for transcription factor binding sites

Select a JASPAR CORE database (like Vertebrata) to search your DNA sequence(s) with the set of profiles

TRANSFAC's match - for transcription factor binding sites

commercial application requiring a license for the most up-to-date version

Whitehead only: See BaRC_datasets/Transfac for the command-line program and data files

# Search using all Transfac profiles
match matrix.dat MyPromoters.fa MyPromoters.match_out.txt minSUM_good.prf
# Search using a subset of profiles
match matrix.dat MyPromoters.fa MyPromoters.vert.match_out.txt vertebrate_non_redundant_minSUM.prf

Publication: Kel et al., 2003
Public web site (older data): http://www.gene-regulation.com/cgi-bin/pub/programs/match/bin/match.cgi

HMMER - for proteins or nucleic acids

Use a set of HMMs (like Pfam or another public protein domain resource) to search a fasta file of proteins
```
hmmsearch Pfam-A.hmm MyProteins.fa > MyProteins.Pfam-A.hmm.out.txt
```

Bioconductor packages

TFBSTools

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text