wiki:SOP/PatternsMotifs

Context Navigation

Version 13 (modified by gbell, 10 years ago) ( diff )
--

Searching for patterns, motifs, or profiles in a DNA or protein sequence

This is a traditional bioinformatics task, and many tools do this in a variety of ways. One main determinant of tool is the representation of what you're looking for. Identifying enriched sites is a related but different task.

Search with a pattern (text, with optional choices at some positions)

dreg (EMBOSS suite) - for nucleic acids (where "pattern" is a regular expression)

dreg -pattern "GGCC[ACGT]" -sequence My_promoters.fa -outfile My_promoters.GGCCN.dreg_out.txt

preg (EMBOSS suite) - for proteins (where "pattern" is a regular expression)

dreg -pattern "LPE[ACS]G" -sequence My_proteins.fa -outfile My_proteins.fa.LPEMG.preg_out.txt

fuzznuc (EMBOSS suite) - for nucleic acids (where "pmismatch" is the number of mismatches in the pattern)

fuzznuc -pattern "nnnGGCCTnnn" -sequence My_promoters.fa -pmismatch 1 -outfile My_promoters.GGCCT.1mis.fuzznuc_out.txt

fuzzpro (EMBOSS suite) - for proteins (where "pmismatch" is the number of mismatches in the pattern)

fuzzpro -pattern "xxxxLPEAGxxxx" -sequence My_proteins.fa -pmismatch 1 -outfile My_proteins.LPEAG.1mis.fuzzpro_out.txt

EMBOSS output format can be changed with the option, -rformat. See EMBOSS Report Formats for more details.

Search with a custom profile (a probability matrix, with choices at all positions)

These searches are generally a two-step process, one step to create the motif and one step to search with it. There are several choices of detailed options, so check out the documentation.

prophecy + profit (EMBOSS suite) - for proteins

prophecy -sequence Aligned_protein_sites.fa -type F -name MyProfile -outfile MyProfile.txt -filter
profit -infile MyProfile.txt -sequence My_proteins.fa -outfile My_proteins.MyProfileprofit_out.txt

HMMER - for proteins or nucleic acids

# Create a HMM from an aligned set of proteins or nucleic acids (fasta or other common format)
hmmbuild MyProfile.hmm Aligned_protein_sites.fa
# Use the HMM to search a fasta file of proteins
hmmsearch MyProfile.hmm Protein_set.fa > Protein_set.MyProfile.hmmsearch_out.txt

JASPAR - for transcription factor binding sites

Search with a dataset of profiles

JASPAR - for transcription factor binding sites

Select a JASPAR CORE database (like Vertebrata) to search your DNA sequence(s) with the set of profiles

TRANSFAC's match - for transcription factor binding sites

commercial application requiring a license for the most up-to-date version

Whitehead only: See BaRC_datasets/Transfac for the command-line program and data files

# Search using all Transfac profiles
match matrix.dat MyPromoters.fa MyPromoters.match_out.txt minSUM_good.prf
# Search using a subset of profiles
match matrix.dat MyPromoters.fa MyPromoters.vert.match_out.txt vertebrate_non_redundant_minSUM.prf

Publication: Kel et al., 2003
Public web site (older data): http://www.gene-regulation.com/cgi-bin/pub/programs/match/bin/match.cgi

HMMER - for proteins or nucleic acids

Use a set of HMMs (like Pfam or another public protein domain resource) to search a fasta file of proteins
```
hmmsearch Pfam-A.hmm MyProteins.fa > MyProteins.Pfam-A.hmm.out.txt
```

Bioconductor packages

TFBSTools

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text