= Searching for patterns, motifs, or profiles in a DNA or protein sequence =

This is a traditional bioinformatics task, and many tools do this in a variety of ways.  One main determinant of tool is the representation of what you're looking for.  [http://barcwiki/wiki/SOPs/enriched_tf_binding_sites Identifying enriched sites] is a related but different task.

== Search with a pattern (text, with optional choices at some positions) ==

[http://emboss.sourceforge.net/apps/cvs/emboss/apps/dreg.html dreg] (EMBOSS suite) - for nucleic acids (where "pattern" is a regular expression)
{{{
dreg -pattern "GGCC[ACGT]" -sequence My_promoters.fa -outfile My_promoters.GGCCN.dreg_out.txt
}}}

[http://emboss.sourceforge.net/apps/cvs/emboss/apps/preg.html preg] (EMBOSS suite) - for proteins (where "pattern" is a regular expression)
{{{
dreg -pattern "LPE[ACS]G" -sequence My_proteins.fa -outfile My_proteins.fa.LPEMG.preg_out.txt
}}}

[http://emboss.sourceforge.net/apps/cvs/emboss/apps/fuzznuc.html fuzznuc] (EMBOSS suite) - for nucleic acids (where "pmismatch" is the number of mismatches in the pattern)
{{{
fuzznuc -pattern "nnnGGCCTnnn" -sequence My_promoters.fa -pmismatch 1 -outfile My_promoters.GGCCT.1mis.fuzznuc_out.txt
}}}

[http://emboss.sourceforge.net/apps/cvs/emboss/apps/fuzzpro.html fuzzpro] (EMBOSS suite) - for proteins (where "pmismatch" is the number of mismatches in the pattern)
{{{
fuzzpro -pattern "xxxxLPEAGxxxx" -sequence My_proteins.fa -pmismatch 1 -outfile My_proteins.LPEAG.1mis.fuzzpro_out.txt
}}}


''EMBOSS output format can be changed with the option, -rformat.  See [http://emboss.sourceforge.net/docs/themes/ReportFormats.html EMBOSS Report Formats] for more details.''

== Search with a custom profile (a probability matrix, with choices at all positions) ==

These searches are generally a two-step process, one step to create the motif and one step to search with it.  There are several choices of detailed options, so check out the documentation.  You need to [http://barcwiki/wiki/SOPs/multipleSequenceAlignment align your sequences] before you can create a profile.

[http://emboss.sourceforge.net/apps/cvs/emboss/apps/prophecy.html prophecy] + [http://emboss.sourceforge.net/apps/cvs/emboss/apps/profit.html profit] (EMBOSS suite) - for proteins
{{{
prophecy -sequence Aligned_protein_sites.fa -type F -name MyProfile -outfile MyProfile.txt -filter
profit -infile MyProfile.txt -sequence My_proteins.fa -outfile My_proteins.MyProfileprofit_out.txt
}}}

[http://hmmer.org/ HMMER] - for proteins or nucleic acids
{{{
# Create a HMM from a set of proteins or nucleic acids (fasta or other common format)
hmmbuild MyProfile.hmm Aligned_protein_sites.fa
# Use the HMM to search a fasta file of proteins
hmmsearch MyProfile.hmm Protein_set.fa > Protein_set.MyProfile.hmmsearch_out.txt
}}}

[http://jaspar.genereg.net/ JASPAR] - for transcription factor binding sites

[http://meme-suite.org/ MEME Suite] For proteins or nucleic acids 
 * Build a PWM from a series of sequences using the [http://meme-suite.org/tools/meme meme] command
{{{
meme Sequence_set.fa -protein -nmotifs 5 -minw 8 -maxw 12 -o output_directory
}}}

Program options:
    -protein:   specifies your input sequences are amino acids.\\
    -dna:   specifies your input sequences are DNA nucleotides.\\
    -nmotifs:  indicates the number of motifs to look for in your sequences\\
    -minw, -maxw:           minimum and maximum motif widths//

*  Using the [http://meme-suite.org/doc/mast.html/ MAST] program within the MEME-suite, with the PWM output by the above command you can search a list of protein sequences, in FASTA format, for your motif. 
{{{
mast -minseqs 100 -m 1 -comp -ev 10 -o output_directory
}}}

Program options:
   -minseqs: Specifies the number of sequences to analyze.\\
   -m:            Specifies the number of motifs within your matrix file.\\
   -comp:     This option can improve search selectivity when erroneous matches are due to biased sequence composition.\\
   -ev:           MAST only displays sequences matching your query with E-values below the given threshold you specify here. By default, sequences in the database with matches with E-values less than 10 are displayed. If your motifs are very short or have low information content (are not very specific), it may be impossible for any sequence to achieve a low E-value.\\

== Search with a dataset of profiles ==

[http://jaspar.genereg.net/ JASPAR] - for transcription factor binding sites
  * Select a JASPAR CORE database (like [http://jaspar.genereg.net/cgi-bin/jaspar_db.pl?rm=browse&db=core&tax_group=vertebrates "Vertebrata"]) to search your DNA sequence(s) with the set of profiles

[http://www.biobase-international.com/wp-content/uploads/2012/03/Match_command_line.txt TRANSFAC's match] - for transcription factor binding sites
  * commercial application requiring a license for the most up-to-date version
  * Whitehead only: See BaRC_datasets/Transfac for the command-line program and data files
{{{
# Search using all Transfac profiles
match matrix.dat MyPromoters.fa MyPromoters.match_out.txt minSUM_good.prf
# Search using a subset of profiles
match matrix.dat MyPromoters.fa MyPromoters.vert.match_out.txt vertebrate_non_redundant_minSUM.prf
}}}
  * Publication: [http://www.ncbi.nlm.nih.gov/pubmed/12824369 Kel et al., 2003]
  * Public web site (older data): http://www.gene-regulation.com/cgi-bin/pub/programs/match/bin/match.cgi

[http://hmmer.org/ HMMER] - for proteins or nucleic acids
  * Use a set of HMMs (like Pfam or another public protein domain resource) to search a fasta file of proteins
{{{
hmmsearch Pfam-A.hmm MyProteins.fa > MyProteins.Pfam-A.hmm.out.txt
}}}

[http://www.bioconductor.org/ Bioconductor] packages
  * [http://bioconductor.org/packages/release/bioc/html/TFBSTools.html TFBSTools]