wiki:SOPs/homologous

Identifying homologous genes, proteins, or genome regions

Several ways to identify homologous are listed below. Given that homology is the presence of shared ancestry, which is difficult to address directly, most of these resources and methods attempt to predict homology using sequence similarity.

Homologs can be either orthologs (produced by a speciation event) in different species or paralogs (produced by a gene duplication event) in the same species.

Use a gene-centric homology database

HomoloGene

HomoloGene is a system for automated detection of homologs among the annotated genes of several completely sequenced eukaryotic genomes. We have a local MySQL database of homologene on canna. It is updated every month on the first Wednesday of the month.

Database fields include homologene_group_id, taxon_id, gene_id_key, gene_symbol, protein_gi and protein_acc.

The database is used for our BaRC tool Find orthologs (Whitehead only).

Ensembl

Ensembl is a comprehensive system for genome annotation that has been applied to a wide variety of organisms. Ensembl includes the group of genes by homology. Homolog sets can be obtained by simply going to the gene page in your reference organism, such as human GATA4. Clicking on the "Orthologues" link on the left-side banner opens a Orthologues page that lists orthologs, or clicking on a "Gene Tree" link can create

For genome-wide analysis, all Ensembl data (like ortholog alignments) can also be downloaded as large text files. Most homology data is found in the ensembl_compara database.

Extract information from genome alignments

UCSC Genome Bioinformatics

UCSC Genome Bioinformatics displays and provides all the data behind genome assemblies, all sorts of data mapped to these assemblies, and genome-genome alignments. To get a genome-genome alignment of a region of interest, point your genome browser to the desired location like human GATA4 and turn on the Conservation track (or similar track for another reference genome) to "full". Clicking on the Conservation link lets one select the genomes to include in the alignment. Then clicking on the "Multiz Alignments of 46 Vertebrates" (or similar) track creates a configurable detailed alignment in MAF format (but of not more than 30,000 nt). Alignments of multiple regions can be obtained using the Table Browser and selecting

  • the desired clade, genome, and assembly
  • group = Comparative Genomics; track = Conservation
  • table = multiz46way (or related table for another assembly)

Genome alignment and conservation metrics can also be downloaded in bulk. BaRC has placed some alignment files (in MAF format) on tak at /nfs/genomes/GENOME_NAME/maf/ . Others are available from sites like these:

liftOver

UCSC LiftOver can be used to convert, or lift over, genome coordinates within different assembly versions. Alternatively, it can be used to convert between assemblies, though it may not convert correctly, e.g. due to rearrangements.

#command line
liftOver -minMatch=0.95 inputRegions.bed mm10ToHg38.over.chain.gz outRegionsConverted.bed failedConversions.bed
#set minMatch lower to relax the threshold for bases to remap
#default input file is bed format, see -gff or -genePred for other accepted formats

VISTA

VISTA also has genome-genome alignments available for download., but the last update appears to be May 2008.

Extract information about protein families

Many databases are available that contain pre-aligned sequences for protein families.

Pfam

The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). It is available at several sites and appears to last be updated on May 2010.

Do database searches

If your favorite species or genes are not included in the above resources, you will have to identify homologs yourself. On the other hand even if your species and genes are included in the above resources, you may want to verify known or identify new homologs with the methods below:

Sequence Searching

We have a tool Find similar genes in another species that does a blastp all vs. all comparison to identify similar genes in another species. The blast searches are redone once a month on the second Wednesday of the month.

A reciprocal blast search is a good way of finding homologues.

Profile Searching

hmmer is an excellent tool to search for distant homologues. Rather than searching a database with a single sequence, HMMER can build a profile of related sequences and, thus, more sensitively search a sequence database. The complete user's guide for HMMER3 is online. HMMER3 is installed on tak and the cluster.

Note: See TracWiki for help on using the wiki.