Identifying homologous genes, proteins, or genome regions
Several ways to identify homologous are listed below. Given that homology is the presence of shared ancestry, which is difficult to address directly, most of these resources and methods attempt to predict homology using sequence similarity.
Homologs can be either orthologs (produced by a speciation event) in different species or paralogs (produced by a gene duplication event) in the same species.
Use a gene-centric homology database
HomoloGene
HomoloGene is a system for automated detection of homologs among the annotated genes of several completely sequenced eukaryotic genomes. We have a local MySQL database of homologene on canna. It is updated every month on the first Wednesday of the month.
Database fields include homologene_group_id, taxon_id, gene_id_key, gene_symbol, protein_gi and protein_acc.
The database is used for our BaRC tool Find orthologs (Whitehead only).
Ensembl
Ensembl is a comprehensive system for genome annotation that has been applied to a wide variety of organisms. Ensembl includes the group of genes by homology. Homolog sets can be obtained by simply going to the gene page in your reference organism, such as human GATA4. Clicking on the "Orthologues" link on the left-side banner opens a Orthologues page that lists orthologs, or clicking on a "Gene Tree" link can create
- an interactive tree
- a computer-readable representation of the tree
- a multiple sequence alignment that can be customized by clicking on the "Configure this page" box at left
- Decode Ensembl protein IDs into species names using the list of Ensembl stable IDs.
For genome-wide analysis, all Ensembl data (like ortholog alignments) can also be downloaded as large text files. Most homology data is found in the ensembl_compara database.
Extract information from genome alignments
UCSC Genome Bioinformatics
UCSC Genome Bioinformatics displays and provides all the data behind genome assemblies, all sorts of data mapped to these assemblies, and genome-genome alignments. To get a genome-genome alignment of a region of interest, point your genome browser to the desired location like human GATA4 and turn on the Conservation track (or similar track for another reference genome) to "full". Clicking on the Conservation link lets one select the genomes to include in the alignment. Then clicking on the "Multiz Alignments of 46 Vertebrates" (or similar) track creates a configurable detailed alignment in MAF format (but of not more than 30,000 nt). Alignments of multiple regions can be obtained using the Table Browser and selecting
- the desired clade, genome, and assembly
- group = Comparative Genomics; track = Conservation
- table = multiz46way (or related table for another assembly)
Genome alignment and conservation metrics can also be downloaded in bulk. BaRC has placed some alignment files (in MAF format) on tak at /nfs/genomes/GENOME_NAME/maf/ . Others are available from sites like these:
- Multiple alignments of 45 vertebrate genomes with Human
- Conservation scores for alignments of 45 vertebrate genomes with Human
- Basewise conservation scores (phyloP) of 45 vertebrate genomes with Human
- FASTA alignments of 45 vertebrate genomes with Human for CDS regions
liftOver
UCSC LiftOver can be used to convert, or lift over, genome coordinates within different assembly versions. Alternatively, it can be used to convert between assemblies, though it may not convert correctly, e.g. due to rearrangements.
#command line liftOver -minMatch=0.95 inputRegions.bed mm10ToHg38.over.chain.gz outRegionsConverted.bed failedConversions.bed #set minMatch lower to relax the threshold for bases to remap #default input file is bed format, see -gff or -genePred for other accepted formats
VISTA
VISTA also has genome-genome alignments available for download., but the last update appears to be May 2008.
Extract information about protein families
Many databases are available that contain pre-aligned sequences for protein families.
Pfam
The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). It is available at several sites and appears to last be updated on May 2010.
Do database searches
If your favorite species or genes are not included in the above resources, you will have to identify homologs yourself. On the other hand even if your species and genes are included in the above resources, you may want to verify known or identify new homologs with the methods below:
Sequence Searching
We have a tool Find similar genes in another species that does a blastp all vs. all comparison to identify similar genes in another species. The blast searches are redone once a month on the second Wednesday of the month.
A reciprocal blast search is a good way of finding homologues.
Profile Searching
hmmer is an excellent tool to search for distant homologues. Rather than searching a database with a single sequence, HMMER can build a profile of related sequences and, thus, more sensitively search a sequence database. The complete user's guide for HMMER3 is online. HMMER3 is installed on tak and the cluster.