Selecting genome features

It's worth carefully identifying what sorts of features are of interest, such as

  • protein-coding genes (Ensembl, RefSeq, etc. Is one gene set better than others?) represented as sets of exons
  • non-coding genes
  • short RNAs (miRNAs (mature or hairpins?), piRNAs, etc.)
  • lincRNAs (long intergenic non-coding RNAs)
  • other novel transcripts
  • promoters or transcription start sites (such as for transcription factor IP experiments)
  • 5' or 3' untranslated regions (UTRs)

Also, does it make sense to merge different types of features into one file or keep them separate?


Method 1 (using BEDTools)

  • Create a BED file with the regions you'd like to annotate.
  • Create a BED file with your desired genome annotation, such as
    • transcription start sites
    • transcripts (start - end)
  • Depending on how you'd like to apply annotations, use intersectBed or closest Bed like
intersectBed -wa -wb -a regionsToAnnotate.bed -b landmarks.bed > regionsOverlappingFeatures.bed
closestBed -D b -a regionsToAnnotate.bed -b landmarks.bed > regionsAndClosestFeatures.bed
  • Further detail: Both commands have a "-s" option to force strandedness (only matching regions if they're on the same strand).
  • Output BED files can be further processed/filtered with a spreadsheet or other matrix-oriented application.

Method 2 (using a GFF database)

  • Create a GFF database and upload GFF files with the features to include on the annotation.
    • This is a sample command to create the tables of a new database (the database itself was created by unix)
perl --dsn "" --adaptor dbi::mysqlopt  --create --user  userlogin --pass xxx
  • Uploading GFF2 or GFF3 files makes a difference in the way the database is queried, especially for the "Class ID" and attributes on the last column of the GFF file. My sample commands work better with GFF2 file
  • We currently have the following databases
    • scerevisiae_oct09
    • mouse_mm9
    • human_hg18
  • Write and run a perl script that queries the database. This is the code for the main steps of the script:
  1. connect to the database
use Bio::Seq;
use Bio::DB::GFF;
use Bio::Graphics; 
my $db_path = "";
my $db    = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
	                                 -dsn     => $db_path,
	                                 -user => $user,
	                                 -pass => $passwd
	                                 ) or die();
print "connection done\n";
  1. Create a segment for each area or single coordinate we want to query.
my $segment = $db->segment($chrom,$start,$stop);
#if we want to query 10 kb up and down $start and $stop we would do:
$start2 = $start - 10000; $stop2 = $stop + 10000;
my $segment2 = $db->segment($chrom,$start2,$stop2);

  1. Retrieve all features of interest within that segment. Those features are objects and from then we can get the start, stop, strand, frame, gene name or other information included on the GFF file. We can calculate distances to a feature, find which features overlap with a point on the genome, count how many features fall within a segment, find their orientation with respect to the point of interest, get the sequence of a region etc.

      my @genes = $segment->features(-type => ['gene:SGD']);
      my @CDS = $segment->features(-type => ['CDS:SGD']);
      			my  $CDSStart = $_ -> start();
			my  $CDSEnd = $_ -> stop();
			my  $CDSStrand = $_ -> strand();
			my $CDSFrame =  $_ -> phase();

Sample script

  1. Main differences with method 1.
    • Method 2 requires spending more time in setting up the database and writting the perl script.
    • It allows the retrieval of more detailed information. i.e. Identify a binding event that tends to fall within bidirectional promoters or only upstream or downstream of a gene. Classify coordinates (SNPs) into exonic, intronic intragenic intergenic in one perl script run.
    • Once the script is written repeating the analysis for a different cut off distance is very easy to do and doesn't take much time.

Method 3 (for SNPs, using snpEff)

  • This takes SNP positions as input (in VCF format). I don't think it works on regions > 1 nt.
  • This method has the advantage that it can link SNPs to synonymous and nonsynonymous codons to identify changed in encoded protein.
  • The snpEff web site has lots of information.
  • It was installed with the human, mouse, fly, worm, and yeast genomes listed on the snpEff features page but you can create a custom annotation index from a GFF file using the "snpEff build" command (see details on the snpEff manual page).

Sample commands:

# Default output based on Ensembl gene annotation
snpEff GRCh37.75 < snps.calls.vcf 1> snps.calls.out.txt 2> snps.calls.err.txt
# Default output based on UCSC gene annotation
snpEff hg19 < snps.calls.vcf 1> snps.calls.out.txt 2> snps.calls.err.txt
# Don't show so much stuff
snpEff -no-utr -no-downstream -no-intergenic -no-intron -no-upstream hg19 < snps.calls.vcf 1> snps.calls.filt.out.txt 2> snps.calls.filt.err.txt

On Whitehead servers, pre-installed genomes (with annotations) can be chosen from the directory names in /usr/local/share/snpEff/data

Creating a novel genome database for snpEff:

  • Check if the genome database has been installed:
    • cd /usr/local/share/snpEff
    • java -jar /usr/local/share/snpEff/snpEff.jar databases
  • If not already installed, check if the genome has been processed for snpEff at
  • If you need to create it yourself, do the following:
    • Make a copy of the snpEff config file containing the path to your new database (ex: ~/snpEff/data/myGenome.v1)
    • In the config file directory (ex: data/myGenome.v1) put the genome sequence (called sequences.fa) and a GTF file describing the genes (called genes.gtf).
    • Go to that directory and run
      • java -jar /usr/local/share/snpEff/snpEff.jar build -gtf22 -v myGenome.v1
    • Now there should be a new file: data/myGenome.v1/snpEffectPredictor.bin