Version 2 (modified by 12 years ago) ( diff ) | ,
---|
Selecting genome features
It's worth carefully identifying what sorts of features are of interest, such as
- protein-coding genes (Ensembl, RefSeq, etc. Is one gene set better than others?) represented as sets of exons
- non-coding genes
- short RNAs (miRNAs (mature or hairpins?), piRNAs, etc.)
- lincRNAs (long intergenic non-coding RNAs)
- other novel transcripts
- promoters or transcription start sites (such as for transcription factor IP experiments)
- 5' or 3' untranslated regions (UTRs)
Also, does it make sense to merge different types of features into one file or keep them separate?
Annotations
- Ensembl ncRNA Annotation Procedure
- Guttman, M. et al (2009) Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs (See Supp.)
- Guttman, M. et al (2010) Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals (See Supp.)
Method 1 (using BEDTools)
- Create a BED file with the regions you'd like to annotate.
- Create a BED file with your desired genome annotation, such as
- transcription start sites
- transcripts (start - end)
- Depending on how you'd like to apply annotations, use intersectBed or closest Bed like
intersectBed -wa -wb -a regionsToAnnotate.bed -b landmarks.bed > regionsOverlappingFeatures.bed
closestBed -a regionsToAnnotate.bed -b landmarks.bed > regionsAndClosestFeatures.bed
- Further detail: Both commands have a "-s" option to force strandedness (only matching regions if they're on the same strand).
- Output BED files can be further processed/filtered with a spreadsheet or other matrix-oriented application.
- You may also want to check Step 3: "Linking bound regions to genes" of this SOP: Identifying and/or quantifying peaks in ChIP-Seq projects
Method 2 (using a GFF database)
- Create a GFF database and upload GFF files with the features to include on the annotation.
- This is a sample command to create the tables of a new database (the database itself was created by unix)
perl load_gff.pl --dsn "dbi:mysql:host=canna.wi.mit.edu:scerevisiae_oct09" --adaptor dbi::mysqlopt --create --user userlogin --pass xxx
- Uploading GFF2 or GFF3 files makes a difference in the way the database is queried, especially for the "Class ID" and attributes on the last column of the GFF file. My sample commands work better with GFF2 file
- We currently have the following databases
- scerevisiae_oct09
- mouse_mm9
- human_hg18
- Write and run a perl script that queries the database. This is the code for the main steps of the script:
- connect to the database
use Bio::Seq; use Bio::DB::GFF; use Bio::Graphics; my $db_path = "dbi:mysql:host=canna.wi.mit.edu:scerevisiae_oct09"; my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', -dsn => $db_path, -user => $user, -pass => $passwd ) or die(); print "connection done\n";
- Create a segment for each area or single coordinate we want to query.
my $segment = $db->segment($chrom,$start,$stop); #if we want to query 10 kb up and down $start and $stop we would do: $start2 = $start - 10000; $stop2 = $stop + 10000; my $segment2 = $db->segment($chrom,$start2,$stop2);
- Retrieve all features of interest within that segment. Those features are objects and from then we can get the start, stop, strand, frame, gene name or other information included on the GFF file. We can calculate distances to a feature, find which features overlap with a point on the genome, count how many features fall within a segment, find their orientation with respect to the point of interest, get the sequence of a region etc.
my @genes = $segment->features(-type => ['gene:SGD']); my @CDS = $segment->features(-type => ['CDS:SGD']); my $CDSStart = $_ -> start(); my $CDSEnd = $_ -> stop(); my $CDSStrand = $_ -> strand(); my $CDSFrame = $_ -> phase();
Sample script MapRegions_mm9.pl
- Main differences with method 1.
- Method 2 requires spending more time in setting up the database and writting the perl script.
- It allows the retrieval of more detailed information. i.e. Identify a binding event that tends to fall within bidirectional promoters or only upstream or downstream of a gene. Classify coordinates (SNPs) into exonic, intronic intragenic intergenic in one perl script run.
- Once the script is written repeating the analysis for a different cut off distance is very easy to do and doesn't take much time.
Method 3 (for SNPs, using snpEff)
- This takes SNP positions as input (in VCF format). I don't think it works on regions > 1 nt.
- This method has the advantage that it can link SNPs to synonymous and nonsynonymous codons to identify changed in encoded protein.
- The snpEff web site has lots of information.
- It was installed with the human, mouse, fly, worm, and yeast genomes listed on the snpEff features page but you can create a custom annotation index from a GFF file using the "snpEff build" command (see details on the snpEff manual page).
Sample commands:
# Default output based on Ensembl gene annotation snpEff hg19 < snps.calls.vcf 1> snps.calls.out.txt 2> snps.calls.err.txt # Default output based on UCSC gene annotation snpEff hg19 < snps.calls.vcf 1> snps.calls.out.txt 2> snps.calls.err.txt
# Don't show so much stuff snpEff -no-downstream -no-intergenic -no-intron -no-upstream hg19 < snps.calls.vcf 1> snps.calls.filt.out.txt 2> snps.calls.filt.err.txt
On Whitehead servers, pre-installed genomes (with annotations) can be chosen from the directory names in /usr/local/share/snpEff/data
Note:
See TracWiki
for help on using the wiki.