Changes between Initial Version and Version 1 of SOPs/genome_regions_annotations


Ignore:
Timestamp:
01/23/13 16:49:43 (12 years ago)
Author:
trac
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/genome_regions_annotations

    v1 v1  
     1=== Selecting genome features ===
     2
     3It's worth carefully identifying what sorts of features are of interest, such as
     4  * protein-coding genes (Ensembl, RefSeq, etc. Is one gene set better than others?) represented as sets of exons
     5  * non-coding genes
     6  * short RNAs (miRNAs (mature or hairpins?), piRNAs, etc.)
     7  * lincRNAs (long intergenic non-coding RNAs)
     8  * other novel transcripts
     9  * promoters or transcription start sites (such as for transcription factor IP experiments)
     10  * 5' or 3' untranslated regions (UTRs)
     11
     12Also, does it make sense to merge different types of features into one file or keep them separate?
     13 
     14==== Annotations ====
     15  * [[http://www.ensembl.org/info/docs/genebuild/ncrna.html|Ensembl ncRNA Annotation Procedure]]
     16  * Guttman, M. et al (2009)[[http://www.nature.com/nbt/journal/v28/n5/full/nbt.1633.html | Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs (See Supp.)]]
     17  * Guttman, M. et al (2010) [[http://www.nature.com/nature/journal/v458/n7235/full/nature07672.html | Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals (See Supp.)]]
     18
     19
     20=== Method 1 (using BEDTools) ===
     21
     22  * Create a BED file with the regions you'd like to annotate.
     23  * Create a BED file with your desired genome annotation, such as
     24      * transcription start sites
     25      * transcripts (start - end)
     26  * Depending on how you'd like to apply annotations, use intersectBed or closest Bed like
     27
     28{{{
     29intersectBed -wa -wb -a regionsToAnnotate.bed -b landmarks.bed > regionsOverlappingFeatures.bed
     30}}}
     31
     32{{{
     33closestBed -a regionsToAnnotate.bed -b landmarks.bed > regionsAndClosestFeatures.bed
     34}}}
     35
     36  * Further detail: Both commands have a "-s" option to force strandedness (only matching regions if they're on the same strand).
     37  * Output BED files can be further processed/filtered with a spreadsheet or other matrix-oriented application.
     38 
     39  * You may also want to check Step 3: "Linking bound regions to genes" of this SOP: [wiki:SOPs/chip_seq_peaks/ Identifying and/or quantifying peaks in ChIP-Seq projects  ]
     40
     41===  Method 2 (using a GFF database) ===
     42  * Create a GFF database and upload GFF files with the features to include on the annotation.
     43      * This is a sample command to create the tables of a new database (the database itself was created by unix)
     44
     45{{{
     46perl load_gff.pl --dsn "dbi:mysql:host=canna.wi.mit.edu:scerevisiae_oct09" --adaptor dbi::mysqlopt  --create --user  userlogin --pass xxx
     47}}}
     48
     49      * Uploading GFF2 or GFF3 files makes a difference in the way the database is queried, especially for the "Class ID" and attributes on the last column of the GFF file.  My sample commands work better with GFF2 file
     50      * We currently have the following databases
     51         * scerevisiae_oct09
     52         * mouse_mm9
     53         * human_hg18
     54
     55  * Write and run a perl script that queries the database. This is the code for the main steps of the script:
     56 1. connect to the database
     57
     58 {{{
     59use Bio::Seq;
     60use Bio::DB::GFF;
     61use Bio::Graphics;
     62my $db_path = "dbi:mysql:host=canna.wi.mit.edu:scerevisiae_oct09";
     63my $db    = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
     64                                         -dsn     => $db_path,
     65                                         -user => $user,
     66                                         -pass => $passwd
     67                                         ) or die();
     68print "connection done\n";
     69}}}
     70
     712. Create a segment for each area or single coordinate we want to query.
     72
     73{{{
     74my $segment = $db->segment($chrom,$start,$stop);
     75#if we want to query 10 kb up and down $start and $stop we would do:
     76$start2 = $start - 10000; $stop2 = $stop + 10000;
     77my $segment2 = $db->segment($chrom,$start2,$stop2);
     78}}}
     79 
     803. Retrieve all features of interest within that segment. Those features are objects and from then we can get the start, stop, strand, frame, gene name or other information included on the GFF file. We can calculate distances to a feature, find which features overlap with a point on the genome, count how many features fall within a segment, find their orientation with respect to the point of interest, get the sequence of a region etc.
     81     
     82{{{
     83      my @genes = $segment->features(-type => ['gene:SGD']);
     84      my @CDS = $segment->features(-type => ['CDS:SGD']);
     85                        my  $CDSStart = $_ -> start();
     86                        my  $CDSEnd = $_ -> stop();
     87                        my  $CDSStrand = $_ -> strand();
     88                        my $CDSFrame =  $_ -> phase();
     89}}}
     90
     91Sample script [[MapRegions_mm9.pl|MapRegions_mm9.pl]]
     92                       
     93   
     944. Main differences with method 1.
     95  * Method 2 requires spending more time in setting up the database and writting the perl script.
     96  * It allows the retrieval of more detailed information. //i.e.// Identify a binding event that tends to fall within bidirectional promoters or only upstream or downstream of a gene. Classify coordinates (SNPs) into exonic, intronic intragenic intergenic in one perl script run.
     97  * Once the script is written repeating the analysis for a different cut off distance is very easy to do and doesn't take much time.
     98
     99=== Method 3 (for SNPs, using snpEff) ===
     100
     101  * This takes SNP positions as input.  I don't think it works on regions > 1 nt.
     102  * This method has the advantage that it can link SNPs to synonymous and nonsynonymous codons to identify changed in encoded protein.
     103  * The [[http://snpeff.sourceforge.net/|snpEff web site]] has lots of information.
     104  * It was installed with the human, mouse, fly, worm, and yeast genomes listed on the [[http://snpeff.sourceforge.net/features.html|snpEff features page]] but you can create a custom annotation index from a GFF file using the "snpEff build" command (see details on the [[http://snpeff.sourceforge.net/manual.html|snpEff manual page]]).
     105
     106Sample commands:
     107
     108 {{{
     109# Default output based on Ensembl gene annotation
     110snpEff hg37.61 snps.calls 1> snps.calls.out.txt 2> snps.calls.err.txt
     111# Default output based on UCSC gene annotation
     112snpEff hg37 snps.calls 1> snps.calls.out.txt 2> snps.calls.err.txt
     113}}}
     114
     115{{{
     116# Don't show so much stuff
     117snpEff -no-downstream -no-intergenic -no-intron -no-upstream hg37.61 snps.calls 1> snps.calls.filt.out.txt 2> snps.calls.filt.err.txt
     118}}}