= Identifying enriched biological themes in gene sets =


=== Recommendations ===

* Choose appropriate,
   * gene sets (reference) to use depending on the biological question e.g. MSigDB H1 hallmark gene sets
   * background genes, some tools do not have the option to select a different background
* Gene sets should be approximately 15 to 500 genes
* Use gene identifiers that are accepted by the tool
* Check or verify the tool’s databases or gene sets are updated or maintained


=== Database for Annotation, Visualization and Integrated Discovery (DAVID) ===

* [[http://david.abcc.ncifcrf.gov/home.jsp | DAVID]] is generally the best place to start your enrichment analysis.
* DAVID is a tool that analyzes a subset of assayed genes, asking the general question, "What's special about these genes compared to a random list of genes of the same size?" 
* Instructions for using DAVID can be found under //Functional Annotation// on the DAVID web site.
* You'll probably end up running DAVID multiple times, with different types of annotations, to get the more informative combination.
* More details:
    * DAVID has an upper limit of 3000 genes as input
    * Full output can be downloaded as text and viewed as a spreadsheet.
    * By default, DAVID uses the entire genome (annotation) as background, if your background is different make sure to enter/change the background, e.g. if using a microarray study the background gene list would be all the genes/probsets assayed.
    * [[https://david.ncifcrf.gov/helps/update.html| DAVID Knowledgebase]] and [[https://academic.oup.com/view-large/82532053| annotation ]] contains information of the databases or sources used by DAVID

=== Gene Set Enrichment Analysis (GSEA) ===

[[http://www.broadinstitute.org/gsea/index.jsp|GSEA]] is very different different from tools like DAVID.  GSEA takes as input all assayed genes, along with a metric that GSEA uses to order the genes.  Then it asks the general question, "What's special about the order of these genes compared to a randomly ordered list of the same genes?"  In other words, it looks for gene annotations that are enriched at the top or bottom of your ordered genes.

GSEA can be run on about any operating system (so on your own computer or on a Whitehead Linux server like tak).

==== Introductory information about GSEA ====

* [[https://www.gsea-msigdb.org/gsea/login.jsp|Download the GSEA software and additional resources to analyze, annotate and interpret enrichment results.]]
* [[https://www.gsea-msigdb.org/gsea/msigdb/index.jsp|Explore the Molecular Signatures Database (MSigDB), ]]a collection of annotated gene sets for use with GSEA software.  The H (hallmark), C2 (curated) and C5 (GO) are good gene sets to start the analysis.
* [[http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Main_Page | GSEA and MSigDB documentation]]
* [[http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Using_RNA-seq_Datasets_with_GSEA | Guidelines for using RNA-seq datasets with GSEA]]
* [[https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/FAQ | GSEA FAQ]]
  
   ==== GSEAPreranked: start with a list of genes and values ====
   1. Create a two column file with gene names as first column and numeric values for second column (eg. log2 fold change, log2 ratio). The file does not need to be sorted and it should have extension ".rnk". 
     * The second column, used to rank genes, could be log2 fold change, t-statistic, or another scoring scheme that takes into account both log ratio and p-value.

   2. Run GSEA:

     *  To run using the GUI
        * 1. Start GSEA. On tak, the command is 'gsea'.
        * 2. Upload your ranked file "file.rnk". Click on "Steps in GSEA analysis -> Load data"  
        * 3. Click on  "Tools -> GseaPreranked"  
        * 4. Select one of the gene sets from the "Gene sets database". We recommend starting with the Hallmarks set (h.all). You can find more information about the sets [[https://www.gsea-msigdb.org/gsea/msigdb/index.jsp|here ]]
        * 5. Select your uploaded ranked list (rnk file) for "Ranked list".
        * 6. The "Chip platform" refers to the type of identifiers in your rnk file.  If your input file has human gene symbols, choose a platform file like "Human_Gene_Symbol_with_Remapping_MSigDB*.chip".  If your input input file has mouse gene symbols, you'll need to choose a "platform" to assign the mouse symbols to orthologous human symbols (like Mouse_Gene_Symbol_Remapping_to_Human_Orthologs_MSigDB*.chip)
        * 7. Click the "Show" button next to "Basic fields" to name your sample/comparison.  This is especially important, of course, if you're running GSEA multiple times.  You can also set the output directory. 
        * 8. Click "Run" at the bottom of the GSEA window.  It usually takes at least several minutes.  If you see "Error!" near the bottom left, click on it to diagnose what went wrong and try again.

     *  To run the same type of analysis on the command line, you can see the command the GUI used clicking the "Command" button and run that command in your Linux machine. 
{{{
gsea-cli.sh GSEAPreranked -gmx ftp.broadinstitute.org://pub/gsea/gene_sets/h.all.v7.2.symbols.gmt -norm meandiv -nperm 1000 -rnk myFile.rnk -scoring_scheme weighted -rpt_label my_analysis -create_svgs false -make_sets true -plot_top_x 20 -rnd_seed timestamp -set_max 500 -set_min 15 -zip_report false -out ./output
}}}
    

   ==== Traditional GSEA ====
 
  1.  Create necessary files in correct format for expression, phenotype and chip annotation ([[http://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Using_RNA-seq_Datasets_with_GSEA | see GSEA wiki]])
  1.  Use MSigDB for gene sets or create custom gene sets in correct format
  1.  Run GSEA, use default options to start


   ==== Single-sample GSEA (ssGSEA) ====
  An extension of GSEA that can be used to determine enrichment of gene sets in individual samples.


More information

* [[http://software.broadinstitute.org/webservices/gpModuleRepository/download/prod/module/?file=/ssGSEAProjection/broad.mit.edu:cancer.software.genepattern.module.analysis/00270/7.6/ssGSEAProjection.zip | Broad's ssGSEA from GenePattern R/jar scripts]]

   ==== Fast gene set enrichment analysis (fgsea) ====
[[http://bioconductor.org/packages/release/bioc/html/fgsea.html | fgsea]] is an R-package for fast preranked GSEA. This package allows to quickly and accurately calculate arbitrarily low GSEA P-values for a collection of gene sets. You may want to try fgsea if the Broad GSEA takes too long to run. 

=== Cytoscape: BiNGO and ClueGO for visualization ===

You need to have [[http://cytoscape.org | Cytoscape]] installed to use BiNGO or ClueGO

[[http://apps.cytoscape.org/apps/bingo|BiNGO Plugin/App]] and [[https://www.psb.ugent.be/cbd/papers/BiNGO/Home.html | documentation]]\\

  1.  Start BiNGO via Cytoscape, Apps->Start BiNGO
  1. Get genes from cluster/network or paste gene list
  1.  Select the correct options (eg. species)
  1.  Run BiNGO

  ''Results'': Each node is color coded by its significance or p-value.  The size of a node represents the number of genes belonging to that GO term or category.  Edges represents (parent-child) relationships between terms. 

[[http://apps.cytoscape.org/apps/cluego | ClueGO Plugin/App]] and [[http://www.ici.upmc.fr/cluego/| documentation]] \\
Note:  A [[http://www.ici.upmc.fr/cluego/cluegoLicense.shtml | (free) license key]] is needed to run this tool.

  1.  Start ClueGO via Cytoscape, Apps->Start ClueGO
  1.  Paste genes
  1.  Select the correct options (eg. species, database)
  1.  Run ClueGO (Note: user can use the slide bar to specify how general or detail the terms to be visualized)

  ''Results'': Each node is color coded by its functionally related group it belongs to, a node that contains more than one color implies genes shared with multiple (related) groups.  The size of a node represents the significance of the enrichment.  Edges represent term-term interactions, a thicker edge represents more genes shared between the terms. 




=== GeneGO ===
[[http://portal.genego.com/ | GeneGO Login (Password Required)]]
  1.  Upload gene list and activate
  1.  One-click analysis -> Select GeneGo Pathway Maps


== Other/Useful Links ==

[[http://go.princeton.edu/cgi-bin/GOTermFinder | GO Term Finder]] : significant GO terms shared among a list of genes from your organism.[[BR]]

[[http://go.princeton.edu/cgi-bin/GOTermMapper | GO Term Mapper]] :  maps the granular GO annotations for genes in a list to a set of GO slim terms, allowing you to bin your genes into broad categories.

[[http://www.ingenuity.com/products/ipa | Ingenuity IPA]], subscription required.

[[http://www.advaitabio.com/ipathwayguide.html | Advaita iPathwayGuide]], login required - subscription required for downloading.


== More Information ==

Hot Topics: [[http://jura.wi.mit.edu/bio/education/hot_topics/enrichment/Gene_list_enrichment_Mar10.pdf | Gene List Enrichment ]]