wiki:SOPs/go_annotation

Identifying enriched biological themes in gene sets

Recommendations

  • Choose appropriate,
    • gene sets (reference) to use depending on the biological question e.g. MSigDB H1 hallmark gene sets
    • background genes, some tools do not have the option to select a different background
  • Gene sets should be approximately 15 to 500 genes
  • Use gene identifiers that are accepted by the tool
  • Check or verify the tool’s databases or gene sets are updated or maintained

Database for Annotation, Visualization and Integrated Discovery (DAVID)

  • DAVID is generally the best place to start your enrichment analysis.
  • DAVID is a tool that analyzes a subset of assayed genes, asking the general question, "What's special about these genes compared to a random list of genes of the same size?"
  • Instructions for using DAVID can be found under Functional Annotation on the DAVID web site.
  • You'll probably end up running DAVID multiple times, with different types of annotations, to get the more informative combination.
  • More details:
    • DAVID has an upper limit of 3000 genes as input
    • Full output can be downloaded as text and viewed as a spreadsheet.
    • By default, DAVID uses the entire genome (annotation) as background, if your background is different make sure to enter/change the background, e.g. if using a microarray study the background gene list would be all the genes/probsets assayed.
    • DAVID Knowledgebase and annotation contains information of the databases or sources used by DAVID

Gene Set Enrichment Analysis (GSEA)

GSEA is very different different from tools like DAVID. GSEA takes as input all assayed genes, along with a metric that GSEA uses to order the genes. Then it asks the general question, "What's special about the order of these genes compared to a randomly ordered list of the same genes?" In other words, it looks for gene annotations that are enriched at the top or bottom of your ordered genes.

GSEA can be run on about any operating system (so on your own computer or on a Whitehead Linux server like tak).

Introductory information about GSEA

GSEAPreranked: start with a list of genes and values

  1. Create a two column file with gene names as first column and numeric values for second column (eg. log2 fold change, log2 ratio). The file does not need to be sorted and it should have extension ".rnk".
    • The second column, used to rank genes, could be log2 fold change, t-statistic, or another scoring scheme that takes into account both log ratio and p-value.
  1. Run GSEA:
  • To run using the GUI
    • 1. Start GSEA. On tak, the command is 'gsea'.
    • 2. Upload your ranked file "file.rnk". Click on "Steps in GSEA analysis -> Load data"
    • 3. Click on "Tools -> GseaPreranked"
    • 4. Select one of the gene sets from the "Gene sets database". We recommend starting with the Hallmarks set (h.all). You can find more information about the sets here
    • 5. Select your uploaded ranked list (rnk file) for "Ranked list".
    • 6. The "Chip platform" refers to the type of identifiers in your rnk file. If your input file has human gene symbols, choose a platform file like "Human_Gene_Symbol_with_Remapping_MSigDB*.chip". If your input input file has mouse gene symbols, you'll need to choose a "platform" to assign the mouse symbols to orthologous human symbols (like Mouse_Gene_Symbol_Remapping_to_Human_Orthologs_MSigDB*.chip)
    • 7. Click the "Show" button next to "Basic fields" to name your sample/comparison. This is especially important, of course, if you're running GSEA multiple times. You can also set the output directory.
    • 8. Click "Run" at the bottom of the GSEA window. It usually takes at least several minutes. If you see "Error!" near the bottom left, click on it to diagnose what went wrong and try again.
  • To run the same type of analysis on the command line, you can see the command the GUI used clicking the "Command" button and run that command in your Linux machine.
    gsea-cli.sh GSEAPreranked -gmx ftp.broadinstitute.org://pub/gsea/gene_sets/h.all.v7.2.symbols.gmt -norm meandiv -nperm 1000 -rnk myFile.rnk -scoring_scheme weighted -rpt_label my_analysis -create_svgs false -make_sets true -plot_top_x 20 -rnd_seed timestamp -set_max 500 -set_min 15 -zip_report false -out ./output
    

Traditional GSEA

  1. Create necessary files in correct format for expression, phenotype and chip annotation ( see GSEA wiki)
  2. Use MSigDB for gene sets or create custom gene sets in correct format
  3. Run GSEA, use default options to start

Single-sample GSEA (ssGSEA)

An extension of GSEA that can be used to determine enrichment of gene sets in individual samples.

More information

Fast gene set enrichment analysis (fgsea)

fgsea is an R-package for fast preranked GSEA. This package allows to quickly and accurately calculate arbitrarily low GSEA P-values for a collection of gene sets. You may want to try fgsea if the Broad GSEA takes too long to run.

Cytoscape: BiNGO and ClueGO for visualization

You need to have Cytoscape installed to use BiNGO or ClueGO

BiNGO Plugin/App and documentation

  1. Start BiNGO via Cytoscape, Apps->Start BiNGO
  2. Get genes from cluster/network or paste gene list
  3. Select the correct options (eg. species)
  4. Run BiNGO

Results: Each node is color coded by its significance or p-value. The size of a node represents the number of genes belonging to that GO term or category. Edges represents (parent-child) relationships between terms.

ClueGO Plugin/App and documentation
Note: A (free) license key is needed to run this tool.

  1. Start ClueGO via Cytoscape, Apps->Start ClueGO
  2. Paste genes
  3. Select the correct options (eg. species, database)
  4. Run ClueGO (Note: user can use the slide bar to specify how general or detail the terms to be visualized)

Results: Each node is color coded by its functionally related group it belongs to, a node that contains more than one color implies genes shared with multiple (related) groups. The size of a node represents the significance of the enrichment. Edges represent term-term interactions, a thicker edge represents more genes shared between the terms.

GeneGO

GeneGO Login (Password Required)

  1. Upload gene list and activate
  2. One-click analysis -> Select GeneGo Pathway Maps

GO Term Finder : significant GO terms shared among a list of genes from your organism.

GO Term Mapper : maps the granular GO annotations for genes in a list to a set of GO slim terms, allowing you to bin your genes into broad categories.

Ingenuity IPA, subscription required.

Advaita iPathwayGuide, login required - subscription required for downloading.

More Information

Hot Topics: Gene List Enrichment

Note: See TracWiki for help on using the wiki.