wiki:SOPs/integrateExpressionIP

Introduction

Given a ChIP-seq experiment for a transcription factor (TF) of interest and a microarray or RNA-Seq experiment comparing WT samples to KO or RNAi for the same TF, the goal of this analysis is to find out if the genes potentially regulated by the TF are changing in expression and if there is an enrichment of genes differentially expressed (DE) within the genes bound by the TF and vice versa.

We will be using the term “genes bound” or “genes potentially regulated by the TF” to refer to genes that are within a certain distance from a ChIP-Seq peak. This is defined by the user. Some examples of criteria to define “genes bound” by the TF would be: 1) The region from 2kb upstream to 500 pb downstream of the TSS overlaps with a peak. 2) The gene body or the 10Kb upstream or 10 Kb downstream of the gene overlap with a peak.

Step 1: Define peaks ("bound regions") from a ChIP-Seq experiment

Follow the standard operating procedure for identifying ChIP-Seq peaks

Step 2: Find genes differentially expressed from an expression microarray experiment

Follow standard operating procedures for normalizing microarray experiments and identifying differentially expressed genes to get a list of genes differentially expressed or a list of all the genes and their fold change on the microarray experiment.

For RNA-seq data, you can follow the steps in SOPs/rna-seq-diff-expressions to get a list of genes differentially expressed or a list of all the genes and their fold change.

Step 3: Find out if the genes potentially regulated by the TF are changing in expression

Method 1. Use R to draw cumulative distribution function (cdf) plots that show the distribution of expression changes for all genes and for genes bound. If binding results in activation or repression of the genes bound we should see a shift in the distributions. We can test if the difference is significant with a Kolmogorov–Smirnov (KS) test.

Naming: FC stands for fold change in expression between control cells and cells where the TF of interest has been knock down. GenesBound is the list of genes potentially regulated by the TF according to the ChIP-seq experiment (see introduction above). AllTranscripts.FC.txt and GenesBound.FC.txt files have the FC for all genes or for the subset of genes bound. We plot both distributions and look for differences.

The input files used for the R code below are like:

AllTranscripts.FC.txt

 NM_000022        0.13    -0.02   -0.53     0.29   -0.06   -0.22   -0.18   -0.07
 NM_000073        0.03    -0.39   -0.42     0.29    0.01   -0.23   -0.02   -0.14
 NM_000188       -0.18    -0.43   -0.41    -0.72   -0.04   -0.29   -0.38   -0.47
 NM_000270        0.31     0.54    0.50     0.67    0.32    0.18    0.31    0.54
 NM_000314       -0.12     0.13    0.04    -0.23    0.03    0.14    0.11   -0.06
 NM_000401        0.00    -0.40    0.13    -0.06   -0.32    0.10    0.15    0.18
 NM_000404       -0.08     0.17    0.30    -0.20   -0.03    0.22   -0.08    0.11
    .....

GenesBound.FC.txt

 NM_000073       0.03    -0.39  -0.42   0.29    0.01  -0.23   -0.02   -0.14
 NM_000401       0.00    -0.40   0.13  -0.06   -0.32   0.10    0.15    0.18
 ...

These input files have the logFC of 8 know-down experiments. The expression changes we are interested in plotting are on the 6th expression column.

 control <- read.delim("AllTranscripts.FC.txt")
 GenesBound <- read.delim("GenesBound.FC.txt")
 png(file="cdfPlot.png")
 plot(ecdf(control[,6]), xlab="logFC_hsTF", ylab="Cumulative Fraction")
 lines(ecdf(GenesBound[,6]), verticals=T,do.points=FALSE, col="purple", lty=1, xlim=c(-6,6))
 dev.off()
 # Perform a KS test
 ks.test(control[,6], GenesBound[,6])

Method 2. Pick a cut-off for genes that are differentially expressed and test if the bound genes are enriched for differentially expressed genes using Fisher's exact test.

For example: Of the 20,000 transcripts assayed by the microarray, 700 were differentially expressed and 400 of the DE expressed genes were also found to be bound by the TF. The total number of genes bound by the TF was 3000.

              DE  non-DE  Totals
   Bound      400   2600  3000
   Not-Bound  300  16700  17000
   Totals     700  19300  20000

You can perform a Chi-squared test or a Fisher's exact test in R:

   data <- matrix(c(400, 2600, 300, 16700), ncol=2, byrow=T)
   chisq.test(data)
   fisher.test(data)

Method 3. Use the GSEA tool from the Broad Institute. Use the raw array data and genes bound as the gene data set to look for enrichment GSEA

Method 4. For visualization, draw a scatterplot on log2 gene expression levels with WT and KO on X and Y axis, highlight the bound genes on top of all genes. This can be achieved with simple R codes. In the example below, the two text files each has three columns, with geneID, WT and KO as header.

 control=read.delim("all_genes_exp.txt")
 GenesBound = read.delim("GenesBound_exp.txt")
 par(pty="s")
 plot(log2(control$WT), log2(control$KO))
 abline(0,1)
 points (log2(GenesBound$WT, log2(GenesBound$KO),  col="red")