Given a ChIP-seq experiment for a transcription factor (TF) of interest and a microarray or RNA-Seq experiment comparing WT samples to KO or RNAi for the same TF, the goal of this analysis is to find out if the genes potentially regulated by the TF are changing in expression and if there is an enrichment of genes differentially expressed (DE) within the genes bound by the TF and vice versa.
We will be using the term “genes bound” or “genes potentially regulated by the TF” to refer to genes that are within a certain distance from a ChIP-Seq peak. This is defined by the user. Some examples of criteria to define “genes bound” by the TF would be: 1) The region from 2kb upstream to 500 pb downstream of the TSS overlaps with a peak. 2) The gene body or the 10Kb upstream or 10 Kb downstream of the gene overlap with a peak.
Step 1: Define peaks ("bound regions") from a ChIP-Seq experiment
Step 2: Find genes differentially expressed from an expression microarray experiment
Follow standard operating procedures for normalizing microarray experiments and identifying differentially expressed genes to get a list of genes differentially expressed or a list of all the genes and their fold change on the microarray experiment.
For RNA-seq data, you can follow the steps in SOPs/rna-seq-diff-expressions to get a list of genes differentially expressed or a list of all the genes and their fold change.
Step 3: Find out if the genes potentially regulated by the TF are changing in expression
Method 1. Use R to draw cumulative distribution function (cdf) plots that show the distribution of expression changes for all genes and for genes bound. If binding results in activation or repression of the genes bound we should see a shift in the distributions. We can test if the difference is significant with a Kolmogorov–Smirnov (KS) test.
Naming: FC stands for fold change in expression between control cells and cells where the TF of interest has been knock down. GenesBound is the list of genes potentially regulated by the TF according to the ChIP-seq experiment (see introduction above). AllTranscripts.FC.txt and GenesBound.FC.txt files have the FC for all genes or for the subset of genes bound. We plot both distributions and look for differences.
The input files used for the R code below are like:
NM_000022 0.13 -0.02 -0.53 0.29 -0.06 -0.22 -0.18 -0.07 NM_000073 0.03 -0.39 -0.42 0.29 0.01 -0.23 -0.02 -0.14 NM_000188 -0.18 -0.43 -0.41 -0.72 -0.04 -0.29 -0.38 -0.47 NM_000270 0.31 0.54 0.50 0.67 0.32 0.18 0.31 0.54 NM_000314 -0.12 0.13 0.04 -0.23 0.03 0.14 0.11 -0.06 NM_000401 0.00 -0.40 0.13 -0.06 -0.32 0.10 0.15 0.18 NM_000404 -0.08 0.17 0.30 -0.20 -0.03 0.22 -0.08 0.11 .....
NM_000073 0.03 -0.39 -0.42 0.29 0.01 -0.23 -0.02 -0.14 NM_000401 0.00 -0.40 0.13 -0.06 -0.32 0.10 0.15 0.18 ...
These input files have the logFC of 8 know-down experiments. The expression changes we are interested in plotting are on the 6th expression column.
control <- read.delim("AllTranscripts.FC.txt") GenesBound <- read.delim("GenesBound.FC.txt") png(file="cdfPlot.png") plot(ecdf(control[,6]), xlab="logFC_hsTF", ylab="Cumulative Fraction") lines(ecdf(GenesBound[,6]), verticals=T,do.points=FALSE, col="purple", lty=1, xlim=c(-6,6)) dev.off()
# Perform a KS test ks.test(control[,6], GenesBound[,6])
Method 2. Pick a cut-off for genes that are differentially expressed and test if the bound genes are enriched for differentially expressed genes using Fisher's exact test.
For example: Of the 20,000 transcripts assayed by the microarray, 700 were differentially expressed and 400 of the DE expressed genes were also found to be bound by the TF. The total number of genes bound by the TF was 3000.
DE non-DE Totals Bound 400 2600 3000 Not-Bound 300 16700 17000 Totals 700 19300 20000
You can perform a Chi-squared test or a Fisher's exact test in R:
data <- matrix(c(400, 2600, 300, 16700), ncol=2, byrow=T) chisq.test(data) fisher.test(data)
Method 3. Use the GSEA tool from the Broad Institute. Use the raw array data and genes bound as the gene data set to look for enrichment GSEA
Method 4. For visualization, draw a scatterplot on log2 gene expression levels with WT and KO on X and Y axis, highlight the bound genes on top of all genes. This can be achieved with simple R codes. In the example below, the two text files each has three columns, with geneID, WT and KO as header.
control=read.delim("all_genes_exp.txt") GenesBound = read.delim("GenesBound_exp.txt") par(pty="s") plot(log2(control$WT), log2(control$KO)) abline(0,1) points (log2(GenesBound$WT, log2(GenesBound$KO), col="red")