Quality Control of Microarray Data
- Quality assessment of microarrays should be performed to ensure the data is reliable before normalization. Bioconductor has packages for QC'ing.
- For affy, use simpleaffy and/or affyPLM to examine percent present calls (from A/P calls), Normalized Unscale Standard Error (NUSE), and Relative Log Expression (RLE). Choose a cutoff for each method, for eg. more than 30% for present.
- For Agilent, the package arrayQualityMetrics can used. Sample code for getting percent present calls, RLE and NUSE (for Affy)
library("simpleaffy") library("affyPLM") #for NUSE # Read cel files from directory data = ReadAffy() # Create affy QA matrix data.qc = qc(data) # percent present pp = percent.present(data.qc) # RLE and NUSE plmStruc = fitPLM(data) RLE(plmStruct, type="stats") NUSE(plmStruct, type="stats") # arrayQualityMetrics using custom CDF (optional) library(arrayQualityMetrics) CELs = ReadAffy(cdfname="hgu133plus2hsentrezgcdf") eset = rma(CELs) arrayQualityMetrics(eset, outdir="QC", force=TRUE) # arrayQualityMetrics using oligo package library(arrayQualityMetrics) library(oligo) library(pd.huex.1.0.st.v2) geneCELs =list.celfiles(full.names = TRUE) affyGeneFS = read.celfiles(geneCELs) eset.oligoGeneCore = rma(affyGeneFS, target = "core") arrayQualityMetrics(eset.oligoGeneCore, outdir="QC", force=TRUE) #Agilent (arrayQualityMetrics) on 2-color library(arrayQualityMetrics) library(limma) scanFiles = dir(pattern = ".*.txt$") maData = read.maimages(scanFiles, source="agilent") arrayQualityMetrics(expressionset=maData, outdir="QC", force = TRUE, do.logtransform = TRUE) #Agilent (arrayQualityMetrics) on 1-color library(arrayQualityMetrics) library(limma) scanFiles = dir(pattern = ".*.txt$") maData = read.maimages(scanFiles, source="agilent", green.only=TRUE) eSet = new("ExpressionSet", exprs = maData$E, annotation =maData$genes[,7]) arrayQualityMetrics(eSet, outdir="QC", force = TRUE, do.logtransform = TRUE)
Pre-processing Affymetrix microarrays
- Each probe or gene signal needs to be corrected.
- General steps (for Affymetrix arrays):
- Background Correction, if any
- Normalization
- Summarization
- Commonly used methods for background correction and normalization Affy data include MAS5 (by Affy), Robust Multichip Array (RMA) and GC Robust Multichip Array (GCRMA).
- References for these methods:
- RMA: Irizarry RA et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2):249-64, 2003.
- RMA FAQs
- GCRMA: Wu Z and Irizarry RA. Stochastic models inspired by hybridization theory for short oligonucleotide arrays. Proceedings of RECOMB ’04.
- RMA is our preferred method. It's as good (or better) as MAS5 and GCRMA but also faster. On the other hand, all three of these methods are popular.
- RMA and GCRMA output log2-transformed intensities, whereas MAS5 outputs untransformed intensities.
Sample code for processing Affymetrix arrays using standard Affymetrix probeset definitions (CDF):
library(affy) # Read all CEL files CELs = ReadAffy() # Normalize with RMA or use justRMA for large data set # To use GCRMA use gcrma which ignores MM intensities eset = rma(CELs) # Get the expression values alone (as a matrix) eset.rma.expr = exprs(eset) # Affy QC report #library(affyQCReport) # Print report library(arrayQualityMetrics) arrayQualityMetrics(eset, outdir="QC", force=TRUE)
- Clustering samples after normalization can be used to see possible batch effects
d=dist(t(eset.rma.expr)) plot(hclust(d, method="average"))
- Analysis of Mouse (or Human) Gene 1.0 ST Arrays
The use of the Affy CDF file for these arrays is not recommended because it is "not supported" by Affymetrix. For these arrays use either a custom CDF (ex: CELs.entrez = ReadAffy(cdfname="mogene10stv1mmentrezgcdf")) or the standard Affymetrix probeset definitions but using the oligo package. The output from the oligo package agrees with the output from commercial Affymetrix analysis software. NOTE: Normalizing these arrays with small sample size (eg. < 5) may not work properly or as expected, eg. normalized values might occur multiple times within, or across, arrays. Possible solutions may include:
-
- normalizing each array individually and then quantile
- using Expresso (from Affy)
- fRMA
- Aroma package
Code to normalize using the oligo package and the standard Affymetrix probeset definitions (CDF)
with Human Gene ST Arrays:
library(oligo) library(pd.hugene.1.0.st.v1) geneCELs =list.celfiles(full.names = TRUE) affyGeneFS = read.celfiles(geneCELs) # Do RMA - transcript level (35556 features) eset.oligoGeneCore = rma(affyGeneFS, target = "core") eset.oligoGeneCore.exprs = exprs(eset.oligoGeneCore) write.table(eset.oligoGeneCore.exprs, file="RMA_using_oligo.txt", sep="\t", quote=F)
with Human Exon ST Arrays:
library(oligo) library(pd.huex.1.0.st.v2) geneCELs =list.celfiles(full.names = TRUE) affyGeneFS = read.celfiles(geneCELs) # Do RMA - gene level (22011 genes ["transcript clusters"]) eset.oligoGeneCore = rma(affyGeneFS, target = "core") eset.oligoGeneCore.exprs = exprs(eset.oligoGeneCore) write.table(eset.oligoGeneCore.exprs, file="RMA_using_oligo.txt", sep="\t", quote=F)
Code to normalize using the affy package and custom probeset definitions (CDF):
library(affy) library(mogene10stv1mmentrezgcdf) CELs.entrezGene = ReadAffy(cdfname="mogene10stv1mmentrezgcdf") # Do RMA - gene level (21225 features) eset.entrezGene = rma(CELs.entrezGene) eset.entrezGene.expr = exprs(eset.entrezGene) write.table(eset.ense.expr, file="RMA_using_affy_customCDF_EntrezGene.txt", sep="\t", quote=F)
Using custom probeset definitions with Affymetrix microarrays
- At the time they were designed, probes were organized into probesets, with each probeset representing one transcript (or part of a transcript). Since then, our understanding of transcripts and genes have increased, and it's possible to use updated probeset definitions based on gene models according to resources like NCBI Entrez Gene or Ensembl.
- One source of custom CDFs (chip definition files, describing probeset definitions) is is the BrainArray Microarray Lab at the University of Michigan.
- For custom CDFs keyed to Entrez Gene IDs, a probeset like 10000_at represents Entrez Gene with ID = 10000 (AKT3). Instead of using the entrez_gene database on canna, we can also use the 'db' package, like mogene10stv1mmentrezg.db to link probeset ID to other information.
- Annotations of custom CDF probesets can also be access from the corresponding R package [see code below].
- Some custom CDF files are designed to define each gene by just one probeset, eliminating the issue of summarizing a gene represented by multiple probesets.
Sample code for processing Affymetrix arrays using custom probeset definitions (CDF) and RMA:
library(affy) library(hgu133plus2hsentrezgcdf) # Read all CEL files CELs = ReadAffy(cdfname="hgu133plus2hsentrezgcdf") # Normalize with RMA or use justRMA for large data set eset = rma(CELs) # Get the expression values alone (as a matrix) eset.rma.expr = exprs(eset)
Sample code for processing Affymetrix arrays using custom probeset definitions (CDF) and GCRMA:
library(gcrma) # Normalize with GCRMA where CEL files are in the directory CEL_Files eset = justGCRMA(celfile.path = "CEL_Files", verbose=T, cdfname="mouse4302mmentrezg") # Get the expression values alone (as a matrix) eset.rma.expr = exprs(eset)
Accessing probeset annotations for Affymetrix arrays
To link standard probeset IDs to genes, one can download a "NetAffx Annotation File" from the Affymetrix web site (requiring free registration). Custom probeset IDs usually have a gene ID encoded in the probeset ID. For example, using Entrez Gene - based custom probesets, probeset 12960_at represents Crybb1, which has an Entrez Gene ID of 12960. Standard Affymetrix or custom probesets can also be annotated using Bioconductor, as long as one has installed the "db" package corresponding to the CDF file. If one used hgu133plus2hsentrezgcdf as the CDF, then hgu133plus2hsentrezg.db would contain probeset annotations.
# Use the standard "db" file for the Mouse Genome 430 2.0 array library(mouse4302.db) # Start with a list of all probesets on the array with a command like probesets.all = rownames(eset.rma.expr) # List all annotation types available (optional) ls("package:mouse4302.db") # Get some selected information gene.IDs = unlist(mget(probesets.all, mouse4302ENTREZID, ifnotfound=NA)) gene.symbols = unlist(mget(probesets.all, mouse4302SYMBOL, ifnotfound=NA)) gene.names = unlist(mget(probesets.all, mouse4302GENENAME, ifnotfound=NA)) # Print them all (and/or merge them with the expression matrix) probe.anno = cbind(probesets.all, gene.IDs, gene.symbols, gene.names) write.table(probe.anno, file="Mouse4302.probe_anno.txt", quote=F, sep="\t", row.names=F) # Get probes for a specific probeset (2_at) library(hgu133plus2hsentrezgprobe) as.data.frame(hgu133plus2hsentrezgprobe[hgu133plus2hsentrezgprobe$Probe.Set.Name=="2_at",])
Normalizing Agilent microarrays
- Steps for analyzing Agilent data and code
- Read data (*.txt files, one for each array)
- Run background correction (if desired)
- Use loess for within-array normalization and Aquantile for between-array normalization.
- Draw QC plots (eg. MA plot)
- Print MA and RG values for all data
library(limma) # Read array scanner files scanFiles = dir(pattern = ".*.txt$") maData = read.maimages(scanFiles, source="agilent") # Do not background subtract maData.nobg.0 = backgroundCorrect(maData, method="none", offset=0) # Normalize using loess MA.loess.0 = normalizeWithinArrays(maData.nobg.0, method="loess") # Normalize between arrays MA.loess.q.0 = normalizeBetweenArrays(MA.loess.0, method="Aquantile") # Write-out normalized values # After quantile normalization (M and A values) write.table(cbind(MA.loess.q.0$genes, MA.loess.q.0$M), file="M_log2ratios_loess_q.TXT", sep="\t", quote=FALSE, row.names=F) write.table(cbind(MA.loess.q.0$genes, MA.loess.q.0$A), file="A_log2intensities_loess_q.TXT", sep="\t", quote=FALSE, row.names=F) # Get R and G values for verification (optional) RG.loess.q.0 = RG.MA(MA.loess.q.0) write.table(cbind(RG.loess.q.0$genes, RG.loess.q.0$R ,RG.loess.q.0$G), file="RG_loess_q.TXT", sep="\t", quote=FALSE, row.names=F)
Sample R code for processing Agilent data:
Wi-files1\BaRC_Public\BaRC_code\R\normalize_agilent_2color_arrays_GB.R
Choice of offset
You can judge a good value for the offset by inspection of the MA-plots. If you really want a quantitative way to judge this, look at the component fit$df.prior after you use the eBayes() function in limma. The better you stabilise the variances, the larger will be df.prior and the greater will be the power to detect DE genes. Hence the offset which maximises df.prior is, in sense, optimal.
From: LIMMA: choice of offset... Reply by Gordon Smyth
Other Issues
- Some genes may be represented by multiple probes. To get a unique value for each gene, median (recommended), mean, or maximum value of the probes should be determined. See, http://jura.wi.mit.edu/bio/scripts/R/makeNonRedundantGeneMatrix.R and http://jura.wi.mit.edu/bio/scripts/R/makeNonRedundantProbeMatrix.R
- Spot flags can be used to define spot weights, which are taken into consideration for normalization and/or statistics, such as with commands like
# Define weight function spotWeights <- function(qta) { mapply(min,1-qta[,"gIsFeatNonUnifOL"],1-qta[,"gIsFeatNonUnifOL"], 1-qta[,"gIsBGNonUnifOL"],1-qta[,"gIsBGNonUnifOL"], 1-qta[,"gIsFeatPopnOL"],1-qta[,"gIsFeatPopnOL"], 1-qta[,"gIsBGPopnOL"],1-qta[,"gIsBGPopnOL"]) } # Use weights for normalization maData.weighted = read.maimages(scanFiles, source="agilent", wt.fun=spotWeights) # Use weights for differnetial expression statistics fit = lmFit(maData.weighted, design, weights = maData.weighted$weights)