Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Initial Version and Version 1 of SOPs/normalize_ma

Timestamp:: 01/23/13 16:49:43 (12 years ago)
Author:: trac
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SOPs/normalize_ma

               v1
+=== Quality Control of Microarray Data===
+  * Quality assessment of microarrays should be performed to ensure the data is reliable before normalization.  Bioconductor has packages for QC'ing.
+    * For affy, use //simpleaffy// and/or //affyPLM// to examine percent present calls (from A/P calls), Normalized Unscale Standard Error (NUSE), and Relative Log Expression (RLE).  Choose a cutoff for each method, for eg. more than 30% for present.
+    * For Agilent, the package //arrayQualityMetrics//  can used.  Sample code for getting percent present calls, RLE and NUSE (for Affy)
+ {{{
+    library("simpleaffy")
+    library("affyPLM")  #for NUSE
+    # Read cel files from directory
+    data = ReadAffy()
+    # Create affy QA matrix
+    data.qc = qc(data)
+    # percent present
+    pp = percent.present(data.qc)
+    # RLE and NUSE
+    plmStruc = fitPLM(data)
+    RLE(plmStruct, type="stats")
+    NUSE(plmStruct, type="stats")
+    # arrayQualityMetrics using custom CDF (optional)
+    library(arrayQualityMetrics)
+    CELs = ReadAffy(cdfname="hgu133plus2hsentrezgcdf")
+    eset = rma(CELs)
+    arrayQualityMetrics(eset, outdir="QC", force=TRUE)
+    #Agilent (arrayQualityMetrics) on 2-color
+    library(arrayQualityMetrics)
+    library(limma)
+    scanFiles = dir(pattern = ".*.txt$")
+    maData = read.maimages(scanFiles, source="agilent")
+    arrayQualityMetrics(expressionset=maData, outdir="QC", force = TRUE, do.logtransform = TRUE)
+    #Agilent (arrayQualityMetrics) on 1-color
+    library(arrayQualityMetrics)
+    library(limma)
+    scanFiles = dir(pattern = ".*.txt$")
+    maData = read.maimages(scanFiles, source="agilent")
+    eSet = new("ExpressionSet", exprs = maData$E, annotation =maData$genes[,7])
+    arrayQualityMetrics(eSet, outdir="QC", force = TRUE, do.logtransform = TRUE)
+ }}}
+=== Pre-processing Affymetrix microarrays ===
+  * Each probe or gene signal needs to be corrected.
+  * General steps (for Affymetrix arrays):
+. Background Correction, if any
+. Normalization
+. Summarization
+  * Commonly used methods for background correction and normalization Affy data include MAS5 (by Affy), Robust Multichip Array (RMA) and GC Robust Multichip Array (GCRMA).
+  * References for these methods:
+      * RMA: Irizarry RA et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2):249-64, 2003.
+      * GCRMA: Wu Z and Irizarry RA. Stochastic models inspired by hybridization theory for short oligonucleotide arrays. Proceedings of RECOMB ’04.
+  * RMA is our preferred method.  It's as good (or better) as MAS5 and GCRMA but also faster.  On the other hand, all three of these methods are popular.
+  * RMA and GCRMA output log2-transformed intensities, whereas MAS5 outputs untransformed intensities.
+ Sample code for processing Affymetrix arrays using standard Affymetrix probeset definitions (CDF):
+{{{
+    library(affy)
+    # Read all CEL files
+    CELs = ReadAffy()
+    # Normalize with RMA or use justRMA for large data set
+    # To use GCRMA use gcrma which ignores MM intensities
+    eset = rma(CELs)
+    # Get the expression values alone (as a matrix)
+    eset.rma.expr = exprs(eset)
+    # Affy QC report
+    #library(affyQCReport)
+    # Print report
+    library(arrayQualityMetrics)
+    arrayQualityMetrics(eset, outdir="QC", force=TRUE)
+}}}
+  * Analysis of Mouse (or Human) Gene 1.0 ST Arrays
+ The use of the Affy CDF file for these arrays is not recommended because it is "not supported" by Affymetrix. For these arrays use either a custom CDF (ex: CELs.entrez = ReadAffy(cdfname="mogene10stv1mmentrezgcdf")) or the standard Affymetrix probeset definitions but using the oligo package. The output from the oligo package agrees with the output from commercial Affymetrix analysis software.  '''NOTE: Normalizing these arrays with small sample size (eg. < 5) may not work properly or as expected, eg. normalized values might occur multiple times within, or across, arrays. Possible solutions may include:'''
+  ::
+      * normalizing each array individually and then quantile
+      * using Expresso (from Affy)
+      * [http://www.bioconductor.org/packages/2.6/bioc/html/frma.html fRMA]
+      * [http://www.aroma-project.org/vignettes/GeneSTArrayAnalysis Aroma package]
+Code to normalize using the oligo package and the standard Affymetrix probeset definitions (CDF)
+ with Human Gene ST Arrays:
+{{{
+library(oligo)
+library(pd.mogene.1.0.st.v1)
+geneCELs =list.celfiles(full.names = TRUE)
+affyGeneFS = read.celfiles(geneCELs)
+# Do RMA - transcript level (35556 features)
+eset.oligoGeneCore = rma(affyGeneFS, target = "core")
+eset.oligoGeneCore.exprs = exprs(eset.oligoGeneCore)
+write.table(eset.oligoGeneCore.exprs, file="RMA_using_oligo.txt", sep="\t", quote=F)
+}}}
+ with Human Exon ST Arrays:
+{{{
+library(oligo)
+library(pd.huex.1.0.st.v2)
+geneCELs =list.celfiles(full.names = TRUE)
+affyGeneFS = read.celfiles(geneCELs)
+# Do RMA - gene level (22011 genes ["transcript clusters"])
+eset.oligoGeneCore = rma(affyGeneFS, target = "core")
+eset.oligoGeneCore.exprs = exprs(eset.oligoGeneCore)
+write.table(eset.oligoGeneCore.exprs, file="RMA_using_oligo.txt", sep="\t", quote=F)
+}}}
+Code to normalize using the affy package and custom probeset definitions (CDF):
+{{{
+library(affy)
+library(mogene10stv1mmentrezgcdf)
+CELs.entrezGene = ReadAffy(cdfname="mogene10stv1mmentrezgcdf")
+# Do RMA - gene level (21225 features)
+eset.entrezGene = rma(CELs.entrezGene)
+eset.entrezGene.expr = exprs(eset.entrezGene)
+write.table(eset.ense.expr, file="RMA_using_affy_customCDF_EntrezGene.txt", sep="\t", quote=F)
+}}}
+=== Using custom probeset definitions with Affymetrix microarrays ===
+  * At the time they were designed, probes were organized into probesets, with each probeset representing one transcript (or part of a transcript).  Since then, our understanding of transcripts and genes have increased, and it's possible to use updated probeset definitions based on gene models according to resources like NCBI Entrez Gene or Ensembl.
+  * One source of custom CDFs (chip definition files, describing probeset definitions) is [http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/ is the BrainArray Microarray Lab at the University of Michigan].
+  * For custom CDFs keyed to Entrez Gene IDs, a probeset like 10000_at represents Entrez Gene with ID = 10000 (AKT3). Instead of using the entrez_gene database on canna, we can also use the 'db' package, like mogene10stv1mmentrezg.db to link probeset ID to other information.
+  * Annotations of custom CDF probesets can also be access from the corresponding R package [see code below].
+  * Some custom CDF files are designed to define each gene by just one probeset, eliminating the issue of summarizing a gene represented by multiple probesets.
+ Sample code for processing Affymetrix arrays using custom probeset definitions (CDF):
+{{{
+    library(hgu133plus2hsentrezgcdf)
+    # Read all CEL files
+    CELs = ReadAffy(cdfname="hgu133plus2hsentrezgcdf")
+    # Normalize with RMA or use justRMA for large data set
+    eset = rma(CELs)
+    # Get the expression values alone (as a matrix)
+    eset.rma.expr = exprs(eset)
+}}}
+=== Accessing probeset annotations for Affymetrix arrays ===
+To link standard probeset IDs to genes, one can download a "NetAffx Annotation File" from the [[http://www.affymetrix.com/support/technical/byproduct.affx?cat=arrays | Affymetrix web site]] (requiring free registration).
+Custom probeset IDs usually have a gene ID encoded in the probeset ID.
+For example, using Entrez Gene - based custom probesets, probeset 12960_at represents Crybb1, which has an Entrez Gene ID of 12960.
+Standard Affymetrix or custom probesets can also be annotated using Bioconductor, as long as one has installed the "db" package corresponding to the CDF file.  If one used hgu133plus2hsentrezgcdf as the CDF, then
+hgu133plus2hsentrezg.db would contain probeset annotations.
+{{{
+    # Use the standard "db" file for the Mouse Genome 430 2.0 array
+    library(mouse4302.db)
+    # Start with a list of all probesets on the array with a command like
+    probesets.all = rownames(eset.rma.expr)
+    # List all annotation types available (optional)
+    ls("package:mouse4302.db")
+    # Get some selected information
+    gene.IDs = unlist(mget(probesets.all, mouse4302ENTREZID, ifnotfound=NA))
+    gene.symbols = unlist(mget(probesets.all, mouse4302SYMBOL, ifnotfound=NA))
+    gene.names = unlist(mget(probesets.all, mouse4302GENENAME, ifnotfound=NA))
+    # Print them all (and/or merge them with the expression matrix)
+    probe.anno = cbind(probesets.all, gene.IDs, gene.symbols, gene.names)
+    write.table(probe.anno, file="Mouse4302.probe_anno.txt", quote=F, sep="\t", row.names=F)
+    # Get probes for a specific probeset (2_at)
+    library(hgu133plus2hsentrezgprobe)
+    as.data.frame(hgu133plus2hsentrezgprobe[hgu133plus2hsentrezgprobe$Probe.Set.Name=="2_at",])
+}}}
+=== Normalizing Agilent microarrays ===
+  * Steps for analyzing Agilent data and code
+. Read data (*.txt files, one for each array)
+. Run background correction (if desired)
+. Use loess for within-array normalization and Aquantile for between-array normalization.
+. Draw QC plots (eg. MA plot)
+. Print MA and RG values for all data
+   {{{
+    library(limma)
+    # Read array scanner files
+    scanFiles = dir(pattern = ".*.txt$")
+    maData = read.maimages(scanFiles, source="agilent")
+    # Do not background subtract
+    maData.nobg.0 = backgroundCorrect(maData, method="none", offset=0)
+    # Normalize using loess
+    MA.loess.0 = normalizeWithinArrays(maData.nobg.0, method="loess")
+    # Normalize between arrays
+    MA.loess.q.0 = normalizeBetweenArrays(MA.loess.0, method="Aquantile")
+    # Write-out normalized values
+    # After quantile normalization (M and A values)
+    write.table(cbind(MA.loess.q.0$genes, MA.loess.q.0$M), file="M_log2ratios_loess_q.TXT", sep="\t", quote=FALSE, row.names=F)
+    write.table(cbind(MA.loess.q.0$genes, MA.loess.q.0$A), file="A_log2intensities_loess_q.TXT", sep="\t", quote=FALSE, row.names=F)
+    # Get R and G values for verification (optional)
+    RG.loess.q.0 = RG.MA(MA.loess.q.0)
+    write.table(cbind(RG.loess.q.0$genes, RG.loess.q.0$R ,RG.loess.q.0$G), file="RG_loess_q.TXT", sep="\t", quote=FALSE, row.names=F)
+}}}
+Sample R code for processing Agilent data: \\Wi-files1\BaRC_Public\BaRC_code\R\normalize_agilent_2color_arrays_GB.R
+====  Choice of offset   ====
+You can judge a good value for the offset by inspection of the
+MA-plots. If you really want a quantitative way to judge this, look
+at the component fit$df.prior after you use the eBayes() function in
+limma. The better you stabilise the variances, the larger will be
+df.prior and the greater will be the power to detect DE genes. Hence
+the offset which maximises df.prior is, in sense, optimal.
+From: [[https://stat.ethz.ch/pipermail/bioconductor/2006-April/012554.html | LIMMA: choice of offset...  Reply by Gordon Smyth ]]
+====  Other Issues   ====
+  * Some genes may be represented by multiple probes.  To get a unique value for each gene, median (recommended), mean, or maximum value of the probes should be determined.  See, [[http://iona.wi.mit.edu/bio/bioinfo/Rscripts/makeNonRedundantGeneMatrix.R]] and [[http://iona.wi.mit.edu/bio/bioinfo/Rscripts/makeNonRedundantProbeMatrix.R]]
+  * Spot flags can be used to define spot weights, which are taken into consideration for normalization and/or statistics, such as with commands like
+{{{
+# Define weight function
+spotWeights <- function(qta) {
+    mapply(min,1-qta[,"gIsFeatNonUnifOL"],1-qta[,"gIsFeatNonUnifOL"],
+-qta[,"gIsBGNonUnifOL"],1-qta[,"gIsBGNonUnifOL"],
+-qta[,"gIsFeatPopnOL"],1-qta[,"gIsFeatPopnOL"],
+-qta[,"gIsBGPopnOL"],1-qta[,"gIsBGPopnOL"])
+    }
+# Use weights for normalization
+maData.weighted = read.maimages(scanFiles, source="agilent", wt.fun=spotWeights)
+# Use weights for differnetial expression statistics
+fit = lmFit(maData.weighted, design, weights = maData.weighted$weights)
+}}}
+=== More Information ===
+  * [[http://jura.wi.mit.edu/bio/education/bioinfo2007/arrays/ | BaRC Course Notes: Microarray Analysis ]]
+  * [[http://iona.wi.mit.edu/bio/education/R2011/ | An introduction to R and Bioconductor: A BaRC Short Course ]]