Changes between Initial Version and Version 1 of SOPs/normalize_ma


Ignore:
Timestamp:
01/23/13 16:49:43 (12 years ago)
Author:
trac
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SOPs/normalize_ma

    v1 v1  
     1
     2=== Quality Control of Microarray Data===
     3
     4  * Quality assessment of microarrays should be performed to ensure the data is reliable before normalization.  Bioconductor has packages for QC'ing.
     5    * For affy, use //simpleaffy// and/or //affyPLM// to examine percent present calls (from A/P calls), Normalized Unscale Standard Error (NUSE), and Relative Log Expression (RLE).  Choose a cutoff for each method, for eg. more than 30% for present.
     6    * For Agilent, the package //arrayQualityMetrics//  can used.  Sample code for getting percent present calls, RLE and NUSE (for Affy)
     7
     8 {{{   
     9    library("simpleaffy")
     10    library("affyPLM")  #for NUSE
     11    # Read cel files from directory
     12    data = ReadAffy()
     13    # Create affy QA matrix
     14    data.qc = qc(data)
     15    # percent present
     16    pp = percent.present(data.qc)
     17    # RLE and NUSE
     18    plmStruc = fitPLM(data)
     19    RLE(plmStruct, type="stats")
     20    NUSE(plmStruct, type="stats")
     21   
     22    # arrayQualityMetrics using custom CDF (optional)
     23    library(arrayQualityMetrics)
     24    CELs = ReadAffy(cdfname="hgu133plus2hsentrezgcdf")
     25    eset = rma(CELs)
     26    arrayQualityMetrics(eset, outdir="QC", force=TRUE)
     27   
     28    #Agilent (arrayQualityMetrics) on 2-color
     29    library(arrayQualityMetrics)
     30    library(limma)
     31    scanFiles = dir(pattern = ".*.txt$")
     32    maData = read.maimages(scanFiles, source="agilent")
     33    arrayQualityMetrics(expressionset=maData, outdir="QC", force = TRUE, do.logtransform = TRUE)
     34
     35    #Agilent (arrayQualityMetrics) on 1-color
     36    library(arrayQualityMetrics)
     37    library(limma)
     38    scanFiles = dir(pattern = ".*.txt$")
     39    maData = read.maimages(scanFiles, source="agilent")
     40    eSet = new("ExpressionSet", exprs = maData$E, annotation =maData$genes[,7])
     41    arrayQualityMetrics(eSet, outdir="QC", force = TRUE, do.logtransform = TRUE)
     42
     43 }}}
     44
     45=== Pre-processing Affymetrix microarrays ===
     46
     47  * Each probe or gene signal needs to be corrected.
     48
     49  * General steps (for Affymetrix arrays):
     50     1. Background Correction, if any
     51     1. Normalization
     52     1. Summarization
     53
     54  * Commonly used methods for background correction and normalization Affy data include MAS5 (by Affy), Robust Multichip Array (RMA) and GC Robust Multichip Array (GCRMA). 
     55  * References for these methods:
     56      * RMA: Irizarry RA et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2):249-64, 2003.
     57      * GCRMA: Wu Z and Irizarry RA. Stochastic models inspired by hybridization theory for short oligonucleotide arrays. Proceedings of RECOMB ’04.
     58  * RMA is our preferred method.  It's as good (or better) as MAS5 and GCRMA but also faster.  On the other hand, all three of these methods are popular.
     59  * RMA and GCRMA output log2-transformed intensities, whereas MAS5 outputs untransformed intensities.
     60
     61 Sample code for processing Affymetrix arrays using standard Affymetrix probeset definitions (CDF):
     62
     63{{{
     64    library(affy)
     65    # Read all CEL files
     66    CELs = ReadAffy()
     67    # Normalize with RMA or use justRMA for large data set
     68    # To use GCRMA use gcrma which ignores MM intensities
     69    eset = rma(CELs)
     70    # Get the expression values alone (as a matrix)
     71    eset.rma.expr = exprs(eset)
     72    # Affy QC report
     73    #library(affyQCReport)
     74    # Print report
     75    library(arrayQualityMetrics)
     76    arrayQualityMetrics(eset, outdir="QC", force=TRUE)
     77}}}
     78  * Analysis of Mouse (or Human) Gene 1.0 ST Arrays
     79 The use of the Affy CDF file for these arrays is not recommended because it is "not supported" by Affymetrix. For these arrays use either a custom CDF (ex: CELs.entrez = ReadAffy(cdfname="mogene10stv1mmentrezgcdf")) or the standard Affymetrix probeset definitions but using the oligo package. The output from the oligo package agrees with the output from commercial Affymetrix analysis software.  '''NOTE: Normalizing these arrays with small sample size (eg. < 5) may not work properly or as expected, eg. normalized values might occur multiple times within, or across, arrays. Possible solutions may include:'''
     80  ::
     81      * normalizing each array individually and then quantile
     82      * using Expresso (from Affy)
     83      * [http://www.bioconductor.org/packages/2.6/bioc/html/frma.html fRMA]
     84      * [http://www.aroma-project.org/vignettes/GeneSTArrayAnalysis Aroma package]
     85
     86Code to normalize using the oligo package and the standard Affymetrix probeset definitions (CDF)
     87
     88 with Human Gene ST Arrays:
     89
     90{{{
     91library(oligo)
     92library(pd.mogene.1.0.st.v1)
     93geneCELs =list.celfiles(full.names = TRUE)
     94affyGeneFS = read.celfiles(geneCELs)
     95# Do RMA - transcript level (35556 features)
     96eset.oligoGeneCore = rma(affyGeneFS, target = "core")
     97eset.oligoGeneCore.exprs = exprs(eset.oligoGeneCore)
     98write.table(eset.oligoGeneCore.exprs, file="RMA_using_oligo.txt", sep="\t", quote=F)
     99}}}
     100
     101 with Human Exon ST Arrays:
     102
     103{{{
     104library(oligo)
     105library(pd.huex.1.0.st.v2)
     106geneCELs =list.celfiles(full.names = TRUE)
     107affyGeneFS = read.celfiles(geneCELs)
     108# Do RMA - gene level (22011 genes ["transcript clusters"])
     109eset.oligoGeneCore = rma(affyGeneFS, target = "core")
     110eset.oligoGeneCore.exprs = exprs(eset.oligoGeneCore)
     111write.table(eset.oligoGeneCore.exprs, file="RMA_using_oligo.txt", sep="\t", quote=F)
     112}}}
     113
     114Code to normalize using the affy package and custom probeset definitions (CDF):
     115
     116{{{
     117library(affy)
     118library(mogene10stv1mmentrezgcdf)
     119CELs.entrezGene = ReadAffy(cdfname="mogene10stv1mmentrezgcdf")
     120# Do RMA - gene level (21225 features)
     121eset.entrezGene = rma(CELs.entrezGene)
     122eset.entrezGene.expr = exprs(eset.entrezGene)
     123write.table(eset.ense.expr, file="RMA_using_affy_customCDF_EntrezGene.txt", sep="\t", quote=F)
     124}}}
     125
     126
     127=== Using custom probeset definitions with Affymetrix microarrays ===
     128   
     129  * At the time they were designed, probes were organized into probesets, with each probeset representing one transcript (or part of a transcript).  Since then, our understanding of transcripts and genes have increased, and it's possible to use updated probeset definitions based on gene models according to resources like NCBI Entrez Gene or Ensembl.
     130  * One source of custom CDFs (chip definition files, describing probeset definitions) is [http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/ is the BrainArray Microarray Lab at the University of Michigan].
     131  * For custom CDFs keyed to Entrez Gene IDs, a probeset like 10000_at represents Entrez Gene with ID = 10000 (AKT3). Instead of using the entrez_gene database on canna, we can also use the 'db' package, like mogene10stv1mmentrezg.db to link probeset ID to other information.
     132  * Annotations of custom CDF probesets can also be access from the corresponding R package [see code below].
     133  * Some custom CDF files are designed to define each gene by just one probeset, eliminating the issue of summarizing a gene represented by multiple probesets.
     134
     135 Sample code for processing Affymetrix arrays using custom probeset definitions (CDF):
     136
     137{{{
     138    library(hgu133plus2hsentrezgcdf)
     139    # Read all CEL files
     140    CELs = ReadAffy(cdfname="hgu133plus2hsentrezgcdf")
     141    # Normalize with RMA or use justRMA for large data set
     142    eset = rma(CELs)
     143    # Get the expression values alone (as a matrix)
     144    eset.rma.expr = exprs(eset)
     145}}}
     146
     147=== Accessing probeset annotations for Affymetrix arrays ===
     148
     149To link standard probeset IDs to genes, one can download a "NetAffx Annotation File" from the [[http://www.affymetrix.com/support/technical/byproduct.affx?cat=arrays | Affymetrix web site]] (requiring free registration).
     150Custom probeset IDs usually have a gene ID encoded in the probeset ID.
     151For example, using Entrez Gene - based custom probesets, probeset 12960_at represents Crybb1, which has an Entrez Gene ID of 12960.
     152Standard Affymetrix or custom probesets can also be annotated using Bioconductor, as long as one has installed the "db" package corresponding to the CDF file.  If one used hgu133plus2hsentrezgcdf as the CDF, then
     153hgu133plus2hsentrezg.db would contain probeset annotations.
     154
     155
     156{{{
     157    # Use the standard "db" file for the Mouse Genome 430 2.0 array
     158    library(mouse4302.db)
     159
     160    # Start with a list of all probesets on the array with a command like
     161    probesets.all = rownames(eset.rma.expr)
     162 
     163    # List all annotation types available (optional)
     164    ls("package:mouse4302.db")
     165
     166    # Get some selected information
     167    gene.IDs = unlist(mget(probesets.all, mouse4302ENTREZID, ifnotfound=NA))
     168    gene.symbols = unlist(mget(probesets.all, mouse4302SYMBOL, ifnotfound=NA))
     169    gene.names = unlist(mget(probesets.all, mouse4302GENENAME, ifnotfound=NA))
     170
     171    # Print them all (and/or merge them with the expression matrix)
     172    probe.anno = cbind(probesets.all, gene.IDs, gene.symbols, gene.names)
     173    write.table(probe.anno, file="Mouse4302.probe_anno.txt", quote=F, sep="\t", row.names=F)
     174
     175    # Get probes for a specific probeset (2_at)
     176    library(hgu133plus2hsentrezgprobe)
     177    as.data.frame(hgu133plus2hsentrezgprobe[hgu133plus2hsentrezgprobe$Probe.Set.Name=="2_at",])
     178
     179}}}
     180
     181=== Normalizing Agilent microarrays ===
     182
     183  * Steps for analyzing Agilent data and code
     184  1. Read data (*.txt files, one for each array)
     185  1. Run background correction (if desired)
     186  1. Use loess for within-array normalization and Aquantile for between-array normalization.
     187  1. Draw QC plots (eg. MA plot)
     188  1. Print MA and RG values for all data
     189
     190   {{{
     191    library(limma)
     192    # Read array scanner files
     193    scanFiles = dir(pattern = ".*.txt$")
     194    maData = read.maimages(scanFiles, source="agilent")
     195    # Do not background subtract
     196    maData.nobg.0 = backgroundCorrect(maData, method="none", offset=0)
     197    # Normalize using loess
     198    MA.loess.0 = normalizeWithinArrays(maData.nobg.0, method="loess")
     199    # Normalize between arrays
     200    MA.loess.q.0 = normalizeBetweenArrays(MA.loess.0, method="Aquantile")
     201   
     202    # Write-out normalized values
     203    # After quantile normalization (M and A values)
     204    write.table(cbind(MA.loess.q.0$genes, MA.loess.q.0$M), file="M_log2ratios_loess_q.TXT", sep="\t", quote=FALSE, row.names=F)
     205    write.table(cbind(MA.loess.q.0$genes, MA.loess.q.0$A), file="A_log2intensities_loess_q.TXT", sep="\t", quote=FALSE, row.names=F)
     206    # Get R and G values for verification (optional)
     207    RG.loess.q.0 = RG.MA(MA.loess.q.0)
     208    write.table(cbind(RG.loess.q.0$genes, RG.loess.q.0$R ,RG.loess.q.0$G), file="RG_loess_q.TXT", sep="\t", quote=FALSE, row.names=F)
     209}}}
     210
     211Sample R code for processing Agilent data: \\Wi-files1\BaRC_Public\BaRC_code\R\normalize_agilent_2color_arrays_GB.R
     212
     213====  Choice of offset   ====
     214
     215You can judge a good value for the offset by inspection of the
     216MA-plots. If you really want a quantitative way to judge this, look
     217at the component fit$df.prior after you use the eBayes() function in
     218limma. The better you stabilise the variances, the larger will be
     219df.prior and the greater will be the power to detect DE genes. Hence
     220the offset which maximises df.prior is, in sense, optimal.
     221
     222From: [[https://stat.ethz.ch/pipermail/bioconductor/2006-April/012554.html | LIMMA: choice of offset...  Reply by Gordon Smyth ]]
     223
     224====  Other Issues   ====
     225
     226  * Some genes may be represented by multiple probes.  To get a unique value for each gene, median (recommended), mean, or maximum value of the probes should be determined.  See, [[http://iona.wi.mit.edu/bio/bioinfo/Rscripts/makeNonRedundantGeneMatrix.R]] and [[http://iona.wi.mit.edu/bio/bioinfo/Rscripts/makeNonRedundantProbeMatrix.R]]
     227
     228  * Spot flags can be used to define spot weights, which are taken into consideration for normalization and/or statistics, such as with commands like
     229{{{
     230# Define weight function
     231spotWeights <- function(qta) {
     232    mapply(min,1-qta[,"gIsFeatNonUnifOL"],1-qta[,"gIsFeatNonUnifOL"],
     233    1-qta[,"gIsBGNonUnifOL"],1-qta[,"gIsBGNonUnifOL"],
     234    1-qta[,"gIsFeatPopnOL"],1-qta[,"gIsFeatPopnOL"],
     235    1-qta[,"gIsBGPopnOL"],1-qta[,"gIsBGPopnOL"])
     236    }
     237# Use weights for normalization
     238maData.weighted = read.maimages(scanFiles, source="agilent", wt.fun=spotWeights)
     239# Use weights for differnetial expression statistics
     240fit = lmFit(maData.weighted, design, weights = maData.weighted$weights)
     241}}}
     242
     243=== More Information ===
     244
     245  * [[http://jura.wi.mit.edu/bio/education/bioinfo2007/arrays/ | BaRC Course Notes: Microarray Analysis ]]
     246  * [[http://iona.wi.mit.edu/bio/education/R2011/ | An introduction to R and Bioconductor: A BaRC Short Course ]]
     247