wiki:SOP/scRNA-seq

Single cell RNA-Seq to quantify gene levels and assay for differential expression

Load appropriate version of the Seurat package

As of April 18, 2024, the default version of Seurat available on fry/RStudio is v5.0.1.

To use Seurat version 5, enter the following command in R:

library(Seurat)

In some instances, certain software may only be compatible with older versions of Seurat, or you might need to run analyses on Seurat objects created with earlier versions. In these cases, you can load an earlier version of Seurat by specifying the library location.

For example, to load Seurat version 4, use this command:

library(irlba, lib.loc = "/nfs/apps/lib/R/4.2-focal/site-library.2023q4") # the 'irlba' package is required by Seurat for linear algebra
library(Seurat, lib.loc = "/nfs/apps/lib/R/4.2-focal/site-library.2023q1")

This command directs R to use a specific library path where Seurat version 4 is installed, ensuring compatibility with older datasets or software requirements.

Alternatively, we could explicitly load all R packages from the last R library set to keep the package versions consistent and compatible.

set_lib_paths <- function(lib_vec) {
  lib_vec <- normalizePath(lib_vec, mustWork = TRUE)
  shim_fun <- .libPaths
  shim_env <- new.env(parent = environment(shim_fun))
  shim_env$.Library <- character()
  shim_env$.Library.site <- character()
  environment(shim_fun) <- shim_env
  shim_fun(lib_vec)
}

set_lib_paths(c("/nfs/apps/lib/R/4.2-focal/site-library.2023q1", "/opt/R/4.2.1/lib/R/library"))

library(Seurat)
# load other required packages, e.g. scPred, for cell type annotation.
library(scPred)

By running the codes above, all the required R packages will be loaded from the previous R library set (2023q1).

Create a matrix of gene counts by cells

  • For 10x Genomics experiments, we use cell ranger to get this counts matrix.
    • The main command is cellranger count, which requires a reference transcriptome indexed specifically for cellranger.
    • Pre-built reference transcriptomes are available from 10x Genomics. Several of them are available at Whitehead on tak under /nfs/genomes/[ASSEMBLY]/10x where ASSEMBLY is specific to our nomenclature. Note that only certain gene types are included in these pre-built references.
    • Custom reference transcriptomes can be created with cellranger commands:
      • Filter the gtf to include only a subset of the annotated gene biotypes, for example,
        bsub cellranger mkgtf Homo_sapiens.GRCh38.93.gtf Homo_sapiens.GRCh38.93.filtered.gtf --attribute=gene_biotype:protein_coding
        
      • Create the cellranger index using a command such as
        bsub cellranger mkref --genome=MyGenome --fasta=genome.fa --genes=Genes.filtered.gtf --ref-version=1.0
        
    • Optional: How to create a STAR index with parameters different than the defaults
      • Run the "cellranger mkref" as specified above.
      • Look inside the "mkref" output folder for the "star" folder and the "genomeParameters.txt" file with the STAR command
      • Rerun the "STAR --runMode genomeGenerate" command with the new parameters and use the output from this step to replace the original "star" folder inside the cellranger mkref output.
  • Run the actual cellranger count command using syntax like
    bsub cellranger count --id=ID --fastqs=PATH --transcriptome=DIR --sample=SAMPLE_LIST --project=PROJECT
    
  • The output of 'cellranger count' includes
    • An indexed BAM file of all mapped reads (possorted_genome_bam.bam)
    • A Loupe Browser visualization and analysis file (cloupe.cloupe)
  • The quality control summary is "web_summary.html" in the 'outs' folder and has important quality metrics and graphs such as: Estimated Number of Cells, Mean Reads per Cell, Mean Reads per Cell, Sequencing Saturation, etc.
  • The "matrix" output files are not in the usual matrix structure. To create a standard 2-dimensional matrix, one can use R commands such as
    library(monocle)
    library(cellrangerRkit)
    cellranger_data_path = "/path/to/dir/with/outs/dir"
    crm = load_cellranger_matrix(cellranger_data_path)
    crm.matrix = as.matrix(exprs(crm))
    write.table(crm.matrix, "My.cellranger.matrix.txt", sep="\t", quote=F)
    
  • For experiments using hashing or CITE-Seq,
    • Follow this page Demultiplexing scRNA-seq hashtag oligos (HTOs)

      or
      #use CITE-seq-Count (https://hoohm.github.io/CITE-seq-Count)
      CITE-seq-Count -R1 R1_001.fastq.gz -R2 R2_001.fastq.gz -t tags.csv -cbf 1 -cbl 16 -umif 17 -umil 28 -cells 4000 -o CITESeq_Count_Out
      #10X cell barcode is usually first 16 bases
      #cbf, cell barcode first position, 1
      #cbl, cell barcode last position, 16
      
      #umi is usually the next 12 bases (for V3 chemistry), use 10 bases (for V2 chemistry)
      #umif, umi first position, 17
      #umil, umi last position, 28
      
      #FASTQC can be also used to verify if the above positions look correct
      
      #cells, expected number of cells (use Loupe browser to get estimate)
      
      #CITESeq_Count_Out will contain similar output to usual 10X which can be read-in, e.g
      data.htos<-Read10X("CITESeq_Count_Out/umi_count/", gene.column=1)
      

Combine/intersect CITE-seq-Count matrix in Seurat using HTODemux in Seurat Demultiplexing with hashtag oligos (HTOs)

Run quality control and filter cells

We typically use the Seurat R package for these steps. The commands below are for Seurat 3.

  • Start out by loading the counts matrix from cellranger:
    library("Seurat") 
    message("Loaded Seurat version", packageDescription("Seurat")$Version)
    # Load the barcodes*, features*, and matrix* files in your 10x Genomics directory
    counts.data <- Read10X(data.dir = input.counts.filename)
    
  • This is how you would load the counts from a file with gene counts:
    library("Seurat") 
    message("Loaded Seurat version", packageDescription("Seurat")$Version)
    counts.data <- read.table(file = paste0("./exp1_forSeurat.txt"))
    
  • Example of input data if starting with gene counts (only a few samples):
    E25-35_A4       E25-35_A5       E25-35_A6       E25-35_B4       E25-35_B5       E25-35_B6       E25-35_C4       E25-35_C5       E25-35_C6
    ENSMUSG00000064372_mt-Tp        7       10      7       10      3       17      13      4
    ENSMUSG00000064371_mt-Tt        0       0       0       1       0       5       0       0
    
  • Make the Seurat object and calculate the percentage of mitochondrial reads
    seuratObject <- CreateSeuratObject(counts = counts.data,project = "ProjectName")
    seuratObject[["percent.mt"]] <- PercentageFeatureSet(object = seuratObject, pattern = "^mt-")
    

Filter cells with high % reads mapping to mitochondrial transcripts and with low number of genes detected

These cutoffs are specific for each experiment

MIN.NUM.GENES = 200
MAX.NUM.GENES = 8000 
MAX.PERCENT.MITO = 20
all_Filt <- subset(x = seuratObject, subset = nFeature_RNA > MIN.NUM.GENES & nFeature_RNA < MAX.NUM.GENES & percent.mt < MAX.PERCENT.MITO)

Normalize data

all_Filt <- NormalizeData(object = all_Filt, normalization.method = "LogNormalize", scale.factor = 10000)

Identify of highly variable features

num.variable.features.to.find = 2000
all_Filt <- FindVariableFeatures(object = all_Filt, selection.method = "vst", nfeatures = num.variable.features.to.find)

Scale data

all.genes <- rownames(x = all_Filt)
all_Filt <- ScaleData(object = all_Filt, features = all.genes)

Perform and visualize dimensional analysis

Perform principal components analysis

all_Filt <- RunPCA(object = all_Filt, features = VariableFeatures(object = all_Filt))

pdf("./PCAPlot.pdf", w=11, h=8.5)
DimPlot(object = all_Filt, reduction = "pca")
DimPlot(object = all_Filt, dims = c(3, 4), reduction = "pca")
dev.off()

pdf("ElbowPlot.pdf", w=11, h=8.5)
ElbowPlot(object = all_Filt)
dev.off()

Based on the elbow plot decide how many components to use to run UMAP, tSNE and the Louvain clustering. Run non-linear dimensional reduction (UMAP/tSNE)

all_Filt <- RunUMAP(object = all_Filt, dims = 1:20)
all_Filt <- RunTSNE(object = all_Filt, dims = 1:20)

pdf("./UMAP_colorByExp.pdf", w=11, h=8.5)
DimPlot(object = all_Filt, reduction = "umap")
dev.off()
pdf("./TSNE_colorByExp.pdf", w=11, h=8.5)
TSNEPlot(object = all_Filt)
dev.off()

Partition cells into clusters

Run Louvain clustering using different resolutions to then decide which one to follow up on for further analysis. The resolution chosen depends on the granularity we want to work with and the cell heterogeneity.

all_Filt <- FindNeighbors(object = all_Filt, dims = 1:20)
all_Filt <- FindClusters(object = all_Filt, resolution = 0.5)

pdf("./UMAP_colorByCluster_Res0.5.pdf", w=11, h=8.5)
UMAPPlot(object = all_Filt, label= TRUE)
dev.off()

all_Filt <- FindClusters(object = all_Filt, resolution = 0.4)
pdf("./UMAP_colorByCluster_Res0.4.pdf", w=11, h=8.5)
UMAPPlot(object = all_Filt, label= TRUE)
dev.off()

all <- FindClusters(object = all_Filt, resolution = 0.3)
pdf("./UMAP_colorByCluster_Res0.3.pdf", w=11, h=8.5)
UMAPPlot(object = all_Filt, label= TRUE)
dev.off()

The clustering at different resolutions are stored in all_Filt$RNA_snn_res.0.3, all_Filt$RNA_snn_res.0.4, and all_Filt$RNA_snn_res.0.5

Identify genes that differentially expressed between samples or clusters

Years of research has led to effective algorithms to quantify differential expression between RNA-seq samples that have been assayed genome-wide. Single cell expression profiles, however, typically assay only a small fraction of all genes, and this single property greatly complicates differential expression analysis. Two general approaches exist for differential expression:

  • consider each cell as a sample
  • aggregate counts across all cells in a group/cluster, and treat them as one sample

Seurat has 2 functions "FindAllMarkers" and "FindMarkers" that work well as long as the fold change and percentage of cells expressing the gene thresholds are not too relaxed. We recommend logfc.threshold = 0.7, min.pct = .25. These functions by default add one pseudocount, "pseudocount.use = 1", to avoid dividing by zero. How we set this parameter, pseudocount.use, will have a strong effect on the LogFC output, especially when there are few cells in the groups compared.

  • example command to compare each cluster to all other clusters:
    MIN_LOGFOLD_CHANGE = 0.7
    MIN_PCT_CELLS_EXPR_GENE = .25
    all.markers.pos.wilcox = FindAllMarkers(all_Filt,
                                            min.pct = MIN_PCT_CELLS_EXPR_GENE,
                                            logfc.threshold = MIN_LOGFOLD_CHANGE,
                                            only.pos = TRUE,
                                            test.use="wilcox")
    
  • example command to compare 2 or more clusters or cell identities (i.e. control and treatment)
    markers1_versus_18.pos.wilcox = FindMarkers(all_Filt,
                                            min.pct = MIN_PCT_CELLS_EXPR_GENE,
                                            logfc.threshold = MIN_LOGFOLD_CHANGE,
                                            only.pos = TRUE,
                                            test.use="wilcox",
                                            ident.1  = "1",
                                            ident.2 = "18"
    

Add biological annotations to cells or cell clusters

Drawing biological conclusions from a single-cell experiment usually requires that one classify cells (or at least cell clusters) by type. Traditionally this is a time-consuming process of exploring marker genes and manually assigning cell type to each numbered cluster. Given that a number of public scRNA-seq experiments already have these annotations, one can leverage automated software with these ideally "gold standard" datasets to classify current experiments, either via expression profiles or marker genes.

Perform trajectory analysis

This step is relevant for projects that include cells at different stages of a developmental process or other change that is associated with a time course. Specific methods/algorithms for dimensional reduction are available to do this. Most of the methods have some concept of pseudotime, metric that one expects is correlated with actual time, but given that they aren't identical, interpretation needs to be performed with caution. Diffusion maps work well for this step and they have been implemented in R (https://bioconductor.org/packages/release/bioc/html/destiny.html) and python (https://scanpy.readthedocs.io/en/stable/api/scanpy.tl.diffmap.html).

Example R code to make diffusion maps with destiny

A comprehensive comparison among different trajectory methods have been done using dynverse. You can select the methods based on the type of topology on their website.

Here is the sample R code for with Slingshot, which is the top ranked method for bifurcation.

Detect genes that are differentially expressed between conditions

Often one is interested in how a treatment or condition alters the gene expression profile in a selected cell type. In the desirable case where multiple biological replicates are present for each condition, K. D. Zimmerman, M. A. Espeland and C. D. Langefeld have recently highlighted the importance of properly taking account of the correlation present in such hierarchically structured data. One strategy, the so-called pseudobulk approach, is to aggregate counts across cells from the same biological sample or subject (for examples of how to implement pseudobulk models see this link). Mixed-effects modeling, where sample is treated as a random effect, is another strategy. The code below uses mixed effects modeling within MAST and has been adapted from K. D. Zimmerman et al.. Note that the additional complexity and potential benefit of these mixed-effects models are accompanied by increased computational expense: fitting these models to thousands of genes in thousands of cells can be slow. A vignette outlining how to use MAST for differential expression in the more traditional fixed-effect mode (i.e. without including any random effects) can be found here.

  • Testing for differential expression with mixed-effects models in MAST:
    require(MAST)
    require(dplyr)
    require(gdata)
    require(reshape2)
    
    # Assume that differential expression is being tested with in a single
    # cluster/cell type from a single cell RNA-seq data-set.  The counts matrix
    # is the starting point.  From a Seurat object clusterCells, it can be extracted as follows:
    allgenesCT<-t(as.data.frame(clusterCells@assays$RNA@counts)) # Un-normalized counts.
    allgenesCT<-as.data.frame(allgenesCT)
    ngenes<-ncol(allgenesCT)
    allgenesCT$Sample<-clusterCells@meta.data$Sample
    allgenesCT$Treat<-clusterCells@meta.data$Treat
    allgenesCT$wellKey<-paste(clusterCells@meta.data$Sample, rownames(clusterCells@meta.data), sep = "_")
    rownames(allgenesCT)<-allgenesCT$wellKey
    allgenesCT<-allgenesCT[,c(ngenes+3,ngenes+2,ngenes+1,1:ngenes)]
    genecountsRaw<-as.matrix(t(allgenesCT[,c(-1,-2,-3)]))
    genecounts<-log2(genecountsRaw + 1) # Logarithmic (base 2) transformation with pseudocount.
    coldata<-allgenesCT[,1:3]
    coldata$Sample<-as.factor(coldata$Sample)
    
    # Filter out genes expressed in <10% of cells
    genecounts<-genecounts[rowSums(genecounts>0)/(ncol(genecounts))>=0.1,]
    genecounts<-genecounts[,rownames(coldata)]
    
    fDataCT<-data.frame(primerid=rownames(genecounts))
    sca<-MAST::FromMatrix(exprsArray=genecounts, cData=coldata, fData=fDataCT)
    
    # CDR is the proportion of genes detected in each cell and is defined in
    # G. Finak et al., Genome Biology 16, 278 (2015).  CDR functions as a
    # regularizer in the model.
    cdr2<-colSums(SummarizedExperiment::assay(sca)>0)
    
    # Reassign cellular detection rate.
    SummarizedExperiment::colData(sca)$ngeneson<-scale(cdr2)
    
    # Assign predictors as factors.
    SummarizedExperiment::colData(sca)$Sample<-factor(SummarizedExperiment::colData(sca)$Sample)
    SummarizedExperiment::colData(sca)$Treat<-factor(SummarizedExperiment::colData(sca)$Treat)
    
    # In the model below, "Treat" is a categorical fixed effect, while "Sample" is a categorical random effect with intercept
    # varying by sample.
    #
    # A note on the nAGQ=0 argument for fitting the discrete component of MAST's hurdle model:
    # 
    # Fitting the model involves optimizing an objective function, the Laplace approximation to the deviance, with respect to the
    # parameters. The Laplace approximation to the deviance requires determining the conditional modes of the random effects.
    # These are the values that maximize the conditional density of the random effects, given the model parameters and the data.
    # This is done using Penalized Iteratively Reweighted Least Squares (PIRLS). In most cases PIRLS is fast and stable. It is simply
    # a penalized version of the IRLS algorithm used in fitting GLMs.
    #
    # The distinction between the "fast" (nAGQ=0) and "slow" (nAGQ=1) algorithms in lme4 (to which glmer belongs)
    # is whether the fixed-effects parameters are optimized in PIRLS or a nonlinear optimizer.
    
    options(mc.cores = 4) # The number of available cores is hardware-dependent.
    zlmCond<-zlm(~ Treat + ngeneson + (1 | Sample), sca, method='glmer', ebayes = FALSE, fitArgsD = list(nAGQ = 0), strictConvergence = FALSE, parallel=TRUE)
    
    # Tabulate statistics for for treatment ("TreatT") versus control ("TreatC" is the reference level of "Treat").
    summaryCond<-MAST::summary(zlmCond,doLRT=TRUE)
    summaryDt<-summaryCond$datatable
    fcHurdle<-merge(summaryDt[contrast=='TreatT' & component=='H',.(primerid, `Pr(>Chisq)`)], summaryDt[contrast=='TreatT' & component=='logFC', .(primerid, coef, ci.hi, ci.lo)], by='primerid')
    fcHurdle[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')] # Correct for multiple testing.
    fcHurdle<-stats::na.omit(as.data.frame(fcHurdle)) # Omit NAs, if present.
    fcHurdle<-fcHurdle[order(fcHurdle$fdr),]
    
    # Examine the top differentially expressed genes, ranked by FDR.
    head(fcHurdle)
    

Perform trajectory-based differential expression analysis

Example R code to run trajectory-based differential expression analysis with tradeSeq

Combine multiple scRNA-seq datasets

Many experiments are especially informative when compared to other experiments, either performed by the same or different laboratories. This is challenging, however, especially when the different experiments profile different types of cells. In these cases, biological and technical differences are confounded, and one needs to make thoughtful assumptions about how to perform batch correction and achieve "success" during dataset integration.

As of 2020 there are more than a dozen algorithms available for integrating single cell RNA-seq data-sets. Three such methods are canonical correlation analysis (implemented in Seurat), iterative linear correction based on soft clustering (implemented in Harmony) and integrative nonnegative matrix factorization (implemented in LIGER). Commands for using each of these methods from within a Seurat workflow are given below.

  • Using CCA in Seurat (please see T. Stuart et al. “Comprehensive Integration of Single-Cell Data”, Cell 177, 1888-1902 (2019), the associated Seurat v.3 vignette and the documentation for the FindIntegrationAnchors function):
    library(Seurat) 
    
    # Merge two or more Seurat objects, objA and objB, from different batches.
    all <- merge(x=objA,y=objB,add.cell.ids=c("A","B"))
    
    # Split and re-integrate the merged object according to the batch slot.
    s3.list <- SplitObject(all, split.by = "batch")
    
    # This loop normalizes each experiment separately first.
    for (i in 1:length(s3.list)) {
        s3.list[[i]] <- NormalizeData(s3.list[[i]], verbose = FALSE)
        s3.list[[i]] <- FindVariableFeatures(s3.list[[i]], selection.method = "vst", nfeatures = 2000, verbose = FALSE)
    }
    
    # Find so-called anchors and carry out the integration.
    s3.anchors <- FindIntegrationAnchors(object.list = s3.list)
    s3.integrated <- IntegrateData(anchorset = s3.anchors) 
    DefaultAssay(s3.integrated) <- "integrated"
    

Use the integrated assay to run umap visualization and jointly define cell types. Use the RNA assay to identify canonical cell type marker genes and to show gene expression using FeaturePlot. Please check this vignette for more detail. https://satijalab.org/seurat/archive/v3.0/immune_alignment.html

  • Using Harmony from within Seurat (please see I. Korsunsky et al. “Fast, sensitive and accurate integration of single-cell data with Harmony”, Nature Methods 16, 1289-1296 (2019), the details of the Harmony algorithm linked below and the documentation for the RunHarmony function):
    library(Seurat) 
    library(harmony)
    
    # Merge two or more Seurat objects, objA and objB, from different batches.
    all <- merge(x=objA,y=objB,add.cell.ids=c("A","B"))
    
    # In anticipation of using Harmony to integrate data-sets below, first use Seurat to run PCA on the un-corrected data.
    
    all <- NormalizeData(all, normalization.method = "LogNormalize", scale.factor = 10000)
    all <- FindVariableFeatures(all, selection.method = "vst", nfeatures = 2000)
    all <- ScaleData(all, features = rownames(all))
    all <- RunPCA(all, features = VariableFeatures(object = all))
    
    # Do the integration using Harmony, indexing samples by the batch slot:
    all <- RunHarmony(all, "batch")
    
    # When generating UMAP or another embedding, be sure to use the integrated "harmony" reduction.
    all <- RunUMAP(all,reduction = "harmony")
    
  • Using LIGER (v 0.4.2.9000) from within Seurat (please see J.D. Welsh et al. “Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity”, Nature Biotechnology 37, 1873–1887 (2019), the LIGER documentation/vignettes the documentation for the RunOptimizeALS and RunQuantileAlignSNF functions):
    library(Seurat)
    library(SeuratWrappers) 
    library(liger)
    
    # Merge two or more Seurat objects, objA and objB, from different batches.
    all <- merge(x=objA,y=objB,add.cell.ids=c("A","B"))
    
    # In anticipation of using LIGER to integrate data-sets below, first use Seurat to scale the data without centering.
    
    all <- NormalizeData(all, normalization.method = "LogNormalize", scale.factor = 10000)
    all <- FindVariableFeatures(all, selection.method = "vst", nfeatures = 2000)
    all <- ScaleData(all, do.center=FALSE, split.by = "batch")
    
    # Do the integration using LIGER, indexing samples by the batch slot:
    all <- RunOptimizeALS(all, split.by = "batch")
    all <- RunQuantileAlignSNF(all, split.by = "batch")
    
    # When generating UMAP or another embedding, be sure to use the reduction from the integrated nonnegative factorization ("iNMF").
    all <- RunUMAP(all, dims = 1:ncol(all[["iNMF"]]), reduction = "iNMF")
    
    # The commands above use default values for the rank, k, of the NMF and the homogeneity parameter, lambda. Generally, the rank
    # should be chosen to be large enough to capture structure in the data matrix, yet small enough that the factorization gives reproducible
    # interpretations under multiple non-unique solutions. Increasing the homogeneity parameter places greater emphasis on the common
    # factors in the integration.  The heuristic functions suggestK and suggestLambda can be used to guide the data-set-specific
    # adjustment of these two parameters. 
    

Export expression and dimensional analysis data for interactive viewing

We prefer using UCSC's Cell Browser environment for this task.

  • Prerequisites. To make the most of this interactive viewing tool,
    • Run dimensional reduction (such as PCA, tSNE, UMAP).
    • Cluster/partition the cells (such as with Seurat's FindClusters()).
    • Identify cluster-specific marker genes (such as with Seurat'sFindAllMarkers()) and assemble/print information about them with commands such as
      all.markers.forCB = cbind(as.numeric(all.markers$cluster), all.markers$gene, all.markers$p_val_adj, all.markers$avg_logFC,
                            all.markers$pct.1, all.markers$pct.2)
      write.table(all.markers.forCB, file="all.markers.exported.txt", quote = FALSE, sep = "\t", row.names=F)
      
    • Add info/links about the marker genes with the CellBrowser command
      cbMarkerAnnotate all.markers.exported.txt markers.txt
      
  • Export the key data from the Seurat object:
    ExportToCellbrowser(seurat, dir=export.dir, dataset.name=dataset.name, markers.file=markers.file, reductions=c("pca", "tsne", "umap"))
    
  • Run Cell Browser's cbBuild (on tak) to create the web-viewable directory of files.
  • Move the cbBuild output to a web server, which creates a page that looks something like https://cells.ucsc.edu/

10x Genomics' Loupe Browser

The Loupe Browser is a nice tool for exploring and visualizing single-cell RNA-seq experiments. The cloupe file format that the Loupe Browser requires is typically created by the 10x Genomics program, Cell Ranger.

We recommend the Loupe Browser for exploration and visualization but not analysis or re-analysis.

Cell type classification

Several methods for cell-type classification have been published. A good place to start is to try:

This review article evaluates thirty-two published automatic methods for cell-type identification using scRNA-seq data.

Links to recommended scRNA-seq analysis tutorials and resources

Note: See TracWiki for help on using the wiki.