Single cell RNA-Seq to quantify gene levels and assay for differential expression

Create a matrix of gene counts by cells

  • For 10x Genomics experiments, we use cell ranger to get this counts matrix.
    • The main command is cellranger count, which requires a reference transcriptome indexed specifically for cellranger.
    • Pre-built reference transcriptomes are available from 10x Genomics. Several of them are available at Whitehead on tak under /nfs/genomes/[ASSEMBLY]/10x where ASSEMBLY is specific to our nomenclature. Note that only certain gene types are included in these pre-built references.
    • Custom reference transcriptomes can be created with cellranger commands:
      • Filter the gtf to include only a subset of the annotated gene biotypes, for example,
        bsub cellranger mkgtf Homo_sapiens.GRCh38.93.gtf Homo_sapiens.GRCh38.93.filtered.gtf --attribute=gene_biotype:protein_coding
      • Create the cellranger index using a command such as
        bsub cellranger mkref --genome=MyGenome --fasta=genome.fa --genes=Genes.filtered.gtf --ref-version=1.0
    • Run the actual cellranger count command using syntax like
      bsub cellranger count –id=ID –fastqs=PATH –transcriptome=DIR –sample=SAMPLE_LIST –project=PROJECT
  • The output of 'cellranger count' includes
    • An indexed BAM file of all mapped reads (possorted_genome_bam.bam)
    • A Loupe Browser visualization and analysis file (cloupe.cloupe)
  • The quality control summary is "web_summary.html" in the 'outs' folder and has important quality metrics and graphs such as: Estimated Number of Cells, Mean Reads per Cell, Mean Reads per Cell, Sequencing Saturation, etc.
  • The "matrix" output files are not in the usual matrix structure. To create a standard 2-dimensional matrix, one can use R commands such as
    cellranger_data_path = "/path/to/dir/with/outs/dir"
    crm = load_cellranger_matrix(cellranger_data_path)
    crm.matrix = as.matrix(exprs(crm))
    write.table(crm.matrix, "My.cellranger.matrix.txt", sep="\t", quote=F)

Run quality control and filter cells

We typically use the Seurat R package for these steps.

  • Start out by loading the counts matrix from cellranger:
    message("Loaded Seurat version", packageDescription("Seurat")$Version)
    # Load the barcodes*, features*, and matrix* files in your 10x Genomics directory
    counts.all <- Read10X(data.dir = input.counts.filename)

Perform and visualize dimensional analysis

Partition cells into clusters

Identify genes that differentially expressed between samples or clusters

Years of research has led to effective algorithms to quantify differential expression between RNA-seq samples that have been assayed genome-wide. Single cell expression profiles, however, typically assay only a small fraction of all genes, and this single property greatly complicates differential expression analysis. Two general approaches exist for differential expression:

  • consider each cell as a sample
  • aggregate counts across all cells in a group/cluster, and treat them as one sample

Add biological annotations to cells or cell clusters

Drawing biological conclusions from a single-cell experiment usually requires that one classify cells (or at least cell clusters) by type. Traditionally this is a time-consuming process of exploring marker genes and manually assigning cell type to each numbered cluster. Given that a number of public scRNA-seq experiments already have these annotations, one can leverage automated software with these ideally "gold standard" datasets to classify current experiments, either via expression profiles or marker genes.

Perform trajectory analysis

This step is relevant for projects that include cells at different stages of a developmental process or other change that is associated with a time course. Specific methods/algorithms for dimensional reduction are available to do this, but they often give very different results. Most of the methods have some concept of pseudotime, metric that one expects is correlated with actual time, but given that they aren't identical, interpretation needs to be performed with caution.

Combine multiple scRNA-seq datasets

Many experiments are especially informative when compared to other experiments, either performed by the same or different laboratories. This is challenging, however, especially when the different experiments profile different types of cells. In these cases, biological and technical differences are confounded, and one needs to make thoughtful assumptions about how to perform batch correction and achieve "success" during dataset integration.

As of 2020 there are more than a dozen algorithms available for integrating single cell RNA-seq data-sets. Three such methods are canonical correlation analysis (implemented in Seurat), iterative linear correction based on soft clustering (implemented in Harmony) and integrative nonnegative matrix factorization (implemented in LIGER). Commands for using each of these methods from within a Seurat workflow are given below.

  • Using CCA in Seurat (please see T. Stuart et al. “Comprehensive Integration of Single-Cell Data”, Cell 177, 1888-1902 (2019), the associated Seurat v.3 vignette and the documentation for the FindIntegrationAnchors function):
    # Merge two or more Seurat objects, objA and objB, from different batches.
    all <- merge(x=objA,y=objB,add.cell.ids=c("A","B"))
    # Split and re-integrate the merged object according to the batch slot.
    s3.list <- SplitObject(all, = "batch")
    # This loop normalizes each experiment separately first.
    for (i in 1:length(s3.list)) {
        s3.list[[i]] <- NormalizeData(s3.list[[i]], verbose = FALSE)
        s3.list[[i]] <- FindVariableFeatures(s3.list[[i]], selection.method = "vst", nfeatures = 2000, verbose = FALSE)
    # Find so-called anchors and carry out the integration.
    s3.anchors <- FindIntegrationAnchors(object.list = s3.list)
    s3.integrated <- IntegrateData(anchorset = s3.anchors) 
    DefaultAssay(s3.integrated) <- "integrated"
  • Using Harmony from within Seurat (please see I. Korsunsky et al. “Fast, sensitive and accurate integration of single-cell data with Harmony”, Nature Methods 16, 1289-1296 (2019), the details of the Harmony algorithm linked below and the documentation for the RunHarmony function):
    # Merge two or more Seurat objects, objA and objB, from different batches.
    all <- merge(x=objA,y=objB,add.cell.ids=c("A","B"))
    # In anticipation of using Harmony to integrate data-sets below, first use Seurat to run PCA on the un-corrected data.
    all <- NormalizeData(all, normalization.method = "LogNormalize", scale.factor = 10000)
    all <- FindVariableFeatures(all, selection.method = "vst", nfeatures = 2000)
    all <- ScaleData(all, features = rownames(all))
    all <- RunPCA(all, features = VariableFeatures(object = all))
    # Do the integration using Harmony, indexing samples by the batch slot:
    all <- RunHarmony(all, "batch")
    # When generating UMAP or another embedding, be sure to use the integrated "harmony" reduction.
    all <- RunUMAP(all,reduction = "harmony")
  • Using LIGER (v from within Seurat (please see J.D. Welsh et al. “Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity”, Nature Biotechnology 37, 1873–1887 (2019), the LIGER documentation/vignettes the documentation for the RunOptimizeALS and RunQuantileAlignSNF functions):
    # Merge two or more Seurat objects, objA and objB, from different batches.
    all <- merge(x=objA,y=objB,add.cell.ids=c("A","B"))
    # In anticipation of using LIGER to integrate data-sets below, first use Seurat to scale the data without centering.
    all <- NormalizeData(all, normalization.method = "LogNormalize", scale.factor = 10000)
    all <- FindVariableFeatures(all, selection.method = "vst", nfeatures = 2000)
    all <- ScaleData(all,, = "batch")
    # Do the integration using LIGER, indexing samples by the batch slot:
    all <- RunOptimizeALS(all, = "batch")
    all <- RunQuantileAlignSNF(all, = "batch")
    # When generating UMAP or another embedding, be sure to use the reduction from the integrated nonnegative factorization ("iNMF").
    all <- RunUMAP(all, dims = 1:ncol(all[["iNMF"]]), reduction = "iNMF")
    # The commands above use default values for the rank, k, of the NMF and the homogeneity parameter, lambda. Generally, the rank
    # should be chosen to be large enough to capture structure in the data matrix, yet small enough that the factorization gives reproducible
    # interpretations under multiple non-unique solutions. Increasing the homogeneity parameter places greater emphasis on the common
    # factors in the integration.  The heuristic functions suggestK and suggestLambda can be used to guide the data-set-specific
    # adjustment of these two parameters. 

Export expression and dimensional analysis data for interactive viewing

We prefer using UCSC's Cell Browser environment for this task.

  • Prerequisites. To make the most of this interactive viewing tool,
    • Run dimensional reduction (such as PCA, tSNE, UMAP).
    • Cluster/partition the cells (such as with Seurat's FindClusters()).
    • Identify cluster-specific marker genes (such as with Seurat'sFindAllMarkers()) and assemble/print information about them with commands such as
      all.markers.forCB = cbind(as.numeric(all.markers$cluster), all.markers$gene, all.markers$p_val_adj, all.markers$avg_logFC,
                            all.markers$pct.1, all.markers$pct.2)
      write.table(all.markers.forCB, file="all.markers.exported.txt", quote = FALSE, sep = "\t", row.names=F)
    • Add info/links about the marker genes with the CellBrowser command
      cbMarkerAnnotate all.markers.exported.txt markers.txt
  • Export the key data from the Seurat object:
    ExportToCellbrowser(seurat, dir=export.dir,, markers.file=markers.file, reductions=c("pca", "tsne", "umap"))
  • Run Cell Browser's cbBuild to create the web-viewable directory of files.
  • Move the cbBuild output to a web server, which creates a page that looks something like

Links to recommended scRNA-seq analysis tutorials and resources