Pooled CRISPR screen analysis

Method 1 (based on Wang et al., 2015)

Recommendations for this method come from Whitehead Functional Genomics platform.

Basic method is described in Wang et al., 2015, Identification and characterization of essential genes in the human genome, Science, 2015 Nov 27, 350(6264).

Detailed methods are in the supplementary materials.

Guide sequences should be required to have an exact match at the expected position (typically at the beginning of the read). See /nfs/BaRC_Public/BaRC_code/Perl/read_count_CRISPR_guides/ for one implementation of this.

Once counts are obtained, replicate measurements were averaged.

An aggregate reference set (summing initial counts from several samples) may be warranted, based on the experimental design.

Remove all sgRNAs that don't have an adequate number of counts in the (initial) aggregate reference set. This “adequate number” depends on the number of samples and sample coverage. For example, this was set to 200 for AML and 400 for CML.

Add a pseudocount of 1 to each sgRNA for each sample.

One may want to remove control and/or non-genic guides before further analysis.

Normalize all final samples and the aggregate reference set by reads per million (RPM).

Calculate the following metrics:

guide-based CRISPR score (per sample) = log2 [(final counts + pseudocount) / (initial counts + pseudocount)]

guide-based CRISPR score (overall) = average guide-based CRISPR score across all samples

gene-based CRISPR score = average guide-based CRISPR score across all sgRNAs targeting a given gene

For each gene, the sgRNA with the lowest overall CRISPR score can be defined as the “best” sgRNA for that gene.

Other details from the Wang et al. methods:

To identify genes essential for optimal proliferation under standard media conditions, the log2 fold change distribution for all sgRNAs targeting a given gene was compared with the entire distribution using a Kolmogorov-Smirnov test using the ks_2samp function from the scipy.stats Python library. The resulting p-values were corrected using the Benjamini-Hochberg procedure.

To identify cell line-specific essential genes, the CS distribution of each line was mean-normalized to zero. For each gene in each line, the CS in the given line was subtracted by the minimum CS in the other three lines to define a cell line-specific essentiality score (negative values indicate cell line specificity). For each line, genes with a differential score less than -1.5 (~4 standard deviations from the mean score) whose minimum CS in the other three lines was greater than -1 were defined as cell line-specific genes.

Getting guide counts from reads

Implementation 1: alignment free

  • Read in list of CRISPR guide sequences (of length n).
  • Read through short reads, counting presence of first n nucleotides (where n can have more than one value).
  • Iterate through CRISPR guide sequences, printing counts for each one.
  • Also print out most prevalent n-mers that are not in CRISPR guide list (for quality control).
  • See /nfs/BaRC_Public/BaRC_code/Perl/read_count_CRISPR_guides for this implementation.

Implementation 2: using bowtie alignment

  • Make bowtie index with sgRNA sequences, and ran bowtie to align reads to indexed sgRNAs.
  • Sample commands with 0 mismatch in the alignment, with 20bp long sequence
  • bowtie-build guide.fa guide
  • bowtie -3 20 -n 0 -l 20 -p 4 -S guide foo.fq.gz foo.sam
  • -3: trim <int> bases from 3' end of reads
  • -n: <int> max mismatches in seed (can be 0-3, default: -n 2)
  • -l: <int> seed length for -n (default: 28)
  • -p: <int> number of alignment threads to launch (default: 1)
  • Note: don't use --best nor -m with bowtie
  • Command to assign read counts for each sgRNA:
grep -v ^\@ foo.sam | awk -F"\t" '{ if($2==0 && $13=="MD:Z:20") print $3 }' | sort | uniq -c | awk '{ print $2"\t"$1 }'

Method 2: MAGeCK

Analyze CRISPR genome-wide, or targeted, screen.

Test: compare two conditions

  • Common usage to test, or compare, two conditions
mageck test -k count_matrix.txt -t top1,top2 -c bot1,bot2 -n mageck_out.txt
# the options -t and -c specificity the treatment and control samples, respectively.

The input file, count_matrix.txt, column names must match arguments to -c and -t, e.g.


The output files include,

  • .summary.pdf file which summarizes (only) the top hits, and also includes a waterfall plot.
  • .gene_summary.txt results summarized by gene (for all genes)
  • .sgrna_summary.txt resuls by guide (for all guides); this file can be made into a matrix using a few UNIX commands, e.g.
#get only the columns of interest: Gene, sgrna, control_mean, treat_mean
cut -f 1,2,5,6 mageck.sgrna_summary.txt | awk '{print $2"\t"$1"\t"$3"\t"$4}' > CRISPR_score_sgRNA.txt

#convert the single column into a (wide) matrix, each column is a guide and each row is a gene
grep -v crispr_sgRNA.txt | sed 's/_/\t/' | sort -k 1,1 -k 2,2 -k 3,3n | awk -F '\t' '{print $1"\t"$2"_"$3"\t"$4"\t"$5}' | grep  -v INTERGENIC | grep -v CTRL0 | cut -f 1,4 | groupBy -g 1 -c 2 -o collapse |sed 's/,/\t/g' > CRISPR_score_sgRNA.txt