Version 2 (modified by 6 years ago) ( diff ) | ,
---|
Pooled CRISPR screen analysis
Method 1 (based on Wang et al., 2015)
Recommendations come from Whitehead Functional Genomics platform.
Basic method is described in Wang et al., 2015, Identification and characterization of essential genes in the human genome, Science, 2015 Nov 27, 350(6264).
Detailed methods are in the supplementary materials.
Guide sequences should be required to have an exact match at the expected position (typically at the beginning of the read). See /nfs/BaRC_Public/BaRC_code/Perl/read_count_CRISPR_guides/read_count_CRISPR_guides.pl for one implementation of this.
Once counts are obtained, replicate measurements were averaged.
An aggregate reference set (summing initial counts from several samples) may be warranted, based on the experimental design.
Remove all sgRNAs that don't have an adequate number of counts in the (initial) aggregate reference set. This “adequate number” depends on the number of samples and sample coverage. For example, this was set to 200 for AML and 400 for CML.
Add a pseudocount of 1 to each sgRNA for each sample.
One may want to remove control and/or non-genic guides before further analysis.
Normalize all final samples and the aggregate reference set by reads per million (RPM).
Calculate the following metrics:
guide-based CRISPR score (per sample) = log2 [(final counts + pseudocount) / (initial counts + pseudocount)]
guide-based CRISPR score (overall) = average guide-based CRISPR score across all samples
gene-based CRISPR score = average guide-based CRISPR score across all sgRNAs targeting a given gene
For each gene, the sgRNA with the lowest overall CRISPR score can be defined as the “best” sgRNA for that gene.
Other details from the Wang et al. methods:
To identify genes essential for optimal proliferation under standard media conditions, the log2 fold change distribution for all sgRNAs targeting a given gene was compared with the entire distribution using a Kolmogorov-Smirnov test using the ks_2samp function from the scipy.stats Python library. The resulting p-values were corrected using the Benjamini-Hochberg procedure.
To identify cell line-specific essential genes, the CS distribution of each line was mean-normalized to zero. For each gene in each line, the CS in the given line was subtracted by the minimum CS in the other three lines to define a cell line-specific essentiality score (negative values indicate cell line specificity). For each line, genes with a differential score less than -1.5 (~4 standard deviations from the mean score) whose minimum CS in the other three lines was greater than -1 were defined as cell line-specific genes.