wiki:SOP/PooledCRISPR

Version 23 (modified by byuan, 2 years ago) ( diff )

--

Pooled CRISPR screen analysis

Method 1 (based on Wang et al., 2015)

Recommendations for this method come from Whitehead Functional Genomics platform.

Basic method is described in Wang et al., 2015, Identification and characterization of essential genes in the human genome, Science, 2015 Nov 27, 350(6264).

Detailed methods are in the supplementary materials.

Guide sequences should be required to have an exact match at the expected position (typically at the beginning of the read). See /nfs/BaRC_Public/BaRC_code/Perl/read_count_CRISPR_guides/read_count_CRISPR_guides.pl for one implementation of this.

Once counts are obtained, replicate measurements were averaged.

An aggregate reference set (summing initial counts from several samples) may be warranted, based on the experimental design.

Remove all sgRNAs that don't have an adequate number of counts in the (initial) aggregate reference set. This “adequate number” depends on the number of samples and sample coverage. For example, this was set to 200 for AML and 400 for CML.

Add a pseudocount of 1 to each sgRNA for each sample.

One may want to remove control and/or non-genic guides before further analysis.

Normalize all final samples and the aggregate reference set by reads per million (RPM).

Calculate the following metrics:

guide-based CRISPR score (per sample) = log2 [(final counts + pseudocount) / (initial counts + pseudocount)]

guide-based CRISPR score (overall) = average guide-based CRISPR score across all samples

gene-based CRISPR score = average guide-based CRISPR score across all sgRNAs targeting a given gene

For each gene, the sgRNA with the lowest overall CRISPR score can be defined as the “best” sgRNA for that gene.

Other details from the Wang et al. methods:

To identify genes essential for optimal proliferation under standard media conditions, the log2 fold change distribution for all sgRNAs targeting a given gene was compared with the entire distribution using a Kolmogorov-Smirnov test using the ks_2samp function from the scipy.stats Python library. The resulting p-values were corrected using the Benjamini-Hochberg procedure.

To identify cell line-specific essential genes, the CS distribution of each line was mean-normalized to zero. For each gene in each line, the CS in the given line was subtracted by the minimum CS in the other three lines to define a cell line-specific essentiality score (negative values indicate cell line specificity). For each line, genes with a differential score less than -1.5 (~4 standard deviations from the mean score) whose minimum CS in the other three lines was greater than -1 were defined as cell line-specific genes.

Getting guide counts from reads

Implementation 1: alignment free

  • Read in list of CRISPR guide sequences (of length n).
  • Read through short reads, counting presence of first n nucleotides (where n can have more than one value).
  • Iterate through CRISPR guide sequences, printing counts for each one.
  • Also print out most prevalent n-mers that are not in CRISPR guide list (for quality control).
  • See /nfs/BaRC_Public/BaRC_code/Perl/read_count_CRISPR_guides for this implementation.
  • If you have low amount of reads assigned to guides, the guides could be in a opposite strand. You can re-map with reverse and compliment of the guides.

Implementation 2: If you cannot access to the script in the above folder, you can use bowtie to do the alignment

  • Make bowtie index with sgRNA sequences, and ran bowtie to align reads to indexed sgRNAs.
  • Sample commands with 0 mismatch in the alignment, with 20bp long sequence
  • bowtie-build guide.fa guide
  • bowtie -3 20 -n 0 -l 20 -p 4 -S guide foo.fq.gz foo.sam
  • -3: trim <int> bases from 3' end of reads
  • -n: <int> max mismatches in seed (can be 0-3, default: -n 2)
  • -l: <int> seed length for -n (default: 28)
  • -p: <int> number of alignment threads to launch (default: 1)
  • Note: don't use --best nor -m with bowtie
  • Command to assign read counts for each sgRNA:
grep -v ^\@ foo.sam | awk -F"\t" '{ if($2==0 && $13=="MD:Z:20") print $3 }' | sort | uniq -c | awk '{ print $2"\t"$1 }'

Method 2: MAGeCK

Analyze CRISPR genome-wide, or targeted, screen. As input, MAGeCK requires (raw) counts. If there are no replicates, MAGeCK will estimate the mean/variance from all the samples, i.e. both conditions.

Test: compare two conditions

  • Common usage to test, or compare, two conditions
# the options -t and -c specificity the treatment and control samples, respectively.
mageck test -k count_matrix.txt -t top1,top2 -c bot1,bot2 -n mageck_out.txt

# For paired samples:
mageck test -k count_matrix.txt -t top1,top2 -c bot1,bot2 -n mageck_out.txt --paired
# note: --paired option is available in version 0.5.9+
# use the option --normcounts-to-file to write normalized counts (by guide) to a file

# If control guides are used, these can be specified using the options (below).
# MAGeCK will use the control guides for normalization and generating the null distribution in running RRA, instead of all the guides.
--control-sgrna controlGuides.txt --norm-method control

#One option is to run with and without control guides and compare the results, e.g. volcano plot of -log(RRA) vs logFC to see how the control guides compare.

# A summary of the output (in pdf) is available from running the R script (see below).
# In some versions of 0.5.9+ an error may occur for the 'paired' option,
# Error in axis(1, at = 1:length(vali), labels = (collabel), las = 2) :
#'at' and 'labels' lengths differ, 2 != 4
# a workaround this is to edit the R script is to set the collabel to only two conditions,
# e.g.  collabel=c("low","high")
# in the R script for all occurrences


The input file, count_matrix.txt, column names must match arguments to -c and -t, e.g.

sgRNAgenebot1bot2top1top2
sgACTL7A_2ACTL7A32141026
sgACTL7A_3ACTL7A444082118
sgACTL7A_4ACTL7A6461418313
sgACTL7A_5ACTL7A901774
sgACTL7A_6ACTL7A42547166
sgACTL7A_7ACTL7A14322360

The output files include,

  • .gene_summary.txt results summarized by gene (for all genes)
  • .sgrna_summary.txt resuls by guide (for all guides); this file can be made into a matrix using a few UNIX commands, e.g.

Run .Rmd to create a file (.html) which summarizes (only) the top hits, and also includes a waterfall plot. To create this summary report file, you can either open the .Rmd in Rstudio and click on "knit", or run the command below:

Rscript -e "rmarkdown::render('foo.report.Rmd')"
#get only the columns of interest: Gene, sgrna, control_mean, treat_mean
cut -f 1,2,5,6 mageck.sgrna_summary.txt | awk '{print $2"\t"$1"\t"$3"\t"$4}' > CRISPR_score_sgRNA.txt

#convert the single column into a (wide) matrix, each column is a guide and each row is a gene
grep -v crispr_sgRNA.txt | sed 's/_/\t/' | sort -k 1,1 -k 2,2 -k 3,3n | awk -F '\t' '{print $1"\t"$2"_"$3"\t"$4"\t"$5}' | grep  -v INTERGENIC | grep -v CTRL0 | cut -f 1,4 | groupBy -g 1 -c 2 -o collapse |sed 's/,/\t/g' > CRISPR_score_sgRNA.txt
Note: See TracWiki for help on using the wiki.