= Pooled CRISPR screen analysis =

== Method 1: MAGeCK ==

Analyze a CRISPR genome-wide, or targeted, screen.  As input, MAGeCK requires (raw) counts.  If there are no replicates, MAGeCK will estimate the mean/variance from all the samples, i.e. both conditions.

  * [[https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0554-4 | Publication]]
  * [[https://sourceforge.net/projects/mageck/ | MAGeCK Home/Download Page]]
  * [[https://sourceforge.net/p/mageck/wiki/demo/ | Tutorial]]


==== Test: compare two conditions ====

  * Common usage to test, or compare, two conditions


{{{

# the options -t and -c specificity the treatment and control samples, respectively.
mageck test -k count_matrix.txt -t top1,top2 -c bot1,bot2 -n mageck_out.txt

# For paired samples:
mageck test -k count_matrix.txt -t top1,top2 -c bot1,bot2 -n mageck_out.txt --paired
# note: --paired option is available in version 0.5.9+
# use the option --normcounts-to-file to write normalized counts (by guide) to a file

# If control guides are used, these can be specified using the options (below).
# MAGeCK will use the control guides for normalization and generating the null distribution in running RRA, instead of all the guides.
--control-sgrna controlGuides.txt --norm-method control

#One option is to run with and without control guides and compare the results, e.g. volcano plot of -log(RRA) vs logFC to see how the control guides compare.

# A summary of the output (in pdf) is available from running the R script (see below).
# In some versions of 0.5.9+ an error may occur for the 'paired' option,
# Error in axis(1, at = 1:length(vali), labels = (collabel), las = 2) :
#'at' and 'labels' lengths differ, 2 != 4
# a workaround this is to edit the R script is to set 'collabel' (column labels) to only two conditions,
# e.g.  collabel=c("low","high")
# in the R script for all occurrences


}}}


The input file, count_matrix.txt, column names must match arguments to -c and -t but can also include other samples, which will be ignored.  The tab-delimited format should look like


||sgRNA||gene||bot1||bot2||top1||top2||
||sgACTL7A_2||ACTL7A||32||14||10||26||
||sgACTL7A_3||ACTL7A||44||40||82||118||
||sgACTL7A_4||ACTL7A||64||61||418||313||
||sgACTL7A_5||ACTL7A||9||0||17||74||
||sgACTL7A_6||ACTL7A||42||5||47||166||
||sgACTL7A_7||ACTL7A||14||32||23||60||


The output files include

  - .gene_summary.txt results summarized by gene (for all genes)
  - .sgrna_summary.txt results by guide (for all guides); this file can be made into a matrix using a few UNIX commands, e.g. 

Run .Rmd to create a file (.html) which summarizes (only) the top hits, and also includes a waterfall plot. To create this summary report file, you can either open the .Rmd in Rstudio and click on "knit", or run the command below:
{{{
Rscript -e "rmarkdown::render('foo.report.Rmd')"
}}}


{{{
# Get only the columns of interest: Gene, sgrna, control_mean, treat_mean
cut -f 1,2,5,6 mageck.sgrna_summary.txt | awk '{print $2"\t"$1"\t"$3"\t"$4}' > CRISPR_score_sgRNA.txt

# Convert the single column into a (wide) matrix, each column is a guide and each row is a gene
grep -v crispr_sgRNA.txt | sed 's/_/\t/' | sort -k 1,1 -k 2,2 -k 3,3n | awk -F '\t' '{print $1"\t"$2"_"$3"\t"$4"\t"$5}' | grep  -v INTERGENIC | grep -v CTRL0 | cut -f 1,4 | groupBy -g 1 -c 2 -o collapse |sed 's/,/\t/g' > CRISPR_score_sgRNA.txt
}}}


== Method 2 (based on Wang et al., 2015) ==

Basic method is described in [https://www.ncbi.nlm.nih.gov/pubmed/26472758 Wang et al., 2015], Identification and characterization of essential genes in the human genome, Science, 2015 Nov 27, 350(6264).

Detailed methods are in the [http://science.sciencemag.org/content/sci/suppl/2015/10/14/science.aac7041.DC1/Wang-SM.pdf supplementary materials].

Guide sequences should be required to have an exact match at the expected position (typically at the beginning of the read). See /nfs/BaRC_Public/BaRC_code/Perl/read_count_CRISPR_guides/read_count_CRISPR_guides.pl for one implementation of this.

Once counts are obtained, replicate measurements were averaged.

An aggregate reference set (summing initial counts from several samples) may be warranted, based on the experimental design.

Remove all sgRNAs that don't have an adequate number of counts in the (initial) aggregate reference set. This “adequate number” depends on the number of samples and sample coverage. For example, this was set to 200 for AML and 400 for CML.

Add a pseudocount of 1 to each sgRNA for each sample.

One may want to remove control and/or non-genic guides before further analysis.

Normalize all final samples and the aggregate reference set by reads per million (RPM).

Calculate the following metrics:

guide-based CRISPR score (per sample) = log2 [(final counts + pseudocount) / (initial counts + pseudocount)]

guide-based CRISPR score (overall) = average guide-based CRISPR score across all samples

gene-based CRISPR score = average guide-based CRISPR score across all sgRNAs targeting a given gene

For each gene, the sgRNA with the lowest overall CRISPR score can be defined as the “best” sgRNA for that gene.

'''Other details from the Wang et al. methods:'''

To identify genes essential for optimal proliferation under standard media conditions, the log2 fold change distribution for all sgRNAs targeting a given gene was compared with the entire distribution using a Kolmogorov-Smirnov test using the ks_2samp function from the scipy.stats Python library. The resulting p-values were corrected using the Benjamini-Hochberg procedure.

To identify cell line-specific essential genes, the CS distribution of each line was mean-normalized to zero. For each gene in each line, the CS in the given line was subtracted by the minimum CS in the other three lines to define a cell line-specific essentiality score (negative values indicate cell line specificity). For each line, genes with a differential score less than -1.5 (~4 standard deviations from the mean score) whose minimum CS in the other three lines was greater than -1 were defined as cell line-specific genes.

== Getting guide counts from reads ==

'''Implementation 1: alignment free'''

* Read in list of CRISPR guide sequences (of length n).
* Read through short reads, counting presence of first n nucleotides (where n can have more than one value).
* Iterate through CRISPR guide sequences, printing counts for each one.
* Also print out most prevalent n-mers that are not in CRISPR guide list (for quality control).
* See /nfs/BaRC_Public/BaRC_code/Perl/read_count_CRISPR_guides for this implementation.
* If you have low amount of reads assigned to guides, the guides could be in a opposite strand. You can re-map with reverse and compliment of the guides.

'''Implementation 2: If you cannot access to the script in the above folder, you can use bowtie to do the alignment'''
* Make bowtie index with sgRNA sequences, and ran bowtie to align reads to indexed sgRNAs.

* Sample commands with 0 mismatch in the alignment, with 20bp long sequence
* bowtie-build guide.fa guide
* bowtie  -3 20 -n 0  -l 20 -p 4  -S guide foo.fq.gz foo.sam

* -3: trim <int> bases from 3' end of reads
* -n: <int> max mismatches in seed (can be 0-3, default: -n 2)
* -l: <int> seed length for -n (default: 28)
* -p: <int> number of alignment threads to launch (default: 1)

* Note: don't use --best nor -m with bowtie

* Command to assign read counts for each sgRNA:

{{{
grep -v ^\@ foo.sam | awk -F"\t" '{ if($2==0 && $13=="MD:Z:20") print $3 }' | sort | uniq -c | awk '{ print $2"\t"$1 }'
}}}