==== Preparing a mass spec experiment for differential protein expression ====

This method is for label-free samples from our Proteomics Core Facility, which has some [[http://massspec.wi.mit.edu/documents/Scaffoldhowto060513.pdf|Scaffold quick instructions]].

  * Using the (free) Scaffold Viewer (available from [[http://www.proteomesoftware.com/products/free-viewer|Proteome Software]], with a large [[http://www.proteomesoftware.com/pdf/scaffold_users_guide.pdf|User's Manual]]), open the sf3 file.

  * By default, three filters prevent all mapped proteins from being displayed:
    * Protein Threshold (default = 99%)
    * Min # Peptides (default = 2)
    * Peptide Threshold (default = 95%) 
  * These filters are typical good with the default settings.

  * Scaffold has multiple display and quantification options.
    * From the Scaffold User's Manual:
      * Spectrum Counting methods are the most reliable in answering the question, "Is anything changing between experimental conditions?".
      * Precursor Ion Intensity quantification methods are very reliable in answering the question, "How much is the amount of change I am dealing with?"
      * The Total Ion Count (TIC) methods can answer both questions but not very well. 
    * For a quick QC check: 
      * Look at the first few most highly expressed proteins.  
      * Are they within 2-fold or so of each other?  
      * If some samples have much lower or higher counts, their sensitivity may differ so much that samples may not be comparable.

    * Under Display Options, select "Total Spectrum Count"

    * For Spectrum Counting, click on the "Quantitative Analysis" icon (showing a bar graph).
       * Keep "Use Normalization" checked.
       * For Quantitative Method, select "Total Spectra".
       * Click the Apply button.
       * Next to Display Option: select Quantitative Value (Normalized Total Spectra)
       * Export (top menu) => Current View.
       * Use this file to identify differentially expressed proteins (to be explained below).

    * For Precursor Ion Intensity quantification, click on the "Quantitative Analysis" icon (showing a bar graph).
       * Keep "Use Normalization" checked.
       * For Quantitative Method, select "Top 3 Precursor Intensity".
       * Click the Apply button.
       * Next to Display Option: select Quantitative Value (Normalized Top 3 Precursor Intensity)
       * Export (top menu) => Current View.
       * Use this file for visualization (to be explained below).
  
    * To identify differentially expressed proteins, use the normalized Total Spectrum Count
      * Normalize across samples (typically using quantile normalization)
      * Impute missing values 
      * Recommended statistic: t-test or moderated t-test (such as is implemented in 'limma') on log2 transformed values
      * Correct p-values with FDR (or an alternate method) 

    * For pathway analysis: [[https://david.ncifcrf.gov/|DAVID]] usually works fine.

    * For visualization:
      * Draw a heatmap (Cluster3.0 -> Java TreeView) using the normalized Top 3 Precursor Intensities. 
      * Draw scatterplot using the normalized Top 3 Precursor Intensities, highlighting the differentially expressed proteins (from the Total Spectrum).  

==== Performing differential protein expression analysis ====

More detail about this analysis (briefly describe above): 

  * Create a tab-delimited matrix of desired metric across all samples, with one column of unique protein identifiers

  * Normalize by quantiles (or another method) across all samples, based on the assumption that total protein mass should be the same in each sample.  If this assumption is not valid, then spike-in (or another non-global) normalization method should be applied.  See our code: normalize_matrix.R (which also includes other methods).

  * Impute missing values.  We prefer the half-minimum method, which imputes any missing values of a protein with half of the minimum assayed value for that protein.  This assumes that the true level of a protein with a missing value is between 0 and the minimum assayed level for that protein.  The half-min method calculates the middle of this range with our code: impute_missing_matrix_values.R (which also includes other methods).

  * Calculate statistics for the differential expression analysis using limma, which applies moderated t-tests, one per protein.  The protein levels must first be log-transformed, but that step occurs within our code: Run_2_groups_limma_differential_expression.R (which also calculates adjusted p-values).  Choose an appropriate FDR threshold for differential expression.

  * Create volcano and MA plots for a global perspective of changing protein levels.

==== Recommendations from Northeastern (May Institute, Vitek Lab)  ====

  * Best input is peptide-level "peak intensities", which are any continuous metric, such as Scaffold's
    * Average Precursor Intensity
    * Total Precursor Intensity
    * Top Three Precursor Intensities
  * Ideal analysis pipeline is to input these values into MSstats for pre-processing, statistics, and data visualization
  * Preprocessing steps recommended by (performed by) MSstats:
    * Log2 transform
    * Median-normalize across samples and runs (ignoring any 0s)
    * Convert all 0s to NA
    * Censor low measurements
      * Get median
      * Get 99.9th (or other percentile) to identify right tail of distribution ("r")
      * Get threshold of left side ("l") of the distribution (2*median - r)
      * Censor all values less than "l"
    * Impute all missing values using a MNAR method, such as the accelerated failure model
    * Summarize all features of a protein using Tukey's median polish (TMP), but ignore proteins with only 1 peptide (or risk increased false positive rate)
  * Model each protein with a linear mixed-effects model
     * Limma does a good job too, but it doesn't handle all the experimental designed handled by MSstats
   * Use model to calculate fold changes and raw p-values
   * Correct all p-values with FDR (BH method)
   * Draw summary plots (volcano plots, MA plots) 

==== Preparing and processing an experiment with MSstats  ====

  * Export peptide-level intensity values from your favorite MS quantification software.

  * Create a peptide intensity file
      * Organize the dataset so the first three columns are Gene.symbol, Protein.Accession, and Peptide.sequence
        * If needed, convert each protein accession to a gene symbol
        * Replace any intensities shown as "-" with 0
        * If Excel is used, check that gene symbols aren't being converted to dates
      * After the first 3 columns, the remaining columns hold intensities, one column per sample 

  * To avoid losing information about peptides that could have originated from multiple proteins and/or genes, merge peptide rows representing more than 1 protein and/or gene
     * Sample command: sort -k3,3 Peptide_intensities.matrix.txt | groupBy -g 3 -c 1,2,3,4,5,6,7,8,9,10 -o distinct >| Peptide_intensities.matrix.mergedByPeptide.txt
     * Make sure that all rows of the output file are unique.

  * Create a sample description file
     * Columns are  Run, Condition, BioReplicate
     * See MSstats documentation on how to use these fields to represent technical replicates, biological replicates, and paired designs.
     * Replication is not required for subsequent protein quantification but is required for statistical analysis.
 
  * Run MSstats using peptide intensities and sample description as input files.
    * For sample code, see **/nfs/BaRC_code/R/analyze_MS_with_MSstats/analyze_MS_with_MSstats.R**

==== References  ====

  * Choi et al., 2017 [[https://pubs.acs.org/doi/abs/10.1021/acs.jproteome.6b00881|ABRF Proteome Informatics Research Group (iPRG) 2015 Study: Detection of Differentially Abundant Proteins in Label-Free Quantitative LC−MS/MS Experiments]]
  * MSstats at [[https://bioconductor.org/packages/release/bioc/html/MSstats.html|Bioconductor]] and [[https://github.com/MeenaChoi/MSstats|GitHub]]
  * Other references on these topics
    * Lazar et. al., 2016 [[https://pubs.acs.org/doi/full/10.1021/acs.jproteome.5b00981|Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies]]
    * Wei et al., 2018  [[https://www.nature.com/articles/s41598-017-19120-0|Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data]]