wiki:SOP/MassSpec

Differential protein expression (with mass spec)

This method is for label-free samples from our Proteomics Core Facility, which has some Scaffold quick instructions.

  • By default, three filters prevent all mapped proteins from being displayed:
    • Protein Threshold (default = 99%)
    • Min # Peptides (default = 2)
    • Peptide Threshold (default = 95%)
  • These filters are typical good with the default settings.
  • Scaffold has multiple display and quantification options.
    • From the Scaffold User's Manual:
      • Spectrum Counting methods are the most reliable in answering the question, "Is anything changing between experimental conditions?".
      • Precursor Ion Intensity quantification methods are very reliable in answering the question, "How much is the amount of change I am dealing with?"
      • The Total Ion Count (TIC) methods can answer both questions but not very well.
    • For a quick QC check:
      • Look at the first few most highly expressed proteins.
      • Are they within 2-fold or so of each other?
      • If some samples have much lower or higher counts, their sensitivity may differ so much that samples may not be comparable.
  • Under Display Options, select "Total Spectrum Count"
  • For Spectrum Counting, click on the "Quantitative Analysis" icon (showing a bar graph).
    • Keep "Use Normalization" checked.
    • For Quantitative Method, select "Total Spectra".
    • Click the Apply button.
    • Next to Display Option: select Quantitative Value (Normalized Total Spectra)
    • Export (top menu) => Current View.
    • Use this file to identify differentially expressed proteins (to be explained below).
  • For Precursor Ion Intensity quantification, click on the "Quantitative Analysis" icon (showing a bar graph).
    • Keep "Use Normalization" checked.
    • For Quantitative Method, select "Top 3 Precursor Intensity".
    • Click the Apply button.
    • Next to Display Option: select Quantitative Value (Normalized Top 3 Precursor Intensity)
    • Export (top menu) => Current View.
    • Use this file for visualization (to be explained below).

  • To identify differentially expressed proteins, use the normalized Total Spectrum Count
    • Recommended statistic: t-test on log2 transformed values
    • Correct p-values with FDR (or an alternate method)
  • For pathway analysis: DAVID usually works fine.
  • For visualization:
    • Draw a heatmap (Cluster3.0 -> Java TreeView) using the normalized Top 3 Precursor Intensities.
    • Draw scatterplot using the normalized Top 3 Precursor Intensities, highlighting the differentially expressed proteins (from the Total Spectrum).

Recommendations from Northeastern (May Institute, Vitek Lab)

  • Best input is peptide-level "peak intensities", which are any continuous metric, such as Scaffold's
    • Average Precursor Intensity
    • Total Precursor Intensity
    • Top Three Precursor Intensities
  • Ideal analysis pipeline is to input these values into MSstats for pre-processing, statistics, and data visualization
  • Preprocessing steps recommended by (performed by) MSstats:
    • Log2 transform
    • Median-normalize across samples and runs (ignoring any 0s)
    • Convert all 0s to NA
    • Censor low measurements
      • Get median
      • Get 99.9th (or other percentile) to identify right tail of distribution ("r")
      • Get threshold of left side ("l") of the distribution (2*median - r)
      • Censor all values less than "l"
    • Impute all missing values using a MNAR method, such as the accelerated failure model
    • Summarize all features of a protein using Tukey's median polish (TMP), but ignore proteins with only 1 peptide (or risk increased false positive rate)
  • Model each protein with a linear mixed-effects model
    • Limma does a good job too, but it doesn't handle all the experimental designed handled by MSstats
  • Use model to calculate fold changes and raw p-values
  • Correct all p-values with FDR (BH method)
  • Draw summary plots (volcano plots, MA plots)

Preparing and processing an experiment with MSstats

  • Export peptide-level intensity values from your favorite MS quantification software.
  • Create a peptide intensity file
    • Organize the dataset so the first three columns are Gene.symbol, Protein.Accession, and Peptide.sequence
      • If needed, convert each protein accession to a gene symbol
      • Replace any intensities shown as "-" with 0
      • If Excel is used, check that gene symbols aren't being converted to dates
    • After the first 3 columns, the remaining columns hold intensities, one column per sample
  • To avoid losing information about peptides that could have originated from multiple proteins and/or genes, merge peptide rows representing more than 1 protein and/or gene
    • Sample command: sort -k3,3 Peptide_intensities.matrix.txt | groupBy -g 3 -c 1,2,3,4,5,6,7,8,9,10 -o distinct >| Peptide_intensities.matrix.mergedByPeptide.txt
    • Make sure that all rows of the output file are unique.
  • Create a sample description file
    • Columns are Run, Condition, BioReplicate
    • See MSstats documentation on how to use these fields to represent technical replicates, biological replicates, and paired designs.
    • Replication is not required for subsequent protein quantification but is required for statistical analysis.

  • Run MSstats using peptide intensities and sample description as input files.
    • For sample code, see /nfs/BaRC_code/R/analyze_MS_with_MSstats/analyze_MS_with_MSstats.R

References