SOP/MassSpec – BaRC Wiki

Context Navigation

This method is for label-free samples from our Proteomics Core Facility, which has some Scaffold quick instructions.

Using the (free) Scaffold Viewer (available from Proteome Software, with a large User's Manual), open the sf3 file.

By default, three filters prevent all mapped proteins from being displayed:
- Protein Threshold (default = 99%)
- Min # Peptides (default = 2)
- Peptide Threshold (default = 95%)
These filters are typical good with the default settings.

To identify differentially expressed proteins, use the normalized Total Spectrum Count
- Normalize across samples (typically using quantile normalization)
- Impute missing values
- Recommended statistic: t-test or moderated t-test (such as is implemented in 'limma') on log2 transformed values
- Correct p-values with FDR (or an alternate method)

For visualization:
- Draw a heatmap (Cluster3.0 -> Java TreeView) using the normalized Top 3 Precursor Intensities.
- Draw scatterplot using the normalized Top 3 Precursor Intensities, highlighting the differentially expressed proteins (from the Total Spectrum).

More detail about this analysis (briefly describe above):

Create a tab-delimited matrix of desired metric across all samples, with one column of unique protein identifiers

Normalize by quantiles (or another method) across all samples, based on the assumption that total protein mass should be the same in each sample. If this assumption is not valid, then spike-in (or another non-global) normalization method should be applied. See our code: normalize_matrix.R (which also includes other methods).

Impute missing values. We prefer the half-minimum method, which imputes any missing values of a protein with half of the minimum assayed value for that protein. This assumes that the true level of a protein with a missing value is between 0 and the minimum assayed level for that protein. The half-min method calculates the middle of this range with our code: impute_missing_matrix_values.R (which also includes other methods).

Calculate statistics for the differential expression analysis using limma, which applies moderated t-tests, one per protein. The protein levels must first be log-transformed, but that step occurs within our code: Run_2_groups_limma_differential_expression.R (which also calculates adjusted p-values). Choose an appropriate FDR threshold for differential expression.

Create volcano and MA plots for a global perspective of changing protein levels.

Best input is peptide-level "peak intensities", which are any continuous metric, such as Scaffold's
- Average Precursor Intensity
- Total Precursor Intensity
- Top Three Precursor Intensities
Ideal analysis pipeline is to input these values into MSstats for pre-processing, statistics, and data visualization
Preprocessing steps recommended by (performed by) MSstats:
- Log2 transform
- Median-normalize across samples and runs (ignoring any 0s)
- Convert all 0s to NA
- Censor low measurements
  - Get median
  - Get 99.9th (or other percentile) to identify right tail of distribution ("r")
  - Get threshold of left side ("l") of the distribution (2*median - r)
  - Censor all values less than "l"
- Impute all missing values using a MNAR method, such as the accelerated failure model
- Summarize all features of a protein using Tukey's median polish (TMP), but ignore proteins with only 1 peptide (or risk increased false positive rate)
Model each protein with a linear mixed-effects model
- Limma does a good job too, but it doesn't handle all the experimental designed handled by MSstats
Use model to calculate fold changes and raw p-values
Correct all p-values with FDR (BH method)
Draw summary plots (volcano plots, MA plots)

Export peptide-level intensity values from your favorite MS quantification software.

To avoid losing information about peptides that could have originated from multiple proteins and/or genes, merge peptide rows representing more than 1 protein and/or gene
- Sample command: sort -k3,3 Peptide_intensities.matrix.txt | groupBy -g 3 -c 1,2,3,4,5,6,7,8,9,10 -o distinct >| Peptide_intensities.matrix.mergedByPeptide.txt
- Make sure that all rows of the output file are unique.

Create a sample description file
- Columns are Run, Condition, BioReplicate
- See MSstats documentation on how to use these fields to represent technical replicates, biological replicates, and paired designs.
- Replication is not required for subsequent protein quantification but is required for statistical analysis.

Run MSstats using peptide intensities and sample description as input files.
- For sample code, see /nfs/BaRC_code/R/analyze_MS_with_MSstats/analyze_MS_with_MSstats.R

Note: See TracWiki for help on using the wiki.