Differential protein expression (with mass spec)
This method is for label-free samples from our Proteomics Core Facility, which has some Scaffold quick instructions.
- Using the (free) Scaffold Viewer (available from Proteome Software, with a large User's Manual), open the sf3 file.
- By default, three filters prevent all mapped proteins from being displayed:
- Protein Threshold (default = 99%)
- Min # Peptides (default = 2)
- Peptide Threshold (default = 95%)
- These filters are typical good with the default settings.
- Scaffold has multiple display and quantification options.
- From the Scaffold User's Manual:
- Spectrum Counting methods are the most reliable in answering the question, "Is anything changing between experimental conditions?".
- Precursor Ion Intensity quantification methods are very reliable in answering the question, "How much is the amount of change I am dealing with?"
- The Total Ion Count (TIC) methods can answer both questions but not very well.
- For a quick QC check:
- Look at the first few most highly expressed proteins.
- Are they within 2-fold or so of each other?
- If some samples have much lower or higher counts, their sensitivity may differ so much that samples may not be comparable.
- From the Scaffold User's Manual:
- Under Display Options, select "Total Spectrum Count"
- For Spectrum Counting, click on the "Quantitative Analysis" icon (showing a bar graph).
- Keep "Use Normalization" checked.
- For Quantitative Method, select "Total Spectra".
- Click the Apply button.
- Next to Display Option: select Quantitative Value (Normalized Total Spectra)
- Export (top menu) => Current View.
- Use this file to identify differentially expressed proteins (to be explained below).
- For Precursor Ion Intensity quantification, click on the "Quantitative Analysis" icon (showing a bar graph).
- Keep "Use Normalization" checked.
- For Quantitative Method, select "Top 3 Precursor Intensity".
- Click the Apply button.
- Next to Display Option: select Quantitative Value (Normalized Top 3 Precursor Intensity)
- Export (top menu) => Current View.
- Use this file for visualization (to be explained below).
- To identify differentially expressed proteins, use the normalized Total Spectrum Count
- Recommended statistic: t-test on log2 transformed values
- Correct p-values with FDR (or an alternate method)
- For pathway analysis: DAVID usually works fine.
- For visualization:
- Draw a heatmap (Cluster3.0 -> Java TreeView) using the normalized Top 3 Precursor Intensities.
- Draw scatterplot using the normalized Top 3 Precursor Intensities, highlighting the differentially expressed proteins (from the Total Spectrum).
Recommendations from Northeastern (May Institute, Vitek Lab)
- Best input is peptide-level "peak intensities", which are any continuous metric, such as Scaffold's
- Average Precursor Intensity
- Total Precursor Intensity
- Top Three Precursor Intensities
- Ideal analysis pipeline is to input these values into MSstats for pre-processing, statistics, and data visualization
- Preprocessing steps recommended by (performed by) MSstats:
- Log2 transform
- Median-normalize across samples and runs (ignoring any 0s)
- Convert all 0s to NA
- Censor low measurements
- Get median
- Get 99.9th (or other percentile) to identify right tail of distribution ("r")
- Get threshold of left side ("l") of the distribution (2*median - r)
- Censor all values less than "l"
- Impute all missing values using a MNAR method, such as the accelerated failure model
- Summarize all features of a protein using Tukey's median polish (TMP), but ignore proteins with only 1 peptide (or risk increased false positive rate)
- Model each protein with a linear mixed-effects model
- Limma does a good job too, but it doesn't handle all the experimental designed handled by MSstats
- Use model to calculate fold changes and raw p-values
- Correct all p-values with FDR (BH method)
- Draw summary plots (volcano plots, MA plots)
Preparing and processing an experiment with MSstats
- Export peptide-level intensity values from your favorite MS quantification software.
- Create a peptide intensity file
- Organize the dataset so the first three columns are Gene.symbol, Protein.Accession, and Peptide.sequence
- If needed, convert each protein accession to a gene symbol
- Replace any intensities shown as "-" with 0
- If Excel is used, check that gene symbols aren't being converted to dates
- After the first 3 columns, the remaining columns hold intensities, one column per sample
- Organize the dataset so the first three columns are Gene.symbol, Protein.Accession, and Peptide.sequence
- To avoid losing information about peptides that could have originated from multiple proteins and/or genes, merge peptide rows representing more than 1 protein and/or gene
- Sample command: sort -k3,3 Peptide_intensities.matrix.txt | groupBy -g 3 -c 1,2,3,4,5,6,7,8,9,10 -o distinct >| Peptide_intensities.matrix.mergedByPeptide.txt
- Make sure that all rows of the output file are unique.
- Create a sample description file
- Columns are Run, Condition, BioReplicate
- See MSstats documentation on how to use these fields to represent technical replicates, biological replicates, and paired designs.
- Replication is not required for subsequent protein quantification but is required for statistical analysis.
- Run MSstats using peptide intensities and sample description as input files.
- For sample code, see /nfs/BaRC_code/R/analyze_MS_with_MSstats/analyze_MS_with_MSstats.R
References
- Choi et al., 2017 ABRF Proteome Informatics Research Group (iPRG) 2015 Study: Detection of Differentially Abundant Proteins in Label-Free Quantitative LC−MS/MS Experiments
- MSstats at Bioconductor and GitHub
- Other references on these topics
Note:
See TracWiki
for help on using the wiki.