==== Differential protein expression (with mass spec) ==== This method is for label-free samples from our Proteomics Core Facility, which has some [[http://massspec.wi.mit.edu/documents/Scaffoldhowto060513.pdf|Scaffold quick instructions]]. * Using the (free) Scaffold Viewer (available from [[http://www.proteomesoftware.com/products/free-viewer|Proteome Software]], with a large [[http://www.proteomesoftware.com/pdf/scaffold_users_guide.pdf|User's Manual]]), open the sf3 file. * By default, three filters prevent all mapped proteins from being displayed: * Protein Threshold (default = 99%) * Min # Peptides (default = 2) * Peptide Threshold (default = 95%) * These filters are typical good with the default settings. * Scaffold has multiple display and quantification options. * From the Scaffold User's Manual: * Spectrum Counting methods are the most reliable in answering the question, "Is anything changing between experimental conditions?". * Precursor Ion Intensity quantification methods are very reliable in answering the question, "How much is the amount of change I am dealing with?" * The Total Ion Count (TIC) methods can answer both questions but not very well. * For a quick QC check: * Look at the first few most highly expressed proteins. * Are they within 2-fold or so of each other? * If some samples have much lower or higher counts, their sensitivity may differ so much that samples may not be comparable. * Under Display Options, select "Total Spectrum Count" * For Spectrum Counting, click on the "Quantitative Analysis" icon (showing a bar graph). * Keep "Use Normalization" checked. * For Quantitative Method, select "Total Spectra". * Click the Apply button. * Next to Display Option: select Quantitative Value (Normalized Total Spectra) * Export (top menu) => Current View. * Use this file to identify differentially expressed proteins (to be explained below). * For Precursor Ion Intensity quantification, click on the "Quantitative Analysis" icon (showing a bar graph). * Keep "Use Normalization" checked. * For Quantitative Method, select "Top 3 Precursor Intensity". * Click the Apply button. * Next to Display Option: select Quantitative Value (Normalized Top 3 Precursor Intensity) * Export (top menu) => Current View. * Use this file for visualization (to be explained below). * To identify differentially expressed proteins, use the normalized Total Spectrum Count * Recommended statistic: t-test on log2 transformed values * Correct p-values with FDR (or an alternate method) * For pathway analysis: [[https://david.ncifcrf.gov/|DAVID]] usually works fine. * For visualization: * Draw a heatmap (Cluster3.0 -> Java TreeView) using the normalized Top 3 Precursor Intensities. * Draw scatterplot using the normalized Top 3 Precursor Intensities, highlighting the differentially expressed proteins (from the Total Spectrum). ==== Recommendations from Northeastern (May Institute, Vitek Lab) ==== * Best input is peptide-level "peak intensities", which are any continuous metric, such as Scaffold's * Average Precursor Intensity * Total Precursor Intensity * Top Three Precursor Intensities * Ideal analysis pipeline is to input these values into MSstats for pre-processing, statistics, and data visualization * Preprocessing steps recommended by (performed by) MSstats: * Log2 transform * Median-normalize across samples and runs (ignoring any 0s) * Convert all 0s to NA * Censor low measurements * Get median * Get 99.9th (or other percentile) to identify right tail of distribution ("r") * Get threshold of left side ("l") of the distribution (2*median - r) * Censor all values less than "l" * Impute all missing values using a MNAR method, such as the accelerated failure model * Summarize all features of a protein using Tukey's median polish (TMP), but ignore proteins with only 1 peptide (or risk increased false positive rate) * Model each protein with a linear mixed-effects model * Limma does a good job too, but it doesn't handle all the experimental designed handled by MSstats * Use model to calculate fold changes and raw p-values * Correct all p-values with FDR (BH method) * Draw summary plots (volcano plots, MA plots) ==== Preparing and processing an experiment with MSstats ==== * Export peptide-level intensity values from your favorite MS quantification software. * Create a peptide intensity file * Organize the dataset so the first three columns are Gene.symbol, Protein.Accession, and Peptide.sequence * If needed, convert each protein accession to a gene symbol * Replace any intensities shown as "-" with 0 * If Excel is used, check that gene symbols aren't being converted to dates * After the first 3 columns, the remaining columns hold intensities, one column per sample * To avoid losing information about peptides that could have originated from multiple proteins and/or genes, merge peptide rows representing more than 1 protein and/or gene * Sample command: sort -k3,3 Peptide_intensities.matrix.txt | groupBy -g 3 -c 1,2,3,4,5,6,7,8,9,10 -o distinct >| Peptide_intensities.matrix.mergedByPeptide.txt * Make sure that all rows of the output file are unique. * Create a sample description file * Columns are Run, Condition, BioReplicate * See MSstats documentation on how to use these fields to represent technical replicates, biological replicates, and paired designs. * Replication is not required for subsequent protein quantification but is required for statistical analysis. * Run MSstats using peptide intensities and sample description as input files. * For sample code, see **/nfs/BaRC_code/R/analyze_MS_with_MSstats/analyze_MS_with_MSstats.R** ==== References ==== * Choi et al., 2017 [[https://pubs.acs.org/doi/abs/10.1021/acs.jproteome.6b00881|ABRF Proteome Informatics Research Group (iPRG) 2015 Study: Detection of Differentially Abundant Proteins in Label-Free Quantitative LC−MS/MS Experiments]] * MSstats at [[https://bioconductor.org/packages/release/bioc/html/MSstats.html|Bioconductor]] and [[https://github.com/MeenaChoi/MSstats|GitHub]] * Other references on these topics * Lazar et. al., 2016 [[https://pubs.acs.org/doi/full/10.1021/acs.jproteome.5b00981|Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies]] * Wei et al., 2018 [[https://www.nature.com/articles/s41598-017-19120-0|Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data]]