| 1 | | Label-free mass spectrometry data analysis with data from Scaffold. |
| 2 | | 1. Quantitative Methods. |
| 3 | | 2. Normalization. |
| 4 | | 3. Differentially expressed proteins. |
| | 1 | ==== Differential protein expression (with mass spec) ==== |
| | 2 | |
| | 3 | This method is for label-free samples from our Proteomics Core Facility, which has some [[http://massspec.wi.mit.edu/documents/Scaffoldhowto060513.pdf|Scaffold quick instructions]]. |
| | 4 | |
| | 5 | * Using the (free) Scaffold Viewer (available from [[http://www.proteomesoftware.com/products/free-viewer|Proteome Software]], with a large [[http://www.proteomesoftware.com/pdf/scaffold_users_guide.pdf|User's Manual]]), open the sf3 file. |
| | 6 | |
| | 7 | * By default, three filters prevent all mapped proteins from being displayed: |
| | 8 | * Protein Threshold (default = 99%) |
| | 9 | * Min # Peptides (default = 2) |
| | 10 | * Peptide Threshold (default = 95%) |
| | 11 | * These filters are typical good with the default settings. |
| | 12 | |
| | 13 | * Scaffold has multiple display and quantification options. |
| | 14 | * From the Scaffold User's Manual: |
| | 15 | * Spectrum Counting methods are the most reliable in answering the question, "Is anything changing between experimental conditions?". |
| | 16 | * Precursor Ion Intensity quantification methods are very reliable in answering the question, "How much is the amount of change I am dealing with?" |
| | 17 | * The Total Ion Count (TIC) methods can answer both questions but not very well. |
| | 18 | * For a quick QC check: |
| | 19 | * Look at the first few most highly expressed proteins. |
| | 20 | * Are they within 2-fold or so of each other? |
| | 21 | * If some samples have much lower or higher counts, their sensitivity may differ so much that samples may not be comparable. |
| | 22 | |
| | 23 | * Under Display Options, select "Total Spectrum Count" |
| | 24 | |
| | 25 | * For Spectrum Counting, click on the "Quantitative Analysis" icon (showing a bar graph). |
| | 26 | * Keep "Use Normalization" checked. |
| | 27 | * For Quantitative Method, select "Total Spectra". |
| | 28 | * Click the Apply button. |
| | 29 | * Next to Display Option: select Quantitative Value (Normalized Total Spectra) |
| | 30 | * Export (top menu) => Current View. |
| | 31 | * Use this file to identify differentially expressed proteins (to be explained below). |
| | 32 | |
| | 33 | * For Precursor Ion Intensity quantification, click on the "Quantitative Analysis" icon (showing a bar graph). |
| | 34 | * Keep "Use Normalization" checked. |
| | 35 | * For Quantitative Method, select "Top 3 Precursor Intensity". |
| | 36 | * Click the Apply button. |
| | 37 | * Next to Display Option: select Quantitative Value (Normalized Top 3 Precursor Intensity) |
| | 38 | * Export (top menu) => Current View. |
| | 39 | * Use this file for visualization (to be explained below). |
| | 40 | |
| | 41 | * To identify differentially expressed proteins, use the normalized Total Spectrum Count |
| | 42 | * Recommended statistic: t-test on log2 transformed values |
| | 43 | * Correct p-values with FDR (or an alternate method) |
| | 44 | |
| | 45 | * For pathway analysis: [[https://david.ncifcrf.gov/|DAVID]] usually works fine. |
| | 46 | |
| | 47 | * For visualization: |
| | 48 | * Draw a heatmap (Cluster3.0 -> Java TreeView) using the normalized Top 3 Precursor Intensities. |
| | 49 | * Draw scatterplot using the normalized Top 3 Precursor Intensities, highlighting the differentially expressed proteins (from the Total Spectrum). |
| | 50 | |
| | 51 | |
| | 52 | ==== Recommendations from Northeastern (May Institute, Vitek Lab) ==== |
| | 53 | |
| | 54 | * Best input is peptide-level "peak intensities", which are any continuous metric, such as Scaffold's |
| | 55 | * Average Precursor Intensity |
| | 56 | * Total Precursor Intensity |
| | 57 | * Top Three Precursor Intensities |
| | 58 | * Ideal analysis pipeline is to input these values into MSstats for pre-processing, statistics, and data visualization |
| | 59 | * Preprocessing steps recommended by (performed by) MSstats: |
| | 60 | * Log2 transform |
| | 61 | * Median-normalize across samples and runs (ignoring any 0s) |
| | 62 | * Convert all 0s to NA |
| | 63 | * Censor low measurements |
| | 64 | * Get median |
| | 65 | * Get 99.9th (or other percentile) to identify right tail of distribution ("r") |
| | 66 | * Get threshold of left side ("l") of the distribution (2*median - r) |
| | 67 | * Censor all values less than "l" |
| | 68 | * Impute all missing values using a MNAR method, such as the accelerated failure model |
| | 69 | * Summarize all features of a protein using Tukey's median polish (TMP), but ignore proteins with only 1 peptide (or risk increased false positive rate) |
| | 70 | * Model each protein with a linear mixed-effects model |
| | 71 | * Limma does a good job too, but it doesn't handle all the experimental designed handled by MSstats |
| | 72 | * Use model to calculate fold changes and raw p-values |
| | 73 | * Correct all p-values with FDR (BH method) |
| | 74 | * Draw summary plots (volcano plots, MA plots) |
| | 75 | |
| | 76 | ==== Preparing and processing an experiment with MSstats ==== |
| | 77 | |
| | 78 | * Export peptide-level intensity values from your favorite MS quantification software. |
| | 79 | |
| | 80 | * Create a peptide intensity file |
| | 81 | * Organize the dataset so the first three columns are Gene.symbol, Protein.Accession, and Peptide.sequence |
| | 82 | * If needed, convert each protein accession to a gene symbol |
| | 83 | * Replace any intensities shown as "-" with 0 |
| | 84 | * If Excel is used, check that gene symbols aren't being converted to dates |
| | 85 | * After the first 3 columns, the remaining columns hold intensities, one column per sample |
| | 86 | |
| | 87 | * To avoid losing information about peptides that could have originated from multiple proteins and/or genes, merge peptide rows representing more than 1 protein and/or gene |
| | 88 | * Sample command: sort -k3,3 Peptide_intensities.matrix.txt | groupBy -g 3 -c 1,2,3,4,5,6,7,8,9,10 -o distinct >| Peptide_intensities.matrix.mergedByPeptide.txt |
| | 89 | * Make sure that all rows of the output file are unique. |
| | 90 | |
| | 91 | * Create a sample description file |
| | 92 | * Columns are Run, Condition, BioReplicate |
| | 93 | * See MSstats documentation on how to use these fields to represent technical replicates, biological replicates, and paired designs. |
| | 94 | * Replication is not required for subsequent protein quantification but is required for statistical analysis. |
| | 95 | |
| | 96 | * Run MSstats using peptide intensities and sample description as input files. |
| | 97 | * For sample code, see **/nfs/BaRC_code/R/analyze_MS_with_MSstats/analyze_MS_with_MSstats.R** |
| | 98 | |
| | 99 | ==== References ==== |
| | 100 | |
| | 101 | * Choi et al., 2017 [[https://pubs.acs.org/doi/abs/10.1021/acs.jproteome.6b00881|ABRF Proteome Informatics Research Group (iPRG) 2015 Study: Detection of Differentially Abundant Proteins in Label-Free Quantitative LC−MS/MS Experiments]] |
| | 102 | * MSstats at [[https://bioconductor.org/packages/release/bioc/html/MSstats.html|Bioconductor]] and [[https://github.com/MeenaChoi/MSstats|GitHub]] |
| | 103 | * Other references on these topics |
| | 104 | * Lazar et. al., 2016 [[https://pubs.acs.org/doi/full/10.1021/acs.jproteome.5b00981|Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies]] |
| | 105 | * Wei et al., 2018 [[https://www.nature.com/articles/s41598-017-19120-0|Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data]] |