1 | | Label-free mass spectrometry data analysis with data from Scaffold. |
2 | | 1. Quantitative Methods. |
3 | | 2. Normalization. |
4 | | 3. Differentially expressed proteins. |
| 1 | ==== Differential protein expression (with mass spec) ==== |
| 2 | |
| 3 | This method is for label-free samples from our Proteomics Core Facility, which has some [[http://massspec.wi.mit.edu/documents/Scaffoldhowto060513.pdf|Scaffold quick instructions]]. |
| 4 | |
| 5 | * Using the (free) Scaffold Viewer (available from [[http://www.proteomesoftware.com/products/free-viewer|Proteome Software]], with a large [[http://www.proteomesoftware.com/pdf/scaffold_users_guide.pdf|User's Manual]]), open the sf3 file. |
| 6 | |
| 7 | * By default, three filters prevent all mapped proteins from being displayed: |
| 8 | * Protein Threshold (default = 99%) |
| 9 | * Min # Peptides (default = 2) |
| 10 | * Peptide Threshold (default = 95%) |
| 11 | * These filters are typical good with the default settings. |
| 12 | |
| 13 | * Scaffold has multiple display and quantification options. |
| 14 | * From the Scaffold User's Manual: |
| 15 | * Spectrum Counting methods are the most reliable in answering the question, "Is anything changing between experimental conditions?". |
| 16 | * Precursor Ion Intensity quantification methods are very reliable in answering the question, "How much is the amount of change I am dealing with?" |
| 17 | * The Total Ion Count (TIC) methods can answer both questions but not very well. |
| 18 | * For a quick QC check: |
| 19 | * Look at the first few most highly expressed proteins. |
| 20 | * Are they within 2-fold or so of each other? |
| 21 | * If some samples have much lower or higher counts, their sensitivity may differ so much that samples may not be comparable. |
| 22 | |
| 23 | * Under Display Options, select "Total Spectrum Count" |
| 24 | |
| 25 | * For Spectrum Counting, click on the "Quantitative Analysis" icon (showing a bar graph). |
| 26 | * Keep "Use Normalization" checked. |
| 27 | * For Quantitative Method, select "Total Spectra". |
| 28 | * Click the Apply button. |
| 29 | * Next to Display Option: select Quantitative Value (Normalized Total Spectra) |
| 30 | * Export (top menu) => Current View. |
| 31 | * Use this file to identify differentially expressed proteins (to be explained below). |
| 32 | |
| 33 | * For Precursor Ion Intensity quantification, click on the "Quantitative Analysis" icon (showing a bar graph). |
| 34 | * Keep "Use Normalization" checked. |
| 35 | * For Quantitative Method, select "Top 3 Precursor Intensity". |
| 36 | * Click the Apply button. |
| 37 | * Next to Display Option: select Quantitative Value (Normalized Top 3 Precursor Intensity) |
| 38 | * Export (top menu) => Current View. |
| 39 | * Use this file for visualization (to be explained below). |
| 40 | |
| 41 | * To identify differentially expressed proteins, use the normalized Total Spectrum Count |
| 42 | * Recommended statistic: t-test on log2 transformed values |
| 43 | * Correct p-values with FDR (or an alternate method) |
| 44 | |
| 45 | * For pathway analysis: [[https://david.ncifcrf.gov/|DAVID]] usually works fine. |
| 46 | |
| 47 | * For visualization: |
| 48 | * Draw a heatmap (Cluster3.0 -> Java TreeView) using the normalized Top 3 Precursor Intensities. |
| 49 | * Draw scatterplot using the normalized Top 3 Precursor Intensities, highlighting the differentially expressed proteins (from the Total Spectrum). |
| 50 | |
| 51 | |
| 52 | ==== Recommendations from Northeastern (May Institute, Vitek Lab) ==== |
| 53 | |
| 54 | * Best input is peptide-level "peak intensities", which are any continuous metric, such as Scaffold's |
| 55 | * Average Precursor Intensity |
| 56 | * Total Precursor Intensity |
| 57 | * Top Three Precursor Intensities |
| 58 | * Ideal analysis pipeline is to input these values into MSstats for pre-processing, statistics, and data visualization |
| 59 | * Preprocessing steps recommended by (performed by) MSstats: |
| 60 | * Log2 transform |
| 61 | * Median-normalize across samples and runs (ignoring any 0s) |
| 62 | * Convert all 0s to NA |
| 63 | * Censor low measurements |
| 64 | * Get median |
| 65 | * Get 99.9th (or other percentile) to identify right tail of distribution ("r") |
| 66 | * Get threshold of left side ("l") of the distribution (2*median - r) |
| 67 | * Censor all values less than "l" |
| 68 | * Impute all missing values using a MNAR method, such as the accelerated failure model |
| 69 | * Summarize all features of a protein using Tukey's median polish (TMP), but ignore proteins with only 1 peptide (or risk increased false positive rate) |
| 70 | * Model each protein with a linear mixed-effects model |
| 71 | * Limma does a good job too, but it doesn't handle all the experimental designed handled by MSstats |
| 72 | * Use model to calculate fold changes and raw p-values |
| 73 | * Correct all p-values with FDR (BH method) |
| 74 | * Draw summary plots (volcano plots, MA plots) |
| 75 | |
| 76 | ==== Preparing and processing an experiment with MSstats ==== |
| 77 | |
| 78 | * Export peptide-level intensity values from your favorite MS quantification software. |
| 79 | |
| 80 | * Create a peptide intensity file |
| 81 | * Organize the dataset so the first three columns are Gene.symbol, Protein.Accession, and Peptide.sequence |
| 82 | * If needed, convert each protein accession to a gene symbol |
| 83 | * Replace any intensities shown as "-" with 0 |
| 84 | * If Excel is used, check that gene symbols aren't being converted to dates |
| 85 | * After the first 3 columns, the remaining columns hold intensities, one column per sample |
| 86 | |
| 87 | * To avoid losing information about peptides that could have originated from multiple proteins and/or genes, merge peptide rows representing more than 1 protein and/or gene |
| 88 | * Sample command: sort -k3,3 Peptide_intensities.matrix.txt | groupBy -g 3 -c 1,2,3,4,5,6,7,8,9,10 -o distinct >| Peptide_intensities.matrix.mergedByPeptide.txt |
| 89 | * Make sure that all rows of the output file are unique. |
| 90 | |
| 91 | * Create a sample description file |
| 92 | * Columns are Run, Condition, BioReplicate |
| 93 | * See MSstats documentation on how to use these fields to represent technical replicates, biological replicates, and paired designs. |
| 94 | * Replication is not required for subsequent protein quantification but is required for statistical analysis. |
| 95 | |
| 96 | * Run MSstats using peptide intensities and sample description as input files. |
| 97 | * For sample code, see **/nfs/BaRC_code/R/analyze_MS_with_MSstats/analyze_MS_with_MSstats.R** |
| 98 | |
| 99 | ==== References ==== |
| 100 | |
| 101 | * Choi et al., 2017 [[https://pubs.acs.org/doi/abs/10.1021/acs.jproteome.6b00881|ABRF Proteome Informatics Research Group (iPRG) 2015 Study: Detection of Differentially Abundant Proteins in Label-Free Quantitative LC−MS/MS Experiments]] |
| 102 | * MSstats at [[https://bioconductor.org/packages/release/bioc/html/MSstats.html|Bioconductor]] and [[https://github.com/MeenaChoi/MSstats|GitHub]] |
| 103 | * Other references on these topics |
| 104 | * Lazar et. al., 2016 [[https://pubs.acs.org/doi/full/10.1021/acs.jproteome.5b00981|Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies]] |
| 105 | * Wei et al., 2018 [[https://www.nature.com/articles/s41598-017-19120-0|Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data]] |