class: center, middle, inverse, title-slide # Data Preprocessing and Processing ### Max Qiu, PhD Bioinformatician/Computational Biologist
maxqiu@unl.edu
###
Bioinformatics Research Core Facility, Center for Biotechnology
Data Life Science Core, NCIBC
### 02-07-2022 --- background-image: url(data:image/png;base64,#https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs13024-018-0304-2/MediaObjects/13024_2018_304_Fig1_HTML.png?as=webp) background-size: 75% # Metabolomics Workflow .footnote[ [Shao, Y., Le, W. Mol Neurodegeneration 14, 3 (2019)](https://doi.org/10.1186/s13024-018-0304-2)</br> Copyright © 2021 BioMed Central Ltd ] --- # Data Acquisition .pull-left[ ### Instrumentation * Separation: LC-, GC-, CE- * Detection: + -MS (Quadrupole, TOF, Ion-trap/Orbitrap) + -MSn (Triple Qua, Qtrap, Q-TOF, Q-Orbitrap) * Others: DI (direct infusion)-, MALDI-, etc... ] .pull-right[ ### Data preprocessing Converting raw MS data to a data matrix (peak intensity table or concentration table, etc) containing each peak area in each sample * Peak picking (Deconvolution) * Peak alignment * Chemical identification of metabolites ] ??? The normal workflow for preprocessing of LC-MS spectra data involves three steps. The first step is to perform **peak picking or deconvolution on each sample separately**, which is to **filter and identify peaks** within each sample. --- # Data Acquisition: Preprocessing ### Deconvolution and alignment .pull-left[ * **Peak picking or deconvolution** on each sample separately + Mass-to-charge ratio + Retention time + Peak area ] .pull-right[ <img src="data:image/png;base64,#./img/deconvolution.png" width="100%" style="display: block; margin: auto 0 auto auto;" /> ] .footnote[ Copyright © University of Birmingham </br> and Birmingham Metabolomics Training Center ] ??? There are more than one metabolite features elute at the same time. Peak picking is the process of decipher the peak area (intensity) of each feature (m/z ratio) eluted at the same time. After processing the whole RT, we will get EIC for each feature (m/z ratio). There are many software out there that facilitate this process, and each software perform peak picking in different and complex ways. We will not discuss any of details here, but they all **report a metabolite feature with an associated m/z ratio, retention time and peak area**. This process is relatively easy, because we expect the m/z ratio and RT for a specific metabolite feature in an analysis to be the same. --- # Data Acquisition: Preprocessing ### Deconvolution and alignment .pull-left[ * **Peak alignment** + Setting bins relating to small ranges of retention time and mass-to-charge ratios + Align data for each sample combine to a single data table - A data table where **rows are samples and columns are features (dimensions)** * Chemical identification of metabolites + Typically done after statistical analysis ] .pull-right[ <img src="data:image/png;base64,#./img/alignment_1.png" width="100%" style="display: block; margin: auto;" /> ] .footnote[ Copyright © University of Birmingham and Birmingham Metabolomics Training Center ] ??? The next step is to **aligning the mass-to-charge ratio or retention time** across all samples and combine these data into a single dataset. This is not **never a perfect process**, because the same metabolite feature may be detected at a slightly different mass-to-charge ratio or retention time in different samples. A simple approach is to set bins. The **same metabolite in different samples is added to the same bin** to allow these metabolites to be aligned across different samples. This will normally involve using both the accurately measured **m/z ratio** and, where possible, the **retention time**, and **fragmentation mass spectra** acquired applying tandem mass spectrometry. The data across all samples is then binned together to ensure the **same metabolite in each sample is reported as the same metabolite** in the single integrated dataset. --- # Data Acquisition: Preprocessing ### Deconvolution and alignment <img src="data:image/png;base64,#./img/alignment_2.png" width="70%" style="display: block; margin: auto;" /> .footnote[ Analytical Chemistry, Vol. 78, No. 3, Feb 1, 2006 ] ??? This is the most surface introduction of MS or LC-MS, there are many important concepts such as mass accuracy, mass resolution, ion suppression, monoisotopic mass, how to read ms spectrum, how to calculate molecular mass. --- # Data Acquisition: Preprocessing ### Peak intensity table <img src="data:image/png;base64,#./img/peak_intensity_table.png" width="100%" style="display: block; margin: auto;" /> --- # (Post-Acquisition) Data processing .pull-left[ ### MS Data is highly complex * Hundreds to thousands of metabolites * Complexity of mass spectral data: multiple peaks for each metabolite * Hundreds or thousands of samples ] -- .pull-right[ ### Common artefacts in metabolomics data * Baseline drift (batch correction) * Peak misalignment * Unwanted peak intensity differences (normalization) * Noise variables (batch correction) * Missing values (imputation) * Batch effects (batch correction) * Unequal variable weights (scaling) * ... ] --- # (Post-Acquisition) Data processing ### Quality control and Batch correction ### Assess the quality of data: missing values ### Assess feature presence/absence: imputation ### Data processing * Log2 transformation * Normalization * Scaling ??? Several processing steps have to be applied to remove systemic bias from the data, while preserving biological information. --- # Quality Control .pull-left[ ### Why? Reproducibility * Sample interacts directly with the instrument * Changes in measured metabolic feature response over time * QC samples are periodically analyzed throughout an analytical run + QC samples are **theoretically identical biological samples**, with a **metabolic and sample matrix composition** similar to those of the biological samples under study. ] ??? The first important thing to keep in mind is quality control and the use of QC samples. We are concerned about the **issue of reproducibility**. Not all samples can be run in a single analytical batch, because of issues ranging from instrument **medium to long-term reproducibility** and necessary preventative **maintenance**. But the issue of reproducibility is not only time-dependent. It is also **instrument dependent**. In any chromatography-MS system, the **sample unavoidably interacts directly with the instrument**, and this leads to **changes in measured metabolic feature response over time**, both in terms of chromatography and MS. (**Signal attenuation**) The **degree and timing of signal attenuation is not consistent across all measured metabolic features** (and it is also dependent on the type of biofluid measured). For this reason, it is a necessary requirement that QC samples are periodically analyzed throughout an analytical run in order to provide robust quality assurance (QA) for each metabolic feature detected. The ideal QC samples are ... -- .pull-right[ ### Purpose of using QC samples * To **'condition' or equilibrate** the analytical platform * To provide data to calculate technical precision within each analytical block: **within batch effect** * To provide data to use for signal correction between analytical blocks: **between batch effect** ] .footnote[ [Dunn WB, et al. Nat Protoc. 2011 Jun 30;6(7):1060-83.](https://www.nature.com/articles/nprot.2011.335) ] ??? In data processing stages, batch correction algorithms (we will mention a few) can **use the QC responses as the basis** to assess the quality of the data, **remove peaks with poor repeatability, correct the signal attenuation and concatenate batch data together** after data acquisition and before statistical analysis. --- # QC sample ### Types of QC samples .pull-left[ * Pooled QC (**Gold standard**) + **Small aliquots of each biological sample to be studied are pooled and thoroughly mixed** + Closest to the composition of the biological samples + Suited to small, focused, studies, e.g., small clinical trials or animal studies of hundreds of samples + Not always applicable in large-scale studies involving thousands of samples ] .pull-right[ * Surrogate QC + Commercially available biofluids composed of multiple biological samples **not present** in the study + Some metabolites are only detected in the samples and some are only detected in the commercial serum sample. During data pre-processing, these metabolites (all associated metabolic features) are **removed** from the data set, as the methods applied require detection of all metabolites in QC and subject population samples. + This provides loss (~20%) of metabolic information in this data set. ] .footnote[ [Dunn WB, et al. Nat Protoc. 2011 Jun 30;6(7):1060-83.](https://www.nature.com/articles/nprot.2011.335) ] ??? Ideally, QC samples are theoretically identical biological samples, with a metabolic and sample matrix composition similar to those of the biological samples under study. Two types of QC sample are available: pooled QC and commercially available surrogate QC. --- # QC samples and Injection Order <img src="data:image/png;base64,#./img/injection_order_1.PNG" width="75%" style="display: block; margin: auto;" /> .footnote[ [Tripp, B.A., Dillon, S.T., Yuan, M. et al. Sci Rep 11, 1521 (2021)](https://doi.org/10.1038/s41598-020-80412-z)</br> Copyright © 2021 Springer Nature Limited ] ??? Like we mentioned before, the purpose of incorporate QC samples is to **provide a base** to assess the quality of the data. To achieve that, they will need to be **run periodically through the entire data acquisition** and must be **included within each batch**. --- # QC samples and Injection Order .pull-left[ <img src="data:image/png;base64,#./img/injection_order_2.PNG" width="100%" style="display: block; margin: auto;" /> .footnote[ [Dunn, W., Broadhurst, et al. Nat Protoc 6, 1060–1083 (2011)](https://doi.org/10.1038/nprot.2011.335)</br> Copyright © 2011 Springer Nature Limited ] ] ??? A few tips: * Use the **same** pooled QC sample across batches throughout the whole acquisition phase. * In each batch, QC samples must be at the **beginning and at the end of each batch**, to correct the total signal drift of one batch. * Within each batch, QC sample should be **run every 3-5 samples**, to correct the signal drift within a batch. * Don't leave your QC samples in the auto sampler for days, because they will dry out and defeat the purpose of using them in the first place. -- .pull-right[ ### Tips * The **same** QC samples are **included within each batch** throughout the whole acquisition phase. * In each batch, QC samples must be at the **beginning and at the end of each batch**, to correct the total signal drift of one batch. * Within each batch, QC sample should be **run every 3-5 samples**, to correct the signal drift within a batch. ### What is considered a batch? Extended amount of time that instrument not running; instrument shutdown; instrument re-calibration... ] --- # Batch correction ### Use pooled QC to correct both between and within batch effects * Quantile normalization: **map time-varying changes in samples** forcing identical peak-intensity distributions. [QC-RFSC](https://www.sciencedirect.com/science/article/pii/S0003267018309395) (`statTarget`), [QC-SVR](https://pubs.rsc.org/en/content/articlelanding/2015/an/c5an01638j) * Regression normalization: **operate on each ion in isolation**, correcting for sample-to-sample systematic variation using either linear or nonlinear regression. [QC-RLSC](https://www.nature.com/articles/nprot.2011.335), [QC-RSC](https://link-springer-com.libproxy.unl.edu/article/10.1007%2Fs00216-013-6856-7) ### QA criteria After batch correction, each metabolic feature was required to pass strict QA criteria. Any feature that did not pass the QA criteria was removed from the data set and, thus, ignored in any subsequent data analysis. * Calculate RSD for all metabolic features in QC samples. * Remove all metabolic features that are detected in <50% of QC samples. * Remove all metabolic features with a RSD, as calculated for the QC samples, of >30% (in GC-MS) or 20% (in LC-MS). **Code demo and tutorial:** [Batch Correction using `statTarget`](https://github.com/whats-in-the-box/tutorials_and_demos/blob/main/BatchCorrection.ipynb) ??? Previously we mentioned data conditioning algorithms will **use the QC responses as a base** to correct signal drifts between and within a batch. These algorithms are generally grouped into two classes, in terms of how they work mathematically, **operate on (align) samples** or **operate on (align) features**. Here normalization is strictly referring to the batch correction process. There are other "normalization" in this workflow and should be distinguished. After signal correction and batch integration, each detected metabolic feature is required to pass strict QA criteria. Any peak that does not pass the QA criteria is removed from the data set and thus ignored in any subsequent data analysis. They are mostly **peaks with poor repeatability** (detected in some samples but not others, or concentration is so low that they tests the limit of detection) RSD: coefficient of variance. --- # Missing values .pull-left[ ## Potentially three main reasons * The metabolite is not present in the biological sample * The metabolite is present in the sample, but the concentration is below the limit of detection * The metabolite is present in the sample and above limit of detection but was not annotated as a peak during peak picking (deconvolution) process ] .pull-left[  ] .footnote[ Copyright © University of Birmingham and Birmingham Metabolomics Training Center ] ??? If you have QC samples, there are not likely to have missing values after batch correction. That is another advantage of using quality control, because without QC samples, raw mass spec data will usually contains quite a lot of missing values. Missing values in mass spec data can be **unavoidable and problematic** in the analysis and they occur as a result of **both technical and biological reasons**. As it will have a negative impact on statistical analysis, it's important you explore the peak intensity table after acquisition pre-processing and investigate the **number of percentage of your missing values**. There are three main reasons as to why a metabolite feature is missing from a spectrum. --- # Missing values .pull-left[ ## Potentially three main reasons * The metabolite is not present in the biological sample * The metabolite is present in the sample, but the concentration is below the limit of detection * **The metabolite is present in the sample and above limit of detection but was not annotated as a peak during peak picking (deconvolution) process** + Data matrix contains high number of missing values (>20%): not normal ] ??? If your data matrix contains a relatively **high percentage of missing values, say 20%**, in practical 25%, you need to evaluate your data and find the main reason for it, because this is not normal. -- .pull-right[ ### Check data preprocessing parameters for * Retention time misalignment * Incorrect grouping of mass-to-charge values * Inaccurate signal-to-noise threshold ] ??? **Make sure you are using a set of optimal parameters to process the spectra**. Non-optimal processing parameters may result in missing values due to retention time misalignment, incorrect grouping of m/z varues, or applying inaccurate signal-to-noise threshold. --- # Missing values .pull-left[ ## Potentially three main reasons * **The metabolite is not present in the biological sample** * **The metabolite is present in the sample, but the concentration is below the limit of detection** * The metabolite is present in the sample and above limit of detection but was not annotated as a peak during peak picking (deconvolution) process ] ??? The other two reasons are valid legitimate reasons behind missing values, even if you did everything perfectly. Under these circumstances, we can **filter to remove problematic samples or metabolic features that have a high percentage of missing values**. -- .pull-right[ ### Remove Sample * Affect statistical power * Tread lightly! Check previous steps. ### Assess feature presence/absence * A feature is deemed present if missing values accounts for less than 50%* of sample space + Features deemed present: **missing value imputation** - KNN (k-nearest neighbor) - Small value replacement - Random forest + Features deemed absent: remove features ] ??? Removing sample is a risky move, as removing sample will reduce the power of the test in statistical analysis. So tread lightly, check previous data preprocessing steps first, deconvolution and alighment. If we are sure a metabolic feature is not present or below detection limit, we can remove them with less negative effects. But first, you must **set a criteria to assess if a feature is really missing**. Usually we deem a feature present if its missing values accounts for less than 50% sample space. If a feature is missing in most of the samples, then remove feature. If it is deemed a present feature with missing values, then we do imputation. The objective of missing value imputation to **replace the missing values with non-zero values while maintaining the structure of the data**. --- # Missing values .pull-left[  .footnote[ Copyright © University of Birmingham and </br> Birmingham Metabolomics Training Center ] ] .pull-right[ ### Remove Sample * Affect statistic power * Tread lightly! Check previous steps. ### Assess feature presence/absence * A feature is deemed present if missing values accounts for less than 50%* of sample space + Features deemed present: **missing value imputation** - **KNN (k-nearest neighbor)** - Small value replacement - Random forest + Features deemed absent: remove features ] ??? The K-nearest neighbor imputation method (or KNN in short) uses a feature specific approach, **the nearest K (e.g. five) neighbors of the missing feature are identified using some distance metric (Euclidean distance)**. This missing value is replaced with an average of the nearest non-missing values. The small value replacement method **imputes the same value for each of the missing values for a specific metabolite feature**. In this approach the missing value is usually replaced by a value **half of the minimum peak intensity of that feature**. This method would be expected to perform well when the missing values occur because the concentration of the metabolite is less than the limit of detection. --- # Data Processing .pull-left[ <img src="data:image/png;base64,#./img/data_example.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ### Understand the statistical nature of your data * Is your data normally distributed? * Does your data have good statistical power?* * High-dimensional ] ??? Up to this point, we have shown that several processing steps have to be applied to **remove systematic bias from the data, while preserving biological information**. The next step in the metabolomics pipeline is to use statistical techniques to extract the relevant and meaningful information from the preprocessed data. However, before statistical analysis, we must make sure we understand the **statistical characteristics** of the data matrix, specifically the **distribution of data and its high-dimensional nature**, and what **statistical tests** are suitable for our data, and whether the data matrix **satisfy the assumptions** of those tests. --- # Data Processing .pull-left[ <img src="data:image/png;base64,#./img/data_example.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ### Understand the statistical nature of your data * Is your data **normally distributed**? * Does your data have good statistical power?* * High-dimensional ### What statistical tests can we use * Parametric or non-parametric ### Whether your data satisfy the assumptions of statistical tests * Normal distribution (Gaussian) * Equal variances (homoscedasticity) * ... ] ??? Parametric statistics are based on assumptions that your sample follows a specific distribution, the distribution of population from which the sample was taken. Nonparametric statistics are not based on assumptions, that is, **the data can be collected from a sample that does not follow a specific distribution**. **Nonparametric tests are also called distribution-free tests because they don’t assume that your data follow a specific distribution**. This is why we have to make sure whether the data is normally distributed (Gaussian distribution). If yes, then we can use parametric tests; if not, we can only use non-parametric tests. If data is not normally distributed, then the **goal of data processing step is to transform data so that it resembles a normal distribution as much as it can**. --- # Data Processing **Goal of data processing**: to get to (near) **Gaussian** distribution so we can use parametric tests. .footnote[ [Nonparametric Tests vs. Parametric Tests](https://statisticsbyjim.com/hypothesis-testing/nonparametric-parametric-tests/) ] .pull-left[ - Reason to use parametric test + **Parametric tests have greater statistical power**. If an effect actually exists, a parametric analysis is more likely to detect it. + Parametric tests can perform well with skewed and nonnormal distributions. + Parametric tests can perform well when spread (dispersion) of each group is different. ] .pull-right[ - Reason to use nonparametric test + You sample size is small. + Median is more accurately represents the center of your data (highly skewed data).  ] ??? Parametric tests offers a few advantages. There are also times when non-parametric tests are better suited. The answer is often contingent upon whether the **mean or median** is a better measure of central tendency for the distribution of your data. * If the mean is a better measure and you have a sufficiently large sample size, a parametric test usually is the better, more powerful choice. * If the median is a better measure, consider a nonparametric test regardless of your sample size. --- # Data Processing **Goal of data processing**: to get to (near) **Gaussian** distribution so we can use parametric tests. .pull-left[ ### Log Transformation * Data transformation applies a mathematical transformation on individual values themselves, e.g. log or cube root transformation. * For mass spec data, log transformation is the fitter choice, as **log transformation reduces or removes the skewness** of mass spec data. * Generalized logarithm transformation (glog) for dataset that contains negative values ] .pull-right[ <img src="data:image/png;base64,#./img/distribution_before_after.png" width="100%" style="display: block; margin: auto;" /> ] --- # Data Processing **Goal of data processing**: to get to (near) **Gaussian** distribution so we can use parametric tests. ### Normalization * There are many types of normalization methods. The point of normalization is that data **can be described as a normal distribution (bell curve).** * Normalization methods + Linear regression normalization (Rlr, RlrMA, RlrMACyc): assumes bias in the data is linearly dependent on the peak intensity + Local regression normalization (LoessF, LoessCyc): assumes a nonlinear relationship between bias in the data and the peak intensity + EigenMS normalization: fits an analysis of variance model to the data to evaluate the treatment group effect and then uses singular value decomposition on the model residual matrix to identify and remove the bia .footnote[ [Tommi Välikangas, Tomi Suomi, Laura L Elo, Briefings in Bioinformatics, Volume 19, Issue 1, January 2018, Pages 1–11](https://doi.org/10.1093/bib/bbw095) ] --- # Data Processing **Goal of data processing**: to get to (near) **Gaussian** distribution so we can use parametric tests. ### Scaling .pull-left[ * Scaling aims to **standardize the range of feature data** (transforming your data so that it fits within a specific scale), thereby make individual features more comparable (as the range of values of data may vary widely). * Scaling methods + Auto Scaling: mean-centered and divided by the standard deviation of each variable + Pareto Scaling: mean-centered and divided by the square root of the standard deviation of each variable + Range Scaling: mean-centered and divided by the range of each variable ] .pull-right[  .footnote[ [van den Berg RA, et al. BMC Genomics. 2006 Jun 8;7:142](https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-7-142) ] ] ??? A few tips: * Log transformation is usually the first step in mass spec data processing. Check histogram and qq plot afterwards. * You can choose either normalization or scaling, or both. Still judging by histogram and qq plot. * Scaling better not be the first step in data processing. --- class: inverse, center, middle # Next: Statistical analysis Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).