class: center, middle, inverse, title-slide .title[ # Data Preprocessing and Processing ] .author[ ###
Max Qiu, PhD
Bioinformatician/Computational Biologist
maxqiu@unl.edu
] .institute[ ###
Bioinformatics Research Core Facility, Center for Biotechnology
Data Life Science Core, NCIBC
] .date[ ### 02-13-2023 ] --- background-image: url(data:image/png;base64,#https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs13024-018-0304-2/MediaObjects/13024_2018_304_Fig1_HTML.png?as=webp) background-size: 75% # MS Omics Workflow .footnote[ [Shao, Y., Le, W. Mol Neurodegeneration 14, 3 (2019)](https://doi.org/10.1186/s13024-018-0304-2) ] ??? HPLC: * Components: pumps, sampler, column, detector * Operation principle: separation based on adsorption/desorption rate * Column (stationary phase): polarity opposite to your target analytes * Mobile phase: polarity same as your target analytes (competing with analytes) * Separation modes: normal phase for non-polar, reverse phase for polar * Elution modes: isocratic (elutent unchanged, one pump) or gradient (eluent changing throughout the run, two pump) * **Detector**: spectrophotometric (uv/vis, fluorescence), refractive index (RI), evaporative light-scattering (ELSD), electrochemical, **mass spectrometer** MS: * Inlet, ionization source, mass analyzer, and detector * Ionization source: ESI and MALDI * Mass analyzer: quadrupole, TOF, ion-trap/orbitrap MS/MS: * Auto mode and targeted mode * Collision cell A LC-MS dataset is three dimensional, it includes three components for each feature (1) the retention time, (2) the mass-to-charge ratio and (3) the chromatographic peak area/intensity. We apply all of these components to construct a data matrix defining each feature, their peak intensity in each sample. --- # Data Acquisition .pull-left[ ### Instrumentation * Separation: LC-, GC-, CE-, 1D or 2D Gel * Detection: + -MS (Quadrupole, TOF, Ion-trap/Orbitrap) + -MSn (Triple Qua, Qtrap, Q-TOF, Q-Orbitrap) * Others: DI (direct infusion)-, MALDI-, etc... ] .pull-right[ ### Data preprocessing (Deconvolution) Raw mass spectrum after acquisition is **complicated (convoluted)** by many interference and artefacts (instrumental mass bias, baseline noise, baseline drift). **Deconvolution** is the process of computationally **removing interference/artefacts, separating co-eluting components and creating a pure spectrum for each component**, i.e., converting raw mass spectrum to a data matrix (peak intensity table or concentration table, etc.) containing each peak intensity in each sample * Sample alignment * Peak picking ] ??? During acquisition, mass spectrum is complicated (convoluted) by many interference and artefacts. Deconvolution is the process of computationally removing interference/artefacts, separating co-eluting components and creating a pure spectrum for each component. Generally, deconvolution involves two steps, sample alignment and peak picking. Some software perform **peak picking on each sample separately**, which is to **filter and identify peaks** within each sample, and then align all samples together. Some other software, such as Waters Progenesis, align samples first (by selecting a reference sample **contains the most commonality**), then peak picking. --- # Data Acquisition: Preprocessing <img src="data:image/png;base64,#./img/progenesis_auto_processing.png" width="90%" style="display: block; margin: auto;" /> ??? After all files are imported, software will choose an alignment reference. It will pick the file that has the most common features (black dots) across all the files. Once the software picked a reference, it will use all the black dots to align the files to the reference, so that we don't have a feature identify as a different peak, just because it drifted a little bit in the chromatogram. --- # Data Acquisition: Preprocessing <img src="data:image/png;base64,#./img/progenesis_review_alignment.png" width="90%" style="display: block; margin: auto;" /> ??? The four panel here are telling us the same thing, just different visualization. The software is looking at the chromatogram for alignment. Blue dots are the vectors being used to align data. A vector has some sort of direction in 3D, connecting this sample to the reference. Alignment score identifies how well a sample is aligned to the reference. A bad score or red area is not necessarily a bad thing, just means that the sample is significantly different from the reference. --- # Data Acquisition: Preprocessing <img src="data:image/png;base64,#./img/progenesis_peak_picking.png" width="90%" style="display: block; margin: auto;" /> ??? Once we have all the data imported and aligned, then the software will pick the spectra peaks. This is showing an aggregated sample containing all ~4000 ish peaks that has been found in all samples. In the aggregate, the software will try to normalize the data, how the data is compared to itself. By default, it is normalized to all ions. --- # Data Acquisition: Preprocessing <img src="data:image/png;base64,#./img/progenesis_normalization.png" width="90%" style="display: block; margin: auto;" /> ??? The software looks at all the spectra data, and try to find the most commonality in the spectra data between the files. It finds a file that has the most commonality spectrally as a normalization reference, and all other files will be compared against this normality reference. All other files are either slightly higher or lower in total intensity compared to the reference. --- # Data Acquisition: Preprocessing ### Peak intensity table <img src="data:image/png;base64,#./img/peak_intensity_table.png" width="100%" style="display: block; margin: auto;" /> ??? This is the most surface introduction of MS or LC-MS, there are many important concepts such as mass accuracy, mass resolution, ion suppression, monoisotopic mass, how to read ms spectrum, how to calculate molecular mass. Water Progenesis QI, Peaks studio from Bioinformatics Solutions, MaxQuant. --- # (Post-Acquisition) Data processing .pull-left[ ### MS Data is highly complex * Hundreds to thousands of metabolites/peptides * Complexity of mass spectral data: multiple peaks for each analyte * Hundreds or thousands of samples ] -- .pull-right[ ### Common artefacts * Instrumental mass bias (use lock mass) * Baseline noise * Baseline drift (batch correction) * Misalignment * Unwanted peak intensity differences (normalization) * Missing values (imputation) * Batch effects (batch correction) * Unequal variable weights (scaling) * ... ] --- # (Post-Acquisition) Data processing * Quality control and Injection order * Assess the quality of data: missing values * Assess feature presence/absence: imputation * (QC-based) batch correction * Data processing + Log2 transformation + Normalization + Scaling ??? Several processing steps have to be applied to remove systemic bias from the data, while preserving biological information. --- # Quality Control .pull-left[ ### Why? Reproducibility * Sample interacts directly with the instrument * Changes in measured feature response over time ### Purpose of using QC samples * To **'condition' or equilibrate** the analytical platform * To provide data to calculate technical precision within each analytical block: **within batch effect** * To provide data to use for signal correction between analytical blocks: **between batch effect** ] ??? The first important thing to keep in mind is quality control and the use of QC samples. We are concerned about the **issue of reproducibility**. Not all samples can be run in a single analytical batch, because of issues ranging from instrument **medium to long-term reproducibility** and necessary preventative **maintenance**. But the issue of reproducibility is not only time-dependent. It is also **instrument dependent**. In any chromatography-MS system, the **sample unavoidably interacts directly with the instrument**, and this leads to **changes in measured feature response over time**, both in terms of chromatography and MS. (**Signal attenuation**) The **degree and timing of signal attenuation is not consistent across all measured features** (and it is also dependent on the type of biofluid measured). For this reason, it is a necessary requirement that QC samples are periodically analyzed throughout an analytical run in order to provide robust quality assurance (QA) for each feature detected. -- .pull-right[ ### QC-based batch correction * Use the QC responses as the base to assess the quality of the data * Remove peaks with poor repeatability * Correct the signal attenuation * Concatenate batch data together ] .footnote[ [Dunn WB, et al. Nat Protoc. 2011 Jun 30;6(7):1060-83.](https://www.nature.com/articles/nprot.2011.335) ] ??? In data processing stages, QC-based batch correction algorithms (we will mention a few) can **use the QC responses as the basis** to assess the quality of the data, **remove peaks with poor repeatability, correct the signal attenuation and concatenate batch data together** after data acquisition and before statistical analysis. --- # QC sample ### Types of QC samples .pull-left[ * Pooled QC (**Gold standard**) + **Small aliquots of each biological sample to be studied are pooled and thoroughly mixed** + Closest to the composition of the biological samples + Suited to small, focused, studies, e.g., small clinical trials or animal studies of hundreds of samples + Not always applicable in large-scale studies involving thousands of samples ] .pull-right[ * Surrogate QC + Commercially available biofluids composed of multiple biological samples **not present** in the study + Some features are only detected in the samples and some are only detected in the commercial serum sample. During batch correction, these features are **removed** from the dataset, as the **QC-based batch correction methods only correct for features that were detected in both QC sample and real samples**. + This provides loss (~20%) of information in this data set. ] .footnote[ [Dunn WB, et al. Nat Protoc. 2011 Jun 30;6(7):1060-83.](https://www.nature.com/articles/nprot.2011.335) ] ??? Ideally, QC samples are **theoretically identical** to real biological samples, with a sample matrix composition similar to those of the biological samples under study. Two types of QC sample are available: pooled QC and commercially available surrogate QC. --- # QC samples and Injection Order <img src="data:image/png;base64,#./img/injection_order_1.PNG" width="75%" style="display: block; margin: auto;" /> .footnote[ [Tripp, B.A., Dillon, S.T., Yuan, M. et al. Sci Rep 11, 1521 (2021)](https://doi.org/10.1038/s41598-020-80412-z) ] ??? Like we mentioned before, the purpose of incorporate QC samples is to **provide a base** to assess the quality of the data. To achieve that, they will need to be **run periodically through the entire data acquisition** and must be **included within each batch**. --- # QC samples and Injection Order .pull-left[ <img src="data:image/png;base64,#./img/injection_order_2.PNG" width="100%" style="display: block; margin: auto;" /> .footnote[ [Dunn, W., Broadhurst, et al. Nat Protoc 6, 1060–1083 (2011)](https://doi.org/10.1038/nprot.2011.335) ] ] ??? A few tips: * Use the **same** pooled QC sample across batches throughout the whole acquisition phase. * In each batch, QC samples must be at the **beginning and at the end of each batch**, to correct the total signal drift of one batch. * Within each batch, QC sample should be **run every 3-5 samples**, to correct the signal drift within a batch. * Don't leave your QC samples in the auto sampler for days, because they will dry out and defeat the purpose of using them in the first place. -- .pull-right[ ### Tips * The **same** QC samples are **included within each batch** throughout the whole acquisition phase. * In each batch, QC samples must be at the **beginning and at the end of each batch**, to correct the total signal drift of one batch. * Within each batch, QC sample should be **run every 3-5 samples**, to correct the signal drift within a batch. ### What is considered a batch? Extended amount of time that instrument not running; instrument shutdown; instrument re-calibration... ] --- background-image: url(data:image/png;base64,#./img/ms_omics_workflow_mod.png) background-size: 75% # MS Omics Workflow .footnote[ [Shao, Y., Le, W. Mol Neurodegeneration 14, 3 (2019)](https://doi.org/10.1186/s13024-018-0304-2) ] ??? The preparation of pooled QC sample and the planning of injection order should have happened in sample preparation step and before data acquisition. --- # Missing values .pull-left[ ## Potentially three main reasons * **True missing (true negatives)**: The feature is not present in the biological sample * **Missing not at random (MNAR)**: The feature is present in the sample, but the concentration is below the limit of detection (left-censored data) * **Missing completely at random (MCAR)/missing at random (MAR)**: The feature is present in the sample and above limit of detection but was not annotated as a peak during peak picking (deconvolution) process ] .pull-left[ ![missing values](data:image/png;base64,#./img/missing_values.jpg) ] ??? Missing values in mass spec data can be **unavoidable and problematic** in the analysis and they occur as a result of **both technical and biological reasons**. Generally, missing values can result from both biological and technical reasons: * True missing * MNAR: MNAR refers to left-censored data, where the distribution of abundance is truncated on the left side. For example, low-intensity peptides record higher rate of missing values, as some of them falling below the limit of detection of the instrument. * MCAR/MAR --- # Missing values .pull-left[ ## Potentially three main reasons * True missing (true negatives): The feature is not present in the biological sample * Missing not at random (MNAR): The feature is present in the sample, but the concentration is below the limit of detection (left-censored data) * **Missing completely at random (MCAR)/missing at random (MAR)**: The feature is present in the sample and above limit of detection but was not annotated as a peak during peak picking (deconvolution) process + Data matrix contains high number of missing values (>25%): not normal ] ??? Feature is present in the sample and above the limit of detection but was not detected due to **random errors, technical limitations, or general stochastic fluctuations** during data acquisition process. As a result, each missing value cannot be directly explained by the nature of the feature or its measured intensity. MCAR/MAR affects the entire dataset with a **uniform distribution**. As it will have a negative impact on statistical analysis, it's important you explore the peak intensity table and investigate the **percentage of your missing values**. | -- .pull-right[ ### Check data preprocessing parameters for * Retention time misalignment * Incorrect grouping of mass-to-charge values * Inaccurate signal-to-noise threshold ] ??? A **very large percentage of missing values** is problematic for data analysis and may lead to erroneous conclusions. As such, it is recommended to explore the data table and investigate the percentage of missing values. If a data table contains a relatively high percentage of missing values (>30%), it is recommended to reevaluate data collection and deconvolution parameters. **Make sure you are using a set of optimal parameters to process the spectra**. Non-optimal processing parameters may result in missing values due to retention time misalignment, incorrect grouping of m/z values, or applying inaccurate signal-to-noise threshold. --- # Missing values .pull-left[ ## Potentially three main reasons * **True missing (true negatives)**: The feature is not present in the biological sample * **Missing not at random (MNAR)**: The feature is present in the sample, but the concentration is below the limit of detection (left-censored data) * Missing completely at random (MCAR)/missing at random (MAR): The feature is present in the sample and above limit of detection but was not annotated as a peak during peak picking (deconvolution) process ] ??? The other two reasons are valid legitimate reasons behind missing values, even if you did everything perfectly. Under these circumstances, we can **filter to remove problematic samples or features that have a high percentage of missing values**. -- .pull-right[ ### Remove Sample * Affect statistical power * Tread lightly! Check previous steps. ### Assess feature presence/absence * A feature is deemed present if missing values accounts for less than 80%* of sample space + Features deemed present: **missing value imputation** - KNN (k-nearest neighbor) - Small value replacement - Random forest + Features deemed absent: remove features ] ??? Removing sample is a risky move, as removing sample will reduce the power of the test in statistical analysis. So tread lightly, unless **a sample contains missing values in most of the feature space or there is sufficient scientific rationale for its exclusion**. Removing problematic features has less impact on the integrity of the dataset, but we need to **make sure that the features being removed are “true missing”**. For this purpose, **a criterion should be established** before removing any feature. Popular criterion is the “modified 80% rule”, where **a feature will be kept if it has a non-zero value in at least 80% of the samples in any one group**. This threshold value can be modified depending on the sample dataset. A feature that has a percentage of non-zero values **below the threshold** value will be removed from the dataset and excluded from all subsequent data analysis. For features that deemed present, proceed to missing value imputation. The objective of missing value imputation is to **replace the missing values with non-zero values while maintaining the structure of the data**. --- # Missing values .pull-left[ ![KNN](data:image/png;base64,#./img/KNN.PNG) ] .pull-right[ ### Remove Sample * Affect statistic power * Tread lightly! Check previous steps. ### Assess feature presence/absence * A feature is deemed present if missing values accounts for less than 50%* of sample space + Features deemed present: **missing value imputation** - **KNN (k-nearest neighbor)** - Small value replacement - Random forest + Features deemed absent: remove features ] ??? The K-nearest neighbor imputation method (or KNN in short) uses a feature specific approach, **the nearest K (e.g. five) neighbors of the missing feature are identified using some distance metric (Euclidean distance)**. This missing value is replaced with an average of the nearest non-missing values. The small value replacement method **imputes the same value for each of the missing values for a specific feature **. In this approach the missing value is usually replaced by a value **half of the minimum peak intensity of that feature**. This method would be expected to perform well when the missing values occur because the concentration of the feature is less than the limit of detection. The selection of missing value imputation methods can significantly affect downstream analysis and result interpretation, and there is no one method that is ideal for every application. It is recommended user choose imputation methods based on **the types of missing values which predominantly present in the dataset**. --- # QC-based Batch correction ### Use pooled QC to correct both between and within batch effects .pull-left[ * Signal correction * Batch concatenation ] .pull-right[ * [QC-RFSC](https://www.sciencedirect.com/science/article/pii/S0003267018309395) (`statTarget`) * [QC-SVR](https://pubs.rsc.org/en/content/articlelanding/2015/an/c5an01638j) * [QC-RLSC](https://www.nature.com/articles/nprot.2011.335) * [QC-RSC](https://link-springer-com.libproxy.unl.edu/article/10.1007%2Fs00216-013-6856-7) ] ### QA criteria After batch correction, each feature was required to pass a set of QA criteria. Any feature that did not pass the QA criteria was removed from the data set and, thus, ignored in any subsequent data analysis. * Calculate CV (coefficient of variation) for all features in QC samples. * Remove all features that are detected in **two-thirds** of QC samples. * Remove all features with a CV (as calculated for the QC samples) of >30% (in GC-MS) or 20% (in LC-MS). **Code demo and tutorial:** [Batch Correction using `statTarget`](https://github.com/whats-in-the-box/tutorials_and_demos/blob/main/BatchCorrection.ipynb) ??? Previously we mentioned data conditioning algorithms will **use the QC responses as a base** to correct signal drifts between and within a batch. Here are some examples of QC-based batch correction algorithms. In addition, there are also QC-free approaches for batch correction, the most popular one among which is Combat. After **signal correction and batch integration**, each detected feature is required to pass a set of QA criteria. Any peak that does not pass the QA criteria is removed from the data set and thus ignored in any subsequent data analysis. They are mostly **peaks with poor repeatability** (detected in some samples but not others, or concentration is so low that they tests the limit of detection) FDA recommended that **a measured response detected in two-thirds of QC samples should be within a coefficient of variation (CV) of 15%**, except for analytes approaching the limit of detection, in which case a CV of 20% is acceptable. For biomarkers, the FDA guidance allows up to 30% CV. --- # Data Processing .pull-left[ <img src="data:image/png;base64,#./img/data_example.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ### Understand the statistical nature of your data * Is your data normally distributed? * High-dimensional ] ??? Up to this point, we have shown that several processing steps have to be applied to **remove systematic bias from the data, while preserving biological information**. The next step in the workflow is to use statistical techniques to extract the relevant and meaningful information from the preprocessed data. However, before statistical analysis, we must make sure we understand the **statistical characteristics** of the data matrix, specifically the **distribution of data and its high-dimensional nature**, and what **statistical tests** are suitable for our data, and whether the data matrix **satisfy the assumptions** of those tests. --- # Data Processing .pull-left[ <img src="data:image/png;base64,#./img/data_example.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ### Understand the statistical nature of your data * Is your data **normally distributed**? * High-dimensional ### What statistical tests can we use * Parametric or non-parametric ### Whether your data satisfy the assumptions of statistical tests * Normal distribution (Gaussian) * Equal variances (homoscedasticity) * ... ] ??? Parametric statistics are based on assumptions that your sample follows a specific distribution, the distribution of population from which the sample was taken. Nonparametric statistics are not based on assumptions, that is, **the data can be collected from a sample that does not follow a specific distribution**. **Nonparametric tests are also called distribution-free tests because they don’t assume that your data follow a specific distribution**. This is why we have to make sure whether the data is normally distributed (Gaussian distribution). If yes, then we can use parametric tests; if not, we can only use non-parametric tests. If data is not normally distributed, then the **goal of data processing step is to transform data so that it resembles a normal distribution as much as it can**. <!-- # Data Processing --> <!-- **Goal of data processing**: to get to (near) **Gaussian** distribution so we can use parametric tests. --> <!-- [Nonparametric Tests vs. Parametric Tests](https://statisticsbyjim.com/hypothesis-testing/nonparametric-parametric-tests/) --> <!-- - Reason to use parametric test --> <!-- + **Parametric tests have greater statistical power**. If an effect actually exists, a parametric analysis is more likely to detect it. --> <!-- + Parametric tests can perform well with skewed and nonnormal distributions. --> <!-- + Parametric tests can perform well when spread (dispersion) of each group is different. --> <!-- - Reason to use nonparametric test --> <!-- + You sample size is small. --> <!-- + Median is more accurately represents the center of your data (highly skewed data). --> <!-- ![highly skewed data](data:image/png;base64,#./img/highly_skewed.png) --> <!-- ??? --> <!-- Parametric tests offers a few advantages. --> <!-- There are also times when non-parametric tests are better suited. --> <!-- The answer is often contingent upon whether the **mean or median** is a better measure of central tendency for the distribution of your data. --> <!-- * If the mean is a better measure and you have a sufficiently large sample size, a parametric test usually is the better, more powerful choice. --> <!-- * If the median is a better measure, consider a nonparametric test regardless of your sample size. --> --- # Data Processing **Goal of data processing**: to get to (near) **Gaussian** distribution so we can use parametric tests. .pull-left[ ### Log Transformation * Data transformation applies a mathematical transformation on individual values themselves, e.g. log or cube root transformation. * For mass spec data, log transformation is the fitter choice, as **log transformation reduces or removes the skewness** of mass spec data, improves linearity between independent variable and response variable. * Generalized logarithm transformation (glog) for dataset that contains negative values ] .pull-right[ <img src="data:image/png;base64,#./img/distribution_before_after.png" width="100%" style="display: block; margin: auto;" /> ] ??? After imputation and batch correction, peak intensity data table typically **spanning several orders of magnitude, and its distribution is highly skewed**. Log transformation reduces or removes the skewness of the original data, which makes it an ideal choice for mass spectrometric data transformation. More importantly, it improves linearity between independent variable (i.e., treatment, group, disease) and response (i.e., peaks) and boosts validity of the statistical tests. In the event that negative values exists in the dataset after imputation, generalized logarithm transformation (glog) is a good alterative --- # Data Processing **Goal of data processing**: to get to (near) **Gaussian** distribution so we can use parametric tests. ### Normalization Normalization aims to **remove systemic biases** from the dataset and **make the samples more comparable and the subsequent statistical analysis more reliable**. * Normalization methods + Linear regression normalization: assumes bias in the data is linearly dependent on the peak intensity + Local regression normalization: assumes a nonlinear relationship between bias in the data and the peak intensity + Global adjustment/scaling: assumes similar distribution of intensities across samples and calculates a global scaling factor between samples by using a selected reference sample + **EigenMS normalization**: evaluates and removes treatment group effect then uses singular value decomposition (SVD) on residual matrix to identify trends attributable to bias .footnote[ [Tommi Välikangas, Tomi Suomi, Laura L Elo, Briefings in Bioinformatics, Volume 19, Issue 1, January 2018, Pages 1–11](https://doi.org/10.1093/bib/bbw095) ] ??? Normalization is a data processing step that aims to **remove systemic biases** from the dataset (i.e., non-biological signals that are attributable to instrumental or technical aspects). Normalization aims to **make the samples more comparable and the subsequent statistical analysis more reliable**. There have been many reviews evaluating the performance of these methods, typically focused on their ability to decrease intragroup variation between technical and/or biological replicates. EigenMS provides several advantages. It is suitable for data containing high percentages of missing values. User do not need to know sources of bias to be able to remove them, and the method can be easily incorporated in existing data analysis workflow without requiring any special upstream or downstream steps. Apart from above methods, there are also ANOVA normalization, variance stabilization normalization, quantile normalization, and median normalization, etc. --- # Data Processing **Goal of data processing**: to get to (near) **Gaussian** distribution so we can use parametric tests. ### Scaling .pull-left[ * Scaling aims to **standardize the range of feature data** (transforming your data so that it fits within a specific scale), thereby make individual features more comparable (as the range of values of data may vary widely). * Scaling methods + Auto Scaling: mean-centered and divided by the standard deviation of each variable + Pareto Scaling: mean-centered and divided by the square root of the standard deviation of each variable + Range Scaling: mean-centered and divided by the range of each variable ] .pull-right[ ![scaling](data:image/png;base64,#./img/scaling.PNG) .footnote[ [van den Berg RA, et al. BMC Genomics. 2006 Jun 8;7:142](https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-7-142) ] ] ??? Simply put, we are transforming the **range (i.e., the spread)** of each feature by dividing a factor, the scaling factor, which is different for each feature, so as to shift the focus towards the variation between samples and not within. **Scaling often results in the inflation of small values, and in turn, the influence of measurement error. ** Compared to previous data processing steps, scaling **changes the structure of the data much more profoundly**, and therefore should be considered an optional step and exercised with greater caution. A few tips: * Log transformation is usually the first step in mass spec data processing. Check histogram and qq plot afterwards. * You can choose either normalization or scaling, or both. Still judging by histogram and qq plot. * Scaling better not be the first step in data processing. --- class: inverse, center, middle # Next: Statistical analysis Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).