(Mass Spec generated) Omics data analysis

.title[
# (Mass Spec generated) Omics data analysis
]
.author[
### <a href="https://shibalytics.com/">Max Qiu, PhD</a> </br> Bioinformatician/Computational Biologist </br> <a href="mailto:maxqiu@unl.edu" class="email">maxqiu@unl.edu</a>
]
.institute[
### <a href="https://biotech.unl.edu/bioinformatics">Bioinformatics Research Core Facility, Center for Biotechnology</a>
]
.date[
### 03-23-2023
]

---

background-image: url(data:image/png;base64,#./img/center.png)
background-position: bottom

# [Bioinformatics Core Research facility](https://biotech.unl.edu/bioinformatics)

* Part of Nebraska Center for Biotechnology, located at the Beadle Center E204

* Serving expertise and comprehensive analyses for large and small data sets across all NU campuses as well as external institutions/private industry

* Fee-for-service business model

* Supported by Nebraska Research Initiative and grant funding through collaborations

]

???

Our mission is to provide faculty, staff and students with our expertise, which is large-scale high-dimensional high-throughput data analysis, bioinformatics and statistics in the field of life science. We operate on a fee-for-service business model and we do insist on authorship or acknowledgement for publication, depending on contribution; fee is waived if PI include us in the grant application.

---

background-image: url(data:image/png;base64,#./img/center.png)
background-position: bottom

# [Bioinformatics Core Research facility](https://biotech.unl.edu/bioinformatics)

.pull-left[
<img src="data:image/png;base64,#./img/genome_assembly.PNG" width="85%" style="display: block; margin: auto;" />

]

* Bulk and single-cell RNA-Seq data analysis
* De novo or guided genome or transcriptome assemblies
* Functional genomics (gene enrichment and pathway analysis)
* Microbiome analysis via amplicon sequencing or shotgun metagenomics
* Mass Spec-generated Omics analysis (proteomics, peptidomics, metabolomics, etc.)

We also provide custom analyses, including the development of new analysis workflow or automated pipeline, setting up research database and web portal, integration of large scale data sets.

]

???
We utilize high performance computing cluster from HCC to assist our clients with bioinformatics and computational needs, such as....

---

background-image: url(data:image/png;base64,#./img/center.png)
background-position: bottom

# [Bioinformatics Core Research facility](https://biotech.unl.edu/bioinformatics)

]

???

If you scroll down on our home page, you'll fine **appointments and consultation**, which will lead you to our consultation scheduler. **We usually do not charge student for simple stat or data analysis consultations.**

---

# Mass Spectrometry-generated Omics

.footnote[
[Gorrochategui et al, Trends in Analytical Chemistry, Volume 82, 2016, Pages 425-442](https://www.sciencedirect.com/science/article/pii/S0165993616300425)

]

???

This class is focusing on mass spec-generated omics data analysis (from analytical methodology point of view), i.e., proteomics and metabolomics. Mass spectrometry has been one of the most **powerful research and analytical tool since the inception of proteomics and metabolomics as a science**. **Highly sensitive, precise, rapid, and selective with high dynamic range** has proven to be particularly effective and informative when combined with **separation techniques, or as part of a tandem experiment**.

Peptidomics should be considered as a branch of proteomics, it's **the study of all possible biological properties of low-weight endogenous peptides**. Lipidomics is usually considered as a branch of metabolomics, focusing on lipids in the cell and their interaction with other metabolites, lipids, and proteins.

---

background-image: url(data:image/png;base64,#https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs13024-018-0304-2/MediaObjects/13024_2018_304_Fig1_HTML.png?as=webp)
background-size: 75%

# We bring in errors every step of the way

.footnote[
[Shao, Y., Le, W. Mol Neurodegeneration 14, 3 (2019)](https://doi.org/10.1186/s13024-018-0304-2)
]

???

We bring in errors and bias every step of the way. We will discuss what to look out for in each of these steps. Let's pause for a minute and think about what could possibly go wrong in each step, what variances we could possibly bring to the data.

---

# (How to reduce) Sources of variances

### [Experimental design](https://shibalytics.com/courses/cbio842/MQ2_Exp_Design.html)
- Confounding factors
- Control, randomize, replicate, block
- Sample size `\(n\)` and **statistical power**

### [Sample Collection and Preparation](https://shibalytics.com/courses/cbio842/MQ2_Exp_Design.html)
- Robust SOP
- Sampling bias
- Technical replicates vs biological replicates
- Pseudo-replication
- **Quality Control: pooled QC and injection order**

]

### [Data Acquisition (instrumentation and data deconvolution)](https://shibalytics.com/courses/cbio842/MQ3_Acquisition.html)
- Instruments: LC-MS, LC-MS/MS
- Deconvolution: sample alignment and peak picking

]

???

Study question needs to be clearly defined ahead of time, because it will influence experimental design and everything follows.

In experimental design, we need to **control all sources of variation except the one** we are interested in studying. We can control and remove confounding factors by **setting up control group, random sampling and assignment, use of biological and technical replicates and blocking**. One of the most important thing in experimental design stage is to decide sample number per group to ensure an **appropriate number of biological replicates** in each biological groups to **confidently** answer the question in a **statistically robust manner**. Here **statistically robust manner** refer the issue of **statistical power**, which is an very important issue in experimental design phase that researchers most frequently neglect.

In sample collection and preparation, **experimental procedures should be standardized and optimized** and that different groups should be treated in the same way. However there are a few things to be cautious about here (sampling bias, difference between technical replicates and biological replicates and what is pseudo-replication). We will discuss type of QC samples, the preparation of pooled QC, planning the injection order (layout).

In data acquisition, know your instrument, test and use optimal parameters for deconvolution (peak picking and sample alignment).

---

# Experimental workflow is an upstream-to-downstream process.

???

Experimental workflow is an upstream-to-downstream process.

To be specific, steps that comes first has more impact to the final results than steps coming later.

More importantly, don’t expect later steps to compensate what was done (or not done) before.

**By sample collection and preparation, 50% of the result is fixed. After data acquisition, 80% of the result is done. No amount of statistic analysis, model building, machine learning can fix bad data. Quality of analysis depends on quality of data**

Think hard early on, and consult with bioinformatician and statistician early on.

## What comes first has more impact than what comes later;

## Don't expect later steps to compensate what was done wrong before.

---

# Data processing

### (Quality control)

(*Types of QC sample and Injection order*)

### Assess the quality of data: Missing values

* Assess feature presence/absence
* Missing value imputation

### QC-based batch correction

### Data transformation

* Log Transformation
* Normalization
* Scaling 
]

# Statistical analysis

### Univariate - feature selection
* Comparisons with multiple testing correction
* Ratios (fold-change) and volcano plot

### Multivariate - visualization and feature selection/extraction
* Principle component analysis (PCA)
* Partial least squares-discriminant analysis (PLS-DA)
]

---

# Quality control

### Why? Reproducibility
* Sample interacts directly with the instrument
* Changes in measured feature response over time

### Purpose of using QC samples
* To **'condition' or equilibrate** the analytical platform
* To provide data to calculate technical precision within each analytical block: **within batch effect**
* To provide data to use for signal correction between analytical blocks: **between batch effect**

]

???

The first important thing to keep in mind is quality control and the use of QC samples. We are concerned about the **issue of reproducibility**. Not all samples can be run in a single analytical batch, because of issues ranging from instrument **medium to long-term reproducibility** and necessary preventative **maintenance**.

But the issue of reproducibility is not only time-dependent. It is also **instrument dependent**. In any chromatography-MS system, the **sample unavoidably interacts directly with the instrument**, and this leads to **changes in measured feature response over time**, both in terms of chromatography and MS. (**Signal attenuation**) The **degree and timing of signal attenuation is not consistent across all measured features** (and it is also dependent on the type of biofluid measured).

For this reason, it is a necessary requirement that QC samples are periodically analyzed throughout an analytical run in order to provide robust quality assurance (QA) for each feature detected. |

### QC-based batch correction
* Use the QC responses as the base to assess the quality of the data
* Remove peaks with poor repeatability
* Correct the signal attenuation

]

.footnote[
[Dunn WB, et al. Nat Protoc. 2011 Jun 30;6(7):1060-83.](https://www.nature.com/articles/nprot.2011.335)
]

???

In data processing stages, QC-based batch correction algorithms (we will mention a few) can **use the QC responses as the basis** to assess the quality of the data, **remove peaks with poor repeatability, correct the signal attenuation and concatenate batch data together** after data acquisition and before statistical analysis.

---

# QC sample

### Types of QC samples

* Pooled QC (**Gold standard**)
  + **Small aliquots of each biological sample to be studied are pooled and thoroughly mixed**
  + Closest to the composition of the biological samples
  + Suited to small, focused, studies, e.g., small clinical trials or animal studies of hundreds of samples
  + Not always applicable in large-scale studies involving thousands of samples

]

* Surrogate QC
  + Commercially available biofluids composed of multiple biological samples **not present** in the study
  + Some features are only detected in the samples and some are only detected in the commercial serum sample. During batch correction, these features are **removed** from the dataset, as the **QC-based batch correction methods only correct for features that were detected in both QC sample and real samples**. 
  + This provides loss (~20%) of information in this data set.

]

.footnote[
[Dunn WB, et al. Nat Protoc. 2011 Jun 30;6(7):1060-83.](https://www.nature.com/articles/nprot.2011.335)
]

???

Ideally, QC samples are **theoretically identical** to real biological samples, with a sample matrix composition similar to those of the biological samples under study.

Two types of QC sample are available: pooled QC and commercially available surrogate QC.

---

# QC samples and Injection order

.footnote[
[Tripp, B.A., Dillon, S.T., Yuan, M. et al. Sci Rep 11, 1521 (2021)](https://doi.org/10.1038/s41598-020-80412-z)
]

???

Like we mentioned before, the purpose of incorporate QC samples is to **provide a base** to assess the quality of the data. To achieve that, they will need to be **run periodically through the entire data acquisition** and must be **included within each batch**.

---

# QC samples and Injection order

.footnote[
[Dunn, W., Broadhurst, et al. Nat Protoc 6, 1060–1083 (2011)](https://doi.org/10.1038/nprot.2011.335)
]
]

???

A few tips:

* Use the **same** pooled QC sample across batches throughout the whole acquisition phase. 
* In each batch, QC samples must be at the **beginning and at the end of each batch**, to correct the total signal drift of one batch. 
* Within each batch, QC sample should be **run every 3-5 samples**, to correct the signal drift within a batch. 
* Don't leave your QC samples in the auto sampler for days, because they will dry out and defeat the purpose of using them in the first place.

### Tips
* The **same** QC samples are **included within each batch** throughout the whole acquisition phase.
* In each batch, QC samples must be at the **beginning and at the end of each batch**, to correct the total signal drift of one batch. 
* Within each batch, QC sample should be **run every 3-5 samples**, to correct the signal drift within a batch.

### What is considered a batch? 
Extended amount of time that instrument not running; instrument shutdown; instrument re-calibration...

]

---

background-image: url(data:image/png;base64,#./img/ms_omics_workflow_mod.png)
background-size: 75%

# MS Omics Workflow

.footnote[
[Shao, Y., Le, W. Mol Neurodegeneration 14, 3 (2019)](https://doi.org/10.1186/s13024-018-0304-2)

]

???

The preparation of pooled QC sample and the planning of injection order should have happened in sample preparation step and before data acquisition.

---

# Example data after data acquisition and deconvolution

![peak intensity table](data:image/png;base64,#./img/peak_intensity_table.PNG)

* Samples = Observations (in this case on row)
* Peaks, metabolites or peptides... = Features (Dimensions, in this case on column)

---

# Missing values

## Potentially three main reasons
* **True missing (true negatives)**: The feature is not present in the biological sample
* **Missing not at random (MNAR)**: The feature is present in the sample, but the concentration is below the limit of detection (left-censored data)
* **Missing completely at random (MCAR)/missing at random (MAR)**: The feature is present in the sample and above limit of detection but was not annotated as a peak during peak picking (deconvolution) process

]

???

Missing values in mass spec data can be **unavoidable and problematic** in the analysis and they occur as a result of **both technical and biological reasons**.

Generally, missing values can result from both biological and technical reasons:

* True missing
* MNAR: MNAR refers to left-censored data, where the distribution of abundance is truncated on the left side. For example, low-intensity peptides record higher rate of missing values, as some of them falling below the limit of detection of the instrument. 
* MCAR/MAR

---

# Missing values

## Potentially three main reasons

* True missing (true negatives): The feature is not present in the biological sample
* Missing not at random (MNAR): The feature is present in the sample, but the concentration is below the limit of detection (left-censored data)
* **Missing completely at random (MCAR)/missing at random (MAR)**: The feature is present in the sample and above limit of detection but was not annotated as a peak during peak picking (deconvolution) process
  + Data matrix contains high number of missing values (>25%): not normal

]

???

Feature is present in the sample and above the limit of detection but was not detected due to **random errors, technical limitations, or general stochastic fluctuations** during data acquisition process. As a result, each missing value cannot be directly explained by the nature of the feature or its measured intensity. MCAR/MAR affects the entire dataset with a **uniform distribution**.

As it will have a negative impact on statistical analysis, it's important you explore the peak intensity table and investigate the **percentage of your missing values**. |

--
.pull-right[

### Check data preprocessing parameters for

* Retention time misalignment
* Incorrect grouping of mass-to-charge values
* Inaccurate signal-to-noise threshold

]

???

A **very large percentage of missing values** is problematic for data analysis and may lead to erroneous conclusions. As such, it is recommended to explore the data table and investigate the percentage of missing values. If a data table contains a relatively high percentage of missing values (>30%), it is recommended to reevaluate data collection and deconvolution parameters.

**Make sure you are using a set of optimal parameters to process the spectra**. Non-optimal processing parameters may result in missing values due to retention time misalignment, incorrect grouping of m/z values, or applying inaccurate signal-to-noise threshold.

---

# Missing values

## Potentially three main reasons

* **True missing (true negatives)**: The feature is not present in the biological sample
* **Missing not at random (MNAR)**: The feature is present in the sample, but the concentration is below the limit of detection (left-censored data)
* Missing completely at random (MCAR)/missing at random (MAR): The feature is present in the sample and above limit of detection but was not annotated as a peak during peak picking (deconvolution) process

]
???

The other two reasons are valid legitimate reasons behind missing values, even if you did everything perfectly. Under these circumstances, we can **filter to remove problematic samples or features that have a high percentage of missing values**.

### Remove Sample

* Affect statistical power
* Tread lightly! Check previous steps.

### Assess feature presence/absence
* A feature is deemed present if missing values accounts for less than 80%* of sample space
  + Features deemed present: **missing value imputation**
      - KNN (k-nearest neighbor)
      - Small value replacement
      - Random forest
  + Features deemed absent: remove features
]

???

Removing sample is a risky move, as removing sample will reduce the power of the test in statistical analysis. So tread lightly, unless **a sample contains missing values in most of the feature space or there is sufficient scientific rationale for its exclusion**.

Removing problematic features has less impact on the integrity of the dataset, but we need to **make sure that the features being removed are “true missing”**. For this purpose, **a criterion should be established** before removing any feature. Popular criterion is the “modified 80% rule”, where **a feature will be kept if it has a non-zero value in at least 80% of the samples in any one group**. This threshold value can be modified depending on the sample dataset. A feature that has a percentage of non-zero values **below the threshold** value will be removed from the dataset and excluded from all subsequent data analysis.

For features that deemed present, proceed to missing value imputation. The objective of missing value imputation is to **replace the missing values with non-zero values while maintaining the structure of the data**.

---

# Missing values

![KNN](data:image/png;base64,#./img/KNN.PNG)

]

### Remove Sample

* Affect statistic power
* Tread lightly! Check previous steps.

### Assess feature presence/absence

* A feature is deemed present if missing values accounts for less than 50%* of sample space
  + Features deemed present: **missing value imputation**
      - **KNN (k-nearest neighbor)**
      - Small value replacement
      - Random forest
  + Features deemed absent: remove features

]

???

The K-nearest neighbor imputation method (or KNN in short) uses a feature specific approach, **the nearest K (e.g. five) neighbors of the missing feature are identified using some distance metric (Euclidean distance)**. This missing value is replaced with an average of the nearest non-missing values.

The small value replacement method **imputes the same value for each of the missing values for a specific feature **. In this approach the missing value is usually replaced by a value **half of the minimum peak intensity of that feature**. This method would be expected to perform well when the missing values occur because the concentration of the feature is less than the limit of detection.

The selection of missing value imputation methods can significantly affect downstream analysis and result interpretation, and there is no one method that is ideal for every application. It is recommended user choose imputation methods based on **the types of missing values which predominantly present in the dataset**.

---

# QC-based batch correction

### Use pooled QC to correct both between and within batch effects

* [QC-RFSC](https://www.sciencedirect.com/science/article/pii/S0003267018309395) (`statTarget`)
* [QC-SVR](https://pubs.rsc.org/en/content/articlelanding/2015/an/c5an01638j)
* [QC-RLSC](https://www.nature.com/articles/nprot.2011.335)
* [QC-RSC](https://link-springer-com.libproxy.unl.edu/article/10.1007%2Fs00216-013-6856-7)

### QA criteria

After batch correction, each feature was required to pass a set of QA criteria. Any feature that did not pass the QA criteria was removed from the data set and, thus, ignored in any subsequent data analysis.

* Calculate CV (coefficient of variation) for all features in QC samples. 
* Remove all features that are detected in **two-thirds** of QC samples. 
* Remove all features with a CV (as calculated for the QC samples) of >30% (in GC-MS) or 20% (in LC-MS).

**Code demo and tutorial:** [Batch Correction using `statTarget`](https://github.com/whats-in-the-box/tutorials_and_demos/blob/main/BatchCorrection.ipynb)

???

Previously we mentioned data conditioning algorithms will **use the QC responses as a base** to correct signal drifts between and within a batch. Here are some examples of QC-based batch correction algorithms. In addition, there are also QC-free approaches for batch correction, the most popular one among which is Combat.

After **signal correction and batch integration**, each detected feature is required to pass a set of QA criteria. Any peak that does not pass the QA criteria is removed from the data set and thus ignored in any subsequent data analysis. They are mostly **peaks with poor repeatability** (detected in some samples but not others, or concentration is so low that they tests the limit of detection)

FDA recommended that **a measured response detected in two-thirds of QC samples should be within a coefficient of variation (CV) of 15%**, except for analytes approaching the limit of detection, in which case a CV of 20% is acceptable. For biomarkers, the FDA guidance allows up to 30% CV.

---

# Special considerations

]

### Understand the statistical nature of your data

* Is your data normally distributed? 
* High-dimensional
]

???

The next step in the workflow is to use statistical techniques to extract the relevant and meaningful information from the imputed and batch-corrected data.

However, before statistical analysis, we must make sure we understand the **statistical characteristics** of the data matrix, specifically the **distribution of data and its high-dimensional nature**, and what **statistical tests** are suitable for our data, and whether the data matrix **satisfy the assumptions** of those tests.

---

# Special considerations-1

### Understand the statistical nature of your data

* Is your data **normally distributed**? 
* High-dimensional

]

### What statistical tests can we use

* Parametric or non-parametric

### Whether your data satisfy the assumptions of statistical tests

* Normal distribution (Gaussian)
* Equal variances (homoscedasticity)
* ...

]

???

Parametric statistics are based on assumptions that your sample follows a specific distribution, the distribution of population from which the sample was taken. Nonparametric statistics are not based on assumptions, that is, **the data can be collected from a sample that does not follow a specific distribution**. **Nonparametric tests are also called distribution-free tests because they don’t assume that your data follow a specific distribution**.

This is why we have to make sure whether the data is normally distributed (Gaussian distribution). If yes, then we can use parametric tests; if not, we can only use non-parametric tests. If data is not normally distributed, then the **goal of data processing step is to transform data so that it resembles a normal distribution as much as it can**.

---

# Data transformation

**Goal of data processing**: to get to (near) **Gaussian** distribution so we can use parametric tests.

* Data transformation applies a mathematical transformation on individual values themselves, e.g. log or cube root transformation.

* For mass spec data, log transformation is the fitter choice, as **log transformation reduces or removes the skewness** of mass spec data, improves linearity between independent variable and response variable.

* Generalized logarithm transformation (glog) for dataset that contains negative values 
]
 
.pull-right[
<img src="data:image/png;base64,#./img/distribution_before_after.png" width="100%" style="display: block; margin: auto;" />

.footnote[
[Abid M.S., Qiu H, et al., Sci Rep 12, 8289 (2022)](https://doi.org/10.1038/s41598-022-12197-2)
]

]

???
After imputation and batch correction, peak intensity data table typically **spanning several orders of magnitude, and its distribution is highly skewed**. Log transformation reduces or removes the skewness of the original data, which makes it an ideal choice for mass spectrometric data transformation.

More importantly, it improves linearity between independent variable (i.e., treatment, group, disease) and response (i.e., peaks) and boosts validity of the statistical tests.

In the event that negative values exists in the dataset after imputation, generalized logarithm transformation (glog) is a good alterative

---

# Data transformation

**Goal of data processing**: to get to (near) **Gaussian** distribution so we can use parametric tests.

### Normalization

Normalization aims to **remove systemic biases** from the dataset and **make the samples more comparable and the subsequent statistical analysis more reliable**.

* Normalization methods
  + Linear regression normalization: assumes bias in the data is linearly dependent on the peak intensity
  + Local regression normalization: assumes a nonlinear relationship between bias in the data and the peak intensity
  + Global adjustment/scaling: assumes similar distribution of intensities across samples and calculates a global scaling factor between samples by using a selected reference sample
  + **EigenMS normalization**: evaluates and removes treatment group effect then uses singular value decomposition (SVD) on residual matrix to identify trends attributable to bias

.footnote[
[Tommi Välikangas, Tomi Suomi, Laura L Elo, Briefings in Bioinformatics, Volume 19, Issue 1, January 2018, Pages 1–11](https://doi.org/10.1093/bib/bbw095)
]

???

Normalization is a data processing step that aims to **remove systemic biases** from the dataset (i.e., non-biological signals that are attributable to instrumental or technical aspects). Normalization aims to **make the samples more comparable and the subsequent statistical analysis more reliable**. There have been many reviews evaluating the performance of these methods, typically focused on their ability to decrease intragroup variation between technical and/or biological replicates.

EigenMS provides several advantages. It is suitable for data containing high percentages of missing values. User do not need to know sources of bias to be able to remove them, and the method can be easily incorporated in existing data analysis workflow without requiring any special upstream or downstream steps.

Apart from above methods, there are also ANOVA normalization, variance stabilization normalization, quantile normalization, and median normalization, etc.

---

# Data transformation

**Goal of data processing**: to get to (near) **Gaussian** distribution so we can use parametric tests.

### Scaling

* Scaling aims to **standardize the range of feature data** (transforming your data so that it fits within a specific scale), thereby make individual features more comparable (as the range of values of data may vary widely).

* Scaling methods
  + Auto Scaling: mean-centered and divided by the standard deviation of each variable
  + Pareto Scaling: mean-centered and divided by the square root of the standard deviation of each variable
  + Range Scaling: mean-centered and divided by the range of each variable
]

![scaling](data:image/png;base64,#./img/scaling.PNG)

.footnote[
[van den Berg RA, et al. BMC Genomics. 2006 Jun 8;7:142](https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-7-142)
]

]
???

Simply put, we are transforming the **range (i.e., the spread)** of each feature by dividing a factor, the scaling factor, which is different for each feature, so as to shift the focus towards the variation between samples and not within. **Scaling often results in the inflation of small values, and in turn, the influence of measurement error. **

Compared to previous data processing steps, scaling **changes the structure of the data much more profoundly**, and therefore should be considered an optional step and exercised with greater caution.

A few tips:

* Log transformation is usually the first step in mass spec data processing. Check histogram and qq plot afterwards.

* You can choose either normalization or scaling, or both. Still judging by histogram and qq plot.

* Scaling better not be the first step in data processing.

---

# Special considerations-2

### Understand the statistical nature of your data

* Is your data normally distributed? 
* **High-dimensional**

]

### High dimensional data

* n>>m (wide data table)
* Traditional statistical analysis assumes a long data table (m>>n)

### Consideration for high-dimensionality
* Univariate analysis: **multiple testing correction**
* Multivariate analysis: **dimension reduction**

]

???

The next step in the workflow is to use statistical techniques to extract the relevant and meaningful information from the processed data. Two types of statistical analysis: univariant and multivariate analysis.

Last lecture we discussed the **distribution of the data**, and how it will affect the type of statistical test applicable to the dataset. This time our focus will be on the **high-dimensional nature of the data**, and its impact on statistics, specifically **multiple testing correction and dimension reduction**. We will walk through these as we talk about the statistical analysis.

---

# Univariant analysis

* Comparisons
  + compare numeric features between groups
  + check distribution with histogram and qq plot

* Multiple testing correction
  + control false discovery rate (FDR) with BH correction

* Ratios (Fold change)
  + degree of quantity change between two groups
  + Volcano plot: log2(abs(FC)) ~ log10(p-values)

]

???

Univariate statistics refer to all statistical analyses that include **a single dependent variable** and can include one or more independent variables. Univariate analysis mainly includes contrast analysis (i.e., pairwise comparison) and omnibus test (i.e., multi-group comparison).

**Typically, the primary goal is to identify features that differ between groups.**

---

# Univariant analysis: differential analysis

* **Comparisons**
  + **compare numeric features between groups**
  + **check distribution with histogram and qq plot**

* Multiple testing correction
  + control false discovery rate (FDR) with BH correction

* Ratios (Fold change)
  + degree of quantity change between two groups
  + Volcano plot: log2(abs(FC)) ~ log10(p-values)

]

???

The goal of univariate analysis is to **differentiate**; the goal is to identify the features that are significantly changing between classes of biological samples.

Type of univariate analysis, parametric or non-parametric tests, depends on the distribution of the data. Therefore, always check your distribution with histogram and qq plot. |

## Data distribution and normality

* If data distribution is near normal, **parametric** methods can be applied
  + **T-test**: assumes normal distribution and equal variance; **Welch's t-test** over Student's t-test
  + **ANOVA**: assumes normal distribution and equal variance

* If data distribution is not normal, central limit theorem (CLT) is not satisfied, only **non-parametric** methods for hypothesis test
  + **Wilcoson rank sum test**
  + **Kruskal-Wallis ANOVA test**

]

???

After data processing steps above, if data distribution is approximately normal, parametric statistical analysis should be applied (i.e., using t-test for a contrast analysis and/or using ANOVA for an omnibus test). In the case of data with non-normal distribution, nonparametric statistical analysis should be considered. For example, the Wilcoxon rank-sum test (also called Mann–Whitney U test or Wilcoxon–Mann–Whitney test) and/or Kruskal-Wallis one-way ANOVA for a contrast analysis and an omnibus test, respectively.

We should use **Welch’s t-test** by default, instead of Student’s t-test, because Welch's t-test performs better than Student's t-test whenever **sample sizes and variances are unequal between groups**, and gives the same result when sample sizes and variances are equal.

---

# Univariant analysis

* Comparisons
  + compare numeric features between groups
  + check distribution with histogram and qq plot

* **Multiple testing correction**
  + **control false discovery rate (FDR) with BH correction**

* Ratios (Fold change)
  + degree of quantity change between two groups
  + Volcano plot: log2(abs(FC)) ~ log10(p-values)

]

## Multiple testing correction

* Why? **Multiple simultaneous statistical tests** increase the number of false positives in the results.

* Familywise error rate (FWER) vs. False discovery rate (FDR)

]

???

Univariate analysis focuses on the analysis of a single dependent variable (metabolite/peptide). However, in a high-dimensional datasets generated from LC-MS, there are thousands of features being tested all at once. **Performing multiple statistical tests simultaneously will increase the number of false positives in the result dramatically.**

If multiple hypothesis tests are run at the same time, **the number of false positives will inflate by the number of tests**. For example, if one test is performed at the 5% level and the corresponding null hypothesis is true, there is only a **5% chance of incorrectly rejecting the null hypothesis**. However, if 100 tests are each conducted at the 5% level and all corresponding null hypotheses are true, **the expected number of incorrect rejections** (also known as false positives or Type I errors) is 5.

Truth is **when you run multiple simultaneous statistical tests, a fraction will always be false positives.** But there are ways we can **decrease the number of false positives.**

---

# Multiple testing correction (cont.)

.pull-left[
* Controlling the familywise error rate (**FWER**): **Bonferroni correction**
  + If a significance threshold of `\(α\)` is used (**family-wise error rate**), but `\(n\)` separate tests are performed, then the Bonferroni adjustment deems a feature significant only if the corresponding P-value is `\(≤ α/n\)`. 
  + **Too strict.**
]

* Controlling the false discovery rate (**FDR**): **Benjamini–Hochberg procedure**
  + First rank the p-values in ascending order; assign ranks to the p-values;
  + Set the significance threshold of `\(α\)` (FDR) you are willing to accept. 
  + Calculate each individual p-value’s Benjamini-Hochberg critical value using this formula `\((i/m)Q\)`;
      - i = the individual p-value’s rank
      - m = total number of tests
      - Q = the false discovery rate ( `\(α\)`, chosen by you)
  + Compare each original p-values against its Benjamini-Hochberg critical value; Find the largest p value that is smaller than the BH critical value.

???

**Adjusting the p-value threshold for significance is one of the main approaches for addressing the multiple testing problem**. There are two types of p-value adjustment: either controlling the Family-Wise Error Rate (FWER) using the Bonferroni-style correction (including the Holm correction) or controlling the False Discovery Rate (FDR) using the Benjamini-Hochberg procedure.

In many cases, Bonferroni is too strict. Bonferroni "penalize" all input p-values equally, whereas Benjamini-Hochberg (as a way to control the FDR) "punishes" p-values accordingly to their ranking.

---

# Univariant analysis

* Comparisons
  + compare numeric features between groups
  + check distribution with histogram and qq plot

* Multiple testing corrections
  + control false discovery rate (FDR) with BH correction

* **Ratios (Fold change)**
  + **degree of quantity change between two groups**
  + **Volcano plot: log2(FC) ~ log10(p-values)**

]

]

???

A volcano plot is a type of scatterplot that shows **statistical significance (P value)** versus **magnitude of change (fold change)**. It enables quick visual identification of genes with large fold changes that are also statistically significant.

In a volcano plot, the most upregulated genes are towards the right, the most downregulated genes are towards the left, and the most statistically significant genes are towards the top.

---

# Multivariate analysis

* Purpose of multivariate analysis (through dimension reduction)
  + **Visualization** and **feature selection/extraction**

* Motivating question 
  + How do we visualize high-dimensional data?
  + Can we find a **small number of features** that accurately capture the **relevant properties** of the data?

* Gist: Project the data from original high-dimensional space into a "smaller" low-dimensional subspace
  + **Goal: to discover the dimensions that matter the most**

* Two main methods for reducing dimensionality
  + Feature extraction: PCA (unsupervised method): finding a **new** set of `\(k\)` dimensions that are **combinations of the original** `\(d\)` dimensions
  + Feature selection: PLS-DA (supervised method): finding `\(k\)` of the `\(d\)` dimensions that give us the most information and we discard the other `\((d-k)\)` dimensions

???

Multivariate analyses are statistical procedures designed for analyzing **data involving more than one type of measurement or observation**. Multivariate analysis techniques **look at multiple variables simultaneously, assuming that information resides in the joint distribution**.

The purpose of multivariate analysis are two-fold: **visualization and feature selection/extraction**, through dimension reduction, or subspace estimation (in machine learning term).

The motivating question we are asking here is "". (For example, we can probably describe the trajectory of a tennis ball by its velocity, diameter, and mass.) Similarly, can we find just a few features in your high dimensional omics data that can capture the essence of this dataset?  How? (gist)

The two main methods to use are PCA and PLS-DA, that can help us achieve our goal, which is **to discover the dimensions that matter the most, or to help us see the dominant trend in the data**.

In feature extraction, we are interested in finding a **new** set of `\(k\)` dimensions that are **combinations of the original** `\(d\)` dimensions.

In feature selection, we are interested in finding `\(k\)` of the `\(d\)` dimensions that give us the most information and we discard the other `\((d-k)\)` dimensions.

---

# Principle component analysis (PCA)

* Goal: We want to find a feature that can explain most of the variance of the data.

* Problem: `\(x1\)` and `\(x2\)` are correlated, i.e., non-zero covariance. **Both features contribute to the variance of the data.**

* This is **Feature extraction**. Instead of choosing between `\(x1\)` or `\(x2\)` (existing features), we create a **new** feature that can explain most of the variance.

* Solution: De-correlation
 + Eigen-decomposition
 + Singular value decomposition

]

???

Principal component analysis (PCA) is the most versatile and prevalent dimension reduction method. **It projects the original features onto a new feature space and creates a set of new features** (principal components, PCs), each of which is a linear combination of the original features. The method reduces the dimension of the data by keeping only the top new features that capture the majority of the variability of the original dataset.

This graph explains the main idea of PCA. Let's consider a 2D data distribution being plotted here. We want to reduce the dimensionality of this dataset, i.e., we want to find a feature (1D) that can explain most of the variance of this data.

Problem is we cannot simply decide which feature to keep, which feature to eliminate. Why? Because these two features are **correlated**. We cannot determine if one feature contributes more to the variance than the other one.

How to find this most discriminating new feature? By **de-correlate** the features such that their covariance vanishes. By rotating the axis, changing the bases. In the new bases, the horizontal axis accounts for most of the variance of the data. Thus, we have created (extracted) a new feature (along the horizontal axis) that contributes most to the variance.

How does PCA find the de-correlated features `\(z1\)` and `\(z2\)`? That is the matrix algebra behind PCA, which we will not get into. There are generally two ways, **eigen-decomposition and singular value decomposition**.

---

# Principle component analysis (PCA)

.pull-left[
* **Score plot**
  + **Projected observation distribution on new plane/base**
  + Similar observations accumulate within the same relative space (dispersion = dissimilar)
* Loading plot (biplot)
  + Explains how original variables are linearly combined to form new PCs
  + Variables with largest absolute loadings have greatest importance
  + Direction in score plots corresponds to direction in loading plot - biplot
* Scree plot
  + How many PCs should be keep?
  + Plot the variance explained as a function of number of PCs kept
  + At the "elbow", adding new PC does not significantly increase the variance explained by PCA
]

![](data:image/png;base64,#MS_Omics_data_analysis_files/figure-html/unnamed-chunk-8-1.png)

.footnote[
[Nguyen LH, Holmes S. PLoS Comput. Biol. 15(6): e1006907 (2019)](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006907)
]

]

???
Diagnostic plots of PCA (output).

Example datasets: wine data. The variables include the chemical
properties and composition of the wines. Class labels for grape varieties 
(59 Barolo, 71 Grignolino, 48 Barbera).

---

# Principle component analysis (PCA)

* Score plot
  + Projected observation distribution on new plane/base
  + Similar observations accumulate within the same relative space (dispersion = dissimilar)
* **Loading plot (biplot)**
  + **Explains how original variables are linearly combined to form new PCs**
  + **Variables with largest absolute loadings have greatest importance**
  + Direction in score plots corresponds to direction in loading plot - biplot
* Scree plot
  + How many PCs should be keep?
  + Plot the variance explained as a function of number of PCs kept
  + At the "elbow", adding new PC does not significantly increase the variance explained by PCA
]

![](data:image/png;base64,#MS_Omics_data_analysis_files/figure-html/unnamed-chunk-9-1.png)

.footnote[
[Nguyen LH, Holmes S. PLoS Comput. Biol. 15(6): e1006907 (2019)](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006907)
]
]

???

Loading plot explains exactly how the old variables (dimensions) contributes to the new variables (PCs). Some original variables contributes a lot to a PC, while others contributes less. 
---

# Principle component analysis (PCA)

![](data:image/png;base64,#MS_Omics_data_analysis_files/figure-html/unnamed-chunk-10-1.png)

.footnote[
[Nguyen LH, Holmes S. PLoS Comput. Biol. 15(6): e1006907 (2019)](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006907)
]
]

???

This is another version of the loading plot, not a bar graph, but it still shows how important each original variable to the new PC.

Variables with largest absolute loadings (imagine a shadow) have the greatest importance to the new PC. 
---

# Principle component analysis (PCA)

* Score plot
  + Projected observation distribution on new plane/base
  + Similar observations accumulate within the same relative space (dispersion = dissimilar)
* **Loading plot (biplot)**
  + Explains how original variables are linearly combined to form new PCs
  + Variables with largest absolute loadings have greatest importance
  + **Direction in score plots corresponds to direction in loading plot** - biplot
* Scree plot
  + How many PCs should be keep?
  + Plot the variance explained as a function of number of PCs kept
  + At the "elbow", adding new PC does not significantly increase the variance explained by PCA
]

![](data:image/png;base64,#MS_Omics_data_analysis_files/figure-html/unnamed-chunk-11-1.png)

.footnote[
[Nguyen LH, Holmes S. PLoS Comput. Biol. 15(6): e1006907 (2019)](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006907)
]
]

???
Loading plot overlay with score plot gives biplot. 
---

# Principle component analysis (PCA)

.pull-left[
* Score plot
  + Projected observation distribution on new plane/base
  + Similar observations accumulate within the same relative space (dispersion = dissimilar)
* Loading plot (biplot)
  + Explains how original variables are linearly combined to form new PCs
  + Variables with largest absolute loadings have greatest importance
  + Direction in score plots corresponds to direction in loading plot - biplot
* **Scree plot**
  + How many PCs should be keep?
  + **Plot the variance explained as a function of number of PCs kept**
  + At the "**elbow**", adding new PC does not significantly increase the variance explained by PCA

]
.pull-right[

![](data:image/png;base64,#MS_Omics_data_analysis_files/figure-html/unnamed-chunk-12-1.png)

.footnote[
[Nguyen LH, Holmes S. PLoS Comput. Biol. 15(6): e1006907 (2019)](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006907)
]
]

???
There is an elbow point in scree plot, after which the variance explained in following PCs flattens. Adding more PC after the elbow point doesn't increase the total variance explained by the PCA significantly.

---

# Partial least squares-discriminant analysis (PLS-DA)

.footnote[
[Abid M.S., Qiu H, et al., Sci Rep 12, 8289 (2022)](https://doi.org/10.1038/s41598-022-12197-2)
]

???

Partial least squares-discriminant analysis (PLS-DA) is another versatile dimension reduction method that can be used for predictive and descriptive modeling as well as for feature selection in high-dimensional datasets. In contrast to PCA, PLS-DA as a **supervised method and provides stronger classification prediction**. However, because PLS-DA takes into account sample label information, this method runs the risk of **overfitting** the data, and the user needs to optimize many parameters before reaching reliable and valid outcomes.

---

# Partial least squares-discriminant analysis (PLS-DA)

![](data:image/png;base64,#./img/vip_ROC.png)

.footnote[
[Abid M.S., Qiu H, et al., Sci Rep 12, 8289 (2022)](https://doi.org/10.1038/s41598-022-12197-2)
]

???

The PLS-DA will generate a **variable importance in projection (VIP)** score for each variable (peptide), which **measures the importance of that individual variable and summarizes the contribution a variable makes to the PLS model**.

---

# Resources:

* [Data carpentry](https://datacarpentry.org/lessons/)
  + [Data Analysis and Visualization in R](https://datacarpentry.org/R-genomics/index.html)
  + [Data Analysis and Visualization in R for Ecologists](https://datacarpentry.org/R-ecology-lesson/index.html)

* [Software carpentry](https://software-carpentry.org/lessons/)
  + [Programming with R](http://swcarpentry.github.io/r-novice-inflammation/)
  + [Programming with Python](https://swcarpentry.github.io/python-novice-inflammation/)
  + [Version Control with Git](https://swcarpentry.github.io/git-novice/)
  + [The Unix Shell](https://swcarpentry.github.io/shell-novice/)
  + [R for Reproducible Scientific Analysis](https://swcarpentry.github.io/r-novice-gapminder/)

* [OpenIntro Statistics](https://www.openintro.org/book/os/)

* [Modern Statistics for Modern Biology](http://web.stanford.edu/class/bios221/book/index.html)

* [An Introduction to Statistical Learning](https://www.statlearning.com/)

---

background-image: url(data:image/png;base64,#./img/bcrf.png)
background-size: 75%