Data Preprocessing and Processing

class: center, middle, inverse, title-slide

.title[
# Data Preprocessing and Processing
]
.author[
### <a href="https://shibalytics.com/">Max Qiu, PhD</a> </br> Bioinformatician/Computational Biologist </br> <a href="mailto:maxqiu@unl.edu" class="email">maxqiu@unl.edu</a>
]
.institute[
### <a href="https://biotech.unl.edu/bioinformatics">Bioinformatics Research Core Facility, Center for Biotechnology</a> </br> <a href="https://ncibc.unl.edu/">Data Life Science Core, NCIBC</a>
]
.date[
### 02-13-2023
]

---

background-image: url(data:image/png;base64,#https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs13024-018-0304-2/MediaObjects/13024_2018_304_Fig1_HTML.png?as=webp)
background-size: 75%

# MS Omics Workflow

.footnote[
[Shao, Y., Le, W. Mol Neurodegeneration 14, 3 (2019)](https://doi.org/10.1186/s13024-018-0304-2)

]

???

HPLC:

* Components: pumps, sampler, column, detector
* Operation principle: separation based on adsorption/desorption rate
* Column (stationary phase): polarity opposite to your target analytes
* Mobile phase: polarity same as your target analytes (competing with analytes)
* Separation modes: normal phase for non-polar, reverse phase for polar
* Elution modes: isocratic (elutent unchanged, one pump) or gradient (eluent changing throughout the run, two pump)
* **Detector**: spectrophotometric (uv/vis, fluorescence), refractive index (RI), evaporative light-scattering (ELSD), electrochemical, **mass spectrometer**

MS:

* Inlet, ionization source, mass analyzer, and detector
* Ionization source: ESI and MALDI
* Mass analyzer: quadrupole, TOF, ion-trap/orbitrap

MS/MS:

* Auto mode and targeted mode
* Collision cell

A LC-MS dataset is three dimensional, it includes three components for each feature (1) the retention time, (2) the mass-to-charge ratio and (3) the chromatographic peak area/intensity. We apply all of these components to construct a data matrix defining each feature, their peak intensity in each sample.

---

# Data Acquisition

.pull-left[

### Instrumentation

* Separation: LC-, GC-, CE-, 1D or 2D Gel

* Detection: 
  + -MS (Quadrupole, TOF, Ion-trap/Orbitrap)
  + -MSn (Triple Qua, Qtrap, Q-TOF, Q-Orbitrap)

* Others: DI (direct infusion)-, MALDI-, etc...

]

.pull-right[
### Data preprocessing (Deconvolution)

Raw mass spectrum after acquisition is **complicated (convoluted)** by many interference and artefacts (instrumental mass bias, baseline noise, baseline drift). **Deconvolution** is the process of computationally **removing interference/artefacts, separating co-eluting components and creating a pure spectrum for each component**, i.e., converting raw mass spectrum to a data matrix (peak intensity table or concentration table, etc.) containing each peak intensity in each sample

* Sample alignment
* Peak picking

]

???
During acquisition, mass spectrum is complicated (convoluted) by many interference and artefacts. Deconvolution is the process of computationally removing interference/artefacts, separating co-eluting components and creating a pure spectrum for each component.

Generally, deconvolution involves two steps, sample alignment and peak picking. Some software perform **peak picking on each sample separately**, which is to **filter and identify peaks** within each sample, and then align all samples together. Some other software, such as Waters Progenesis, align samples first (by selecting a reference sample **contains the most commonality**), then peak picking.

---