Bioinformatics and Downstream Analysis of Microbiome Data

.title[
# Bioinformatics and Downstream Analysis of Microbiome Data
]
.author[
### <a href="https://shibalytics.com/">Max Qiu, PhD</a> Bioinformatician/Computational Biologist <a href="mailto:maxqiu@unl.edu" class="email">maxqiu@unl.edu</a>
]
.institute[
### <a href="https://biotech.unl.edu/bioinformatics">Bioinformatics Core Research Facility, Center for Biotechnology</a>
]
.date[
### 10-16-2023
]

---

# [Bioinformatics Core Research facility](https://biotech.unl.edu/bioinformatics)

## Who are we?

* Part of Nebraska Center for Biotechnology, located at the Beadle Center E204

* Serving expertise and comprehensive analyses for large and small data sets across all NU campuses as well as external institutions/private industry

* Fee-for-service business model

* Supported by Nebraska Research Initiative and grant funding through collaborations

]

???

Our mission is to provide faculty, staff and students with our expertise, which is large-scale high-dimensional data analysis, and statistical analysis in the field of life science. We operate on a fee-for-service business model and we do insist on authorship or acknowledgement for publication, depending on contribution; fee is waived if PI include us in the grant application.

---

# [Bioinformatics Core Research facility](https://biotech.unl.edu/bioinformatics)

.pull-left[
<img src="data:image/png;base64,#./img/genome_assembly.PNG" width="85%" style="display: block; margin: auto;" />

]

* Bulk and single-cell RNA-Seq data analysis
* De novo or guided genome or transcriptome assemblies
* Functional genomics (gene enrichment and pathway analysis)
* Microbiome analysis via amplicon sequencing or shotgun metagenomics
* Mass Spec-generated Omics analysis (proteomics, peptidomics, metabolomics, etc.)

We also provide custom analyses, including the development of new analysis workflow or automated pipeline, research database and web portal, integration of large scale data sets.

]

???
We utilize high performance computing cluster from HCC to assist our clients with bioinformatics and computational needs, such as....

---

# [Bioinformatics Core Research facility](https://biotech.unl.edu/bioinformatics)

]

???

If you scroll down on our home page, you'll fine **appointments and consultation**, which will lead you to our consultation scheduler. **We usually do not charge student for simple stat or data analysis consultations.**

---

# Outline

- ### 16S rRNA Sequencing Approach

- ### Microbiome Data Structure

- ### Challenges of modeling Microbiome Data

- ### What Microbiome Bioinformatics Analysis .highlight[Can] and .highlight[Cannot] Answer

---

# Metagenomics Workflow Overview

???
This slide shows an overview of a metagenomics workflow from **sample collection** to **optional enrichment of microbes** (e.g. selective media, depletion of host genomes) to **extraction of the genetic material** for targeted or shotgun sequencing. We will go over these two type of sequencing in subsequent slides and compare them.

---

# 16S rRNA Sequencing Approach

- The Advantages of 16S rRNA Sequencing

- Bioinformatic Analysis of 16S rRNA Sequencing Data

- Limitations of 16S rRNA Sequencing Approach

---

## Advantages of 16S rRNA Sequencing

* Should have sufficient .highlight[resolution] to differentiate the different communities you want to study

* Availability of a good .highlight[reference DB] for the samples you study is necessary

* .highlight[Single copy gene] preferred but not always possible

]

???

The marker should have sufficient resolution to differentiate the different communities you want to study (16S differentiates genera and some species; 18S and ITS rRNA for fungi);

A reference database is needed for taxonomic assignment so availability of a good reference DB for the samples you study is necessary;

Single copy gene preferred but not always possible (rRNAs not single copy genes); |

* The 16S rRNA gene is ubiquitous

* Contains highly conserved regions

* Well-studied primer sets are available

* Well-curated databases of reference sequences and taxonomies

* Relatively cheap and simple with mature analysis pipelines
]

???

rRNA genes are present in all living organisms, it is the universal phylogenetic marker.

16S rRNA contains nine hypervariable (V) regions (V1–V9). Currently, the segments of V1–V3, V4, and V4–V5 regions are most commonly used because research showed that each can provide **genus-level resolution**.

Ribosomal Database Project (RDP), Greengenes, SILVA. GreenGenes is the smallest among the three databases while Silva is the largest. Silva is the most up to date and its taxonomic assignments are manually curated.

---

## Bioinformatics Data Analysis Tools

![](data:image/png;base64,#img/tools.png)

???

All packages are open source and have online tutorials and forums. QIIME and mothur are **self-contained** pipelines. All of these tools can be used to analyze 16S rRNA gene sequencing data from raw sequence reads to generate ASV/OTU/abundance table, enable comparison of multiple samples and employ the use of the SILVA 16S rRNA gene reference database. QIIME 2 was available in 2018; it is a complete redesigned and rewritten version of the QIIME microbiome analysis pipeline.

---

## Bioinformatic Workflow Overview

---

## Limitations of 16S rRNA Sequencing Approach

### 16S rRNA

* Variation of 16S copy number across most organisms

* Insufficient resolution at the finest (strain) levels

### Amplicon sequencing

* Amplify rRNA marker via PCR
  + Primer bias: .highlight[miss detecting taxa], reduce microbial diversity
  
]

* Artificial sequences
  + Sequencing errors and incorrectly assembled amplicons (i.e., .highlight[chimeras])
  + .highlight[Incorrectly assigning OTU/ASV], and the 16S locus being transferred between distantly related taxa

* Only taxonomy composition no functional composition   
  + [PICRUSt2](https://www.nature.com/articles/s41587-020-0548-6#citeas): leveragig genome database, .highlight[predicts] community gene abundances, .highlight[hypothesis-generating] purpose only

]

???

The limitations are two-side; there are limitation with 16S rRNA itself, ...

There are also limitation with amplicon sequencing technology:

* Amplicon sequencing rRNA markers via PCR may .highlight[miss detecting taxa] due to various biases associated with PCR, which may substantially reduce microbial diversity. Differential amplification of groups depending on primer affinity.

* 16S rRNA sequencing .highlight[overestimates the community diversity or species abundance] due to .highlight[artificial sequences]. Artificial sequences can occur due to several reasons.

* Amplicon sequencing .highlight[only discerns the taxonomic composition] of microbiome community. It cannot directly analyze the biological functions of associated taxa. 
  + .highlight[PICRUSt2]: leveraging genome database, .highlight[predicts] community gene abundances, .highlight[hypothesis-generating] purpose only

PICRUSt predicts community gene abundances

* no active molecules (proteins/metabolites)
* accuracy could be low on individual genomes 
* accuracy based on annotated well studied enzymes and pathways
* Hypothesis-generating; followup studies are needed to validate findings

---

## Summary

#### Targeted Sequencing (i.e., amplicon sequencing, marker gene profiling)

- Use PCR primers to target specific regions of genome, e.g., 16S rRNA

- Obtain .highlight[taxonomic] information, .highlight[no metabolic functional] information

- Less expensive (~ $100 per sample), relatively smaller computational needs

- Relatively free of host DNA contamination

- Able to sequence deeper and broader; majority of genes can be assigned to at least phylum level

- Good to answer: .highlight["Who are there?"]

]

#### Shotgun Sequencing (WGS sequencing)

- Sequence randomly all the DNA that are in the sample

- Obtain .highlight[taxonomic and functional] information

- Relatively expensive (~ $1000 per sample); usually requires huge computational resources

- Prone to host DNA contamination; Don't know the exact host of each gene (a bag of genes)

- Many more unassigned gene fragments (“wasted” data)

- Good to answer: .highlight["Who are there and What are they capable of?"]
]

---

# Microbiome data structure

- Data matrix

- Sample metadata

- Feature metadata

- Structured as a phylogenetic tree

---

## (Sample-by-feature) Data matrix

```r
otu_table(ps)[1:10, 1:10]
```

```
## OTU Table:          [10 taxa and 10 samples]
##                      taxa are columns
##        ASV_3837 ASV_1432 ASV_3347 ASV_3098 ASV_546 ASV_1097 ASV_2257 ASV_3871
## 12post        0        0        0        0      13        0        0        0
## 12pre         0        0        0        0      39        0        0        0
## 14post        1        1        0        2      13        1        0        0
## 14pre         0        0        0        0       3        0        0        1
## 20post        0        2        0        0      64        1        0        0
## 20pre         0        5        0        0      48        0        0        0
## 21post        0       43        0        0       2        1        0        0
## 21pre         0       22        1        0       2        0        0        0
## 23post        0        0        1        0      31       18        0        0
## 23pre         0        0        0        0      11       10        1        0
##        ASV_2892 ASV_2761
## 12post        0        0
## 12pre         1        0
## 14post        2        0
## 14pre         5        0
## 20post        0        0
## 20pre         0        0
## 21post        0        0
## 21pre         0        0
## 23post        0        0
## 23pre         0        0
```

---

## Sample metadata

```r
sample_data(ps)[1:10,]
```

```
##        sample                             fq1                             fq2
## 12post 12post 12post_S19_L001_R1_001.fastq.gz 12post_S19_L001_R2_001.fastq.gz
## 12pre   12pre  12pre_S17_L001_R1_001.fastq.gz  12pre_S17_L001_R2_001.fastq.gz
## 14post 14post  14post_S3_L001_R1_001.fastq.gz  14post_S3_L001_R2_001.fastq.gz
## 14pre   14pre  14pre_S18_L001_R1_001.fastq.gz  14pre_S18_L001_R2_001.fastq.gz
## 20post 20post 20post_S23_L001_R1_001.fastq.gz 20post_S23_L001_R2_001.fastq.gz
## 20pre   20pre  20pre_S14_L001_R1_001.fastq.gz  20pre_S14_L001_R2_001.fastq.gz
## 21post 21post 21post_S13_L001_R1_001.fastq.gz 21post_S13_L001_R2_001.fastq.gz
## 21pre   21pre   21pre_S5_L001_R1_001.fastq.gz   21pre_S5_L001_R2_001.fastq.gz
## 23post 23post  23post_S7_L001_R1_001.fastq.gz  23post_S7_L001_R2_001.fastq.gz
## 23pre   23pre  23pre_S24_L001_R1_001.fastq.gz  23pre_S24_L001_R2_001.fastq.gz
##        ID time group
## 12post 12 post   CPB
## 12pre  12  pre   CPB
## 14post 14 post   CPB
## 14pre  14  pre   CPB
## 20post 20 post   CNT
## 20pre  20  pre   CNT
## 21post 21 post   CNT
## 21pre  21  pre   CNT
## 23post 23 post   CPB
## 23pre  23  pre   CPB
```

---

## Feature metadata

```r
tax_table(ps)[1:10,]
```

```
## Taxonomy Table:     [10 taxa by 7 taxonomic ranks]:
##          Kingdom    Phylum       Class        Order           
## ASV_3837 "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
## ASV_1432 "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
## ASV_3347 "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
## ASV_3098 "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
## ASV_546  "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
## ASV_1097 "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
## ASV_2257 "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
## ASV_3871 "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
## ASV_2892 "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
## ASV_2761 "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
##          Family            Genus                           Species    
## ASV_3837 "Lachnospiraceae" NA                              NA         
## ASV_1432 "Lachnospiraceae" "Lachnoclostridium"             NA         
## ASV_3347 "Lachnospiraceae" "Lachnoclostridium"             NA         
## ASV_3098 "Lachnospiraceae" "Lachnoclostridium"             NA         
## ASV_546  "Lachnospiraceae" "Lachnoclostridium"             NA         
## ASV_1097 "Lachnospiraceae" "Lachnoclostridium"             NA         
## ASV_2257 "Lachnospiraceae" NA                              NA         
## ASV_3871 "Lachnospiraceae" NA                              NA         
## ASV_2892 "Lachnospiraceae" "Lachnoclostridium"             NA         
## ASV_2761 "Lachnospiraceae" "Lachnospiraceae NK4A136 group" "bacterium"
```

---

## Structured as a phylogenetic tree

.pull-right[
.footnote[
[Salomon JD, Q H, et al. Dis Model Mech (2023) 16 (5): dmm049742.](https://doi.org/10.1242/dmm.049742)
Figure generated using [FigTree](http://tree.bio.ed.ac.uk/software/figtree/).
]
]

---

# Challenges of Modeling Microbiome Data

- High-throughput count data

- High-dimensionality

- Sparcity

- Over-dispersion

- Compositionality

---

## High-throughput count data (Non-negative integers)

* Imaging pixels get summarized as base calls

* Base calls get pooled into sequence/read

* Index sequences are used to group sequences within each sample

* Similar sequences between samples are pooled as proxy for taxonomy inference
]

???

Between sequencing and counting, there is an important aggregation or clustering step involved, which aggregates sequences that belong together. This is the same for all high-throughput sequencing data. For instance, all reads belonging to the same gene (in RNA-Seq), or to the same binding region (ChIP-Seq).

### Challenges in downstream analysis

- .emphasize[High-dimensional (underdetermined)]

- .emphasize[Sparse (zero-inflated)]

- .emphasize[Over-dispersed (large within-group heterogeneities)]

- .emphasize[Discrete Compositionality]
]

---

## High-dimensional (underdetermined)

* High-dimensional data
  + p (number of features, i.e., taxa/ASV) >> n (number of samples)

]

```r
dim(otu_table(ps)) 
```

```
## [1]   24 4578
```

]

![](data:image/png;base64,#./img/otu_table.png)
.footnote[
[Salomon JD, Q H, et al. Dis Model Mech (2023) 16 (5): dmm049742.](https://doi.org/10.1242/dmm.049742)
]

???

Microbiome sequence data sets are high dimensional with tens of thousands of different features. They are underdetermined, having the number of taxa or ASVs much greater than the number of samples. Simply put, we don't have enough sample to estimate anything.

---

## The curse of dimensionality

* High dimensionality violate the assumptions of all standard statistical tests
  + "Spurious correlations"

* Common practice is to do dimensional reduction using .highlight[distance matrices] between samples 
  + Instead of a matrix of 200 samples X 60,000 ASVs, we can work with 200 X 200 symmetric matrix 
  + Many choices of distance matrices; none are "the best", as they all emphasize different patterns

* Ordination (ex., PCoA)

* STRONG, VERY STRONG, SUPER STRONG ASSUMPTIONS

???

Incorporate abundance or just use presence/absence?
Partially pool information from phylogenetically related taxa or not?
  
Ordination is a collective term for multivariate techniques which summarizes a multidimensional dataset (community data such as species abundance data, sample by species data) and projects onto a low dimensional space, where similar species and samples are plotted close together, and dissimilar species and samples are placed far apart.

---

## Sparse (zero-inflated)

???

.highlight[In microbiome data, sparsity is seen as the absence of many taxa across samples and zeros are generated in most experiments.] Microbiome taxa abundance, especially the taxa abundance at lower taxonomic levels or ASV counts often have many zeros and heavily right skewed.

---

## Sparse (zero-inflated)

```r
ggplot_truehist(unlist(otu_table(ps)), "ASV table histogram")
```

![](data:image/png;base64,#MQ_microbiome_bioinformatics_files/figure-html/unnamed-chunk-13-1.png)
]

```r
sum(otu_table(ps)==0)/(dim(otu_table(ps))[1]*dim(otu_table(ps))[2])
```

```
## [1] 0.6905581
```

### Where do the zeros come from?

* Sampling zero (count zero)
  + A count is used to record the number of times an event occurs.
  + Count zero occurs due to .highlight[non-exhaustive sampling]

* Structure zero (essential zero, genuine zero, absolute zero ...)
  + True negatives
  
]

???

**A count** is used to record the number of times an event occurs. **Count zeros** present if the event did not occur on a certain situation, but may occur in another situation. This type of zero is due to a **sampling problem**, because components may be unobserved due to the limited size of the sample or undetectable due to the limit of techniques, such as .highlight[sequencing depth or library size].

Stucture zero means that “a component which is truly zero, not something recorded as zero simply because the experimental design or the measuring instrument has not been sufficiently sensitive to detect a trace of the part”. Structure zero represent the absence of a taxa from a sample.

---

## Over-dispersed (large within-group heterogeneities)

.pull-left[
![](data:image/png;base64,#MQ_microbiome_bioinformatics_files/figure-html/unnamed-chunk-15-1.png)
]

.pull-right[
![](data:image/png;base64,#MQ_microbiome_bioinformatics_files/figure-html/unnamed-chunk-16-1.png)

]

???

Common-Scale Variance versus Mean for Salomon example. Each point in each panel represents a different ASV's mean/variance estimate for a biological group.

---

## Over-dispersed (large within-group heterogeneities)

.pull-left[
![](data:image/png;base64,#MQ_microbiome_bioinformatics_files/figure-html/unnamed-chunk-17-1.png)
]

### Over-dispersion in sequencing data

* Library sizes are widely different

* Taxa count proportions (relative abundance) is larger than what would be predicted by a pre-assumed typical multinomial regression
  + i.e., Poisson distribution

]

???

It is known that there is over-dispersion in sequencing data. This is all high-through sequencing data, not just amplicon sequencing, RNA-Seq data as well. This is due to two reasons:

* library sizes of DNA or RNA sequencing are widely different between samples

* read counts are more variable than what is expected according to a Poisson distribution.

---

## Discrete Compositionality

.pull-left[
* Most high-throughput sequencing studies result in .highlight[counts] that can be interpreted as .highlight[relative abundances]

* Relative abundances of different variables from a single sample are .highlight[not independent]
  + Parts of whole and provide exclusively .highlight[relative] information between their components
  + The elements of the composition are .highlight[non-negative] and .highlight[sum to unity]

* Many methods transform these data in a way that .highlight[reduces the interdependency] and helps approximate residual normality. It is fundamentally impossible to do such transformations without making .highlight[strong] assumptions, which are .highlight[almost always] violated in microbiome studies
  
]

![](data:image/png;base64,#./img/relative_abun.jpg)

]

???

i.e., within a sample, each relative abundance is a non-negative value between 0 and 1, which adds up to 1

Compositional data quantitatively describe the parts of whole and provide only relative information between their components. Thus, compositional data exist as .highlight[the proportions or fractions of a whole, or portions of a total], conveying exclusively relative information, and have the properties: the elements of the composition are .highlight[non-negative and sum to unity]. From a practical point of view, if researchers are really only interested in .highlight[relative frequencies, not the absolute amount of data], then the data are compositional.

The total sum of all component values (sometimes called the library size) is an artifact of the sampling procedure. The library size can be affected by many factors, such as technical variability or differences in experiment-specific abundance.

---

# What Microbiome Bioinformatics Analysis .highlight[Can] and .highlight[Cannot] Answer

- Community diversities

- Downstream statistical analysis

---

## What Microbiome Bioinformatics Analysis .highlight[Can] Answer

### Who are there

- .highlight[Taxonomic composition] (classification and abundance)

### What are they doing

- .highlight[Functional composition] (only with WGS data)

- Expressed functional composition (only with WTS data)

???

This is what the bioinformatics can get you at the end of its pipeline, taxa identification and estimate their abundance. From here, we can do diversity analysis, specifically, we can calculate the alpha diversity for each samples of different metrics, and we can visualize the beta diversity with 2-d plot.

---

## Community Diversities

* Alpha diversity: diversity at a single site

* Beta diversity: diversity between sites

???

Analyses of community diversities are widely used in community microbiome study. Three levels of diversity (alpha diversity, beta diversity and gamma diversity) have become central to community ecology (Whittaker 1967, 1969). In microbiome study, alpha diversity and beta diversity are commonly used.

---

## Alpha Diversity

* Diversity at one spot or community
  + .highlight[Local] diversity
  + Acts like a .highlight[summary statistic] of a single community

* Fundamental questions:
  + How many species present? (richness)
  + How many species are truly there? (diversity)
  + How even are each species relatively to each other? (evenness)

]

### Observed OTU, Chao1, Shannon, Simpson, InvSimpson, Fisher, Phylogenetic diversity, ...

+ .emphasize[Observed vs Estimated]
  
  + .emphasize[Non-phylogenetic vs Phylogenetic]
  
  + .emphasize[Unweighted vs Weighted] by abundance
]

???

Alpha diversity as one of the basic diversity indices is defined as .highlight[diversity in one spot or sample]. It acts like a .highlight[summary statistic] of a single population, it's .highlight[local].

Alpha diversity is one of the essential concepts in ecology. The fundamental questions encountered by researchers are: ...

.highlight[Communities that are numerically dominated by one or a few species exhibit low evenness] while .highlight[communities where abundance is distributed equally among species exhibit high evenness].

There are many alpha diversity metrics, these metrics represent different concepts.

---

## Alpha Diversity

```r
adiv[1:20, 1:7]
```

```
##         Observed    Chao1  Shannon   Simpson InvSimpson   Fisher        PD
## X12post      772 1069.439 5.153406 0.9789896   47.59538 203.6945  57.36124
## X12pre      1414 1733.444 5.545307 0.9902083  102.12695 244.6882 105.36945
## X14post     1221 1432.757 5.307597 0.9878167   82.07974 202.8327  95.81532
## X14pre       933 1178.000 5.208907 0.9831334   59.28875 193.2584  71.00369
## X20post     1329 1741.248 4.979233 0.9792213   48.12623 230.1734 103.31272
## X20pre      1215 1508.007 5.127905 0.9848628   66.06240 213.5342  98.36759
## X21post     1408 1837.875 5.409402 0.9861229   72.06100 266.6152 110.18600
## X21pre      1532 1905.312 5.440965 0.9873182   78.85342 281.8831 118.72087
## X23post     1622 1903.114 5.145834 0.9713741   34.93343 266.0578  97.02621
## X23pre      1426 1753.434 4.985194 0.9658489   29.28160 252.2428 101.73798
## X44post     1637 1939.544 5.806304 0.9866761   75.05291 338.4221  99.57581
## X44pre      1357 1778.787 5.521896 0.9759116   41.51383 319.3114 115.27071
## X48post     1379 1708.024 5.580212 0.9911054  112.42722 241.2026  85.74055
## X48pre      1524 1780.321 5.520602 0.9891785   92.40903 258.6288 127.79839
## X50post     1567 1882.972 5.471001 0.9866266   74.77548 280.1078 118.23638
## X50pre      1388 1703.269 5.126916 0.9672926   30.57411 265.0344 113.85581
## X54post     1304 1608.603 5.290703 0.9889997   90.90663 217.0067 100.64586
## X54pre      1432 1717.187 5.437186 0.9870839   77.42249 249.1576 112.94817
## X59post     1726 1981.088 5.439747 0.9877155   81.40321 264.2693 171.56131
## X59pre      1339 1632.182 5.515967 0.9897586   97.64324 251.2341 104.31080
```

---

## Beta Diversity

* Community classification (i.e., to differentiate) 
  + leads to .highlight[measure the similarity] between two community samples

* "Species turnover" 
  + .highlight[a measure of change] in diversity across environmental gradients
  + reflects .highlight[species replacement] as one moves across space or time

* Elucidate .highlight[how much diversity is unique] to a community, or describe .highlight[how many taxa are shared] between communities.

???

.highlight[One important purpose of microbiome study is to determine whether the microbiome commonities can be classified together or needs to be separated], to differentiate treatment from control, healthy from disease, genetic mutate from wild type, etc. The questions of .highlight[community classification] leads to .highlight[measure the similarity] between two community samples (beta-diversity). The concept of “similarity” or beta-diversity and its measures mainly come from ecology and other fields.

Beta diversity was originally defined as .highlight[a measure of change] in diversity across environmental gradients; in other words, it is .highlight[the rate of change in species composition] from one community to another along gradients (Whittaker 1960). Hence, it .highlight[reflects species replacement] as a community moves across space or time (Magurran 2004). Beta diversity is also known as ‘species turnover’.

In general, beta diversity evaluates differences between two or more communities (Koleff et al. 2003; Lozupone and Knight 2008), thus allowing us to elucidate how much diversity is .highlight[unique to one community], or describe how many taxa are .highlight[shared between communities].

---

## Beta Diversity

* Beta diversity is calculated by using a .highlight[similarity or distance] measure to represent the relationships of samples

+ Jaccard similarity
  + Bray-Curtis dissimilarity
  + UniFrac (Unweighted vs Weighted by abundance)

]

.pull-right[
![](data:image/png;base64,#MQ_microbiome_bioinformatics_files/figure-html/unnamed-chunk-20-1.png)

]

.footnote[
[Salomon JD, Q H, et al. Dis Model Mech (2023) 16 (5): dmm049742.](https://doi.org/10.1242/dmm.049742)
]

???

As beta dversity is a measure of similarity, it is calculated by using a similarity or distance measure/matrix. There are a few popular choices.

The key point to .highlight[selection] of the proper measure of beta diversity is based on .highlight[microbiome hypothesis testing] and .highlight[the selection must be tailored to the hypothesis, rather than vice versa]. No single measure is best in all circumstances.

This graph shows the similarity or distance matrix between each pair of samples using Unifrac distance matrix.

---

## Beta Diversity

* Beta diversity is calculated by using a .highlight[similarity or distance] measure to represent the relationships of samples
  + Jaccard similarity
  + Bray-Curtis dissimilarity
  + UniFrac (Unweighted vs Weighted by abundance)

* Ordination (visualization)
  + Goal: .highlight[Visualization of similarity among samples]
  + PCA, PCoA, NMDS ...

]

]

.footnote[
[Salomon JD, Q H, et al. Dis Model Mech (2023) 16 (5): dmm049742.](https://doi.org/10.1242/dmm.049742)
]

???

After we calculated a distance matrix, what do we do with them? We use them for ordination.

The primary aim of ordination is to .highlight[represent multiple samples in a reduced number of] orthogonal (i.e., independent) .highlight[axes]. The importance of ordination axes decreases by order. The first axis of an ordination explains the most variation in the dataset, followed by the second axis, then the third, and so on.

The ordination plots are particularly useful for visualizing the similarity among samples (subjects). For example, .highlight[in the context of beta diversity, samples that are closer in ordination space have species assemblages that are more similar to one another than samples that are further apart in ordination space].

PCoA is a flexible ordination technique that allows the user to choose virtually any distance metric (e.g., Jaccard, Bray-Curtis, Euclidean, etc.) while PCA only uses Euclidean distances.

---

## What Microbiome Bioinformatics Analysis .highlight[Cannot] Answer

### Two main themes in the current microbiome studies

- .emphasize[To characterize the relationship between microbiome features and biological, genetic, clinical or experimental conditions]

- .emphasize[To identify potential biological and environmental factors that are associated with microbiome composition]

.highlight[Goal: to understand mechanisms of host genetic and environmental factors that shape microbiome.]

]

![Interactions among environment, microbiome and host](data:image/png;base64,#img/hypothesis_microbiome.jpg)

]

???
Up to now, we know what kind of answer bioinformatics can get you. We can get taxanomy composition and estimate taxa abundance. From there, we can calculate alpha diversity of different metrics and we can visualize beta diversity.

There are mainly two themes in the current microbiome studies. The goal of these studies is to ... Insights gained from the studies potentially contribute to the development of therapeutic strategies in modulating the microbiome composition in human diseases.

None of these questions can be answered just by exploring the microbiome alone (bioinformatics alone).

What can help you answer these questions are statistics and hypothesis testing, they helps explore the interactions among environment, microbiome and host, which are dynamic and complicated.

---

## What Can We Do With Alpha Diversity?

]

]

.footnote[
[Salomon JD, Q H, et al. Dis Model Mech (2023) 16 (5): dmm049742.](https://doi.org/10.1242/dmm.049742)
]

???
A hypothesis testing microbial taxa can be performed by comparing alpha and beta diversity indices. Depending on whether the data are normally or non-normally distributed, number of experimental groups, or experimental conditions, we can use a t-test, analysis of variance (ANOVA), or corresponding non-parametric test.

.highlight[The statistical hypothesis could be alpha diversity.] For example, for antibiotic studies, we hypothesize that antibiotic treatment decreases microbial diversity.

In the Salomon JD example, there was a significant decrease in phylogenetic diversity in the CPB post-operative samples compared to the control post-operative samples.

---

## What Can We Do With Beta Diversity?

```
## UnifracPermutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## vegan::adonis2(formula = dist_list[[i]] ~ phyloseq::sample_data(ps.beta)$group)
##                                      Df SumOfSqs      R2      F Pr(>F)  
## phyloseq::sample_data(ps.beta)$group  1  0.25481 0.08254 1.9793  0.034 *
## Residual                             22  2.83214 0.91746                
## Total                                23  3.08695 1.00000                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```
]

]

.footnote[
[Salomon JD, Q H, et al. Dis Model Mech (2023) 16 (5): dmm049742.](https://doi.org/10.1242/dmm.049742)
]

???

.highlight[The statistical hypothesis could also be beta diversity.] This is PERMANOVA, a multivariate analysis of variance based on .highlight[distance matrices] and .highlight[permutation]. In the Salomon JD example, we tested community similarities between CNT and CPB, i.e., comparing how dissimilar the communities are by group. There was a statistically significant difference in the β-diversity in the CPB group compared to the control group.

---

# What else?

### Differential abundance analysis (DA)

### PICRUSt2

### Canonical correlation analysis (CCA)

### Mediation analysis

### Ordination (Unconstrained and Constrained)

---

# There is no such thing as a standard workflow when it comes to downstream statistics

## Let your question guide your stats

## ASSUMPTION, ASSUMPTION, ASSUMPTION

For example,

Would it be important to identify differential abundance of a single ASV, with high power? Use the rawest possible data, and if necessary, compromise by heavier filtering or pooling of unrelated sequences.

Do strain-level differences not matter as much as macro-ecological patterns? Start with distance-matrix based stats, and see if there are specific patterns of interest that could be modeled with more precise methods.

???

Clearly defined the objectives of the study before deciding on study design and statistical analysis

---

background-image: url(data:image/png;base64,#./img/bcrf.png)
background-size: 55%

Sections of this presentation were sampled and modified from [Statistical Analysis of Microbiome Data with R](https://link.springer.com/book/10.1007/978-981-13-1534-3) by Yinglin Xia, Jun Sun and Ding-Geng Chen (2018).

]