Bioinformatic Analysis of Microbiome Data

.title[
# Bioinformatic Analysis of Microbiome Data
]
.author[
### <a href="https://shibalytics.com/">Max Qiu, PhD</a> </br> Bioinformatician/Computational Biologist </br> <a href="mailto:maxqiu@unl.edu" class="email">maxqiu@unl.edu</a>
]
.institute[
### <a href="https://biotech.unl.edu/bioinformatics">Bioinformatics Core Research Facility, Center for Biotechnology</a>
]
.date[
### 04-17-2023
]

---

# [Bioinformatics Core Research facility](https://biotech.unl.edu/bioinformatics)

## Who are we?

* Part of Nebraska Center for Biotechnology, located at the Beadle Center E204

* Serving expertise and comprehensive analyses for large and small data sets across all NU campuses as well as external institutions/private industry

* Fee-for-service business model

* Supported by Nebraska Research Initiative and grant funding through collaborations

]

???

Our mission is to provide faculty, staff and students with our expertise, which is large-scale high-dimensional data analysis, and statistical analysis in the field of life science. We operate on a fee-for-service business model and we do insist on authorship or acknowledgement for publication, depending on contribution; fee is waived if PI include us in the grant application.

---

# [Bioinformatics Core Research facility](https://biotech.unl.edu/bioinformatics)

.pull-left[
<img src="data:image/png;base64,#./img/genome_assembly.PNG" width="85%" style="display: block; margin: auto;" />

]

* Bulk and single-cell RNA-Seq data analysis
* De novo or guided genome or transcriptome assemblies
* Functional genomics (gene enrichment and pathway analysis)
* Microbiome analysis via amplicon sequencing or shotgun metagenomics
* Mass Spec-generated Omics analysis (proteomics, peptidomics, metabolomics, etc.)

We also provide custom analyses, including the development of new analysis workflow or automated pipeline, research database and web portal, integration of large scale data sets.

]

???
We utilize high performance computing cluster from HCC to assist our clients with bioinformatics and computational needs, such as....

---

# [Bioinformatics Core Research facility](https://biotech.unl.edu/bioinformatics)

]

???

If you scroll down on our home page, you'll fine **appointments and consultation**, which will lead you to our consultation scheduler. **We usually do not charge student for simple stat or data analysis consultations.**

---

# Outline

- ### Background

- ### 16S rRNA Sequencing Approach

- ### Shotgun Metagenomic Sequencing Approach

- ### What Microbiome Bioinformatics Analysis Can Answer

---

# Background

- The big picture

- Metagenomics workflow overview

---

## The Big Picture

To accomplish this, we use a series of .highlight[experimental] and .highlight[computational] techniques to make inference about the community:

* Metagenomics
  + Marker gene (amplicon) sequencing
  + Shotgun sequencing
* Metatranscriptomics
* Metaproteomics
* Metametabolomics
* ...
]

???

The goal of the microbiome studies is to explore the relationship between microbes and their habitat, including humans and its effect on our health. To accomplish this we use a variety of molecular biology techniques and computational techniques to make inferences about the community.

In this workshop, we'll be talking about how to use marker genes to characterize a community. There are so many other routinely used experimental techniques, such as shotgun metagenomic sequencing and its data analysis, RNA-Seq data from metatranscriptomics, there are newer metaproteomes and metametabolomes.

The point is there are many terms now, omes and omics, to refer to a system biology based approach to study a community holistically.

---

## Human Microbiome Studies

* Human gut microbiome consists of ~1200 different species; each person harbors ~200 species

]

]

???

Some of you might heard of the phrase "Most of you is not you", stem from the observation that most of the cells from your body is non-human cells. Ratio is about 2:1 based on latest estimation.

For example, we now know that there are over 1200 different species of microorganisms found in the human gut (the gut microbiome) and at any given time, a person can habour ~200 species (and possibly more low abundant species and variants). The microbes in and on your body encode 500X more genes than the human genome and they collectively weight about 1.5-2kg.

---

## Human Microbiome Studies

* Human gut microbiome consists of ~1200 different species; each person harbors ~200 species

* Some functions of microbiome
  + Prevent pathogens from easily colonize the host
  + Provide metabolic functions that are not encoded by hosts
  + Needed for proper development of host immune system
  + Modulate host immunity to defend against pathogens
  + Microbial signals can induce metabolic gene expression – implication in modulating metabolic diseases

.footnote[
[Morgan XC, Segata N, Huttenhower C. Trends Genet
. 2013 Jan;29(1):51-8](https://doi.org/10.1016/j.tig.2012.09.005)
]

]

???

These microorganisms occupied different niches on the human body and different body sites have different microbiome profiles. Over the years, numerous studies have repeatedly demonstrated that microbiomes have many functional roles to the host. These include:

---

## Animal Microbiome Studies

* Microbes play similar roles in animal hosts
* Manipulation of microbiome can make agriculture more sustainable and efficient
  + Reduce the use of antibiotics and reduce harboring of harmful pathogens

.footnote[
[Peixoto RS, Harkins DM, Nelson KE.</br> Annu Rev Anim Biosci
. 2021 Feb 16;9:289-311.](https://doi.org/10.1146/annurev-animal-091020-075907)
]

]

]

???

While human microbiome has been extensively studied, studies about the microbiome of animals have important implications to .highlight[human, animal and environmental health and preservation]. Animal microbiomes have been demonstrated to serve similar roles in animal hosts as they do in human hosts. Manipulate the microbiome can .highlight[make agriculture practices more sustainable and efficient]. This is a recent review of microbiome studies of animal hosts.

---

## Environmental Microbiome Studies

* Microbial activities can shape planetary health and macroorganism health
  + Different metabolic capacities are found in different environments

* The microbiomes can be used to predict the biogeochemical conditions of each habitat 
  + Ice core microbiome to look at microbes and their environmental parameters at histotical conditions
  
.footnote[
[Dinsdale EA, et al. Nature. 2008 Apr 3;452(7187):629-32](https://doi.org/10.1038/nature06810)
]
]

]

???

Many of the early studies of microbiome look at the microbial compositions and functions in natural habitats, as microbial activities can shape .highlight[planetary health] (e.g. influencing nutrient cycles) and .highlight[macroorganism health] (by influencing the environment of these organisms). For example, in one of the early metagenomics studies, Dinsdale et al looked at 9 different biomes and found that different metabolic capacities characterized each environment.

Another study looking a glacier ice core enable researchers to look back in time to see what microbes and their environmental parameters are like 15,000 years ago.

The more we learn about the microbial communities, the more we see that affecting one community through .highlight[anthropological activities] (from the use of antibiotics, to climate change, deforestation and agriculture activities) can have a wide spreading effect that ripples through the .highlight[interconnected] micriota. 
Therefore, more sophisticated Omics platforms and analyses are needed to tease out the complex interaction.

---

## Experimental Techniques

### Culture (colonies)

* Artificial

* Bacteria in nature are not found in isolation but in mixed communities

* Bacterial populations corporate and compete with each other and interact with host and environment

* Most microbiome cannot be cultivated

]

???
Traditional microbiome studies research microbes in pure culture (colonies). Pure culture (mono-isolate) is .highlight[artificial]. Bacteria in nature are not found in isolation but in .highlight[mixed] communities and they .highlight[corporate and compete] with each other and interact with host and environment.

I've heard that <1% of organisms across many habitats are cultivable, which is controversial and probably not true for habitats such as human body sites. In any case, it would be nearly .highlight[impossible to culture all constituents] of a given microbiome sample. |

### High-throughput sequencing technologies-metagenomics

* Metagenomics offers an effective way to profile the structure and function of microbial communities

![](data:image/png;base64,#img/sequencing_tech.png)

]

???

Then comes high-throughput sequencing technologies which enables metagenomics as an effective, if imperfect way, to profile the structure and function of microbial communities. There are a lot of microbiome papers, close to 100,000 in the last 20 years.

The key enabler of microbiome studies is the advancement in sequencing technologies. We can divide the platforms into 3 generations.

* The first generation is the Sanger sequencers. In this most advanced form is the capillary sequencers shown and this was the workhorse for the human genome project.

* The second (next) generation sequencers such as Ion Torrent, Roche 454 (both are rarely used now), and Illumina uses massively parallel sequencing to achieve high throughput. The read lengths however are typically short (~300bp).

* The third generation sequencers such as PacBio and Oxford Nanopore (ONP) enable longer reads (10kb-100Kb) to be generated from single molecules. However, sequencing error rates are significantly higher for 3rd generation sequencers compared to Sanger or NGS.

---

## Metagenomics Workflow Overview

???
On this slide, you can see an overview of a metagenomics workflow from sample collection to optional enrichment of microbes (e.g. selective media, depletion of host genomes) to extraction of the genetic material for targeted or shotgun sequencing. We will go over these two type of sequencing in subsequent slides and compare them.

---

# 16S rRNA Sequencing Approach

- rRNA-the universal phylogenetic markers

- The Advantages of 16S rRNA Sequencing

- Bioinformatic Analysis of 16S rRNA Sequencing Data

- Limitations of 16S rRNA Sequencing Approach

---

## rRNA-the Universal Phylogenetic Markers

* rRNAs play critical roles in protein translation

* rRNAs are relatively conserved and rarely acquired horizontally

* Behave like a molecular clock
  + Proposed by Carl Woese in 1970s to use for taxonomic
classification
  + Useful for phylogenetic analysis
  + Used to build tree-of-life (placing organisms in a single phylogenetic tree)

* 16S rRNA most commonly used
]

]

???

There are a few reasons that rRNA is picked as marker gene.

First of all, rRNA genes are present in all living organisms.

Due to its critical roles in protein translation, the sequences (and structure) of rRNAs are highly conserved and rarely horizontally acquired. The property of .highlight[highly conserved] indicates that a .highlight[life tree] can be constructed to link together all known bacteria.

They behave like a .highlight[molecular clock], meaning that they provide good signals for phylogenetic reconstruction. Carl Woese proposed in the 70s to use these molecules for taxonomic classification. They have been used ever since and are the most popular phylogenetic markers with good databases available for them.

This is a schematic of Prokaryotic ribosome, which contains ~ 45 proteins (in blue), 3 rRNA (not translated, in pink). One of these 3 is the 16S small subunit.

---

## Advantages of 16S rRNA Sequencing

* The 16S rRNA gene is ubiquitous

* Contains highly conserved regions

* Well-studied primer sets are available

* Well-curated databases of reference sequences and taxonomies

* Relatively cheap and simple with mature analysis pipelines
]

* Should have sufficient .highlight[resolution] to differentiate the different communities you want to study

* Availability of a good .highlight[reference DB] for the samples you study is necessary

* .highlight[Single copy gene] preferred but not always possible

]

???

The marker should have sufficient resolution to differentiate the different communities you want to study (16S differentiates genera and some species; 18S and ITS rRNA for fungi);

A reference database is needed for taxonomic assignment so availability of a good reference DB for the samples you study is necessary;

Single copy gene preferred but not always possible (rRNAs not single copy genes);

When comparing across studies, need to standardize markers (different V regions of 16S gene), experimental protocols, and bioinformatic pipelines.

---

## Advantages of 16S rRNA Sequencing

* The 16S rRNA gene is ubiquitous

* Contains highly conserved regions

* .emphasize[Well-studied primer sets are available]

* Well-curated databases of reference sequences and taxonomies

* Relatively cheap and simple with mature analysis pipelines
]

16S rRNA contains **nine hypervariable (V) regions** (V1–V9); the high variability is suitable as unique identifiers for fine level taxonomic classification.

Currently, the segments of V1–V3, V4, and V4–V5 regions are most commonly used because research showed that each can provide genus-level sequence resolution, with V1–V3 or the V1–V4 regions can provide more accurate estimates than others.

]

???

By choosing appropriate primer set according to the studied microbial community, the problem of PCR bias can be alleviated. Using efficient computational algorithms, the chimeric reads can be readily detected. **Only amplify what you want (no host contamination)**

---

## Advantages of 16S rRNA Sequencing

* The 16S rRNA gene is ubiquitous

* Contains highly conserved regions

* Well-studied primer sets are available

* .emphasize[Well-curated databases of reference sequences and taxonomies]

* Relatively cheap and simple with mature analysis pipelines
]

1. Ribosomal Database Project (RDP) (Cole et al. 2009)
2. Greengenes (DeSantis et al. 2006)
3. SILVA (Pruesse et al. 2007)

]

???

GreenGenes is the smallest among the three databases while Silva is the largest. Silva is the most up to date and its taxonomic assignments are manually curated.

---

## Advantages of 16S rRNA Sequencing

* The 16S rRNA gene is ubiquitous

* Contains highly conserved regions

* Well-studied primer sets are available

* Well-curated databases of reference sequences and taxonomies

* .emphasize[Relatively cheap and simple with mature analysis pipelines]
]

]

---

## Bioinformatics Data Analysis Tools

![](data:image/png;base64,#img/tools.png)

???

All packages are open source and have online tutorials and forums. QIIME and mothur are **self-contained** pipelines. All of these tools can be used to analyze 16S rRNA gene sequencing data from raw sequence reads to generate ASV/OTU/abundance table, enable comparison of multiple samples and employ the use of the SILVA 16S rRNA gene reference database. QIIME 2 was available in 2018; it is a complete redesigned and rewritten version of the QIIME microbiome analysis pipeline.

---

## Bioinformatic Workflow Overview

---

## Limitations of 16S rRNA Sequencing Approach

### 16S rRNA

* Variation of 16S copy number across most organisms

* Insufficient resolution at the finest (strain) levels

### Amplicon sequencing

* Amplify rRNA marker via PCR
  + Primer bias: .highlight[miss detecting taxa], reduce microbial diversity
  
]

* Artificial sequences
  + Sequencing errors and incorrectly assembled amplicons (i.e., .highlight[chimeras])
  + .highlight[Incorrectly assigning OTU/ASV], and the 16S locus being transferred between distantly related taxa

* Only taxonomy composition no functional composition   
  + [PICRUSt2](https://www.nature.com/articles/s41587-020-0548-6#citeas): leveragig genome database, .highlight[predicts] community gene abundances, .highlight[hypothesis-generating] purpose only

]

???

The limitations are two-side; there are limitation with 16S rRNA itself, ...

There are also limitation with amplicon sequencing technology:

* Amplicon sequencing rRNA markers via PCR may .highlight[miss detecting taxa] due to various biases associated with PCR, which may substantially reduce microbial diversity. Differential amplification of groups depending on primer affinity.

* 16S rRNA sequencing .highlight[overestimates the community diversity or species abundance] due to .highlight[artificial sequences]. Artificial sequences can occur due to several reasons.

* Amplicon sequencing .highlight[only discerns the taxonomic composition] of microbiome community. It cannot directly analyze the biological functions of associated taxa. 
  + .highlight[PICRUSt2]: leveragig genome database, .highlight[predicts] community gene abundances, .highlight[hypothesis-generating] purpose only

PICRUSt predicts community gene abundances

* no active molecules (proteins/metabolites)
* accuracy could be low on individual genomes 
* accuracy based on annotated well studied enzymes and pathways
* Hypothesis-generating; followup studies are needed to validate findings

---

# Shotgun Metagenomic Sequencing Approach

- Advantages of Shotgun Metagenomic Sequencing

- Bioinformatic Analysis of Shotgun Metagenomic Data

- Challenges of Analyzing Shotgun Metagenomic Data

---

## Advantages of Shotgun Metagenomic Sequencing

* Shotgun sequencing not only discerns the taxonomic composition of microbiome community, but it also .highlight[provides information about microbial functions associated with different conditions], such as, health and disease, treatment and control, wild type and knockout.

---

## Bioinformatic Workflow Overview

---

## Challenges of Analyzing Shotgun Metagenomic Data

* Shotgun sequencing datasets are large and complex, which make it .highlight[difficult to determine the genome from which a read was derived], pose computational problems, challenge sequence alignment.

* .highlight[Containing unwanted host DNA and vulnerable to contamination]. The unwanted host DNA needs developing molecular and bioinformatic methods to filter. Identifying and removing the contaminated metagenomic sequences is especially problematic, and requires particular tools to identify and filter.

* The large metagenomic datasets pose even more .highlight[challenges to identify the significantly different taxa] between communities.

* The cost of whole-genome sequencing is still high, especially in complex communities or when host DNA greatly outnumbers microbial DNA.

---

## Summary

#### Targeted Sequencing (i.e., amplicon sequencing, marker gene profiling)

- Use PCR primers to target specific regions of genome, e.g., 16S rRNA

- Obtain .highlight[taxonomic] information, .highlight[no metabolic functional] information

- Less expensive (~ $100 per sample), relatively smaller computational needs

- Relatively free of host DNA contamination

- Able to sequence deeper and broader; majority of genes can be assigned to at least phylum level

- Good to answer: .highlight["Who are there?"]

]

#### Shotgun Sequencing

- Sequence randomly all the DNA that are in the sample

- Obtain .highlight[taxonomic and functional] information

- Relatively expensive (~ $1000 per sample); usually requires huge computational resources

- Prone to host DNA contamination; Don't know the exact host of each gene (a bag of genes)

- Many more unassigned gene fragments (“wasted” data)

- Good to answer: .highlight["Who are there and What are they doing?"]
]

---

## What Microbiome Bioinformatics Analysis .highlight[Can] Answer

### Who are there

- .highlight[Taxonomic composition] (classification and abundance)

### What are they doing

- .highlight[Functional composition]

---

## Introduction to Community Diversities

* Richness - the count of "things"

* Diversity - the .highlight[count] of "things" with some consideration of .highlight[evenness]

???
Analyses of community diversities are widely used in community microbiome study.

---

## Introduction to Community Diversities

* Alpha diversity: diversity at a single site

* Beta diversity: diversity between sites

???

Three levels of diversity (alpha diversity, beta diversity and gamma diversity) have become central to community ecology (Whittaker 1967, 1969). In microbiome study, alpha diversity and beta diversity are commonly used.

---

background-image: url(data:image/png;base64,#./img/bcrf.png)
background-size: 55%

Sections of this presentation were sampled and modified from [Canadian Bioinformatics Workshop](https://bioinformatics.ca/) "Microbiome Analysis" workshop materials.

]