Microbiome analysis in R - Part II
Workshop Overview
The second installment of the microbiome analysis is to
assign taxonomy to the ASVs
construct phylogenetic tree
use the ASV table (ASV abundance in samples) (1), ASV sequences, taxonomic assignment and phylogenetic tree (2, 3 and 4), and sample metadata (5) to construct a
phyloseq
object (5 components) for further data exploration.
Getting Ready
Load R packages we will be using for today’s workshop.
suppressMessages({
library("tidyverse")
library("dada2")
library("gridExtra")
library("phyloseq")
library("Biostrings")
})= 1 no_of_cores
Set your working directory to where your data (fastq files) are located, using combinations of commands listed below.
- Check current working directory
getwd()
- Create a directory
dir.create()
- Set working directory.
setwd()
Load where we left off in day 2.
load("./intermediate/dada2_day_2.rda")
Inspect objects loaded into your environment.
ls()
[1] "create_dt" "dadaFs" "dadaRs" "derepFs"
[5] "derepRs" "errF" "errPlotF" "errPlotR"
[9] "errR" "filt_path" "filtFs" "filtRs"
[13] "fnFs" "fnRs" "FORWARD_TRUNC" "fPlot"
[17] "getN" "mergers" "metadata" "metadata_df"
[21] "no_of_cores" "out" "REVERSE_TRUNC" "rPlot"
[25] "sample.names" "seqtab" "seqtab.nochim" "taxa"
[29] "track" "track.long"
Taxonomy Inference
To infer the taxonomy of our ASVs, we can BLAST our ASVs against a reference database. It breaks the sequences to k-mers and tries to find the closest match and assign taxonomy. Or we can use different type of supervised learning algorithm such as Naive Bayes classifier, which is commonly used.
Naive Bayes Classifier is accurate with relatively low runtime.
Step 1: calculate the probability that ASV 1 belong to Taxon 1, given that what we know about Taxon1’s sequences in the database
Step 2: sort all taxa based on their probabilities
Step 3: assign ASV 1 to which taxon that has the highest probability
The three most common 16S databases are Silva, RDP and GreenGenes. GreenGenes is the smallest among the three databases while Silva is the largest. Silva is the most up to date and its taxonomic assignments are manually curated. Read more about the databases: here.
We will be using the Silva database in this workshop. The DADA2
-formatted reference fastas are downloaded from DADA2
website.
The assignTaxonomy
function in DADA2
implements the RDP naive Bayesian classifier method. The k-mer profile of the sequences to be classified are compared against k-mer profiles of all reference sequences in a training set. The reference sequence with the most similar profile is used to assign taxonomy to the query sequence, and then a bootstrapping approach is used to assess the confidence assignment at each taxonomic level.
KIND OF SLOW.
If you have trouble with this step, load("./intermediate/dada2_assign_taxonomy.rda")
<- "silva_nr99_v138.1_wSpecies_train_set.fa.gz" # don't change this
silva <- assignTaxonomy(seqtab.nochim, silva, multithread=no_of_cores, verbose=TRUE) taxa
Finished processing reference fasta.
Inspect the taxonomic assignments
<- taxa # Removing sequence rownames for display only
taxa.print rownames(taxa.print) <- NULL
head(taxa.print)
Kingdom Phylum Class Order
[1,] "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales"
[2,] "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
[3,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales"
[4,] "Bacteria" "Firmicutes" "Negativicutes" "Veillonellales-Selenomonadales"
[5,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales"
[6,] "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
Family Genus Species
[1,] "Lactobacillaceae" "Lactobacillus" "amylovorus"
[2,] "Lachnospiraceae" "Agathobacter" NA
[3,] "Prevotellaceae" "Prevotella_9" NA
[4,] "Veillonellaceae" "Megasphaera" "elsdenii"
[5,] "Prevotellaceae" "Prevotella_9" NA
[6,] "Lachnospiraceae" "Lachnospiraceae NK4A136 group" "bacterium"
The return value of assignTaxonomy()
is a character matrix, with each row corresponds to an ASV, and each column corresponds to a taxonomic level.
For the Silva training set, the levels are Kingdom, Phylum, Class, Order, Family, Genus, and we see that all but the last sequence have been classified all the way to species.
Phylogeny Reconstruction
We need a multiple alignment of the ASVs to build a phylogenetic tree. But before that, we need a FASTA file of the ASV sequences.
Let’s print a file from the ASV table (seqtab.nochim):
# Name the sequence file
<- "dada2_nochim.fa"
sequence_outfile print(paste0('Writing FASTA file ', sequence_outfile, ' with ', ncol(seqtab.nochim), ' sequences.'))
for (i in 1:ncol(seqtab.nochim)){
cat(paste0(">ASV_", i), file = sequence_outfile, sep="\n", append = TRUE)
cat(colnames(seqtab.nochim)[i], file = sequence_outfile, sep="\n", append= TRUE)
}
Let’s quickly check that the ASVs FASTA file that we printed looks fine.
system("head dada2_nochim.fa", intern=TRUE) # use system() to run bash command in R
Then, we will generate the phylogenetic tree with MAFFT
and FastTree
. MAFFT
is a multiple sequence alignment program for unix-like operating systems. MAFFT
is useful for hard-to-align sequences such as those containing large gaps (e.g., rRNA sequences containing variable loop regions). Fasttree
is a phylogenetic tree building program that infers approximately-maximum-likelihood phylogenetic trees from alignments.
# Multiple sequence alignment with MAFFT
system("ml mafft; mafft --auto --thread -1 dada2_nochim.fa > dada2_nochim_mafft_msa.fa", intern=TRUE)
# Phylogenetic tree reconstruction with FastTree
system("ml fasttree; fasttree -gtr -nt < dada2_nochim_mafft_msa.fa > dada2_nochim.tree", intern=TRUE)
Introducing Phyloseq
The phyloseq
package is a tool to import, store, analyze, and graphically display complex phylogenetic sequencing data that has already been clustered into OTUs or ASVs, especially when there is associated sample data, phylogenetic tree, and/or taxonomic assignment of the OTUs or ASVs.
Load the ASVs FASTA file and phylogenetic tree back to the environment.
<- "dada2_nochim.fa"
seqfile <- "dada2_nochim.tree"
treefile ls()
[1] "create_dt" "dadaFs" "dadaRs" "derepFs"
[5] "derepRs" "errF" "errPlotF" "errPlotR"
[9] "errR" "filt_path" "filtFs" "filtRs"
[13] "fnFs" "fnRs" "FORWARD_TRUNC" "fPlot"
[17] "getN" "mergers" "metadata" "metadata_df"
[21] "no_of_cores" "out" "REVERSE_TRUNC" "rPlot"
[25] "sample.names" "seqfile" "seqtab" "seqtab.nochim"
[29] "taxa" "taxa.print" "track" "track.long"
[33] "treefile"
Let’s have a look at the data first.
metadata_df
: sample meta dataseqtab.nochim
: ASV table after removing chimerataxa
: taxonomic assignments for those ASVsseqfile
is the result of multiple sequence alignment withMAFFT
treefile
is the result of phylogenetic tree reconstruction withFastTree
These are the 5 components of a phyloseq
object. Let’s preview some of our components here.
1:3,1:3] seqtab.nochim[
GGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTGGTGAAGAAGGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTAATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG
6post 686
21post 1957
54post 1244
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGATAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGAGTGTCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGATAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAG
6post 11
21post 308
54post 253
GGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGAGTCTCGTGAGACTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAG
6post 221
21post 650
54post 692
1:3,] taxa[
Kingdom
GGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTGGTGAAGAAGGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTAATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG "Bacteria"
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGATAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGAGTGTCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGATAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAG "Bacteria"
GGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGAGTCTCGTGAGACTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAG "Bacteria"
Phylum
GGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTGGTGAAGAAGGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTAATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG "Firmicutes"
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGATAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGAGTGTCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGATAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAG "Firmicutes"
GGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGAGTCTCGTGAGACTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAG "Bacteroidota"
Class
GGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTGGTGAAGAAGGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTAATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG "Bacilli"
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGATAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGAGTGTCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGATAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAG "Clostridia"
GGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGAGTCTCGTGAGACTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAG "Bacteroidia"
Order
GGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTGGTGAAGAAGGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTAATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG "Lactobacillales"
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGATAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGAGTGTCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGATAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAG "Lachnospirales"
GGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGAGTCTCGTGAGACTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAG "Bacteroidales"
Family
GGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTGGTGAAGAAGGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTAATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG "Lactobacillaceae"
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGATAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGAGTGTCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGATAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAG "Lachnospiraceae"
GGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGAGTCTCGTGAGACTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAG "Prevotellaceae"
Genus
GGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTGGTGAAGAAGGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTAATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG "Lactobacillus"
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGATAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGAGTGTCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGATAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAG "Agathobacter"
GGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGAGTCTCGTGAGACTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAG "Prevotella_9"
Species
GGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTGGTGAAGAAGGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTAATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG "amylovorus"
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGATAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGAGTGTCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGATAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAG NA
GGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGAGTCTCGTGAGACTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAG NA
1:3,] metadata_df[
sample fq1
6post 6post sub_6post_S2_L001_R1_001.fastq.gz
21post 21post sub_21post_S13_L001_R1_001.fastq.gz
54post 54post sub_54post_S6_L001_R1_001.fastq.gz
fq2 ID time group
6post sub_6post_S2_L001_R2_001.fastq.gz 6 post CNT
21post sub_21post_S13_L001_R2_001.fastq.gz 21 post CNT
54post sub_54post_S6_L001_R2_001.fastq.gz 54 post CNT
Let’s backup our data before we start phyloseq.
.0 <- seqtab.nochim
seqtab.nochim.0 <- taxa
taxa.0 <- metadata_df metadata_df
Step 1, change ASV names, as DADA2
use raw sequence as names
Change colnames(seqtab.nochim)
and rownames(taxa)
to the same: ASV_1, ASV_2…
colnames(seqtab.nochim) <- paste0("ASV_", seq(1:ncol(seqtab.nochim)))
rownames(taxa) <- paste0("ASV_", seq(1:nrow(taxa)))
Change rownames(metadata_df)
and rownames(seqtab.nochim)
to match sample names.
rownames(metadata_df) <- metadata_df$sample
1:3,1:3] seqtab.nochim[
ASV_1 ASV_2 ASV_3
6post 686 11 221
21post 1957 308 650
54post 1244 253 692
1:3,] taxa[
Kingdom Phylum Class Order
ASV_1 "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales"
ASV_2 "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
ASV_3 "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales"
Family Genus Species
ASV_1 "Lactobacillaceae" "Lactobacillus" "amylovorus"
ASV_2 "Lachnospiraceae" "Agathobacter" NA
ASV_3 "Prevotellaceae" "Prevotella_9" NA
1:3,] metadata_df[
sample fq1
6post 6post sub_6post_S2_L001_R1_001.fastq.gz
21post 21post sub_21post_S13_L001_R1_001.fastq.gz
54post 54post sub_54post_S6_L001_R1_001.fastq.gz
fq2 ID time group
6post sub_6post_S2_L001_R2_001.fastq.gz 6 post CNT
21post sub_21post_S13_L001_R2_001.fastq.gz 21 post CNT
54post sub_54post_S6_L001_R2_001.fastq.gz 54 post CNT
Step 2, use phyloseq()
to make the data object
Five components of a phyloseq
object: tax_table()
, sample_data()
, otu_table()
, phy_tree()
, refseq()
.
<- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE),
ps sample_data(metadata_df),
tax_table(taxa))
ps
phyloseq-class experiment-level object
otu_table() OTU Table: [ 1190 taxa and 6 samples ]
sample_data() Sample Data: [ 6 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 1190 taxa by 7 taxonomic ranks ]
# Add rep seqs
<- readDNAStringSet(seqfile)
repseqs <- merge_phyloseq(ps, repseqs)
ps ps
phyloseq-class experiment-level object
otu_table() OTU Table: [ 1190 taxa and 6 samples ]
sample_data() Sample Data: [ 6 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 1190 taxa by 7 taxonomic ranks ]
refseq() DNAStringSet: [ 1190 reference sequences ]
# Add tree
<- read_tree(treefile)
tree <- merge_phyloseq(ps, tree)
ps ps
phyloseq-class experiment-level object
otu_table() OTU Table: [ 1190 taxa and 6 samples ]
sample_data() Sample Data: [ 6 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 1190 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 1190 tips and 1188 internal nodes ]
refseq() DNAStringSet: [ 1190 reference sequences ]
Bioinformatics Workflow Summary
This is the end of bioinformatic analysis of 16S fastq files. At this point, we formed a phyloseq
object that contains the results of our data processing. From here, we can explore the phyloseq
object for abundance visualization and diversity analysis.
References and Further Readings
MAFFT
website : https://mafft.cbrc.jp/alignment/software/
FastTree
website : http://www.microbesonline.org/fasttree/
- Comparing Denoising methods: https://peerj.com/articles/5364/
- SILVA, RDP, Greengenes, NCBI and OTT — how do these taxonomies compare? https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3501-4
phyloseq
website: http://joey711.github.io/phyloseq/index.htmlphyloseq
paper: https://doi.org/10.1371/journal.pone.0061217
Reproducibility
R session information:
sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] Biostrings_2.64.1 GenomeInfoDb_1.32.4 XVector_0.36.0
[4] IRanges_2.30.1 S4Vectors_0.34.0 BiocGenerics_0.42.0
[7] phyloseq_1.40.0 gridExtra_2.3 dada2_1.24.0
[10] Rcpp_1.0.10 lubridate_1.9.2 forcats_1.0.0
[13] stringr_1.5.0 dplyr_1.1.2 purrr_1.0.1
[16] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[19] ggplot2_3.4.2 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] nlme_3.1-162 bitops_1.0-7
[3] matrixStats_0.63.0 RColorBrewer_1.1-3
[5] tools_4.2.1 vegan_2.6-4
[7] utf8_1.2.3 R6_2.5.1
[9] mgcv_1.8-42 DBI_1.1.3
[11] colorspace_2.1-0 permute_0.9-7
[13] rhdf5filters_1.8.0 ade4_1.7-22
[15] withr_2.5.0 tidyselect_1.2.0
[17] compiler_4.2.1 cli_3.6.1
[19] Biobase_2.56.0 DelayedArray_0.22.0
[21] scales_1.2.1 digest_0.6.31
[23] Rsamtools_2.12.0 rmarkdown_2.21
[25] jpeg_0.1-10 pkgconfig_2.0.3
[27] htmltools_0.5.5 MatrixGenerics_1.8.1
[29] fastmap_1.1.1 htmlwidgets_1.6.2
[31] rlang_1.1.1 rstudioapi_0.14
[33] generics_0.1.3 hwriter_1.3.2.1
[35] jsonlite_1.8.4 BiocParallel_1.30.4
[37] RCurl_1.98-1.12 magrittr_2.0.3
[39] GenomeInfoDbData_1.2.8 biomformat_1.24.0
[41] interp_1.1-4 Matrix_1.5-4
[43] Rhdf5lib_1.18.2 munsell_0.5.0
[45] fansi_1.0.4 ape_5.7-1
[47] lifecycle_1.0.3 stringi_1.7.12
[49] yaml_2.3.7 MASS_7.3-60
[51] SummarizedExperiment_1.26.1 zlibbioc_1.42.0
[53] rhdf5_2.40.0 plyr_1.8.8
[55] grid_4.2.1 parallel_4.2.1
[57] crayon_1.5.2 deldir_1.0-6
[59] lattice_0.21-8 splines_4.2.1
[61] multtest_2.52.0 hms_1.1.3
[63] knitr_1.42 pillar_1.9.0
[65] igraph_1.4.2 GenomicRanges_1.48.0
[67] reshape2_1.4.4 codetools_0.2-19
[69] glue_1.6.2 evaluate_0.21
[71] ShortRead_1.54.0 latticeExtra_0.6-30
[73] data.table_1.14.8 RcppParallel_5.1.7
[75] png_0.1-8 vctrs_0.6.2
[77] tzdb_0.4.0 foreach_1.5.2
[79] gtable_0.3.3 xfun_0.39
[81] survival_3.5-5 iterators_1.0.14
[83] GenomicAlignments_1.32.1 cluster_2.1.4
[85] timechange_0.2.0