Microbiome analysis in R - Part II
For a free consultation, please visit our consultation scheduler!
Workshop Overview
The second installment of the microbiome analysis is to
assign taxonomy to the ASVs
construct phylogenetic tree
use the ASV table (ASV abundance in samples) (1), ASV sequences, taxonomic assignment and phylogenetic tree (2, 3 and 4), and sample metadata (5) to construct a
phyloseqobject (5 components) for further data exploration.
Getting Ready
Load R packages we will be using for today’s workshop.
suppressMessages({
library("tidyverse")
library("dada2")
library("gridExtra")
library("phyloseq")
library("Biostrings")
})
no_of_cores = 1Set your working directory to where your data (fastq files) are located, using combinations of commands listed below.
- Check current working directory
getwd()- Create a directory
dir.create()- Set working directory.
setwd()Load where we left off in day 2.
load("./intermediate/dada2_day_2.rda")Inspect objects loaded into your environment.
ls() [1] "create_dt" "dadaFs" "dadaRs" "derepFs"
[5] "derepRs" "errF" "errPlotF" "errPlotR"
[9] "errR" "filt_path" "filtFs" "filtRs"
[13] "fnFs" "fnRs" "FORWARD_TRUNC" "fPlot"
[17] "getN" "mergers" "metadata" "metadata_df"
[21] "no_of_cores" "out" "REVERSE_TRUNC" "rPlot"
[25] "sample.names" "seqtab" "seqtab.nochim" "taxa"
[29] "track" "track.long"
Taxonomy Inference
To infer the taxonomy of our ASVs, we can BLAST our ASVs against a reference database. It breaks the sequences to k-mers and tries to find the closest match and assign taxonomy. Or we can use different type of supervised learning algorithm such as Naive Bayes classifier, which is commonly used.
Naive Bayes Classifier is accurate with relatively low runtime.
Step 1: calculate the probability that ASV 1 belong to Taxon 1, given that what we know about Taxon1’s sequences in the database
Step 2: sort all taxa based on their probabilities
Step 3: assign ASV 1 to which taxon that has the highest probability
The three most common 16S databases are Silva, RDP and GreenGenes. GreenGenes is the smallest among the three databases while Silva is the largest. Silva is the most up to date and its taxonomic assignments are manually curated. Read more about the databases: here.
We will be using the Silva database in this workshop. The DADA2-formatted reference fastas are downloaded from DADA2 website.
The assignTaxonomy function in DADA2 implements the RDP naive Bayesian classifier method. The k-mer profile of the sequences to be classified are compared against k-mer profiles of all reference sequences in a training set. The reference sequence with the most similar profile is used to assign taxonomy to the query sequence, and then a bootstrapping approach is used to assess the confidence assignment at each taxonomic level.
KIND OF SLOW.
If you have trouble with this step, load("./intermediate/dada2_assign_taxonomy.rda")
silva <- "silva_nr99_v138.1_wSpecies_train_set.fa.gz" # don't change this
taxa <- assignTaxonomy(seqtab.nochim, silva, multithread=no_of_cores, verbose=TRUE)Finished processing reference fasta.
Inspect the taxonomic assignments
taxa.print <- taxa # Removing sequence rownames for display only
rownames(taxa.print) <- NULL
head(taxa.print) Kingdom Phylum Class Order
[1,] "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales"
[2,] "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
[3,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales"
[4,] "Bacteria" "Firmicutes" "Negativicutes" "Veillonellales-Selenomonadales"
[5,] "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales"
[6,] "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
Family Genus Species
[1,] "Lactobacillaceae" "Lactobacillus" "amylovorus"
[2,] "Lachnospiraceae" "Agathobacter" NA
[3,] "Prevotellaceae" "Prevotella_9" NA
[4,] "Veillonellaceae" "Megasphaera" "elsdenii"
[5,] "Prevotellaceae" "Prevotella_9" NA
[6,] "Lachnospiraceae" "Lachnospiraceae NK4A136 group" "bacterium"
The return value of assignTaxonomy() is a character matrix, with each row corresponds to an ASV, and each column corresponds to a taxonomic level.
For the Silva training set, the levels are Kingdom, Phylum, Class, Order, Family, Genus, and we see that all but the last sequence have been classified all the way to species.
Phylogeny Reconstruction
We need a multiple alignment of the ASVs to build a phylogenetic tree. But before that, we need a FASTA file of the ASV sequences.
Let’s print a file from the ASV table (seqtab.nochim):
Below two code blocks to print out ASV table should only be run once. If you run it again, it will keep appending the ASV table at the end of existing file. Duplication will cause error in Fasttree step.
# Name the sequence file
sequence_outfile <- "dada2_nochim.fa"
print(paste0('Writing FASTA file ', sequence_outfile, ' with ', ncol(seqtab.nochim), ' sequences.'))for (i in 1:ncol(seqtab.nochim)){
cat(paste0(">ASV_", i), file = sequence_outfile, sep="\n", append = TRUE)
cat(colnames(seqtab.nochim)[i], file = sequence_outfile, sep="\n", append= TRUE)
}Remember the columns of the ASV table are the sequence variants. Therefore we can just print the column as the sequence after the FASTA description line.
Let’s quickly check that the ASVs FASTA file that we printed looks fine.
system("head dada2_nochim.fa", intern=TRUE) # use system() to run bash command in RThen, we will generate the phylogenetic tree with MAFFT and FastTree. MAFFT is a multiple sequence alignment program for unix-like operating systems. MAFFT is useful for hard-to-align sequences such as those containing large gaps (e.g., rRNA sequences containing variable loop regions). Fasttree is a phylogenetic tree building program that infers approximately-maximum-likelihood phylogenetic trees from alignments.
# Multiple sequence alignment with MAFFT
system("ml mafft; mafft --auto --thread -1 dada2_nochim.fa > dada2_nochim_mafft_msa.fa", intern=TRUE)
# Phylogenetic tree reconstruction with FastTree
system("ml fasttree; fasttree -gtr -nt < dada2_nochim_mafft_msa.fa > dada2_nochim.tree", intern=TRUE)Introducing Phyloseq
The phyloseq package is a tool to import, store, analyze, and graphically display complex phylogenetic sequencing data that has already been clustered into OTUs or ASVs, especially when there is associated sample data, phylogenetic tree, and/or taxonomic assignment of the OTUs or ASVs.
Load the ASVs FASTA file and phylogenetic tree back to the environment.
seqfile <- "dada2_nochim.fa"
treefile <- "dada2_nochim.tree"
ls() [1] "create_dt" "dadaFs" "dadaRs" "derepFs"
[5] "derepRs" "errF" "errPlotF" "errPlotR"
[9] "errR" "filt_path" "filtFs" "filtRs"
[13] "fnFs" "fnRs" "FORWARD_TRUNC" "fPlot"
[17] "getN" "mergers" "metadata" "metadata_df"
[21] "no_of_cores" "out" "REVERSE_TRUNC" "rPlot"
[25] "sample.names" "seqfile" "seqtab" "seqtab.nochim"
[29] "taxa" "taxa.print" "track" "track.long"
[33] "treefile"
Let’s have a look at the data first.
metadata_df: sample meta dataseqtab.nochim: ASV table after removing chimerataxa: taxonomic assignments for those ASVsseqfileis the result of multiple sequence alignment withMAFFTtreefileis the result of phylogenetic tree reconstruction withFastTree
These are the 5 components of a phyloseq object. Let’s preview some of our components here.
seqtab.nochim[1:3,1:3] GGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTGGTGAAGAAGGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTAATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG
6post 686
21post 1957
54post 1244
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGATAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGAGTGTCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGATAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAG
6post 11
21post 308
54post 253
GGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGAGTCTCGTGAGACTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAG
6post 221
21post 650
54post 692
taxa[1:3,] Kingdom
GGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTGGTGAAGAAGGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTAATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG "Bacteria"
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGATAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGAGTGTCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGATAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAG "Bacteria"
GGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGAGTCTCGTGAGACTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAG "Bacteria"
Phylum
GGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTGGTGAAGAAGGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTAATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG "Firmicutes"
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGATAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGAGTGTCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGATAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAG "Firmicutes"
GGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGAGTCTCGTGAGACTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAG "Bacteroidota"
Class
GGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTGGTGAAGAAGGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTAATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG "Bacilli"
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGATAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGAGTGTCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGATAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAG "Clostridia"
GGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGAGTCTCGTGAGACTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAG "Bacteroidia"
Order
GGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTGGTGAAGAAGGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTAATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG "Lactobacillales"
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGATAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGAGTGTCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGATAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAG "Lachnospirales"
GGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGAGTCTCGTGAGACTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAG "Bacteroidales"
Family
GGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTGGTGAAGAAGGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTAATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG "Lactobacillaceae"
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGATAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGAGTGTCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGATAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAG "Lachnospiraceae"
GGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGAGTCTCGTGAGACTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAG "Prevotellaceae"
Genus
GGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTGGTGAAGAAGGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTAATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG "Lactobacillus"
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGATAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGAGTGTCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGATAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAG "Agathobacter"
GGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGAGTCTCGTGAGACTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAG "Prevotella_9"
Species
GGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTGGTGAAGAAGGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTAATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG "amylovorus"
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGCGAAGAAGTATTTCGGTATGTAAAGCTCTATCAGCAGGGAAGATAATGACGGTACCTGACTAAGAAGCACCGGCTAAATACGTGCCAGCAGCCGCGGTAATACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGTGCGGCAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGTCGTACTAGAGTGTCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTACTGGACGATAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAG NA
GGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGAGTCTCGTGAGACTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAG NA
metadata_df[1:3,] sample fq1
6post 6post sub_6post_S2_L001_R1_001.fastq.gz
21post 21post sub_21post_S13_L001_R1_001.fastq.gz
54post 54post sub_54post_S6_L001_R1_001.fastq.gz
fq2 ID time group
6post sub_6post_S2_L001_R2_001.fastq.gz 6 post CNT
21post sub_21post_S13_L001_R2_001.fastq.gz 21 post CNT
54post sub_54post_S6_L001_R2_001.fastq.gz 54 post CNT
Always back up your data after and/or before analysis.
Let’s backup our data before we start phyloseq.
seqtab.nochim.0 <- seqtab.nochim
taxa.0 <- taxa
metadata_df.0 <- metadata_dfStep 1, change ASV names, as DADA2 use raw sequence as names
Change colnames(seqtab.nochim) and rownames(taxa) to the same: ASV_1, ASV_2…
colnames(seqtab.nochim) <- paste0("ASV_", seq(1:ncol(seqtab.nochim)))
rownames(taxa) <- paste0("ASV_", seq(1:nrow(taxa)))Change rownames(metadata_df) and rownames(seqtab.nochim) to match sample names.
rownames(metadata_df) <- metadata_df$sampleMake sure the names match.
seqtab.nochim[1:3,1:3] ASV_1 ASV_2 ASV_3
6post 686 11 221
21post 1957 308 650
54post 1244 253 692
taxa[1:3,] Kingdom Phylum Class Order
ASV_1 "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales"
ASV_2 "Bacteria" "Firmicutes" "Clostridia" "Lachnospirales"
ASV_3 "Bacteria" "Bacteroidota" "Bacteroidia" "Bacteroidales"
Family Genus Species
ASV_1 "Lactobacillaceae" "Lactobacillus" "amylovorus"
ASV_2 "Lachnospiraceae" "Agathobacter" NA
ASV_3 "Prevotellaceae" "Prevotella_9" NA
metadata_df[1:3,] sample fq1
6post 6post sub_6post_S2_L001_R1_001.fastq.gz
21post 21post sub_21post_S13_L001_R1_001.fastq.gz
54post 54post sub_54post_S6_L001_R1_001.fastq.gz
fq2 ID time group
6post sub_6post_S2_L001_R2_001.fastq.gz 6 post CNT
21post sub_21post_S13_L001_R2_001.fastq.gz 21 post CNT
54post sub_54post_S6_L001_R2_001.fastq.gz 54 post CNT
Step 2, use phyloseq() to make the data object
Five components of a phyloseq object: tax_table(), sample_data(), otu_table(), phy_tree(), refseq().
ps <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=FALSE),
sample_data(metadata_df),
tax_table(taxa))
psphyloseq-class experiment-level object
otu_table() OTU Table: [ 1190 taxa and 6 samples ]
sample_data() Sample Data: [ 6 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 1190 taxa by 7 taxonomic ranks ]
# Add rep seqs
repseqs <- readDNAStringSet(seqfile)
ps <- merge_phyloseq(ps, repseqs)
psphyloseq-class experiment-level object
otu_table() OTU Table: [ 1190 taxa and 6 samples ]
sample_data() Sample Data: [ 6 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 1190 taxa by 7 taxonomic ranks ]
refseq() DNAStringSet: [ 1190 reference sequences ]
# Add tree
tree <- read_tree(treefile)
ps <- merge_phyloseq(ps, tree)
psphyloseq-class experiment-level object
otu_table() OTU Table: [ 1190 taxa and 6 samples ]
sample_data() Sample Data: [ 6 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 1190 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 1190 tips and 1188 internal nodes ]
refseq() DNAStringSet: [ 1190 reference sequences ]
Bioinformatics Workflow Summary
This is the end of bioinformatic analysis of 16S fastq files. At this point, we formed a phyloseq object that contains the results of our data processing. From here, we can explore the phyloseq object for abundance visualization and diversity analysis.
References and Further Readings
MAFFTwebsite : https://mafft.cbrc.jp/alignment/software/
FastTreewebsite : http://www.microbesonline.org/fasttree/
- Comparing Denoising methods: https://peerj.com/articles/5364/
- SILVA, RDP, Greengenes, NCBI and OTT — how do these taxonomies compare? https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3501-4
phyloseqwebsite: http://joey711.github.io/phyloseq/index.htmlphyloseqpaper: https://doi.org/10.1371/journal.pone.0061217
Reproducibility
R session information:
sessionInfo()R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] Biostrings_2.64.1 GenomeInfoDb_1.32.4 XVector_0.36.0
[4] IRanges_2.30.1 S4Vectors_0.34.0 BiocGenerics_0.42.0
[7] phyloseq_1.40.0 gridExtra_2.3 dada2_1.24.0
[10] Rcpp_1.0.10 lubridate_1.9.2 forcats_1.0.0
[13] stringr_1.5.0 dplyr_1.1.2 purrr_1.0.1
[16] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[19] ggplot2_3.4.2 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] nlme_3.1-162 bitops_1.0-7
[3] matrixStats_0.63.0 RColorBrewer_1.1-3
[5] tools_4.2.1 vegan_2.6-4
[7] utf8_1.2.3 R6_2.5.1
[9] mgcv_1.8-42 DBI_1.1.3
[11] colorspace_2.1-0 permute_0.9-7
[13] rhdf5filters_1.8.0 ade4_1.7-22
[15] withr_2.5.0 tidyselect_1.2.0
[17] compiler_4.2.1 cli_3.6.1
[19] Biobase_2.56.0 DelayedArray_0.22.0
[21] scales_1.2.1 digest_0.6.31
[23] Rsamtools_2.12.0 rmarkdown_2.21
[25] jpeg_0.1-10 pkgconfig_2.0.3
[27] htmltools_0.5.5 MatrixGenerics_1.8.1
[29] fastmap_1.1.1 htmlwidgets_1.6.2
[31] rlang_1.1.1 rstudioapi_0.14
[33] generics_0.1.3 hwriter_1.3.2.1
[35] jsonlite_1.8.4 BiocParallel_1.30.4
[37] RCurl_1.98-1.12 magrittr_2.0.3
[39] GenomeInfoDbData_1.2.8 biomformat_1.24.0
[41] interp_1.1-4 Matrix_1.5-4
[43] Rhdf5lib_1.18.2 munsell_0.5.0
[45] fansi_1.0.4 ape_5.7-1
[47] lifecycle_1.0.3 stringi_1.7.12
[49] yaml_2.3.7 MASS_7.3-60
[51] SummarizedExperiment_1.26.1 zlibbioc_1.42.0
[53] rhdf5_2.40.0 plyr_1.8.8
[55] grid_4.2.1 parallel_4.2.1
[57] crayon_1.5.2 deldir_1.0-6
[59] lattice_0.21-8 splines_4.2.1
[61] multtest_2.52.0 hms_1.1.3
[63] knitr_1.42 pillar_1.9.0
[65] igraph_1.4.2 GenomicRanges_1.48.0
[67] reshape2_1.4.4 codetools_0.2-19
[69] glue_1.6.2 evaluate_0.21
[71] ShortRead_1.54.0 latticeExtra_0.6-30
[73] data.table_1.14.8 RcppParallel_5.1.7
[75] png_0.1-8 vctrs_0.6.2
[77] tzdb_0.4.0 foreach_1.5.2
[79] gtable_0.3.3 xfun_0.39
[81] survival_3.5-5 iterators_1.0.14
[83] GenomicAlignments_1.32.1 cluster_2.1.4
[85] timechange_0.2.0