Microbiome analysis in R - Part I

Author

Max Qiu,
Binformatics Core Research Facility (BCRF)
Nebraska Center for Biotechnology,
University of Nebraska-Lincoln

Published

April 17, 2023

Note

For a free consultation, please visit our consultation scheduler!

Workshop and Data Overview

The first installment of the microbiome analysis is to obtain Amplicon Sequence Variants (ASVs), their abundance in each sample, given raw 16S fastq data.

The data we will be using for this workshop is originated from Salomon, JD et al., 2023. “Many animal models have been useful in evaluating cardiopulmonary bypass (CPB) and a variety of cardiovascular, kidney, respiratory and neurologic outcomes. No animal models of CPB, however, have been utilized to evaluate changes to the microbiome … In this article, we used a model of piglet CPB/DHCA to evaluate the microbiome … A total of 12 piglets were used in this study, five in the CPB/DHCA group and seven in the control group receiving mechanical ventilation only. Sequences of 16S rRNA amplicon libraries generated using fecal microbial DNA resulted in a total of 12.6 million reads.”

For the sake of this workshop, we picked a few samples and subsetted the raw fastq files, so the analysis will fit in the time period of this workshop. To read more about this study, please refer to the publication.

In this workshop, we will be following the Bioconductor workflow for microbiome data analysis (Callahan et al., 2016b) using R software.

Getting Ready

Load R packages we will be using for today’s workshop.

suppressMessages({
        library("tidyverse")
        library("dada2")
        library("gridExtra")
})
no_of_cores = 1

If you run this code in your own PC, no_of_core is 1 due to forking limitations in Windows. When you run this code on the HCC cluster (or Mac), no_of_core can be adjusted to what you have requested for the job. Additional note: not all Mac configurations support forking within RStudio, especially if you run RStudio within one of the virtualization technologies (e.g. Docker, VirtualBox).

Set your working directory to where your data (fastq files) are located, using combinations of commands listed below.

Check current working directory

getwd()

Create a directory

dir.create()

Set working directory.

setwd()

Load metadata file.

metadata = "sample_meta.csv"
metadata_df = read.csv(metadata, row.names=1)

metadata_df

Note

Setting up a metadata file before running the analysis saves time and minimizes the probability of making errors further down the line.

Read forward and reverse fastq filenames from metadata data frame.

fnFs <- metadata_df$fq1
fnRs <- metadata_df$fq2
# Extract sample names from metadata data frame
sample.names <- metadata_df$sample

Inspect files and sample names.

fnFs
length(fnFs)
fnRs
length(fnRs)
sample.names

DADA2 Workflow

This graph is a reminder of microbiome analysis overview. When we are interested in an environment, we take a sample and extract DNA. We amplify a region of interest and sequence it. After that, we cluster the sequences into groups, which will be the foundation of diversity analysis.

Why do we cluster sequences?

In order to minimize the risks of sequencer error in targeted sequencing, clustering approaches were initially developed. Clustering approaches are based upon the idea that related/similar organisms will have similar target gene sequences and that rare sequencing errors will have a trivial contribution, if any, to the consensus sequence for these clusters.

Reduce dimensionality
Remove sequencing artifacts
Compensate for within species variability

What is OTU (Operational taxonomic units) clustering?

To generate OTUs from sequencing data, clusters are often generated using a similarity threshold of 97% sequence identity. This approach carries with it the risk that multiple similar species can be grouped into a single OTU, with their individual identifications being lost to the abstract of a cluster. Alternatively, some have tried the approach of requiring extremely high levels of sequence identity to minimize the risk of losing diversity to clustering, with thresholds closer to 100% being used, but this creates a significant risk of identifying sequencing errors as new species and false diversity.

What is ASV (Amplicon sequence variants) analysis?

The ASV approach will start by determining which exact sequences were read and how many times each exact sequence was read. These data will be combined with an error model for the sequencing run, enabling the comparison of similar reads to determine the probability that a given read at a given frequency is not due to sequencer error. This creates, in essence, a p-value for each exact sequence, where the null-hypothesis is equivalent to that exact sequence being a consequence of sequencing error.

Following this calculation, sequences are filtered according to some threshold value for confidence, leaving behind a collection of exact sequences having a defined statistical confidence. Because these are exact sequences, generated without clustering or reference databases, ASV results can be readily compared between studies using the same target region. Additionally, a given target gene sequence should always generate the same ASV and a given ASV, being an exact sequence, can be compared to a reference database at a much higher resolution allowing for more precise identification down to the species level and even potentially beyond.

OTU vs ASV

As stated above, ASV approaches can provide a significant advantage to more precise identification of microbes. In addition, they can provide a more detailed picture of the diversity within a sample. An OTU, being a cluster of multiple, similar sequences that may either be “real” sequences from the sample or errors can contain multiple, similar species of microbe lumped into a single unit. An ASV does not have this issue, as even a single base difference in the sequence will result in a unique ASV and a more detailed picture of the diversity of a given sample.

ASV analysis with DADA2

Step 1: Filter and Trim
Step 2: Infer sequence variants using DADA2

Step 3: Chimera removal

Filter and Trim

For quality checking, we want to remove low-quality sequencing reads and trimming the reads to a consistent length. Every dataset is different and therefore it is always worth inspecting the quality of the data before filtering.

Inspect read quality

We start by visualizing the quality profiles of the forward reads:

dada2::plotQualityProfile(fnFs[1:2])

Next, we also visualize the quality profiles of the reverse reads:

dada2::plotQualityProfile(fnRs[1:2])

Note the variation in the number of reads and quality scores across samples.

An aggregate quality plot from all samples make it easier to choose cut offs for filtering.

fPlot <- dada2::plotQualityProfile(fnFs, aggregate = TRUE) +
        ggtitle("Forward") +
        geom_hline(yintercept =  30, colour = "blue")
rPlot <- dada2::plotQualityProfile(fnRs, aggregate = TRUE) +
        ggtitle("Reverse") +
        geom_hline(yintercept =  30, colour = "blue")

grid.arrange(fPlot, rPlot, nrow = 1)

The forward reads are of good quality. We will truncate the forward reads at position 280 (trimming the last 20 nucleotides).

The reverse reads are of significantly worse quality, especially at the end, which is common in Illumina sequencing. Based on these profiles, we will truncate the reverse reads at position 250 where the quality distribution crashes.

Warning

Deciding on trimming locations is your judgement call based on quality plots. Trim too little, we will include sections of very low quality; trim too much, we won’t be able to merge paired-end reads in later steps.

Filter and trim using `DADA2::filterAndTrim()`

Assign file names for the filtered fastq.gz files.

# create a folder called "filtered" in our working directory, where the filtered fastq files will be stored
filt_path <- paste0(getwd(), '/filtered') # don't change this 
filtFs <- file.path(filt_path, paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(filt_path, paste0(sample.names, "_R_filt.fastq.gz"))

Set truncating lengths based on quality profiles:

FORWARD_TRUNC <- 280 # determine from quality plots
REVERSE_TRUNC <- 250 # determine from quality plots

Run DADA2 filter and trim function. The maxEE parameter sets the maximum number of “expected errors” allowed in a read.

NOT FAST.

If you have trouble with this step, load("./intermediate/dada2_filter_trim.rda")

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs,
                     truncLen=c(FORWARD_TRUNC,REVERSE_TRUNC), 
                     trimLeft=c(20, 20), maxEE=c(2,2), 
                     multithread=no_of_cores,
                     matchIDs=TRUE, compress=TRUE, 
                     verbose=TRUE)

Note

Many functions in DADA2 have arguments that might be useful and relevant for you datasets. Remember to check default parameters of arguments and change them if needed.

head(out)

                                    reads.in reads.out
sub_6post_S2_L001_R1_001.fastq.gz      35798     12423
sub_21post_S13_L001_R1_001.fastq.gz    85163     30440
sub_54post_S6_L001_R1_001.fastq.gz    127525     45874
sub_9post_S20_L001_R1_001.fastq.gz     46189     16557
sub_23post_S7_L001_R1_001.fastq.gz    183281     65417
sub_48post_S8_L001_R1_001.fastq.gz    136211     45415

After filtering, we want to make sure that the filtered fastq.gz files are there.

list.files(filt_path)

 [1] "21post_F_filt.fastq.gz" "21post_R_filt.fastq.gz" "23post_F_filt.fastq.gz"
 [4] "23post_R_filt.fastq.gz" "48post_F_filt.fastq.gz" "48post_R_filt.fastq.gz"
 [7] "54post_F_filt.fastq.gz" "54post_R_filt.fastq.gz" "6post_F_filt.fastq.gz" 
[10] "6post_R_filt.fastq.gz"  "9post_F_filt.fastq.gz"  "9post_R_filt.fastq.gz"

Infer Sequence Variants using `DADA2`

The DADA2 algorithm makes use of a parametric error model of substitutions (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data using a form of unsupervised learning, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution.

Dereplication using `DADA2::derepFastq()`

After filter and trimming based on quality profile, we want to remove replicated sequences in the filtered fastq files. This is to reduce repetition and file size/memory requirements in downstream steps.

derepFs <- derepFastq(filtFs, n = 1e+06, verbose = TRUE)
derepRs <- derepFastq(filtRs, n = 1e+06, verbose = TRUE)

names(derepFs) <- sample.names
names(derepRs) <- sample.names

Learning error rates using `DADA2::learnErrors()`

Estimate and inspect the error rates before using them as the err parameter for the dada function.

SLOW STEP.

If you have trouble with this step, load("./intermediate/dada2_error_rate.rda")

errF <- learnErrors(filtFs, verbose=TRUE, multithread=no_of_cores)

56192760 total bases in 216126 reads from 6 samples will be used for learning the error rates.
Initializing error rates to maximum possible estimate.
selfConsist step 1 ......
   selfConsist step 2
   selfConsist step 3
   selfConsist step 4
   selfConsist step 5
   selfConsist step 6
Convergence after  6  rounds.

errR <- learnErrors(filtRs, verbose=TRUE, multithread=no_of_cores)

49708980 total bases in 216126 reads from 6 samples will be used for learning the error rates.
Initializing error rates to maximum possible estimate.
selfConsist step 1 ......
   selfConsist step 2
   selfConsist step 3
   selfConsist step 4
   selfConsist step 5
Convergence after  5  rounds.

Visualize estimated error rates.

errPlotF <- plotErrors(errF, nominalQ = TRUE) + labs(title = "Forward")
errPlotR <- plotErrors(errR, nominalQ = TRUE) + labs(title = "Reverse")
grid.arrange(errPlotF, errPlotR, nrow = 1)

Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates. The red line shows the expected error rates. Here the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop with increased quality as expected.

Inferring sequence variants using `DADA2::dada()`

Now we are ready to infer sequence variants using dereplicated data and error rates learned in previous step.

The dada function takes as input dereplicated amplicon sequencing reads and returns the inferred composition of the samples. Put simply, dada removes all sequencing errors to reveal the members of the sequenced community.

Because we are dealing with a small number of samples, we use pool=TRUE to allow information to be shared across samples which improves the detection of rare variants. This option is more computationally intensive and takes longer to run. More info: here.

NOT AS SLOW, BUT STILL PRETTY SLOW.

If you have trouble with this step, load("./intermediate/dada2_infer_variants.rda")

dadaFs <- dada(derepFs, err=errF, pool=TRUE, multithread=no_of_cores, verbose=TRUE)

6 samples were pooled: 216126 reads in 62158 unique sequences.

dadaRs <- dada(derepRs, err=errR, pool=TRUE, multithread=no_of_cores, verbose=TRUE)

6 samples were pooled: 216126 reads in 61353 unique sequences.

Inspect the results of the first sample.

cat("dada-class: object describing DADA2 denoising results", sep="\n")

dada-class: object describing DADA2 denoising results

cat(paste(length(dadaFs[[1]]$denoised), "sequence variants were inferred from", 
          length(derepFs[[1]]$uniques), "input unique sequences."), sep="\n")

748 sequence variants were inferred from 4771 input unique sequences.

cat(paste("Key parameters: OMEGA_A =", dadaFs[[1]]$opts$OMEGA_A, "OMEGA_C =", 
          dadaFs[[1]]$opts$OMEGA_C, "BAND_SIZE =", dadaFs[[1]]$opts$BAND_SIZE), sep="\n")

Key parameters: OMEGA_A = 1e-40 OMEGA_C = 1e-40 BAND_SIZE = 16

Note

There is a lot to the dada-class object than what is shown here (see ?"dada-class" for more info).

Merging paired-end reads using `DADA2::mergePairs()`

We now merge the forward and reverse reads together to obtain the full denoised sequences. By default, merged sequences are only output if the forward and reverse reads overlap by at least 12 bases, and are identical to each other in the overlap region.

mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)

11502 paired-reads (in 736 unique pairings) successfully merged out of 11923 (in 1048 pairings) input.

28226 paired-reads (in 935 unique pairings) successfully merged out of 29405 (in 1732 pairings) input.

42811 paired-reads (in 945 unique pairings) successfully merged out of 44418 (in 2028 pairings) input.

15581 paired-reads (in 700 unique pairings) successfully merged out of 16058 (in 995 pairings) input.

60919 paired-reads (in 1096 unique pairings) successfully merged out of 63332 (in 2652 pairings) input.

41468 paired-reads (in 1024 unique pairings) successfully merged out of 43533 (in 2438 pairings) input.

Inspect the merger data.frame from the first sample

head(mergers[[1]])

                                                                                                                                                                                                                                                                                                                                                                                                                                   sequence
1 GGAATCTTCCACAATGGACGCAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTGGTGAAGAAGGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTAATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACTGTTTTTCTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG
2 GGAATCTTCCACAATGGGCGCAAGCCTGATGGAGCAACACCGCGTGAGTGAAGAAGGGTTTCGGCTCGTAAAGCTCTGTTGTTGGAGAAGAACGTGCGTGAGAGTAACTGTTCACGCAGTGACGGTATCCAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTGCTTAGGTCTGATGTGAAAGCCTTCGGCTTAACCGAAGAAGTGCATCGGAAACCGGGCGACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG
3 GGAATCTTCCACAATGGACGAAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGGTTTCGGCTCGTAAAGCTCTGTTGGTAGTGAAGAAAGATAGAGGTAGTAACTGGCCTTTATTTGACGGTAATTACTTAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAG
4 AGAATATTCCGCAATGGAGGAAACTCTGACGGAGCAACGCCGCGTGGACGATGAAGGCCGGAAGGTTGTAAAGTCCTTTTATAATTGAGGAATAAGCGGGACAGGGAATGGTTCCGCGGTGACTGTAGATTATGAATAAGCACCGGCTAATTACGTGCCAGCAGCCGCGGTAACACGTAAGGTGCGAGCGTTATTCGGAATTATTGGGCGTAAAGGGCACGCAGGCGGCTTTGCAAGCTTGGTGTGAAATCTCAGGGCTCAACCCTGAAACTGCATTGAGAACTGCATGGCTAGAGTTAGTGAGGGGAAACCGGAATTCCAGGTGTAGGGGTGAAATCTGTAGATATCTGGAAGAACACCAATGGCGAAGGCAGGTTTCCAGCACATAACTGACGCTCAGGTGCGAAGGTGCGGGGATCAAACAG
5      GGAATATTGGTCAATGGGCGGAAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATGCGGGGATAAAGTGGGCCACGCGTGGCCTTTTGCAGGTACCGCATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGCAGGCCGCCGGGCAAGCGTGTTGTGAAATGCAGTCGCTCAACGTCTGCACTGCAGCGCGAACTGCCCAGCTTGAGTGCGCGCAACGTTGGCGGAATTCGCCGTGTAGCGGTGAAATGCTTAGATATGGCGAAGAACTCCGATTGCGAAGGCAGCTGACGGGTGCGTAACTGACGCTCATGCTCGAAAGTGCGGGTATCGAACAG
6      GGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGCAGGACGACGGCCCTATGGGTTGTAAACTGCTTTTATAAGGGAATAAAGTGAGTCTCGTGAGACTTTTTGCATGTACCTTATGAATAAGGACCGGCTAATTCCGTGCCAGCAGCCGCGGTAATACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGACGCTCAACGTCTGCACTGCAGCGCGAACTGGTTTCCTTGAGTACGCACAAAGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGAAAGTGCGGGTATCGAACAG
  abundance forward reverse nmatch nmismatch nindel prefer accept
1       686       1       1     65         0      0      1   TRUE
2       449      19      39     65         0      0      1   TRUE
3       389      43      43     65         0      0      2   TRUE
4       301       5      13     65         0      0      2   TRUE
5       226      21      16     70         0      0      2   TRUE
6       221       3       3     70         0      0      2   TRUE

The mergers object is a list of data.frames from each sample which contains the merged $sequence, $abundance and other information of the merged sequence variants.

Construct sequence table

After obtaining full denoised sequences, we can now construct an ASV table.

seqtab <- makeSequenceTable(mergers)
dim(seqtab)

[1]    6 1679

#head(seqtab)

There sequence table is a matrix with rows corresponding to the samples, and columns corresponding to the sequence variants. There are 1679 ASVs in this table.

Inspect distribution of sequence lengths.

table(nchar(getSequences(seqtab)))


260 301 383 384 399 400 401 402 403 404 406 409 416 417 418 419 420 421 422 423 
  5   1   2   4  31 358 116 104 145   7  10   2   2   5  11  88 478  46   5   2 
424 425 426 427 451 454 473 
 34 206  12   2   1   1   1

The lengths of most of our merged sequences fall within the expected range for this 16S amplicon.

Chimera Removal

Denoising step that we did before does not remove chimeric ASVs. Chimeric ASVs are sequences composed of two or more parents. The accuracy of sequence variants after denoising makes identifying chimeras simpler. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences.

Removal chimera using `DADA2::removeBimeraDenovo()`

THIS STEP IS CONSIDERABLY LESS SLOW. BUT DEFINITELY NOT FAST.

If you have trouble with this step, load("./intermediate/dada2_remove_chimera.rda")

seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=no_of_cores, verbose=TRUE)

Identified 489 bimeras out of 1679 input sequences.

#head(seqtab.nochim)
dim(seqtab.nochim)

[1]    6 1190

sum(seqtab.nochim)/sum(seqtab)

[1] 0.9414634

The frequency of chimeric sequences varies substantially from dataset to dataset, and depends on factors including experimental procedures and sample complexity. In these 6 samples, chimeras make up about 29% of the merged sequence variants and about 6% the merged sequence reads when abundances of those variants are accounted for.

Tracking reads through the workflow

Before assigning taxonomy to the ASVs, we want to look at the number of reads that made it through the different steps.

getN <- function(x) sum(getUniques(x)) # getUniques() gets abundance of unique sequences
# Calculate the number of reads at different steps
track <- cbind(out, 
               sapply(dadaFs, getN), 
               sapply(dadaRs, getN), 
               sapply(mergers, getN), 
               rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
track

        input filtered denoisedF denoisedR merged nonchim
6post   35798    12423     12060     12258  11502   11237
21post  85163    30440     29680     30128  28226   26562
54post 127525    45874     44856     45383  42811   40304
9post   46189    16557     16207     16379  15581   15006
23post 183281    65417     63945     64744  60919   56903
48post 136211    45415     44021     44862  41468   38758

Plot the track data.frame.

track.long <- as.data.frame(track, 
                            row.names = row.names(track)) %>% 
  gather(., key = steps, value = counts, input:nonchim, factor_key = TRUE)

ggplot(track.long, aes(x = steps, y = counts, color = steps)) +
    theme_classic() +
    geom_boxplot() +
    geom_jitter(shape = 16, position = position_jitter(0.3)) +
    scale_y_continuous(labels = scales::comma)

Note

Outside of filtering, there should be no step in which a majority of reads are lost.

save.image(file = "./intermediate/dada2_day_2.rda")

At the end of day 2, make sure to save your work so we can continue where we left off in day 3.

Note

Always back up your data after and/or before analysis.

References and Further Readings

MICROBIOME INFORMATICS: OTU VS. ASV: https://www.zymoresearch.com/blogs/blog/microbiome-informatics-otu-vs-asv
DADA2 Pipeline Tutorial (1.16) : http://benjjneb.github.io/dada2/tutorial.html
DADA2 manuscript : https://www.nature.com/articles/nmeth.3869#methods
DADA2 website : http://benjjneb.github.io/dada2/index.html. Highly recommended, a treasure trove of information.
Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses : https://f1000research.com/articles/5-1492/v2

Acknowledgement

Sections of this presentation were sampled and modified from Canadian Bioinformatics Workshop “Microbiome Analysis” workshop materials.

Reproducibility

R session information:

sessionInfo()

R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gridExtra_2.3   dada2_1.24.0    Rcpp_1.0.10     lubridate_1.9.2
 [5] forcats_1.0.0   stringr_1.5.0   dplyr_1.1.2     purrr_1.0.1    
 [9] readr_2.1.4     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.2  
[13] tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] bitops_1.0-7                matrixStats_0.63.0         
 [3] RColorBrewer_1.1-3          GenomeInfoDb_1.32.4        
 [5] tools_4.2.1                 bslib_0.4.2                
 [7] utf8_1.2.3                  R6_2.5.1                   
 [9] DT_0.27                     DBI_1.1.3                  
[11] BiocGenerics_0.42.0         colorspace_2.1-0           
[13] withr_2.5.0                 tidyselect_1.2.0           
[15] compiler_4.2.1              cli_3.6.1                  
[17] Biobase_2.56.0              DelayedArray_0.22.0        
[19] labeling_0.4.2              sass_0.4.6                 
[21] scales_1.2.1                digest_0.6.31              
[23] Rsamtools_2.12.0            rmarkdown_2.21             
[25] XVector_0.36.0              jpeg_0.1-10                
[27] pkgconfig_2.0.3             htmltools_0.5.5            
[29] MatrixGenerics_1.8.1        fastmap_1.1.1              
[31] htmlwidgets_1.6.2           rlang_1.1.1                
[33] rstudioapi_0.14             farver_2.1.1               
[35] jquerylib_0.1.4             generics_0.1.3             
[37] hwriter_1.3.2.1             jsonlite_1.8.4             
[39] crosstalk_1.2.0             BiocParallel_1.30.4        
[41] RCurl_1.98-1.12             magrittr_2.0.3             
[43] GenomeInfoDbData_1.2.8      interp_1.1-4               
[45] Matrix_1.5-4                munsell_0.5.0              
[47] S4Vectors_0.34.0            fansi_1.0.4                
[49] lifecycle_1.0.3             stringi_1.7.12             
[51] yaml_2.3.7                  SummarizedExperiment_1.26.1
[53] zlibbioc_1.42.0             plyr_1.8.8                 
[55] grid_4.2.1                  parallel_4.2.1             
[57] crayon_1.5.2                deldir_1.0-6               
[59] lattice_0.21-8              Biostrings_2.64.1          
[61] hms_1.1.3                   knitr_1.42                 
[63] pillar_1.9.0                GenomicRanges_1.48.0       
[65] reshape2_1.4.4              codetools_0.2-19           
[67] stats4_4.2.1                glue_1.6.2                 
[69] evaluate_0.21               ShortRead_1.54.0           
[71] latticeExtra_0.6-30         RcppParallel_5.1.7         
[73] png_0.1-8                   vctrs_0.6.2                
[75] tzdb_0.4.0                  gtable_0.3.3               
[77] cachem_1.0.8                xfun_0.39                  
[79] snow_0.4-4                  GenomicAlignments_1.32.1   
[81] IRanges_2.30.1              timechange_0.2.0           
[83] ellipsis_0.3.2

Workshop and Data Overview

Getting Ready

DADA2 Workflow

Why do we cluster sequences?

What is OTU (Operational taxonomic units) clustering?

What is ASV (Amplicon sequence variants) analysis?

OTU vs ASV

ASV analysis with DADA2

Filter and Trim

Inspect read quality

Filter and trim using DADA2::filterAndTrim()

Infer Sequence Variants using DADA2

Dereplication using DADA2::derepFastq()

Learning error rates using DADA2::learnErrors()

Inferring sequence variants using DADA2::dada()

Merging paired-end reads using DADA2::mergePairs()

Construct sequence table

Chimera Removal

Removal chimera using DADA2::removeBimeraDenovo()

Tracking reads through the workflow

References and Further Readings

Acknowledgement

Reproducibility

Filter and trim using `DADA2::filterAndTrim()`

Infer Sequence Variants using `DADA2`

Dereplication using `DADA2::derepFastq()`

Learning error rates using `DADA2::learnErrors()`

Inferring sequence variants using `DADA2::dada()`

Merging paired-end reads using `DADA2::mergePairs()`

Removal chimera using `DADA2::removeBimeraDenovo()`