Microbiome analysis in R - PICRUSt2

Author

Jean-Jack Riethoven, Max Qiu, Chia Sin Liew,
Bioinformatics Core Research Facility (BCRF)
Nebraska Center for Biotechnology,
University of Nebraska-Lincoln

Published

October 17, 2023

Note

For a free consultation, please visit our consultation scheduler!

Workshop and Data Overview

This installment of the microbiome analysis is PICRUST2 for prediction of metagenomics functions. For this session, we will be using the full phyloseq object from Salomon, JD et al., 2023 so we have more covariants to work with. The data has two main covariates, group (CNT and CPB representing control samples and cardiopulmonary bypass samples) and time (pre and post representing pre-operation and post-operation time points).

Getting Ready

Load R packages we will be using for today’s workshop.

suppressMessages({
        library("tidyverse")
        library("phyloseq")
        library("rstatix")
        library("vegan")
        library("picante")
        library("kableExtra")
        library("reticulate")
        library("ALDEx2")
        library("data.table")
})
no_of_cores = 1

## useful functions for generating PICRUSt2 input

#---- phyloseq_counts_to_df
# Converts phyloseq into a data frame which can be used to export the
# ASVs and samples with their counts to a file. This function also
# adds a new column with the ASV IDs to the front.
phyloseq_counts_to_df <- function(ps) {
  df <- as.data.frame(t(otu_table(ps)))
  df <- cbind("#OTU ID" = rownames(df), df)  # Cheeky way to add a comment to first line

  # Note: we use OTU ID above since at the moment this function is meanly used
  # to pipe files into picrust, which needs the 'OTU ID'.
  return(df)
}


#---- write_counts_to_file
# Write the (raw) counts from a phyloseq object as a data matrix to file.
#
# INPUT  ps           : phyloseq object
#        file_name    : full file name (incl. path if required when not writing
#                       to working directory)
write_counts_to_file <- function(ps, filename) {
  df <- phyloseq_counts_to_df(ps)

  write.table(df,
              file=filename,
              sep = "\t",
              row.names = FALSE)
}

#---- write_seqs_to_file
# Write the sequences from a phyloseq object as a fasta file.
#
# INPUT  ps           : phyloseq object
#        filename    : full file name (incl. path if required when not writing
#                       to working directory)
write_seqs_to_file <- function(ps, filename) {
  df <- as.data.frame(refseq(ps))

  unlink(filename)  # make sure we delete before we concatenate below

  names <- rownames(df)
  for (i in 1:nrow(df)) {
    cat(paste0(">", names[i]), file=filename, sep="\n", append=TRUE)
    cat(df[i,1], file=filename, sep="\n", append=TRUE)
  }
  message(paste0("Wrote ", nrow(df), " ASV sequences to file ", filename))
}

Copy all needed files to $WORK, create the 16S_Workshop directory if it does not exist, and set it as working directory using commands below.

system("cp -Ru /work/riethoven/jeanjack/16S_Workshop $WORK/", intern=TRUE)
setwd(file.path(Sys.getenv("WORK"), "16S_Workshop"))
Note

After the above setwd() command use the ‘gear’ icon in RStudio’s right-bottom panel and select ‘Go to working directory’. That will then enable you see files in the ‘Files’ tab of the right-bottom panel.

Load the phyloseq object constructed from whole dataset of Salomon, JD et al., 2023.

load("./intermediate/SalomonJD_et_al_covariates.rda")

Inspect objects loaded into your environment.

ls()
[1] "create_dt"             "has_annotations"       "no_of_cores"          
[4] "phyloseq_counts_to_df" "ps"                    "write_counts_to_file" 
[7] "write_seqs_to_file"   
ps
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 4578 taxa and 24 samples ]
sample_data() Sample Data:       [ 24 samples by 17 sample variables ]
tax_table()   Taxonomy Table:    [ 4578 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 4578 tips and 4576 internal nodes ]
refseq()      DNAStringSet:      [ 4578 reference sequences ]

To rearrange the levels of factor or to identify which level of a factor is reference

Note

Levels of a factor are ordered alphabetically by default, which means for group factor with two levels CNT and CPB, CNT is default reference level; for time factor with two levels pre and post, post is default reference level.

sample_meta = sample_data(ps) %>%
        dplyr::as_tibble() %>%
        dplyr::mutate(time = factor(time)) %>% # this step make `time` a factor so we can rearrange its levels
        within(., time <- relevel(time, ref = "pre")) %>% # this step set `pre` as ref level
        as.data.frame()

rownames(sample_meta) = sample_meta$sample 
sample_data(ps) = data.frame(sample_meta)

Inspect and Filter phyloseq object

A few screening and filtering steps before final phyloseq object is ready for abundance visualization and diversity analysis.

  • Step 1, remove Archaea, unknowns, chloroplasts;
ps.clean <- subset_taxa(ps, Kingdom == "Bacteria") %>%
        subset_taxa(!is.na(Phylum)) %>%
        subset_taxa(!Class %in% c("Chloroplast")) %>%
        subset_taxa(!Family %in% c("mitochondria"))
ps.clean
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 4541 taxa and 24 samples ]
sample_data() Sample Data:       [ 24 samples by 17 sample variables ]
tax_table()   Taxonomy Table:    [ 4541 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 4541 tips and 4539 internal nodes ]
refseq()      DNAStringSet:      [ 4541 reference sequences ]
Note

We are only interested in bacteria. Removed 37 taxa.

  • Step 2, filter taxa (ASV) with low abundance (< 2);
ps.clean.p0 <- filter_taxa(ps.clean, function (x) {sum(x > 0) >= 2}, prune=TRUE)
ps.clean.p0
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3564 taxa and 24 samples ]
sample_data() Sample Data:       [ 24 samples by 17 sample variables ]
tax_table()   Taxonomy Table:    [ 3564 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 3564 tips and 3562 internal nodes ]
refseq()      DNAStringSet:      [ 3564 reference sequences ]
Note

Abundance is the sum of the number of times an ASV is observed across all samples. Removed 977 taxa.

Complete metadata are shown below:

sample_data(ps.clean.p0)

Prepare for input files for PICRUSt2:

Note

If have trouble generating files in the next step, copy the prepared files:

system("cp -u ./intermediate/raw_counts.tsv picrust2/")
system("cp -u ./intermediate/seqs.fna picrust2/ ")
dir.create("picrust2", showWarnings = FALSE)
write_counts_to_file(ps.clean.p0, filename = "picrust2/raw_counts.tsv")
write_seqs_to_file(ps.clean.p0, filename = "picrust2/seqs.fna")

Functional prediction using PICRUSt2

PICRUSt2 (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) is a software for predicting functional abundances based only on marker gene sequences. “Function” usually refers to gene families such as KEGG orthologs (KO) and Enzyme Classification numbers (EC), but predictions can be made for any arbitrary trait. Similarly, predictions are typically based on 16S rRNA gene sequencing data, but other marker genes can also be used.

The entire PICRUSt2 pipeline can be run using a single script, called picrust2_pipeline.py. This script will run each of the 4 key steps outlined in above figure: (1) sequence placement, (2) hidden-state prediction of genomes, (3) metagenome prediction, (4) pathway-level predictions.

Run PICRUSt2 on cluster:

Warning

Slow step. Depending on our schedule we might skip this and use pregenerated files instead. Instructor will let you know.

If we were to run this in a terminal or a submitted job (via slurm) on HCC, these would be the commands (do not run now; they are an example):

module load picrust2/2.4
# Use an extra cd to go to your work directory (not shown)
mkdir -p picrust2; cd picrust2
rm -Rf results # delete the results directory if it already exists
               # otherwise picrust2 will stop (safety feature)
picrust2_pipeline.py -s seqs.fna -i raw_counts.tsv -p 2 -o results \
                     --stratified --verbose

Instead, if within R/RStudio, you should copy and run the next (long) line to execute this from within R. If you get an error ‘file not found’, you have forgotten to include the ‘picrust2/2.4’ module in setting up your RStudio instance. However, we are not going to run this either since it takes a relatively long time (40 minutes) for our workshop. The longest steps in the workflow are all single-core commands that are not efficient. You are welcome to run the next commands outside the workshop at another time.

system('mkdir -p picrust2 ; cd picrust2 ; rm -Rf results; picrust2_pipeline.py -s seqs.fna -i raw_counts.tsv -p 4 -o results --stratified --verbose')

We have prepared the output of the picrust2 workflow in advance, so we are going to use those files. Let’s copy them here:

system('cp -Ru ./intermediate/picrust2 .')

After the full workflow is finished, we can add descriptions as a new column in gene family and pathway abundance tables using add_descriptions.py. To add a description column to E.C. number, KO, and MetaCyc pathway abundance tables respectively (again, don’t copy and paste these; use the long system() command instead:

cd $WORK/16S_Workshop/picrust2/results

add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
                    -o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO \
                    -o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC \
                    -o pathways_out/path_abun_unstrat_descrip.tsv.gz
                    
# Decompress tsv files
find . -name "*desc*" -exec pigz -dkv \{\} \;

The long system command we will be running is:

system("cd ./picrust2/results; add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC -o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz; add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO -o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz; add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC -o pathways_out/path_abun_unstrat_descrip.tsv.gz; find . -name '*desc*.gz' -exec gunzip -dfv \\{\\} \\;")

(Again, this long command has exactly the same commands as listed in the previous code block; although this one is easier to run for us without making typos or mistakes). You will see some messages about replacing files since we pregenerated the unzipped files as well.

The main output of PICRUSt2 is the tsv file within EC_metagenome_out, KO_metagenome_out, and pathways_out folders. Copy those tsv files to your own working directory. Be sure to change their names (EC.tsv, KO.tsv, and path.tsv, respectively), as some files share the same name. You can do this manually, or use the quick system commands below.

system('cp ./picrust2/results/EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv  ./picrust2/EC.tsv')
system('cp ./picrust2/results/KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv  ./picrust2/KO.tsv')
system('cp ./picrust2/results/pathways_out/path_abun_unstrat_descrip.tsv  ./picrust2/path.tsv')
p2_EC = paste0("picrust2", "/EC.tsv")
p2EC = as.data.frame(fread(p2_EC))
rownames(p2EC) = p2EC$"function"
p2EC = p2EC %>%
        mutate(across(-c(`function`, description), round ))
p2EC = as.matrix(p2EC[,-1]) # drop off function column as we have it as row names
p2_KO = paste0("picrust2", "/KO.tsv")
p2KO = as.data.frame(fread(p2_KO))
rownames(p2KO) = p2KO$"function"
p2KO = p2KO %>%
        mutate(across(-c(`function`, description), round ))
p2KO = as.matrix(p2KO[,-1])
p2_PW = paste0("picrust2", "/path.tsv")
p2PW = as.data.frame(fread(p2_PW))
rownames(p2PW) = p2PW$"pathway"
p2PW = p2PW %>%
        mutate(across(-c(`pathway`, description), round ))
p2PW = as.matrix(p2PW[,-1]) # drop off pathway column as we have it as row names

Downstream analyses

There are many possible ways to analyze the PICRUSt2 output. Package ggpicrust2 is an R package for analyzing PICRUSt2 data. For statistical analysis, package ALDEx2 was mentioned in PICRUSt2 github wiki, but you can use any packages discussed in differential abundance analysis section. We will demonstrate using ALDEx2 to analyze our PICRUSt2 output.

model = "group + time"
covariate = str_split(model, pattern = "\\+") %>% unlist() %>% str_trim()
var1 = covariate[1]
var2 = covariate[2]
covariate_matrix = sample_meta[, c(var1, var2)] %>% as.data.frame()
formula = as.formula(paste("", model, sep = " ~ " ))
mm <- model.matrix(object = formula, covariate_matrix)
Warning

Slow step. Alternatively,

load("./intermediate/picrust2_aldex2.rda")

This is the step we are likely going to take in the interest of time. Instructor will let you know. The loading of the above prepared data will replace the code block just below for EC, KO, and PW.

EC = p2EC %>%
    as.data.frame() %>%
    dplyr::select(-`description`) %>%
    mutate(across(everything(), as.numeric))

EC_glm_x <- aldex.clr(EC, mm, mc.samples=128, denom="all", verbose=F)
EC_glm_test <- aldex.glm(EC_glm_x, mm) # slow step
EC_glm_effect <- aldex.glm.effect(EC_glm_x)

Check estimated effect size

hist(EC_glm_effect[["groupCPB"]]$effect, breaks = 20, xlab = "effect size", col = "yellow", main = "EC")

Create MW and MA plots

aldex.plot(EC_glm_effect[["groupCPB"]], type = "MW", test = "effect",
           cutoff=1,all.cex = 0.4, rare.cex = 0.4, called.cex = 0.6,
           xlab = "Dispersion", ylab = "Difference")
title(main = "(EC) MW Plot")

aldex.plot(EC_glm_effect[["groupCPB"]], type = "MA", test = "effect",
           cutoff=1,all.cex = 0.4, rare.cex = 0.4, called.cex = 0.6, 
           xlab = "Dispersion", ylab = "Difference")
title(main = "(EC) MA Plot")

Relationship between effect, difference, and p values

plot(EC_glm_effect[["groupCPB"]]$effect, EC_glm_test$`groupCPB:pval`, 
     log = "y", cex = 0.4, col = "blue", pch = 19, 
     xlab = "Effect size", ylab = "P value", main = "(EC) Effect size plot")

points(EC_glm_effect[["groupCPB"]]$effect, EC_glm_test$`groupCPB:pval.holm`, 
       cex = 0.5, col = "red", pch = 19)

abline(h = 0.05, lty = 2, col = "grey")
legend("bottom", legend = c("P value", "BH-adjusted"), pch = 19, col = c("blue", "red"))

plot(EC_glm_test$`groupCPB:Est`, EC_glm_test$`groupCPB:pval`, 
     log = "y", cex = 0.4, col = "blue", pch = 19, 
     xlab = "Difference", ylab = "P value", main = "(EC) Volcano plot")

points(EC_glm_test$`groupCPB:Est`, EC_glm_test$`groupCPB:pval.holm`, 
       cex = 0.5, col = "red", pch = 19)

abline(h = 0.05, lty = 2, col = "grey")
legend("bottom", legend = c("P value", "BH-adjusted"), pch = 19, col = c("blue", "red"))

Extracting only significant features

This code is specific for the model you have; in our case we want to look at features that were significantly different in our CPB group vs the CNT group. If we look at the structure of the GLM test, we can see that there are two columns groupCPB:pval.holm and timepost:pval.holm that denote the adjusted p-values for the group and the time, respectively.

colnames(EC_glm_test)
 [1] "Intercept::Est"       "Intercept::SE"        "Intercept::t.val"    
 [4] "Intercept::pval"      "groupCPB:Est"         "groupCPB:SE"         
 [7] "groupCPB:t.val"       "groupCPB:pval"        "timepost:Est"        
[10] "timepost:SE"          "timepost:t.val"       "timepost:pval"       
[13] "Intercept::pval.holm" "groupCPB:pval.holm"   "timepost:pval.holm"  

And then we can select, for example, for those features that are significant between the two levels in our group by:

significant_EC <- EC_glm_test[EC_glm_test$'groupCPB:pval.holm' < 0.05,]
head(significant_EC)
             Intercept::Est Intercept::SE Intercept::t.val Intercept::pval
EC:1.21.98.1       1.278804     0.2065985         6.192541    4.031766e-06
EC:2.5.1.120       1.790699     0.2353511         7.611366    1.906152e-07
EC:4.2.1.151       1.277179     0.2059617         6.204028    3.951708e-06
             groupCPB:Est groupCPB:SE groupCPB:t.val groupCPB:pval timepost:Est
EC:1.21.98.1    -1.422846   0.2543588      -5.595640  1.552865e-05   -0.2873220
EC:2.5.1.120    -1.581422   0.2897582      -5.459127  2.112049e-05   -0.3154491
EC:4.2.1.151    -1.419692   0.2535747      -5.600972  1.547512e-05   -0.2827332
             timepost:SE timepost:t.val timepost:pval Intercept::pval.holm
EC:1.21.98.1   0.2508011      -1.144916     0.2670328         0.0018645293
EC:2.5.1.120   0.2857054      -1.103844     0.2834117         0.0001221716
EC:4.2.1.151   0.2500281      -1.130683     0.2726728         0.0018310857
             groupCPB:pval.holm timepost:pval.holm
EC:1.21.98.1         0.02657190                  1
EC:2.5.1.120         0.03612157                  1
EC:4.2.1.151         0.02648130                  1
Note

Instead of filtering by pval.holm, we can filter by absolute effect size (in the *_glm_effect variable) instead, as this is the recommended method by the author of ALDEx2. Absolute effect > 1 is then considered significant.

We can now use the row names of the significant features to extract the effect sizes and the description (that we threw out earlier)..

significant_EC_effect <- EC_glm_effect$groupCPB[rownames(significant_EC),]
significant_EC_description <- as.data.frame(p2EC)$description[rownames(p2EC) %in% rownames(significant_EC)]

head(significant_EC_effect)
               rab.all rab.win.0   rab.win.1  diff.btw  diff.win    effect
EC:1.21.98.1 0.5527638  1.166232 -0.17991944 -1.420502 0.9142540 -1.464474
EC:2.5.1.120 0.9952047  1.548379 -0.05057648 -1.587636 1.0269548 -1.458089
EC:4.2.1.151 0.5487666  1.173546 -0.17382010 -1.442925 0.8930056 -1.602430
                overlap
EC:1.21.98.1 0.04212226
EC:2.5.1.120 0.05156297
EC:4.2.1.151 0.04531303
head(significant_EC_description)
[1] "Cyclic dehypoxanthinyl futalosine synthase"
[2] "Aminodeoxyfutalosine synthase"             
[3] "Chorismate dehydratase"                    

.. and then glue the three objects together. This can then be pruned for columns that are not interesting or too overwhelming/not needed, and written to a file for further analysis (not shown, use as exercise!)

EC_CPB_sign <- cbind(description = significant_EC_description, significant_EC, significant_EC_effect)
KO = p2KO %>%
    as.data.frame() %>%
    dplyr::select(-`description`) %>%
    mutate(across(everything(), as.numeric))

KO_glm_x <- aldex.clr(KO, mm, mc.samples=128, denom="all", verbose=F)
KO_glm_test <- aldex.glm(KO_glm_x, mm)
KO_glm_effect <- aldex.glm.effect(KO_glm_x)

# KO_glm <- aldex(KO, mm, mc.samples=128, test = "glm", effect = TRUE, verbose=TRUE, denom = "all")

Check estimated effect size

hist(KO_glm_effect[["groupCPB"]]$effect, breaks = 20, xlab = "effect size", col = "yellow", main = "KO")

Create MW and MA plots

aldex.plot(KO_glm_effect[["groupCPB"]], type = "MW", test = "effect",
           cutoff=1,all.cex = 0.4, rare.cex = 0.4, called.cex = 0.6,
           xlab = "Dispersion", ylab = "Difference")
title(main = "(KO) MW Plot")

aldex.plot(KO_glm_effect[["groupCPB"]], type = "MA", test = "effect",
           cutoff=1,all.cex = 0.4, rare.cex = 0.4, called.cex = 0.6, 
           xlab = "Dispersion", ylab = "Difference")
title(main = "(KO) MA Plot")

Relationship between effect, difference, and p values

plot(KO_glm_effect[["groupCPB"]]$effect, KO_glm_test$`groupCPB:pval`, 
     log = "y", cex = 0.4, col = "blue", pch = 19, 
     xlab = "Effect size", ylab = "P value", main = "(KO) Effect size plot")

points(KO_glm_effect[["groupCPB"]]$effect, KO_glm_test$`groupCPB:pval.holm`, 
       cex = 0.5, col = "red", pch = 19)

abline(h = 0.05, lty = 2, col = "grey")
legend("bottom", legend = c("P value", "BH-adjusted"), pch = 19, col = c("blue", "red"))

plot(KO_glm_test$`groupCPB:Est`, KO_glm_test$`groupCPB:pval`, 
     log = "y", cex = 0.4, col = "blue", pch = 19, 
     xlab = "Difference", ylab = "P value", main = "(KO) Volcano plot")

points(KO_glm_test$`groupCPB:Est`, KO_glm_test$`groupCPB:pval.holm`, 
       cex = 0.5, col = "red", pch = 19)

abline(h = 0.05, lty = 2, col = "grey")
legend("bottom", legend = c("P value", "BH-adjusted"), pch = 19, col = c("blue", "red"))

Extracting only significant features

This code is specific for the model you have; in our case we want to look at features that were significantly different in our CPB group vs the CNT group. If we look at the structure of the GLM test, we can see that there are two columns groupCPB:pval.holm and timepost:pval.holm that denote the adjusted p-values for the group and the time, respectively. Instead of filtering by pval.holm, we can filter by absolute effect size (in the *_glm_effect variable) instead, as this is the recommended method by the author of ALDEx2. Absolute effect > 1 is then considered significant.

In this case, that is what we are showing you here (there are no significant features if we look at pval.holm - see how we did that for EC and check it yourself).

str(KO_glm_effect)
List of 2
 $ groupCPB:'data.frame':   5505 obs. of  7 variables:
  ..$ rab.all  : num [1:5505] 4.424 0.271 6.915 0.287 4.889 ...
  ..$ rab.win.0: num [1:5505] 4.4245 1.4167 6.9146 -0.0528 4.8481 ...
  ..$ rab.win.1: num [1:5505] 4.445 -0.662 7.023 0.759 4.925 ...
  ..$ diff.btw : num [1:5505] 0.119 -2.216 0.269 1.06 0.234 ...
  ..$ diff.win : num [1:5505] 0.771 2.783 0.799 1.682 0.579 ...
  ..$ effect   : num [1:5505] 0.173 -0.716 0.316 0.525 0.366 ...
  ..$ overlap  : num [1:5505] 0.424 0.155 0.395 0.295 0.367 ...
 $ timepost:'data.frame':   5505 obs. of  7 variables:
  ..$ rab.all  : num [1:5505] 4.424 0.271 6.915 0.287 4.889 ...
  ..$ rab.win.0: num [1:5505] 4.474 0.271 6.944 0.495 4.936 ...
  ..$ rab.win.1: num [1:5505] 4.369 0.2388 6.6983 0.0219 4.8264 ...
  ..$ diff.btw : num [1:5505] 0.0772 -0.3461 -0.2318 -0.5853 -0.111 ...
  ..$ diff.win : num [1:5505] 0.766 3.059 0.66 1.8 0.569 ...
  ..$ effect   : num [1:5505] 0.1114 -0.0771 -0.2728 -0.3227 -0.174 ...
  ..$ overlap  : num [1:5505] 0.459 0.474 0.37 0.354 0.44 ...
colnames(KO_glm_effect$groupCPB)
[1] "rab.all"   "rab.win.0" "rab.win.1" "diff.btw"  "diff.win"  "effect"   
[7] "overlap"  

And then we can select, for example, for those features that are significant between the two levels in our group by:

significant_KO <- rownames(KO_glm_effect$groupCPB[abs(KO_glm_effect$groupCPB$effect) > 1,])
head(significant_KO)
[1] "K00127" "K00380" "K00557" "K01497" "K01654" "K02386"

We can now use the list of the significant features to extract the test and the effect size data, as well as the description (that we threw out earlier)..

significant_KO_test <- KO_glm_test[significant_KO,]
significant_KO_effect <- KO_glm_effect$groupCPB[significant_KO,]
significant_KO_description <- as.data.frame(p2KO)$description[rownames(p2KO) %in% significant_KO]

head(significant_KO_test)
       Intercept::Est Intercept::SE Intercept::t.val Intercept::pval
K00127       2.545390     0.1905511        13.362673    1.046347e-11
K00380       1.149812     0.2274718         5.056823    5.573997e-05
K00557       1.454651     0.2194788         6.629542    1.525066e-06
K01497       1.409107     0.2565700         5.493784    1.966397e-05
K01654       3.050979     0.2477408        12.317390    4.648561e-11
K02386       2.626857     0.2098831        12.521219    3.585200e-11
       groupCPB:Est groupCPB:SE groupCPB:t.val groupCPB:pval timepost:Est
K00127   -1.1287451   0.2346016      -4.812509  9.665926e-05   -0.1093162
K00380   -1.1092477   0.2800573      -3.962150  7.555302e-04    0.0410316
K00557   -1.0071969   0.2702166      -3.728412  1.298367e-03    0.2424569
K01497   -0.9944875   0.3158823      -3.149054  4.971209e-03    0.1598871
K01654   -1.3550777   0.3050120      -4.443141  2.289655e-04   -0.4074431
K02386   -1.1685579   0.2584026      -4.523418  1.912052e-04   -0.1866487
       timepost:SE timepost:t.val timepost:pval Intercept::pval.holm
K00127   0.2313203     -0.4718006     0.6433740         3.743312e-08
K00380   0.2761402      0.1489362     0.8738907         1.034422e-01
K00557   0.2664371      0.9106022     0.3751650         3.721714e-03
K01497   0.3114641      0.5142822     0.6141965         4.138823e-02
K01654   0.3007459     -1.3546659     0.1906326         1.623544e-07
K02386   0.2547884     -0.7322239     0.4730964         1.257243e-07
       groupCPB:pval.holm timepost:pval.holm
K00127          0.5315235                  1
K00380          1.0000000                  1
K00557          1.0000000                  1
K01497          1.0000000                  1
K01654          0.9933919                  1
K02386          0.9254890                  1
head(significant_KO_effect)
         rab.all rab.win.0 rab.win.1  diff.btw  diff.win    effect    overlap
K00127 1.9478705  2.455850 1.3027083 -1.116716 0.7960562 -1.377084 0.07812530
K00380 0.6862665  1.139806 0.1207816 -1.093551 0.8963992 -1.151500 0.09828416
K00557 0.9054649  1.665706 0.6248386 -1.027131 0.7747888 -1.074238 0.12656267
K01497 0.8488230  1.658194 0.5838604 -1.091957 0.8652529 -1.046514 0.16562512
K01654 2.4223764  3.026338 1.5046760 -1.424031 0.9927465 -1.337132 0.10468771
K02386 2.0480352  2.496997 1.4082157 -1.188066 0.9299798 -1.244966 0.08580370
head(significant_KO_description)
[1] "fdoI, fdsG; formate dehydrogenase subunit gamma"                          
[2] "cysJ; sulfite reductase (NADPH) flavoprotein alpha-component [EC:1.8.1.2]"
[3] "trmA; tRNA (uracil-5-)-methyltransferase [EC:2.1.1.35]"                   
[4] "ribA, RIB1; GTP cyclohydrolase II [EC:3.5.4.25]"                          
[5] "E2.5.1.56, neuB; N-acetylneuraminate synthase [EC:2.5.1.56]"              
[6] "flgA; flagella basal body P-ring formation protein FlgA"                  

.. and then glue the three objects together. This can then be pruned for columns that are not interesting or too overwhelming/not needed, and written to a file for further analysis (not shown, use as exercise!)

KO_CPB_sign <- cbind(description = significant_KO_description, significant_KO_test, significant_KO_effect)
PW = p2PW %>%
    as.data.frame() %>%
    dplyr::select(-`description`) %>%
    mutate(across(everything(), as.numeric))

PW_glm_x <- aldex.clr(PW, mm, mc.samples=128, denom="all", verbose=F)
PW_glm_test <- aldex.glm(PW_glm_x, mm)
PW_glm_effect <- aldex.glm.effect(PW_glm_x)

# PW_glm <- aldex(PW, mm, mc.samples=128, test = "glm", effect = TRUE, verbose=TRUE, denom = "all")
Check estimated effect size
hist(PW_glm_effect[["groupCPB"]]$effect, breaks = 20, xlab = "effect size", col = "yellow", main = "Pathway")

Create MW and MA plots
aldex.plot(PW_glm_effect[["groupCPB"]], type = "MW", test = "effect",
           cutoff=1,all.cex = 0.4, rare.cex = 0.4, called.cex = 0.6,
           xlab = "Dispersion", ylab = "Difference")
title(main = "(PW) MW Plot")

aldex.plot(PW_glm_effect[["groupCPB"]], type = "MA", test = "effect",
           cutoff=1,all.cex = 0.4, rare.cex = 0.4, called.cex = 0.6, 
           xlab = "Dispersion", ylab = "Difference")
title(main = "(PW) MA Plot")

Relationship between effect, difference, and p values
plot(PW_glm_effect[["groupCPB"]]$effect, PW_glm_test$`groupCPB:pval`, 
     log = "y", cex = 0.4, col = "blue", pch = 19, 
     xlab = "Effect size", ylab = "P value", main = "(PW) Effect size plot")

points(PW_glm_effect[["groupCPB"]]$effect, PW_glm_test$`groupCPB:pval.holm`, 
       cex = 0.5, col = "red", pch = 19)

abline(h = 0.05, lty = 2, col = "grey")
legend("bottom", legend = c("P value", "BH-adjusted"), pch = 19, col = c("blue", "red"))

plot(PW_glm_test$`groupCPB:Est`, PW_glm_test$`groupCPB:pval`, 
     log = "y", cex = 0.4, col = "blue", pch = 19, 
     xlab = "Difference", ylab = "P value", main = "(PW) Volcano plot")

points(PW_glm_test$`groupCPB:Est`, PW_glm_test$`groupCPB:pval.holm`, 
       cex = 0.5, col = "red", pch = 19)

abline(h = 0.05, lty = 2, col = "grey")
legend("bottom", legend = c("P value", "BH-adjusted"), pch = 19, col = c("blue", "red"))

Extracting only significant features

This code is specific for the model you have; in our case we want to look at features that were significantly different in our CPB group vs the CNT group. If we look at the structure of the GLM test, we can see that there are two columns groupCPB:pval.holm and timepost:pval.holm that denote the adjusted p-values for the group and the time, respectively. Instead of filtering by pval.holm, we can filter by absolute effect size (in the *_glm_effect variable) instead, as this is the recommended method by the author of ALDEx2. Absolute effect > 1 is then considered significant.

On purpose, we have not provided code here for you. Try it yourself, look at the examples for EC where we filtered on pval.holm, and KO where we filtered on absolute effect.

The answer to the number of significant pathways is 2 and 2, respectively. Check if both filtering methods give you the same answer; what is the description of the pathways?

Key limitations

There are several limitations to keep in mind when analyzing PICRUSt2 output, which are mainly related to predictions being limited to the gene contents of existing reference genomes. The main drawback is:

Differential abundance results will likely differ substantially from what would be found based on shotgun metagenomics data. Although amplicon-based predictions may be highly correlated with functional profiles based on shotgun metagenomics sequencing data, the actual functions identified as significantly different can substantially differ. Check out this post for more details.

More detailed discussion on its limitation can be found in their github wiki.

References and Further Readings

  1. Douglas et al. (2020) PICRUSt2 for prediction of metagenome functions paper local copy
  2. PICRUSt2 wiki
  3. ALDEx2 vignette
  4. A demo of picrust2 & ALDEX2
  5. Fernandes et al. (2013) ANOVA-Like Differential Expression (ALDEx) Analysis for Mixed Population RNA-Seq link local copy
  6. Fernandes et al. (2014) Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis link local copy
  7. Gloor et al. (2016) Displaying Variation in Large Datasets: Plotting a Visual Summary of Effect Sizes. link local copy

Reproducibility

R session information:

sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] data.table_1.14.8   ALDEx2_1.30.0       zCompositions_1.4.1
 [4] truncnorm_1.0-9     NADA_1.6-1.1        survival_3.5-7     
 [7] MASS_7.3-60         reticulate_1.34.0   kableExtra_1.3.4   
[10] picante_1.8.2       nlme_3.1-163        ape_5.7-1          
[13] vegan_2.6-4         lattice_0.21-8      permute_0.9-7      
[16] rstatix_0.7.2       phyloseq_1.42.0     lubridate_1.9.3    
[19] forcats_1.0.0       stringr_1.5.0       dplyr_1.1.3        
[22] purrr_1.0.2         readr_2.1.4         tidyr_1.3.0        
[25] tibble_3.2.1        ggplot2_3.4.4       tidyverse_2.0.0    

loaded via a namespace (and not attached):
 [1] colorspace_2.1-0            ellipsis_0.3.2             
 [3] XVector_0.38.0              GenomicRanges_1.50.2       
 [5] rstudioapi_0.15.0           DT_0.30                    
 [7] fansi_1.0.4                 xml2_1.3.5                 
 [9] codetools_0.2-19            splines_4.2.2              
[11] cachem_1.0.8                knitr_1.44                 
[13] ade4_1.7-22                 jsonlite_1.8.7             
[15] broom_1.0.5                 cluster_2.1.4              
[17] png_0.1-8                   compiler_4.2.2             
[19] httr_1.4.7                  backports_1.4.1            
[21] RcppZiggurat_0.1.6          Matrix_1.6-1               
[23] fastmap_1.1.1               cli_3.6.1                  
[25] htmltools_0.5.6             tools_4.2.2                
[27] igraph_1.5.1                gtable_0.3.4               
[29] glue_1.6.2                  GenomeInfoDbData_1.2.9     
[31] reshape2_1.4.4              Rcpp_1.0.11                
[33] carData_3.0-5               Biobase_2.58.0             
[35] jquerylib_0.1.4             vctrs_0.6.3                
[37] Biostrings_2.66.0           rhdf5filters_1.10.1        
[39] multtest_2.54.0             svglite_2.1.2              
[41] crosstalk_1.2.0             iterators_1.0.14           
[43] xfun_0.40                   rvest_1.0.3                
[45] timechange_0.2.0            lifecycle_1.0.3            
[47] zlibbioc_1.44.0             scales_1.2.1               
[49] hms_1.1.3                   MatrixGenerics_1.10.0      
[51] parallel_4.2.2              SummarizedExperiment_1.28.0
[53] biomformat_1.26.0           rhdf5_2.42.1               
[55] yaml_2.3.7                  sass_0.4.7                 
[57] stringi_1.7.12              S4Vectors_0.36.2           
[59] foreach_1.5.2               BiocGenerics_0.44.0        
[61] BiocParallel_1.32.6         GenomeInfoDb_1.34.9        
[63] rlang_1.1.1                 pkgconfig_2.0.3            
[65] systemfonts_1.0.5           bitops_1.0-7               
[67] matrixStats_1.0.0           evaluate_0.22              
[69] Rhdf5lib_1.20.0             htmlwidgets_1.6.2          
[71] Rfast_2.0.8                 tidyselect_1.2.0           
[73] plyr_1.8.8                  magrittr_2.0.3             
[75] R6_2.5.1                    IRanges_2.32.0             
[77] generics_0.1.3              DelayedArray_0.24.0        
[79] DBI_1.1.3                   pillar_1.9.0               
[81] withr_2.5.1                 mgcv_1.9-0                 
[83] abind_1.4-5                 RCurl_1.98-1.12            
[85] crayon_1.5.2                car_3.1-2                  
[87] utf8_1.2.3                  tzdb_0.4.0                 
[89] rmarkdown_2.25              grid_4.2.2                 
[91] digest_0.6.33               webshot_0.5.5              
[93] stats4_4.2.2                munsell_0.5.0              
[95] viridisLite_0.4.2           bslib_0.5.1