suppressMessages({
library("tidyverse")
library("phyloseq")
library("rstatix")
library("vegan")
library("picante")
library("kableExtra")
library("reticulate")
library("ALDEx2")
library("data.table")
})= 1
no_of_cores
## useful functions for generating PICRUSt2 input
#---- phyloseq_counts_to_df
# Converts phyloseq into a data frame which can be used to export the
# ASVs and samples with their counts to a file. This function also
# adds a new column with the ASV IDs to the front.
<- function(ps) {
phyloseq_counts_to_df <- as.data.frame(t(otu_table(ps)))
df <- cbind("#OTU ID" = rownames(df), df) # Cheeky way to add a comment to first line
df
# Note: we use OTU ID above since at the moment this function is meanly used
# to pipe files into picrust, which needs the 'OTU ID'.
return(df)
}
#---- write_counts_to_file
# Write the (raw) counts from a phyloseq object as a data matrix to file.
#
# INPUT ps : phyloseq object
# file_name : full file name (incl. path if required when not writing
# to working directory)
<- function(ps, filename) {
write_counts_to_file <- phyloseq_counts_to_df(ps)
df
write.table(df,
file=filename,
sep = "\t",
row.names = FALSE)
}
#---- write_seqs_to_file
# Write the sequences from a phyloseq object as a fasta file.
#
# INPUT ps : phyloseq object
# filename : full file name (incl. path if required when not writing
# to working directory)
<- function(ps, filename) {
write_seqs_to_file <- as.data.frame(refseq(ps))
df
unlink(filename) # make sure we delete before we concatenate below
<- rownames(df)
names for (i in 1:nrow(df)) {
cat(paste0(">", names[i]), file=filename, sep="\n", append=TRUE)
cat(df[i,1], file=filename, sep="\n", append=TRUE)
}message(paste0("Wrote ", nrow(df), " ASV sequences to file ", filename))
}
Microbiome analysis in R - PICRUSt2
For a free consultation, please visit our consultation scheduler!
Workshop and Data Overview
This installment of the microbiome analysis is PICRUST2 for prediction of metagenomics functions. For this session, we will be using the full phyloseq
object from Salomon, JD et al., 2023 so we have more covariants to work with. The data has two main covariates, group
(CNT
and CPB
representing control samples and cardiopulmonary bypass samples) and time
(pre
and post
representing pre-operation and post-operation time points).
Getting Ready
Load R packages we will be using for today’s workshop.
Copy all needed files to $WORK
, create the 16S_Workshop
directory if it does not exist, and set it as working directory using commands below.
system("cp -Ru /work/riethoven/jeanjack/16S_Workshop $WORK/", intern=TRUE)
setwd(file.path(Sys.getenv("WORK"), "16S_Workshop"))
After the above setwd() command use the ‘gear’ icon in RStudio’s right-bottom panel and select ‘Go to working directory’. That will then enable you see files in the ‘Files’ tab of the right-bottom panel.
Load the phyloseq
object constructed from whole dataset of Salomon, JD et al., 2023.
load("./intermediate/SalomonJD_et_al_covariates.rda")
Inspect objects loaded into your environment.
ls()
[1] "create_dt" "has_annotations" "no_of_cores"
[4] "phyloseq_counts_to_df" "ps" "write_counts_to_file"
[7] "write_seqs_to_file"
ps
phyloseq-class experiment-level object
otu_table() OTU Table: [ 4578 taxa and 24 samples ]
sample_data() Sample Data: [ 24 samples by 17 sample variables ]
tax_table() Taxonomy Table: [ 4578 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 4578 tips and 4576 internal nodes ]
refseq() DNAStringSet: [ 4578 reference sequences ]
To rearrange the levels of factor or to identify which level of a factor is reference
Levels of a factor are ordered alphabetically by default, which means for group
factor with two levels CNT
and CPB
, CNT
is default reference level; for time
factor with two levels pre
and post
, post
is default reference level.
= sample_data(ps) %>%
sample_meta ::as_tibble() %>%
dplyr::mutate(time = factor(time)) %>% # this step make `time` a factor so we can rearrange its levels
dplyrwithin(., time <- relevel(time, ref = "pre")) %>% # this step set `pre` as ref level
as.data.frame()
rownames(sample_meta) = sample_meta$sample
sample_data(ps) = data.frame(sample_meta)
Inspect and Filter phyloseq
object
A few screening and filtering steps before final phyloseq
object is ready for abundance visualization and diversity analysis.
- Step 1, remove Archaea, unknowns, chloroplasts;
<- subset_taxa(ps, Kingdom == "Bacteria") %>%
ps.clean subset_taxa(!is.na(Phylum)) %>%
subset_taxa(!Class %in% c("Chloroplast")) %>%
subset_taxa(!Family %in% c("mitochondria"))
ps.clean
phyloseq-class experiment-level object
otu_table() OTU Table: [ 4541 taxa and 24 samples ]
sample_data() Sample Data: [ 24 samples by 17 sample variables ]
tax_table() Taxonomy Table: [ 4541 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 4541 tips and 4539 internal nodes ]
refseq() DNAStringSet: [ 4541 reference sequences ]
We are only interested in bacteria. Removed 37 taxa.
- Step 2, filter taxa (ASV) with low abundance (< 2);
<- filter_taxa(ps.clean, function (x) {sum(x > 0) >= 2}, prune=TRUE)
ps.clean.p0 ps.clean.p0
phyloseq-class experiment-level object
otu_table() OTU Table: [ 3564 taxa and 24 samples ]
sample_data() Sample Data: [ 24 samples by 17 sample variables ]
tax_table() Taxonomy Table: [ 3564 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 3564 tips and 3562 internal nodes ]
refseq() DNAStringSet: [ 3564 reference sequences ]
Abundance is the sum of the number of times an ASV is observed across all samples. Removed 977 taxa.
Complete metadata are shown below:
sample_data(ps.clean.p0)
Prepare for input files for PICRUSt2:
If have trouble generating files in the next step, copy the prepared files:
system("cp -u ./intermediate/raw_counts.tsv picrust2/")
system("cp -u ./intermediate/seqs.fna picrust2/ ")
dir.create("picrust2", showWarnings = FALSE)
write_counts_to_file(ps.clean.p0, filename = "picrust2/raw_counts.tsv")
write_seqs_to_file(ps.clean.p0, filename = "picrust2/seqs.fna")
Functional prediction using PICRUSt2
PICRUSt2 (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) is a software for predicting functional abundances based only on marker gene sequences. “Function” usually refers to gene families such as KEGG orthologs (KO) and Enzyme Classification numbers (EC), but predictions can be made for any arbitrary trait. Similarly, predictions are typically based on 16S rRNA gene sequencing data, but other marker genes can also be used.
The entire PICRUSt2 pipeline can be run using a single script, called picrust2_pipeline.py
. This script will run each of the 4 key steps outlined in above figure: (1) sequence placement, (2) hidden-state prediction of genomes, (3) metagenome prediction, (4) pathway-level predictions.
Run PICRUSt2 on cluster:
Slow step. Depending on our schedule we might skip this and use pregenerated files instead. Instructor will let you know.
If we were to run this in a terminal or a submitted job (via slurm) on HCC, these would be the commands (do not run now; they are an example):
module load picrust2/2.4
# Use an extra cd to go to your work directory (not shown)
mkdir -p picrust2; cd picrust2
rm -Rf results # delete the results directory if it already exists
# otherwise picrust2 will stop (safety feature)
picrust2_pipeline.py -s seqs.fna -i raw_counts.tsv -p 2 -o results \
--stratified --verbose
Instead, if within R/RStudio, you should copy and run the next (long) line to execute this from within R. If you get an error ‘file not found’, you have forgotten to include the ‘picrust2/2.4’ module in setting up your RStudio instance. However, we are not going to run this either since it takes a relatively long time (40 minutes) for our workshop. The longest steps in the workflow are all single-core commands that are not efficient. You are welcome to run the next commands outside the workshop at another time.
system('mkdir -p picrust2 ; cd picrust2 ; rm -Rf results; picrust2_pipeline.py -s seqs.fna -i raw_counts.tsv -p 4 -o results --stratified --verbose')
We have prepared the output of the picrust2 workflow in advance, so we are going to use those files. Let’s copy them here:
system('cp -Ru ./intermediate/picrust2 .')
After the full workflow is finished, we can add descriptions as a new column in gene family and pathway abundance tables using add_descriptions.py
. To add a description column to E.C. number, KO, and MetaCyc pathway abundance tables respectively (again, don’t copy and paste these; use the long system() command instead:
cd $WORK/16S_Workshop/picrust2/results
add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
-o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO \
-o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC \
-o pathways_out/path_abun_unstrat_descrip.tsv.gz
# Decompress tsv files
find . -name "*desc*" -exec pigz -dkv \{\} \;
The long system command we will be running is:
system("cd ./picrust2/results; add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC -o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz; add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO -o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz; add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC -o pathways_out/path_abun_unstrat_descrip.tsv.gz; find . -name '*desc*.gz' -exec gunzip -dfv \\{\\} \\;")
(Again, this long command has exactly the same commands as listed in the previous code block; although this one is easier to run for us without making typos or mistakes). You will see some messages about replacing files since we pregenerated the unzipped files as well.
The main output of PICRUSt2
is the tsv file within EC_metagenome_out
, KO_metagenome_out
, and pathways_out
folders. Copy those tsv files to your own working directory. Be sure to change their names (EC.tsv
, KO.tsv
, and path.tsv
, respectively), as some files share the same name. You can do this manually, or use the quick system commands below.
system('cp ./picrust2/results/EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv ./picrust2/EC.tsv')
system('cp ./picrust2/results/KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv ./picrust2/KO.tsv')
system('cp ./picrust2/results/pathways_out/path_abun_unstrat_descrip.tsv ./picrust2/path.tsv')
= paste0("picrust2", "/EC.tsv")
p2_EC = as.data.frame(fread(p2_EC))
p2EC rownames(p2EC) = p2EC$"function"
= p2EC %>%
p2EC mutate(across(-c(`function`, description), round ))
= as.matrix(p2EC[,-1]) # drop off function column as we have it as row names p2EC
= paste0("picrust2", "/KO.tsv")
p2_KO = as.data.frame(fread(p2_KO))
p2KO rownames(p2KO) = p2KO$"function"
= p2KO %>%
p2KO mutate(across(-c(`function`, description), round ))
= as.matrix(p2KO[,-1]) p2KO
= paste0("picrust2", "/path.tsv")
p2_PW = as.data.frame(fread(p2_PW))
p2PW rownames(p2PW) = p2PW$"pathway"
= p2PW %>%
p2PW mutate(across(-c(`pathway`, description), round ))
= as.matrix(p2PW[,-1]) # drop off pathway column as we have it as row names p2PW
Downstream analyses
There are many possible ways to analyze the PICRUSt2
output. Package ggpicrust2
is an R package for analyzing PICRUSt2 data. For statistical analysis, package ALDEx2
was mentioned in PICRUSt2
github wiki, but you can use any packages discussed in differential abundance analysis section. We will demonstrate using ALDEx2
to analyze our PICRUSt2
output.
= "group + time"
model = str_split(model, pattern = "\\+") %>% unlist() %>% str_trim()
covariate = covariate[1]
var1 = covariate[2]
var2 = sample_meta[, c(var1, var2)] %>% as.data.frame()
covariate_matrix = as.formula(paste("", model, sep = " ~ " ))
formula <- model.matrix(object = formula, covariate_matrix) mm
Slow step. Alternatively,
load("./intermediate/picrust2_aldex2.rda")
This is the step we are likely going to take in the interest of time. Instructor will let you know. The loading of the above prepared data will replace the code block just below for EC, KO, and PW.
= p2EC %>%
EC as.data.frame() %>%
::select(-`description`) %>%
dplyrmutate(across(everything(), as.numeric))
<- aldex.clr(EC, mm, mc.samples=128, denom="all", verbose=F)
EC_glm_x <- aldex.glm(EC_glm_x, mm) # slow step
EC_glm_test <- aldex.glm.effect(EC_glm_x) EC_glm_effect
Check estimated effect size
hist(EC_glm_effect[["groupCPB"]]$effect, breaks = 20, xlab = "effect size", col = "yellow", main = "EC")
Create MW and MA plots
aldex.plot(EC_glm_effect[["groupCPB"]], type = "MW", test = "effect",
cutoff=1,all.cex = 0.4, rare.cex = 0.4, called.cex = 0.6,
xlab = "Dispersion", ylab = "Difference")
title(main = "(EC) MW Plot")
aldex.plot(EC_glm_effect[["groupCPB"]], type = "MA", test = "effect",
cutoff=1,all.cex = 0.4, rare.cex = 0.4, called.cex = 0.6,
xlab = "Dispersion", ylab = "Difference")
title(main = "(EC) MA Plot")
Relationship between effect, difference, and p values
plot(EC_glm_effect[["groupCPB"]]$effect, EC_glm_test$`groupCPB:pval`,
log = "y", cex = 0.4, col = "blue", pch = 19,
xlab = "Effect size", ylab = "P value", main = "(EC) Effect size plot")
points(EC_glm_effect[["groupCPB"]]$effect, EC_glm_test$`groupCPB:pval.holm`,
cex = 0.5, col = "red", pch = 19)
abline(h = 0.05, lty = 2, col = "grey")
legend("bottom", legend = c("P value", "BH-adjusted"), pch = 19, col = c("blue", "red"))
plot(EC_glm_test$`groupCPB:Est`, EC_glm_test$`groupCPB:pval`,
log = "y", cex = 0.4, col = "blue", pch = 19,
xlab = "Difference", ylab = "P value", main = "(EC) Volcano plot")
points(EC_glm_test$`groupCPB:Est`, EC_glm_test$`groupCPB:pval.holm`,
cex = 0.5, col = "red", pch = 19)
abline(h = 0.05, lty = 2, col = "grey")
legend("bottom", legend = c("P value", "BH-adjusted"), pch = 19, col = c("blue", "red"))
Extracting only significant features
This code is specific for the model you have; in our case we want to look at features that were significantly different in our CPB group vs the CNT group. If we look at the structure of the GLM test, we can see that there are two columns groupCPB:pval.holm
and timepost:pval.holm
that denote the adjusted p-values for the group and the time, respectively.
colnames(EC_glm_test)
[1] "Intercept::Est" "Intercept::SE" "Intercept::t.val"
[4] "Intercept::pval" "groupCPB:Est" "groupCPB:SE"
[7] "groupCPB:t.val" "groupCPB:pval" "timepost:Est"
[10] "timepost:SE" "timepost:t.val" "timepost:pval"
[13] "Intercept::pval.holm" "groupCPB:pval.holm" "timepost:pval.holm"
And then we can select, for example, for those features that are significant between the two levels in our group by:
<- EC_glm_test[EC_glm_test$'groupCPB:pval.holm' < 0.05,]
significant_EC head(significant_EC)
Intercept::Est Intercept::SE Intercept::t.val Intercept::pval
EC:1.21.98.1 1.278804 0.2065985 6.192541 4.031766e-06
EC:2.5.1.120 1.790699 0.2353511 7.611366 1.906152e-07
EC:4.2.1.151 1.277179 0.2059617 6.204028 3.951708e-06
groupCPB:Est groupCPB:SE groupCPB:t.val groupCPB:pval timepost:Est
EC:1.21.98.1 -1.422846 0.2543588 -5.595640 1.552865e-05 -0.2873220
EC:2.5.1.120 -1.581422 0.2897582 -5.459127 2.112049e-05 -0.3154491
EC:4.2.1.151 -1.419692 0.2535747 -5.600972 1.547512e-05 -0.2827332
timepost:SE timepost:t.val timepost:pval Intercept::pval.holm
EC:1.21.98.1 0.2508011 -1.144916 0.2670328 0.0018645293
EC:2.5.1.120 0.2857054 -1.103844 0.2834117 0.0001221716
EC:4.2.1.151 0.2500281 -1.130683 0.2726728 0.0018310857
groupCPB:pval.holm timepost:pval.holm
EC:1.21.98.1 0.02657190 1
EC:2.5.1.120 0.03612157 1
EC:4.2.1.151 0.02648130 1
Instead of filtering by pval.holm
, we can filter by absolute effect size (in the *_glm_effect
variable) instead, as this is the recommended method by the author of ALDEx2
. Absolute effect > 1 is then considered significant.
We can now use the row names of the significant features to extract the effect sizes and the description (that we threw out earlier)..
<- EC_glm_effect$groupCPB[rownames(significant_EC),]
significant_EC_effect <- as.data.frame(p2EC)$description[rownames(p2EC) %in% rownames(significant_EC)]
significant_EC_description
head(significant_EC_effect)
rab.all rab.win.0 rab.win.1 diff.btw diff.win effect
EC:1.21.98.1 0.5527638 1.166232 -0.17991944 -1.420502 0.9142540 -1.464474
EC:2.5.1.120 0.9952047 1.548379 -0.05057648 -1.587636 1.0269548 -1.458089
EC:4.2.1.151 0.5487666 1.173546 -0.17382010 -1.442925 0.8930056 -1.602430
overlap
EC:1.21.98.1 0.04212226
EC:2.5.1.120 0.05156297
EC:4.2.1.151 0.04531303
head(significant_EC_description)
[1] "Cyclic dehypoxanthinyl futalosine synthase"
[2] "Aminodeoxyfutalosine synthase"
[3] "Chorismate dehydratase"
.. and then glue the three objects together. This can then be pruned for columns that are not interesting or too overwhelming/not needed, and written to a file for further analysis (not shown, use as exercise!)
<- cbind(description = significant_EC_description, significant_EC, significant_EC_effect) EC_CPB_sign
= p2KO %>%
KO as.data.frame() %>%
::select(-`description`) %>%
dplyrmutate(across(everything(), as.numeric))
<- aldex.clr(KO, mm, mc.samples=128, denom="all", verbose=F)
KO_glm_x <- aldex.glm(KO_glm_x, mm)
KO_glm_test <- aldex.glm.effect(KO_glm_x)
KO_glm_effect
# KO_glm <- aldex(KO, mm, mc.samples=128, test = "glm", effect = TRUE, verbose=TRUE, denom = "all")
Check estimated effect size
hist(KO_glm_effect[["groupCPB"]]$effect, breaks = 20, xlab = "effect size", col = "yellow", main = "KO")
Create MW and MA plots
aldex.plot(KO_glm_effect[["groupCPB"]], type = "MW", test = "effect",
cutoff=1,all.cex = 0.4, rare.cex = 0.4, called.cex = 0.6,
xlab = "Dispersion", ylab = "Difference")
title(main = "(KO) MW Plot")
aldex.plot(KO_glm_effect[["groupCPB"]], type = "MA", test = "effect",
cutoff=1,all.cex = 0.4, rare.cex = 0.4, called.cex = 0.6,
xlab = "Dispersion", ylab = "Difference")
title(main = "(KO) MA Plot")
Relationship between effect, difference, and p values
plot(KO_glm_effect[["groupCPB"]]$effect, KO_glm_test$`groupCPB:pval`,
log = "y", cex = 0.4, col = "blue", pch = 19,
xlab = "Effect size", ylab = "P value", main = "(KO) Effect size plot")
points(KO_glm_effect[["groupCPB"]]$effect, KO_glm_test$`groupCPB:pval.holm`,
cex = 0.5, col = "red", pch = 19)
abline(h = 0.05, lty = 2, col = "grey")
legend("bottom", legend = c("P value", "BH-adjusted"), pch = 19, col = c("blue", "red"))
plot(KO_glm_test$`groupCPB:Est`, KO_glm_test$`groupCPB:pval`,
log = "y", cex = 0.4, col = "blue", pch = 19,
xlab = "Difference", ylab = "P value", main = "(KO) Volcano plot")
points(KO_glm_test$`groupCPB:Est`, KO_glm_test$`groupCPB:pval.holm`,
cex = 0.5, col = "red", pch = 19)
abline(h = 0.05, lty = 2, col = "grey")
legend("bottom", legend = c("P value", "BH-adjusted"), pch = 19, col = c("blue", "red"))
Extracting only significant features
This code is specific for the model you have; in our case we want to look at features that were significantly different in our CPB group vs the CNT group. If we look at the structure of the GLM test, we can see that there are two columns groupCPB:pval.holm
and timepost:pval.holm
that denote the adjusted p-values for the group and the time, respectively. Instead of filtering by pval.holm
, we can filter by absolute effect size (in the *_glm_effect
variable) instead, as this is the recommended method by the author of ALDEx2
. Absolute effect > 1 is then considered significant.
In this case, that is what we are showing you here (there are no significant features if we look at pval.holm - see how we did that for EC and check it yourself).
str(KO_glm_effect)
List of 2
$ groupCPB:'data.frame': 5505 obs. of 7 variables:
..$ rab.all : num [1:5505] 4.424 0.271 6.915 0.287 4.889 ...
..$ rab.win.0: num [1:5505] 4.4245 1.4167 6.9146 -0.0528 4.8481 ...
..$ rab.win.1: num [1:5505] 4.445 -0.662 7.023 0.759 4.925 ...
..$ diff.btw : num [1:5505] 0.119 -2.216 0.269 1.06 0.234 ...
..$ diff.win : num [1:5505] 0.771 2.783 0.799 1.682 0.579 ...
..$ effect : num [1:5505] 0.173 -0.716 0.316 0.525 0.366 ...
..$ overlap : num [1:5505] 0.424 0.155 0.395 0.295 0.367 ...
$ timepost:'data.frame': 5505 obs. of 7 variables:
..$ rab.all : num [1:5505] 4.424 0.271 6.915 0.287 4.889 ...
..$ rab.win.0: num [1:5505] 4.474 0.271 6.944 0.495 4.936 ...
..$ rab.win.1: num [1:5505] 4.369 0.2388 6.6983 0.0219 4.8264 ...
..$ diff.btw : num [1:5505] 0.0772 -0.3461 -0.2318 -0.5853 -0.111 ...
..$ diff.win : num [1:5505] 0.766 3.059 0.66 1.8 0.569 ...
..$ effect : num [1:5505] 0.1114 -0.0771 -0.2728 -0.3227 -0.174 ...
..$ overlap : num [1:5505] 0.459 0.474 0.37 0.354 0.44 ...
colnames(KO_glm_effect$groupCPB)
[1] "rab.all" "rab.win.0" "rab.win.1" "diff.btw" "diff.win" "effect"
[7] "overlap"
And then we can select, for example, for those features that are significant between the two levels in our group by:
<- rownames(KO_glm_effect$groupCPB[abs(KO_glm_effect$groupCPB$effect) > 1,])
significant_KO head(significant_KO)
[1] "K00127" "K00380" "K00557" "K01497" "K01654" "K02386"
We can now use the list of the significant features to extract the test and the effect size data, as well as the description (that we threw out earlier)..
<- KO_glm_test[significant_KO,]
significant_KO_test <- KO_glm_effect$groupCPB[significant_KO,]
significant_KO_effect <- as.data.frame(p2KO)$description[rownames(p2KO) %in% significant_KO]
significant_KO_description
head(significant_KO_test)
Intercept::Est Intercept::SE Intercept::t.val Intercept::pval
K00127 2.545390 0.1905511 13.362673 1.046347e-11
K00380 1.149812 0.2274718 5.056823 5.573997e-05
K00557 1.454651 0.2194788 6.629542 1.525066e-06
K01497 1.409107 0.2565700 5.493784 1.966397e-05
K01654 3.050979 0.2477408 12.317390 4.648561e-11
K02386 2.626857 0.2098831 12.521219 3.585200e-11
groupCPB:Est groupCPB:SE groupCPB:t.val groupCPB:pval timepost:Est
K00127 -1.1287451 0.2346016 -4.812509 9.665926e-05 -0.1093162
K00380 -1.1092477 0.2800573 -3.962150 7.555302e-04 0.0410316
K00557 -1.0071969 0.2702166 -3.728412 1.298367e-03 0.2424569
K01497 -0.9944875 0.3158823 -3.149054 4.971209e-03 0.1598871
K01654 -1.3550777 0.3050120 -4.443141 2.289655e-04 -0.4074431
K02386 -1.1685579 0.2584026 -4.523418 1.912052e-04 -0.1866487
timepost:SE timepost:t.val timepost:pval Intercept::pval.holm
K00127 0.2313203 -0.4718006 0.6433740 3.743312e-08
K00380 0.2761402 0.1489362 0.8738907 1.034422e-01
K00557 0.2664371 0.9106022 0.3751650 3.721714e-03
K01497 0.3114641 0.5142822 0.6141965 4.138823e-02
K01654 0.3007459 -1.3546659 0.1906326 1.623544e-07
K02386 0.2547884 -0.7322239 0.4730964 1.257243e-07
groupCPB:pval.holm timepost:pval.holm
K00127 0.5315235 1
K00380 1.0000000 1
K00557 1.0000000 1
K01497 1.0000000 1
K01654 0.9933919 1
K02386 0.9254890 1
head(significant_KO_effect)
rab.all rab.win.0 rab.win.1 diff.btw diff.win effect overlap
K00127 1.9478705 2.455850 1.3027083 -1.116716 0.7960562 -1.377084 0.07812530
K00380 0.6862665 1.139806 0.1207816 -1.093551 0.8963992 -1.151500 0.09828416
K00557 0.9054649 1.665706 0.6248386 -1.027131 0.7747888 -1.074238 0.12656267
K01497 0.8488230 1.658194 0.5838604 -1.091957 0.8652529 -1.046514 0.16562512
K01654 2.4223764 3.026338 1.5046760 -1.424031 0.9927465 -1.337132 0.10468771
K02386 2.0480352 2.496997 1.4082157 -1.188066 0.9299798 -1.244966 0.08580370
head(significant_KO_description)
[1] "fdoI, fdsG; formate dehydrogenase subunit gamma"
[2] "cysJ; sulfite reductase (NADPH) flavoprotein alpha-component [EC:1.8.1.2]"
[3] "trmA; tRNA (uracil-5-)-methyltransferase [EC:2.1.1.35]"
[4] "ribA, RIB1; GTP cyclohydrolase II [EC:3.5.4.25]"
[5] "E2.5.1.56, neuB; N-acetylneuraminate synthase [EC:2.5.1.56]"
[6] "flgA; flagella basal body P-ring formation protein FlgA"
.. and then glue the three objects together. This can then be pruned for columns that are not interesting or too overwhelming/not needed, and written to a file for further analysis (not shown, use as exercise!)
<- cbind(description = significant_KO_description, significant_KO_test, significant_KO_effect) KO_CPB_sign
= p2PW %>%
PW as.data.frame() %>%
::select(-`description`) %>%
dplyrmutate(across(everything(), as.numeric))
<- aldex.clr(PW, mm, mc.samples=128, denom="all", verbose=F)
PW_glm_x <- aldex.glm(PW_glm_x, mm)
PW_glm_test <- aldex.glm.effect(PW_glm_x)
PW_glm_effect
# PW_glm <- aldex(PW, mm, mc.samples=128, test = "glm", effect = TRUE, verbose=TRUE, denom = "all")
Check estimated effect size
hist(PW_glm_effect[["groupCPB"]]$effect, breaks = 20, xlab = "effect size", col = "yellow", main = "Pathway")
Create MW and MA plots
aldex.plot(PW_glm_effect[["groupCPB"]], type = "MW", test = "effect",
cutoff=1,all.cex = 0.4, rare.cex = 0.4, called.cex = 0.6,
xlab = "Dispersion", ylab = "Difference")
title(main = "(PW) MW Plot")
aldex.plot(PW_glm_effect[["groupCPB"]], type = "MA", test = "effect",
cutoff=1,all.cex = 0.4, rare.cex = 0.4, called.cex = 0.6,
xlab = "Dispersion", ylab = "Difference")
title(main = "(PW) MA Plot")
Relationship between effect, difference, and p values
plot(PW_glm_effect[["groupCPB"]]$effect, PW_glm_test$`groupCPB:pval`,
log = "y", cex = 0.4, col = "blue", pch = 19,
xlab = "Effect size", ylab = "P value", main = "(PW) Effect size plot")
points(PW_glm_effect[["groupCPB"]]$effect, PW_glm_test$`groupCPB:pval.holm`,
cex = 0.5, col = "red", pch = 19)
abline(h = 0.05, lty = 2, col = "grey")
legend("bottom", legend = c("P value", "BH-adjusted"), pch = 19, col = c("blue", "red"))
plot(PW_glm_test$`groupCPB:Est`, PW_glm_test$`groupCPB:pval`,
log = "y", cex = 0.4, col = "blue", pch = 19,
xlab = "Difference", ylab = "P value", main = "(PW) Volcano plot")
points(PW_glm_test$`groupCPB:Est`, PW_glm_test$`groupCPB:pval.holm`,
cex = 0.5, col = "red", pch = 19)
abline(h = 0.05, lty = 2, col = "grey")
legend("bottom", legend = c("P value", "BH-adjusted"), pch = 19, col = c("blue", "red"))
Extracting only significant features
This code is specific for the model you have; in our case we want to look at features that were significantly different in our CPB group vs the CNT group. If we look at the structure of the GLM test, we can see that there are two columns groupCPB:pval.holm
and timepost:pval.holm
that denote the adjusted p-values for the group and the time, respectively. Instead of filtering by pval.holm
, we can filter by absolute effect size (in the *_glm_effect
variable) instead, as this is the recommended method by the author of ALDEx2
. Absolute effect > 1 is then considered significant.
On purpose, we have not provided code here for you. Try it yourself, look at the examples for EC where we filtered on pval.holm
, and KO where we filtered on absolute effect
.
The answer to the number of significant pathways is 2 and 2, respectively. Check if both filtering methods give you the same answer; what is the description of the pathways?
Key limitations
There are several limitations to keep in mind when analyzing PICRUSt2
output, which are mainly related to predictions being limited to the gene contents of existing reference genomes. The main drawback is:
Differential abundance results will likely differ substantially from what would be found based on shotgun metagenomics data. Although amplicon-based predictions may be highly correlated with functional profiles based on shotgun metagenomics sequencing data, the actual functions identified as significantly different can substantially differ. Check out this post for more details.
More detailed discussion on its limitation can be found in their github wiki.
References and Further Readings
- Douglas et al. (2020) PICRUSt2 for prediction of metagenome functions paper local copy
PICRUSt2
wikiALDEx2
vignette- A demo of picrust2 & ALDEX2
- Fernandes et al. (2013) ANOVA-Like Differential Expression (ALDEx) Analysis for Mixed Population RNA-Seq link local copy
- Fernandes et al. (2014) Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis link local copy
- Gloor et al. (2016) Displaying Variation in Large Datasets: Plotting a Visual Summary of Effect Sizes. link local copy
Reproducibility
R session information:
sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.14.8 ALDEx2_1.30.0 zCompositions_1.4.1
[4] truncnorm_1.0-9 NADA_1.6-1.1 survival_3.5-7
[7] MASS_7.3-60 reticulate_1.34.0 kableExtra_1.3.4
[10] picante_1.8.2 nlme_3.1-163 ape_5.7-1
[13] vegan_2.6-4 lattice_0.21-8 permute_0.9-7
[16] rstatix_0.7.2 phyloseq_1.42.0 lubridate_1.9.3
[19] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3
[22] purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
[25] tibble_3.2.1 ggplot2_3.4.4 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] colorspace_2.1-0 ellipsis_0.3.2
[3] XVector_0.38.0 GenomicRanges_1.50.2
[5] rstudioapi_0.15.0 DT_0.30
[7] fansi_1.0.4 xml2_1.3.5
[9] codetools_0.2-19 splines_4.2.2
[11] cachem_1.0.8 knitr_1.44
[13] ade4_1.7-22 jsonlite_1.8.7
[15] broom_1.0.5 cluster_2.1.4
[17] png_0.1-8 compiler_4.2.2
[19] httr_1.4.7 backports_1.4.1
[21] RcppZiggurat_0.1.6 Matrix_1.6-1
[23] fastmap_1.1.1 cli_3.6.1
[25] htmltools_0.5.6 tools_4.2.2
[27] igraph_1.5.1 gtable_0.3.4
[29] glue_1.6.2 GenomeInfoDbData_1.2.9
[31] reshape2_1.4.4 Rcpp_1.0.11
[33] carData_3.0-5 Biobase_2.58.0
[35] jquerylib_0.1.4 vctrs_0.6.3
[37] Biostrings_2.66.0 rhdf5filters_1.10.1
[39] multtest_2.54.0 svglite_2.1.2
[41] crosstalk_1.2.0 iterators_1.0.14
[43] xfun_0.40 rvest_1.0.3
[45] timechange_0.2.0 lifecycle_1.0.3
[47] zlibbioc_1.44.0 scales_1.2.1
[49] hms_1.1.3 MatrixGenerics_1.10.0
[51] parallel_4.2.2 SummarizedExperiment_1.28.0
[53] biomformat_1.26.0 rhdf5_2.42.1
[55] yaml_2.3.7 sass_0.4.7
[57] stringi_1.7.12 S4Vectors_0.36.2
[59] foreach_1.5.2 BiocGenerics_0.44.0
[61] BiocParallel_1.32.6 GenomeInfoDb_1.34.9
[63] rlang_1.1.1 pkgconfig_2.0.3
[65] systemfonts_1.0.5 bitops_1.0-7
[67] matrixStats_1.0.0 evaluate_0.22
[69] Rhdf5lib_1.20.0 htmlwidgets_1.6.2
[71] Rfast_2.0.8 tidyselect_1.2.0
[73] plyr_1.8.8 magrittr_2.0.3
[75] R6_2.5.1 IRanges_2.32.0
[77] generics_0.1.3 DelayedArray_0.24.0
[79] DBI_1.1.3 pillar_1.9.0
[81] withr_2.5.1 mgcv_1.9-0
[83] abind_1.4-5 RCurl_1.98-1.12
[85] crayon_1.5.2 car_3.1-2
[87] utf8_1.2.3 tzdb_0.4.0
[89] rmarkdown_2.25 grid_4.2.2
[91] digest_0.6.33 webshot_0.5.5
[93] stats4_4.2.2 munsell_0.5.0
[95] viridisLite_0.4.2 bslib_0.5.1