Microbiome analysis in R - PICRUSt2

Author

Jean-Jack Riethoven, Max Qiu, Chia Sin Liew,
Bioinformatics Core Research Facility (BCRF)
Nebraska Center for Biotechnology,
University of Nebraska-Lincoln

Published

October 17, 2023

Note

For a free consultation, please visit our consultation scheduler!

Workshop and Data Overview

This installment of the microbiome analysis is PICRUST2 for prediction of metagenomics functions. For this session, we will be using the full phyloseq object from Salomon, JD et al., 2023 so we have more covariants to work with. The data has two main covariates, group (CNT and CPB representing control samples and cardiopulmonary bypass samples) and time (pre and post representing pre-operation and post-operation time points).

Getting Ready

Load R packages we will be using for today’s workshop.

suppressMessages({
        library("tidyverse")
        library("phyloseq")
        library("rstatix")
        library("vegan")
        library("picante")
        library("kableExtra")
        library("reticulate")
        library("ALDEx2")
        library("data.table")
})
no_of_cores = 1

## useful functions for generating PICRUSt2 input

#---- phyloseq_counts_to_df
# Converts phyloseq into a data frame which can be used to export the
# ASVs and samples with their counts to a file. This function also
# adds a new column with the ASV IDs to the front.
phyloseq_counts_to_df <- function(ps) {
  df <- as.data.frame(t(otu_table(ps)))
  df <- cbind("#OTU ID" = rownames(df), df)  # Cheeky way to add a comment to first line

  # Note: we use OTU ID above since at the moment this function is meanly used
  # to pipe files into picrust, which needs the 'OTU ID'.
  return(df)
}


#---- write_counts_to_file
# Write the (raw) counts from a phyloseq object as a data matrix to file.
#
# INPUT  ps           : phyloseq object
#        file_name    : full file name (incl. path if required when not writing
#                       to working directory)
write_counts_to_file <- function(ps, filename) {
  df <- phyloseq_counts_to_df(ps)

  write.table(df,
              file=filename,
              sep = "\t",
              row.names = FALSE)
}

#---- write_seqs_to_file
# Write the sequences from a phyloseq object as a fasta file.
#
# INPUT  ps           : phyloseq object
#        filename    : full file name (incl. path if required when not writing
#                       to working directory)
write_seqs_to_file <- function(ps, filename) {
  df <- as.data.frame(refseq(ps))

  unlink(filename)  # make sure we delete before we concatenate below

  names <- rownames(df)
  for (i in 1:nrow(df)) {
    cat(paste0(">", names[i]), file=filename, sep="\n", append=TRUE)
    cat(df[i,1], file=filename, sep="\n", append=TRUE)
  }
  message(paste0("Wrote ", nrow(df), " ASV sequences to file ", filename))
}

Copy all needed files to $WORK, create the 16S_Workshop directory if it does not exist, and set it as working directory using commands below.

system("cp -Ru /work/riethoven/jeanjack/16S_Workshop $WORK/", intern=TRUE)
setwd(file.path(Sys.getenv("WORK"), "16S_Workshop"))
Note

After the above setwd() command use the ‘gear’ icon in RStudio’s right-bottom panel and select ‘Go to working directory’. That will then enable you see files in the ‘Files’ tab of the right-bottom panel.

Load the phyloseq object constructed from whole dataset of Salomon, JD et al., 2023.

load("./intermediate/SalomonJD_et_al_covariates.rda")

Inspect objects loaded into your environment.

ls()
[1] "create_dt"             "has_annotations"       "no_of_cores"          
[4] "phyloseq_counts_to_df" "ps"                    "write_counts_to_file" 
[7] "write_seqs_to_file"   
ps
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 4578 taxa and 24 samples ]
sample_data() Sample Data:       [ 24 samples by 17 sample variables ]
tax_table()   Taxonomy Table:    [ 4578 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 4578 tips and 4576 internal nodes ]
refseq()      DNAStringSet:      [ 4578 reference sequences ]

To rearrange the levels of factor or to identify which level of a factor is reference

Note

Levels of a factor are ordered alphabetically by default, which means for group factor with two levels CNT and CPB, CNT is default reference level; for time factor with two levels pre and post, post is default reference level.

sample_meta = sample_data(ps) %>%
        dplyr::as_tibble() %>%
        dplyr::mutate(time = factor(time)) %>% # this step make `time` a factor so we can rearrange its levels
        within(., time <- relevel(time, ref = "pre")) %>% # this step set `pre` as ref level
        as.data.frame()

rownames(sample_meta) = sample_meta$sample 
sample_data(ps) = data.frame(sample_meta)

Inspect and Filter phyloseq object

A few screening and filtering steps before final phyloseq object is ready for abundance visualization and diversity analysis.

  • Step 1, remove Archaea, unknowns, chloroplasts;
ps.clean <- subset_taxa(ps, Kingdom == "Bacteria") %>%
        subset_taxa(!is.na(Phylum)) %>%
        subset_taxa(!Class %in% c("Chloroplast")) %>%
        subset_taxa(!Family %in% c("mitochondria"))
ps.clean
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 4541 taxa and 24 samples ]
sample_data() Sample Data:       [ 24 samples by 17 sample variables ]
tax_table()   Taxonomy Table:    [ 4541 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 4541 tips and 4539 internal nodes ]
refseq()      DNAStringSet:      [ 4541 reference sequences ]
Note

We are only interested in bacteria. Removed 37 taxa.

  • Step 2, filter taxa (ASV) with low abundance (< 2);
ps.clean.p0 <- filter_taxa(ps.clean, function (x) {sum(x > 0) >= 2}, prune=TRUE)
ps.clean.p0
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3564 taxa and 24 samples ]
sample_data() Sample Data:       [ 24 samples by 17 sample variables ]
tax_table()   Taxonomy Table:    [ 3564 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 3564 tips and 3562 internal nodes ]
refseq()      DNAStringSet:      [ 3564 reference sequences ]
Note

Abundance is the sum of the number of times an ASV is observed across all samples. Removed 977 taxa.

Complete metadata are shown below:

sample_data(ps.clean.p0)

Prepare for input files for PICRUSt2:

Note

If have trouble generating files in the next step, copy the prepared files:

system("cp -u ./intermediate/raw_counts.tsv picrust2/")
system("cp -u ./intermediate/seqs.fna picrust2/ ")
dir.create("picrust2", showWarnings = FALSE)
write_counts_to_file(ps.clean.p0, filename = "picrust2/raw_counts.tsv")
write_seqs_to_file(ps.clean.p0, filename = "picrust2/seqs.fna")

Functional prediction using PICRUSt2

PICRUSt2 (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) is a software for predicting functional abundances based only on marker gene sequences. “Function” usually refers to gene families such as KEGG orthologs (KO) and Enzyme Classification numbers (EC), but predictions can be made for any arbitrary trait. Similarly, predictions are typically based on 16S rRNA gene sequencing data, but other marker genes can also be used.

The entire PICRUSt2 pipeline can be run using a single script, called picrust2_pipeline.py. This script will run each of the 4 key steps outlined in above figure: (1) sequence placement, (2) hidden-state prediction of genomes, (3) metagenome prediction, (4) pathway-level predictions.

Run PICRUSt2 on cluster:

Warning

Slow step. Depending on our schedule we might skip this and use pregenerated files instead. Instructor will let you know.

If we were to run this in a terminal or a submitted job (via slurm) on HCC, these would be the commands (do not run now; they are an example):

module load picrust2/2.4
# Use an extra cd to go to your work directory (not shown)
mkdir -p picrust2; cd picrust2
rm -Rf results # delete the results directory if it already exists
               # otherwise picrust2 will stop (safety feature)
picrust2_pipeline.py -s seqs.fna -i raw_counts.tsv -p 2 -o results \
                     --stratified --verbose

Instead, if within R/RStudio, you should copy and run the next (long) line to execute this from within R. If you get an error ‘file not found’, you have forgotten to include the ‘picrust2/2.4’ module in setting up your RStudio instance. However, we are not going to run this either since it takes a relatively long time (40 minutes) for our workshop. The longest steps in the workflow are all single-core commands that are not efficient. You are welcome to run the next commands outside the workshop at another time.

system('mkdir -p picrust2 ; cd picrust2 ; rm -Rf results; picrust2_pipeline.py -s seqs.fna -i raw_counts.tsv -p 4 -o results --stratified --verbose')

We have prepared the output of the picrust2 workflow in advance, so we are going to use those files. Let’s copy them here:

system('cp -Ru ./intermediate/picrust2 .')

After the full workflow is finished, we can add descriptions as a new column in gene family and pathway abundance tables using add_descriptions.py. To add a description column to E.C. number, KO, and MetaCyc pathway abundance tables respectively (again, don’t copy and paste these; use the long system() command instead:

cd $WORK/16S_Workshop/picrust2/results

add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
                    -o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO \
                    -o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC \
                    -o pathways_out/path_abun_unstrat_descrip.tsv.gz
                    
# Decompress tsv files
find . -name "*desc*" -exec pigz -dkv \{\} \;

The long system command we will be running is:

system("cd ./picrust2/results; add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC -o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz; add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO -o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz; add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC -o pathways_out/path_abun_unstrat_descrip.tsv.gz; find . -name '*desc*.gz' -exec gunzip -dfv \\{\\} \\;")

(Again, this long command has exactly the same commands as listed in the previous code block; although this one is easier to run for us without making typos or mistakes). You will see some messages about replacing files since we pregenerated the unzipped files as well.

The main output of PICRUSt2 is the tsv file within EC_metagenome_out, KO_metagenome_out, and pathways_out folders. Copy those tsv files to your own working directory. Be sure to change their names (EC.tsv, KO.tsv, and path.tsv, respectively), as some files share the same name. You can do this manually, or use the quick system commands below.

system('cp ./picrust2/results/EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv  ./picrust2/EC.tsv')
system('cp ./picrust2/results/KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv  ./picrust2/KO.tsv')
system('cp ./picrust2/results/pathways_out/path_abun_unstrat_descrip.tsv  ./picrust2/path.tsv')
p2_EC = paste0("picrust2", "/EC.tsv")
p2EC = as.data.frame(fread(p2_EC))
rownames(p2EC) = p2EC$"function"
p2EC = p2EC %>%
        mutate(across(-c(`function`, description), round ))
p2EC = as.matrix(p2EC[,-1]) # drop off function column as we have it as row names
p2_KO = paste0("picrust2", "/KO.tsv")
p2KO = as.data.frame(fread(p2_KO))
rownames(p2KO) = p2KO$"function"
p2KO = p2KO %>%
        mutate(across(-c(`function`, description), round ))
p2KO = as.matrix(p2KO[,-1])