Microbiome analysis in R - Canonical Correlation Analysis

Author

Max Qiu, Chia Sin Liew, Jean-Jack Riethoven,
Binformatics Core Research Facility (BCRF)
Nebraska Center for Biotechnology,
University of Nebraska-Lincoln

Published

October 17, 2023

Note

For a free consultation, please visit our consultation scheduler!

Workshop and Data Overview

This installment of the microbiome analysis is canonical correlation analysis. For this session, we will be using the full phyloseq object from Salomon, JD et al., 2023 so we have more covariants to work with. The data has two main covariates, group (CNT and CPB representing control samples and cardiopulmonary bypass samples) and time (pre and post representing pre-operation and post-operation time points).

Getting Ready

Load R packages we will be using for today’s workshop.

suppressMessages({
        library("tidyverse")
        library("phyloseq")
        library("mixOmics")
})
no_of_cores = 1

Copy all needed files to $WORK, create the 16S_Workshop directory if it does not exist, and set it as working directory using commands below.

system("cp -Ru /work/riethoven/jeanjack/16S_Workshop $WORK/", intern=TRUE)
setwd(file.path(Sys.getenv("WORK"), "16S_Workshop"))

Load the phyloseq object constructed from whole dataset of Salomon, JD et al., 2023.

load("./intermediate/SalomonJD_et_al_covariates.rda")

Inspect objects loaded into your environment.

ls()
[1] "create_dt"   "no_of_cores" "pngconvert"  "ps"         
ps
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 4578 taxa and 24 samples ]
sample_data() Sample Data:       [ 24 samples by 17 sample variables ]
tax_table()   Taxonomy Table:    [ 4578 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 4578 tips and 4576 internal nodes ]
refseq()      DNAStringSet:      [ 4578 reference sequences ]

To rearrange the levels of factor or to identify which level of a factor is reference

Note

Levels of a factor are ordered alphabetically by default, which means for group factor with two levels CNT and CPB, CNT is default reference level; for time factor with two levels pre and post, post is default reference level.

sample_meta = sample_data(ps) %>%
        dplyr::as_tibble() %>%
        dplyr::mutate(time = factor(time)) %>% # this step make `time` a factor so we can rearrange its levels
        within(., time <- relevel(time, ref = "pre")) %>% # this step set `pre` as ref level
        as.data.frame()

rownames(sample_meta) = sample_meta$sample 
sample_data(ps) = data.frame(sample_meta)

Inspect and Filter phyloseq object

A few screening and filtering steps before final phyloseq object is ready for abundance visualization and diversity analysis.

  • Step 1, remove Archaea, unknowns, chloroplasts;
ps.clean <- subset_taxa(ps, Kingdom == "Bacteria") %>%
        subset_taxa(!is.na(Phylum)) %>%
        subset_taxa(!Class %in% c("Chloroplast")) %>%
        subset_taxa(!Family %in% c("mitochondria"))
ps.clean
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 4541 taxa and 24 samples ]
sample_data() Sample Data:       [ 24 samples by 17 sample variables ]
tax_table()   Taxonomy Table:    [ 4541 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 4541 tips and 4539 internal nodes ]
refseq()      DNAStringSet:      [ 4541 reference sequences ]
Note

We are only interested in bacteria. Removed 37 taxa.

  • Step 2, filter taxa (ASV) with low abundance (< 2);
ps.clean.p0 <- filter_taxa(ps.clean, function (x) {sum(x > 0) >= 2}, prune=TRUE)
ps.clean.p0
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3564 taxa and 24 samples ]
sample_data() Sample Data:       [ 24 samples by 17 sample variables ]
tax_table()   Taxonomy Table:    [ 3564 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 3564 tips and 3562 internal nodes ]
refseq()      DNAStringSet:      [ 3564 reference sequences ]
Note

Abundance is the sum of the number of times an ASV is observed across all samples. Removed 977 taxa.

Complete metadata are shown below:

sample_data(ps.clean.p0)

Multi-Omics integration

Sometimes single Omics analysis does not provide enough information to give a deep understanding of a biological system, but we can obtain a more holistic view of a system by combining multiple Omics analyses. There are generally three type of data integration:

  • Early integration: simply concatenation, curse of dimensionality
  • Late integration: each omics analyzed separatedly, results combined at the end
  • Middle integration: use ML or multivariant method to consolidate data without cocatenating features or merging results.

Illustration of early, middle, and late integration for merging data matrices generated by different omics

Specifically, what we are discussing here are two (or more) datasets measured over the same set of samples (experiemntal units), aiming to identify correlated variables across these datasets or to explain the categorical outcome. The question we are trying to answer is: do the information from both datasets agree and reflect any biological condition of interest?

Multivariate methods are well suited for large Omics datasets, where the number of variables (e.g. genes, proteins, metabolites) is much larger than the number of samples (patients, cells, mice). They have the appealing properties of reducing the dimension of the data by using instrumental variables (‘components’, ‘latent variables’), which are defined as combination of all variables. Those components are then used to produce useful graphical outputs that enable better understanding of the relationships and correlation structure between the different data sets that are integrated. It is important to know that multivariate statistical method, aiming at summarizing the main characteristics of the data while capturing the largest source of variation in the data, is exploratory, not enable statistical inference.

Canonical Correlation analysis

Canonical Correlation Analysis (CCA) is a multivariate approach to highlight correlations between two data sets acquired on the same experimental units. It is a dimension reduction technique that aids in exploring datasets. The components yielded by CCA (referred to as canonical variates) are linear combinations of variables from each original dataset. Canonical variates are constructed via the maximisation of the correlation between pairs of canonical variates. Each pair of canonical variates has an associated canonical correlation – the correlation between the two novel components.

Canonical correlation analysis is a rather old statistical technique for finding common components between two sets of correlated variables. The classical CCA method is only applicable when \(P + Q < N\), where \(P\) is the number of variables in the first dataset, \(Q\) is the number of variables in the second dataset and \(N\) is the number of samples in each dataset.

The issue of high dimensionality can be by-passed by introducing regularization into the CCA method. Regularized Canonical Correlation Analysis (rCCA) is able to perform on datasets of high dimensions and/or those with high collinearities (both of which are common in biological contexts). In rCCA, ridge penalties (\(\lambda1\) and \(\lambda2\)) are added to the diagnonal of X and Y respectively to make them invertible.

rCCA using mixOmics

There are two methods included in the mixOmics package to allow the CCA method to be regularized, ridge approach (\(L2\) norm) and shrinkage approach (\(L1\) norm). To read more about rCCA methodology in mixOmics, please visit their website. Case study can be found here.

Here, rCCA is performed to evaluate the correlation between the microbiome and other sets of measured biomarkers. For the sake of time, we will agglomerate ps.clean.p0 to genus level, ps.clean.genus.

# Agglomerate to genus-level and rename
ps.clean.genus = phyloseq::tax_glom(ps.clean.p0, taxrank = rank_names(ps.clean.p0)[6])
phyloseq::taxa_names(ps.clean.genus) = phyloseq::tax_table(ps.clean.genus)[,"Genus"]
otu_table(ps.clean.genus)[1:5,1:5]
OTU Table:          [5 taxa and 5 samples]
                     taxa are columns
       Lachnospiraceae UCG-004 28-4 Lachnospiraceae AC2044 group
12post                       9    0                          120
12pre                       34    0                          152
14post                      22    0                          650
14pre                       12    0                          118
20post                      33    0                            1
       Lachnospiraceae NK4A136 group Lachnoclostridium
12post                           261                40
12pre                           3962               169
14post                          2109               228
14pre                            569                91
20post                            54               180
ps.clean.cca = ps.clean.genus
# subset samples, no missing value
phyXpp <- subset_samples(ps.clean.cca, !is.na(claudin2) & !is.na(fabp2))
# covariates after subset
ebd <- sample_data(phyXpp)[,c("claudin2","claudin3","fabp2")] %>% log()
phyXpp <- filter_taxa(phyXpp, function(x) sum(x>0)>0, TRUE)
phyXpp <- transform_sample_counts(phyXpp, function(x) x/sum(x))
OTUtb <- data.frame(otu_table(phyXpp))
# scaling
OTUM <- as.matrix(scale(OTUtb, center = T)) # scale the otu table
ebdM <- as.matrix(scale(ebd, center = T)) # scale the covariates
rownames(OTUM) = rownames(ebdM)
set.seed(42)
X <- ebdM
Y <- OTUM

# Using the Shrinkage (shrinkage) Method
result.rcca.shrinkage <- rcc(X, Y, method = 'shrinkage') # run the CCA method

rCCA is an unsupervised approach that focuses on maximizing the correlation between the two data sets X and Y and therefore the information about the treatments or groups is not taken into account in the analysis.

Sample plot

The plotIndiv() function shows the relationship and similarities between samples. The samples are represented as points in a two (or three) dimensional subspace that is spanned by latent variables yielded from the multivariate models. These plots allow for the clustering of samples to be evaluated.

In the figure below, the different group are colored to illustrate how the samples are clustered on the sample plot. The first variate seem to separate the samples.

# sample plots
# plot the projection of samples for shrinkage rCCA data
plotIndiv(result.rcca.shrinkage, comp = 1:2,
          ind.names = sample_data(phyXpp)$time,
          group = sample_data(phyXpp)$group, rep.space = "XY-variate",
          legend = TRUE, title = 'rCCA shrinkage XY-space', guide = "none", cex = 6)

Arrow plot

The next sample plot to utilize is the arrow plot. The length of the arrows indicate the level of agreement or disagreement between the canonical variates of the X (EBD) dataset and the Y (microbiome) dataset.

From this, the relation between each sample’s projection into each dataset’s variate space can be evaluated. The start of each arrow represents the sample’s position in the space spanned by the X variate while the tip represents its position in the Y variate space.

Short arrows indicate that the sample shows a high level of agreement between the two datasets and a long arrow indicates disagreement. The arrow lengths can be equivocated to the correlation between components associated to their respective data set for a given dimension. The longer each arrow, the higher the disparity between the datasets.

# plot the arrow plot of samples for shrinkage rCCA data
plotArrow(result.rcca.shrinkage, group = sample_data(phyXpp)$group, col.per.group = color.mixo(1:2),
          legend = FALSE,
          title = "rCCA shrinkage method")

Variable plot (Correlation Circle Plot)

Another important step is analyzing the correlation structure between features and the canonical variates. This plot can also be used to assess the correlation between each of the original variables. Variable vectors that have a close proximity to one another will be high correlated. Vectors with an acute angle (less than 90°) between them indicates they have a positive correlation. An obtuse angle (greater than 90°) indicates a negative correlation. If they are at a right angle, their correlation is zero. In other words, the correlation between two features is equal to the cosine of the angle between their vectors (that begin at the origin).

Below depict the correlation circle plot, where we use a cutoff of 0.5 to prevent features with a negligible correlation being depicted. This plot shows that the main correlations between microbiome and EBD markers.

# plot original variables' correlation with canonical variates
plotVar(result.rcca.shrinkage, var.names = c(TRUE, TRUE), cex = c(4,4), cutoff = 0.5,
        title = "rCCA shrinkage comp 1-2", style = "ggplot2") 

Relevance network graph

A more robust way to evaluate the correlation structure is through a relevance network graph. Relevance networks are an extremely useful tool in evaluating the structure of associations between variables. Networks are able to show positive and negative correlations simultaneously. Networks also provide information on when variables belong to the same biological pathway.

The circular nodes represent EBD markers and the rectangular nodes represent taxa. The bipartite relationships between these non-negligibly correlated features can be seen by the color of the connecting lines (edges).

# network
network(result.rcca.shrinkage, comp = 1:2, interactive = TRUE, color.node = c("orange","lightblue"), lwd.edge = 2, cutoff = 0.5, color.edge = rev(color.jet(100)))

Above graph shows the correlation structure for all bipartite relationships with a correlation above 0.5. There is a reasonably complex correlation structure; there were two clusters.

Clustered image map

To complement the network plot, a clustered image maps (CIM) can be used (bellow). Clustered Image Maps (CIM) are a form of heatmap to represent Pearson correlation coefficients between two matched datasets.

Hierarchical clustering is applied on the rows and columns of the real-valued similarity matrix simultaneously. This is represented as a 2D colored grid where each cell’s color is based on the values of the similarity matrix when performing two dataset integration. Dendrograms are used along the axes to depict how each row/column clusters based on the hierarchical clustering method.

cim() is a great complementary tool to the correlation circle plot (plotVar()) and relevance networks (network()). The correlation between subsets of variables from each dataset can be observed.

# heatmap
cim(result.rcca.shrinkage, comp = 1:2, margins = c(10,12), color = rev(color.jet(25)) )

# subset samples, no missing value
phyXpp <- subset_samples(ps.clean.cca, !is.na(valeric_acid) )
# covariates after subset
scfa <- sample_data(phyXpp)[,c(10:17)] %>% log()
phyXpp <- filter_taxa(phyXpp, function(x) sum(x>0)>0, TRUE)
phyXpp <- transform_sample_counts(phyXpp, function(x) x/sum(x))
OTUtb <- data.frame(otu_table(phyXpp))
# scale the variables
OTUM <- as.matrix(scale(OTUtb, center = T)) 
scfaM <- as.matrix(scale(scfa, center = T))
rownames(OTUM) = rownames(scfaM)
set.seed(42)
X <- OTUM 
Y <- scfaM

# Using the Shrinkage (shrinkage) Method
result.rcca.shrinkage <- rcc(X, Y, method = 'shrinkage')

Sample plot

# sample plots
# plot projection into canonical variate subspace
plotIndiv(result.rcca.shrinkage, comp = 1:2,
          ind.names = sample_data(phyXpp)$time,
          group = sample_data(phyXpp)$group, rep.space = "XY-variate",
          legend = TRUE, title = 'rCCA shrinkage XY-space', guide = "none", cex = 6)

Arrow plot

plotArrow(result.rcca.shrinkage, group = sample_data(phyXpp)$group, col.per.group = color.mixo(1:2), 
          legend = FALSE,
          title = "rCCA shrinkage method")

Variable plot (Correlation Circle Plot)

# plot original variables' correlation with canonical variates
plotVar(result.rcca.shrinkage, var.names = c(TRUE, TRUE), cex = c(4,4), cutoff = 0.5,
        title = "rCCA shrinkage comp 1-2", style = "ggplot2") 

Relevance network graph

# network
network(result.rcca.shrinkage, comp = 1:2, interactive = TRUE, color.node = c("orange","lightblue"), lwd.edge = 2, cutoff = 0.5, color.edge = rev(color.jet(100)) )

Clustered image map

# heatmap
cim(result.rcca.shrinkage, comp = 1:2, margins = c(10,12), color = rev(color.jet(25)) )

References and Further Readings

  1. mixOmics
  2. Machine learning for multi-omics data integration in cancer
  3. Microbiome Data Analysis

Reproducibility

R session information:

sessionInfo()
R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] mixOmics_6.22.0 lattice_0.20-45 MASS_7.3-58.2   phyloseq_1.42.0
 [5] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.1    
 [9] purrr_1.0.1     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
[13] ggplot2_3.4.2   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] nlme_3.1-162           bitops_1.0-7           matrixStats_0.63.0    
 [4] RColorBrewer_1.1-3     GenomeInfoDb_1.34.9    bslib_0.4.2           
 [7] tools_4.2.3            utf8_1.2.3             R6_2.5.1              
[10] vegan_2.6-4            DT_0.27                DBI_1.1.3             
[13] BiocGenerics_0.44.0    mgcv_1.8-42            colorspace_2.1-0      
[16] permute_0.9-7          rhdf5filters_1.10.1    ade4_1.7-22           
[19] withr_2.5.0            tidyselect_1.2.0       gridExtra_2.3         
[22] compiler_4.2.3         cli_3.6.1              Biobase_2.58.0        
[25] labeling_0.4.2         sass_0.4.6             scales_1.2.1          
[28] digest_0.6.31          rmarkdown_2.21         XVector_0.38.0        
[31] pkgconfig_2.0.3        htmltools_0.5.5        fastmap_1.1.1         
[34] htmlwidgets_1.6.2      rlang_1.1.0            rstudioapi_0.14       
[37] farver_2.1.1           jquerylib_0.1.4        generics_0.1.3        
[40] jsonlite_1.8.4         crosstalk_1.2.0        BiocParallel_1.32.6   
[43] RCurl_1.98-1.12        magrittr_2.0.3         GenomeInfoDbData_1.2.9
[46] biomformat_1.26.0      Matrix_1.5-3           Rcpp_1.0.10           
[49] munsell_0.5.0          S4Vectors_0.36.2       Rhdf5lib_1.20.0       
[52] fansi_1.0.4            ape_5.7-1              lifecycle_1.0.3       
[55] stringi_1.7.12         yaml_2.3.7             zlibbioc_1.44.0       
[58] rhdf5_2.42.1           plyr_1.8.8             grid_4.2.3            
[61] parallel_4.2.3         ggrepel_0.9.3          crayon_1.5.2          
[64] Biostrings_2.66.0      splines_4.2.3          multtest_2.54.0       
[67] hms_1.1.3              magick_2.7.5           knitr_1.42            
[70] pillar_1.9.0           igraph_1.4.2           corpcor_1.6.10        
[73] reshape2_1.4.4         codetools_0.2-19       stats4_4.2.3          
[76] glue_1.6.2             evaluate_0.21          data.table_1.14.8     
[79] vctrs_0.6.1            tzdb_0.4.0             foreach_1.5.2         
[82] gtable_0.3.3           cachem_1.0.8           xfun_0.38             
[85] RSpectra_0.16-1        survival_3.5-3         rARPACK_0.11-0        
[88] iterators_1.0.14       IRanges_2.32.0         ellipse_0.4.5         
[91] cluster_2.1.4          timechange_0.2.0       ellipsis_0.3.2