IFAA

IFAA is a robust approach to make inference on the association of covariates with the absolute abundance (AA) of microbiome in an ecosystem. It can be also directly applied to relative abundance (RA) data to make inference on AA because the ratio of two RA is equal ratio of their AA. This algorithm can estimate and test the associations of interest while adjusting for potential confounders. High-dimensional covariates are handled with regularization. The estimates of this method have easy interpretation like a typical regression analysis. High-dimensional covariates are handled with regularization and it is implemented by parallel computing. False discovery rate is automatically controlled by this approach. Zeros do not need to be imputed by a positive value for the analysis. The IFAA package also offers the ‘MZILN’ function for estimating and testing associations of abundance ratios with covariates.

The following mixed effect model is used for the association: \[ \log(\mathcal{Y}_i^k)|\mathcal{Y}_i^k>0=\beta^{0k}+X_i^T\beta^k+W_i^T\gamma^k+Z_i^Tb_i+\epsilon_i^k,\hspace{0.2cm}k=1,...,K+1, \] where

The challenge in microbiome analysis is that we can not oberve \(\mathcal{Y}_i^k\). What is observed is its small proportion: \(Y_i^k=C_i\mathcal{Y}^k_i\) where \(C_i\) is an unknown number between 0 and 1 that denote the observed proportion. The IFAA method successfuly addressed this challenge.

Package installation

To install, type the following command in R console:

install.packages("IFAA", repos = "http://cran.us.r-project.org")

The package could be also installed from GitHub using the following code:

require(devtools)
devtools::install_github("gitlzg/IFAA")

Input for IFAA() function

Most of the time, users just need to feed the first five inputs to the function: MicrobData, CovData, linkIDname, testCov and ctrlCov. All other inputs can just take their default values. Below are all the inputs of the functions

Output for IFAA() function

The estimation results are saved in the following lists:

The covariates data used in the analyses including testCov and ctrlCov is saved in the following object:

Examples

The example datasets dataM and dataC are included in the package. They could be accessed by:

library(IFAA)

data(dataM)
dim(dataM)
#> [1] 40 61
dataM[1:5, 1:8]
#>   id rawCount1 rawCount2 rawCount3 rawCount4 rawCount5 rawCount6 rawCount7
#> 1  1         4        49         2         0       360       222         4
#> 2  2         0        20        14         0        86       211         5
#> 3  3         3         0         3         7         0        57         0
#> 4  4         9        18         5        31        42        58         8
#> 5  5         0         2         1        19        15        67         6

data(dataC)
dim(dataC)
#> [1] 40  4
dataC[1:3, ]
#>   id        v1        v2        v3
#> 1  1  58.06969 -49.90376 -15.30643
#> 2  2  25.96522 -68.58894 -23.10992
#> 3  3 193.71625 124.40186 119.56747

Both the microbiome data dataM and the covariates data dataC contain 40 samples (i.e., 40 rows).

Next we analyze the data to test the association between microbiome and the variable "v1" while adjusting for the variables (potential confounders) "v2" and "v3".

results <- IFAA(MicrobData = dataM,
                CovData = dataC,
                linkIDname = "id",
                testCov = c("v1"),
                ctrlCov = c("v2","v3"),
                fdrRate = 0.15)
#> Data dimensions (after removing missing data if any):
#> 40 samples
#> 60 taxa/OTU/ASV
#> 1 testCov variables in the analysis
#> These are the testCov variables:
#> v1
#> 2 ctrlCov variables in the analysis
#> These are the ctrlCov variables:
#> v2, v3
#> 0 binary covariates in the analysis
#> 25.71 percent of microbiome sequencing reads are zero
#> Start Phase 1 analysis
#> 6 parallel jobs are registered for analyzing 40 reference taxa in Phase 1
#> 33 percent of phase 1 analysis has been done
#> 6 parallel jobs are registered for analyzing 20 reference taxa in Phase 1
#> 67 percent of phase 1 analysis has been done
#> 6 parallel jobs are registered for analyzing 20 reference taxa in Phase 1
#> 100 percent of phase 1 analysis has been done
#> Phase 1 analysis used 1.02 minutes
#> Start Phase 2 parameter estimation
#> Start estimation for the 1th final reference taxon
#> Estimation done for the 1th final reference taxon and it took 0.01 minutes
#> Start estimation for the 2th final reference taxon
#> Estimation done for the 2th final reference taxon and it took 0.009 minutes
#> Phase 2 parameter estimation done and took 0.019 minutes.
#> The entire analysis took 1.04 minutes

In this example, we are only interested in testing the associations with "v1" which is why testCov=c("v1"). The variables "v2" and "v3" are adjusted as potential confounders in the analyses. The final analysis results are saved in the list sig_results:

results$sig_results
#> $v1
#>              estimate      SE est     CI low      CI up  adj p-value
#> rawCount18 0.02223674 0.005107266 0.01222649 0.03224698 1.201099e-03
#> rawCount36 0.02843518 0.005403042 0.01784522 0.03902514 2.926016e-05
#> rawCount41 0.02598367 0.005012267 0.01615963 0.03580772 2.926016e-05

The results found three taxa "rawCount18", "rawCount36", "rawCount41" associated with "v1" while adjusting for "v2" and "v3". The regression coefficients and their 95% confidence intervals are provided. These coefficients correspond to \(\beta^k\) in the model equation.

The interpretation is that

All the analyzed covariates including testCov and ctrlCov can be extracted using the object covariatesData. The covariates data of the first 10 subjects can extracted as follows:

results$covariatesData[1:10,]
#>    id         v1         v2          v3
#> 1   1  58.069691  -49.90376  -15.306430
#> 2   2  25.965216  -68.58894  -23.109922
#> 3   3 193.716251  124.40186  119.567468
#> 4   4  72.156467  -98.48536    2.877972
#> 5   5  98.062712   23.55358  -79.893161
#> 6   6  83.094848 -116.95821 -107.641285
#> 7   7   8.217154 -205.64480 -139.958481
#> 8   8  36.169820   58.95708   26.890379
#> 9   9 152.786131  162.60935  138.731954
#> 10 10  41.621790   65.15427   59.974310

MZILN() function

The IFAA package can also implement the Multivariate Zero-Inflated Logistic Normal (MZILN) regression model for estimating and testing the association of abundance ratios with covariates. The MZILN() function estimates and tests the associations of user-specified abundance ratios with covariates. When the denominator taxon of the ratio is independent of the covariates, ‘MZILN()’ should generate similar results as ‘IFAA()’. The regression model of ‘MZILN()’ can be expressed as follows: \[ \log\bigg(\frac{\mathcal{Y}_i^k}{\mathcal{Y}_i^{K+1}}\bigg)|\mathcal{Y}_i^k>0,\mathcal{Y}_i^{K+1}>0=\alpha^{0k}+\mathcal{X}_i^T\alpha^k+\epsilon_i^k,\hspace{0.2cm}k=1,...,K, \] where

Input for MZILN() function

Most of the time, users just feed the first six inputs to the function: MicrobData, CovData, linkIDname, targetTaxa, refTaxa and allCov. All other inputs can just take their default values. All the inputs for ‘MZILN()’ are:

Output for MZILN() function

The estimation results are saved in the following lists:

All covariates data used in the analysis is saved in the following object:

Examples

We use the same example data The example dataset as that for illustrating the IFAA function. dataM and dataC are included in the package. They could be accessed by:

data(dataM)
dim(dataM)
#> [1] 40 61
dataM[1:5, 1:8]
#>   id rawCount1 rawCount2 rawCount3 rawCount4 rawCount5 rawCount6 rawCount7
#> 1  1         4        49         2         0       360       222         4
#> 2  2         0        20        14         0        86       211         5
#> 3  3         3         0         3         7         0        57         0
#> 4  4         9        18         5        31        42        58         8
#> 5  5         0         2         1        19        15        67         6

data(dataC)
dim(dataC)
#> [1] 40  4
dataC[1:3, ]
#>   id        v1        v2        v3
#> 1  1  58.06969 -49.90376 -15.30643
#> 2  2  25.96522 -68.58894 -23.10992
#> 3  3 193.71625 124.40186 119.56747

Both the microbiome data dataM and the covariates data dataC contain 40 samples (i.e., 40 rows).

Next we analyze the data to test the associations between the ratio “rawCount18/rawCount11” and all the three variables "v1", "v2" and "v3" in a multivariate model where all "v1", "v2" and "v3" are independent variables simultaneously.

results <- MZILN(MicrobData = dataM,
                CovData = dataC,
                 linkIDname = "id",
                 targetTaxa = "rawCount18",
                 refTaxa=c("rawCount11"),
                 allCov=c("v1","v2","v3"),
                 fdrRate=0.15)
#> Data dimensions (after removing missing data if any):
#> 40 samples
#> 60 taxa/OTU/ASV
#> 3 testCov variables in the analysis
#> These are the testCov variables:
#> v1, v2, v3
#> 0 ctrlCov variables in the analysis
#> 0 binary covariates in the analysis
#> 25.71 percent of microbiome sequencing reads are zero
#> Estimation done for the 1th denominator taxon: rawCount11 and it took 0.01 minutes
#> The entire analysis took 0.01 minutes

The final analysis results are saved in the list targettaxa_result_list:

results$targettaxa_result_list
#> $rawCount11
#> $rawCount11$v1
#>              estimate      SE est     CI low      CI up adj p-value
#> rawCount18 0.02310369 0.005570789 0.01218495 0.03402244 0.003085391
#> 
#> $rawCount11$v2
#>               estimate      SE est       CI low       CI up adj p-value
#> rawCount18 0.002604117 0.003173695 -0.003616325 0.008824558           1
#> 
#> $rawCount11$v3
#>                estimate      SE est      CI low        CI up adj p-value
#> rawCount18 -0.006250528 0.002814316 -0.01176659 -0.000734468   0.8055964

The regression coefficients and their 95% confidence intervals are provided. These coefficients correspond to \(\alpha^k\) in the model equation, and can be interpreted as the associations between the covariates and log-ratio of "rawCount18" over ‘“rawCount11”’.

The interpretation for the results is that

We can also extract all the ratios (with "rawCount11" being the denominator taxon) that are significantly associated with any of the covariates as follows:

results$sig_results
#> $rawCount11
#> $rawCount11$v1
#>              estimate      SE est     CI low      CI up  adj p-value
#> rawCount18 0.02310369 0.005570789 0.01218495 0.03402244 0.0030853909
#> rawCount36 0.02965713 0.005894748 0.01810342 0.04121084 0.0001341658
#> rawCount41 0.02562728 0.005462932 0.01491993 0.03633463 0.0003737775

The interpretation for the results is that

All covariates used in the analysis can be extracted using the object covariatesData. The covariates data of the first 10 subjects are extracted as follows:

results$covariatesData[1:10,]
#>    id         v1         v2          v3
#> 1   1  58.069691  -49.90376  -15.306430
#> 2   2  25.965216  -68.58894  -23.109922
#> 3   3 193.716251  124.40186  119.567468
#> 4   4  72.156467  -98.48536    2.877972
#> 5   5  98.062712   23.55358  -79.893161
#> 6   6  83.094848 -116.95821 -107.641285
#> 7   7   8.217154 -205.64480 -139.958481
#> 8   8  36.169820   58.95708   26.890379
#> 9   9 152.786131  162.60935  138.731954
#> 10 10  41.621790   65.15427   59.974310