Ecology in silico

Multilevel Modeling of Community Composition With Imperfect Detection

This is a guest post generously provided by Joe Mihaljevic.

A common goal of community ecology is to understand how and why species composition shifts across space. Common techniques to determine which environmental covariates might lead to such shifts typically rely on ordination of community data to reduce the amount of data. These techniques include redundancy analysis (RDA), canonical correspondence analysis (CCA), and nonmetric multi-dimensional scaling (NMDS), each paired with permutation tests. However, as brought to light by Jackson et al. (2012: Ecosphere), these ordination techniques do not discern species-level covariate effects, making it difficult to attribute community-level pattern shifts to species-level changes. Jackson et al. (2012) propose a hierarchical modeling framework as an alternative, which we extend in this post to correct for imperfect detection.

Multilevel models can estimate species-level random and fixed covariate effects to determine the relative contribution of environmental covariates to changing composition across space (Jackson et al. 2012). For presence/absence data, such models are often formulated as:

Here $y_q$ is a vector of presences/absence of each species at each site ($q=1, … , nm,$ where $n$ is the number of species and $m$ the number of sites). This model can be extended to incorporate multiple covariates.

We are interested in whether species respond differently to environmental gradients (e.g. elevation, temperature, precipitation). If this is the case, then we expect community composition changes along such gradients. Concretely, we are interested in whether $\sigma_{slope}^2$ for any covariate differs from zero.

Jackson et al. (2012) provide code for a maximum likelihood implementation of their model with data from Southern Appalachian understory herbs using the R package lme4. Here we present a simple extension of Jackson and colleague’s work, correcting for detection error with repeat surveys (i.e. multi-species occupancy modeling). Specifically, the above model could be changed slightly to:

Now $y_q$ is the number of times each species is observed at each site over $j$ surveys. $p_{spp[q]}$ represents the species-specific probability of detection when the species is present, and $z_q$ represents the ‘true’ occurence of the species, a Bernoulli random variable with probability, $\psi_q$.

To demonstrate the method, we simulate data for a 20 species community across 100 sites with 4 repeat surveys. We assume that three site-level environmental covariates were measured, two of which have variable affects on occurrence probabilities (i.e. random effects), and one of which has consistent effects for all species (i.e. a fixed effect). We also assumed that species-specific detection probabilities varied, but were independent of environmental covariates.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
################################################
# Simulate data
################################################

Nsite <- 100
Ncov <- 3
Nspecies <- 20
J <- 4

# species-specific intercepts:
alpha <- rnorm(Nspecies, 0, 1)

# covariate values
Xcov <- matrix(rnorm(Nsite*Ncov, 0, 2),
               nrow=Nsite, ncol=Ncov)

# I'll assume 2 of the 3 covariates have effects that vary among species
Beta <- array(c(rnorm(Nspecies, 0, 2),
                rnorm(Nspecies, -1, 1),
                rep(1, Nspecies)
                ),
              dim=c(Nspecies, Ncov)
              )

# species-specific detection probs
p0 <- plogis(rnorm(Nspecies, 1, 0.5))
p0

#### Occupancy states ####
Yobs <- array(0, dim = c(Nspecies, Nsite)) # Simulated observations

for(n in 1:Nspecies){
  for(k in 1:Nsite){
    lpsi <- alpha[n] + Beta[n, ] %*% Xcov[k, ] # Covariate effects on occurrence 
    psi <- 1/(1+exp(-lpsi)) #anti-logit

    z <- rbinom(1, 1, psi) # True Occupancy
    Yobs[n, k] <- rbinom(1, J, p0[n] * z) # Observed Occupancy
  }
}

################################################
# Format data for model
################################################
# X needs to have repeated covariates for each species, long form
X <- array(0, dim=c(Nsite*Nspecies, Ncov))
t <- 1; i <- 1
TT <- Nsite
while(i <= Nspecies){
  X[t:TT, ] <- Xcov
  t <- t+Nsite
  TT <- TT + Nsite
  i <- i+1
}

# Species
Species <- rep(c(1:Nspecies), each=Nsite)

# Observations/data:
Y <- NULL
for(i in 1:Nspecies){
  Y <- c(Y, Yobs[i, ])
}

# All sites surveyed same # times:
J <- rep(J, times=Nspecies*Nsite)

# Number of total observations
Nobs <- Nspecies*Nsite

We fit the following model with JAGS with vague priors.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
model {
  # Priors
  psi.mean ~ dbeta(1,1)
  p.detect.mean ~ dbeta(1,1)

  sd.psi ~ dunif(0,10)
  psi.tau <- pow(sd.psi, -2)

  sd.p.detect ~ dunif(0,10)
  p.detect.tau <- pow(sd.p.detect, -2)

  for(i in 1:Nspecies){
    alpha[i] ~ dnorm(logit(psi.mean), psi.tau)T(-12,12)
    lp.detect[i] ~ dnorm(logit(p.detect.mean), p.detect.tau)T(-12,12)
    p.detect[i] <- exp(lp.detect[i]) / (1 + exp(lp.detect[i]))
  }

  for(j in 1:Ncov){
    mean.beta[j] ~ dnorm(0, 0.01)
    sd.beta[j] ~ dunif(0, 10)
    tau.beta[j] <- pow(sd.beta[j]+0.001, -2)
    for(i in 1:Nspecies){
      betas[i,j] ~ dnorm(mean.beta[j], tau.beta[j])
    }
  }

  # Likelihood
  for(i in 1:Nobs){
    logit(psi[i]) <- alpha[Species[i]] + inprod(betas[Species[i],], X[i, ])
    z[i] ~ dbern(psi[i])
    Y[i] ~ dbinom(z[i] * p.detect[Species[i]], J[i])
  }
}

Using information theory, specifically Watanabe-Akaike information criteria (WAIC), we compared this model, which assumes all covariates have random effects, to all combination of models varying whether each covariate has fixed or random effects. See this GitHub repository for all model statements and code.

A model that assumes all covariates have random effects, and the data-generating model, in which only covariates 1 and 2 have random effects, performed the best, but were indistinguishable from one another:

This result makes sense because the model with all random effects is able to recover species-specific responses to site-level covariates very well:

However, this model estimates that the 95% HDI of $\sigma_{slope}$ of covariate 3 includes zero, indicating that this covariate effectively has a fixed, rather than random effect among species.

Thus, we could conclude that the first two covariates have random effects, while the third covariate has a fixed effect. This means that composition shifts along gradients of covariates 1 and 2. We can visualize the relative contribution of covariate 1 and 2’s random effects to composition using ordination, as discussed in Jackson et al. (2012). To do this, we compare the linear predictor (i.e. $logit^{-1}(\psi_q)$) of the best model that includes only significant random effects to a model that does not have any random effects.

The code to extract linear predictors and ordinate the community is provided on GitHub:

Shiny Variance Inflation Factor Sandbox

In multiple regression, strong correlations among covariates increases the uncertainty or variance in estimated regression coefficients. Variance inflation factors (VIFs) are one tool that has been used as an indicator of problematic covariate collinearity. In teaching students about VIFs, it may be useful to have some interactive supplementary material so that they can manipulate factors affecting the uncertainty in slope terms in real-time.

Here’s a little R shiny app that could be used as a starting point for such a supplement. Currently it only includes two covariates for simplicity, and gives the user control over the covariate $R^2$ value, the residual variance, and the variance of both covariates.

As usual, the file server.R defines what you want to actually do in R:

server.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# interactive variance inflation factor module
library(shiny)
library(car)
library(mvtnorm)
library(gridExtra)
library(ggplot2)

shinyServer(function(input, output){
  output$plot <- renderPlot({
    r2 <- input$r2
    var_error <- input$resid_var
    var_x1 <- input$var_x1
    var_x2 <- input$var_x2
    beta <- c(0, 1, 1)

    # users enter R^2, this backcalculates covariance
    cov_x1x2 <- sqrt(var_x1 * var_x2 * r2)
    sigma <- matrix(c(var_x1, cov_x1x2,
                      cov_x1x2, var_x2),
                    nrow=2)

    X <- array(1, dim=c(input$n, 3))
    X[, c(2, 3)] <- rmvnorm(n=input$n, sigma=sigma, method="chol")
    mu <- X %*% beta
    epsilon <- rnorm(input$n, 0, sqrt(var_error))
    y <- mu + epsilon

    X1 <- X[, 2]
    X2 <- X[, 3]

    model <- lm(y ~ 1 + X1 + X2)

    # thanks to Ben Bolker for the next 3 lines
    # https://groups.google.com/forum/#!topic/ggplot2/4-l3dUT-h2I
    l <- list(vif = round(vif(model)[1], digits=2))
    eq <- substitute(italic(VIF) == vif, l)
    eqstr <- as.character(as.expression(eq))

    l2 <- list(vif = round(vif(model)[2], digits=2))
    eq2 <- substitute(italic(VIF) == vif, l2)
    eqstr2 <- as.character(as.expression(eq2))

    # plot 1: parameter recovery
    df3 <- data.frame(truth = beta,
                      lci = confint(model)[, 1],
                      uci = confint(model)[, 2],
                      est = coef(model), y=0:(length(beta) - 1))

    cip <- ggplot(df3, aes(x=truth, y=y)) +
      geom_point(col="blue", size=5, alpha=.5) +
      theme_bw() +
      geom_segment(aes(x=lci, y=y, xend=uci, yend=y)) +
      geom_point(aes(x=est, y=y), size=3, pch=1) +
      ggtitle("Coefficient recovery") +
      xlab("Value") +
      ylab(expression(beta)) +
      scale_y_continuous(breaks = c(0, 1, 2)) +
      theme(axis.title.y = element_text(angle = 0, hjust = 0))  +
      annotate(geom="text", x=0, y=2, label=eqstr,
               parse=TRUE, vjust=1.5, hjust=-1) +
      annotate(geom="text", x=0, y=1, label=eqstr2,
               parse=TRUE, vjust=1.5, hjust=-1)

    # plot 2: X1 & X2 correlation plot
    xcorp <- ggplot(data.frame(X1, X2), aes(x=X1, y=X2)) +
      geom_point(pch=1) +
      theme_bw() +
      ggtitle("Covariate scatterplot")

    grid.arrange(cip, xcorp,
                 ncol=2)

  })
})

The file ui.R defines the user interface:

ui.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
library(shiny)

shinyUI(pageWithSidebar(
  headerPanel("Variance inflation factor sandbox"),
  sidebarPanel(
    sliderInput("r2", "Covariate R-squared:",
                min=0, max=.99, value=0),
    sliderInput("resid_var", "Residual variance:",
                min=1, max=10, value=1),
    sliderInput("var_x1", "Variance of X1:",
                min=1, max=10, value=1),
    sliderInput("var_x2", "Variance of X2:",
                min=1, max=10, value=1)
    ),
  mainPanel(plotOutput("plot")
            )
  ))

Everything’s ready to fork or clone on Github.

Stochastic Search Variable Selection in JAGS

Stochastic search variable selection (SSVS) identifies promising subsets of multiple regression covariates via Gibbs sampling (George and McCulloch 1993). Here’s a short SSVS demo with JAGS and R.

Assume we have a multiple regression problem:

We suspect only a subset of the elements of $\boldsymbol{\beta}$ are non-zero, i.e. some of the covariates have no effect.

Assume $\boldsymbol{\beta}$ arises from one of two normal mixture components, depending on a latent variable $\gamma_i$:

$\tau_i$ is positive but small s.t. $\beta_i$ is close to zero when $\gamma_i = 0$. $c_i$ is large enough to allow reasonable deviations from zero when $\gamma_i = 1$. The prior probability that covariate $i$ has a nonzero effect is $Pr(\gamma_i = 1) = p_i$. Important subtleties about priors are covered in George and McCulloch (1993) and elsewhere.

Let’s simulate a dataset in which some covariates have strong effects on the linear predictor, and other don’t: suppose we have 20 candidate variables, but only 60 observations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
ncov <- 20
nobs <- 60
var_beta <- .004
c <- 1000
p_inclusion <- .5
sigma_y <- 1

# generate covariates
X <- array(dim=c(nobs, ncov))
for (i in 1:ncov){
  X[, i] <- rnorm(nobs, 0, 1)
}

included <- rbinom(ncov, 1, p_inclusion)
coefs <- rnorm(n=ncov,
               mean=0,
               sd=ifelse(included==1,
                         sqrt(var_beta * c),
                         sqrt(var_beta)
                         )
               )
coefs <- sort(coefs)
Y <- rnorm(nobs, mean=X %*% coefs, sd=sigma_y)

Specifying the model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
cat("model{
  # typical regression priors
  alpha ~ dnorm(0, .001)
  sd_y ~ dunif(0, 100)
  tau_y <- pow(sd_y, -2)

  # ssvs priors
  sd_bet ~ dunif(0, 100)
  tau_in <- pow(sd_bet, -2)
  tau[1] <- tau_in            # coef effectively zero
  tau[2] <- tau_in / 1000     # nonzero coef
  p_ind[1] <- 1/2
  p_ind[2] <- 1 - p_ind[1]

  for (j in 1:ncov){
    indA[j] ~ dcat(p_ind[]) # returns 1 or 2
    gamma[j] <- indA[j] - 1   # returns 0 or 1
    beta[j] ~ dnorm(0, tau[indA[j]])
  }

  # likelihood
  for (i in 1:nobs){
    Y[i] ~ dnorm(alpha + X[i ,] %*% beta[], tau_y)
  }
}
    "
    , file="ssvs.txt")

Fitting the model and assessing performance against known values:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
require(gridExtra)
require(runjags)
require(ggmcmc)
library(coda)
dat <- list(Y=Y, X=X, nobs=nobs, ncov=ncov)
vars <- c("alpha", "sd_bet", "gamma", "beta", "tau_in", "sd_y")
out2 <- run.jags("ssvs.txt", vars, data=dat, n.chains=3,
                 method="parallel", adapt=5000, burnin=5000)
outdf <- ggs(as.mcmc.list(out2))

# extract E(gamma_i) for posterior inclusion probabilities
probs <- summary(out2)$statistics[((2 + ncov):(1+2*ncov)), 1]

labels <- rep(NA, ncov)
for (i in 1:ncov){
  labels[i] <- paste("beta[", i, "]", sep="")
}
xdf <- data.frame(Parameter = labels, value = 1:ncov)
p1 <- ggs_caterpillar(outdf, "beta", X=xdf) +
  theme_classic() +
  geom_vline(xintercept = 0, linetype = "longdash") +
  geom_point(data=data.frame(coefs, pos = 1:ncov),
              aes(x=coefs, y=pos), size=5, col="green4", alpha=.7) +
  xlab("Value") +
  ylab("Coefficient") +
  geom_hline(yintercept = 1:ncov, alpha=.05) +
  scale_y_continuous(breaks=seq(0, ncov, 1))

df <- data.frame(probs=probs, coefs = coefs)
p2 <- ggplot(df, aes(x=abs(coefs), y=probs)) +
  geom_point(size=5, alpha=.7) +
  theme_classic() +
  xlab("Absolute value of true coefficient") +
  ylab("Posterior probability of non-zeroness")

grid.arrange(p1, p2, ncol=2)

On the left, green points indicate true coefficient values, with black posterior Bayesian credible intervals. The right plot shows the relationship between the true magnitude of the effect and the posterior probability that the coefficient was non-zero, $E(\gamma_i \mid \boldsymbol{Y})$.