Ecology in silico

Stochastic Search Variable Selection in JAGS

Stochastic search variable selection (SSVS) identifies promising subsets of multiple regression covariates via Gibbs sampling (George and McCulloch 1993). Here’s a short SSVS demo with JAGS and R.

Assume we have a multiple regression problem:

We suspect only a subset of the elements of $\boldsymbol{\beta}$ are non-zero, i.e. some of the covariates have no effect.

Assume $\boldsymbol{\beta}$ arises from one of two normal mixture components, depending on a latent variable $\gamma_i$:

$\tau_i$ is positive but small s.t. $\beta_i$ is close to zero when $\gamma_i = 0$. $c_i$ is large enough to allow reasonable deviations from zero when $\gamma_i = 1$. The prior probability that covariate $i$ has a nonzero effect is $Pr(\gamma_i = 1) = p_i$. Important subtleties about priors are covered in George and McCulloch (1993) and elsewhere.

Let’s simulate a dataset in which some covariates have strong effects on the linear predictor, and other don’t: suppose we have 20 candidate variables, but only 60 observations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
ncov <- 20
nobs <- 60
var_beta <- .004
c <- 1000
p_inclusion <- .5
sigma_y <- 1

# generate covariates
X <- array(dim=c(nobs, ncov))
for (i in 1:ncov){
  X[, i] <- rnorm(nobs, 0, 1)
}

included <- rbinom(ncov, 1, p_inclusion)
coefs <- rnorm(n=ncov,
               mean=0,
               sd=ifelse(included==1,
                         sqrt(var_beta * c),
                         sqrt(var_beta)
                         )
               )
coefs <- sort(coefs)
Y <- rnorm(nobs, mean=X %*% coefs, sd=sigma_y)

Specifying the model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
cat("model{
  # typical regression priors
  alpha ~ dnorm(0, .001)
  sd_y ~ dunif(0, 100)
  tau_y <- pow(sd_y, -2)

  # ssvs priors
  sd_bet ~ dunif(0, 100)
  tau_in <- pow(sd_bet, -2)
  tau[1] <- tau_in            # coef effectively zero
  tau[2] <- tau_in / 1000     # nonzero coef
  p_ind[1] <- 1/2
  p_ind[2] <- 1 - p_ind[1]

  for (j in 1:ncov){
    indA[j] ~ dcat(p_ind[]) # returns 1 or 2
    gamma[j] <- indA[j] - 1   # returns 0 or 1
    beta[j] ~ dnorm(0, tau[indA[j]])
  }

  # likelihood
  for (i in 1:nobs){
    Y[i] ~ dnorm(alpha + X[i ,] %*% beta[], tau_y)
  }
}
    "
    , file="ssvs.txt")

Fitting the model and assessing performance against known values:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
require(gridExtra)
require(runjags)
require(ggmcmc)
library(coda)
dat <- list(Y=Y, X=X, nobs=nobs, ncov=ncov)
vars <- c("alpha", "sd_bet", "gamma", "beta", "tau_in", "sd_y")
out2 <- run.jags("ssvs.txt", vars, data=dat, n.chains=3,
                 method="parallel", adapt=5000, burnin=5000)
outdf <- ggs(as.mcmc.list(out2))

# extract E(gamma_i) for posterior inclusion probabilities
probs <- summary(out2)$statistics[((2 + ncov):(1+2*ncov)), 1]

labels <- rep(NA, ncov)
for (i in 1:ncov){
  labels[i] <- paste("beta[", i, "]", sep="")
}
xdf <- data.frame(Parameter = labels, value = 1:ncov)
p1 <- ggs_caterpillar(outdf, "beta", X=xdf) +
  theme_classic() +
  geom_vline(xintercept = 0, linetype = "longdash") +
  geom_point(data=data.frame(coefs, pos = 1:ncov),
              aes(x=coefs, y=pos), size=5, col="green4", alpha=.7) +
  xlab("Value") +
  ylab("Coefficient") +
  geom_hline(yintercept = 1:ncov, alpha=.05) +
  scale_y_continuous(breaks=seq(0, ncov, 1))

df <- data.frame(probs=probs, coefs = coefs)
p2 <- ggplot(df, aes(x=abs(coefs), y=probs)) +
  geom_point(size=5, alpha=.7) +
  theme_classic() +
  xlab("Absolute value of true coefficient") +
  ylab("Posterior probability of non-zeroness")

grid.arrange(p1, p2, ncol=2)

On the left, green points indicate true coefficient values, with black posterior Bayesian credible intervals. The right plot shows the relationship between the true magnitude of the effect and the posterior probability that the coefficient was non-zero, $E(\gamma_i \mid \boldsymbol{Y})$.

Better Living Through Zero-one Inflated Beta Regression

Dealing with proportion data on the interval $[0, 1]$ is tricky. I realized this while trying to explain variation in vegetation cover. Unfortunately this is a true proportion, and can’t be made into a binary response. Further, true 0’s and 1’s rule out beta regression. You could arcsine square root transform the data (but shouldn’t; Warton and Hui 2011). Enter zero-and-one inflated beta regression.

The zero-and-one-inflated beta distribution facilitates modeling fractional or proportional data that contains both 0’s and 1’s (Ospina and Ferrari 2010 - highly recommended). The general idea is to model the response variable (call it $y$) as a mixture of Bernoulli and beta distributions, from which the true 0’s and 1’s, and the values between 0 and 1 are generated, respectively. The probability density function is

where $0 < \alpha, \gamma, \mu < 1$, and $\phi>0$. $f(y; \mu, \phi)$ is the probability density function for the beta distribution, parameterized in terms of its mean $\mu$ and precision $\phi$:

$\alpha$ is a mixture parameter that determines the extent to which the Bernoulli or beta component dominates the pdf. $\gamma$ determines the probability that $y=1$ if it comes from the Bernoulli component. $\mu$ and $\phi$ are the expected value and the precision for the beta component, which is usually parameterized in terms of $p$ and $q$ (Ferrari and Cribari-Neto 2004). $\mu = \frac{p}{p + q}$, and $\phi=p+q$.

Although ecologists often deal with proportion data, I haven’t found any examples of 0 & 1 inflated beta regression in the ecological literature. Closest thing I’ve found was Nishii and Tanaka (2012) who take a different approach, where values between 0 and 1 are modeled as logit-normal.

Here’s a quick demo in JAGS with simulated data. For simplicity, I’ll assume 1) there is one covariate that increases the expected value at the same rate for both the Bernoulli and beta components s.t. $\mu = \gamma$, and 2) the Bernoulli component dominates extreme values of the covariate, where the expected value is near 0 or 1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
n <- 100
x <- runif(n, -3, 3)
a0 <- -3
a1 <- 0
a2 <- 1
antilogit <- function(x){
  exp(x) / (1 + exp(x))
}
alpha <- antilogit(a0 + a1 * x + a2 * x^2)

b0 <- 0
b1 <- 2
mu <- antilogit(b0 + b1 * x)
phi <- 5
p <- mu * phi
q <- phi - mu * phi

y.discrete <- rbinom(n, 1, alpha)
y <- rep(NA, n)
for (i in 1:n){
  if (y.discrete[i] == 1){
    y[i] <- rbinom(1, 1, mu[i])
  } else {
    y[i] <- rbeta(1, p[i], q[i])
  }
}

# split the data into discrete and continuous components
y.d <- ifelse(y == 1 | y == 0, y, NA)
y.discrete <- ifelse(is.na(y.d), 0, 1)
y.d <- y.d[!is.na(y.d)]
x.d <- x[y.discrete == 1]
n.discrete <- length(y.d)

which.cont <- which(y < 1 & y > 0)
y.c <- ifelse(y < 1 & y > 0, y, NA)
y.c <- y.c[!is.na(y.c)]
n.cont <- length(y.c)
x.c <- x[which.cont]

Now we can specify our model in JAGS, following the factorization of the likelihood given by Ospina and Ferrari (2010), estimate our parameters, and see how well the model performs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# write model
cat(
  "
  model{
  # priors
  a0 ~ dnorm(0, .001)
  a1 ~ dnorm(0, .001)
  a2 ~ dnorm(0, .001)
  b0 ~ dnorm(0, .001)
  b1 ~ dnorm(0, .001)
  t0 ~ dnorm(0, .01) 
  tau <- exp(t0)
  
  # likelihood for alpha
  for (i in 1:n){
    logit(alpha[i]) <- a0 + a1 * x[i] + a2 * x[i] ^ 2
    y.discrete[i] ~ dbern(alpha[i])
  }
  
  # likelihood for gamma
  for (i in 1:n.discrete){
    y.d[i] ~ dbern(mu[i])
    logit(mu[i]) <- b0 + b1 * x.d[i]
  }
  
  # likelihood for mu and tau
  for (i in 1:n.cont){
    y.c[i] ~ dbeta(p[i], q[i])
    p[i] <- mu2[i] * tau
    q[i] <- (1 - mu2[i]) * tau
    logit(mu2[i]) <- b0 + b1 * x.c[i]
  }
  }  
  ", file="beinf.txt"
)
require(rjags)
jd <- list(x=x, y.d=y.d, y.c=y.c, y.discrete = y.discrete,
           n.discrete=n.discrete, n.cont = n.cont,
           x.d = x.d, x.c=x.c, n=n)
mod <- jags.model("beinf.txt", data= jd, n.chains=3, n.adapt=1000)
out <- coda.samples(mod, c("a0", "a1", "a2", "b0", "b1", "tau"),
                    n.iter=6000, n.thin=6)
# plot(out) check for convergence

require(ggmcmc)
ggd <- ggs(out)
a0.post <- subset(ggd, Parameter == "a0")$value
a1.post <- subset(ggd, Parameter == "a1")$value
a2.post <- subset(ggd, Parameter == "a2")$value
n.stored <- length(a2.post)
P.discrete <- array(dim=c(n.stored, n))
for (i in 1:n){
  P.discrete[, i] <- antilogit(a0.post + a1.post * x[i] + a2.post * x[i] ^ 2)
}
pdd <- melt(P.discrete, varnames = c("iteration", "site"), value.name = "Pr.discrete")
pdd$x <- x[pdd$site]
b0.post <- subset(ggd, Parameter == "b0")$value
b1.post <- subset(ggd, Parameter == "b1")$value
expect <- array(dim=c(n.stored, n))
for (i in 1:n){
  expect[, i] <- antilogit(b0.post + b1.post * x[i])
}
exd <- melt(expect, varnames=c("iteration", "site"), value.name = "Expectation")
exd$x <- x[exd$site]
obsd <- data.frame(x=x, y=y,
                   component = factor(ifelse(y < 1 & y > 0, "Continuous", "Discrete")))
trued <- data.frame(x=x, mu=mu)

ggplot(pdd) +
  geom_line(aes(x=x, y=Pr.discrete, group=iteration), alpha=0.05, color="grey") +
  geom_line(aes(x=x, y=Expectation, group=iteration), data=exd, color="blue", alpha=0.05) +
  geom_point(aes(x=x, y=y, fill=component), data=obsd, size=3, color="black",
             position = position_jitter(width=0, height=.01), pch=21) +
  scale_fill_manual(values = c("red", "black")) +
  ylab("y") +
  geom_line(aes(x=x, y=mu), data=trued, color="green", linetype="dashed") +
  theme_bw()

Here the posterior probability that $y$ comes from the discrete Bernoulli component is shown in grey, and the posterior expected value for both the Bernoulli and beta components across values of the covariate are shown in blue. The dashed green line shows the true expected value that was used to generate the data. Finally, the observed data are shown as jittered points, color coded as being derived from the continuous beta component, or the discrete Bernoulli component.

Errors-in-variables Models in Stan

In a previous post, I gave a cursory overview of how prior information about covariate measurement error can reduce bias in linear regression. In the comments, Rasmus Bååth asked about estimation in the absence of strong priors. Here, I’ll describe a Bayesian approach for estimation and correction for covariate measurement error using a latent-variable based errors-in-variables model that does not rely on strong prior information. Recall that this matters because error in covariate measurements tends to bias slope estimates towards zero.

For what follows, we’ll assume a simple linear regression, in which continuous covariates are measured with error. True covariate values are considered latent variables, with repeated measurements of covariate values arising from a normal distribution with a mean equal to the true value, and some measurement error $\sigma_x$. We can represent the latent variables in the model as circles, and observables as boxes:

with $\epsilon_x \sim Normal(0, \sigma_x)$ and $\epsilon_y \sim Normal(0, \sigma_y)$.

In other words, we assume that for sample unit $i$ and repeat measurement $j$:

The trick here is to use repeated measurements of the covariates to estimate and correct for measurement error. In order for this to be valid, the true covariate values cannot vary across repeat measurements. If the covariate was individual weight, you would have to ensure that the true weight did not vary across repeat measurements (for me, frogs urinating during handling would violate this assumption).

Below, I’ll simulate some data of this type in R. I’m assuming that we randomly select some sampling units to remeasure covariate values, and each is remeasured n.reps times.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
n.reps <- 3
n.repeated <- 10
n <- 50

# true covariate values
x <- runif(n, -3, 3)
y <- x + rnorm(n)  # alpha=0, beta=1, sdy=1

# random subset to perform repeat covariate measurements
which.repeated <- sample(n, n.repeated)
xsd <- 1  # measurement error
xerr <- rnorm(n + (n.repeated * (n.reps - 1)), 0, xsd)

# indx assigns measurements to sample units
indx <- c(1:n, rep(which.repeated, each = n.reps - 1))
indx <- sort(indx)
nobs <- length(indx)
xobs <- x[indx] + xerr
plot(x[indx], xobs,
    xlab = "True covariate value",
    ylab = "Observed covariate value")
abline(0, 1, lty = 2)
segments(x0 = x[indx], x1 = x[indx],
    y0 = x[indx], y1 = xobs, col = "red")
abline(v = x[which.repeated], col = "green", lty = 3)

Here, the discrepancy due to measurement error is shown as a red segment, and the sample units that were measured three times are highlighted with green dashed lines.

I’ll use stan to estimate the model parameters, because I’ll be refitting the model to new data sets repeatedly below, and stan is faster than JAGS for these models.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# write the .stan file
cat("
data{
  int n;
  int nobs;
  real xobs[nobs];
  real y[n];
  int indx[nobs];
}

parameters {
  real alpha;
  real beta;
  real<lower=0> sigmay;
  real<lower=0> sigmax;
  real x[n];
}

model {
  // priors
  alpha ~ normal(0,100);
  beta ~ normal(0,100);
  sigmay ~ uniform(0,1000);
  sigmax ~ uniform(0,1000);
  
  // model structure  
  for (i in 1:nobs){
    xobs[i] ~ normal(x[indx[i]], sigmax);
  }
  for (i in 1:n){
    y[i] ~ normal(alpha + beta*x[i], sigmay);
  }
}
  ",
    file = "latent_x.stan")

With the model specified, estimate the parameters.

1
2
3
4
5
6
7
8
9
library(rstan)
library(modeest)
stan_d <- c("y", "xobs", "nobs", "n", "indx")
chains <- 3
iter <- 1000
thin <- 1
mod1 <- stan(file = "latent_x.stan", data = stan_d,
    chains = chains, iter = iter,
    thin = thin)

How did we do? Let’s compare the true vs. estimated covariate values for each sample unit.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
posteriors <- extract(mod1)

# highest density interval helper function (thanks to Joe Mihaljevic)
HDI <- function(values, percent = 0.95) {
    sorted <- sort(values)
    index <- floor(percent * length(sorted))
    nCI <- length(sorted) - index
    width <- rep(0, nCI)
    for (i in 1:nCI) {
        width[i] <- sorted[i + index] - sorted[i]
    }
    HDImin <- sorted[which.min(width)]
    HDImax <- sorted[which.min(width) + index]
    HDIlim <- c(HDImin, HDImax)
    return(HDIlim)
}

# comparing estimated true x values to actual x values
Xd <- array(dim = c(n, 3))
for (i in 1:n) {
    Xd[i, 1:2] <- HDI(posteriors$x[, i])
    Xd[i, 3] <- mlv(posteriors$x[, i], method = "shorth")$M
}

lims <- c(min(Xd), max(Xd))
plot(x, Xd[, 3], xlab = "True covariate value",
    ylab = "Estimated covariate value",
    col = "purple", pch = 19, ylim = lims)
abline(0, 1, lty = 2)
segments(x0 = x, x1 = x, y0 = Xd[, 1], y1 = Xd[, 2], col = "purple")

Here purple marks the posterior for the covariate values, and the dashed black line shows the one-to-one line that we would expect if the estimates exactly matched the true values. In addition to estimating the true covariate values, we may wish to check to see how well we estimated the standard deviation of the measurement error in our covariate.

1
2
3
4
5
6
hist(posteriors$sigmax, breaks = 20,
    main = "Posterior for measurement error",
    xlab = "Measurement standard deviation")
abline(v = xsd, col = "red", lwd = 2)
legend("topright", legend = "True value", col = "red",
    lty = 1, bty = "n", lwd = 2)

How many sample units need repeat measurements?

You may want to know how many sample units need to be repeatedly measured to adequately estimate the degree of covariate measurement error. For instance, if $\sigma_x = 1$, how does the precision in our estimate of $\sigma_x$ improve as more sample units are repeatedly measured? Let’s see what happens when we repeatedly measure covariate values for $1, 2, …, N$ randomly selected sampling units.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
n.repeated <- 1:n

# store the HDI and mode for the estimate of sigmax in an array
post.sdx <- array(dim = c(length(n.repeated), 3))
for (i in n.repeated) {
    n.repeats <- i
    which.repeated <- sample(n, n.repeats)
    xerr <- rnorm(n + (n.repeats * (n.reps - 1)), 0, xsd)
    indx <- c(1:n, rep(which.repeated, each = n.reps - 1))
    indx <- sort(indx)
    nobs <- length(indx)
    xobs <- x[indx] + xerr
    stan_d <- c("y", "xobs", "nobs", "n", "indx")
    mod <- stan(fit = mod1, data = stan_d, chains = chains,
        iter = iter, thin = thin)
    posteriors <- extract(mod)
    post.sdx[i, 1:2] <- HDI(posteriors$sigmax)
    post.sdx[i, 3] <- mlv(posteriors$sigmax, method = "shorth")$M
}

# Plot the relationship b/t number of sampling units revisited & sdx
plot(x = n.repeated, y = rep(xsd, length(n.repeated)),
    type = "l", lty = 2,
    ylim = c(0, max(post.sdx)),
    xlab = "Number of sampling units measured three times",
    ylab = "Estimated measurement error")
segments(x0 = n.repeated, x1 = n.repeated,
    y0 = post.sdx[, 1], y1 = post.sdx[, 2],
    col = "red")
points(x = n.repeated, y = post.sdx[, 3], col = "red")
legend("topright", legend = c("True value", "Posterior estimate"),
    col = c("black", "red"), lty = c(2, 1),
    pch = c(NA, 1), bty = "n")

Looking at this plot, you could eyeball the number of sample units that should be remeasured when designing a study. Realistically, you would want to explore how this number depends on the true amount of measurement error, and also simulate multiple realizations (rather than just one) for each scenario. Using a similar approach, you might also evaluate whether it’s more efficient to remeasure more sample units, or invest in more repeated measurements per sample unit.