Here is a snippet to reproduce a similar bivariate shrinkage plot in ggplot2, adding a color coded probability density surface and contours for the estimated multivariate normal distribution of random effects, using the same sleep study data that Bates used.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 

In an attempt to help clarify the utility of varying intercept models (and more generally, hierarchical modeling), specifically in terms of shrinkage and prediction, here is a GitHub repo with materials and a slideshow from our department’s graduate QDT (quantitative (th)ink tank) group.
For fun, I’ve included a simple example demonstrating the value of shrinkage when trying to rank NBA players by their free throw shooting ability, a situation with wildly varying amounts of information (free throw attempts) on each player. The example admittedly is not ecological, and sensitive readers may replace free throw attempts with prey capture attempts for topical consistency. Many if not most ecological datasets suffer from similar issues, with varying amounts of information from different sites, species, individuals, etc., so even without considering predation dynamics of NBA players, the example’s relevance should be immediate.
Spoiler alert: Mark Price absolutely dominated at the free throw line in the early nineties.
]]>Here’s a Stan implementation of a dynamic (multiyear) occupancy model of the sort described by MacKenzie et al. (2003).
First, the model statement:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 

This model can be made faster by storing values for log(psi) and log(1  psi), as done in Bob Carpenter’s single season example.
Fitting the model (in parallel):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 

Does it work? Let’s whip up 1000 simulated datasets and their corresponding estimates for colonization and extinction rates.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 

Here are the results for the probability of colonization $\gamma$, and the probability of persistence $\phi$. The blue dashed line shows the true value, and the dashed red lines shows the mean of all 1000 posterior modes. The black lines represent the HPDI for each interation, and the black points represent the posterior modes. This example uses a uniform prior on both of these parameters  probably an overrepresentation of prior ignorance in most real systems.
Based on some initial exploration, this approach seems much much (much?) faster than explicitly modeling the latent occurrence states in JAGS, with better chain mixing and considerably less autocorrelation. Extension to multispecies models should be straightforward too.
MacKenzie DI, Nichols JD, Hines JE, Knutson MG, Franklin AB. 2003. Estimating site occupancy, colonization, and local extinction when a species is detected imperfectly. Ecology 84(8): 22002207. pdf
]]>Here, extract_cover() is a function to do the extraction (with the help of the raster package), and extraction.R makes a parallel call to the function using the doMC package:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Resources
]]>A common goal of community ecology is to understand how and why species composition shifts across space. Common techniques to determine which environmental covariates might lead to such shifts typically rely on ordination of community data to reduce the amount of data. These techniques include redundancy analysis (RDA), canonical correspondence analysis (CCA), and nonmetric multidimensional scaling (NMDS), each paired with permutation tests. However, as brought to light by Jackson et al. (2012: Ecosphere), these ordination techniques do not discern specieslevel covariate effects, making it difficult to attribute communitylevel pattern shifts to specieslevel changes. Jackson et al. (2012) propose a hierarchical modeling framework as an alternative, which we extend in this post to correct for imperfect detection.
Multilevel models can estimate specieslevel random and fixed covariate effects to determine the relative contribution of environmental covariates to changing composition across space (Jackson et al. 2012). For presence/absence data, such models are often formulated as:
Here $y_q$ is a vector of presences/absence of each species at each site ($q=1, … , nm,$ where $n$ is the number of species and $m$ the number of sites). This model can be extended to incorporate multiple covariates.
We are interested in whether species respond differently to environmental gradients (e.g. elevation, temperature, precipitation). If this is the case, then we expect community composition changes along such gradients. Concretely, we are interested in whether $\sigma_{slope}^2$ for any covariate differs from zero.
Jackson et al. (2012) provide code for a maximum likelihood implementation of their model with data from Southern Appalachian understory herbs using the R package lme4. Here we present a simple extension of Jackson and colleague’s work, correcting for detection error with repeat surveys (i.e. multispecies occupancy modeling). Specifically, the above model could be changed slightly to:
Now $y_q$ is the number of times each species is observed at each site over $j$ surveys. $p_{spp[q]}$ represents the speciesspecific probability of detection when the species is present, and $z_q$ represents the ‘true’ occurence of the species, a Bernoulli random variable with probability, $\psi_q$.
To demonstrate the method, we simulate data for a 20 species community across 100 sites with 4 repeat surveys. We assume that three sitelevel environmental covariates were measured, two of which have variable affects on occurrence probabilities (i.e. random effects), and one of which has consistent effects for all species (i.e. a fixed effect). We also assumed that speciesspecific detection probabilities varied, but were independent of environmental covariates.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 

We fit the following model with JAGS with vague priors.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

Using information theory, specifically WatanabeAkaike information criteria (WAIC), we compared this model, which assumes all covariates have random effects, to all combination of models varying whether each covariate has fixed or random effects. See this GitHub repository for all model statements and code.
A model that assumes all covariates have random effects, and the datagenerating model, in which only covariates 1 and 2 have random effects, performed the best, but were indistinguishable from one another:
This result makes sense because the model with all random effects is able to recover speciesspecific responses to sitelevel covariates very well:
However, this model estimates that the 95% HDI of $\sigma_{slope}$ of covariate 3 includes zero, indicating that this covariate effectively has a fixed, rather than random effect among species.
Thus, we could conclude that the first two covariates have random effects, while the third covariate has a fixed effect. This means that composition shifts along gradients of covariates 1 and 2. We can visualize the relative contribution of covariate 1 and 2’s random effects to composition using ordination, as discussed in Jackson et al. (2012). To do this, we compare the linear predictor (i.e. $logit^{1}(\psi_q)$) of the best model that includes only significant random effects to a model that does not have any random effects.
The code to extract linear predictors and ordinate the community is provided on GitHub:
]]>Here’s a little R shiny app that could be used as a starting point for such a supplement. Currently it only includes two covariates for simplicity, and gives the user control over the covariate $R^2$ value, the residual variance, and the variance of both covariates.
As usual, the file server.R
defines what you want to actually do in R:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 

The file ui.R
defines the user interface:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

Everything’s ready to fork or clone on Github.
]]>Assume we have a multiple regression problem:
We suspect only a subset of the elements of $\boldsymbol{\beta}$ are nonzero, i.e. some of the covariates have no effect.
Assume $\boldsymbol{\beta}$ arises from one of two normal mixture components, depending on a latent variable $\gamma_i$:
$\tau_i$ is positive but small s.t. $\beta_i$ is close to zero when $\gamma_i = 0$. $c_i$ is large enough to allow reasonable deviations from zero when $\gamma_i = 1$. The prior probability that covariate $i$ has a nonzero effect is $Pr(\gamma_i = 1) = p_i$. Important subtleties about priors are covered in George and McCulloch (1993) and elsewhere.
Let’s simulate a dataset in which some covariates have strong effects on the linear predictor, and other don’t: suppose we have 20 candidate variables, but only 60 observations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

Specifying the model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

Fitting the model and assessing performance against known values:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 

On the left, green points indicate true coefficient values, with black posterior Bayesian credible intervals. The right plot shows the relationship between the true magnitude of the effect and the posterior probability that the coefficient was nonzero, $E(\gamma_i \mid \boldsymbol{Y})$.
]]>The zeroandoneinflated beta distribution facilitates modeling fractional or proportional data that contains both 0’s and 1’s (Ospina and Ferrari 2010  highly recommended). The general idea is to model the response variable (call it $y$) as a mixture of Bernoulli and beta distributions, from which the true 0’s and 1’s, and the values between 0 and 1 are generated, respectively. The probability density function is
where $0 < \alpha, \gamma, \mu < 1$, and $\phi>0$. $f(y; \mu, \phi)$ is the probability density function for the beta distribution, parameterized in terms of its mean $\mu$ and precision $\phi$:
$\alpha$ is a mixture parameter that determines the extent to which the Bernoulli or beta component dominates the pdf. $\gamma$ determines the probability that $y=1$ if it comes from the Bernoulli component. $\mu$ and $\phi$ are the expected value and the precision for the beta component, which is usually parameterized in terms of $p$ and $q$ (Ferrari and CribariNeto 2004). $\mu = \frac{p}{p + q}$, and $\phi=p+q$.
Although ecologists often deal with proportion data, I haven’t found any examples of 0 & 1 inflated beta regression in the ecological literature. Closest thing I’ve found was Nishii and Tanaka (2012) who take a different approach, where values between 0 and 1 are modeled as logitnormal.
Here’s a quick demo in JAGS with simulated data. For simplicity, I’ll assume 1) there is one covariate that increases the expected value at the same rate for both the Bernoulli and beta components s.t. $\mu = \gamma$, and 2) the Bernoulli component dominates extreme values of the covariate, where the expected value is near 0 or 1.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 

Now we can specify our model in JAGS, following the factorization of the likelihood given by Ospina and Ferrari (2010), estimate our parameters, and see how well the model performs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 

Here the posterior probability that $y$ comes from the discrete Bernoulli component is shown in grey, and the posterior expected value for both the Bernoulli and beta components across values of the covariate are shown in blue. The dashed green line shows the true expected value that was used to generate the data. Finally, the observed data are shown as jittered points, color coded as being derived from the continuous beta component, or the discrete Bernoulli component.
]]>For what follows, we’ll assume a simple linear regression, in which continuous covariates are measured with error. True covariate values are considered latent variables, with repeated measurements of covariate values arising from a normal distribution with a mean equal to the true value, and some measurement error $\sigma_x$. We can represent the latent variables in the model as circles, and observables as boxes:
with $\epsilon_x \sim Normal(0, \sigma_x)$ and $\epsilon_y \sim Normal(0, \sigma_y)$.
In other words, we assume that for sample unit $i$ and repeat measurement $j$:
The trick here is to use repeated measurements of the covariates to estimate and correct for measurement error. In order for this to be valid, the true covariate values cannot vary across repeat measurements. If the covariate was individual weight, you would have to ensure that the true weight did not vary across repeat measurements (for me, frogs urinating during handling would violate this assumption).
Below, I’ll simulate some data of this type in R. I’m assuming that we randomly select some sampling units to remeasure covariate values, and each is remeasured n.reps
times.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 

Here, the discrepancy due to measurement error is shown as a red segment, and the sample units that were measured three times are highlighted with green dashed lines.
I’ll use stan to estimate the model parameters, because I’ll be refitting the model to new data sets repeatedly below, and stan is faster than JAGS for these models.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 

With the model specified, estimate the parameters.
1 2 3 4 5 6 7 8 9 

How did we do? Let’s compare the true vs. estimated covariate values for each sample unit.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 

Here purple marks the posterior for the covariate values, and the dashed black line shows the onetoone line that we would expect if the estimates exactly matched the true values. In addition to estimating the true covariate values, we may wish to check to see how well we estimated the standard deviation of the measurement error in our covariate.
1 2 3 4 5 6 

You may want to know how many sample units need to be repeatedly measured to adequately estimate the degree of covariate measurement error. For instance, if $\sigma_x = 1$, how does the precision in our estimate of $\sigma_x$ improve as more sample units are repeatedly measured? Let’s see what happens when we repeatedly measure covariate values for $1, 2, …, N$ randomly selected sampling units.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

Looking at this plot, you could eyeball the number of sample units that should be remeasured when designing a study. Realistically, you would want to explore how this number depends on the true amount of measurement error, and also simulate multiple realizations (rather than just one) for each scenario. Using a similar approach, you might also evaluate whether it’s more efficient to remeasure more sample units, or invest in more repeated measurements per sample unit.
]]>Being in grad school, I do a lot of scholarly writing that requires associated or embedded R analyses, figures, and tables, plus bibliographies.
Microsoft Word makes this unnecessarily difficult.
Many tools are now available to break free from the tyranny of Word.
The ones I like involve writing an article in markdown format, integrating all data preparation, analysis, and outputs with the document (e.g. with the excellent and accessible knitr package or with a custom make
set up like this one).
Add in version control with Git, and you’ve got a nice stew going.
If you’re involved in the open source/reproducible research blogotwittersphere, this is probably old hat. To many in my department, this looks like black magic. It’s not.
I can’t give an authoritative overview, but here are some resources that helped me get through my divorce:
(I don’t study macaques, but they’re cute enough to warrant a toy example)
First, I downloaded a nexus file with DNA sequences from A molecular phylogeny of living primates by Polina Perelman et al., available here on TreeBASE. I culled the nexus file to include only the 14 species in the genus Macaca, and saved it as macaques.nex.
I used MrBayes to estimate the macaque phylogeny with the following file (analysis.nex):
1 2 3 4 5 6 7 8 

Calling MrBayes from within R:
1


Here is the unrooted consensus tree:
Macaque weight data are available as part of a (much) larger dataset on the body mass of late quaternary mammals, by Smith and colleagues. I extracted the log body mass data for the 13 available macaque species. No body mass data were available for the Siberut macaque (M. siberu).
We will account for phylogenetic nonindependence by considering average species weights to be multivariate normally distributed around a withingenus mean, with a covariance matrix $\Sigma$ that reflects phylogenetic distance (see de Villemereuil et al.). The offdiagonal elements of this matrix are scaled by Pagel’s $\lambda$, which reflects the degree of phylogenetic signal in the data.
With the help of the ape and MASS packages, covariance and precision matrices can be constructed for the trees comprising the posterior phylogeny.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

Then we can fit the model with OpenBUGS, estimating the missing body mass of M. siberu, accounting for uncertainty in the phylogeny’s topology and branch lengths.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 

Here is our phylogenetically informed estimate of the body size for M. siberu. Pagel’s $\lambda$ indicates a weak but nonzero phylogenetic signal with mean = 0.417 and 95% BCI = (0.0156, 0.949). It should go without saying this is a toy example, and it may be better to go out and weigh some actual Siberut macaques (at a minimum, this would be a good excuse for a vacation).
Blomberg et al.): Independent contrasts and PGLS regression estimators are equivalent. Systematic Biology 2012.
Pagel: Inferring the historical patterns of biological evolution. Nature 1999.
Perelman et al.: A molecular phylogeny of living primates. PLoS Genetics 2011.
Smith et al.: Body mass of late quaternary mammals. Ecology 2003.
de Villemereuil et al.: Bayesian models for comparative analysis integrating phylogenetic uncertainty. BMC Evolutionary Biology 2012
I’ll use a simple example: estimating a population mean and standard deviation. We’ll define some population level parameters, collect some data, then use the Metropolis algorithm to simulate the joint posterior of the mean and standard deviation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 

Then, to visualize the evolution of the Markov chains, we can make plots of the chains in 2parameter space, along with the posterior density at different iterations, joining these plots together using ImageMagick (in the terminal) to create an animated .gif:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

lme4
package.
$R^2$ is usually reported as a point estimate of the variance explained by a model, using the maximum likelihood estimates of the model parameters and ignoring uncertainty around these estimates. Nakagawa and Schielzeth (2013) noted that it may be desirable to quantify the uncertainty around $R^2$ using MCMC sampling. So, here we are.
$R^2$ quantifies the proportion of observed variance explained by a statistical model. When it is large (near 1), much of the variance in the data is explained by the model.
Nakagawa and Schielzeth (2013) present two $R^2$ statistics for generalized linear mixed models:
1) Marginal $R^2_{GLMM(m)}$, which represents the proportion of variance explained by fixed effects:
where $\sigma^2_f$ represents the variance in the fitted values (on a link scale) based on the fixed effects:
$\boldsymbol{X}$ is the design matrix of the fixed effects, and $\boldsymbol{\beta}$ is the vector of fixed effects estimates.
$\sum_{l=1}^{u}\sigma^2_l$ represents the sum the variance components for all of $u$ random effects. $\sigma^2_d$ is the distributionspecific variance (Nakagawa & Schielzeth 2010), and $\sigma^2_e$ represents added dispersion.
2) Conditional $R^2_{GLMM(c)}$ represents the proportion of variance explained by the fixed and random effects combined:
Here, I’ll follow the example of an overdispersed Poisson GLMM provided in the supplement to Nakagawa & Schielzeth, available here. This is their most complicated example, and the simpler ones ought to be relatively straightforward for those that are interested in normal or binomial GLMMs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 

Having simulated a dataset, calculate the $R^2$ pointestimates, using the lme4
package to fit the model.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 

Having stored our point estimates, we can now turn to Bayesian methods instead, and generate $R^2$ posteriors.
We need to fit two models in order to get the needed parameters for $R^2_{GLMM}$. First, a model that includes all random effects, but only an intercept fixed effect is fit to estimate the distribution specific variance $\sigma^2_d$. Second, we fit a model that includes all random and all fixed effects to estimate the remaining variance components.
First I’ll clean up the data that we’ll feed to JAGS:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Then, fitting the intercept model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 

Then, fit the full mixedmodel with all fixed and random effects:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 

For every MCMC draw, we can calculate $R^2_{GLMM}$, generating posteriors for both the marginal and conditional values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 

This plot shows the posterior $R^2_{GLMM}$ distributions for both the marginal and conditional cases, with the point estimates generated with lmer
shown as vertical blue lines. Personally, I find it to be a bit more informative and intuitive to think of $R^2$ as a probability distribution that integrates uncertainty in its component parameters. That said, it is unconventional to represent $R^2$ in this way, which could compromise the ease with which this handy statistic can be explained to the uninitiated (e.g. first year biology undergraduates). But, being a derived parameter, those wishing to generate a posterior can do so relatively easily.
Some may be wondering whether the parameter estimates generated with lme4
are comparable to those generated using JAGS. Having used vague priors, we would expect them to be similar. We can plot the Bayesian credible intervals (in blue), with the previous point estimates (as open black circles):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

Suppose you have discovered a statistically significant interaction effect between two continous covariates in the context of a linear model.
Suppose also that you have decided to present the model results with the following table, and the reviewers requested no additional information:
Estimate  SE  Pvalue  

$\beta_0$  0.004  0.037  0.921 
$\beta_1$  1.055  0.038  <0.05 
$\beta_2$  0.496  0.037  <0.05 
$\beta_3$  2.002  0.040  <0.05 
RSE  0.517  
Without knowing the range of covariate values observed, this table gives an incomplete story about relationship between the covariates and the response variable. Assuming the reader has a decent guess about the range of possible values for the covariates, this is what they can piece together:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

If the reader does not know where the observations fell in this plot, it is difficult to know whether the response variable was increasing or decreasing with each covariate across the range of observed values.
Consider the following two cases, where the observed covariate combinations are included as points.
These two plots tell somewhat different stories despite identical model parameters. On the left, across the range of observed covariates, the expected value of $y$ increases as either covariate increases and the interaction term affects the magnitude this increase. On the right, increases in covariate 1 or 2 could increase or decrease $\mu$, depending on the value of the other covariate.
I won’t get into the nitty gritty of how to present interaction effects (but if you’re interested, there are articles out there, e.g. Lamina et al. 2012). My main goal here is to point out the ambiguity associated with only presenting a table of parameter estimates. My preference would be that authors at least present observed covariate ranges (or better yet values), and provide a plot that illustrates the interaction.
]]>Suppose we are to fit a multiyear occupancy model for one species. We will evaluate the fit based on how well the model predicts occupancy in the final year of the project. Start by simulating some data (for details on the structure of these simulated data, refer to this post and references therein):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 

For illustration, I included a strong interaction between treatment and the continuous site level covariate (could be elevation, area, etc). As such, a measure of model fit such as AUC ought to identify a saturated model as the best fitting. Handily, AUC is a derived parameter, and common occupancy model parameters can be used to estimate a posterior. To generate a posterior AUC, we need predicted occupancy probabilities ($\psi$) and realized occupancy states ($Z$) in the final year. Predicted occupancy probabilities can be produced using data from previous years, and realized occupancy states are assumed to be represented by the posterior for $Z$ generated from a singleyear model, fit to the data from the final year of the study.
Begin by modeling occupancy probabilities as a function of both covariates and their interaction, predicting $\psi$ in the final year:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 

Now that we have our posteriors for $\psi$ at each site in the final year, we can fit a singleyear model to the final year’s data to estimate $Z$.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 

To set up the data for AUC calculations, produce site by iteration arrays for $\psi$ and $Z$:
1 2 3 4 5 6 7 8 9 10 11 12 

Now generate the posterior for AUC and store data on the true and false positive rates to produce ROC curves.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

Having fitted a saturated model, we can now fit a simpler model that includes only main effects:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 

How well did our models predict occupancy in the final year of the study, and was one better than the other? We can answer this question by inspecting posteriors for AUC (larger values are better), and the ROC curves.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

As expected, the model that generated the data fits better than the model that excludes the strong interaction term. Note that AUC reflects the accuracy of model predictions, and does not penalize model complexity.
]]>1 2 3 4 5 6 7 8 9 

Boxplots are often used:
1 2 

This gives us a rough comparison of the distribution in each group, but sometimes it’s nice to visualize the kernel density estimates instead.
I recently ran into this issue and tweaked the vioplot() function from
the vioplot
package by Daniel Adler to make split violin plots.
With vioplot2(), the side
argument specifies whether to plot the density on “both”, the “left”, or
the “right” side.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

Last but not least, Peter Kampstra’s beanplot package uses beanplot() to make split density plots, but 1) plots a rug rather than a quantile box, 2) includes a line for the overall mean or median, and 3) makes it easier to change the kernel function.
1 2 3 4 5 6 7 8 9 

There are more ways than one to skin a cat, and what one uses will probably come to personal preference.
]]>Here’s a quick illustration of the problem: I’ll generate data from a known simple linear regression model, and fit models that ignore or incorporate error in the covariate.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Ignoring error in the covariate:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 

Incorporating error in the covariate: I’m assuming that we have substantive knowledge about covariate measurement represented in the prior for the precision in X. Further, the prior for the true X values reflects knowledge of the distribution of our X value in the population from which the sample was taken.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 

Now let’s see how the two models perform.
1 2 3 4 5 6 7 8 9 

The dashed green line shows the model that generated the data, i.e. the “true” line. The red lines show the posterior for the naive model ignoring error in X, while the lessbiased blue lines show the posterior for the model incorporating error in X.
]]>Conceptually, this is expected to occur when on the left side of the curve, increasing habitat heterogeneity opens up new regions in niche space, facilitating colonization by new species. However, as heterogeneity continues to increase, each species has fewer habitat patches to utilize, population sizes decrease, and local extinction risk increases due to demographic stochasticity. To explore this idea theoretically, Allouche et al. (2012) developed an individually based model using a continuous time Markov process. The details of their modeling approach can be found in the supplement to their article, which I recommend. In this post, I’ll demonstrate how to implement a discrete time version of their model in R. Thanks to the agentbased modeling working group at the University of Colorado for providing motivation to code up model in R.
This model is spatially implicit, with $A$ equally connected sites. Each site falls on an environmental condition axis, receiving some value $E$ that characterizes local conditions. The environmental conditions for each site are uniformly distributed between two values that dictate the range of environmental conditions in a focal area. The local range of environmental conditions is a subset of some global range. There are $N$ species in the regional pool that can colonize habitat patches. Each species has some environmental optimum $\mu_i$, and some niche width $\sigma_i$, which together define a Gaussian function for the probability of establishment given a colonization attempt and a habitat patch environmental condition $E$.
The image above illustrates the probability of establishing for five species across the global range of environmental conditions possible. For any focal area, the realized range of environmental conditions is some subset of this global range.
It is assumed that all individuals that occupy a patch have the same pertimestep probabilities of death and reproduction. If an individual reproduces, the number of offspring it produces is a Poisson distributed random variable, and each individual offspring attempts to colonize one randomly selected site. At each timestep, every site has an equal probability of a colonization attempt by an individual from each species in the regional pool. Every habitat patch holds only one individual.
Offspring and immigrants from the regional pool do not displace individuals from habitat patches when they attempt to colonize. In empty sites, offspring receive colonization priority, with regional colonization occurring after breeding. When multiple offspring or immigrants from the regional pool could establish in an empty site, one successful individual is randomly chosen to establish regardless of species identity.
The following parameters are supplied to the function alloucheIBM()
:
A
= number of sites; N
= number of species in the regional pool; ERmin
= global environmental conditions minimum; ERmax
= global environmental conditions maximum; Emin
= local environmental minimum; Emax
= local environmental maximum; sig
= niche width standard deviation for all species; pM
= per timestep probability of mortality; pR
= per timestep probability of reproduction; R
= per capita expected number of offspring; and I
= per timestep probability of attempted colonization by an immigrant from the regional pool for each patch.
The function alloucheIBM()
does the majority of work for this model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 

The function returns a list containing a vector of species richness at each timestep, the proportion of sites occupied at each timestep, a state array containing all occupancy information for each patch, species, and timestep, and lastly a dataframe containing information on the niches of each species in the regional pool.
Using this function we can begin to explore the dynamics of the model through time:
1 2 

Repeating this process a few times, we reveal the expected species richness and it’s variance for some set of parameters.
Finally, we can address the issue of habitat heterogeneity and its effect on species richness. There are many ways to approach this issue, and many parameter combinations to consider. Allouche et al. (2012) provides a thorough treatment of the subject; I’ll demonstrate just one result: that under certain conditions, species richness peaks at intermediate levels of habitat heterogeneity.
To construct a range of habitat heterogeneity values, let’s construct an interval and take subsequently narrower intervals centered around the middle of the original interval.
1 2 3 4 5 6 7 

Here are the intervals:
Now, for each interval, we can iteratively run the model and track species richness. Because species richness tends to vary through time, let’s take the mean of the final 100 timesteps as a measure of species richness for each model run, and record the standard deviation to track variability.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

Of course, the shape of this relationship is sensitive to the parameters. As an example, changing niche width to increase or decrease niche overlap will mediate the strength of interspecific competition for space. Also, increasing reproductive rates may buffer each species from stochastic extinction so that the relationship between environmental heterogeneity and richness is monotonically increasing. Furthermore, here I centered all intervals around the same value, but the exact position of the environmental heterogeneity interval will affect the net establishment probability for each site, depending on how the interval relates to species niches. The parameter space is yours to explore.
These types of stochastic simulation models are fairly straightforward to implement in R. Indeed there’s a package dedicated to facilitating the implementation of such models: simecol. There’s even a book: A Practical Guide to Ecological Modelling: Using R as a Simulation Platform.
Allouche O, Kalyuzhny M, MorenoRueda G, Pizarro M, & Kadmon R. (2012) Areaheterogeneity tradeoff and the diversity of ecological communities. Proceedings of the National Academy of Sciences of the United States of America, 109 (43): 1749517500.
]]>Before getting started, we can define two convenience functions:
1 2 3 4 5 6 7 

Then, initializing the number of sites, species, years, and repeat surveys (i.e. surveys within years, where the occupancy status of a site is assumed to be constant),
1 2 3 4 

we can begin to consider occupancy. We’re interested in making inferences about the rates of colonization and population persistence for each species in a community, while estimating and accounting for imperfect detection.
Occupancy status at site $j$, by species $i$, in year $t$ is represented by $z(j,i,t)$. For occupied sites $z=1$; for unoccupied sites $z=0$. However, $Z$ is incompletely observed: it is possible that a species $i$ is present at a site $j$ in some year $t$ ($z(j,i,t)=1$) but species $i$ was never seen at at site $j$ in year $t$ across all $k$ repeat surveys because of imperfect detection. These observations are represented by $x(j,i,t,k)$. Here we assume that there are no “false positive” observations. In other words, if $\sum_{1}^{k}x(j,i,t,k)>0$ , then $z(j,i,t)=1$. If a site is occupied, the probability that $x(j,i,t,k)=1$ is represented as a Bernoulli trial with probability of detection $p(j,i,t,k)$, such that
The occupancy status $z$ of species $i$ at site $j$ in year $t$ is modeled as a Markov Bernoulli trial. In other words whether a species is present at a site in year $t$ is influenced by whether it was present at year $t−1$. $$ z(j,i,t) \sim Bernoulli(ψj,i,t) $$
where for $t>1$
and in year one $(t=1)$
where the occupancy status in year 0, , and . and are parameters that control the probabilities of colonization and persistence. If a site was unoccupied by species in a previous year , then the probability of colonization is given by the antilogit of . If a site was previously occupied , the probability of population persistence is given by the anitlogit of . We assume that the distributions of species specific parameters are defined by community level hyperparameters such that and . We can generate occupancy data as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 

For simplicity, we’ll assume that there are no differences in species detectability among sites, years, or repeat surveys, but that detectability varies among species. We’ll again use hyperparameters to specify a distribution of detection probabilities in our community, such that .
1 2 3 4 5 6 

We can now generate our observations based on occupancy states and detection probabilities. Although this could be vectorized for speed, let’s stick with nested for loops in the interest of clarity.
1 2 3 4 5 6 7 8 9 10 

Now that we’ve collected some data, we can specify our model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 

Next, bundle up the data.
1


Provide initial values.
1 2 3 4 5 6 7 8 9 10 11 12 13 

As a side note, it is helpful in JAGS to provide initial values for the incompletely observed occupancy state $z$ that are consistent with observed presences, as provided in this example with zinit
. In other words if $x(j,i,t,k)=1$, provide an intial value of 1 for $z(j,i,t)$. Unlike WinBUGS and OpenBUGS, if you do not do this, you’ll often (but not always) encounter an error message such as:
1 2 3 

Now we’re ready to monitor and make inferences about some parameters of interest using JAGS.
1 2 3 4 5 6 7 8 

At this point, you’ll want to run through the usual MCMC diagnostics to check for convergence and adjust the burnin or number of iterations accordingly. Once satisfied, we can check to see how well our model performed based on our known parameter values.
1 2 3 

1 2 

1 2 

The model tracks the number of susceptible, infectious, and recovered individuals in two cooccuring host species. The rates of change for each class are represented as a system of differential equations:
Where $S_i$, $I_i$, and $R_i$ represent the density of susceptible, infectious, and recovered individuals respectively of species $i$. The total number of individuals of each species is $N_i$. Per capita birth and death rates are represented by $b_i$ and $d_i$, and the strength of density dependence in population growth is $\delta_i$. Transmission rates from species $j$ to species $i$ are given by . The pathogen imposes additional mortality for infected individuals at rate , and infected individuals recover at rate so that the average infectious period is . Here, it is assumed that the pathogen does not castrate its hosts. Thus, susceptible, infectious, and recovered individuals reproduce at the same rate.
Epidemiological models often differentiate between two transmission dynamics. With densitydependent transmission, the number of host contacts and transmission events increases with the density of individuals (as shown in the above system of equations). In contrast, with frequencydependent transmission, hosts have a constant contact rate so that the transmission rate depends on the relative proportion of infectious individuals. As an example, models of sexually transmitted infections often assume frequency dependent transmission, implying that the number of sexual partners one has is independent of population density. To incorporate frequency dependent transmission into the above model, it is necessary to divide the transmission term $S\sum{(\beta I)}$ by $N$.
Based on this system of equations, a criterion for pathogen invasion called can be derived based on the dominant eigenvalue of the next generation matrix (Dobson 2004). If , the pathogen does not invade; if $R_0>1$, the pathogen invades.
Shiny requires two files to run: a file containing all of the calculations, plotting functionality, etc., and a file defining a user interface.
Here is the file defining what you want the server to do. Note the use of ifelse() to have either density or frequencydependent transmission.
Here is the file defining the user interface.
Here is a link to the resulting graphical user interface for the model, with example output below.
Feel free to clone the repository or alter the code to suit your own needs. Shiny seems like a tool with great potential to make some mathematical models more accessible (at some level). For instance, something like this could be used in an ecology class to demonstrate the different ways that pathogens can regulate host populations.
]]>