# Errors-in-variables Models in Stan

In a previous post, I gave a cursory overview of how prior information about covariate measurement error can reduce bias in linear regression. In the comments, Rasmus Bååth asked about estimation in the absence of strong priors. Here, I’ll describe a Bayesian approach for estimation and correction for covariate measurement error using a latent-variable based errors-in-variables model that does not rely on strong prior information. Recall that this matters because error in covariate measurements tends to bias slope estimates towards zero.

For what follows, we’ll assume a simple linear regression, in which continuous covariates are measured with error. True covariate values are considered latent variables, with repeated measurements of covariate values arising from a normal distribution with a mean equal to the true value, and some measurement error $\sigma_x$. We can represent the latent variables in the model as circles, and observables as boxes:

with $\epsilon_x \sim Normal(0, \sigma_x)$ and $\epsilon_y \sim Normal(0, \sigma_y)$.

In other words, we assume that for sample unit $i$ and repeat measurement $j$:

The trick here is to use repeated measurements of the covariates to estimate and correct for measurement error. In order for this to be valid, the true covariate values cannot vary across repeat measurements. If the covariate was individual weight, you would have to ensure that the true weight did not vary across repeat measurements (for me, frogs urinating during handling would violate this assumption).

Below, I’ll simulate some data of this type in R. I’m assuming that we randomly select some sampling units to remeasure covariate values, and each is remeasured n.reps times.

Here, the discrepancy due to measurement error is shown as a red segment, and the sample units that were measured three times are highlighted with green dashed lines.

I’ll use stan to estimate the model parameters, because I’ll be refitting the model to new data sets repeatedly below, and stan is faster than JAGS for these models.

With the model specified, estimate the parameters.

How did we do? Let’s compare the true vs. estimated covariate values for each sample unit.

Here purple marks the posterior for the covariate values, and the dashed black line shows the one-to-one line that we would expect if the estimates exactly matched the true values. In addition to estimating the true covariate values, we may wish to check to see how well we estimated the standard deviation of the measurement error in our covariate.

### How many sample units need repeat measurements?

You may want to know how many sample units need to be repeatedly measured to adequately estimate the degree of covariate measurement error. For instance, if $\sigma_x = 1$, how does the precision in our estimate of $\sigma_x$ improve as more sample units are repeatedly measured? Let’s see what happens when we repeatedly measure covariate values for $1, 2, …, N$ randomly selected sampling units.

Looking at this plot, you could eyeball the number of sample units that should be remeasured when designing a study. Realistically, you would want to explore how this number depends on the true amount of measurement error, and also simulate multiple realizations (rather than just one) for each scenario. Using a similar approach, you might also evaluate whether it’s more efficient to remeasure more sample units, or invest in more repeated measurements per sample unit.

# R and My Divorce From Word

Being in grad school, I do a lot of scholarly writing that requires associated or embedded R analyses, figures, and tables, plus bibliographies.

Microsoft Word makes this unnecessarily difficult.

Many tools are now available to break free from the tyranny of Word. The ones I like involve writing an article in markdown format, integrating all data preparation, analysis, and outputs with the document (e.g. with the excellent and accessible knitr package or with a custom make set up like this one). Add in version control with Git, and you’ve got a nice stew going.

If you’re involved in the open source/reproducible research blogo-twittersphere, this is probably old hat. To many in my department, this looks like black magic. It’s not.

I can’t give an authoritative overview, but here are some resources that helped me get through my divorce:

# How Heavy Is the Siberut Macaque? A Bayesian Phylogenetic Approach

Among-species comparisons can include phylogenetic information to account for non-independence arising from shared evolutionary history. Often, phylogenetic topologies and branch lengths are not known exactly, but are estimated with uncertainty. This uncertainty can be accounted for using methods recently described in a neat paper called Bayesian models for comparative analysis integrating phylogenetic uncertainty by de Villemereuil et al. Here, I’ll demonstrate the method by estimating the body mass of the Siberut macaque (Macaca siberu).

(I don’t study macaques, but they’re cute enough to warrant a toy example)

### Building the phylogeny

First, I downloaded a nexus file with DNA sequences from A molecular phylogeny of living primates by Polina Perelman et al., available here on TreeBASE. I culled the nexus file to include only the 14 species in the genus Macaca, and saved it as macaques.nex.

I used MrBayes to estimate the macaque phylogeny with the following file (analysis.nex):

Calling MrBayes from within R:

Here is the unrooted consensus tree:

### Body mass data

Macaque weight data are available as part of a (much) larger dataset on the body mass of late quaternary mammals, by Smith and colleagues. I extracted the log body mass data for the 13 available macaque species. No body mass data were available for the Siberut macaque (M. siberu).

### Body mass model

We will account for phylogenetic non-independence by considering average species weights to be multivariate normally distributed around a within-genus mean, with a covariance matrix $\Sigma$ that reflects phylogenetic distance (see de Villemereuil et al.). The off-diagonal elements of this matrix are scaled by Pagel’s $\lambda$, which reflects the degree of phylogenetic signal in the data.

With the help of the ape and MASS packages, covariance and precision matrices can be constructed for the trees comprising the posterior phylogeny.

Then we can fit the model with OpenBUGS, estimating the missing body mass of M. siberu, accounting for uncertainty in the phylogeny’s topology and branch lengths.

Here is our phylogenetically informed estimate of the body size for M. siberu. Pagel’s $\lambda$ indicates a weak but non-zero phylogenetic signal with mean = 0.417 and 95% BCI = (0.0156, 0.949). It should go without saying this is a toy example, and it may be better to go out and weigh some actual Siberut macaques (at a minimum, this would be a good excuse for a vacation).