# Better living through zero-one inflated beta regression

Dealing with proportion data on the interval $[0, 1]$ is tricky. I realized this while trying to explain variation in vegetation cover. Unfortunately this is a true proportion, and can’t be made into a binary response. Further, true 0’s and 1’s rule out beta regression. You could arcsine square root transform the data (but shouldn’t; Warton and Hui 2011). Enter zero-and-one inflated beta regression.

The zero-and-one-inflated beta distribution facilitates modeling fractional or proportional data that contains both 0’s and 1’s (Ospina and Ferrari 2010 - highly recommended). The general idea is to model the response variable (call it $y$) as a mixture of Bernoulli and beta distributions, from which the true 0’s and 1’s, and the values between 0 and 1 are generated, respectively. The probability density function is

where $0 < \alpha, \gamma, \mu < 1$, and $\phi>0$. $f(y; \mu, \phi)$ is the probability density function for the beta distribution, parameterized in terms of its mean $\mu$ and precision $\phi$:

$\alpha$ is a mixture parameter that determines the extent to which the Bernoulli or beta component dominates the pdf. $\gamma$ determines the probability that $y=1$ if it comes from the Bernoulli component. $\mu$ and $\phi$ are the expected value and the precision for the beta component, which is usually parameterized in terms of $p$ and $q$ (Ferrari and Cribari-Neto 2004). $\mu = \frac{p}{p + q}$, and $\phi=p+q$.

Although ecologists often deal with proportion data, I haven’t found any examples of 0 & 1 inflated beta regression in the ecological literature. Closest thing I’ve found was Nishii and Tanaka (2012) who take a different approach, where values between 0 and 1 are modeled as logit-normal.

Here’s a quick demo in JAGS with simulated data. For simplicity, I’ll assume 1) there is one covariate that increases the expected value at the same rate for both the Bernoulli and beta components s.t. $\mu = \gamma$, and 2) the Bernoulli component dominates extreme values of the covariate, where the expected value is near 0 or 1.

Now we can specify our model in JAGS, following the factorization of the likelihood given by Ospina and Ferrari (2010), estimate our parameters, and see how well the model performs.

Here the posterior probability that $y$ comes from the discrete Bernoulli component is shown in grey, and the posterior expected value for both the Bernoulli and beta components across values of the covariate are shown in blue. The dashed green line shows the true expected value that was used to generate the data. Finally, the observed data are shown as jittered points, color coded as being derived from the continuous beta component, or the discrete Bernoulli component.