A high-level overview, an example, and a call to action.
This post gives some background and a demo for the paper “Neural hierarchical models of ecological populations” out today in Ecology Letters (Joseph 2020).
Deep learning and model-based ecological inference may seem like totally separate pursuits. Yet, if you think about deep learning as a set of tools to approximate functions, it’s not much of a leap to begin seeing opportunities to unite deep learning with some standard ecological modeling approaches.
Hierarchical models have been around for a while, and are now one of the workhorse methods of modern quantitative ecology (e.g., occupancy models, capture-recapture models, N-mixture models, animal movement models, state-space models, etc.). Hierarchical models combine:
A posterior distribution of the unknowns, conditioned on the data is:
\[[z, \theta \mid y] = \dfrac{[y \mid z, \theta] \times [z \mid \theta] \times [\theta]}{[y]}.\]
We might also have some explanatory variables \(x\) that might tell us something about \(z\), \(\theta\), and/or \(y\).
Neural networks approximate functions. Off the shelf neural networks usually just map \(x\) to \(y\), and allow us to predict new values of \(y\) for new values of \(x\). Sometimes, predicting \(y\) is not really what we care about - we really want to learn something about a process \(z\) or some parameters \(\theta\).
We can parameterize a hierarchical model with a neural network to learn about \(z\). So, for example, if \(\theta\) represents the parameters of a neural network, then we can construct a process model \([z \mid \theta]\) where our input \(x\) is mapped to a process \(z\) by way of some neural network:
\[[z \mid \theta] = f(x, \theta),\]
where \(f\) is a neural network that maps \(x\) and \(\theta\) to some probabilistic model for \(z\) (here because \(x\) is observed, I’m not going to condition \(z\) on it on the left hand side of the equation - \(x\) is assumed to be constant and known without error).
Graphically, you might consider a state-space model where some inputs \(x\) are mapped to a state transition matrix (for an example with an animal movement model, see Appendix S2 in the paper):
An N-mixture model can be used to estimate latent integer-valued abundance when unmarked populations are repeatedly surveyed and it is assumed that detection of individuals is imperfect (Royle 2004). Assume that \(J\) spatial locations are each surveyed \(K\) times, in a short time interval for which it is reasonable to assume that the number of individuals is constant within locations \(j=1, ..., J\). Each spatial location has some continuous covariate value represented by \(x_j\), that relates to detection probabilities and expected abundance.
Observations at site \(j\) in survey \(k\) yield counts of the number of unique individuals detected, denoted \(y_{j, k}\) for all \(j\) and all \(k\). Assuming that the detection of each individual is conditionally independent, and that each individual is detected with site-specific probability \(p_j\), the observations can be modeled with a Binomial distribution where the number of trials is the true (latent) population abundance \(n_j\):
\[y_{j, k} \sim \text{Binomial}(p_j, n_j).\]
The true population abundance \(n_j\) is treated as a Poisson random variable with expected value \(\lambda_j\):
\[n_j \sim \text{Poisson}(\lambda_j).\]
Heterogeneity among sites was accounted for using a single layer neural network that ingests the one dimensional covariate \(x_i\) for site \(i\), passes it through a single hidden layer, and outputs a two dimensional vector containing a detection probability \(p_i\) and the expected abundance \(\lambda_i\):
\[ \begin{bmatrix} \lambda_i \\ p_i \end{bmatrix} = f(x_i), \]
where \(f\) is a neural network with two dimensional output activations \(\vec{h}(x_i)\) computed via:
\[\vec{h}(x_i) = \vec{W}^{(2)} g(\vec{W}^{(1)} x_i ),\] and final outputs computed using the log and logit link functions for expected abundance and detection probability:
\[f(x_i) = \begin{bmatrix} \text{exp}(h_1(x_i)) \\ \text{logit}^{-1}(h_2(x_i)) \end{bmatrix}.\]
Here too \(\vec{W}^{(1)}\) is a parameter matrix that generates activations from the inputs, \(g\) is an activation function, and \(\vec{W}^{(2)}\) is a parameter matrix that maps the hidden layer to the outputs. Additionally \(h_1(x_i)\) is the first element of the output activation vector, and \(h_2(x_i)\) the second element.
The negative log likelihood was used as the loss function, enumerating over a large range of potential values of the true abundance (from \(\min(y_j.)\) to \(5 \times \max(y_j.)\), where \(y_{j.}\) is a vector of counts of length \(K\)) to approximate the underlying infinite mixture model implied by the Poisson model of abundance (Royle 2004).
First, load some python dependencies.
import matplotlib.pyplot as plt
import multiprocessing
import numpy as np
import torch
from torch import nn
from torch.distributions import Binomial, Poisson
from torch.utils.data import DataLoader, TensorDataset
Simulate data at nsite
sites, with nrep
repeat surveys.
Here it’s assumed that there is one continuous site-level covariate \(x\) that has some nonlinear relationship with the expected number of individuals at a site.
123456)
np.random.seed(= 200
nsite = 5
nrep = np.linspace(-2.5, 2.5, nsite, dtype=np.float32).reshape(-1,1)
x
# Draw f(x) from a Gaussian process
def kernel(x, theta):
= np.meshgrid(x, x)
m, n = abs(m-n)**2
sqdist return np.exp(- theta * sqdist)
= kernel(x, theta=.2)
K = np.linalg.cholesky(K + 1e-5* np.eye(nsite))
L = np.dot(L, np.random.normal(size=(nsite, 1))) f_prior
Generate some abundance values from a Poisson distribution:
= 3
offset = np.exp(f_prior + offset)
lam = np.random.poisson(lam)
n
='k', alpha=.3)
plt.scatter(x, n, c
plt.plot(x, lam)'Covariate value')
plt.xlabel('True (latent) abundance') plt.ylabel(
For simplicity, assume that the probability of detection is constant across all sites and independent of \(x\).
= np.array([0.5])
pr_detection = np.random.binomial(n=n,
y =pr_detection,
p=(nsite, nrep))
size
plt.plot(x, lam)for i in range(nrep):
='b', s=4, alpha=.3)
plt.scatter(x, y[:, i], c'Covariate value')
plt.xlabel('Observed counts') plt.ylabel(
We will define a torch.nn.Module
class for our neural network.
This ingests \(x\) and outputs a value for \(\lambda\) and \(p\):
class Net(nn.Module):
""" Neural N-mixture model
This is a neural network that ingests x and outputs:
- lam(bda): expected abundance
- p: detection probability
"""
def __init__(self, hidden_size):
super().__init__()
self.fc1 = nn.Linear(1, hidden_size)
self.fc2 = nn.Linear(hidden_size, 2)
def forward(self, x):
= torch.sigmoid(self.fc1(x))
hidden_layer = self.fc2(hidden_layer)
output = torch.exp(output[:, [0]])
lam = torch.sigmoid(output[:, [1]])
p return lam, p
We will use the negative log likelihood as our loss function:
def nmix_loss(y_obs, lambda_hat, p_hat, n_max):
""" N-mixture loss.
Args:
y_obs (tensor): nsite by nrep count observation matrix
lambda_hat (tensor): poisson abundance expected value
p_hat (tensor): individual detection probability
n_max (int): maximum abundance to consider
Returns:
negative log-likelihood (tensor)
"""
= y_obs.shape
batch_size, n_rep
= torch.arange(n_max).unsqueeze(0)
possible_n_vals = Poisson(lambda_hat).log_prob(possible_n_vals)
n_logprob assert n_logprob.shape == (batch_size, n_max)
= Binomial(
y_dist 1, 1, -1),
possible_n_vals.view(=p_hat.view(-1, 1, 1),
probs=False
validate_args
)= y_obs.unsqueeze(-1).repeat(1, 1, n_max)
y_obs = y_dist.log_prob(y_obs).sum(dim=1) # sum over repeat surveys
y_logprob assert y_logprob.shape == (batch_size, n_max)
= torch.logsumexp(n_logprob + y_logprob, -1)
log_lik return -log_lik
Instantiate a model.
= Net(hidden_size=32)
net net
Create a data loader.
= TensorDataset(torch.tensor(x).float(), torch.tensor(y))
dataset = DataLoader(dataset,
dataloader =16,
batch_size=True,
shuffle=multiprocessing.cpu_count()) num_workers
Instantiate an optimizer and choose the number of training epochs:
= 1000
n_epoch = torch.optim.Adam(net.parameters(), lr=0.01, weight_decay=1e-6)
optimizer = [] running_loss
Finally, train the model, visualizing the estimated relationship between \(x\) and \(N\) after every gradient update.
= plt.scatter(x, n, c='k')
_ = plt.xlabel('Covariate value')
_ = plt.ylabel('Abundance')
_ = plt.cm.viridis(np.linspace(0,1,n_epoch))
colors for i in range(n_epoch):
for i_batch, xy in enumerate(dataloader):
= xy
x_i, y_i
optimizer.zero_grad()= net(x_i)
lambda_i, p_i = nmix_loss(y_i, lambda_i, p_i, n_max = 200)
nll = torch.mean(nll)
loss
loss.backward()# Does the update
optimizer.step()
running_loss.append(loss.data.detach().numpy())= net(torch.from_numpy(x))
lam_hat, p_hat = lam_hat.detach().numpy()
lam_hat = plt.plot(x, lam_hat, color=colors[i], alpha=.1)
_ plt.show()
Visualize loss:
= len(running_loss)
n_step = plt.scatter(x=np.arange(n_step), y=running_loss, s=2,
_ =plt.cm.viridis(np.linspace(0,1,n_step)))
color= plt.xlabel('Number of training iterations')
_ = plt.ylabel('N-mixture loss')
_ plt.show()
If you are interested in digging into the details of how to build these models, check out the companion repository on GitHub, which has all of the code required to reproduce the paper, as well as links to Jupyter Notebooks (thanks Binder!) to play with some toy occupancy, capture-recapture, and N-mixture models: https://github.com/mbjoseph/neuralecology
Deep learning is somewhat of a mystery to many ecologists. Those who are currently applying it have done some amazing things (like counting plants and animals in imagery). But, I worry that as a community, ecologists are thinking about deep learning too narrowly.
We can look to hydrology and physics to get a sense for how we can advance science with deep learning. Here are a few key papers that are shaping my thinking around this topic, which might help motivate future work in science-based deep learning for ecology:
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/mbjoseph/mbjoseph.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".