For my project work in my last quarter at Oregon State, I will be looking at small area estimation and handling sparse, zero-inflated data in spatial data. I think this is a nice tie-in to other statistical work that I’ve done, particularly with metagenomeSeq and the work that I did over the summer for the FAA.

This paper is a useful primer on block kriging and finite population estimation. The following document are my notes on this paper.


Managing wildlife populations, for example moose in Alaska, require some estimate of how large those populations are. A reasonable strategy for collecting data would be to split the area of interest into plots, randomly sample plots, and then count how many moose are in those plots. How do we determine an estimate for the size of a population through these plot-based sampling methods?

ver Hoef proposes finite population block kriging which has three distinct advantages over simple random sampling or stratified random sampling.

  1. FPBK is more precise
  2. FPBK allows for small area estimation
  3. FPBK allows for nonrandom sampling designs

Primer on block kriging

Geostatistics is predicated on point samples. Since points are infinitesimally small, infinite populations must be assumed. Small area estimation assumes that data comes from points, as opposed to a finite set of sample units.

Kriging is a spatial prediction method that minimizes prediction variance (MSPE, or mean squared prediction errors). The following derivation borrows heavily from Cressie 1993, p.151.

We assume the data follows a linear model:

where $z$ is our variable of interest with $Z(s)$ defined at each location $s$ in $D$, $\mu = X\beta$, and $\delta$ has second-order stationarity s.t. $E[\delta(s)] = 0$ and $E[Z(s)] = \mu(s) = x^T(s)\beta$.

The covariance between different errors is defined as

s.t. the covariance depends only on $h$, which is the distance between those points.

For universal block kriging, define

where $|B|$ is the area or volume of the block, $Z(B)$ is the average value within block $B$, and $\mu(B)$ is expected value in $B$.

Data is assumed to be collected at $n$ points so the data is really a realization of the random vector $z = [Z(s_1), \ldots, Z(s_n)]$. Let’s use use a linear predictor $a^T z$ of the random variable $Z(B)$ s.t. $E[a^T z] = E[Z(B)]$.

In order to minimize the MSPE, we need to find a $\lambda$ s.t.

Basically, the goal here is to minimize $E[\lambda^Tz - Z(B)]^2$ which yields the following set of equations:


  • $c_B = [c_1(B), \ldots, c_n(B)]^T \text{ where } c_i(B) : = \int_B C(s - s_i)ds / |B|, \text{ for } i = 1, 2, \ldots, n$
  • $x_B = [x_1(B), \ldots, x_n(B)]^T \text{ where } x_j(B) : = \int_B x_j(s)ds / |B|, \text{ for } j = 1, 2, \ldots, n$
  • $m$ are lagrange multipliers necessary for the unbiasedness correction

The solution for $\lambda$ and $m$ yields the block BLUP


  • The overall predicted mean is $\hat{\mu} := X\hat{\beta}_{GLS}$
  • The estimated mean for the observed block is
  • and $\hat{\beta}_{GLS}$ is the Generalized Least Squares estimate $(X^T\Sigma^{-1}X)^{-1}X^T\Sigma^{-1}z$

Then the block kriging variance is given by the second moment about the mean:


  • $\sigma^2_{B,B} = \int_B \int_B C(s - u)ds du / |B|^2$
  • $d_B = (x_B - X^T \Sigma^{-1} c_B)$

A lot of the matematical details are left out here, but for the purposes of this paper it was fine to leave them out.

Block prediction for finite populations

Let’s strictly define what we want to do here.

As ver Hoef states, the “objective of finite population sampling is to estimate the average or total of the values that are actually realized, rather than the mean of some superpopulation from which the data were drawn. In other words, we are after prediction, not estimation. We want to predict a function of the actual values that occurred, not estimate unobservable parameters of the model.

Here’s the set up for the finite population version.

Suppose that $z$ is a vector of random variables on a finite spatial lattice. This spatial lattice $D$ can be indexed from $i = 1, \ldots, N$. The random variable $Z_i$ is located at the $ith$ site in the lattice and let $\tau(z) = B^T z$ be a vector of random variables to be predicted. Note that the columns of $B$ can be structured in various ways such that we compute different $\tau(z)$, such as the average of realized values, the total of the realized values, the total of only a smaller subregion of realized values.

Our goal is to predict $B^T z$ using some linear combination of the data $\hat{\tau}(z_s) = A^T z_s$ where $z_s$ is the $n \text{x} 1$ vector of values from the subset $D$ that we sampled and $z_u$ is the $(N - n)\text{x}1$ vector of the unsampled locations, and $z = (z_s^T, z_u^T)^T$.

Definition 1

Let the Mean-Squared Prediction Error (MPSE) matrix for any particular matrix $A$ be

Definition 2

The predictor $\Lambda^T z$ is the best linear unbiased predictor if

  • $E[\Lambda^T z_s] = E[B^T z]$ and
  • $M_A - M_{\Lambda}$ is non-negative definite for every $A \neq \Lambda$

We will assume that the data $z$ follows the following linear model


  • $X$ is the matrix of fixed effects
  • $\beta$ is the parameter vector
  • $E[\delta] = 0$


To find the BLUP, we need to do two things: (1) establish the uniform unbiasedness condition for the predictor and (2) find the $\Lambda$ that minimizes the MSPE matrix.

Uniform unbiasedness condition

The BLUP is found by finding $\Lambda$ such that

is non-negative definite for all $A$ where $A^T z_s$ is unbiased. By minimizing the MSPE matrix, we obtain the prediction equations:

When the above equation is solved for $\Lambda$, the FPBK predictor is


  • The predicted unobserved values are
  • The predicted unobserved mean
  • The predicted observed mean is
  • The estimate of the parameters is

In plain english, the FPBK predictor $\Lambda^T z_s$ is found by multiplying the observed sample values times the corresponding values in $B_s$, then using universal block kriging to predict all the other unsampled units, then multipling these predicted values with the corresponding values in $B_u$.

Prediction variances of MSPE

If we take the formula for MPSE and substitute our solution for $\Lambda$ in for $A$, we get the following result:


  • $C = \Sigma_{SS} B_s + \Sigma_{SU} B_u$
  • $D = X^T B - X_s^T \Sigma_{SS}^{-1} C$
  • and finally the variance of the generalized least squares coefficients

Stepping back, $B^T \Sigma B$ is the variance of $B^T z$ and $B^T \Sigma B - C^T \Sigma_{SS}^{-1} C$ is the prediction variance. The last term $D^T V D$ arises from having to estimate $\beta$ where $D$ is like the distance between the predicted points in the design matrix $B^T X$ and the actual observed design matrix $C^T \Sigma_{SS}^{-1} X_s$.

For computing purposes, this can be turned into

where $W = X_u^T - X_s^T \Sigma_{SS}^{-1} \Sigma_{SU}$. If $B$ has more than one column, then the prediction variances are held in the diagonal elements and the prediction covariances are held in the off-diagonal elements.

Connections to sampling theory

Suppose that we are interested in predicting the mean over a lattice of $N$ sites, assuming no spatial autocorrelation.

We then end up at a predictor of $\lambda^T z_s = \bar{z}$ and an MPSE of $(\sigma^2 / n)(1 - f)$ where $(1-f)$ is a finite population correction factor.

In both simple random sampling and stratified random sampling, the blocking kriging estimate provides a reduction in variance, in part due to the strong second-order stationarity requirement. Data are often spatially patterned though, and we can get much more when they are modeled as spatially correlated.

Modelling autocorrelation

The following section makes heavy use of Cressie 1993, p. 61, 92-104, 176

We need to estimate $\Sigma$ somehow to model the spatial autocorrelation in the data. A popular model for spatial covariance is the exponential model

Choosing a model is difficult and might involve techniques like AIC, cross validation, or domain knowledge. WE can estimate the $\theta$ parameters using method of moments for variograms and covariances, then using weighted least squares or restricted maximum likelihood.


The increased precision of FPBK over SRS borrows from the fact that it uses information from neighboring locations and therefore outside the “small area”.

The model basis of FPBK does not rely on random sampling of plots, but instead second order stationarity. Thereofre, FPBK can be used with nonrandom sampling designs which has a lot of practical importance, considering the limitations of actual aerial and land surveys.

Systematic designs (see ver Hoef 2002) are easier to implement than SRS designs and have better precision as well. Additional samples can be allocated where necessary to fill “holes” as they appear in the spatial data (this perhaps ties in GRTS sampling that people really seem to love talking about?)

Finally, FPBK’s finite population correction gives it a great advantage over SRS.