This paper is a useful framework for showing how to compare two different methods in forestry spatial estimation. The following document are my notes on this paper.

Abstract

TL;DR SLM has smaller empirical root-mean-squared prediction error for lot of different data types, less bias, and better internal coverage than k-NN. SLM is also more robust to spatially imbalanced methods.

SLM should be used instead of k-NN if the goal is accurate mapping or estimation of population totals and averages.

Method Introduction

Both predictors from SLM and k-NN are linear. However, SLM is based on “an assumed spatial stochastic model” and does not specify how much better SLM will be.

SLM

The spatial linear model includes ordinary kriging and universal kriging. Advantages include:

“If the data-generating process is multivariate normal, then the SLM predictor is equivalent to the conditional expectation which is optimal.”
- Even if this MVN criteria is not met, SLM predictions are optimal among BLUP class
While BLUPs require covariance specification, predictions are “generally robust” to misspecifications

Disadvantages:

Data technically used twice: once to estimate the covariance, another time for BLUP.
Requires estimating a lot of covariance parameters
Assumes spatial stochasitc model

k-NN

k-NN finds observed samples that are “close” to an unobserved location based on the covariates in the model and imputes the closest one directly as the predicted value or forms a weighted average of several points. Some advantages include:

Predictions are within the boundary of biological reality with $k = 1$
Logical relationships between covariates will be maintained
Distribution-free, only assumption is a fixed surface and not the data generating model
k-NN can be made local by including geographic coordinates as covariates and restricting neighbors to a circular area aroung the unobserved point

Some disadvantages:

Highly biased at the edfe of the data cloud
Extreme values will be over/underestimated if sample data does not have whole range of variability
Not statistically consistent estimator (sample size increase doesn’t guarantee convergence to the true value)
Lack a good measure for uncertainty

Methods

Two goals were considered in this paper:

point prediction of unobserved locations
Block prediction of the total or average $T$

We assume that we have $n$ observed points and $m$ unobserved points.

The linear predictor will have the general form:

$y_j = \sum_{i \in \text{Observed}} \lambda_{i, j} y_i$

Both SLM and k-NN use distance. Let $A$ be a matrix with coordinates in columns and the ith row denoted as $a_i$. A general distance formula between the ith and jth rows of A is:

$d(i,j | A, W) = \sqrt{(a_i - a_j)^T W (a_i - a_j)}$

where $W$ is a weighting matrix.

k-NN Review

Let $D$ be the distance matrix with $i, j$ element being the distance between row $i$ and row $j$. $D$ can be partitioned into

$D= \begin{pmatrix} D_{O,O} & D_{O, U} \\ D_{U,O} & D_{U, U} \end{pmatrix}$

For each column $j$ in $D$ where $j \in$ the unobserved values, we can find some $i$ that has the minimum distance to $j$ and $i \in$ the observed values. We could also find $k$ rows in $D_j$ adn do some kind of weighted average.

Cross validation is the method most often used ot compute prediction standard errors. Each sample is removed one at a time and the rest of the sample is used to predict the one that was removed. This squared error will be the global estimator for out-of-sample observations.

Let k-NN prediction standard error be defined then as

$\hat{\zeta} = \sqrt{\frac{1}{n} \sum_{i = 1}^n (\hat{y_i} - y_i)^2}$

An iterated variogram estimator for variance has also been proposed (see here)

SLM methodology has been reviewed in previous works by ver Hoef.

Simulation of artificial data

All data sets were repeatedly simulated on a 20x20 regular grid with 8 covariates.

Started with the following:

$w_1 = z_1 + \epsilon_1$

where $z_1$ is a 400x1 vector of values containing zero-mean spatially autocorrelated random variables from a geostatistical model with partial sill $\delta^2$, range parameter $\rho_1$, and $\epsilon_1$ is 400x1 vector containing zero mean independent random variables with variance $\sigma_1^2$.

We then set up an autoregressive model

$w_{\eta} = \phi_{\eta - 1} w_{\eta - 1} + z_{\eta} + \epsilon_{\eta}$

where $z_{\eta}, \eta= 2, \ldots, 8$ contains zero-mean spatially autocorrelated random variables from some geostatistical model with partial sill $\delta_{\eta}^2$, range parameter $\rho_{\eta}$, and $\epsilon_{\eta}$ is containing zero mean independent random variables with variance $\sigma^2_{\eta}$.

Now let

$x_{\eta} = \mu_{\eta} + w_{\eta}$

where $\mu_{\eta}$ is a vector of the constant true mean.

Finally we create the response variable as

$y = X\beta + z_{y} + \epsilon_{y}$

where $X = [x_1 | x_2 | \ldots | x_8 ]$ is our matrix of covariates, $\beta$ is a vector of parameters, and $z_y$ contains zero mean spatially autocorrelated random variables from some geostatistical model with partial sill $\delta_{y}^2$, range parameter $\rho_{y}$, and $\epsilon_{y}$ is containing zero mean independent.

ver Hoef let all $z_{\eta}$ and $\epsilon_{\eta}$ be normally distributed.

The autocorrelation of $z_{\eta}$ used the spherical model:

$V[j, j^0] = \sigma^2 \left( 1 - \frac{3}{2} \frac{d_{j, j^0}}{\rho} + \frac{1}{2} \frac{d_{j, j^0}{\rho^3}\right) I\left( \frac{d_{j, j^0}{\rho} \leq 1\right)$

Three types of data were simulated using these models with $\phi_{\eta} = (\text{NA}, .5, .5, .5, .5, 0, .5, .5, .5)$ each time. Each of the three simulation methods was used to produce 2000 datasets with 400 simulated values per data set. Each simulation sampled 100 random locations and the other 300 points were predicted, along with an overall total.

Several model misspecficiations were made, including inclusion of covariates with no effect, exclusion of covariates with real effects, and for SLM, generating data from a spherical autocovariance model but fitting the data using an exponential autocovariance model.

Real data

ver Hoef also utilized resampled real data from the Forest Inventory and Analysis databases for Oregon. However, due to privacy restrictions this dataset does not have the actual plot locations made publicly available.

Results

Methods

The following methods were used:

k-NN on Mahalanobis distance with $k = 1$
k-NN on Mahalanobis distance with $k = 5$
k-NN with most significant neighbor $k = 1$
k-NN with most significant neighbor $k = 5$
bstNN which tries $k = 1, \ldots, 30$ then chooses the distance matrix and $k$ with the smallest cross validation RMSPE
SLM (all fixed effects, exponential autocovriance model estimated using REML)
Multiple regression but with assumption of independent errors

Metrics

Three metrics were used:

Root mean squared prediction error for pointwise predictions

$RMSPE_y = \sqrt{\frac{1}{mR} \sum_{r=1}^R \sum_{j=1}^m (\hat{y}_{j|r} - y_{j|r})^2}$

where $R$ is the number of simulations or real data resamplings and $m$ is the number of point predictions

Root mean squared prediction error for total predictions

$RMSPE_y = \sqrt{\frac{1}{R} \sum_{r=1}^R (\hat{T}_{r} - T_{r})^2}$

Signed relative bias which is bias expressed as a fraction of the variability in the data set

$SRB_{\kappa} = sign(\tau_\kappa) \sqrt{\frac{\tau_\kappa^2}{MSPE_\kappa - \tau_\kappa^2}}$

where $\tau_\kappa = \frac{1}{mR} \sum_{r=1}^R \sum_{j=1}^m (\hat{y}{j|r} - y{j|r})$ or $\frac{1}{R}\sum_{r=1}^R(\hat{T}_r - T_r)$.

Comparison

SLM reduced RMSPE by 52.6 - 64.1% compared to the most popular k-NN methods and 34.8-31.8% for the ordinary linear regression model. MSN5 was the best k-NN model, but it was not even as good as LM.

Abstract

Method Introduction

SLM

k-NN

Methods

k-NN Review

Simulation of artificial data

Real data

Results

Methods

Metrics

Comparison

Other posts in "statistics"

Other posts in "research"