Random displacement does not lead to the standard measurement error model: a geometrical explanation

In a previous blog post I have considered the case of a regression problem, where the distance between points is a covariate and one class of points is randomly displaced for privacy protection. This setup was directly derived from the Demographic and Health Surveys (DHS) program, which uses such a displacement procedure (Burgert et al., 2013). I have further stated that the standard measurement error model may not be admissible, since the error is biased.

This is in direct contradiction to Warren et al. (2016), who do not only state that the standard measurement error model may be used, but also claim to show that its assumptions (including unbiased error) hold:

"We also show that the observed distance covariate follows the classical measurement error model form [...] with $\mathrm{E}(u_i) = 0, \mathrm{Var}(u_i) = \sigma^2_u$". (Warren et al., 2016)

Their analysis is a dedicated companion piece to the DHS data publication and reappears in the official guidelines document (Perez-Heydrich et al., 2013). As the DHS is an important data source for epidemiological research, correct usage is crucial. So, who is right?

The Warren et al. (2016) argument

The authors' proof sketch can be found here (under 'Supplementary Materials). It presumes the displacement procedure described in my previous post ("random perturbation"). Borrowing notation from there, let $\mathbf{x} = (x_{1}, x_{2})$ be the coordinates of an original cluster location that gets displaced to $\mathbf{x}' = (x'_{1}, x'_{2})$. Let the coordinates of our point of interest (POI) be $\mathbf{u} = (u_{1}, u_{2})$. The measurement error of interest is $\eta = D(\mathbf{x}', \mathbf{u}) - D(\mathbf{x}, \mathbf{u})$, where $D(\cdot)$ is the Euclidean distance function. By the triangle inequality we have \[D(\mathbf{x}', \mathbf{u})\leq D(\mathbf{x}, \mathbf{u}) + D (\mathbf{x}, \mathbf{x}') \Rightarrow D(\mathbf{x}', \mathbf{u}) - D(\mathbf{x}, \mathbf{u}) = \eta \leq D(\mathbf{x}, \mathbf{x}')\] for $D(\mathbf{x}', \mathbf{u}) \geq D(\mathbf{x}, \mathbf{u})$. By symmetry $\eta \geq - D(\mathbf{x}, \mathbf{x}')$ for $D(\mathbf{x}', \mathbf{u}) \leq D(\mathbf{x}, \mathbf{u})$. The displacement distance $D(\mathbf{x}, \mathbf{x}')$ is itself bounded on $[0\,,r_{\max}]$, where $r_{\max}$ is a parameter of the displacement mechanism. It is therefore straightforward to see that $\eta \in [-r_{\max}\,, r_{\max}]$. From this, and the fact that displacement is obviously point symmetric with respect to the original location $\mathbf{x}$, the authors conclude that $\mathrm{E}(\eta) = 0$.

Reductio ad absurdum

While the derivations of Warren et al. (2016) are correct, their conclusion is fallacious. We can show this first with a reductio ad absurdum: Suppose the maximum displacement distance is arbitrarily larger than the distance to POI; formally $r_{\max} = k \cdot D(\mathbf{x}, \mathbf{u})$ with $k \to \infty$. Then the probability to derive a displaced location that is closer to the POI than the original must go to zero. For example, assume we have 100 households that are located at distances between 50m and 1km to a hospital. Now we displace the 100 households with distances picked randomly between 0m and 10.000km, according to a uniform distribution. How many of them do we expect to land (a) closer to the hospital than before, or (b) further from it than before? The answers are (a) almost none, and (b) almost all (in fact, most will likely land on different continents). Yet, $\mathrm{E}(\eta) = 0$ would imply the answers for (a) and (b) to be the symmetric 50.

The geometrical counter-argument

Elkies et al. (2015) use a neat trick to show that in fact $\mathrm{E}[D(\mathbf{x}', \mathbf{u})] > D(\mathbf{x}, \mathbf{u})$ and that therefore $\mathrm{E}(\eta) > 0$. I provide here a more intuitive geometrical derivation. Recall that the full name of the displacement mechanism is random perturbation within a circle. The original location gets moved along a random angle for a random distance, which is uniformly distributed in $[0\,, r_{\max}]$. We can therefore imagine a circle of radius $r_{\max}$ around the original location $\mathbf{x}$, within which the new location must fall. This is nicely visualized, for instance, in Hunter et al. (2021) Fig.3(1), or Zandbergen (2014) Fig.5(b).

This way of viewing the displacement mechanism turns it into an instance of 'improper' disk point picking. Since the displacement angle is also drawn from a uniform distribution we have that $\mathrm{Pr}(\mathbf{x}' \in S_\alpha) = | S_\alpha | / (\pi r_{\max}^2) = \alpha / 360$ where $S_\alpha$ is a generic circle sector with central angle $\alpha$ in angular degrees. For instance, if we divide the displacement circle in two halfs along a diameter, the probability that the new location will be in one half rather than the other is, unsurprisingly, 1/2 (note that with the 'improper' disk point picking problem, this holds only for circle sectors, not for arbitrary sub-areas). With this in mind, the full counter-argument can be summarized in the image below.

Think of the original distance $d := D(\mathbf{x}, \mathbf{u})$ as the radius of a circle around $\mathbf{u}$. If a displaced location falls inside of the circle, the new distance $d' := D(\mathbf{x}', \mathbf{u})$ is shorter than the original one; if it falls outside, it is longer. With the two blue radii connecting $\mathbf{x}$ and the intersection points of the circles' circumferences we can see clearly that \[\frac{\alpha}{360} < \mathrm{Pr}(d' < d) < \frac{1}{2}.\] The size of $\alpha$ is a function of the relative sizes of the two circles: the larger the fraction $d / r_{\max}$, the larger $\alpha$. This implies $\mathrm{Pr}(d' < d) \to \frac{1}{2}$ for $d / r_{\max} \to \infty$. And therefore preasymptotically $\mathrm{Pr}(d' > d) > \mathrm{Pr}(d' < d)$ with the corollary $\mathrm{E}(d'-d) = \mathrm{E}(\eta) > 0$.

Conclusion

The model of Warren et al. (2016) works fine in $\mathbb{R}^1$: If we add an error from $[-r_{\max}\,,r_{\max}]$ to the single coordinate $x_1$, then naturally the distance $u_1 - x_1$ becomes shorter or longer equivalently (assuming we do not overshoot past $u_1$). In $\mathbb{R}^2$ however, there are ways to move towards $\mathbf{u}$ in one dimension, but still away from it overall. We find that within a given perturbation circle there are more possibilities to enlarge the distance than to shrink it, resulting in a positive bias for distances to POI.

One way to force the error model to be of standard type would be to oversample the intersection area (the green area vis-a-vis the yellow one in our image) when drawing new locations, in inverse proportion to its share in the perturbation circle. This, however, requires an adapted displacement mechanism. With the DHS mechanism as it is, non-standard errors result, which call for special calibration, like the one desribed previously.

Literature

C.R. Burgert, J. Colston, T. Roy, B. Zachary, "Geographic displacement procedure and georeferenced data release policy for the Demographic and Health Surveys," DHS Spatial Analysis Report No.7, ICF International, 2013.

N. Elkies, G. Fink, T. Bärnighausen, "'Scrambling' geo-referenced data to protect privacy induces bias in distance estimation," Population and Environment, vol.37, pp.83-98, 2015.

L.M. Hunter, C. Talbot, W. Twine, J. McGlinchy, C. Kabudula, D. Ohene-Kwofie, "Working toward effective anonymization for surveillance data: innovation at South Africa's Agincourt Health and Socio-Demogaphic Surveillance site," Population and Environment, vol.42, pp.445-476, 2021.

C. Perez-Heydrich, J.L. Warren, C.R. Burgert, M.E. Emch, "Guidelines on the use of DHS GPS data," DHS Spatial Analysis Report No.8, ICF International, 2013.

J.L. Warren, C. Perez-Heydrich, C.R. Burgert, M.E. Emch, "Influence of Demographic and Health Survey point displacement on distance-based analyses," Spatial Demography, vol.4, no.2, pp.155-173, 2016.

P.A. Zandbergen, "Ensuring confidentiality of geocoded health data: Assessing geographic masking strategies for individual-level data," Advances in Medicine, 2014.

Kommentare

Beliebte Posts aus diesem Blog

On the reversibility of Voronoi geomasking

Herfindahl-Hirschman-Index als Maß für die Diversität von Herkünften auf Gemeindeebene [deutsch]

Derivation of the expected nearest neighbor distance in a homogeneous Poisson process