Bias reduction in geodata masked for privacy: a simulation-based approach

Geographic coordinates of households are often displaced by a random vector before publication to protect the privacy of respondents. When these changed coordinates are used for statistical modelling, attenuation bias typically results. Inspired by work done by Karra et al. (2020) I discuss a method to reduce this bias based on a map of true prior location probabilities.

Background

Consider a data set of $n$ subjects, each identified by a pair of geographic coordinates $(x_{i1}, x_{i2})$. For reasons of data protection, these true coordinates will often not be releasable for public use. A common practice consists in "scrambling" or "masking" the coordinates by adding noise and to release the altered pairs $(x'_{i1}, x'_{i2})$ instead. 

One particularly prominent method of noise addition is random perturbation: For each observation a random angle $\gamma$ and a random distance $r$ is drawn and the location is moved in that direction by that amount. Commonly, but not necessarily, both $\gamma$ and $r$ are drawn from a uniform distribution. By simple trigonometry we get:\[\begin{aligned}
        x'_{i1} \;= & \; x_{i1} + r_i \, cos(\gamma_i) &\\
        x'_{i2} \; = & \; x_{i2} + r_i \, sin(\gamma_i) & \forall i = 1, \dots, n\\
        r \; \sim & \; \mathrm{unif}[0 \,,\, r_{\max}] &\\
        \gamma \; \sim & \; \mathrm{unif}[0 \,,\, 2\pi] &\\
    \end{aligned}\] The coordinates are subsequently used in some analysis task, where interest is in a function $g(x_{i1}, x_{i2})$ but only $g(x'_{i1}, x'_{i2})$ is available from the release. In particular, consider a second set of locations, pertaining to points of interest (POI), denoted $(u_{j1}, u_{j2}), j = 1,\dots, m$, where the information we care about is the distance of a location to its nearest POI. Examples are the distance of households to the nearest health facility (e.g. Kashima et al., 2012) or of consumers to a nearest store. For simplicity, consider a euclidean distance measure, such that: \[g(x_{i1}, x_{i2}) = \min_j \sqrt{(x_{i1} - u_{j1})^2 + (x_{i2} - u_{j2})^2} = d_i \]

Why geomasking calls for special calibration

We would like to use $d_i$ in a data science application, but only have $d'_i$, based on altered coordinates. Predictably, using the data 'as is' leads to attenuated effects estimates and reduces the quality of our conclusions (see e.g. Arbia et al., 2015). More importantly, we cannot treat the artificial measurement error induced by geomasking as being of Berkson-type. This would imply the implicit model $d' = d + \eta$ with $\mathrm{E}(\eta|d) = 0$. However, the latter assumption of mean-zero errors is violated: As Elkies et al. (2015) show formally, we have instead $\mathrm{E}(\eta | d) > 0$; in other words: random perturbation leads to a positive bias in the new distances.

An illustration of this effect is given in the animation below. Black dots indicate our original set of locations, among which are placed three POI (shown as blue crosses). One by one locations get 'scrambled' and at each step the distances to nearest POI are re-calculated. The blue line on the left traces the average distance - with mean-zero errors, we would expect it to fluctuate around its initial value, rather than rise with the number of altered locations.

In order to remove the attenuation bias, Karra et al. (2020) suggest the following approach: The publisher of the data set containing 'scrambled' coordinates provides with it a density map of original locations. Such maps, when sufficiently smooth, are typically safe for publication (de Jonge & de Wolf, 2016). Using the map and knowledge of the random perturbation process, a proxy for the true (distance) function $\mathrm{E}\big(g(x_{i1}, x_{i2}) | x'_{i1}, x'_{i2}\big)$ may be calculated.

Bias reduction using simulated synthetic locations

I investigate here a simulation-based spin for 'quick and dirty' reduction of the attenuation bias, where instead of using altered coordinates $(x'_{i1}, x'_{i2})$, the unknown original location is estimated using knowledge of the masking mechanism and synthetic locations simulated from the (true) density map.

The intution behind the approach is captured in the following figure. An original location (black cross) gets altered randomly (say, to the red cross). By reverse-applying the maximum location radius $r_{\max}$ we get the set of all feasible points, from which this location could have originated (the whole area within the circle). We now use the original density map to simulate a large number of synthetic locations. Our estimate for the original location is the centroid of all locations within the feasible range from the observed location - in this case the blue cross.

Formally, denote by $(s_{k1}, s_{k2}), k = 1, \dots, q$ the simulated locations (with $q$ large!) Define a set \[\mathcal{K} := \bigg\{k : \sqrt{(x'_{i1} - s_{k1})^2 + (x'_{i2} - s_{k2})^2} \leq r_{\max}\bigg\}\] Our 'quick and dirty'-estimate for the original location, given the altered location, is then \[(\hat{x}_{i1}, \hat{x}_{i2} \,|\, x'_{i1}, x'_{i2}) = \Bigg(\frac{\sum_{k \in \mathcal{K}} s_{k1}}{|\mathcal{K}|}, \frac{\sum_{k \in \mathcal{K}} s_{k2}}{|\mathcal{K}|} \Bigg)\] where $|\cdot|$ denotes set size. The idea being the following: If the density of true locations around an altered location is highly uneven, it is more likely that it originated from the high-density region than from elsewhere (as seen in the figure above). If the density within the vicinity is homogeneous, the observed location is already the best guess - in that case the centroid will not give a (much) different position than the observed one.

Monte Carlo results

To test the approach, I run a small Monte Carlo simulation. I use a set of realistic housing locations provided in the dwellings data set in the R package sdcSpatial. It contains roughly 90,000 pairs of location coordinates from the Dutch building register, together constituting a medium-sized town. In each simulation run, the following happens:

  1. From each quadrant of the map 2 locations are chosen at random to constitute POI (representing, e.g. health service providers). From the rest, a sample of size 100 is drawn to be our households.
  2. Based on the household locations, a density map is created by kernel density estimation, using a Gauss-kernel of size 150m. Random perturbation is applied to all household locations as described before. Those locations together with the map constitute the published data.
  3. Using the density map, 1 million synthetic locations are created and true locations are estimated. For each set of locations (true, altered, calibrated), the distances to closest POI are calculated.
  4. Based on the true distances, a target variable is created by the linear model $y = 50 + d + \epsilon$, with $\epsilon \sim \mathcal{N}(0, 10^4)$.
  5. Evaluation is two-fold: First, I calculate the mean squared error (MSE) of the approximated distances themselves, where the naive approach uses altered locations 'as is' and the calibrated approach uses distances based on estimated true locations. Secondly, I estimate the linear model, using each of the three distance types. 

Results from 1000 Monte Carlo runs are shown below.

The method indeed achieves a reduction in attenuation bias as well as in the accuracy with which distances to nearest POI may be estimated from geomasked data. However, the quality of true location data is not reached.

Implementation remarks

Locations for the animation were constructed by a Matérn cluster process, a point process model named after Swedish environmental statistician Bertil Matérn. I used the implementation in the useful and extensive R package spatstat, which also provides functionality for nonparametric estimation of the density map and for simulating a synthetic point pattern based on it. The data used for simulation is contained in the package sdcSpatial. The simulation code as well as code for creating all figures can be downloaded from my GitHub page.

Literature

G. Arbia, G. Espa, D. Giuliani, "Measurement error arising when using distances in microeconometric modelling and the individual's position is geo-masked for confidentiality," Econometrics, vol.3, no.4, pp.709-718, 2015.

E. de Jonge & P.-P. de Wolf, "Spatial smoothing and statistical disclosure control," Proceedings of the UNESCO Chair in Data Privacy's Privacy in Statistical Databases (PSD 2016) International Conference, pp.107-117, 2016.

N. Elkies, G. Fink, T. Bärnighausen, "'Scrambling' geo-referenced data to protect privacy induces bias in distance estimation," Population and Environment, vol.37, pp.83-98, 2015.

M. Karra, D. Canning, R. Sato, "Adding measurement error to location data to protect subject confidentiality while allowing for consistent estimation of exposure effects," Journal of the Royal Statistical Society - Series C (Application), vol.69, no.5, pp.1251-1268, 2020.

S. Kashima, E. Suzuki, T. Okayasu, R.J. Louis, A. Eboshida, S.V. Subramanian, "Association between proximity to a health center and early childhood mortality in Madagascar," PLoS ONE, vol.7, no.6, e38370, 2012.

Kommentare

Beliebte Posts aus diesem Blog

On the reversibility of Voronoi geomasking

Herfindahl-Hirschman-Index als Maß für die Diversität von Herkünften auf Gemeindeebene [deutsch]

Derivation of the expected nearest neighbor distance in a homogeneous Poisson process