On the reversibility of Voronoi geomasking

In a previous blog post I have talked about "scrambling" geographic point data in an attempt to protect the units, to which these points pertain, from privacy violations. Besides the simple 'random perturbation' discussed there, a whole range of different methods have been suggested. One is Voronoi masking, introduced by Seidl et al. (2015). 

Despite the comparatively good capacity of that method to preserve spatial patterns in masked data, its deterministic mechanism raises concerns about the possibility to reverse the procedure, revealing real locations. In this post, I explore the reversal potential of Voronoi geomasking.

What is Voronoi masking?

Seidl et al. (2015) describe the method (in its most vanilla version) as follows: For each point location, identified by planar geographic coordinates $(x_{i1}, x_{i2})$, its corresponding Voronoi polygon is computed. The point is then shifted onto the closest edge of said polygon along the shortest path. An example is shown below.

Displayed in the picture on the top are some locations (black squares), say of households with their corresponding dwellings. Blue lines show the Voronoi tesselation, grey dotted lines the Delauny triangulation that leads to it. Shown in the picture below are the new 'masked' locations, denoted by $(x'_{i1}, x'_{i2})$, that result from the process described before. 

In the words of Seidl et al., "each point is snapped to the closest part along the edges of its corresponding Voronoi polygon." Recall that such a polygon is defined as the set of all points that are closer to its assigned location than to any other. Therefore, the closest edge will always result between a location and its nearest neighbor. An equivalent (and arguably simpler) way to describe the procedure is then: Each point gets moved half the way towards its nearest neighbor.

Where two red arrows in the example point to the same square, this means that two locations have been each other's nearest neighbors and therefore are displaced to exactly the same new location. This pair subsequently becomes locationally indistinguishable, a supposed strength of the method and an aspect I will come back to.

Reversing a geomasking procedure

According to Kounadi and Leitner (2014): "Spatial re-identification risk is the probability of identifying the real location of an individual. Research on spatial reidentification has focused on the risk (a) when re-identifying locations from published maps that contain unmasked confidential locations and (b) when estimating the actual locations from published maps that contain masked confidential locations."

The problem at hand is in the second category. Knowledge of the masking procedure can be used to reverse-engineer as many original locations as possible (see e.g. Cassa et al., 2008).

Consider first the set of point locations $(x'_{i1}, x'_{i2}) \,,\, i = 1, \dots, n$ of which we know that a geographic masking procedure has been applied to them, moving them from their true positions. It is assumed that we also possess a set of candidate locations $(x_{j1}, x_{j2})\,,\, j = 1, \dots, N$ with $N \geq n$. For instance, if we know from context information that our units of interest are households in single family homes, the candidate locations may be derived by geocoding public adress registers. For small study areas it may be feasible to geocode all dwellings found in the area and consider them as candidates for true locations.

It is assumed that the masked set is derived (by some masking procedure) from true locations that can be found in the candidate set. The first case to consider for a reversal attack is $n = N$, i.e. there is a 1:1 correspondence between masked locations and candidate locations; the only problem is connecting them. In practice, of course, we encounter more often $n \leq N$, as the masked locations are only a sample from the population (e.g. households with sick people, households with unemployed, or the like). I discuss both cases in turn.

Case n = N

A strategy for reversing the mask is devised as follows:

  1. We know that if two original locations were in a mutual nearest neighbor relation, a paired location (duplicate) arises, but not otherwise. We can thus split the set of candidate locations in two parts: One that contains all points that are pairwise mutual nearest neighbors, one with the rest. Duplicates after masking must have come from the former, singletons from the latter.
  2. Since a masked location is placed at equidistance between the true candidate location and its nearest neighbor, we compute for each member of the masked set its two nearest neighbors from the corresponding subset of the candidate set (duplicates or singletons). The true location will always be one of them.
  3. Duplicates will naturally yield the same two nearest neighbors for both members. But we have no way of deciding, barring further information, which came from where. We can only assign at random with 50% probability of being right.
  4. For singletons there are two possibilities: Either the true location's nearest neighbor was itself some other location's mutual nearest neighbor. In that case, it doesn't appear in the shortened subset of candidates and the two nearest neighbors we find are not equidistant. The one with the shorter distance is then the true original location.
  5. Now we roll up the remainders from the edges. We start with those locations that only have one not-previously-taken candidate left among its two. If we decide this case, at least one other loses its second option and gets assigned the remaining one, etc. All further candidate locations can thus be assigned by iterative elimination.

Now a practical demonstration: For a somewhat plausible re-identification experiment I use the dwellings data set from the R package sdcSpatial. It includes 90,603 real household locations, together comprising a town in the Netherlands. Voronoi masking is applied to the whole data set (assuming $n = N$). Subsequently, the reversal procedure is applied as laid out above.

The data set is quite dense: We get a median distance between nearest neighbors of 5.11m. The median displacement distance (between the original and the masked location) is consequently exactly half that, slightly less than 2.56m; not much. The procedure results in 28,119 (ca. 31%) singletons. We expect to correctly identify all singletons plus half of the duplicates, i.e. 28,119 + 0.5 * 62,484 = 59,361 locations.

Actually running the experiment will always yield slightly more or slightly fewer identifications, depending on the random assignment of duplicates. My demonstration run gives 59,379 locations, or 65.6% of the data, that were correctly assigned to their candidates, including 100% of all singletons and 50.1% of locations from pairs. The computation time was negligible. 

Closeup of dwellings data; roughly two thirds of the masked locations are correctly matched to their sources in the candidate set (de-anonymized).

Case n < N

For most privacy mechanisms, sampling is assumed to add an additional layer of protection, as a single masked location could be associated, in principle, with very many candidate locations. However, this is not the case with Voronoi masking. Its workings instead allow us to exactly recreate the sample from the masked data. We do this as follows:

First we ask: how far away can the true location be (at most) from its masked version? Is there an upper bound for this search distance? There is: It is the distance to the first nearest neighbor (within masked data) with a different location; i.e. the first-nearest neighbor for singletons and the second-nearest neighbor for duplicates. This follows from the logic of the Voronoi masking procedure and is easily verified.

For each masked location, we look - within its individual search distance - for two locations from the larger candidate set, that are exactly equi-distant from it. Those are: the original location and its nearest neighbor (also in the sample). [Note: The probability of there being a second pair within the search area that is also equi-distant depends on the granularity of the geocoding. Typically, the chance will be so small as to be neglected. In the unlikely case of more than one pair, choosing the one with the shortest distance will usually give the candidates required for de-anonymization.]

Collecting all locations thus found will actually exactly reconstruct the sample from the larger candidate set. With this shortened set, we have again $n =N$ and proceed as described above for that case.

For a practical demonstration, consider again the dwellings data set. From the 90,603 households a random sample of 5,000 (ca. 5.5%) is drawn. Voronoi masking is applied to the sampled locations, yielding a median displacement distance of 15.73m (much larger than before). The sample after masking comprises 1,868 (ca. 37%) singletons, hence the expected number of identifications is 1,868 + 0.5 * (5,000 - 1,868) = 3,434. Running the full de-anonymization in two steps (I: sample reconstruction, II: location re-assignment) results in 3,462 (ca. 69.2%) actual identifications. As promised, sampling provided no further protection against disclosure.

Closeup of dwellings data after sampling; again ca. two thirds of locations are de-anonymized.

Note furthermore, that the ability to reconstruct the sample frees us from the necessity of duplicating the sampling frame when compiling a candidate set. Instead, we only require any superset of the sampled locations. For instance, if the masked set is comprised of households, but the publicly available geodata or adress repositories do not include a breakdown of household adresses vs. other adresses, we can simply geocode all buildings (households plus enterprises plus public buildings etc.) and let the decision rule above sort the wheat from the chaff. This severely decreases barriers for a reversal attack.

Conclusion

A reversal attack as described here always requires a candidate set, of which the original locations are a subset. As Wang et al. (2022, p.388) note, "there are many ways to generate such dataset with publicly available data (e.g. local government open data, OpenStreetMap, [...])" etc.

Voronoi geomasking may satisfy privacy models such as spatial k-anonymity on paper, but because of its deterministic mechanism is rather easy to reverse. The privacy after reversal is limited to $k = 2$ for duplicates and none for singletons. The apparent insufficiency of the vanilla version has inspired evolutions of the method, specifically Adaptive Voronoi masking (Polzin & Kounadi, 2021) that seem less vulnerable, but at the same time inferior in post-masking analytical value.

In conclusion: With deterministic masking mechanisms it is not sufficient to check privacy metrics like k-anonymity. Some plausible claim to irreversibility is also required.

Implementation remarks

The Delauny triangulations / Voronoi tesselations were computed using the R package deldir. Nearest neighbor functions for planar point patterns are taken from the spatstat package. R code, which includes the full de-anonymization algorithm as well as reproducible reversal experiments, can be downloaded from my GitHub page.

Literature

C.A. Cassa, S.C. Wieland, K.D. Mandl, "Re-identification of home addresses from spatial locations anonymized by Gaussian skew," International Journal of Health Geographics, vol.7, 2008.

O. Kounadi, M. Leitner, "Why does geoprivacy matter? The scientific publication of confidential data presented on maps," Journal of Empirical Research on Human Research Ethics, vol.9, no.4, pp.34-45, 2014.

F. Polzin, O. Kounadi, "Adaptive Voronoi masking: a method to protect confidential discrete spatial data," 11th International Conference on Geographic Information Science (GIScience), Proceedings Pt.II, 2021.

D.E. Seidl, G. Paulus, P. Jankowski, M. Regenfelder, "Spatial obfuscation methods for privacy protection of household-level data," Applied Geography, vol.63, pp.253-263, 2015.

J. Wang, J. Kim, M.-P. Kwan, "An exploratory assessment of the effectiveness of geomasking methods on privacy protection and analytical accuracy for individual-level geospatial data," Cartography and Geographic Information Science, vol.49, no.5, pp.385-406, 2022.

Kommentare

Beliebte Posts aus diesem Blog

Herfindahl-Hirschman-Index als Maß für die Diversität von Herkünften auf Gemeindeebene [deutsch]

Derivation of the expected nearest neighbor distance in a homogeneous Poisson process