Coordinate masking of multiple connected signals – Mind the centroid!
A while ago I came across a paper by Gao et al. (2019), who try to anonymize the coordinates of geo-located tweets by Twitter users (meanwhile X users). The data in question connects the geo-coordinates by a (pseudonomized) user ID, meaning an analyst can quickly find user-specific clusters, which typically identify the user's home and / or work location.
![]() |
| Toy example of a user's geo-located tweets – clusters clearly identify home location (blue) and a second one (red). |
The problem is notoriously difficult (see, for instance, Zang & Bolot, 2011, or de Montjoye et al., 2013). Gao et al. can only manage a slight anonymization at the price of completely destroying clusters in the data, arguably squandering its analytical utility. This failure is illustrative.
The approach of Gao et al. – and why it doesn't work
The authors in Gao et al. (2019) apply an independent random perturbation per point, of the type I discussed in an earlier post. [Note: They also consider alternatives like a 'Gaussian perturbation', but the difference is inconsequential for the question I want to investigate here.] Suppose we geo-locate $N$ tweets over a number of users. Then for each tweet coordinate $(x_i \,, y_i), i = 1, \ldots, N$ a random displacement distance $r \sim \mathrm{Unif}[0, r_{\max}]$ is drawn, together with a random displacement angle $\gamma \sim \mathrm{Unif}[0, 2\pi]$. Both together determine the point offset: \[\begin{aligned} (x'_i \,, y'_i) &= (x_i + \Delta x_i\,, y_i + \Delta y_i) \;\;\;\;\; \forall i = 1, \ldots, N \\ \text{ with } \Delta x_i &= r_i \cdot cos(\gamma_i) \\ \Delta y_i &= r_i \cdot sin(\gamma_i)\end{aligned}\] However, when coordinates cluster tightly at the home location, applying an independent perturbation to each is basically equivalent to repeatedly perturbing a single home coordinate. When an analyst averages over the resulting cloud of points (i.e. takes the centroid), he will consequently retreive the original home location by the law of large numbers! This known vulnerability was described by Cassa et al. (2008) as well as Zimmerman and Pavlik (2007), among others.
![]() |
| Example of point-by-point masking – blue points still center on the original location. |
An alternative approach using 'sticky noise'
The question is then: Can such data be effectively anonymized? To address the vulnerability of the centroid, I suggest a three-tiered approach. Suppose we grouped the data by user, and for each user computed notable clusters with an appropriate algorithm (for instance DBSCAN, as used by Gao et al.) Let's say we have, as before, $i = 1, \ldots, N$ tweets, now grouped into $j = 1, \ldots, M$ users and $h = 1, \ldots, K$ user-specific clusters. A general anonymization scheme could then look as follows: \[\begin{aligned}x'_i &= x_i + \lambda_1 \, \Delta x_h + \lambda_2 \, \Delta x_j + \lambda_3 \Delta x_i \\ y'_i &= y_i + \lambda_1 \, \Delta y_h + \lambda_2 \, \Delta y_j + \lambda_3 \Delta y_i \end{aligned}\] where $\Delta x_{\bullet} = r_{\bullet}\, cos(\gamma_{\bullet})$, $\Delta y_{\bullet} = r_{\bullet}\, sin(\gamma_{\bullet}), \bullet = i, j, h$ and $\lambda_1, \lambda_2, \lambda_3 \in [0 , 1]$. In other words, we can draw offsets per point as usual: $(\Delta x_i \,, \Delta y_i)$ is the point-specific offset. But we may also do it per cluster: $(\Delta x_j \,, \Delta y_j)$, or even per user: $(\Delta x_h \,, \Delta y_h)$. Or we may do all of them by setting $\lambda_1 = \lambda_2 = \lambda_3 = 1$.
It is easy to see that setting $\lambda_1 = \lambda_2 = 0, \lambda_3 = 1$ reduces the process again to the point-wise random perturbation shown above. On the other hand, $\lambda_1 = 1, \lambda_2 = \lambda_3 = 0$ turns off point-specific and cluster-specific perturbation and instead moves all coordinates associated with a user by the same amount and in the same direction. This effectively moves the centroids of home and work clusters away from the real locations, while keeping their relative position to one-another as well as the cluster density unchanged.
Playing around with $\lambda_1$, $\lambda_2$, $\lambda_3$ therefore not only realizes different flavors of data protection, but also privileges, or otherwise inhibits, different types of analyses. If the primary interest is in the typical mobility radius of a user, then offsetting clusters against one-another (setting $\lambda_2 > 0$) can distort results, but moving the user as a whole ($\lambda_1 > 0$) does not. If the average density of geo-tagged tweets within a cluster is to be researched, $\lambda_3$ could cause distortion. If we want to avoid the inference of true home or work locations by means of the cluster centroid, then $\lambda_1$ and / or $\lambda_2$ should not be zero. Some characteristic effects of $\lambda_1, \lambda_2, \lambda_3$ are visualized below. First, we look at using only a single tier:
We can see once more that offsetting only at the level of points approximately retains cluster centroids and is therefore likely to lead to unwanted privacy breaches. Moving clusters avoids the problem, but distorts between-cluster distance. Offsetting at the user-level keeps said distance intact. Next, we consider using a combination of two tiers:
Naturally, it also makes sense to pick different maximum displacement distances $r_{\max}$ for each of the three tiers. This can either be achieved by using three independently parameterized random perturbations or by drawing only from the one with the highest $r_{\max}$ and then downscaling the other two tiers by setting its $\lambda$ on the continuous scale $0 < \lambda < 1$, such that $(x'_i, y'_i)$ is the result of a weighted perturbation.
In summary
Effectively anonymizing connected location measurements is a tightrope walk: Clearly, moving around individual points is not sufficient, if the number of tweets measured for a given user and cluster is high enough. Convergence of the cluster centroid to the true home / work location is subject to the Central limit theorem. Therefore, experience tells us that only two or three dozen geo-located tweets from the same location should be enough to make point-level masking fully insufficient. Gao et al. considered users who averaged more than 300 geotagged tweets!
Applying perturbation at the user-level and / or cluster-level seems to be the better choice. The intensities of the offsets must be balanced against the resulting degredation of analytical potential. However, bringing privacy and potential together for this kind of data remains a tough nut to crack: their "poor anonymizability" (Gramaglia & Fiore, 2015) means one often has to strongly inhibit (some) potentials or compromise equally strongly on privacy promises.
Literature
C.A. Cassa, S.C. Wieland, K.D. Mandl, "Re-identification of home addresses from spatial locations anonymized by Gaussian skew," International Journal of Health Geographics, vol.7, #45, 2008.
M. Gramaglia & M. Fiore, "On the anonymizability of mobile traffic datasets," arXiv:1501.00100v2, 2015.
Y.-A. de Montjoye, C.A. Hidalgo, M. Verleysen, V.D. Blondel, "Unique in the crowd: The privacy bounds of human mobility," Scientific Reports, vol.3, #1376, 2013.
S. Gao, J. Rao, X. Liu, Y. Kang, Q. Huang, J. App, "Exploring the effectiveness of geomasking techniques for protecting the geoprivacy of Twitter users," Journal of Spatial Information Science, vol.19, pp.105-129, 2019.
H. Zang & J. Bolot, "Anonymization of location data does not work: A large-scale measurement study," Proceedings of the 17th annual international conference on Mobile computing and networking, MobiCom '11, pp.145-156, 2011.
D.L. Zimmerman & C. Pavlik, "Quantifying the effects of mask metadata disclosure and multiple releases on the confidentiality of geographically masked health data," Geographical Analysis, vol.40, no.1, pp.52-76, 2007.




Kommentare
Kommentar veröffentlichen