Conference Paper: 'Protecting high-resolution grid data with additive noise while retaining fitness for use'

From March 11th to 13th I had the chance to visit the biannual NTTS conference (New Techniques and Technologies for Statistics) in Brussels. On the second day I got to present new results in the session on Statistical Data Confidentiality. You can read the essentials in the short paper found in the conference’s Book of Abstracts.

Quoting the introduction:

"In the 2021/22 European census round, countries are producing demographic aggregates for geographic grid cells. These data products pose challenges with respect to confidentiality, since they are unprecedented in terms of spatial granularity and can, furthermore, be combined with aggregates for administrative regions to derive values for even smaller areas by a process called ‘geographic differencing’. Germany and several other countries protect grid data with the Cell Key Method (CKM), a disclosure control method based on additive noise that is suited to protect against geographic differencing risks. We assess the protective effect of CKM w.r.t. geographic differencing, as well as its impact on analytical validity of the resulting ‘noisy’ grid data product, focusing on the highest resolution (100m by 100m)."

Highlights

  • A proposed solution to the problem of Geographic Differencing as formulated by Duke-Williams & Rees (1998) more than 25 years ago. 
Differencing problem: Assume counts for some supopulation defined by a sensitive attribute, disseminated in two area systems: grid counts $x^{(G)}_1, \ldots, x^{(G)}_4$ and local administrative unite (LAU) count $x^{(A)}$. Even though both seem reasonably well aggregated with $\mathbf{x}^{(G)} = (5, 4, 4, 4)$ and $x^{(A)} = 16$, the calculation $x_{G\setminus A} = (5 + 4 + 4 + 4) - 16 = 1$ discloses the isolated unit in red. Noisy confidentiality methods replace each count with an unbiased estimate $\hat{x}_j = x_j + \Delta_j$ with fixed variance $V$ and therefore the differenced value with the estimate $\hat{x}_{G \setminus A}$ with $\mathbb{E}(\hat{x}_{G \setminus A}) = x_{G \setminus A}$ and $Var(\hat{x}_{G \setminus A}) = 5V$. This variance scales to be large relative to small values (to protect confidentiality) and small relative to large values (to allow for high-accuracy population inference).
  • Analyzing the effect of noisy confidentiality methods like the Cell Key Method (CKM) or the discrete Gaussian mechanism for Differential Privacy (DP) (Canonne et al., 2020). 

Suppose we aggregate noisy count data $\hat{x}_j$ for small areas of different expected population size; here: by routed travel time distance to some point of interest (POI). Urban areas (distance to POI up to 2km, avg. cell size > 10 people) have a more favorable signal-to-noise ratio (SNR) than rural areas (distance to POI more than 2km, avg. cell size < 10 people) after noise addition. Therefore, while sums of noisy counts converge to their true values as more cells are added, the rate of convergence is much slower in rural areas, because, due to confidentiality protection, the SNR in each summand is less favorable.

Additional Information

Just like a previous conference contribution, this one was done as part of the AnigeD project, which is funded by the Federal Ministry of Education & Research and the European Union. This time I used data from the 2022 Census of population and housing in Germany, making use of the newest gridded population data. The use case I presented also relied on the openrouteservice and the German grid cell database (Würzler et al., 2023).

The session was also a good opportunity to advertise the recently published 'Guidelines for Statistical Disclosure Control Methods Applied on Geo-Referenced Data' (Möhler et al., 2024), out of which this contribution ultimately grew. It is also highly recommended to check out Mark van der Loo's contribution for the same session ‘Towards Statistical Disclosure Control for Complex Networks’ and the related paper (de Jong et al., 2024).

Literature

C.L. Canonne, G. Kamath, T. Steinke, "The Discrete Gaussian for Differential Privacy," 34th Conference on Neural Information Processing Systems (NeurIPS ’20), Vancouver, Canada, 2020.

R.G. de Jong, M.P.J. van der Loo, F.W. Takes, "The anonymization problem in social networks," https://arxiv.org/html/2409.16163v1, 2024.

O. Duke-Williams, P. Rees, "Can Census Offices publish statistics for more than one small area geography? An analysis of the differencing problem in statistical disclosure," International Journal of Geographical Information Science, vol.12, no.6, 1998, pp.579-605.

M. Möhler, J. Jamme, E. de Jonge, A. Mlodak, J. Gussenbauer, P.-P. de Wolf, "Guidelines for Statistical Disclosure Control Methods Applied on Geo-Referenced Data," WP2 of STACE project - Grant agreement 899218-2019-BG-Methogology, Task 2.4, Deliverable D2.9, 2024.

B. Würzler, M. Massilge, A. Weiner, J. Bobrich, "German Grid Cell Database: From Spatial to Statistics," 31st International Cartographic Conference (ICC ’23), Aug. 13-18, Cape Town, South Africa, 2023.

Kommentare

Beliebte Posts aus diesem Blog

On the reversibility of Voronoi geomasking

Herfindahl-Hirschman-Index als Maß für die Diversität von Herkünften auf Gemeindeebene [deutsch]

Derivation of the expected nearest neighbor distance in a homogeneous Poisson process