Posts

Support recovery for kernel density estimation - easy in theory, impossible in practice?

Bild
Here's a question: Suppose we are given a smoothed density, such as from kernel density estimation (KDE) . Additionally, we have a set of data points that is a superset of the points from which the density was estimated. Is it possible to sort out the original support, i.e. the points that went into the estimation, from those that didn't? Besides showing the possibilities and limitations for signal recovery in an idealized setting, the question is of practical interest in the context of privacy-preserving data publishing and statistical disclosure control. A motivating example Consider a data set of sensitive address coordinates, for example the coordinates of households with disease cases in an epidemiological context, or a collection of burglary cases in the analysis of crime (both contexts relying heavily on density estimation as analytical tool).  Let $x_i \in \mathbb{R}^2, i =1, \ldots, n$ denote the sensitive coordinates in question. The KDE is on a regular grid of evalu...

Consistent random sample queries using cell keys

Bild
Random sample querying is a method to ensure privacy for personal information when answering statistical queries. Rather than calculating the query response from all records in the database, a random sample is drawn and an estimate, based on sampling theory , is returned. The contribution of any individual data point to the answer thereby becomes uncertain, protecting the privacy of data subjects. The method has an undeniable charme, because sampling has intuitive and verifiable privacy-enhancing properties ( Balle et al., 2018 ). Furthermore, sampling and sample-based estimation are well understood by statisticians. This should facilitate adoption, especially when other privacy-preserving mechanisms - like output perturbation, where noise is added to the query answer - are viewed unfavourably as 'messing with the data.' However, random sample queries suffer from inconsistency issues, which hitherto hindered their adoption. In this post, I show how the cell key mechanis...

Now an official Eurostat publication: 'Guidelines for Statistical Disclosure Control Methods Applied on Geo-Referenced Data' (2025 Edition)

Bild
In a previous post I introduced the  Guidelines for Statistical Disclosure Control Methods Applied on Geo-Referenced Data.  This document was authored by a group of methodologists from European statistical agencies, including myself. I am glad to announce that it is now out as an official Eurostat publication from the Manuals and Guidelines series. Quoting the Summary: "Users of official statistics are often interested in aggregates for small geographic areas, but publishing such aggregates may violate confidentiality. These guidelines help practitioners of statistical disclosure control (SDC) when dealing with geo-referenced data. They outline current practices for finding problematic cases , applying protection methods and assessing their impact ."  The 2025 Eurostat version of the guidelines can be found here . The GitHub version (intended as a living document) continuous to be accessible  here . For more info on the contents, check out the previous ...

Herfindahl-Hirschman-Index als Maß für die Diversität von Herkünften auf Gemeindeebene [deutsch]

Bild
In einem kürzlich erschienenen Sammelband präsentierte Prenzel (2024) eine Auswertung zur sog. kulturellen Diversität der deutschen Bevölkerung, operationalisiert als inverses Konzentrationsmaß ( Fraktionalisierung ) verschiedener Geburtsländer. Berechnet wurde das Maß mit Ergebnissen des Zensus 2011, die in diesem Fall jedoch räumlich nicht sehr tief gegliedert vorlagen.  Für den Zensus 2022 sind die erforderlichen Daten nun deutlich detaillierter auf Ebene der Gemeinden verfügbar. Es soll daher im Folgenden eine kurze Replikation der Auswertung von Prenzel mit neueren Daten und höherer räumlicher Auflösung gezeigt werden. Außerdem werden weitere Analysen angerissen, die durch die höhere räumliche Auflösung erst möglich werden. Konzentrationsmaß 'Fraktionalisierung' Es ist naheliegend, "Diversität" als Gegenbegriff zu Homogenität zu verstehen. Zur Quantifizierung bieten sich daher Konzentrationsmaße an. Die von Prenzel (2024, S.78) verwendete Kennzahl der 'Frak...

Conference Paper: 'Protecting high-resolution grid data with additive noise while retaining fitness for use'

Bild
From March 11th to 13th I had the chance to visit the biannual NTTS conference ( New Techniques and Technologies for Statistics ) in Brussels. On the second day I got to present new results in the session on Statistical Data Confidentiality. You can read the essentials in the short paper found in the conference’s Book of Abstracts . Quoting the introduction: "In the 2021/22 European census round, countries are producing demographic aggregates for geographic grid cells. These data products pose challenges with respect to confidentiality, since they are unprecedented in terms of spatial granularity and can, furthermore, be combined with aggregates for administrative regions to derive values for even smaller areas by a process called ‘geographic differencing’. Germany and several other countries protect grid data with the Cell Key Method (CKM), a disclosure control method based on additive noise that is suited to protect against geographic differencing risks. We assess the protectiv...

Deriving the distribution of coordinate errors in displacement processes with minimum displacement distance

Bild
When dealing with data on sensitive subjects, one often encounters artificial measurement error introduced to protect confidentiality. In the field of geomasking specifically, survey units are geo-located and their coordinates randomly displaced. Measurement error models are useful to tackle this issue in an analysis, but those typically require analytical expressions for the distribution of coordinate errors for a given displacement mechanism.  In a somewhat recent PhD thesis , Hossain (2023) has derived an analytical expression for the error distribution of the most basic geomask, the circular uniform random displacement mask. In this post, I build on his work to derive an analogous formula for the more recent so-called donut mask . Background The most straightforward geomask is the random displacement geomask, which has been treated in several previous posts . It draws a random angle and a random distance, both from a uniform distribution, then calculates the $x$- and $y$-of...

Now online: 'Guidelines for Statistical Disclosure Control Methods Applied on Geo-Referenced Data'

A guidelines document I co-authored together with colleagues from France, Austria, Poland, and the Netherlands is now available online . These Guidelines for Statistical Disclosure Control Methods Applied on Geo-Referenced Data are the result of extensive work under the STACE project ( Statistical Methods and Tools for Time Series, Seasonal Adjustment and Statistical Disclosure Control ), together with my co-authors Julien Jamme, Edwin de Jonge, Andrzej Mlodak, Johannes Gussenbauer and Peter-Paul de Wolf. The document treats statistical methods to protect subject confidentiality when aggregates for small spatial areas are to be published. It extends and supplements the Handbook on Statistical Disclosure Control (currently in its new, second edition). Quoting from the Introduction: "Users of statistical data are often interested in spatial distribution patterns . A policy-maker may be interested in the distribution of income over the neighborhoods of a city, a health care profes...

Coordinate masking of multiple connected signals – Mind the centroid!

Bild
A while ago I came across a paper by Gao et al. (2019), who try to anonymize the coordinates of geo-located tweets by Twitter users (meanwhile X users). The data in question connects the geo-coordinates by a (pseudonomized) user ID, meaning an analyst can quickly find user-specific clusters, which typically identify the user's home and / or work location. Toy example of a user's geo-located tweets – clusters clearly identify home location (blue) and a second one (red). The problem is notoriously difficult (see, for instance,  Zang & Bolot, 2011, or de Montjoye et al., 2013). Gao et al. can only manage a slight anonymization at the price of completely destroying clusters in the data, arguably squandering its analytical utility. This failure is illustrative. The approach of Gao et al. – and why it doesn't work The authors in Gao et al. (2019) apply an independent random perturbation per point , of the type I discussed in an earlier post . [Note: They also consider alter...