statshorts

Posts

Support recovery for kernel density estimation - easy in theory, impossible in practice?

Januar 03, 2026

Here's a question: Suppose we are given a smoothed density, such as from kernel density estimation (KDE) . Additionally, we have a set of data points that is a superset of the points from which the density was estimated. Is it possible to sort out the original support, i.e. the points that went into the estimation, from those that didn't? Besides showing the possibilities and limitations for signal recovery in an idealized setting, the question is of practical interest in the context of privacy-preserving data publishing and statistical disclosure control. A motivating example Consider a data set of sensitive address coordinates, for example the coordinates of households with disease cases in an epidemiological context, or a collection of burglary cases in the analysis of crime (both contexts relying heavily on density estimation as analytical tool). Let $x_i \in \mathbb{R}^2, i =1, \ldots, n$ denote the sensitive coordinates in question. The KDE is on a regular grid of evalu...

Consistent random sample queries using cell keys

August 24, 2025

Random sample querying is a method to ensure privacy for personal information when answering statistical queries. Rather than calculating the query response from all records in the database, a random sample is drawn and an estimate, based on sampling theory , is returned. The contribution of any individual data point to the answer thereby becomes uncertain, protecting the privacy of data subjects. The method has an undeniable charme, because sampling has intuitive and verifiable privacy-enhancing properties ( Balle et al., 2018 ). Furthermore, sampling and sample-based estimation are well understood by statisticians. This should facilitate adoption, especially when other privacy-preserving mechanisms - like output perturbation, where noise is added to the query answer - are viewed unfavourably as 'messing with the data.' However, random sample queries suffer from inconsistency issues, which hitherto hindered their adoption. In this post, I show how the cell key mechanis...

Now an official Eurostat publication: 'Guidelines for Statistical Disclosure Control Methods Applied on Geo-Referenced Data' (2025 Edition)

Juni 01, 2025

In a previous post I introduced the Guidelines for Statistical Disclosure Control Methods Applied on Geo-Referenced Data. This document was authored by a group of methodologists from European statistical agencies, including myself. I am glad to announce that it is now out as an official Eurostat publication from the Manuals and Guidelines series. Quoting the Summary: "Users of official statistics are often interested in aggregates for small geographic areas, but publishing such aggregates may violate confidentiality. These guidelines help practitioners of statistical disclosure control (SDC) when dealing with geo-referenced data. They outline current practices for finding problematic cases , applying protection methods and assessing their impact ." The 2025 Eurostat version of the guidelines can be found here . The GitHub version (intended as a living document) continuous to be accessible here . For more info on the contents, check out the previous ...

Herfindahl-Hirschman-Index als Maß für die Diversität von Herkünften auf Gemeindeebene [deutsch]

April 06, 2025

In einem kürzlich erschienenen Sammelband präsentierte Prenzel (2024) eine Auswertung zur sog. kulturellen Diversität der deutschen Bevölkerung, operationalisiert als inverses Konzentrationsmaß ( Fraktionalisierung ) verschiedener Geburtsländer. Berechnet wurde das Maß mit Ergebnissen des Zensus 2011, die in diesem Fall jedoch räumlich nicht sehr tief gegliedert vorlagen. Für den Zensus 2022 sind die erforderlichen Daten nun deutlich detaillierter auf Ebene der Gemeinden verfügbar. Es soll daher im Folgenden eine kurze Replikation der Auswertung von Prenzel mit neueren Daten und höherer räumlicher Auflösung gezeigt werden. Außerdem werden weitere Analysen angerissen, die durch die höhere räumliche Auflösung erst möglich werden. Konzentrationsmaß 'Fraktionalisierung' Es ist naheliegend, "Diversität" als Gegenbegriff zu Homogenität zu verstehen. Zur Quantifizierung bieten sich daher Konzentrationsmaße an. Die von Prenzel (2024, S.78) verwendete Kennzahl der 'Frak...

Dieses Blog durchsuchen

statshorts

Posts

Wrapping query perturbation mechanisms in ggplot stats: Towards privacy-integrated plotting

Support recovery for kernel density estimation - easy in theory, impossible in practice?

Consistent random sample queries using cell keys

Now an official Eurostat publication: 'Guidelines for Statistical Disclosure Control Methods Applied on Geo-Referenced Data' (2025 Edition)

Herfindahl-Hirschman-Index als Maß für die Diversität von Herkünften auf Gemeindeebene [deutsch]