Support recovery for kernel density estimation - easy in theory, impossible in practice?

Here's a question: Suppose we are given a smoothed density, such as from kernel density estimation (KDE). Additionally, we have a set of data points that is a superset of the points from which the density was estimated. Is it possible to sort out the original support, i.e. the points that went into the estimation, from those that didn't?

Besides showing the possibilities and limitations for signal recovery in an idealized setting, the question is of practical interest in the context of privacy-preserving data publishing and statistical disclosure control.

A motivating example

Consider a data set of sensitive address coordinates, for example the coordinates of households with disease cases in an epidemiological context, or a collection of burglary cases in the analysis of crime (both contexts relying heavily on density estimation as analytical tool).

Let $x_i \in \mathbb{R}^2, i =1, \ldots, n$ denote the sensitive coordinates in question. The KDE is on a regular grid of evaluation points $\tilde{x}_g \in \mathbb{R}^2, g = 1, \ldots, G$. Without loss of generality, we assume a symmetric bandwidth matrix $\mathbf{H}$: $$\mathbf{H} = \begin{pmatrix}h^2 & 0\\ 0 & h^2\end{pmatrix}.$$ The density estimate is $$ \hat{f}_h (\tilde{x}_g) = \frac{1}{n h^2} \sum_{i=1}^n K\left(\frac{\tilde{x}_g - x_i}{h}\right)$$ where $K$ is the kernel function, on which we place no a-priori restrictions, but which will often be Gaussian in practice.

Now add the reasonable proposition that the coordinates of all dwellings in the region are available, typically in the form of a master address file (MAF) from the local land surveying office or census-taking bureau. Let $x_1, \ldots, x_N$ denote the coordinates from the MAF (with $N > n$ and commonly $N \gg n$). This allows us to re-write the formula above as a weighted KDE: $$ \hat{f}_h (\tilde{x}_g, \mathbf{w}) = \frac{1}{n h^2} \sum_{i=1}^N K\left(\frac{\tilde{x}_g - x_i}{h}\right)w_i$$ where $\mathbf{w} = w_1, \ldots, w_N$ is the vector of binary weights $w_i \in \{0, 1\}$ with $\sum_{i=1}^N w_i = n$. A weight of 1 means that the data point was used in KDE, 0 that it wasn't. We can now use the observed density $\hat{f}_h$ to make an estimate of $\mathbf{w}$, which, if successful, allows us to identify the $n$ contributors from the size-$N$ MAF. This setup is akin to membership inference (Nergiz et al., 2007), by now the benchmark case for most privacy-preserving data publishing.

The condition $\hat{\mathbf{w}} = \mathbf{w}$ for full support recovery is not verifiable by an outside analyst, given that the true $\mathbf{w}$ is unknown. However, note that for a candidate solution $\hat{\mathbf{w}}$ we can compute an error metric over the evaluation grid when compared with the observed density, conventionally the mean integrated squared error (MISE), which is \[\begin{aligned}MISE\left(\hat{\mathbf{w}}\right) &= \int_x \left(\hat{f}_h(x, \hat{\mathbf{w}}) - \hat{f}_h(x, \mathbf{w})\right)^2\\ &= \frac{1}{G} \sum_{g=1}^G \Delta^2 \cdot \left(\hat{f}_h(\tilde{x}_g, \hat{\mathbf{w}}) - \hat{f}_h(\tilde{x}_g, \mathbf{w})\right)^2\end{aligned}\] with $\Delta$ being the width of the evaluation grid. We have $MISE(\hat{\mathbf{w}}) = 0$ if and only if $\hat{\mathbf{w}} = \mathbf{w}$.

An attempt at support recovery

I am going to generalize an idea presented by Hut et al. (2020). For a start, we know that KDE is a kind of weighted average. Specifically, the density at some point $\tilde{x}_g$ is the sum of $N$ overlapping kernels with known distance weights and unknown inclusion weights: \[\hat{f}_h(\tilde{x}_g) = \frac{1}{n h^2} \left( K\left(\frac{\tilde{x}_g - x_1}{h}\right)w_1 + \ldots + K\left(\frac{\tilde{x}_g - x_N}{h}\right)w_N\right).\] Suppose (for now), that we have at least $N$ different evaluation points available, so we can read off an $N$-vector of density values $\mathbf{f}_h = \left(\hat{f}(\tilde{x}_1), \ldots, \hat{f}(\tilde{x}_N)\right)^T$. This allows us to write up the following linear system of equations with $N$ equations and $N$ variables: \[\mathbf{f}_h = \mathbf{K}_h \mathbf{w}\] with the design matrix $\mathbf{K}_h$ set up as follows: \[\mathbf{K}_h = \frac{1}{n h^2} \begin{bmatrix} K\left(\frac{\tilde{x}_1 - x_1}{h}\right) & \cdots & K\left(\frac{\tilde{x}_1 - x_N}{h}\right) \\ \vdots &\ddots & \vdots \\ K\left(\frac{\tilde{x}_N - x_1}{h}\right) & \cdots & K\left(\frac{\tilde{x}_N - x_N}{h}\right) \end{bmatrix}\] Under certain, not too restrictive conditions on the kernel function, the LSE can be solved and the vector \[\hat{\mathbf{w}} = \mathbf{f}_h \mathbf{K}_h^{-1}\] would, in principle, recover the support of the original KDE.

[Remark: If we don't observe a KDE, but a Nadaraya-Watson-style kernel smoothed average of some variable, the only thing that changes is that $w_i \in \mathbb{R}_{\geq 0}$ and we use a row-normalized version of $\mathbf{K}_h$ with $\sum_j k_{ij} = 1 \; \forall i$. The solution to the LSE would then aim to recover the value for said variable at each contributing data point.]

Often, however, we will not have enough evaluation points to set up a well-conditioned LSE and therefore require a generalization of the procedure that works for $G \leq N$. One avenue follows from viewing the LSE as a regression problem with more predictors than observations. [The similarity between KDE and regression models for statistical disclosure control has been noted before (Green et al., 2024).] In particular, support recovery then becomes a variable selection problem, for which a range of tools are established:

1. Best Subset Selection

The linear optimization formulation of the variable selection problem would be \[\begin{aligned} &\min_w ||\mathbf{f}_h - \mathbf{K}_h\mathbf{w}||^2_2\\ \mathrm{s.t. } \quad & w_i \in \{0, 1\}\;,\;\; \sum_{i=1}^N w_i = n. \end{aligned}\] This is a binary integer program and is very likely not solvable for a solution space of size $\begin{pmatrix}N \\ n\end{pmatrix}$. See Bertsimas et al. (2016) for more on this approach.

2. Lasso

The Lasso (Tibshirani, 1996) is an efficient variable selector for high-dimensional sparse problems. We solve \[\min_w \frac{1}{2N} ||\mathbf{f}_h - \mathbf{K}_h\mathbf{w}||^2_2 + \lambda \sum_{i=1}^N |w_i| \] with $\lambda$ the well-known penalty parameter. A generalization in form of the Elastic net (Zou & Hastie, 2005) can also be used.

3. Backward elimination

If we have a starting solution $\hat{\mathbf{w}}$ with $\sum_{i=1}^N \hat{w}_i = n' > n$, we can try to cut it down by iterating the following steps:

For $i = 1, \ldots, n'$

Estimate $\hat{f}(\hat{\mathbf{w}}_{-i})$, where $-i$ means setting the $i$th weight to zero.
Calculate $MISE(\hat{\mathbf{w}})$.

Set $i^* := \underset{i}{\arg\min} \; MISE(\hat{\mathbf{w}})$.
Update $\hat{\mathbf{w}} := \hat{\mathbf{w}}_{-i^*}$.
Iterate until $\sum_{i=1}^N w_i = n$.

A minimal example

Consider the small example setup below. From a hypothetical population of 13 data points, 7 (coloured red) contribute to the KDE, 6 noise points (black) don't. The density estimate itself is shown in the background.

Only knowing the 13 coordinates, but not their classes, we want to sort the wheat from the chaff. Setting up and solving the LSE directly, as suggested first, gives us the following estimated weights:

As can be seen, we indeed get non-zero weights for the contributing data points only and zero weights for the noise points. Next, we try the variably selection using Lasso. This gives us the following suggested solution:

Shrinkage paths of the Lasso estimator (left) and recovered support (right); selected data points are marked blue.

Two points more than needed (#12 and #10) are selected. This is common Lasso behaviour - it tends to select rather too many variables (see, e.g., Bühlmann & van de Geer, 2011, p.183). One way around this would be to tune the penalty parameter $\lambda$ to yield a solution of size $n$. However, as the shrinkage paths show, the noise point #12 in the center of the density has a strong (yet phoney) signal and two real contributors (#5 and #7) would be de-selected before the noise points.

A smarter strategy is using Lasso to make a pre-selection that is, by intention, a little too large, then running the backward elimination algorithm a couple of times (here, twice). The following figure shows how the Lasso solution is refined:

Left: 2 iterations of backward elimination, following Lasso pre-selection; right: refined solution with final selection marked in blue.

The graph on the left plots for each data point (starting from the Lasso selection), how $MISE$ would change, were it discarded (its weight set to zero). As can be seen, in the first iteration, cutting #10 brings the strongest improvement. In the second, cutting #12 does. This leaves us with a $MISE$ of zero, which also verifies that we succeeded in recovering the KDE's support exactly (see above).

Going beyond small examples?

Though sufficient for toy examples, none of the variable selection approaches is good enough in a general setting.

Best Subset Selection is NP-hard and generally not applicable when both $n$ and $N$ are large, as we would expect them to be in reality.
Backward elimination will tend to get stuck in local optima with imperfectly recovered support, when started from the full model.
Lasso produces both false positives and false negatives, because noise points that are close to true sample points give phoney signals.

But the killing blow when going from small examples, like the one above, to problems of real-world size is likely to be numerical stability of the solution. Consider the column vectors $k_{\cdot i}$, $k_{\cdot j}$ of $\mathbf{K}_h$ for two $x_i$, $x_j$ that are close together in $\mathbb{R}^2$ space (e.g. two neighboring households). Their kernel-smoothed distance weights are nearly the same, meaning their columns in $\mathbf{K}_h$ are strongly correlated.

As active data points and noise points will likely intermingle closely together, $\mathbf{K}_h$ is almost certainly ill-conditioned; it may even be close to singular. The so-called irrepresentability condition from Hastie et al. (2015), p.302, tells us that strong correlations between active and inactive variables generally preclude consistent support recovery.

Preliminary conclusion

Support recovery for KDE is a neat theoretical problem that shows, in principle, a lot of potential for a solution. The original approach from Hut et al. (2020) considered real-valued and non-zero weights, the second assumption being somewhat far from reality. As I have tried to show, it can be generalized to a membership inference problem that is amenable to variable selection algorithms. But growing the setup to real-world proportions quickly leads to multicollinearity issues that prevent the active set from being identified.

For privacy-preserving data publishing, this is good news. KDE shows strong a-priori robustness against reverse-engineering attempts, even without formal privacy methods. This may not be surprising: It is well-known that large-scale KDE is robust against the presence or absence of a few single data points. And this kind of robustness is in turn strongly related to privacy-preserving computations (Dwork & Lei, 2009). Outlying data points may, however, remain vulnerable.

Methods for signal recovery are also constantly evolving. Techniques like Stability Selection (Meinshausen & Bühlmann, 2010) extend previous notions of identifiability. I therefore do not conclude that support recovery for KDE is categorically "impossible in practice". Rather, I think it makes for a nice test case that is worth keeping track of.

Additional Notes

The material presented in this post was originally prepared for the fourth project meeting of the AnigeD project (Anonymität bei integrierten und georeferenzierten Daten) on April 7-8, 2025. The german-language presentation slides with some additional information may be found here. Code in R for the small example shown above can be found on the statshorts Github page.

Literature

D. Bertsimas, A. King, R. Mazumder, "Best subset selection via a modern optimization lens," The Annals of Statistics, vol.44, no.2, pp.813-852, 2016.

P. Bühlmann & S. van de Geer, Statistics for High-Dimensional Data. Methods, Theory and Applications, Springer, 2011.

C. Dwork & J. Lei, "Differential privacy and robust statistics," Proceedings of the 41st annual ACM symposium on theory of computing, STOC '09, pp.371-380, 2009.

E. Green, F. Ritchie, P. White, "The statbarn: A new model for output statistical disclosure control," Privacy in Statistical Databases, PSD '24, Proceedings, pp.284-293, 2024.

T. Hastie, R. Tibshirani, M. Wainwright, Statistical Learning with Sparsity. The Lasso and Generalizations, Chapman & Hall / CRC Press, 2015.

D.A. Hut, J. Goseling, M.-C. van Lieshout, P.-P. de Wolf, E. de Jonge, "Statistical disclosure control when publishing on thematic maps," Privacy in Statistical Databases, PSD '20, Proceedings, pp.195-205, 2020.

N. Meinshausen & P. Bühlmann, "Stability Selection," Journal of the Royal Statistical Society. Series B (Methodological), vol.72, no.4, pp.417-473, 2010.

M.E. Nergiz, M. Atzori, C. Clifton, "Hiding the presence of individuals from shared databases," Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp.665-676, 2007.

R. Tibshirani, "Regression shrinkage and selection via the lasso," Journal of the Royal Statistical Society. Series B (Methodological), vol.58, no.1, pp.267-288, 1996.

H. Zou & T. Hastie, "Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society. Series B (Methodological), vol.67, no.2, pp.301-320, 2005.

Dieses Blog durchsuchen

statshorts

Support recovery for kernel density estimation - easy in theory, impossible in practice?

Kommentare

Kommentar veröffentlichen

Beliebte Posts aus diesem Blog

Derivation of the expected nearest neighbor distance in a homogeneous Poisson process

Consistent random sample queries using cell keys