Consistent random sample queries using cell keys

Random sample querying is a method to ensure privacy for personal information when answering statistical queries. Rather than calculating the query response from all records in the database, a random sample is drawn and an estimate, based on sampling theory, is returned. The contribution of any individual data point to the answer thereby becomes uncertain, protecting the privacy of data subjects.

The method has an undeniable charme, because sampling has intuitive and verifiable privacy-enhancing properties (Balle et al., 2018). Furthermore, sampling and sample-based estimation are well understood by statisticians. This should facilitate adoption, especially when other privacy-preserving mechanisms - like output perturbation, where noise is added to the query answer - are viewed unfavourably as 'messing with the data.' However, random sample queries suffer from inconsistency issues, which hitherto hindered their adoption. In this post, I show how the cell key mechanism can be adpated to enforce consistency, making random sample queries more viable.

Random sample queries

The random sample query (RSQ) was first proposed by Denning (1980). Over the years, the method has received scant attention (e.g. Clifton, 2000, or Dwork, 2011), but was subsequently abandoned in favour of output perturbation.

How does an RSQ work? Consider a database of $N$ records and let $\mathcal{Q}$ specify the subset of records that fit the query predicate. First, let's look at a simple count query. In a non-confidential database, we would simply count as follows: $$N_{\mathcal{Q}} = \sum_{i=1}^N \mathbb{I}(i \in \mathcal{Q})$$ where $\mathbb{I}$ is an indicator function. For example, with the query set $\mathcal{Q} := \{i : age_i > 15\}$, $N_{\mathcal{Q}}$ gives the number of all person records with an age value above 15. The equivalent RSQ draws a sample $\mathcal{S}$ with sample fraction $p \in [0, 1]$, then calculates $$\hat{N}_{\mathcal{Q}} = \frac{ n_{\mathcal{Q}}}{p} \quad\text{ with }\quad n_{\mathcal{Q}} = \sum_{i=1}^N \mathbb{I}(i \in \mathcal{S} \land i \in \mathcal{Q}).$$ A sum query for metric variable $x$ works similarly. In the non-confidential setting, we would simply add up: $$S_{\mathcal{Q}}(x) = \sum_{i=1}^N x_i \cdot \mathbb{I}(i \in \mathcal{Q}).$$ The corresponding RSQ calculates a sum for the sample only, then outputs $$\hat{S}_{\mathcal{Q}}(x) = \frac{ s_{\mathcal{Q}}(x)}{p} \quad\text{ with }\quad s_{\mathcal{Q}}(x) = \sum_{i=1}^N x_i \cdot \mathbb{I}(i \in \mathcal{S} \land i \in \mathcal{Q}).$$ $\hat{N}_{\mathcal{Q}}$ and $\hat{S}_{\mathcal{Q}}$ are estimates of the count and sum respectively. Using simple random sampling (SRS) makes the estimation step straightforward. The results are unbiased; furthermore, since we only need to protect the privacy of individual records, the sampling fraction can be quite high, say 80% or higher, implying small standard errors.

Consistency problem

One of the problems of RSQs is that querying the same statistic repeatedly would yield different results, because each time a new sample is drawn from which the estimate is calculated. For example, if you want to know the number of people in a certain age group, each time you ask could yield a slightly different count. Denning (1980) suggested a solution to this problem: The query text (e.g. an SQL statement) would be transformed into a selection function, so that identical queries select the same sample. However, syntactically different, but semantically equivalent queries would still lead to different results.

Example: Suppose we demanded a 3 by 3 contingency table of counts over all persons in our database. As classifying attributes we select (1) an age variable with the categories $\text{\{15 or younger, 16 to 30, 31 and older\}}$, (2) a marital status with categories $\text{\{single, married, divorced\}}$. Reasonably, we expect the count in the field $\{age = \text{15 or younger} \land marital\_status = \text{single}\}$ to be the same as if we only queried $\{age = \text{15 or younger}\}$, because we expect the other two categories of marital status to be empty for that age group. But since the query text differs, we end up with a different sample and therefore (likely) a different estimate! This has lead to RSQs showing "a low level of consistency" (Adam & Wortmann, 1989, p.553).

Introducing a cell key mechanism for consistent subsampling

What if we could condition the sample not on the query predicate, but on the set of records $\mathcal{Q}$ selected by it? In the case of our example, $\{age = \text{15 or younger} \land marital\_status = \text{single}\}$ and $\{age = \text{15 or younger}\}$ specify the same subset (assuming that the subset $\{age = \text{15 or younger} \land marital\_status \neq \text{single}\}$ is indeed empty). If we let $\mathcal{Q}$ be the key for sample selection, we derive the same sample and therefore the same estimate in both cases!

This brings us to the cell key method (CKM), known primarily as a variant of output perturbation. In output perturbation, a statistic is first calculated as $N_{\mathcal{Q}}$ or $S_{\mathcal{Q}}$ from the full database, but then returned with a small noise value added on top. CKM has been used for several years to identify or 'tag' logically equivalent outputs, such that they would receive the same noise and thereby avoid inconsistencies (see, e.g., Meindl, 2025). What I want to show here is how to adapt this mechanism to provide similar consistency for RSQs.

We start by appending a random number to each record. These numbers are uniformly distributed between 0 and 1 and are called record keys. The record keys become a fixed part of the data set for the duration of its shelf life. In the next step, we sort by these record keys, basically shuffling all entries. This is important, because it means we can select contiguous chunks of rows and get a proper random sample (this is known as Sunter method, following Sunter, 1977).

Now, whenever we issue any query, we also issue the corresponding sum query for record keys. Suppose we want a count $N_{\mathcal{Q}}$. Then we first issue the query $S_{\mathcal{Q}}(r)$, where $r$ denotes the record key variable. We re-transform the sum to the interval [0,1] by keeping only the decimal part. The resulting number $K_{\mathcal{Q}}$ is called the cell key associated with the queried set $\mathcal{Q}$. Formally: $K_{\mathcal{Q}} = S_{\mathcal{Q}}(r) \text{ mod } 1$. We can prove that the decimal part of a sum of uniform random variables on [0,1] is again uniform on [0,1] (see Appendix).

Selecting a sample based on the cell key is then straightforward. Let $n = \lceil pN \rceil$ be the sample size implied by sample fraction $p$. We start with the first record in the sorted database for which $r_i \geq K_{\mathcal{Q}}$, i.e. the first whose record key is larger than (or equal to) the cell key. Beginning at that row, we pick the $n$ next entries as our sample. Should we reach the last ($N$th) row before having reached a sample size of $n$, we simply loop over to the first row and take however many we still need from the top of the data set. Since the record keys have effectively shuffled the rows, the chunk such selected represents a proper random sample. And since the mechanism, given a particular cell key, is fully deterministic, two queries that specify the same set of contributors $\mathcal{Q}$, however syntactically different they may be, return the same sample. Hence, consistency is restored.

The schematic below shows a toy example to support the intuition.

Left: Suppose our query predicate would select the five rows marked in gray. The sum query $S_{\mathcal{G}}(r)$ would return $0.166256 + 0.430454 + 0.512300 + 0.668429 + 0.745334 = 2.552277$ for a cell key of 0.552277. The first record key that is above that is the 8th with 0.668429. Hence, this would be the first one selected in the sample, together with the $n - 1$ that follow it. Right: Sample selection by chunks, starting from the red line and assuming a sampling fraction of $p = 0.65$. If we start somewhere within the fist 35% of entries, a contiguous chunk is selected as in the left example. Otherwise, the remainder is picked starting from the top again, as shown in the example on the right.

Extended example

Let us now examine how this would work with actual data. In particular, I use the eusilcS data set provided with the R package simPop, structured like a common large-scale survey data base (for replicable code see the Implementation Remarks section below). The following table shows count queries evaluated on the full data set, structured by age and economic status variables:A corresponding set of results from RSQs (without the cell key mechanism) would look like this:Note in particular that the number of people aged 80 to 100 (rightmost column) in retirement, inactivity or care work should equal the total number of people aged 80 to 100 (value in the bottom row). Due to different samples, however, this is not the case (335 vs. 332).

Now we repeat the queries, but using the cell key mechanism. We append random record keys to the data and calculate cell keys as described above. The following table shows them for each individual count query:Each of these cell keys maps deterministically to a starting record for the sampling stage. The next table shows the row numbers (from 9522 rows total) at which a sample for the corresponding statistic would start: We can see that the logically equivalent queries $age \in [80, 100]$ and $age \in [80, 100] \land economic\_status = \text{retired, inactive, or in care work}$ result in the same cell key (0.746415) and thereby in the same sample (starting from record #7107). In the final step, we run the RSQ (here using a sample fraction of 80%) and calculate estimates:The syntactically equivalent queries now result in the same count (349). An added benefit accrues, if we query metric variables, for instance the average net income: Since the same records as in the count queries contribute to each queried average, we again end up with the same cell keys. Hence, income is estimated from the same sample as the count to which it pertains.

Privacy implications of adding cell keys to RSQs

There are two common privacy violation problems related to random sample queries. The first arises when logically equivalent queries give different answers. The second occurs when samples do not vary enough between queries with near-perfect overlap.

Firstly, if many logically equivalent queries were to be answered from different samples, each yielding an unbiased estimate, the noise could be cancelled out by averaging over the answers. This may reveal confidential small counts or narrow the noise range, weakening the privacy protection afforded by sampling (see the discussion in Denning, 1980, section 7). The cell key mechanism prevents this problem by giving the same noisy answer each time.

Secondly, breaches of confidentiality in statistical databases rely most heavily on finding two queries that differ by only a single record. Comparing the query results then allows inferring values for that record. It is therefore desirable that two queries that overlap on all but one record should not be evaluated on strongly correlated samples. With respect to the cell key mechanism, it can easily be shown that the samples are, in fact, perfectly independent: If we add a single record to a query (or remove one from it), the cell key is augmented (or diminished) by that record's record key, which is by design independent from all others. Since that record key is random between 0 and 1, this means a complete re-roll of the sample.

In summary

Cell keys have been invented as look-up values to derive consistent results for output perturbation. But the same method can also be extended to yield a general-purpose approach for tagging logically equivalent query results for use with any other statistical confidentiality mechanism. In the case of random sample queries (RSQs) it allows us to enforce a 'same statistic, same sample' property which increases the consistency - hence utility - of results.

Appendix: Proof that cell keys are uniformly distributed on the unit interval

First, we re-write the cell key variable as $Z = \left(\sum_{i=1}^n X_i \right) - \lfloor \sum_{i=1}^n X_i \rfloor$ with $X_i \sim \mathcal{U}(0, 1) \;\forall i = 1, \ldots, n$. It is to be shown that also $Z \sim \mathcal{U}(0, 1)$.

The proof is by induction: For the case $n = 1$ we have $Z = X_1$ and therefore $Z \sim \mathcal{U}(0,1)$ by definition. For the induction step, we use $Y := (\sum_{i = 1}^k X_i) - \lfloor \sum_{i = 1}^k X_i \rfloor $ and prove that the case holds for $$\begin{aligned} Z := Y + X - \lfloor Y + X \rfloor = \begin{cases} X + Y &\text{ if }\; X + Y < 1\\ X + Y - 1 &\text{ if }\; X + Y \geq 1 \end{cases} \end{aligned}$$ where $X$ is short for $X_{k+1}$. From $X, Y \in [0, 1]$ we conclude that also $Z \in [0, 1]$. The density functions of $f_X$ and $f_Y$ are for uniform variables on the unit interval: $$f_X(x) = f_Y(y) = \begin{cases} 1 & \text{ if } 0 \leq x, y \leq 1\\ 0 & \text{ else} \end{cases}$$ Since $X$ and $Y$ are independent, we get $$\begin{aligned} f_{X + Y}(z) &= \int_{-\infty}^\infty f_X(x) \; f_Y(z - x) \;dx\\ &= \begin{cases} z &\text{ if } 0 \leq z < 1 \\ 2 - z &\text{ if } 1 \leq z \leq 2 \\ 0 &\text{else}.\end{cases} \end{aligned}$$ And similarly $$\begin{aligned} f_{X + Y - 1} &= \int_{-\infty}^\infty f_X(x) \; f_Y(z - x + 1) \;dx \\ &= \begin{cases} z + 1 &\text{ if } -1 \leq z < 0 \\ 1 - z &\text{ if } 0 \leq z \leq 1 \\ 0 &\text{else}. \end{cases} \end{aligned}$$ Since we already know that $Z \in [0, 1]$, we only need the first case from $f_{X+Y}$ and the second from $f_{X+Y-1}$. Using this, we write the distribution function of $Z$ as $$\begin{aligned} P(Z \leq u) &= P(0 \leq X + Y \leq u) + P(0 \leq X + Y - 1 \leq 1) \\ &= \int_0^u z \; dz + \int_0^u 1-z \;dz \\ &= \int_0^u 1 \;dz = u \end{aligned}$$ which is the distribution function of a uniform random variable on the unit interval. This concludes the proof.

Implementation remarks

R code for the small demonstrator above using eusilcS data can be found on the statshorts GitHub page.

Literature

N.R. Adam & J.C. Worthmann, "Security control methods for statistical databases: A comparative study," ACM Computing Surveys (CSUR), vol.21, no.4, pp.515-556, 1989.

B. Balle, G. Barthe, M. Gaboardi, "Privacy amplification by subsampling: Tight analyses via couplings and divergences," Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS '18, pp.6280-6290, 2018.

C. Clifton, "Protecting against data mining through samples," in: V. Atluri & J. Hale (Eds.), Research Advances in Databases and Information Systems Security, ch.13, pp.193-207, 2000.

D.E. Denning, "Secure statistical databases with random sample queries," ACTM Transactions on Database Systems, vol.5, no.3, pp.291-315, 1980.

C. Dwork, "A firm foundation for private data analysis," Communications of the ACM, vol.54, no.1, pp.86-95, 2011.

B. Meindl, "cellKey - An R package to perturb statistical tables," Austrian Journal of Statistics, vol, no.4, pp.136-156, 2025.

A.B. Sunter, "List sequential sampling with equal or unequal probabilities without replacement," Journal of the Royal Statistical Society - Series C (Applied Statistics), vol.36, no.3, pp.261-268, 1977.

Dieses Blog durchsuchen

statshorts

Consistent random sample queries using cell keys

Kommentare

Kommentar veröffentlichen

Beliebte Posts aus diesem Blog

Derivation of the expected nearest neighbor distance in a homogeneous Poisson process

Support recovery for kernel density estimation - easy in theory, impossible in practice?