CART synthesis of small-scale georeferences: an experiment using AMELIA data

Inspired by Drechsler & Hu (2021) a method for the creation of synthetic geodata based on classification and regression trees (CART) is tested.

Background

A common problem faced in the provision of microdata for public use are potential privacy violations. The problem is exacerbated if the data contains detailed geographic information, which is known to be particularly revealing (e.g. VanWey et al., 2005). One approach is to publish synthetic microdata instead (Drechsler, 2011), which is built from a model trained on the original data in order to keep important relationships intact. Classification and regression trees have been suggested for the task by Reiter (2005) and have subsequently shown promising results.

Suppe we want to synthesize variable $X_p$. We use the CART to model in a nonparametric fashion the conditional distribution $f(X_p | \mathbf{X}_{-p})$ where $\mathbf{X}_{-p}$ is the matrix of predictor variables without the $p$th one. To make sure we have sufficient variety in the leaf nodes, we may implement a minimum node size or prune the tree after fitting. To create synthetic data, we follow each row in the predictor matrix $\mathbf{X}_p$ to its leaf and draw from the distribution there using a bootstrap method.

Data and application

Given the notorious confidentiality issues of geographically explicit information, public use data without severe access restrictions are rarely published. For the present purpose we will therefore create our own, building on the freely available AMELIA data set (Burgard et al., 2017 & 2020).

We use the household-level data, which in v0.2.3 contains 38 different variables and a total of 3,781,289 observations. Geographic information is given on multiple levels: 4 regions with together 11 provinces, 40 districts and 1,592 cities. For our purpose we want more detail, so we simulate it: In each city we create as many unique point locations as there are households in it. For simplicity we assume these to be uniformly distributed over the city area.

Let $(x_i, y_i)$ be the coordinate  of the $i$th household (measured in m). As our small-scale spatial reference we use a geographic grid cell of size 1km by 1km. It is derived in a straightforward manner as: \[\mathbf{C}_i = \Big(c^{(x)}_i, c^{(y)}_i\Big) = \bigg(\Big\lceil \frac{x_i}{1000}\Big\rceil, \Big\lceil \frac{y_i}{1000} \Big\rceil\bigg)\;\; \forall i = 1,\dots, N\] where $\lceil\cdot\rceil$ is the ceiling function. We encode $\mathbf{C}_i$ as categorical variable with roughly as meany categories as the size of the study area in square km. Subsequently, we use all non-geographic variables in AMELIA to create synthetic $\mathbf{C}_i$ based on the 'real' ones. To cut down on computation time, I synthesize for only one of AMELIA's eleven provinces, comprising a little below 300,000 households. [Note: As synthesizing is usually done independently for subsets, say provinces, the process may easily be parallelized for the whole data set. I restrict myself here to a size sufficient for demonstration.] The CART approach is implemented using the R package synthpop (Nowok et al., 2016).

A first look at the results (mapping cumulative household sizes by grid cell) confirms that the spatial distribution of households is well replicated by the partially synthetic data set:

,
Spatial distribution of cell aggregates (number of people) - original vs. synthetic

Shown below is another heatmap, this time for total disposable household income per square km. If you look closely, you will notice, for instance, in the middle of the eastern border a cell that is dark in the map from original data and grey (meaning empty) in the synthetic variant. Here, a rare combination of features (living in a low-density region, being in the lower tail of the income distribution, ...) has lead to at least one record being assigned a different grid cell by the synthesizer. This is an illustration of how partially synthetic data infuses uncertainty in order to protect confidentiality, while at the same time closely reproducing central patterns.

Spatial distribution of cell aggregates (total disposable household income) - original vs. synthetic

The following plots show more distinctly the relation between cell aggregates with original and synthetic georeferences:

Cell aggregates for original vs. synthetic data

If we want more uncertainty (for more privacy protection), we can require larger leaf nodes or prune the fitted trees more rigorously. Additionally, as suggested by Drechsler & Hu (2021), we can synthesize additional variables (remember that we have here only synthesized the grid cell ID, while all other variables remained the same, which is likely insufficient for full protection).

To keep the post short, I forego any formal assessment of privacy risk - see Wang & Reiter (2012) for suggestions. My aim was to indicate that synthetic georeferences can be of high geographic validity. However, their potential for protecting privacy must be seen in a broader context: Typically, they will not be able to do all the heavy lifting; 'traditional' protection methods (like adding noise or rounding) for the non-geographic part of the data set will likely still be required.

Implementation remarks

R code for the experiment can be downloaded from my GitHub page. In order to replicate the results of this post, AMELIA v.0.2.3 is needed, available here, together with the dedicated map, published with v.0.2.1 (here). For the simulation of point locations, the spatstat package (Baddeley et al., 2015) is used. Additional information (including a helpful introduction) on the synthpop package can be found on its official website.

Literature

A. Baddeley, E. Rubak, R.Turner, Spatial Point Patterns: Methodology and Applications with R, Chapman and Hall / CRC Press, 2015.

J.P. Burgard, J.-P. Kolb, H. Merkle, R. Münnich, "Synthetic data for open and reproducible methodological research in social sciences and official statistics," AStA - Wirtschafts- und Sozialstatistisches Archiv, vol.11, pp.233-244, 2017.

J.P. Burgard, F. Ertz, H. Merkle, R. Münnich, "AMELIA - Data description v0.2.3.1.," http://amelia.uni-trier.de, 2020.

J. Drechsler, Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation, Springer, 2011.

J. Drechsler & J. Hu, "Synthesizing geocodes to facilitate access to detailed geographical information in large-scale administrative data," Journal of Survey Statistics and Methodology, vol.9, no.3, pp.523-548, 2021.

B. Nowok, G.M. Raab, C. Dibben, "synthpop: Bespoke creation of synthetic data in R," Journal of Statistical Software, vol.74, no.4, 2016.

J.P. Reiter, "Using CART to generate partially synthetic public use microdata," Journal of Official Statistics, vol.21, no.3, pp.441-462, 2005.

L.K. VanWey, R.R. Rindfuss, M.P. Gutmann, B. Entwisle, D.L. Balk, "Confidentiality and spatially explicit data: concerns and challenges," Proceedings of the National Academy of Sciences (PNAS), vol.102, no.43, pp.15337-15342, 2005.

H. Wang & J.P. Reiter, "Multiple imputation for sharing precise geographies in public use data," The Annals of Applied Statistics, vol.6, no.1, pp.229-252, 2012.

Kommentare

Beliebte Posts aus diesem Blog

On the reversibility of Voronoi geomasking

Herfindahl-Hirschman-Index als Maß für die Diversität von Herkünften auf Gemeindeebene [deutsch]

Derivation of the expected nearest neighbor distance in a homogeneous Poisson process