statshorts

Posts

Es werden Posts vom Januar, 2023 angezeigt.

CART synthesis of small-scale georeferences: an experiment using AMELIA data

Januar 20, 2023

Inspired by Drechsler & Hu (2021) a method for the creation of synthetic geodata based on classification and regression trees (CART) is tested. Background A common problem faced in the provision of microdata for public use are potential privacy violations. The problem is exacerbated if the data contains detailed geographic information, which is known to be particularly revealing (e.g. VanWey et al., 2005). One approach is to publish synthetic microdata instead (Drechsler, 2011), which is built from a model trained on the original data in order to keep important relationships intact. Classification and regression trees have been suggested for the task by Reiter (2005) and have subsequently shown promising results. Suppe we want to synthesize variable $X_p$. We use the CART to model in a nonparametric fashion the conditional distribution $f(X_p | \mathbf{X}_{-p})$ where $\mathbf{X}_{-p}$ is the matrix of predictor variables without the $p$th one. To make sure we have sufficient var...

Bias reduction in geodata masked for privacy: a simulation-based approach

Januar 13, 2023

Geographic coordinates of households are often displaced by a random vector before publication to protect the privacy of respondents. When these changed coordinates are used for statistical modelling, attenuation bias typically results. Inspired by work done by Karra et al. (2020) I discuss a method to reduce this bias based on a map of true prior location probabilities. Background Consider a data set of $n$ subjects, each identified by a pair of geographic coordinates $(x_{i1}, x_{i2})$. For reasons of data protection, these true coordinates will often not be releasable for public use. A common practice consists in "scrambling" or "masking" the coordinates by adding noise and to release the altered pairs $(x'_{i1}, x'_{i2})$ instead. One particularly prominent method of noise addition is random perturbation : For each observation a random angle $\gamma$ and a random distance $r$ is drawn and the location is moved in that direction by that amount. Commonly...

Best practices Datenvisualisierung mit (Negativ-)Beispielen aus Berichten des Wirtschaftsministeriums [deutsch]

Januar 06, 2023

In einer Republik gehört es zu den Aufgabe der Regierung, die für sie entscheidungsrelevanten Daten transparent zu kommunizieren. Eine hochwertige Darstellung der Evidenzgrundlage für politische Entscheidungen in Berichten und Pressemitteilungen erzeugt Vertrauen beim interessierten (Fach-)Publikum, dass die zugrundeliegende Evidenz ausreichend verstanden wurde, dass zuständige Mitarbeiter bei Ministerien klar und nachvollziehbar denken und dass man nicht davor zurückschreckt, die Öffentlichkeit an diesem Denken teilhaben zu lassen. Dieser Beitrag beschäftigt sich mit der visuellen Präsenation von Daten in einer ausgewählten Publikation. Am Anfang steht die Prämisse Edward Tuftes: "When we reason about quantitative evidence, certain methods for displaying and analyzing data are better than others. Superior methods are more likely to produce truthful, credible, and precise findings." (Tufte, 1997, S.27) Der Jahreswirtschaftsbericht (im Folgenden JWB) wird zu Beginn eine...

The average distance between points randomly placed in diagonally adjacent unit squares

Januar 02, 2023

The average distance between points randomly placed in certain geometrical forms appears in a wide range of contexts, e.g. in area sampling, experimental design, geology or remote sensing. (For me, it surprisingly popped up during my Master's thesis in the context of spatial regression models.) This post is concerned particularly with the following question: What is the expected distance between two points, each placed randomly within a square that touches the other at a single corner, like same-colored fields on a chessboard? Background A well studied mathematical problem is the random distance between points within the same form (so-called line-picking ). For a square, this comes to slightly more than half - namely approx. $0.521$ times - the length of one side, as Wikipedia informs us. Less well established is the case where points are placed in different forms. The simplest such setup are two squares of the same size placed side-by-side (like a1 and b1 on a chessboard). For si...