Ridge regression in the logit model: an application to birth weight data

This post presents a short application of the ridge regression method to the modeling of birth weights of infants. It is based on joint work with V. Kazakova. The logistic model employs an infamous data set by Lee & Scott (1986), known for its collinearity issues (see Seber & Wild, 1989, p.104ff.)

Background

Ridge regression (Hoerl, 1962) is an estimation method for models with strong covariate correlation. An accessible introduction to ridge estimation in linear models was written by M. Taboga on StatLect. A broad-view introduction is Hastie (2010).

Ridge works by adding a penalty term to the sum-of-squared-residuals (SSR) minimization problem in linear regression. For generalized linear models (GLM) we may instead penalize the log-likelihood function. Assume the target vector $y$ to be distributed with density $f(Y, \beta)$, where $\beta$ is the vector of coefficients in the linear predictor part of the GLM. The maximum likelihood estimate of $\beta$ is found by solving \[\max_{\beta} \sum_{i=1}^n \log f(y_i, \beta)\] The ridge estimate follows when solving the penalized version (Segerstedt, 1992): \[\max_{\beta} \sum_{i=1}^n \log f(y_i, \beta) - \frac{\lambda}{2} ||\beta||_2\] where $||\cdot||_2$ is the L2-norm. In the case of a logit-link (logistic regression) this becomes (Schaefer et al., 1984): \[\max_{\beta} \sum_{i=1}^n \Big[y_i x_i \beta - \log\big(1+\exp(x_i \beta)\big)\Big] - \frac{\lambda}{2}\beta^T\beta\]

Data and application

The data set was collected by Lee & Scott and is reproduced in full in Seber & Wild (1989). We have $n=50$ with three variables: birth weight (BW) of infants in gram, plus two predictors which are collected as ante-natal measurements via ultrasound some time before birth: biparietal diameter (BPD) and abdominal circumference (AC). 

We aim to model the occurrence of very low birth weight (VLBW) infants, which is defined as anything below 1500g. Hence our target to be modeled is $\mathrm{VLBW} := 1(\mathrm{BW} < 1500)$, where $1(\cdot)$ denotes an indicator function.

Split of birthweight data by VLBW, we get $n=28$ with VLBW and $n=22$ others.
 

Problematically, the two predictors are strongly correlated (Pearson correlation: 0.85, Spearman rank correlation: 0.81). Under normal circumstances, this makes estimates for the regression coefficients highly unstable. To approach the issue, we can estimate the model with the ridge method rather than simple maximum likelihood (ML).

The predictors BPD and AC are strongly correlated, which inhibits sensible inference.

The ridge method stabilizes estimates by penalizing large deviations from zero. A penalty parameter $\lambda$ determines the strength of this so-called shrinkage effect. The animation below shows how $\lambda$ relates to the resulting coefficient estimates in our logistic model. An MSE-optimal $\lambda$ for estimation is found via cross-validation (here: 10-fold).

To compare the penalized with the non-penalized strategy more generally, we create 100 artificial data sets as bootstrap resamples of the original data (each with $n=50$). On each data set, a logistic regression model with BPD and AC as predictors is fit, once using simple IRLS, once the ridge method.

Subsequently, each of the models fitted on a bootstrap sample is used to predict the occurence of VLBW in the original sample. The hypothesis is that more stable coefficients also lead to less volatile out-of-sample performance

As seen above, the ridge method greatly stabilizes estimates for the model parameters. Since both BDP and AC are good predictors on their own, the corresponding gain in predictive accuracy is modest, but notable. Overall, the ridge method proofs well suitable to tackle the collinearity issue.

Implementation remarks

Ridge regression was performed using the R package glmnet (Friedman et al., 2010). Data and R code can be downloaded from my GitHub page.

Literature

J. Friedman, T. Hastie, R. Tibshirani, "Regularization paths for generalized linear models via coordinate descent," Journal of Statistical Software, vol.33, no.1, pp.1-22, 2010.

T. Hastie, "Ridge regularization: An essential concept in data science," Technometrics, vol.62, no.4, pp.426-433, 2010.

 A.E. Hoerl, "Application of ridge analysis to regression problems," Chemical Engineering Progress, vol.58, pp.54-59, 1962.

A.J. Lee & A.J. Scott, "Ultrasound in ante-natal diagnosis," in The Fascination of Statistics (R.J. Brook, G.C. Arnold, T.H. Hassard, and R.M. Pringle, eds.), ch.21, pp.277-294, CRC Press.

R. Schaefer, L. Roi, R. Wolfe, "A ridge logistic estimator," Communications in Statistics - Theory and Methods, vol.13, no.1, pp-1231-1257, 1984.

G.A.F. Seber & C.J. Wild, Nonlinear regression, Wiley, 1989.

B. Segerstedt, "On ordinary ridge regression in generalized linear models," Communications in Statistics - Theory and Methods, vol.21, no.8, pp.2227-2246, 1992.


Kommentare

Beliebte Posts aus diesem Blog

On the reversibility of Voronoi geomasking

Herfindahl-Hirschman-Index als Maß für die Diversität von Herkünften auf Gemeindeebene [deutsch]

Derivation of the expected nearest neighbor distance in a homogeneous Poisson process