Class Activity, February 3
In this class activity, we will explore the effects of multicollinearity on a fitted logistic regression model.
The following code generates correlated explanatory variables \(X_1,...,X_n \in \mathbb{R}^3\) from a multivariate normal distribution:
\[X_i \overset{iid}{\sim} N(0, \Sigma)\]
where the covariance matrix \(\Sigma \in
\mathbb{R}^{3 \times 3}\) has diagonal entries \(\Sigma_{ii} = Var(X_i) = 1\), and
off-diagonal entries \(\Sigma_{ij} = Cov(X_i,
X_j) = \rho\). Simulating data from a multivariate normal
distribution uses the rmvnorm(...)
function of the
mvtnorm
package; you may need to install the
mvtnorm
package.
We then simulate data \(Y_i \sim Bernoulli(p_i)\), with \(p_i = \dfrac{e^{\beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \beta_3 X_{i,3}}}{1 + e^{\beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \beta_3 X_{i,3}}}\).
library(mvtnorm)
<- 1000
n <- 0
rho <- c(0, 1, 1, 1)
beta
# Create the covariance matrix sigma
<- diag(rep(1 - rho, 3)) + matrix(rep(rho, 9), nrow=3)
sigma
# Generate Xs from a multivariate normal distribution
<- rmvnorm(n, sigma = sigma)
X
# make a design matrix X_complete which includes the initial column of 1s for
# the intercept
<- cbind(1, X)
X_complete
# generate probabilities p
<- exp(X_complete %*% beta)/(1 + exp(X_complete %*% beta))
p
# generate bernoulli response y
<- rbinom(n, 1, p)
y
# fit a model to all the explanatory variables
<- glm(y ~ X, family = binomial)
m1 summary(m1)
Questions
Run the above code several times with
rho = 0
(no correlation between the explanatory variables), and note the coefficient estimates and their estimated standard errors in the summary output. Do the coefficient estimates generally appear to be close to the true values in the simulation (\(\beta = (0, 1, 1, 1)^T\))?Fit a new logistic regression model for \(Y\), this time using only the first coordinate of \(X\) (i.e., the column
X[,1]
in R). Is \(\widehat{\beta}_1\) similar with and without the other two explanatory variables in the model?Now change the correlation \(\rho\) to
rho = 0.6
in the simulation, and run the code several times. How does the variability of the estimated regression coefficients \(\widehat{\beta}\) change?