Using latent variables in logistic regression to reduce multicollinearity, A case-control example: breast cancer risk factors


Background: Logistic regression is one of the most widely used models to analyze the relation between one or more explanatory variables and a categorical response in the field of epidemiology, health and medicine. When there is strong correlation among explanatory variables, i.e.multicollinearity, the efficiency of model reduces considerably. The objective of this research was to employ latent variables to reduce the effect of multicollinearity in analysis of a case-control study about breast cancer risk factors.

Methods: The data belonged to a case-control study in which 300 women with breast cancer were compared to same number of controls. To assess the effect of multicollinearity, five highly correlated quantitative variables were selected. Ordinary logistic regression with collinear data was compared to two models contain latent variables were generated using either factor analysis or principal components analysis. Estimated standard errors of parameters were selected to compare the efficiency of models. We also conducted a simulation study in order to compare the efficiency of models with and without latent factors. All analyses were carried out using S-plus.

Results: Logistic regression based on five primary variables showed an unusual odds ratios for age at first pregnancy (OR=67960, 95%CI: 10184-453503) and for total length of breast feeding (OR=0). On the other hand the parameters estimated for logistic regression on latent variables generated by both factor analysis and principal components analysis were statistically significant (P<0.003). Their standard errors were smaller than that of ordinary logistic regression on original variables. The simulation showed that in the case of normal error and 58% reliability the logistic regression based on latent variables is more efficient than that model for collinear variables.

Conclusions: This research indicated that logistic regression based on latent variables is more efficient than logistic regression based on original collinear variables.

Full Text:






  • There are currently no refbacks.