Online Program

A semi-parametric approach to impute mixed continuous and categorical data

Hakan Demirtas, University of Illinois at Chicago 
*Irene B Helenowski, Northwestern University 

Keywords: multiple imputation, semi-parametric, categorical data, ordinal data

We have developed a method for imputing mixed continuous and binary data based on pairwise correlations and non-parametric transformations of individual variables to normally distributed values (Helenowski and Demirtas, 2013). Here, we propose an extension of this technique to data involving categorical variables with three or more levels. A bivariate example is presented with one continuous variable and one categorical variable. First, the medians for the continuous variable will be computed by each level of the categorical variable and the categorical variable will be ranked as an ordinal variable according to these medians. For example, if medianA < medianC < medianB , then the levels of the variable would be ranked as A = 1, B = 3, and C = 2. The pairwise correlation between the continuous and ordinal variable is then computed. Data will then be transformed to normally distributed values, imputed via joint modeling, and back-transformed to the original scale via the Barton and Schruben (1993) technique for the continuous variable and quantiles based on the original probabilities of the categorical variable. The algorithm is re-iterated until the absolute difference of the pairwise correlations from the original and imputed data is less than some constant c chosen to maximize the coverage rate and minimize standardized bias.