Online Program

Statistical Analysis of Zero-Inflated Continuous Data

*Lei Liu, Northwestern University 

Keywords: health economics, substance abuse, medical expenditures, generalized linear mixed model

Zero-inflated continuous (or semi-continuous) data arise frequently in medical, economical, and ecological studies. Examples include, though certainly aren't limited to, medical costs, medical care usage, substance abuse, coronary artery calcium score, and daily precipitation levels. Such data are often characterized by the presence of a large portion of zero values, in addition to continuous non zero (i.e., positive) values that are often skewed to the right and heteroscedastic. Both features suggest that no simple parametric distribution is suitable for describing such “zero-inflated continuous” data.

In this short course we will review statistical methods to analyze such type of data. We will start from the cross-sectional zero-inflated continuous data. Three approaches are presented to account for the point mass at zero: a two-part model which separately describes the probability of outcome being positive and the amount of positive values; a sample selection approach (e.g., Tobit model) where zero values are considered as “censored” observations; and a zero-inflated Tobit model which accommodates the characteristics of both the sample selection and the two-part approaches. We will then introduce flexible models to characterize right skewness and heteroscedasticity in the positive values, using, e.g., log normal, Gamma, generalized Gamma, log skew normal, Box Cox transformation, and non-parametric methods.

The second section involves modeling repeated measures zero-inflated continuous data. Random effects will be used to tackle the correlation on repeated measures of the same subject and that across different parts of the model. We will incorporate such random effects to the models introduced in Section 1. We will also present joint models of longitudinal zero-inflated continuous data and survival, e.g., in the longitudinal medical cost setting, to account for the possible dependent terminal event or informative dropout.

Finally, we will present applications to real datasets to illustrate our methods. We will use longitudinal medical costs, clustered medical costs, and alcohol drinking data as examples. SAS codes will be provided to facilitate the applications of these methods. Model comparison will also be conducted.

The lecturer has 8 years of hands-on experience in the analysis of zero-inflated continuous data, especially the medical costs and alcohol drinking data. He is PI of three grants funded by NIH and AHRQ on this topic. This application oriented short course is of interest to researchers who would apply up-to-date statistical tools to zero-inflated continuous data.