Modeling Multiple Categorical Measurements Using Linear Latent Structure Analysis
Keywords: Multidimensional categorical data, demographic surveys, latent analysis, health state
Linear Latent Structures (LLS) analysis is used to analyze high-dimensional categorical data. An abundance of such data appears i) in behavior science, especially in demographic surveys and ii) in genetic studies (whole-genome microarray data). The LLS analysis assumes that measurements reflect a hidden property of subjects that can be described by low-dimensional random vector. This vector is interpreted as explanatory variables which can shed light on the mutual correlations observed in measured categorical variables. The LLS analysis is used to discover this hidden property and describe it as precisely as possible. In this report we discuss the formulation of the LLS model, its statistical properties, algorithm to estimate model and its implementation, simulation studies, and application of LLS model to the National Long Term Care Survey data. We also discuss relationship between LLS and Grade of Membership analysis. Basic steps of LLS analysis include i) determining the dimensionality of the explanatory vector, ii) identifying the linear subspace which explanatory vector ranges over, iii) choosing a basis in the indicated subspace using methods of cluster analysis and/or prior knowledge of the phenomenon of interest, iv) calculating empirical distributions of the so-called LLS scores which reflect individual responses in the linear subspace, and v) investigating properties of the LLS score distributions to capture population and individual effects (e.g., heterogeneity). Simulation studies demonstrate the quality of reconstruction of the major model components (i.e., low-dimensional subspace and the LLS scores distribution). Results of the simulation studies prove the sufficient quality of reconstruction for typical sample size and demonstrate the potential of the methodology to analyze survey datasets with 1000 or more questions. This methodology was applied to the 1994 and 1999 NLTCS datasets (5,000+ individuals) with responses to over 200 questions on behavior factors, and self-reported functional status and comorbidities. We estimated subspace that carries latent vectors and obtain interpretation of its basis as “pure-type individuals” (like healthy, strongly disabled, having chronic diseases, etc). Estimated distribution of the LLS scores discovers heterogeneity structure of the population. The components of the vectors of individual LLS scores are used to make predictions of individual lifespans.