Abstract:
|
Random forest methods (RF) and other tree-based methods have become an integral part of social and behavioral research. There is currently limited guidance in their use with data collected for complex sampling designs. As a result, RF applications which apply these methods to sample survey data (including, for example, imputation, behavioral prediction, factor comparisons, and cross-method model validation) may be overly biased. The purpose of this paper is to evaluate methods for overcoming this limitation, and improving the performance of RF via methods we refer to as knotted branches, weighted bags, and post-hoc adjustments. Performance will be based on a large, but highly realistic, synthetic population covering health risks and related behaviors in the United States.
|