Abstract:
|
The E step of EM-type algorithms is time-consuming in massive data applications because it requires multiple passes through the full data. We address this problem by proposing an asynchronous and distributed generalization of the EM called the Distributed EM (DEM). Using DEM, existing EM-type algorithms are easily extended to massive data settings by exploiting the divide-and-conquer technique and widely available computing power, such as grid computing. The DEM algorithm reserves two groups of computing processes called "workers" and "managers" for performing the E step and the M step, respectively. The samples are randomly partitioned into a large number of disjoint subsets and are stored on the worker processes. The E step of DEM algorithm is performed in parallel on all the workers, and every worker communicates its results to the managers at the end of the local E step. The managers perform the M step after they have received results from an r-fraction of the workers, where r is a fixed constant in (0, 1]. The sequence of parameter estimates generated by the DEM algorithm retains many attractive properties of the EM algorithm. (Joint work with Chuanhai Liu and Glen DePalma)
|