Abstract:
|
Modern large-scale datasets are often collected from multiple methods and data sources. The data heterogeneity makes it difficult for traditional algorithms under i.i.d. data assumptions to obtain good prediction performance on new unseen data. One major cure for this problem is domain adaptation where one would like to predict labels of new data after observing the covariates of the new data. While theoretical understanding is still missing to a large extent, many recent work have demonstrated the empirical possibility of building good models for domain adaptation problems in image recognition or text classification. We propose to view the heterogeneous data in the domain adaptation problem from a causal inference perspective using structural causal models. These models not only lead to further theoretical understanding of when many existing domain adaptation algorithms succeed to adapt or not, but also inspire us to consider a new algorithm for semi-supervised domain adaptation. Additionally, we provide theoretical prediction risk guarantees of the proposed novel method and quantify its improvement over previous methods.
|