Abstract:
|
Data that are not intentionally collected for research, such as electronic health record (EHR) data, have a well-known reputation for being “messy”. Potential consumers of such data often find themselves on opposite ends of the spectrum, perceiving that the data are so error-prone as to be unusable, e.g. “garbage in, garbage out”, or amenable to off-the-shelf analyses taking the data at face value. Instead, we argue that the answer lies somewhere in the middle: complex data in, nuanced answers out. Through a series of case studies utilizing a large, EHR-based oncology dataset collected from the Flatiron Health network, we illustrate potential pitfalls when analyzing these data and provide general principles for analytic guidance. Discussion will include issues such as variable assessment timing, measurement error, the importance of incorporating metadata into analyses, and incomplete data.
|