Online Program

Return to main conference page

All Times ET

Program is Subject to Change

Wednesday, June 16
Wed, Jun 16, 10:30 AM - 12:00 PM
Leveraging Machine Learning to Improve Economic Surveys and Programs

Automating the Extraction of Health Care Provision Data from Standardized Insurance Plan PDF Documents (308058)

*Brandon Kopp, Bureau of Labor Statistics 

Keywords: machine learning, health insurance, autocoding

The National Compensation Survey (NCS), conducted by the U.S. Bureau of Labor Statistics, publishes data on numerous components of compensation including detailed data on the provisions of health care plans offered by employers (e.g., deductibles and copays). These provisions are increasingly being communicated to NCS data collectors through Summary of Benefits Coverage (SBC) documents. SBCs are short, semi-standardized pamphlets of plan attributes and insurance companies are mandated by the Patient Protection and Affordable Care Act to produce them for each health care plan they offer. Current collection procedures entail the manual review of SBCs and keying of the data into a data collection system. This process can take a considerable amount of time and is subject to transcription and other errors. The goal of this project was to determine the feasibility of automating the process of data extraction from SBC documents and coding the extracted text into categories in order to improve the data collection process by decreasing burden on interviewers and improving data quality. In this presentation, we will describe the development of a proof-of-concept application that reads in an SBC document in PDF format and outputs a data table with plan name, overall deductible, out-of-pocket maximum expenses, and copay and coinsurance values for five medical services.