CSP 2017 Online Program

Saturday, February 25

PS3 Poster Session 3 and Continental Breakfast

Sat, Feb 25, 8:00 AM - 9:15 AM
Conference Center AB

Using Shiny to Efficiently Process Survey Data (303391)

Akbar Akbari Esfahani, UCLA Center for Health Policy Research
*Carl Ganz, UCLA Center for Health Policy Research
Ninez Ponce, UCLA Center for Health Policy Research
Hongjian Yu, UCLA Center for Health Policy Research

Keywords: R, shiny, survey statistics, SAS, SQL, Excel, VBA

At the UCLA Center for Health Policy Research, we spend a large portion of our resources to conduct the continuous California Health Interview Survey (CHIS). Processing each CHIS cycle requires many months of work, and collaboration between statisticians, programmers, and content experts at the Center. As a result, we are always looking for new ways to streamline data processing, and increase efficiency. One area where we have found a lot of potential is with the R package Shiny. With Shiny, we have developed several locally deployed applications that have substantially increased our efficiency. In our poster, we would like to present several Shiny apps that have helped us with tasks including upcoding, generating public use files, and metadata management. We intend to demonstrate that Shiny is not purely for online dashboards, and interactive graphics, but also an effective alternative to Excel, and VBA for statistics-intensive office applications.

The first application is the public use file (PUF) generator. Typically, each year when we release a PUF, the content experts must review the survey data to determine what variables are appropriate, and what variables are too sensitive (due to small frequencies or other issues) to release. Developing a PUF requires many iterations between the statisticians, and the content experts. The process first requires the content experts to request frequency tables. The statisticians must program, and produce them. Then the content experts review, and suggest new constructs. The statisticians program, and produce tables for the new constructs. Then those are reviewed, and the process is repeated until there are no more problematic variables. This process can take weeks, and that’s not all. Before the PUF can be released, a document showing all the frequencies, and explaining all the new constructs must be presented to the Data Disclosure Review Committee (DDRC). In the past this document was produced ad hoc with SAS and Excel. Using Shiny, we produced an application that allows content experts to interactively produce frequency tables, and construct new variables. More importantly, the app dynamically generates the SAS code necessary to incorporate the newly constructed variables into our main dataset, and uses R Markdown to generate the review document for the DDRC. This significantly reduces the burden on the statistician, and lowers the probability of error.

The next application is for streamlining open-response upcoding. For many surveys, there are questions that do not have predefined categories (i.e. what is a person’s job). This means that content experts must review the responses, and manually classify them. In the past this was done in Excel. With Shiny, we can take advantage of R’s wealth of text-mining packages to create a customized UI that suggests upcodes based on similar responses in the past. The app also provides other auxiliary information that help the content experts in the decision making process. As more responses are upcoded, the text-mining algorithms have more data to work with, and become more effective. Over time we believe that the application will become a platform for Natural-Language processing, which will improve efficiency even further.

The final application is our survey metadata manager. In the past, metadata for each CHIS cycle was kept in separate Excel documents. The metadata was managed manually, and there were frequent referential integrity issues across years. A SQL database is the natural alternative; however, Excel offers many benefits to the content experts, because it is flexible, and easier to use than a SQL database. Shiny gives us the benefits of both. Now the data is stored in a SQL database, where referential integrity is enforced, but content experts can use the metadata manager to create an excel database, and to dynamically generate metadata cross-walks across years. They can update the metadata as they see fit in Excel, and then upload the changes to the SQL database. We get the benefits of SQL, without fundamentally changing how the content experts interact with the metadata, and more importantly, we don’t require the content experts to understand SQL.

In the end, we hope to demonstrate: how to integrate Shiny into a SAS-dominated work environment, how to make SQL databases accessible to non-programmers, and why Shiny is a good tool for data-driven Office applications.

CSP 2017 Online Program

Using Shiny to Efficiently Process Survey Data (303391)

ASA Meetings Department

Share