Concerns with reanalysis for ongoing data transparency initiatives
*Sara Hughes, GSK 

Keywords: anonymization, de-identification, re-analysis

When a sponsor prepares to conduct the primary analysis on a recently completed study, they have access to the full set of collected data and all relevant supporting documentation. As the statistical programmers operationalize the statistical analysis plan (SAP), they have access to the full study team in the (likely) event some judgment needs to be applied in interpreting the SAP, or in the (even more likely) event data collected during the course of the study requires some special handling. Will an independent analyst with no involvement in the original study design and analysis have insight into some of these judgments that were made? Many studies in many diseases have multiple analyses over time: one at the primary timepoint, with many other updated analyses after that. Some researchers talk about wanting a “dataset of record”. What does that mean in this context and what is a reasonable expectation when we look to sponsors to share data which could have multiple cutpoints over many years? Studies which have been conducted in recent years benefit from greater adherence to standards (and recently, more common standards) as well as much more rigorous data stewardship. This translates to a higher likelihood that data which are shared will be accurate and complete. Older studies do not have this feature. What are reasonable expectations when it comes to gaining access to older data / study information? When data are shared with independent researchers, the current best practice requires that the data are de-identified, or anonymized, in order to remove any personally identifiable information (PII) and to obfuscate the data sufficiently well to make it difficult for anyone to re-identify a specific patient using the data that was shared along with other publically available records. In fact, the process of de-identifying data is currently being performed in a variety of ways. While there is a lot of similarity between the approaches that are being used, there remain differences and these differences could be important when deciding what types of re-analyses are appropriate and how to interpret the results of any re-analyses. Consider a scenario where one sponsor redacts the raw AE terms, but leaves intact all coded terms, including rare events. Now consider another sponsor who wishes to minimize the risk of re-identification and so chooses to redact the raw AE terms, along with any coded terms when they are rare events. A meta-analysis which compares medicines from these two sponsors will likely show one medicine has a higher risk with respect to that rare event; that would almost certainly be misleading. How can a researcher be sure what de-identification algorithms were used in producing the datasets they were granted access to? Is it acceptable to allow sponsors to have a range of choices for what represents an anonymized dataset? With regard to what level of anonymization is acceptable, what is the role of the environment in which data are shared? If a researcher can only get access to the data in a secure “glove box” environment does this imply that the data can be “less anonymized” than if the researcher is given more open access to data? The quality of any re-analysis is a function of the quality and completeness of the data and meta-data which inform the re-analysis. In light of this, I would suggest that any independent researcher who “discovers” a new finding in their re-analysis which is inconsistent with the original results should consider initiating discussions with the original sponsor(s) to ensure these new findings are indeed newsworthy and not simply a function of one or more of the possible explanations described in this abstract. Such discussions with the sponsor(s) would not challenge the researcher’s independence. Indeed if, after the new findings had been “tested”, the researcher continued to believe their findings were newsworthy, it should give the researcher even more confidence to publish their results, allowing for an open scientific discussion. On the other hand, if discussions with the sponsor(s) identified issues with the data or documentation or with the approach that had been taken in the re-analysis, the researcher would then be better informed as they continue their research. And the publication of an errant finding would have been avoided.

This talk will touch on these many points and propose areas where greater collaboration across data sharing bodies would be most useful.