NIHPO's Synthetic Health Data Platform is designed to create a parallel universe of health data. Fully synthetic yet scientifically valid.
This page describes the structure and functionality of this Python-based platform that generates synthetic health data at scale. One of the output types the platform generates are standard-compliant CDISC SDTM files in SAS Transport File Format (XPORT).
First, the platform allows users to generate very realistic yet fake synthetic individuals ("SynthPerson"). Each SynthPerson receives a complete individual demographic profile including date of birth, gender, and places of birth and residence ("SynthCity"), among other user-defined variables.
Second, the platform generates a full Personal Health Record (SynthPHR) for each SynthPerson. Each SynthPHR includes both Real-World Data from government agencies (EMA, FDA) as well as random values assigned from controlled terminologies (LOINC, SNOMED-CT). The goal of each SynthPHR is to provide a life-long, realistic, comprehensive medical history for each SynthPerson.
Third, user can define the parameters for a synthetic clinical trial ("SynthTrial"), virtually enrolling the synthetic subjects defined previously. The platform generates synthetic results for the clinical trial, based on the number of epochs, visits, arms, etc. defined by user. User can play with the clinical trial parameters and can quickly generate output files for different scenarios of the same trial.
Finally, the platform generates different types of output files: CSV, JSON, SAS (xport), and SQLite.
The data generated by this platform is not intended to replace "real" healthcare data. Rather, this platform wants to encourage and facilitate the use of synthetic health data as a temporary placeholder for real data. We believe synthetic health data can be useful to accelerate and shorten all test, QA, and end-to-end system validation in life science applications. The platform’s initial focus is to provide synthetic health data across the lifecycle of a clinical trial.
It is worth mentioning that the platform explicitly, purposefully generates fully random data where, for example, a SynthPerson with gender equal to "Male" may be assigned a condition of "pregnancy". This synthetic data is designed to test assumptions and rules built into software used with clinical trial data.
This paper describes our platform: