Identifying patient-level data risks in trusted research environments: Worked examples with synthetic data
M. Baragilly, A. Topham, S. Gallier, E. SapeyObjective
This study examined and illustrated real-world risks of unintended patient-level data egress from Trusted Research Environments (TREs) and Secure Data Environments (SDEs), using synthetic data to recreate cases encountered in PIONEER, the HDR UK Hub in Acute Care.
Methods
Synthetic datasets with demographics and NEWS2 vital signs were created using SciPy and NumPy for two fictitious populations. These datasets were transformed for machine-learning and embedded into various formats to simulate potential egress scenarios. Three worked examples include binary serialisation of data, binary serialisation of complex objects, and plain text mark-up reports.
Results
Initial screening of exported files included checking reported sizes. While absolute size alone cannot confirm patient-level data, unusually large files can signal the need for closer inspection. In several cases, this prompted manual review that uncovered sensitive information. File size is therefore a useful signal within a layered egress checking process, not a diagnostic measure. Standard tools like Python or R do not warn of hidden data, reinforcing the need for explicit egress policies and independent verification. Converting binary formats only works for recognized code libraries and requires ongoing maintenance. Manual inspection alongside automation remains essential to identify and remove embedded data.
Conclusion
These cases highlight the complexities in identifying and preventing identifiable data egress from TREs. Key insights include clear guidance for researchers, the limitations of binary serialisation for egress due to security vulnerabilities, and the importance of plain-text data exports for ease of verification.