Identifying patient-level data risks in trusted research environments: Worked examples with synthetic data

doi:10.1177/20552076261440981

DOI: 10.1177/20552076261440981 ISSN: 2055-2076

Identifying patient-level data risks in trusted research environments: Worked examples with synthetic data

M. Baragilly, A. Topham, S. Gallier, E. Sapey

Objective

This study examined and illustrated real-world risks of unintended patient-level data egress from Trusted Research Environments (TREs) and Secure Data Environments (SDEs), using synthetic data to recreate cases encountered in PIONEER, the HDR UK Hub in Acute Care.

Methods

Synthetic datasets with demographics and NEWS2 vital signs were created using SciPy and NumPy for two fictitious populations. These datasets were transformed for machine-learning and embedded into various formats to simulate potential egress scenarios. Three worked examples include binary serialisation of data, binary serialisation of complex objects, and plain text mark-up reports.

Results

Initial screening of exported files included checking reported sizes. While absolute size alone cannot confirm patient-level data, unusually large files can signal the need for closer inspection. In several cases, this prompted manual review that uncovered sensitive information. File size is therefore a useful signal within a layered egress checking process, not a diagnostic measure. Standard tools like Python or R do not warn of hidden data, reinforcing the need for explicit egress policies and independent verification. Converting binary formats only works for recognized code libraries and requires ongoing maintenance. Manual inspection alongside automation remains essential to identify and remove embedded data.

Conclusion

These cases highlight the complexities in identifying and preventing identifiable data egress from TREs. Key insights include clear guidance for researchers, the limitations of binary serialisation for egress due to security vulnerabilities, and the importance of plain-text data exports for ease of verification.

Outline

Identifying patient-level data risks in trusted research environments: Worked examples with synthetic data

Objective

Methods

Results

Conclusion

More from our Archive