Problem Statement
During the last decades the computation cost decreased and rapid progress with machine learning (ML) and artificial intelligence (AI) techniques is achieved. The advantages of AI can be seen in various applications which affect our daily lives (e.g. smart assistants, chat bots, application processes, etc.).
From these benefits also healthcare can profit by ‘learning’ models which support clinical practice in treatment decision support systems (TDSS). To increase the robustness of an obtained model and produce meaningful results, generally the analysis outcome depends on the number of training samples.
But meaningful data to improve predictions in medical research and healthcare is often distributed across multiple sites and is not easily accessible. This data contains highly sensitive patient information, may consist at each site different data formats and cannot be shared without explicit consent of the patient.
Central Analysis
Each data-provider (e.g. hospital) has to extract the data from their source systems and sends the data to the analysis site in a pre-defined format.
This process creates data-silos which are hard to maintain and update when new data-formats need to be included.
Solution
The Personal Health Train (PHT) is a paradigm proposed within the GO:FAIR initiative as one solution for distributed analysis of medical data, enhancing their FAIRness. Rather than transferring data to a central analysis site, the analysis algorithm (wrapped in a ‘train’), travels between multiple sites (e.g., hospitals – so-called ‘train stations’) hosting the data in a secure fashion.
Implementing trains as light-weight containers enables even complex data analysis workflows to travel between sites, for example, genomics pipelines or deep-learning algorithms – analytics methods that are not easily amenable to established distributed queries or simple statistics.
To overcome the legal issues and redundant work of time and cost intensive integration and export of clinical data, Data Integration Center (DIC) at each medical center are introduced within the National Medical Informatics Initiative.
A DIC will act as central hub locally at each site, providing secure and reliable access to integrated data from healthcare and research without losing control over the data.
Distributed Learning
Rather than sending the data to the algorithms, we transfer the algorithms to the data. First an empty model will be transferred to the first station and learns the first partial model. Iteratively the model will be updated on each site locally. The final model can be obtained, without patient data ever leaving the hospital.