HUG implementation of SPHN

HUG DATA LAKE

HUG SPHN Overview

Figure 1. Overview of the HUG Data Lake

The HUG DATA LAKE is a key component of the information system, addressing numerous essential needs such as reporting, activity planning and management, providing aggregated data based on requirements, as well as monitoring and alerting, among others.

The HUG DATA LAKE is built on several key principles:

  • A flexible architecture, leveraging MongoDB

  • Rigorous security practices, including a well-defined process for access rights and governance

  • A seamless integration strategy to address specific needs

HUG Data Lake aggregates information from heterogeneous sources (EHR, EMR, laboratories, intensive care, etc.) as close to real-time as possible.

The main objectives of HUG DATA LAKE are to:

  • Gather all patient medical data in a single database

  • Make this data available to users in an efficient manner

Data feeding : HUG DATA IMPORTER

The data is fed by a custom Java application called HUG DATA IMPORTER. This application imports data continuously, ensuring up-to-date information with minimal delay. It also offers the option to perform batch catch-ups if needed. The importer retrieves data directly from the source applications.

Integration of SPHN

Any data source used within the SPHN framework must be entirely available in a continuous stream within HUG DATA LAKE, either as a copy or as a reference to the original data (e.g., for large files such as OMICS/DICOMs).

This serves several purposes, notably:

  1. Automating deliverables for SPHN and generating RDF files.

  2. Utilizing this data in various projects beyond SPHN (research-related projects, as well as quality improvement initiatives or more operational projects)

  3. Preserving data sources that will eventually be replaced but hold significant “data value” (e.g., MetaVision, Clinisoft)

SPHN pipeline

HUG SPHN Pipeline

Figure 2. The SPHN pipeline in HUG

The key steps

  1. Get patient data : The Data-Loader reads data from HUG Data. This component is configurable per project. A set of scripts contained in Magellan (stored in Git) enables several operations:

  1. Retrieve a patient cohort according to specific rules

  2. Extract data from the Data Lake for these patients

  3. Mapping of the extracted data to SPHN concepts

  4. Retrieve coded values according to the various external terminologies used (SNOMED CT, LOINC, CHOP…)

  1. Write patient data : The Data-Loader loads the data in the datamart SPHN DATA MART

  2. Get pseudonyms : The SPHN-EXPORTER-BATCH launched and calls sphn-patient-data-service that reads the data from the SPHN DATA MART de-identifies it using pseudonym-service.

  3. POST to SPHN connector : The SPHN-EXPORTER-BATCH pushes JSON de-identifier data into the SPHN connector (one JSON file per patient)

  4. Write in S3 RDF files : RDF data is generated and stored

  1. For all projects (except TI4Health) SPHN-EXPORTER-BATCH downloads RDF data from the connector and stores it into the SPHN ODS. This data is then encrypted by the public key of the researcher and sent using SETT tool to biomed-it.

  2. For TI4Health / DEAS it lets the connector to write data into S3, which notifies Einstein ready to be read.

The key components

  • HUG Data Lake

Stores all patient data from various systems. The stored data is raw and has undergone minimal processing.

  • Magellan

Magellan is a Java application designed for extracting and transforming data from the HUG DATA LAKE. This is achieved using MongoDB aggregations and JavaScript scripts.

The application supports input parameters and can retrieve data from external sources. Magellan consists of multiple modules and includes a graphical user interface (developed with Angular), allowing users to execute and interact with predefined queries.

  • Data-Loader

The Data-Loader is a Spring Batch-based project running on Spring Cloud Data Flow (SCDF).

This component reads data from various sources, such as Magellan and/or CSV files, then transforms them if necessary before writing them to targets such as PostgreSQL, MongoDB or files.

At the end of batch processing, a summary report is sent by e-mail. This report includes details of reads, writes and deletes, as well as any errors encountered during the process.

The Data-Loader is configured via files on Git. These files contain the general configuration for all projects using the Data-Loader, and specific configurations for each project.

The data can then be transformed and stored in various targets, including PostgreSQL, MongoDB, or files.

  • SIMED

The department is responsible for encoding the values present in our information system into the target reference standards when necessary. We provide extracts of these values, which are then encoded by the department with medical expertise.

  • SPHN DATA MART

The SPHN DATA MART stores patient data in SPHN format. At this stage, the data is mapped to the ontologies defined in the SPHN schema, but has not yet been de-identified.

Each project feeds this Data Mart according to its patient cohort and the concepts requested.

  • SPHN-EXPORTER-BATCH

This batch has several stages depending on the specifics of each project:

  • extract data from the SPHN Data Mart (using the sphn-patient-data-service script)

  • manage exchanges with pseudonym-service to pseudonymize data

  • manage communications with the SPHN connector (project creation, data ingestion, triggering of connector process steps)

  • PSEUDONYMISATION-SERVICE

The pseudonymization microservice is called by the SPHN-EXPORTER-BATCH batch. Its role is to pseudonymize data. Pseudonymization is performed according to rules defined in a configuration file for each project.

All pseudonyms are stored in the database and then imported into the HUG DATA LAKE to improve processing speed.

Pseudonyms are generated by data type (patient, stay, laboratory result). Once a pseudonym has been generated for an identifier, it will no longer be modified (this ensures that a patient can always be found, even if there are several deliveries with the same data).

TI4Health / DEAS / Einstein component FQS (Federated Query System)

HUG SPHN TI4Health

Figure 3. Einstein and TI4Health components in HUG

Einstein & TI4Health

  1. The partner authenticates on https://auth.tuneinsight.com using two-factor authentication (2FA).

  2. Via the WAF:

  1. Users of the TI4Health instance can authenticate via auth.tuneinsight.com to request the results of static queries run at other hospitals.

  2. The admin of TI4Health can request the addition of a new user at https://portal.tuneinsight.com.

  3. The admin of TI4Health can update TI4Health by downloading a new docker-compose file and Docker images from https://registry.tuneinsight.com.

  4. Upon the first launch of the TI4Health instance, it will request to register with the TI4Health root instance of Tune Insight.

  1. Request statistical query results using the TI4Health

  2. TI4Health runs the query on the Virtuoso database, which has been populated by Einstein Manager by converting patients’ data into RDF files in Virtuoso graph format, and returns the statistical query results.

Einstein & SPHN Connector
  1. The SPHN Connector receives a new de-identified patient data in JSON

  2. The SPHN Connector transforms the JSON in RDF and write it to the S3 storage.

  3. The SPHN Connector notifies Einstein to load RDF files it into the graph database.

  4. Einstein receives a notification for new data arrival, then retrieves the data and saves it in Virtuoso.