CHUV implementation of SPHN
CHUV research ecosystem
HORUS
is the platform at the University Hospital of Lausanne (CHUV) dedicated to the research community. It stands for Hospital Research Unified Data & Analytics Services.
The platform encompasses clinical data (i.e. patient data, metadata, documents, images, IOT) also known as HORUS Data
and research applications or services also known as HORUS Analytics & Services
(see Figure 1).
Figure 1. Overview of the CHUV research ecosystem
Note
A specific data pipeline (blue arrows) has been developed by the CHUV to deliver data releases to the SPHN community.
Key applications and services are:
- HORUS Data
Data integration into Oracle platform (various clinical data sources)
Data standardization (cleansing and FAIR transformation)
Management of Terminologies (ontologies, …)
Management of Data Registries
Data and object storage
RDF Graph databases
- HORUS Applications (e.g.
HORUS Consent
)
Research project registration (protocol, DMP, Ethical approval, DTA)
Project patient cohorts
Project patient pseudo-codification (used by de-identification)
Other applications:
HORUS Explorer
,HORUS Restitution
,HORUS Images
,HORUS Registry
- HORUS Analytics and Services
CHORUS Digital workspace
Machine Learning platform (MLOps)
Analytics and data visualization tools
- SPHN Federated Query System (provided by SPHN)
TI4Health
from Tuneinsight
Einstein
API endpoint
Virtuoso
Graph Database
- SPHN Data Pipeline (developed by CHUV)
Data release generation and delivery for SPHN projects (
NDS
andDEM
)Incremental data delivery for FQS (
TI4Health/Einstein
)
- SPHN
Connector
(provided by SPHN)
RDF data conversion based on SPHN schema triggered by CHUV SPHN data pipeline
Data quality validation
Einstein
notifications (patient RDF data)
- SPHN
SETT
(provided by SPHN)
Data transfer to BioMedIT
SPHN Data Provisioning Process
The process involves the following steps (see Figure 2):
Data Preparation
Data Release Generation
Data Release Validation
Data Release Delivery
Figure 2. SPHN Data Provisioning at CHUV
Note
CHUV developed a generic SPHN data pipeline in Python
orchestrated by a Jenkins
docker agent. The pipeline integrates the SPHN Connector
.
Data analysts and engineers can customize, if required, default scripts to address the data release requirements such as:
patient cohort definition
inclusion and exclusion criteria
de-identification rules
selection of concepts
Data standarization
Data originates from an Oracle platform which is a combination of an Oracle data warehouse Oracle Health Foundation
and a data lake, providing structured and unstructured data.
Data standardization is a continous process and a daily ETL done at the datawarehouse level.
The challenge consists in standardizing data by making it FAIR
, mostly interoperable.
Data standardization components include:
Mapping tables (aligning data fields across different data sources)
Business rules (ensuring data complies with predefined rules and logic)
Terminologies (using standard terminologies to unify data meaning)
Meta data (for PACS images, clinical documents, datasets, Omics)
Clinical data (organizing descriptive metadata and clinical data for easier processing)
Note
Codification is done at the source for some data systems (e.g. billing and laboratory) or manually by specialists or clinicians.
Data de-identification
The de-identification
of the data release is not done by the SPHN Connector
, but by the CHUV SPHN data pipeline
while preparing the release tables.
Dates are shifted according to the project rules and pseudo-codes (Subject Pseudo Identifiers) are provided by HORUS Consent
(i.e. 1 unique pseudo id per patient/project) thanks to an Oracle function.
All identifiers (e.g. encounter id) are “hashed”.
CHUV pipeline and SPHN Connector
The SPHN pipeline can serve both NDS projects and the Federated Query System.
The diagram below outlines the workflow and the scripts of the pipeline (see Figure 3).
Figure 3. CHUV SPHN data pipeline scripts
Note
Standardized and de-identified CHUV data is transformed into a JSON document (1 per patient) and is ingested into the SPHN Connector
to get a validated RDF file.
The transfer with SETT
is not operated by the CHUV pipeline, but manually upon request.
Federated Query System and Einstein
The CHUV pipeline can generate RDF data (as NQUADS) to Einstein
in order to provide patient data to the FQS TI4Health
(see Figure 4).
Figure 4. CHUV Federated Query System Architecture
Note
The CHUV SPHN data pipeline is able to load incremental data (a file per patient) as well as remove a patient if necessary (e.g. general consent revoking).