SPHN Connector

Note

> Watch the SPHN Webinar introducing the SPHN Connector !

Introduction

Hospitals in the SPHN network use a variety of database systems to store their clinical data, which can be found in diverse formats such as a structured SQL database or a non-structured format. Validating data quality is difficult to manage with large amounts of data, and is sometimes regarded as time-consuming and resource-intensive, delaying the work that researchers need to achieve.

The SPHN Connector is a dockerized solution that allows data-providing institutions to build a pipeline that converts their data from relational or JSON sources into graph data based on an RDF schema conforming to the SPHN Framework. The ingested data is converted into RDF and validated to check its conformity with the schema. Optionally, data providers that do not have an in-house de-identification functionality, can activate the module for de-identification in the SPHN Connector.

The SPHN Connector integrates a variety of tools developed or distributed by the DCC like the SHACLer or the SPHN RDF Quality Check Tool to simplify the production of high quality data. In the context of SPHN, the SPHN Connector is intended to and can be used by any data provider for an easier creation and validation of data in RDF.

The SPHN Connector is built with flexibility and simplicity in mind. It requires only two inputs: The patient-level data and the base ontology which can be the SPHN RDF Schema and optionally a project-specific RDF Schema.

Almost everything else can be adapted by the user to fit its needs and skills, and the working environment. One example is the variety of input data that is supported by the SPHN Connector: a user can upload JSON files, RDF files or setup a specific database import. The user also has the option to configure validation parameters. Experienced users can provide their own SHACL file for validation, while others can simply use the file that is created by the connector via the SHACLer.

Overview & architecture

SPHN Connector architecture

Figure 1. Core elements of the SPHN Connector architecture.

As seen in the picture the process of creating and validating the data follows 4 steps:

  • 1) Ingestion

    In the ingestion phase, data provided by the user is imported into the MinIO object storage of the SPHN Connector, following a suitable formatting process. The user can either upload JSON files or RDF files, or import data from a database. For the latter, the provided data is transformed into a JSON file. The project information associated to the data must also be given to allow the SPHN Connector to bundle the data and dependencies (like project-specific terminologies or validation queries) for validation. This enables the user to have multiple validation options in one SPHN Connector instance. Projects are initialized with the SPHN RDF Schema and optionally a project-specific RDF Schema. Additionally, external terminologies that are referenced in the ingested data should be uploaded.

  • 2) Pre-Check & De-Identification

    To ensure that the input data is valid, in an expected format, and RDF data can be produced from the data, pre-checks can be either defined or the default ones used. By the default the SPHN Connector checks for spaces and special characters in French & German to avoid them later on during the IRI creation. These default checks are limited to the id fields of the JSON file but a user can define them for any field as needed. It is possible to define checks that only verify the input data while others will directly alter the input data (Replacechecks). If a user has no means of de-identification, the SPHN Connector can be configured to take care of this step.

  • 3) Integration

    During the Integration phase, the patient-based JSON data is mapped to RDF via the RMLmapper, resulting in one RDF file per patient. This step is necessary to use later on the SHACLer or the SPHN RDF Quality Check Tool.

  • 4) Validation

    During the Validation phase, the RDF data is checked against SHACL shapes (see: Shapes Contraint Language (SHACL) ) generated by the SHACLer based on the provided ontologies plus additionally, manually provided SPARQL queries. The resulting validation report states if the patient data complies to the provided semantic definitions and is categorized as valid or not. The user can evaluate the output via different logging enpdoints provided by the SPHN Connector.

The SPHN Connector provides the user with an entire pipeline for creating and validating patient data using the Semantic Web technologies.

Tools used in SPHN Connector

The following tools are used in the SPHN Connector:

SPHN RDF Quality Check Tool

The SPHN RDF Quality Check tool (also referred as “QC tool”) is a Java-based tool that facilitates the validation of data in compliance with the SPHN RDF Schema or a SPHN project-specific schema.

The QC tool is primarily intended for Data Providers to use for checking their data prior to sending it to Data Users or Projects. Although it is now integrated in the SPHN Connector, it can be used as a standalone tool by anyone who wants to check their data against the SPHN or a project-specific RDF Schema.

One major advantage of the QC tool is that there are no transaction size limits: bulk uploads can be in hundreds of millions of triples, depending on the machine’s resources.

Note

For more information about the QC tool hardware requirements and dependencies, please read the README.md

The QC tool currently supports the following operations:

  • Checking compliance of data with the RDF schema and the SHACL constraints of the project

  • Quantitative profiling of the data for evaluating its completeness with pre-defined SPARQL queries

Statistical SPARQL queries

A set of statistical SPARQL queries providing summaries of the data have been developed.

These queries give information about:

  • the data coverage of the SPHN RDF Schema ( QC00010, QC00020)

  • data elements that are not part of SPHN, if any ( QC00030, QC00031)

  • basic summary of the SPHN properties used and their annotated values ( QC00012)

  • number of data elements, i.e. number of patients, hospitals, patients per hospital ( QC00040, QC00041, QC00042).

These queries and other more specific ones are available on the DCC Git and are integrated in the QC tool.

Data validation with SHACL

SHACL rules generated by the SHACLer tool are integrated in the QC tool to validate the compliance of the RDF data produced with the SPHN RDF Schema.

Integrated SHACLs are described in SHACL constraint components implemented in SPHN.

Availability and usage rights

For information about the SPHN Connector, please contact DCC - dcc@sib.swiss.

The SPHN Quality Check tool is co-developed by SIB Swiss Institute of Bioinformatics and HUG members.

The SPHN Quality Check tool is available on Git. The SPHN Quality Check tool is licensed under the GPLv3 License.