The SPHN Connector is a dockerized solution that allows data-providing institutions to build a pipeline that converts their data from relational or JSON sources into graph data based on an RDF schema conforming to the SPHN Framework. The ingested data is converted into RDF and validated to check its conformity with the schema. Optionally, data providers that do not have an in-house de-identification functionality, can activate the module for de-identification in the SPHN Connector.
The SPHN Connector integrates a variety of tools developded or distribued by the DCC like the SHACLer or the Quality Assurance Framework to simplify the production of high quality data. In the context of SPHN, the SPHN Connector is intended to and can be used by any data provider for an easier creation and validation of data in RDF.
The SPHN Connector is build with flexibilty and simplicity in mind. It expects only two elements: The input data is provided on a patient level and the base ontology which can be the SPHN RDF Schema and optionally a project-specific RDF Schema.
Almost everything else can be adapted by the user to fit its needs, skills and the working environment.
One example is the bandwidth of input data that is supported by the connector.
A user can upload JSON files, RDF files or setup a specific database import.
The setup of the validation is also user centric.
Experienced users can provided their own SHACL file for validation while others can
simply use the file that is created by the connector via the
Overview & architecture
Figure 1. Core elements of the SPHN Connector architecture.
As seen in the picture the process of creating and validating the data follows 4 steps:
- 1) Ingestion
In the ingestion phase, data provided by the user is ingested into the SPHN Connector. To be precise: in the object storage of the connector (MinIo). The user can either upload JSON files or RDF files or can utilize the option to import data form a database. If the latter, the provided data is transformed into a JSON file. In addition, project related data has to be uploaded. The connector uses projects to bundle dependencies (like project-specific terminologies or validation queries) and data for validation. This enables the user to have multiple validation options in one SPHN Connector instance. Projects are initialized with the SPHN RDF Schema and optionally a project-specific RDF Schema. Additionally, terminologies should be uploaded that are referenced in the ingested data.
- 2) Pre-Check & De-Identification
To ensure that the input data is valid, in an expected format and RDF data can be produced from the data, pre-checks can be either defined or the default ones used. By the default the connector checks for spaces and special characters in French & German to avoid them later on during the IRI creation. These default checks are limited to the id fields of the JSON file but a user can define them for any field as needed. It is possible to define checks that only verifies the input data while others will directly alter the input data (Replacechecks). In case a user has no means of De-Identification, the SPHN Connector can take care of the step. This step is optional for the user.
- 3) Integration
During the Integration phase, the patient based JSON data is mapped to RDF via the RMLmapper resulting in one RDF file per patient. This step is necessary to be able to use the quality tools (or the output of the quality tools) such as the SHACLer or the Quality Assurance Framework.
- 4) Validation
During the Validation (QC checks), the RDF data is checked against SHACL shapes (see: Shapes Contraint Language (SHACL) ) generated by the SHACLer based on the provided ontologies plus additionally, manually provided SparQl queries. The resulting validation report states if the patient data complies to the provided semantic definitions and is categorized as valid or not. The user can evaluate the output via different logging enpdoints provided by the SPHN Connector.
The SPHN Connector provides the user with an entire pipeline for creating and validating patient data using the Semantic Web technologies.
Availability and usage rights
Please contact DCC - firstname.lastname@example.org.