Quality Assurance Framework

The SPHN network’s hospitals use a variety of database systems to store their clinical data, which can be found in many formats such as a structured database SQL or non-structured Non-SQL format. Data quality validation and assurance have been a difficult task to handle with such large amounts of data, and it is sometimes regarded as a time-consuming and resource-intensive task, delaying the work that needs to be done by the researchers. Using RDF standard format and constraints with semantic web tools, the Quality Assurance Framework is thought to address the previous challenges. Accompanied by a scenario-based multidimensional data quality check, which can be run automatically against the hospital’s RDF data, the Quality Assurance Framework ensures the adherence to the SPHN schema and validates the data against the specific criteria.

The SPHN Quality Assurance Framework currently provides the Quality Check tool which is a Java-based tool that aims to facilitate the validation of data in compliance with the SPHN RDF schema or a SPHN project-specific schema.

The SPHN RDF Quality Check Tool

The SPHN RDF Quality Check Tool is primarily intended for Data Providers to use for checking their data prior to sending it to Data Users or Projects. Anyone who wants to check their data against the SPHN or a project-specific RDF schema can also use it.

Warning

The RDF Quality Check Tool repository is a private repository. For access please reach out to dcc@sib.swiss

Because the tool is delivered as a ready-to-run Java (.jar) file, it can be run on any operating system that supports Java. One major advantage of the tool is that there are no transaction size limits: bulk uploads can be in hundreds of millions of triples, depending on the machine’s resources.

Note

For more information about the QC tool hardware requirements and dependencies, please read the README.md

The QC tool currently supports the following checks:

  • SHACL constraints against the RDF schema of a specific project, checking for RDF schema compliance and constraint validity.

  • Several SPARQL queries to evaluate the completeness and validity of the data. Furthermore, to provide dataset profiling tables that assist researchers in comprehending descriptive statistics about the data available and ensuring that the results are as expected.

Statistical SPARQL queries

SPARQL Protocol and RDF Query Language (SPARQL) is a W3C recommendation of a standard language for querying databases and data sources provided in RDF. For more information, see the background section about background-sparql.

In SPHN, a set of statistical SPARQL queries have been developed to gain basic knowledge about the data being queried.

These statistical queries give qualitative information about:

  • the data coverage of the SPHN ontology ( QC00010, QC00020)

  • data elements that are not part of SPHN, if any ( QC00030, QC00031)

  • basic summary of the SPHN properties used and their annotated values ( QC00012)

  • number of data elements, i.e. number of patients, hospitals, patients per hospital ( QC00040, QC00041, QC00042).

These queries are available at https://git.dcc.sib.swiss/sphn-semantic-framework/sphn-ontology/-/tree/master/quality_assurance/statistics and are integrated in the Quality Check tool. For further information on the use and building of statistical queries, please read the user-guide-summary-stats section.

Data validation with SHACL

Shapes Contraint Language (SHACL) is a W3C recommendation of a language for validating RDF graphs against a set of conditions. For more information about SHACL, see the background section about background-shacl.

In SPHN, a set of SHACL rules have been developed to validate RDF data that complies with the SPHN RDF schema. These rules are automatically generated by the SHACLer tool (see SHACLer) and integrated in the SPHN Quality Check tool to validate the compliance of the RDF data produced with the SPHN ontology.

If you are interested in knowing which SHACLs are integrated, refer to the section 2. SHACL constraints implemented for SPHN and for running these SHACLs (see the user guide).

Availability and usage rights

© Copyright 2022, Personalized Health Informatics Group (PHI), SIB Swiss Institute of Bioinformatics

The SPHN Quality Assurance Framework is co-developed by SIB Swiss Institute of Bioinformatics and HUG members.

The SPHN Quality Check tool is available at https://git.dcc.sib.swiss/sphn-semantic-framework/sphn-ontology/-/tree/master/quality_assurance (send request to DCC - dcc@sib.swiss). The SPHN Quality Check tool is licensed under the GPLv3 License.

For any question or comment, please contact the SPHN Data Coordination Center (DCC) at dcc@sib.swiss.