Improve data quality through validation

The process of validating data according to a given schema ensures to some extent its quality and usability by others. It therefore constitute an important step to be executed in the context of SPHN.

Target Audience

This document is mainly intended for data providers and project data managers who wish to validate their data against the schema. There exists several ways to validate data produced in the SPHN RDF format. This document presents two examples to do so:

  • one is by using the RDF Quality Check Tool and interpreting its validation report

  • the other is by validating data using SHACLs directly in GraphDB

1. Data quality validation using the RDF Quality Check Tool

The SPHN RDF Quality Check Tool is a ready-to-run Java (.jar) tool and can be run on any operating system that supports Java. Refer to Quality Assurance Framework for more information.

There are no transaction size limits: bulk uploads can be in hundreds of millions of triples, depending on the machine resources.

Note

For more information about the QC tool hardware requirements and dependencies, as well as how to setup and use the tool please read the README.md

Warning

The SPHN RDF Quality Check Tool repository is a private repository. For access please reach out to dcc@sib.swiss

The QC tool currently supports the following:

  • SHACL validation against the RDF schema of a specific project, checking for RDF schema compliance and constraint validity

  • Execute several SPARQL queries to evaluate the completeness and validity of the data. Furthermore, to provide dataset profiling tables that assist researchers in comprehending descriptive statistics about the data available and ensuring that the results are as expected

The QC tool Usage

The QC tool is designed to read the configuration parameters defined by the user in a .properties file. The parameters refer to the tool’s input and output configuration options, which can be summarized as follows:

  • Input:
    1. Ontology: The path pointing to the location where the ontology file is stored as a Turtle file. It corresponds to the RDF schema that will be used to check the data.

    2. Data: The path pointing to the location where the data files are stored as Turtle files. They will be loaded by the tool into Apache Jena TDB2 store in the filesystem.

    3. Query: The path pointing to the location where the statistical SPARQL queries are stored as .rq files. They will be used by the tool to run against the loaded data in the Jena TDB2 store.

    4. Shapes: The path pointing to the location where the SHACL rules are stored as a Turtle file. They will be used by the tool to run the shape constrain validation against the loaded data in the Jena TDB2 store.

  • Output:
    • The tool displays a live report on the screen that follows each query output in a table with the name of the query on top of the table. For example the output of the SPHN attributes count query against the resource files would look like the following table:

# Shows all the data attributes used for every particular concept and how many objects are linked to, with their min, max...
Executing queries2021/QC00012-count-sphn-attributes.rq against resources files
----------------------------------------------------------------------------------------
| concept | attribute | range | sphn_objects_count | min_value | max_value | avg_value |
========================================================================================
----------------------------------------------------------------------------------------
  • Another example of the output for the query that list attributes that are not part of the SPHN ontology.

# Expected: This lists all properties (data + object) that are not part of SPHN ontology. Should be empty.
Executing queries2021/QC00031-shows-attributes-not-defined-in-ontology.rq against resources files
-----------------------------
| non_sphn_attribute        |
=============================
| :hasExtractionDate        |
| dct:conformsTo            |
| frailty:relatesToVariable |
| skos:altLabel             |
-----------------------------

(48 ms)
  • After running all the provided queries and showing their output, the tool displays the SHACL validation output in a table informing whether the data conforms to the given RDF schema and highlighting the violation. For a complete example look into test1_report.

  • The generated report can be exported as flat files per concept as CSV, TSV, JSON or XML. More details are available in the README.md file.

2. SHACL validation in GraphDB

This section describes how to validate a RDF data graph against a set of constraints expressed in SHACL. We will use in this walkthrough GraphDB, a graph database for RDF with SPARQL support. A set of SHACLs to validate data according to the SPHN RDF schema can be downloaded here (learn more here about which SHACLs are included).

This document uses the following terminology:

  • SHACL: SHACL Shapes Constraint Language, standardized in https://www.w3.org/TR/shacl/. For an introduction to SHACL, visit the SHACL Background section

  • Data Graph: Refers to a RDF graph with information about e.g. Drugs, BodyHeight. An example RDF file shacl_test_graph.ttl is provided for testing purpose

  • Shapes Graphs: Refers to constraints in SHACL which are expressed as RDF

Step 1: Preparing a new repository

In GraphDB, SHACL validation needs to be enabled during the creation of a repository. It is not possible to do this afterwards for an already existing repository. A new repository with SHACL validation can be created as follows:

  • Open the GraphDB Workbench, a web-based user interface, and login with your credentials.

  • Navigate to Setup > Repositories > Create new repository:

../_images/01_create_new_repository.png
  • Click on the Enable SHACL validation in the options page:

../_images/02_create_new_repository_settings.png ../_images/03_connect_to_repository.png

Step 2: Importing SHACL shapes

Shape graphs can be inserted using any method for loading RDF data into GraphDB. The GraphDB Workbench provides three methods:

  • Upload RDF files

  • GET RDF data from a URL

  • Import RDF text snippet

In case a shape graph is uploaded directly to the server, it appears in the tab Server files.

As first example, we use the Upload RDF files option:

  • Select a SHACL file from your local computer and it will be uploaded to the server.

  • An uploaded RDF file is added to the user data* list of files available for importing.

  • Click on the Import button to initiate the import of the uploaded file to the repository.

../_images/04_import_shacl_shapes.png
  • In import dialog box, select as Target graphs Named graph. It is required to use a reserved graph name for SHACL validation. Fill in:

  • Your options pane should look like the following:

../_images/05_use_reserved_named_graph.png

A successful import is confirmed with the message: “imported successfully in less than a second.”

Step 3: Loading and validating a data graph

There are various options for loading data into GraphDB.

  • Import RDF text snippet, allows us to just copy and past a few examples. Copy the following data graph and click on “import”. Choose “The default graph” as Target Graph, no further options are required. Start the loading by pressing the “Import” button.:

@prefix sphn: <https://biomedit.ch/rdf/sphn-ontology/sphn#> .
@prefix resource: <https://biomedit.ch/rdf/sphn-resource/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dg: <https://biomedit.ch/rdf/sphn-ontology/sphn/dataGraphValidation/> .

### AdministrativeCase
resource:CHE_108_904_325-AdministrativeCase-42A4EAC1-28DB-474F-0F1A-548008488DB6 sphn:hasDischargeDateTime "2020-04-15T11:00:00"^^xsd:dateTime;
         sphn:hasDischargeLocation resource:CHE_108_904_325-Location-RehabilitationHospital-Reha_a_Betrieb;
         sphn:hasAdmissionDateTime "2020-03-15T12:00:00"^^xsd:dateTime;
         sphn:hasIdentifier "42A4EAC1-28DB-474F-0F1A-548008488DB6"^^xsd:string;
         sphn:hasCareHandling resource:CareHandling-394656005;
         sphn:hasDataProviderInstitute resource:CHE_108_904_325-DataProviderInstitute;
         sphn:hasSubjectPseudoIdentifier resource:CHE_108_904_325-SubjectPseudoIdentifier-0938EAC1-1020-474F-CFB8-548008482DB1;
         sphn:hasSubjectPseudoIdentifier resource:CHE_108_904_325-SubjectPseudoIdentifier2-0938EAC1-1020-474F-CFB8-548008482DB1;
         a sphn:AdministrativeCase.

### Related classes
resource:CHE_108_904_325-Location-RehabilitationHospital-Reha_a_Betrieb sphn:hasClass resource:CHE_108_904_325-Location-Location_class-Reha_a_Betrieb;
         sphn:hasExact "Reha a.Betrieb"^^xsd:string;
         sphn:hasDataProviderInstitute resource:CHE_108_904_325-DataProviderInstitute;
         a sphn:Location.
resource:CHE_108_904_325-Location-Location_class-Reha_a_Betrieb
         a sphn:Location_class.

resource:CareHandling-394656005 sphn:hasTypeCode resource:Code-SNOMED-CT-394656005;
         a sphn:CareHandling.
resource:Code-SNOMED-CT-394656005 a snomed:394656005.

resource:CHE_108_904_325-DataProviderInstitute sphn:hasCode resource:CHE_108_904_325-Code-UID-CHE_108_904_325;
         a sphn:DataProviderInstitute.
resource:CHE_108_904_325-Code-UID-CHE_108_904_325 sphn:hasIdentifier "CHE_108_904_325"^^xsd:string;
         sphn:hasName "USZ"^^xsd:string;
         sphn:hasCodingSystemAndVersion "UID"^^xsd:string;
         a sphn:Code.

resource:CHE_108_904_325-SubjectPseudoIdentifier-0938EAC1-1020-474F-CFB8-548008482DB1 sphn:hasIdentifier "0938EAC1-1020-474F-CFB8-548008482DB1"^^xsd:string;
         sphn:hasDataProviderInstitute resource:CHE_108_904_325-DataProviderInstitute;
         a sphn:SubjectPseudoIdentifier.
resource:CHE_108_904_325-SubjectPseudoIdentifier2-0938EAC1-1020-474F-CFB8-548008482DB1 sphn:hasIdentifier "0938EAC1-1020-474F-CFB8-548008482DB1"^^xsd:string;
         sphn:hasDataProviderInstitute resource:CHE_108_904_325-DataProviderInstitute;
         a sphn:SubjectPseudoIdentifier.

While loading the data graph, the SHACL validation is applied on the data. This example will stop with an error message, referring to the instance and the failed constraint. It this case, the data graph has two SubjectPseudoIdentifier, where only one is allowed.

../_images/06_SHACL_validation_failed.png

The following corrected data graph will pass the SHACL validation and will be inserted in the repository:

@prefix sphn: <https://biomedit.ch/rdf/sphn-ontology/sphn#> .
@prefix resource: <https://biomedit.ch/rdf/sphn-resource/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dg: <https://biomedit.ch/rdf/sphn-ontology/sphn/dataGraphValidation/> .

### AdministrativeCase
resource:CHE_108_904_325-AdministrativeCase-42A4EAC1-28DB-474F-0F1A-548008488DB6 sphn:hasDischargeDateTime "2020-04-15T11:00:00"^^xsd:dateTime;
         sphn:hasDischargeLocation resource:CHE_108_904_325-Location-RehabilitationHospital-Reha_a_Betrieb;
         sphn:hasAdmissionDateTime "2020-03-15T12:00:00"^^xsd:dateTime;
         sphn:hasIdentifier "42A4EAC1-28DB-474F-0F1A-548008488DB6"^^xsd:string;
         sphn:hasCareHandling resource:CareHandling-394656005;
         sphn:hasDataProviderInstitute resource:CHE_108_904_325-DataProviderInstitute;
         sphn:hasSubjectPseudoIdentifier resource:CHE_108_904_325-SubjectPseudoIdentifier-0938EAC1-1020-474F-CFB8-548008482DB1;
         a sphn:AdministrativeCase.

### Related classes
resource:CHE_108_904_325-Location-RehabilitationHospital-Reha_a_Betrieb sphn:hasClass resource:CHE_108_904_325-Location-Location_class-Reha_a_Betrieb;
         sphn:hasExact "Reha a.Betrieb"^^xsd:string;
         sphn:hasDataProviderInstitute resource:CHE_108_904_325-DataProviderInstitute;
         a sphn:Location.
resource:CHE_108_904_325-Location-Location_class-Reha_a_Betrieb
         a sphn:Location_class.

resource:CareHandling-394656005 sphn:hasTypeCode resource:Code-SNOMED-CT-394656005;
         a sphn:CareHandling.
resource:Code-SNOMED-CT-394656005 a snomed:394656005.

resource:CHE_108_904_325-DataProviderInstitute sphn:hasCode resource:CHE_108_904_325-Code-UID-CHE_108_904_325;
         a sphn:DataProviderInstitute.
resource:CHE_108_904_325-Code-UID-CHE_108_904_325 sphn:hasIdentifier "CHE_108_904_325"^^xsd:string;
         sphn:hasName "USZ"^^xsd:string;
         sphn:hasCodingSystemAndVersion "UID"^^xsd:string;
         a sphn:Code.

resource:CHE_108_904_325-SubjectPseudoIdentifier-0938EAC1-1020-474F-CFB8-548008482DB1 sphn:hasIdentifier "0938EAC1-1020-474F-CFB8-548008482DB1"^^xsd:string;
         sphn:hasDataProviderInstitute resource:CHE_108_904_325-DataProviderInstitute;
         a sphn:SubjectPseudoIdentifier.

We can see the following confirmation:

../_images/07_SHACL_validation_passed.png

Step 4: Updating and deleting shape graphs

Go to the SPARQL Editor and delete the SHACL Shape Graph explicitely with the following query:

CLEAR GRAPH <http://rdf4j.org/schema/rdf4j#SHACLShapeGraph>
../_images/08_Remove_SHACL_graph.png

Please note the following restrictions working with SHACL shape graphs in GraphDB:

  • Clearing the repository with the option “Explore > Graphs overview > Clear repository” does not remove the shape graph.

  • The “replacement of existing data” option in the Import settings does not work for SHACL shapes. SHACL shapes cannot be replaced, instead the shape graph needs to be deleted as described above.

  • SHACL shapes cannot be accessed with SPARQL inside GraphDB.

How to interpret a SHACL Validation Report?

A SHACL validation process produces a validation report as result. A validation report of a data graph that satisfies to the constraints specified in a shapes graph has the following content:

[   a sh:ValidationReport ;
    sh:conforms true ;
] .

The variable sh:conforms with the value true indicates that no constraint violations have occurred. For simplicity, most implementations do not expose a validation report to the end user as long as the data graph conforms to the shape graph. A message “imported successfully” is shown, for instance, using GraphDB.

A validation report for a data graph, which does not satify all the constraints is indicated by sh:conforms false. For each constraint violation a validation result is added to the report. Each validation result contains information describing which data element violated which condition.

[   a sh:ValidationReport ;
        sh:conforms false ;
        sh:result [
              a sh:ValidationResult ;
              sh:resultSeverity sh:Violation ;
              sh:focusNode ex:Bob ;
              sh:resultPath ex:age ;
              sh:value "twenty two" ;
              sh:resultMessage "ex:age expects a literal of datatype xsd:integer." ;
              sh:sourceConstraintComponent sh:DatatypeConstraintComponent ;
              sh:sourceShape ex:PersonShape-age ;
    ]] .