Validate data with SHACL rules

The process of validating data according to a given schema ensures to some extent its quality and usability by others. It therefore constitute an important step to be executed in the context of SPHN.

Target Audience

This document is mainly intended for data providers and project data managers who wish to validate their data against the schema. There exists several ways to validate data produced in the SPHN RDF format. This document presents how data can be validated using SHACLs in GraphDB.

Note

Data generated with the SPHN Connector already went through the process of data validation with the SPHN generated SHACLs. It uses the the Quality Check Tool. Refer to SPHN RDF Quality Check Tool for more information.

SHACL validation in GraphDB

This section describes how to validate a RDF data graph against a set of constraints expressed in SHACL. We will use in this walkthrough GraphDB, a graph database for RDF with SPARQL support. A set of SHACLs to validate data according to the SPHN RDF Schema can be downloaded here (learn more about SHACL constraint components implemented in SPHN).

This document uses the following nomenclature:

  • SHACL: refers to the SHACL Shapes Constraint Language, standardized in https://www.w3.org/TR/shacl/. For an introduction to SHACL, visit the SHACL Background section

  • Data Graph: refers to a RDF graph with information about e.g. Drugs, BodyHeight. An example RDF file shacl_test_graph.ttl is provided for testing purpose

  • Shapes Graphs: refers to constraints in SHACL which are expressed as RDF.

Step 1: Preparing a new repository

In GraphDB, SHACL validation needs to be enabled during the creation of a repository. It is not possible to do this afterwards for an already existing repository. A new repository with SHACL validation can be created as follows:

  • Open the GraphDB Workbench, a web-based user interface, and login with your credentials.

  • Navigate to Setup > Repositories > Create new repository:

../_images/01_create_new_repository.png
  • Click on the Enable SHACL validation in the options page:

../_images/02_create_new_repository_settings.png ../_images/03_connect_to_repository.png

Step 2: Importing SHACL shapes

Shape graphs can be inserted using any method for loading RDF data into GraphDB. The GraphDB Workbench provides three methods:

  • Upload RDF files

  • Get RDF data from an URL

  • Import RDF text snippet

In case a shape graph is uploaded directly to the server, it will appear in the tab Server files.

As first example, we use the Upload RDF files option:

  • Select a SHACL file from your local computer to upload it to the server.

  • An uploaded RDF file is added to the user data list of files available for importing.

  • Click on the Import button to initiate the import of the uploaded file to the repository.

../_images/04_import_shacl_shapes.png
  • In import dialog box, select as Target graphs Named graph. It is required to use a reserved graph name for SHACL validation. Fill in:

  • Your options pane should look like the following:

../_images/05_use_reserved_named_graph.png

A successful import is confirmed with the message: “imported successfully in less than a second.”

Step 3: Loading and validating a data graph

There are various options for loading data into GraphDB.

  • Import RDF text snippet, allows us to just copy and past a few examples. Copy the following data graph and click on “import”. Choose “The default graph” as Target Graph, no further options are required. Start the loading by pressing the “Import” button.:

@prefix sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#> .
@prefix resource: <https://biomedit.ch/rdf/sphn-resource/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dg: <https://biomedit.ch/rdf/sphn-schema/sphn/dataGraphValidation/> .

### AdministrativeCase
resource:CHE_108_904_325-AdministrativeCase-42A4EAC1-28DB-474F-0F1A-548008488DB6 sphn:hasDischargeDateTime "2020-04-15T11:00:00"^^xsd:dateTime;
         sphn:hasDischargeLocation resource:CHE_108_904_325-Location-RehabilitationHospital-Reha_a_Betrieb;
         sphn:hasAdmissionDateTime "2020-03-15T12:00:00"^^xsd:dateTime;
         sphn:hasIdentifier "42A4EAC1-28DB-474F-0F1A-548008488DB6"^^xsd:string;
         sphn:hasCareHandling resource:CareHandling-394656005;
         sphn:hasDataProvider resource:CHE_108_904_325-DataProvider;
         sphn:hasSubjectPseudoIdentifier resource:CHE_108_904_325-SubjectPseudoIdentifier-0938EAC1-1020-474F-CFB8-548008482DB1;
         sphn:hasSubjectPseudoIdentifier resource:CHE_108_904_325-SubjectPseudoIdentifier2-0938EAC1-1020-474F-CFB8-548008482DB1;
         a sphn:AdministrativeCase.

### Related classes
resource:CHE_108_904_325-Location-RehabilitationHospital-Reha_a_Betrieb sphn:hasExact "Reha a.Betrieb"^^xsd:string;
         sphn:hasTypeCode resource:Code-SNOMED-CT-225728007;
         sphn:hasDataProvider resource:CHE_108_904_325-DataProvider;
         a sphn:Location.
resource:Code-SNOMED-CT-225728007 a snomed:225728007 .

resource:CareHandling-394656005 sphn:hasTypeCode resource:Code-SNOMED-CT-394656005;
         a sphn:CareHandling.
resource:Code-SNOMED-CT-394656005 a snomed:394656005.

resource:CHE_108_904_325-DataProvider sphn:hasCode resource:CHE_108_904_325-Code-UID-CHE_108_904_325;
         a sphn:DataProvider.
resource:CHE_108_904_325-Code-UID-CHE_108_904_325 sphn:hasIdentifier "CHE_108_904_325"^^xsd:string;
         sphn:hasName "USZ"^^xsd:string;
         sphn:hasCodingSystemAndVersion "UID"^^xsd:string;
         a sphn:Code.

resource:CHE_108_904_325-SubjectPseudoIdentifier-0938EAC1-1020-474F-CFB8-548008482DB1 sphn:hasIdentifier "0938EAC1-1020-474F-CFB8-548008482DB1"^^xsd:string;
         sphn:hasDataProvider resource:CHE_108_904_325-DataProvider;
         a sphn:SubjectPseudoIdentifier.
resource:CHE_108_904_325-SubjectPseudoIdentifier2-0938EAC1-1020-474F-CFB8-548008482DB1 sphn:hasIdentifier "0938EAC1-1020-474F-CFB8-548008482DB1"^^xsd:string;
         sphn:hasDataProvider resource:CHE_108_904_325-DataProvider;
         a sphn:SubjectPseudoIdentifier.

While loading the data graph, the SHACL validation is applied on the data. This example will stop with an error message, referring to the instance and the failed constraint. It this case, the data graph has two SubjectPseudoIdentifier, where only one is allowed.

../_images/06_SHACL_validation_failed.png

The following corrected data graph will pass the SHACL validation and will be inserted in the repository:

@prefix sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#> .
@prefix resource: <https://biomedit.ch/rdf/sphn-resource/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dg: <https://biomedit.ch/rdf/sphn-schema/sphn/dataGraphValidation/> .

### AdministrativeCase
resource:CHE_108_904_325-AdministrativeCase-42A4EAC1-28DB-474F-0F1A-548008488DB6 sphn:hasDischargeDateTime "2020-04-15T11:00:00"^^xsd:dateTime;
         sphn:hasDischargeLocation resource:CHE_108_904_325-Location-RehabilitationHospital-Reha_a_Betrieb;
         sphn:hasAdmissionDateTime "2020-03-15T12:00:00"^^xsd:dateTime;
         sphn:hasIdentifier "42A4EAC1-28DB-474F-0F1A-548008488DB6"^^xsd:string;
         sphn:hasCareHandling resource:CareHandling-394656005;
         sphn:hasDataProvider resource:CHE_108_904_325-DataProvider;
         sphn:hasSubjectPseudoIdentifier resource:CHE_108_904_325-SubjectPseudoIdentifier-0938EAC1-1020-474F-CFB8-548008482DB1;
         a sphn:AdministrativeCase.

### Related classes
resource:CHE_108_904_325-Location-RehabilitationHospital-Reha_a_Betrieb  sphn:hasExact "Reha a.Betrieb"^^xsd:string;
         sphn:hasTypeCode resource:Code-SNOMED-CT-225728007;
         sphn:hasDataProvider resource:CHE_108_904_325-DataProvider;
         a sphn:Location.
resource:Code-SNOMED-CT-225728007 a snomed:225728007 .

resource:CareHandling-394656005 sphn:hasTypeCode resource:Code-SNOMED-CT-394656005;
         a sphn:CareHandling.
resource:Code-SNOMED-CT-394656005 a snomed:394656005.

resource:CHE_108_904_325-DataProvider sphn:hasCode resource:CHE_108_904_325-Code-UID-CHE_108_904_325;
         a sphn:DataProvider.
resource:CHE_108_904_325-Code-UID-CHE_108_904_325 sphn:hasIdentifier "CHE_108_904_325"^^xsd:string;
         sphn:hasName "USZ"^^xsd:string;
         sphn:hasCodingSystemAndVersion "UID"^^xsd:string;
         a sphn:Code.

resource:CHE_108_904_325-SubjectPseudoIdentifier-0938EAC1-1020-474F-CFB8-548008482DB1 sphn:hasIdentifier "0938EAC1-1020-474F-CFB8-548008482DB1"^^xsd:string;
         sphn:hasDataProvider resource:CHE_108_904_325-DataProvider;
         a sphn:SubjectPseudoIdentifier.

We can see the following confirmation:

../_images/07_SHACL_validation_passed.png

Step 4: Updating and deleting shape graphs

Go to the SPARQL Editor and delete the SHACL Shape Graph explicitely with the following query:

CLEAR GRAPH <http://rdf4j.org/schema/rdf4j#SHACLShapeGraph>
../_images/08_Remove_SHACL_graph.png

Please note the following restrictions working with SHACL shape graphs in GraphDB:

  • Clearing the repository with the option “Explore > Graphs overview > Clear repository” does not remove the shape graph.

  • The “replacement of existing data” option in the Import settings does not work for SHACL shapes. SHACL shapes cannot be replaced, instead the shape graph needs to be deleted as described above.

  • SHACL shapes cannot be accessed with SPARQL inside GraphDB.

How to interpret a SHACL Validation Report?

A SHACL validation process produces a validation report as result, which reports the conformance (true or false) and a set of validation results.

An example of a validation report of a data graph that satisfies to the constraints specified in a shapes graph has the following content:

[   a sh:ValidationReport ;
    sh:conforms true ;
] .

The variable sh:conforms with the value true indicates that no constraint violations have occurred. For simplicity, most implementations do not expose a validation report to the end user as long as the data graph conforms to the shape graph. A message “imported successfully” is shown, for instance, in GraphDB.

A validation report for a data graph, which does not satify all the constraints is indicated by sh:conforms false. For each constraint violation a validation result is added to the report. Each validation result contains information describing which data element violated which condition. Following is an example of a validation report where the data graph does not conform to the shapes graph.

[   a sh:ValidationReport ;
    sh:conforms false ;
    sh:result [ a sh:ValidationResult ;  ...] ,
              [ a sh:ValidationResult ;  ...] .
] .

The validation report is again an RDF graph. The following table summarizes the SHACL Validation Report Properties (note that the namespaces have been omitted to simplify the representation).

Validation Report Properties

Property Name

Property

Description

Conformance Checking

sh:conforms

false if the validation produce any results, i.e., a validation result, and true otherwise

Validation Results

sh:result

Each validation produces a sh:result

Focus node

sh:focusNode

A validation result has exactly one focus node, which was validated and has caused the violation

Path

sh:resultPath

Equivalent to the value of sh:path of the shape

Value

sh:value

RDF term (at most one) that caused the result

Source

sh:sourceShape

Shape name that the focus node was validated against

Constraint Component

sh:sourceConstraintComponent

Specifiy the constraint component that caused the result, e.g., the constraint sh:minCount has sh:MinCountConstraintComponent

Details

sh:detail

May link to other violations for that shape

Message

sh:resultMessage

Communicate additional textual details to humans

Severity

sh:resultSeverity

The severity level of the shape that caused the result

Validation report may provide guidance on how to identify or fix violations in the data graph. In the following example, the focus node ex:Bob violates the SHACL Shape ex:PersonShape-age. In detail, the property ex:age is only allowed to have integer values, but a literal "twenty two" was found (i.e., the sh:value "twenty two" ; line).

[   a sh:ValidationReport ;
        sh:conforms false ;
        sh:result [
              a sh:ValidationResult ;
              sh:resultSeverity sh:Violation ;
              sh:focusNode ex:Bob ;
              sh:resultPath ex:age ;
              sh:value "twenty two" ;
              sh:resultMessage "ex:age expects a literal of datatype xsd:integer." ;
              sh:sourceConstraintComponent sh:DatatypeConstraintComponent ;
              sh:sourceShape ex:PersonShape-age ;
        ] .
] .

Data Graph Example

The following graph data shows an example with instances of Allergies (allergies:allergy1, allergies:allergy2) connected to several data points among which a triple stating that allergies:allergy1 is connected to a DataProvider sib:hospital1.

Data graph example with Allergen

Figure: Data Graph Example.

@prefix allergies: <http://sib.swiss/allergies/> .
@prefix patients: <http://sib.swiss/fictivePatients/> .
@prefix allergens: <http://sib.swiss/allergens/> .
@prefix sib: <http://sib.swiss/> .
@prefix sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#> .
@prefix snomed: <http://snomed.info/id/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

# types
patients:anonymous1 rdf:type sphn:SubjectPseudoIdentifier .
patients:anonymous2 rdf:type sphn:SubjectPseudoIdentifier .
sib:hospital1 rdf:type sphn:DataProvider .
allergies:allergy1 rdf:type sphn:Allergy .
allergies:allergy2 rdf:type sphn:Allergy .
substances:peanuts1 rdf:type snomed:762952008 .

# relations to the allergy
allergies:allergy1 sphn:hasSubjectPseudoIdentifier patients:anonymous1 .
allergies:allergy1 sphn:hasDataProvider sib:hospital1 .
allergies:allergy1 sphn:hasAllergen allergens:peanuts1 .
allergies:allergy2 sphn:hasSubjectPseudoIdentifier patients:anonymous2 .

Two different SHACL shapes examples are built below about the Allergy class which can be connected to specific values of DataProvider.

SHACL Shapes which conform with the Data Graph (no validation results are produced)
@prefix sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix ex: <http://sib.swiss/examples#> .
@prefix sib:  <http://sib.swiss/> .

ex:SibShape2_correct
    a sh:NodeShape ;
    sh:targetClass sphn:Allergy ;
    sh:property [
        sh:path sphn:hasDataProvider ;
        sh:in ( sib:hospital1 sib:hospital2 sib:hospital3 )
    ].

The shape indicates that the class sphn:Allergy can have a sphn:DataProvider value which can be a sib:hospital1, sib:hospital2 or sib:hospital3. Therefore, the triple allergies:allergy1 sphn:hasDataProvider sib:hospital1 . is a valid one. The tool will not throw any validation results since there are no errors according to this shape. The example graph data complies with this SHACL shape.

Modified SHACL Shapes which do not conform with the Data Graph (validation results are produced)
@prefix sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix ex: <http://sib.swiss/examples#> .
@prefix sib:  <http://sib.swiss/> .

ex:SibShape2_wrong
    a sh:NodeShape ;
    sh:targetClass sphn:Allergy ;
    sh:property [
        sh:path sphn:hasDataProvider ;
        sh:in ( sib:hospital2 sib:hospital3 )
    ].

The shape indicates that the class sphn:Allergy can have a sphn:DataProvider which can only be now a sib:hospital2 or sib:hospital3`. The triple allergies:allergy1 sphn:hasDataProvider sib:hospital1 . is not valid anymore since sib:hospital1 is not stated as being a correct DataProvider. The tool will throw a validation result indicating the violation of this shape (which should be interpreted as an error). With this shape, the example graph data does not comply anymore.

Generate your own SHACL rules with the SHACLer

The SHACLer generates a SHACL file for data validation based on a SPHN compliant (project) schema. For additional information, check the SHACLer documentation.