Data quality assurance for validation

1. Summary statistics

Note

A set of statistical queries is provided for making an initial evaluation of the data content and quality: https://git.dcc.sib.swiss/sphn-semantic-framework/sphn-ontology/-/tree/master/quality_assurance/. These statistical queries can be run in any triplestore that enables the querying of RDF data (e.g. GraphDB, Jena) by simply copy-pasting the content of the queries into the querying field. More information on how to run SPARQL can be found in Training Video and in user guide.

The queries are mostly built in the following manner:

First, prefixes used in the queries need to be defined to facilitate the (human) reading and writing of the query:

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-ontology/sphn#>
PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX psss:<https://biomedit.ch/rdf/sphn-ontology/psss#>
PREFIX spo:<https://biomedit.ch/rdf/sphn-ontology/spo#>

Then the type of query should be specified. Here, the query form SELECT is used for specifying variables expected to be provided in the results

Note that variables start with a question mark ? in SPARQL:

SELECT ?concept (COUNT(?resource) AS ?sphn_concepts_resources)
 (COUNT(distinct ?subject) as ?subject_cnt)
 (COUNT(distinct ?case) as ?case_cnt)
 (COUNT(distinct ?provider) as ?provider_cnt)

In this example, five variables will be given as result:

  • a concept,

  • the count of resource for that concept as well as

  • the count of subjects,

  • cases and

  • providers for that concept.

Following the definition of variables, the graph pattern of interest must then be specified with the clause WHERE:

WHERE {
 { ?concept rdfs:subClassOf+ sphn:SPHNConcept } UNION { ?concept rdfs:subClassOf+ psss:PSSSConcept } UNION { ?concept rdfs:subClassOf+ spo:SPOConcept } .
 ?resource a ?concept .
 optional {?resource sphn:hasDataProviderInstitute ?provider}
 optional {?resource sphn:hasSubjectPseudoIdentifier ?subject}
 optional {?resource sphn:hasAdministrativeCase ?case}

Here, the query searches for patterns where a resource is defined with the RDF class type concept. This concept class must be at least either a subClass of SPHN, PSSS or SPO. And finally, the resource can optionally be connected to: * a data provider institute or * a subject pseudo identifier or * a administrative case.

Next it is possible to filter out for graph patterns that are not of interest and that should not be returned in the results, with the clause FILTER NOT EXISTS:

FILTER NOT EXISTS {?concept rdfs:subClassOf sphn:ValueSet}

Here, the query filters out classes that are subClasses of the SPHN class ValueSet.

Finally, it is possible to end with some query modifiers, here the query ends by grouping results for a given variable:

} group by ?concept order by desc(?sphn_concepts_resources)

In this example, the results are grouped by concepts retrieved.

For data following project-specific ontologies, it may be necessary to adjust the queries to fit the search for certain elements. For any help, please contact the DCC at dcc@sib.swiss.

2. Data validation

SPHN SHACL validation in GraphDB

This section describes how to validate a RDF data graph against a set of constraints expressed in SHACL. We will use in this walkthrough GraphDB, a graph database for RDF with SPARQL support. A set of SHACLs to validate data according to the SPHN RDF schema can be downloaded here (learn more here about which SHACLs are included).

This document uses the following terminology:

SHACL

SHACL Shapes Constraint Language, standardized in https://www.w3.org/TR/shacl/

Data Graph

refers to a RDF graph with information about e.g. Drugs, BodyHeight. An example RDF file “shacl_test_graph.ttl” is provided for testing purpose.

Shapes Graphs

refers to constraints in SHACL which are expressed as RDF.

Step 1: Preparing a new repository

In GraphDB, SHACL validation needs to be enabled during the creation of a repository. It is not possible to do this afterwards for an already existing repository. A new repository with SHACL validation can be created as follows:

  • Open the GraphDB Workbench, a web-based user interface, and login with your credentials.

  • Navigate to Setup > Repositories > Create new repository:

../_images/01_create_new_repository.png
  • Click on the Enable SHACL validation in the options page:

../_images/02_create_new_repository_settings.png ../_images/03_connect_to_repository.png

Step 2: Importing SHACL shapes

Shape graphs can be inserted using any method for loading RDF data into GraphDB. The GraphDB Workbench provides three methods:

  • Upload RDF files,

  • GET RDF data from a URL, and

  • Import RDF text snipped.

In case a shape graph is uploaded directly to the server, it appears in the tab Server files.

As first example, we use the Upload RDF files option:

  • Select a SHACL file from your local computer and it will be uploaded to the server.

  • An uploaded RDF file is added to the user data* list of files available for importing.

  • Click on the Import button to initiate the import of the uploaded file to the repository.

../_images/04_import_shacl_shapes.png
  • In import dialog box, select as Target graphs Named graph. It is required to use a reserved graph name for SHACL validation. Fill in:

  • Your options pane should look like the following:

../_images/05_use_reserved_named_graph.png

A successfull import is confirmed with the message: “imported successfully in less than a second.”

Step 3: Loading and validating a data graph

There are various options for loading data into GraphDB.

  • Import RDF text snippet, allows us to just copy and past a few examples. Copy the following data graph and click on “import”. Choose “The default graph” as Target Graph, no further options are required. Start the loading by pressing the “Import” button.:

@prefix : <https://biomedit.ch/rdf/sphn-ontology/sphn#> .
@prefix dg: <https://biomedit.ch/rdf/sphn-ontology/sphn/dataGraphValidation/> .
dg:AdministrativeCase_2
   a :AdministrativeCase ;
   :hasSubjectPseudoIdentifier dg:SubjectPseudoIdentifier_AdministrativeCase_1;
       :hasSubjectPseudoIdentifier dg:SubjectPseudoIdentifier_AdministrativeCase_2;
   :hasAdministrativeCaseAdmissionDateTime "2021-01-01T00:00:00"^^xsd:dateTime;
   :hasLocation dg:Location_AdministrativeCase_1;
   :hasDateTime "2021-01-01T00:00:00"^^xsd:dateTime;
   :hasIdentifier "AdministrativeCase_1_Identifier";
   :hasAdministrativeCaseCareHandling dg:hasAdministrativeCaseCareHandling_AdministrativeCase_1;
   :hasAdministrativeCaseDischargeDateTime     "2021-01-01T00:00:00"^^xsd:dateTime;
   :hasAdministrativeCaseOriginLocation    dg:Location_AdministrativeCase_1;
   :hasEndDateTime "2021-01-01T00:00:00"^^xsd:dateTime;
   :hasDestinationLocation dg:Location_AdministrativeCase_1;
   :hasOriginLocation  dg:Location_AdministrativeCase_1;
   :hasStartDateTime   "2021-01-01T00:00:00"^^xsd:dateTime;
   :hasDataProviderInstitute dg:DataProviderInstitute_AdministrativeCase_1;
   :hasAdministrativeCaseDischargeLocation dg:Location_AdministrativeCase_1.
dg:SubjectPseudoIdentifier_AdministrativeCase_1 a :SubjectPseudoIdentifier.
dg:SubjectPseudoIdentifier_AdministrativeCase_2 a :SubjectPseudoIdentifier.
dg:Location_AdministrativeCase_1 a :Location.
dg:hasAdministrativeCaseCareHandling_AdministrativeCase_1 a :CareHandling  .
dg:DataProviderInstitute_AdministrativeCase_1 a :DataProviderInstitute .

While loading the data graph, the SHACL validation is applied on the data. This example will stop with an error message, referring to the instance and the failed constraint. It this case, the data graph has two SubjectPseudoIdentifier, where only one is allowed.

../_images/06_SHACL_validation_failed.png

The following corrected data graph will pass the SHACL validation and will be inserted in the repository:

@prefix : <https://biomedit.ch/rdf/sphn-ontology/sphn#> .
@prefix dg: <https://biomedit.ch/rdf/sphn-ontology/sphn/dataGraphValidation/> .
dg:AdministrativeCase_1
   a :AdministrativeCase ;
   :hasSubjectPseudoIdentifier dg:SubjectPseudoIdentifier_AdministrativeCase_1;
   :hasAdministrativeCaseAdmissionDateTime "2021-01-01T00:00:00"^^xsd:dateTime;
   :hasLocation dg:Location_AdministrativeCase_1;
   :hasDateTime "2021-01-01T00:00:00"^^xsd:dateTime;
   :hasIdentifier "AdministrativeCase_1_Identifier";
   :hasAdministrativeCaseCareHandling dg:hasAdministrativeCaseCareHandling_AdministrativeCase_1;
   :hasAdministrativeCaseDischargeDateTime     "2021-01-01T00:00:00"^^xsd:dateTime;
   :hasAdministrativeCaseOriginLocation    dg:Location_AdministrativeCase_1;
   :hasEndDateTime "2021-01-01T00:00:00"^^xsd:dateTime;
   :hasDestinationLocation dg:Location_AdministrativeCase_1;
   :hasOriginLocation  dg:Location_AdministrativeCase_1;
   :hasStartDateTime   "2021-01-01T00:00:00"^^xsd:dateTime;
   :hasDataProviderInstitute dg:DataProviderInstitute_AdministrativeCase_1;
   :hasAdministrativeCaseDischargeLocation dg:Location_AdministrativeCase_1.
dg:SubjectPseudoIdentifier_AdministrativeCase_1 a :SubjectPseudoIdentifier.
dg:Location_AdministrativeCase_1 a :Location.
dg:hasAdministrativeCaseCareHandling_AdministrativeCase_1 a :CareHandling  .
dg:DataProviderInstitute_AdministrativeCase_1 a :DataProviderInstitute .

We can see the following confirmation:

../_images/07_SHACL_validation_passed.png

Step 4: Updating and deleting shape graphs

Go to the SPARQL Editor and delete the SHACL Shape Graph explicitely with the following query:

CLEAR GRAPH <http://rdf4j.org/schema/rdf4j#SHACLShapeGraph>
../_images/08_Remove_SHACL_graph.png

Please note the following restrictions working with SHACL shape graphs in GraphDB:

  • Clearing the repository with the option “Explore > Graphs overview > Clear repository” does not remove the shape graph.

  • The “replacement of existing data” option in the Import settings does not work for SHACL shapes. SHACL shapes cannot be replaced, instead the shape graph needs to be deleted as described above.

  • SHACL shapes cannot be accessed with SPARQL inside GraphDB.

How to interpret a SHACL Validation Report?

A SHACL validation process produces a validation report as result. A validation report of a data graph that satisfies to the constraints specified in a shapes graph has the following content:

[   a sh:ValidationReport ;
    sh:conforms true ;
] .

The variable sh:conforms with the value true indicates that no constraint violations have occurred. For simplicity, most implementation do not expose a validation report to the end user as long as the data graph conforms to the shape graph. A message “imported successfully” is shown, for instance, using GraphDB.

A validation report for a data graph, which does not satify all the constraints is indicated by sh:conforms false. For each constraint violation a validation result is added to the report. Each validation result contains information describing which data element violated which condition.

[   a sh:ValidationReport ;
        sh:conforms false ;
        sh:result [
              a sh:ValidationResult ;
              sh:resultSeverity sh:Violation ;
              sh:focusNode ex:Bob ;
              sh:resultPath ex:age ;
              sh:value "twenty two" ;
              sh:resultMessage "ex:age expects a literal of datatype xsd:integer." ;
              sh:sourceConstraintComponent sh:DatatypeConstraintComponent ;
              sh:sourceShape ex:PersonShape-age ;
    ]] .

Generate your own SHACL rules with the SHACLer

The SHACLer generates a SHACL file for data validation based on a SPHN compliant (project) ontology and an optional exception file (available on request). The SHACLer is based on Python 3 and only requires minimal additional packages.

Installation of the SHACLer

Python3 and the following libraries from the requirements.txt need to be installed:

  • plac==1.3.3

  • rdflib==5.0.0

You can run pip install -r requirements.txt to install the required librairies.

The SHACLer is tested with Python 3.6 but is expected to work with most 3.x versions.

Running the SHACLer

Start the application with the ontology file sphn_ontology_2021-1.ttl in the .ttl format in debug mode -d and with the exceptions file exceptions.json to store it to the output file shacl.ttl you can use this command:

python shacl_generator.py -o 'ttl' -d sphn_ontology_2021-1.ttl -e exceptions.json shacl.ttl

Starting the generator with -h prints all available arguments:

usage: shacl_generator.py [-h] [-o ONTOLOGY_FILE_TYPE]
                        [-s SPHN Official SHACL] [-e None] [-d]
                        ontology shacl_output

Derives SHACL rules for SPHN Ontologies from an ontology file

positional arguments:
ontology              Ontology file to load
shacl_output          The path to the shacl outputfile

optional arguments:
-h, --help            show this help message and exit
-o ONTOLOGY_FILE_TYPE, --ontology-file-type ONTOLOGY_FILE_TYPE
                        Type of the ontology file
-s SPHN Official SHACL, --shapefile-comment SPHN Official SHACL
                        Comment on the ShapeFile
-e None, --exception None
                        Exception file to load
-d, --debug           enable debug mode

This command line enables you to obtain the shacl rules for the ontology provided as argument.