Quality Assurance Framework

The SPHN data quality control tool contains a set of SHACL rules and statistical SPARQL queries to validate the compliance of the RDF data produced.

Statistical SPARQL queries

SPARQL Protocol and RDF Query Language (SPARQL) is a W3C recommendation of a standard language for querying databases and data sources provided in RDF.

In SPHN, a set of statistical SPARQL queries have been developed to gain basic knowledge about the data being queried. These queries are available at https://git.dcc.sib.swiss/sphn-semantic-framework/sphn-ontology/-/tree/master/quality_assurance/statistics.

These statistical queries give qualitative information about:

  • the data coverage of the SPHN ontology (QC00010, QC00020)

  • data elements that are not part of SPHN, if any (QC00030, QC00031)

  • basic summary of the SPHN properties used and their annotated values (QC00012)

  • number of data elements, i.e. number of patients, hospitals, patients per hospital (QC00040, QC00041, QC00042).

For further information on the use and building of the statistical queries, please read the user guide.

Data validation with SHACL

Shapes Contraint Language (SHACL) is a W3C recommendation of a language for validating RDF graphs against a set of conditions. SHACL itself is formalized in an RDF what is then called the Shapes Graph, the elements inside are called shapes. Constraints would refer to the general meaning of restricting something, in the context of SHACL this can be understood as a shape or combinations of shapes. SHACL can be written manually or generated.

In SPHN the SPHN RDF schema is the main ontology. The project ontologies are derived from the SPHN ontology, while it is probable that the project ontology most of the time extend the SPHN ontology. To be able to generate useful constraints out of the ontology, the ontology needs to contain most of the needed information. This is the case for the SPHN RDF schema and the recommendation for project ontologies is to follow this convention. This affects mainly the rdfs:domain and rdfs:range definitions of properties and rdfs:subClassOf relations of classes. In general the assumptions related to properties, and individuals is based on the Closed World Assumption (CWA). Therefore every property that is not described in the ontology to be allowed to be used at some class is also not added to the SHACL graph. A similar thinking is applied to the individuals that are used in a meaning binding way: In case there is an enumerated list of individuals for a class, we make these the only allowed individuals of that class.

By using these assumptions we can automatically generate valid and useful SHACL files from the SPHN RDF schema as well as derived project ontologies. For exceptions that cannot be easily covered by assumptions or defintinions in the ontology, an exception mechanism is provided. This is particularly useful for modeling cardinality constraints.

The tool to generate SHACL files based on the ontology and an optional exception file is called SHACLer (available on request). The single file Python code is based on Python 3 and only requires minimal additional packages. If you are interested on how to run the SHACLer (see user guide).

SPHN provides a set of SHACLs (https://git.dcc.sib.swiss/sphn-semantic-framework/sphn-ontology/-/tree/master/quality_assurance/shacl) to validate the compliance of the RDF data produced with the SPHN ontology. This section describes which SHACLs are included in the SPHN set and how they are set up. If you are interested on how to run these SHACLs (see the user guide).

Principle

The SHACLer generates all validation rules based on NodeShapes centric to a class from the ontology. All range and domain annotations and individuals are collected based on the ontology. All information is stored in internal dictionaries to transport the information to the SHACL generation. In detail: to get the information out of the ontology the generator looks for all owl:ObjectProperties and owl:DatatypeProperties and parses their range and domain specifications. For range specifications it also parses the corresponding rdfs:subClassOf information. This is needed as some properties have an upper level concept as their domain, logically that implies that also the lower level elements have that. Although we require RDFS inference for the validation, it can happen that the upper level concept should not be instantiatable on its own and is excluded, therefore we annotate the property at all allowed levels. This supports the readability, on a per concept basis for a human reader.

Assumptions

Before we go into details about the SHACL rules implemented for SPHN, here is the list of assumptions taken into account during the building of the constraints:

  • We require that SHACL is tested using RDFS Inference turned on. This is required, as ranges pick some upper level concepts (e.g. SNOMED CT subtrees).

  • There are no further ObjectProperties/DataProperties than the ones that are defined in the ontology (although, there might be further classes with predicates).

  • An rdfs:domain or rdfs:range annotation of an Object Property indicates that only these properties are allowed in the classes (this is also applying to inherit properties).

  • An rdfs:domain of a property pointing to an owl:unionOf list means that the the property can be used in any of the list items instances.

  • An rdfs:range of a property pointing to an owl:unionOf list means that the the property has to always end in an instance of “one Of” (or subclassOf) the references classes.

  • In case there are Individuals/Instances of owl:NamedIndividual and a class we make these Individuals being the only allowed Instances of a class.

  • owl:EquivalentClass properties link SPHN concepts to other external terminologies (e.g. SNOMED CT, LOINC). These properties are not picked up and evaluated in the SHACL generation. Although logically valid, and applying OWL2 inference also technically valid, the SHACL rules focus on SPHN concepts.

SHACL constraints implemented for SPHN

Please note that the expressions listed below are not fully correct Shape Graphs. These simplified expressions can be used for a mapping to the respective SHAPE graphs in the shacl.ttl.

SHACL Constraint

Description

sh:closed true

value node has only those properties that have been explicitly enumerated via sh:property

sh:ignoredProperties

properties that are also permitted in addition to those explicitly enumerated via sh:property

sh:datatype xsd:dateTime

verifies if a property value has the type xsd:dateTime

sh:datatype xsd:double

verifies if a property value has the type xsd:double

sh:datatype xsd:string

verifies if a property value has the type xsd:string

sh:class … sh:path

range of a property is used correctly, i.e. the class of an instance matches the specified type constraint

sh:maxCount, sh:minCount

checks if the cardinality of a property is applied correctly, e.g., there is just one value for a given property

sh:inversePath rdf:type

only those values are allowed, that have been explicitly enumerated in the expression as a type

sh:or … sh:path

values of the specified sh:path needs to correspond to one of the explicitly enumerated IRIs

sh:in … sh:inversePath

values neeeds to correspond to explicitly enumerated value lists of individuals

sh:and … sh:or

list of not instantiable classes, e.g., Measurement

Template of implemented SHACL constraints

Note

Some of the examples shown below are shortened, to improve readability. The original ones can be looked up in the shacl.ttl.

There exist three different node shape patterns. The first one consists of Cardinality contraints, Restriction on classes, and Literal type constraints, which is the most used one. Restricting on individuals/instances, as well as Non instantable classes are two other implemented patterns.

Cardinality constraints

There exist properties with a specific cardinality, which means that there exists a restriction on how often a property can be used with a certain entity. The cardinalities defined in SPHN can be found here). They include information on links connecting each SPHN concept to patient (via sphn:hasSubjectPseudoIdentifier), provider (via sphn:hasDataProviderInstitute), and case (via sphn:hasAdministrativeCase).

One example of application of these constraints is on the property sphn:hasSubjectPseudoIdentifier. Entities are allowed to only have at most one SubjectPseudoIdentifier. This rule is expressed by the following SHACL constraints :

constraints:Biobanksample a sh:NodeShape ;
sh:closed true ;
sh:ignoredProperties ( rdf:type ) ;
sh:property [ sh:class sphn:Biosample ;
        sh:path sphn:hasBiosample ],
    [ sh:class sphn:SubjectPseudoIdentifier ;
        sh:maxCount 1 ;
        sh:minCount 0 ;
        sh:path sphn:hasSubjectPseudoIdentifier ];
sh:targetClass sphn:Biobanksample .

We can interpret this rule as follows: For all instances of the class sphn:Biobanksample, the property sphn:hasSubjectPseudoIdentifier can be used zero (sh:minCount 0) or exactly one (sh:maxCount 1) time.

Restriction on classes

A common pattern are restrictions for properties on classes, thus a certain property has to refer to an instance of specific class or a specific set of classes. One example where this constraint is required is the property sphn:hasTimePatternTypeCode for instances of the class sphn:TimePattern. These constraints are expressed as followed:

constraints:TimePattern a sh:NodeShape ;
sh:closed true ;
sh:ignoredProperties ( rdf:type ) ;
sh:property [ sh:or ( [ sh:class <http://snomed.info/id/255238004> ] [ sh:class <http://snomed.info/id/385432009> ] [ sh:class <http://snomed.info/id/7087005> ] [ sh:class sphn:Code ] ) ;
        sh:path sphn:hasTimePatternTypeCode ] ;
sh:targetClass sphn:TimePattern .

The above constraints can be interpreted as follows: For all instances of the class sphn:TimePattern, it must hold that the property sphn:hasTimePatternTypeCode refers to an instance of at least one of the enumerated classes. This is ensured by the usage of the SHACL expression sh:or which lists all accepted classes.

Literal type constraints

Besides the object properties where Restrictions on classes are used, there exist also data properties. On data properties we have the option to restrict the possible datatypes using Literal type constraints. In the class sphn:Code, three of them are in use. On the properties sphn:hasCodeCodingSystemAndVersion, sphn:hasIdentifier, sphn:hasIdentifier, the shacl file validates that the literal used is of type xsd:string.

constraints:Code a sh:NodeShape ;
   sh:closed true ;
   sh:ignoredProperties ( rdf:type ) ;
   sh:property [ sh:datatype xsd:string ;
            sh:path sphn:hasCodeCodingSystemAndVersion ],
      [ sh:datatype xsd:string ;
            sh:path sphn:hasIdentifier ],
      [ sh:datatype xsd:string ;
            sh:path sphn:hasCodeName ] ;
   sh:targetClass sphn:Code .

The interpretation of the above constraint is: whenever in an an instance of sphn:Code the property sphn:hasCodeName is used, the object needs to be a Literal of type xsd:string.

Restricting on individuals/instances

There exist cases where it is forbidden to create new instances of a class, but only already existing so-called individuals (instances) are allowed. This constraint is, for instance, applied on entities of the type sphn:Biosample_fixationType as shown in the following:

constraints:Biosample_fixationType a sh:NodeShape ;
sh:closed true ;
sh:ignoredProperties ( rdf:type ) ;
sh:property [ sh:in ( sphn:PAXgeneTissue sphn:AlcoholBased sphn:Other sphn:AldehydeBased sphn:NonaldehydeWithAceticAcid sphn:VacuumTechnologyStabilization sphn:AllprotectTissueReagent sphn:NonbufferedFormalin sphn:OptimumCuttingTemperatureMedium sphn:NonaldehydeBasedWithoutAceticAcid sphn:UNK sphn:HeatStabilization sphn:SnapFreezing sphn:RNALater sphn:NeutralBufferedFormalin ) ;
        sh:path [ sh:inversePath rdf:type ] ] ;
sh:targetClass sphn:Biosample_fixationType .

This SHACL constraints ensures, that only explicitly enumerated individuals are used as instances for the class sphn:Biosample_fixationType. In addition, it forbids by means of an inversePath constraint sh:inversePath rdf:type that new entities are derived as subclasses.

Non instantiable classes

Some classes from the ontology are not allowed to be instantiated on their own, but only their subclasses. One example are :Measurement class. This rule can be expressed with the following SHACL constraints:

  constraints:Measurement a sh:NodeShape ;
  sh:and ( [ sh:property [ sh:hasValue sphn:Measurement ;
                    sh:path rdf:type ] ] [ sh:or ( [ sh:property [ sh:hasValue sphn:HeartRate ;
                                sh:path rdf:type ] ] [ sh:property [ sh:hasValue sphn:OxygenSaturation ;
                                sh:path rdf:type ] ] [ sh:property [ sh:hasValue sphn:BodyWeight ;
                                sh:path rdf:type ] ] [ sh:property [ sh:hasValue sphn:CircumferenceMeasure ;
                                sh:path rdf:type ] ] [ sh:property [ sh:hasValue sphn:RespiratoryRate ;
                                sh:path rdf:type ] ] [ sh:property [ sh:hasValue sphn:BodyTemperature ;
                                sh:path rdf:type ] ] [ sh:property [ sh:hasValue sphn:BodyHeight ;
                                sh:path rdf:type ] ] [ sh:property [ sh:hasValue sphn:SystemicArterialBloodPressure ;
                                sh:path rdf:type ] ] [ sh:property [ sh:hasValue sphn:CentralVenousPressure ;
                                sh:path rdf:type ] ] ) ] ) ;
sh:closed false ;
sh:targetClass sphn:Measurement .

We can interpret this rule as follows: For all instances of the class :Measurement it must hold that if an entity is of the class :Measurement, that entity is also of one of the explicitly enumerated class. If the entity is only of class :Measurement, it is invalid.

Implementation examples

Class Example

constraints:Unit a sh:NodeShape ;
    sh:closed true ;
    sh:ignoredProperties ( rdf:type ) ;
    sh:or ( [ sh:class :Code ;
                sh:path :hasCode ] [ sh:class :Terminology ;
                sh:path :hasCode ] ) ;
    sh:property [ sh:class <https://biomedit.ch/rdf/sphn-resource/ucum/UCUM> ;
            sh:path :hasUnitCode ],
        [ sh:path :hasCode ] ;
    sh:targetClass :Unit .

The NodeShape shown here is generated through various parts of the ontology. From bottom to the top:

  • there is a class :Unit in the Ontology (last Line: sh:targetClass :Unit)

  • the properties :hasCode and :hasUnitCode do have the :Unit in their domain specification (sh:property and following)

  • the property :hasUnitCode has the UCUM class form the Terminologies in the range (sh:property and following)

  • the property :hasCode has the :Terminology and :Code classes in the range (sh:or and following lines). The two target classes will have NodeShapes on their own.

  • the rdf:type is ignored unless explicitly specified

  • the shape is closed (sh:closed true) to define there are no other properties allowed.

Meaning Binding / Individual Example

constraints:OncologyTreatmentAssessment_result a sh:NodeShape ;
    sh:closed true ;
    sh:ignoredProperties ( rdf:type ) ;
    sh:property [ sh:in ( :CompleteResponse :StableDisease :Unknown :ProgressiveDisease :PartialResponse ) ;
            sh:inversePath rdf:type ] ;
    sh:targetClass :OncologyTreatmentAssessment_result .

A Meaning Binding or Individual also result in a NodeShape as shown just above. From bottom to the top:

  • there is a class :OncologyTreatmentAssessment_result in the Ontology (last Line: sh:targetClass :OncologyTreatmentAssessment_result)

  • the inverse property of the type sh:inversePath rdf:type means all instances of the class OncologyTreatmentAssessment_result have to be in the list specified in the sh:in list. Only :CompleteResponse, :StableDisease, :unknown, :ProgressiveDisease and :PartialResponse are allowed

  • the rdf:type is ignored unless explicitly specified

  • the shape is closed (sh:closed true) to define there are no other properties allowed.

Exceptions

A file using the JSON syntax is provided for handling exceptions.

The element that is searched for is “exceptions” which holds an array of exceptions. There are different types of exceptions, each with their own but similar elements. Exceptions are applied only when the specified element exists. For example, if the property does not exist at a certain class, then the cardinality exception will not be applied.

Exceptions will also be applied by specificity, meaning that the more specific overrides the less specific. This is used in the cardinality constraints by setting a default to e.g. the hasSubjectPseudoIdentitifier with no class referenced, and then have another cardinality constraint referencing a certain class. The second will be then applied to the class, the first one will be applied to all others.

Cardinality Exception

To override any default cardinality on a specific property you can use the following constraint syntax:

{
    "type" : "cardinality",
    "property" : "https://biomedit.ch/rdf/sphn-ontology/sphn#hasSubjectPseudoIdentifier",
    "class" : null,
    "minCount" : 1,
    "maxCount" : 1
}

The class element can also point to the IRI of a specific class so that it is applied only on the property on that class.

Not Instantiable Class

Some classes present in the ontology may not be instantiated on their own. Only their subclasses are instantiable. This can be done exception can be written down with the following statement (example of the :Measurement class):

{
    "type" : "notInstantiableClass",
    "class" : "https://biomedit.ch/rdf/sphn-ontology/sphn#Measurement"
}

Not Instantiable Property

Some properties present in the ontology may not be instantiated on their own. Only their subproperties are instantiable. This exception can be written down with the following statement (example of the :hasValue property):

{
    "type" : "notInstantiableProperty",
    "property" : "https://biomedit.ch/rdf/sphn-ontology/sphn#hasValue",
    "class" : null
}

Range Extension

Sometimes ranges of properties (in general or in combination with specific domain classes) need to be extended in the validation. This can be done with the Range Extension. Adding the class with a null annotation does make the extension generic.

{
    "type" : "rangeExtension",
    "property" : "https://biomedit.ch/rdf/sphn-ontology/sphn#hasLabResultLabTestCode",
    "class" :  "https://biomedit.ch/rdf/sphn-ontology/sphn#LabResult",
    "extendedRange" :  "https://biomedit.ch/rdf/sphn-ontology/sphn#Code"
}

Availability and usage rights

© Copyright 2021, Personalized Health Informatics Group (PHI), SIB Swiss Institute of Bioinformatics

The SPHN Quality Framework is available at https://git.dcc.sib.swiss/sphn-semantic-framework/sphn-ontology/-/tree/master/quality_assurance. The SPHN SPHN Quality Framework is under the CC BY-NC-SA 4.0 License. The SHACLer is licensed under the GPLv3 and is available on request dcc@sib.swiss. For any question or comment, please contact the SPHN Data Coordination Center (DCC) at dcc@sib.swiss.