Quality Assurance Framework
The SPHN data quality control tool contains a set of SHACL rules and statistical SPARQL queries to validate the compliance of the RDF data produced.
Statistical SPARQL queries
SPARQL Protocol and RDF Query Language (SPARQL) is a W3C recommendation of a standard language for querying databases and data sources provided in RDF.
In SPHN, a set of statistical SPARQL queries have been developed to gain basic knowledge about the data being queried. These queries are available at https://git.dcc.sib.swiss/sphn-semantic-framework/sphn-ontology/-/tree/master/quality_assurance/statistics.
These statistical queries give qualitative information about:
data elements that are not part of SPHN, if any (QC00030, QC00031)
basic summary of the SPHN properties used and their annotated values (QC00012)
number of data elements, i.e. number of patients, hospitals, patients per hospital (QC00040, QC00041, QC00042).
For further information on the use and building of the statistical queries, please read the user guide.
Data validation with SHACL
Shapes Contraint Language (SHACL) is a W3C recommendation of a language for validating RDF graphs against a set of conditions. SHACL itself is formalized in an RDF what is then called the Shapes Graph, the elements inside are called shapes. Constraints would refer to the general meaning of restricting something, in the context of SHACL this can be understood as a shape or combinations of shapes. SHACL can be written manually or generated.
In SPHN the SPHN RDF schema is the main ontology. The project ontologies are derived from the SPHN ontology, while it is probable that the project ontology most of the time extend the SPHN ontology. To be able to generate useful constraints out of the ontology, the ontology needs to contain most of the needed information. This is the case for the SPHN RDF schema and the recommendation for project ontologies is to follow this convention. This affects mainly the rdfs:domain
and rdfs:range
definitions of properties and rdfs:subClassOf
relations of classes. In general the assumptions related to properties, and individuals is based on the Closed World Assumption (CWA). Therefore every property that is not described in the ontology to be allowed to be used at some class is also not added to the SHACL graph. A similar thinking is applied to the individuals that are used in a meaning binding way: In case there is an enumerated list of individuals for a class, we make these the only allowed individuals of that class.
By using these assumptions we can automatically generate valid and useful SHACL files from the SPHN RDF schema as well as derived project ontologies. For exceptions that cannot be easily covered by assumptions or defintinions in the ontology, an exception mechanism is provided. This is particularly useful for modeling cardinality constraints.
The tool to generate SHACL files based on the ontology and an optional exception file is called SHACLer
(available on request).
The single file Python code is based on Python 3 and only requires minimal additional packages. If you are interested on how to run the SHACLer (see user guide).
SPHN provides a set of SHACLs (https://git.dcc.sib.swiss/sphn-semantic-framework/sphn-ontology/-/tree/master/quality_assurance/shacl) to validate the compliance of the RDF data produced with the SPHN ontology. This section describes which SHACLs are included in the SPHN set and how they are set up. If you are interested on how to run these SHACLs (see the user guide).
Principle
The SHACLer generates all validation rules based on NodeShapes centric to a class from the ontology. All range and domain annotations and individuals are collected based on the ontology. All information is stored in internal dictionaries to transport the information to the SHACL generation.
In detail: to get the information out of the ontology the generator looks for all owl:ObjectProperties
and owl:DatatypeProperties
and parses their range and domain specifications. For range specifications it also parses the corresponding rdfs:subClassOf
information. This is needed as some properties have an upper level concept as their domain, logically that implies that also the lower level elements have that. Although we require RDFS inference for the validation, it can happen that the upper level concept should not be instantiatable on its own and is excluded, therefore we annotate the property at all allowed levels. This supports the readability, on a per concept basis for a human reader.
Assumptions
Before we go into details about the SHACL rules implemented for SPHN, here is the list of assumptions taken into account during the building of the constraints:
We require that SHACL is tested using RDFS Inference turned on. This is required, as ranges pick some upper level concepts (e.g. SNOMED CT subtrees).
There are no further ObjectProperties/DataProperties than the ones that are defined in the ontology (although, there might be further classes with predicates).
An
rdfs:domain
orrdfs:range
annotation of an Object Property indicates that only these properties are allowed in the classes (this is also applying to inherit properties).An
rdfs:domain
of a property pointing to anowl:unionOf
list means that the the property can be used in any of the list items instances.An
rdfs:range
of a property pointing to anowl:unionOf
list means that the the property has to always end in an instance of “one Of” (or subclassOf) the references classes.In case there are Individuals/Instances of
owl:NamedIndividual
and a class we make these Individuals being the only allowed Instances of a class.owl:EquivalentClass
properties link SPHN concepts to other external terminologies (e.g. SNOMED CT, LOINC). These properties are not picked up and evaluated in the SHACL generation. Although logically valid, and applying OWL2 inference also technically valid, the SHACL rules focus on SPHN concepts.
SHACL constraints implemented for SPHN
Please note that the expressions listed below are not fully correct Shape Graphs. These simplified expressions can be used for a mapping to the respective SHAPE graphs in the shacl.ttl.
SHACL Constraint |
Description |
---|---|
sh:closed true |
value node has only those properties that have been explicitly enumerated via sh:property |
sh:ignoredProperties |
properties that are also permitted in addition to those explicitly enumerated via sh:property |
sh:datatype xsd:dateTime |
verifies if a property value has the type xsd:dateTime |
sh:datatype xsd:double |
verifies if a property value has the type xsd:double |
sh:datatype xsd:string |
verifies if a property value has the type xsd:string |
sh:class … sh:path |
range of a property is used correctly, i.e. the class of an instance matches the specified type constraint |
sh:maxCount, sh:minCount |
checks if the cardinality of a property is applied correctly, e.g., there is just one value for a given property |
sh:inversePath rdf:type |
only those values are allowed, that have been explicitly enumerated in the expression as a type |
sh:or … sh:path |
values of the specified sh:path needs to correspond to one of the explicitly enumerated IRIs |
sh:in … sh:inversePath |
values neeeds to correspond to explicitly enumerated value lists of individuals |
sh:and … sh:or |
list of not instantiable classes, e.g., Measurement |
… |
… |
Template of implemented SHACL constraints
Note
Some of the examples shown below are shortened, to improve readability. The original ones can be looked up in the shacl.ttl.
There exist three different node shape patterns. The first one consists of Cardinality contraints, Restriction on classes, and Literal type constraints, which is the most used one. Restricting on individuals/instances, as well as Non instantable classes are two other implemented patterns.
- Cardinality constraints
There exist properties with a specific cardinality, which means that there exists a restriction on how often a property can be used with a certain entity. The cardinalities defined in SPHN can be found here). They include information on links connecting each SPHN concept to patient (via
sphn:hasSubjectPseudoIdentifier
), provider (viasphn:hasDataProviderInstitute
), and case (viasphn:hasAdministrativeCase
).One example of application of these constraints is on the property
sphn:hasSubjectPseudoIdentifier
. Entities are allowed to only have at most one SubjectPseudoIdentifier. This rule is expressed by the following SHACL constraints :constraints:Biobanksample a sh:NodeShape ; sh:closed true ; sh:ignoredProperties ( rdf:type ) ; sh:property [ sh:class sphn:Biosample ; sh:path sphn:hasBiosample ], [ sh:class sphn:SubjectPseudoIdentifier ; sh:maxCount 1 ; sh:minCount 0 ; sh:path sphn:hasSubjectPseudoIdentifier ]; sh:targetClass sphn:Biobanksample .
We can interpret this rule as follows: For all instances of the class
sphn:Biobanksample
, the propertysphn:hasSubjectPseudoIdentifier
can be used zero (sh:minCount 0
) or exactly one (sh:maxCount 1
) time.- Restriction on classes
A common pattern are restrictions for properties on classes, thus a certain property has to refer to an instance of specific class or a specific set of classes. One example where this constraint is required is the property
sphn:hasTimePatternTypeCode
for instances of the classsphn:TimePattern
. These constraints are expressed as followed:constraints:TimePattern a sh:NodeShape ; sh:closed true ; sh:ignoredProperties ( rdf:type ) ; sh:property [ sh:or ( [ sh:class <http://snomed.info/id/255238004> ] [ sh:class <http://snomed.info/id/385432009> ] [ sh:class <http://snomed.info/id/7087005> ] [ sh:class sphn:Code ] ) ; sh:path sphn:hasTimePatternTypeCode ] ; sh:targetClass sphn:TimePattern .
The above constraints can be interpreted as follows: For all instances of the class
sphn:TimePattern
, it must hold that the propertysphn:hasTimePatternTypeCode
refers to an instance of at least one of the enumerated classes. This is ensured by the usage of the SHACL expressionsh:or
which lists all accepted classes.- Literal type constraints
Besides the object properties where Restrictions on classes are used, there exist also data properties. On data properties we have the option to restrict the possible datatypes using Literal type constraints. In the class
sphn:Code
, three of them are in use. On the propertiessphn:hasCodeCodingSystemAndVersion
,sphn:hasIdentifier
,sphn:hasIdentifier
, the shacl file validates that the literal used is of typexsd:string
.constraints:Code a sh:NodeShape ; sh:closed true ; sh:ignoredProperties ( rdf:type ) ; sh:property [ sh:datatype xsd:string ; sh:path sphn:hasCodeCodingSystemAndVersion ], [ sh:datatype xsd:string ; sh:path sphn:hasIdentifier ], [ sh:datatype xsd:string ; sh:path sphn:hasCodeName ] ; sh:targetClass sphn:Code .
The interpretation of the above constraint is: whenever in an an instance of
sphn:Code
the propertysphn:hasCodeName
is used, the object needs to be a Literal of typexsd:string
.- Restricting on individuals/instances
There exist cases where it is forbidden to create new instances of a class, but only already existing so-called individuals (instances) are allowed. This constraint is, for instance, applied on entities of the type
sphn:Biosample_fixationType
as shown in the following:constraints:Biosample_fixationType a sh:NodeShape ; sh:closed true ; sh:ignoredProperties ( rdf:type ) ; sh:property [ sh:in ( sphn:PAXgeneTissue sphn:AlcoholBased sphn:Other sphn:AldehydeBased sphn:NonaldehydeWithAceticAcid sphn:VacuumTechnologyStabilization sphn:AllprotectTissueReagent sphn:NonbufferedFormalin sphn:OptimumCuttingTemperatureMedium sphn:NonaldehydeBasedWithoutAceticAcid sphn:UNK sphn:HeatStabilization sphn:SnapFreezing sphn:RNALater sphn:NeutralBufferedFormalin ) ; sh:path [ sh:inversePath rdf:type ] ] ; sh:targetClass sphn:Biosample_fixationType .
This SHACL constraints ensures, that only explicitly enumerated individuals are used as instances for the class
sphn:Biosample_fixationType
. In addition, it forbids by means of an inversePath constraintsh:inversePath rdf:type
that new entities are derived as subclasses.- Non instantiable classes
Some classes from the ontology are not allowed to be instantiated on their own, but only their subclasses. One example are
:Measurement
class. This rule can be expressed with the following SHACL constraints:constraints:Measurement a sh:NodeShape ; sh:and ( [ sh:property [ sh:hasValue sphn:Measurement ; sh:path rdf:type ] ] [ sh:or ( [ sh:property [ sh:hasValue sphn:HeartRate ; sh:path rdf:type ] ] [ sh:property [ sh:hasValue sphn:OxygenSaturation ; sh:path rdf:type ] ] [ sh:property [ sh:hasValue sphn:BodyWeight ; sh:path rdf:type ] ] [ sh:property [ sh:hasValue sphn:CircumferenceMeasure ; sh:path rdf:type ] ] [ sh:property [ sh:hasValue sphn:RespiratoryRate ; sh:path rdf:type ] ] [ sh:property [ sh:hasValue sphn:BodyTemperature ; sh:path rdf:type ] ] [ sh:property [ sh:hasValue sphn:BodyHeight ; sh:path rdf:type ] ] [ sh:property [ sh:hasValue sphn:SystemicArterialBloodPressure ; sh:path rdf:type ] ] [ sh:property [ sh:hasValue sphn:CentralVenousPressure ; sh:path rdf:type ] ] ) ] ) ; sh:closed false ; sh:targetClass sphn:Measurement .
We can interpret this rule as follows: For all instances of the class
:Measurement
it must hold that if an entity is of the class:Measurement
, that entity is also of one of the explicitly enumerated class. If the entity is only of class:Measurement
, it is invalid.
Implementation examples
Class Example
constraints:Unit a sh:NodeShape ;
sh:closed true ;
sh:ignoredProperties ( rdf:type ) ;
sh:or ( [ sh:class :Code ;
sh:path :hasCode ] [ sh:class :Terminology ;
sh:path :hasCode ] ) ;
sh:property [ sh:class <https://biomedit.ch/rdf/sphn-resource/ucum/UCUM> ;
sh:path :hasUnitCode ],
[ sh:path :hasCode ] ;
sh:targetClass :Unit .
The NodeShape shown here is generated through various parts of the ontology. From bottom to the top:
there is a class
:Unit
in the Ontology (last Line:sh:targetClass :Unit
)the properties
:hasCode
and:hasUnitCode
do have the :Unit in their domain specification (sh:property
and following)the property
:hasUnitCode
has theUCUM
class form the Terminologies in the range (sh:property
and following)the property
:hasCode
has the:Terminology
and:Code
classes in the range (sh:or
and following lines). The two target classes will have NodeShapes on their own.the
rdf:type
is ignored unless explicitly specifiedthe shape is closed (
sh:closed true
) to define there are no other properties allowed.
Meaning Binding / Individual Example
constraints:OncologyTreatmentAssessment_result a sh:NodeShape ;
sh:closed true ;
sh:ignoredProperties ( rdf:type ) ;
sh:property [ sh:in ( :CompleteResponse :StableDisease :Unknown :ProgressiveDisease :PartialResponse ) ;
sh:inversePath rdf:type ] ;
sh:targetClass :OncologyTreatmentAssessment_result .
A Meaning Binding or Individual also result in a NodeShape as shown just above. From bottom to the top:
there is a class
:OncologyTreatmentAssessment_result
in the Ontology (last Line:sh:targetClass :OncologyTreatmentAssessment_result
)the inverse property of the type
sh:inversePath rdf:type
means all instances of the classOncologyTreatmentAssessment_result
have to be in the list specified in thesh:in
list. Only:CompleteResponse
,:StableDisease
,:unknown
,:ProgressiveDisease
and:PartialResponse
are allowedthe
rdf:type
is ignored unless explicitly specifiedthe shape is closed (
sh:closed true
) to define there are no other properties allowed.
Exceptions
A file using the JSON syntax is provided for handling exceptions.
The element that is searched for is “exceptions” which holds an array of exceptions. There are different types of exceptions, each with their own but similar elements. Exceptions are applied only when the specified element exists. For example, if the property does not exist at a certain class, then the cardinality exception will not be applied.
Exceptions will also be applied by specificity, meaning that the more specific overrides the less specific. This is used in the cardinality constraints by setting a default to e.g. the hasSubjectPseudoIdentitifier
with no class referenced, and then have another cardinality constraint referencing a certain class. The second will be then applied to the class, the first one will be applied to all others.
Cardinality Exception
To override any default cardinality on a specific property you can use the following constraint syntax:
{
"type" : "cardinality",
"property" : "https://biomedit.ch/rdf/sphn-ontology/sphn#hasSubjectPseudoIdentifier",
"class" : null,
"minCount" : 1,
"maxCount" : 1
}
The class element can also point to the IRI of a specific class so that it is applied only on the property on that class.
Not Instantiable Class
Some classes present in the ontology may not be instantiated on their own. Only their subclasses are instantiable. This can be done exception can be written down with the following statement (example of the :Measurement class):
{
"type" : "notInstantiableClass",
"class" : "https://biomedit.ch/rdf/sphn-ontology/sphn#Measurement"
}
Not Instantiable Property
Some properties present in the ontology may not be instantiated on their own. Only their subproperties are instantiable. This exception can be written down with the following statement (example of the :hasValue property):
{
"type" : "notInstantiableProperty",
"property" : "https://biomedit.ch/rdf/sphn-ontology/sphn#hasValue",
"class" : null
}
Range Extension
Sometimes ranges of properties (in general or in combination with specific domain classes) need to be extended in the validation. This can be done with the Range Extension. Adding the class with a null annotation does make the extension generic.
{
"type" : "rangeExtension",
"property" : "https://biomedit.ch/rdf/sphn-ontology/sphn#hasLabResultLabTestCode",
"class" : "https://biomedit.ch/rdf/sphn-ontology/sphn#LabResult",
"extendedRange" : "https://biomedit.ch/rdf/sphn-ontology/sphn#Code"
}
Availability and usage rights
© Copyright 2021, Personalized Health Informatics Group (PHI), SIB Swiss Institute of Bioinformatics
The SPHN Quality Framework is available at https://git.dcc.sib.swiss/sphn-semantic-framework/sphn-ontology/-/tree/master/quality_assurance. The SPHN SPHN Quality Framework is under the CC BY-NC-SA 4.0 License. The SHACLer is licensed under the GPLv3 and is available on request dcc@sib.swiss. For any question or comment, please contact the SPHN Data Coordination Center (DCC) at dcc@sib.swiss.