SHACLer
Note
For an introduction to SHACL, visit the SHACL Background section.
SHACL
Shapes Contraint Language (SHACL) allows validating a dataset that has been specified following an RDF schema, for instance the SPHN RDF Schema or a project-specific RDF Schema (see Generate a project-specific RDF Schema and Generate data according to a RDF schema). For further information on SHACL, see Data validation with SHACL
The SHACLer
tool
The SPHN SHACL Generator (SHACLer
) is a Python-based tool developed by the DCC.
The tool takes as input a SPHN-compliant RDF Schema and generates a set of SHACL rules in Turtle format.
Projects can use the SHACLer
to generate a set of SHACL rules based on a
project-specific RDF Schema.
Note
An (optional) exception file can be provided together with the RDF schema if any of the defined classes has an exception to be handled separately.
See the instructions on how to run the SHACLer
to generate SHACL files here.
SHACLer
internals
The SHACLer
generates all validation rules based on NodeShapes
centric to a
class from the RDF schema. All domain, range, restriction and cardinality annotations
and individuals are collected based on the RDF schema.
Specifically, the SHACLer looks for
all owl:ObjectProperties
and owl:DatatypeProperties
before parsing their
range and domain specifications. For range specifications, it also parses the
corresponding rdfs:subClassOf
information since some properties
have an upper level concept as their domain; which logically implies that for
the lower level concepts, the property is deemed valid.
In addition, the SHACLer looks for owl:Restriction
and parses information according to specific criteria (i.e. is the information a
cardinality restriction or a restriction on a property value?).
Although we require RDFS inference for the validation, it can happen that the upper level
concept should not be instantiable on its own and is excluded, therefore we annotate
the property at all allowed levels. This supports the readability,
on a per concept basis for a human reader.
All parsed information is stored in internal dictionaries and transported to the SHACL generator.
Assumptions made on the SHACLer
When building the SHACL rules, the SHACLer makes some assumptions about the RDF schema. When the assumptions hold, a RDF schema can be used to generate SHACL files. The RDF schema of SPHN starting with version 2021.1 conforms to these assumptions. Project’s based on this version (and future versions) of the SPHN RDF Schema must also conform to the assumptions.
The assumptions are the following:
We require that SHACL is tested using RDFS Inference turned on. This is required, as ranges pick some upper level concepts (e.g. SNOMED CT subtrees). As SNOMED CT in RDF is an OWL ontology it has subclasses that use OWL syntax instead of RDFS syntax. To be able to apply only RDFS Reasoning in the validation phase, the SNOMED CT exploit feature can be used to extend the ranges to all non-RDFS subclasses.
There are no further ObjectProperties/DataProperties than the ones that are defined in the RDF schema (although, there might be further classes with predicates).
An
rdfs:domain
orrdfs:range
annotation of an Object Property indicates that only these properties are allowed in the classes (this is also applying to inherited properties).An
rdfs:domain
of a property pointing to anowl:unionOf
list means that the the property can be used in any of the listed instances.An
rdfs:range
of a property pointing to anowl:unionOf
list means that the the property has to always end in an instance of “one Of” (or subclassOf) the referred classes.In case there are Individuals/Instances of
owl:NamedIndividual
and a class we make these Individuals being the only allowed Instances of a class.owl:EquivalentClass
properties link SPHN concepts to other external terminologies (e.g. SNOMED CT, LOINC). These properties are not picked up and evaluated in the SHACL generation. Although logically valid, and applying OWL2 inference also technically valid, the SHACL rules focus on SPHN concepts.An
owl:Restriction
annotation on a property overwrites itsrdfs:range
annotation.
Constraints implemented in SPHN using SHACL
Note on the formatting: the level of the validation constraint is in the straight brackets before each constraint type. See Validation constraint severity levels for more information about the levels.
For each class in the RDF Schema (standalone SPHN or in combination with a project): Restriction on classes
[ERROR] no other properties used for this class than the specified in the RDF schema with inference rules applied (same as displayed in the pyLODE visualization)
[ERROR] the properties occur the right cardinality Cardinality constraints
[ERROR] the properties lead to the right target type (datatype or class) Literal type constraints
[ERROR] when terminology valuesets are used, the specification whether children/descendands (direct and indirect subclasses) are allowed is checked
[ERROR] when terminology valuesets are used, the validity of the codes are checked according the restricted valuesets
[ERROR] when specifiying start and end datetimes in a class, it is asserted that the start is before the end datetime Restricting that the start is before the end
For SPHN/project valuesets in the RDF Schema:
[ERROR] no other than the specified individuals of the valueset are used Restricting on individuals/instances
In general:
[WARN] naming conventions are obeyed for instances of project/SPHN classes Naming convention on ontology instances
[WARN] naming conventions are obeyed for instances of shared resources e.g. external terminologies _naming_convention_on_shared_instances
SHACL constraint components implemented in SPHN
A specific set of constraints is implemented in the SHACLer in the context of SPHN, which are listed below:
SHACL Constraint |
Description |
---|---|
sh:closed false |
value node has only those properties that have been explicitly enumerated via sh:property |
sh:ignoredProperties |
properties that are also permitted in addition to those explicitly enumerated via sh:property |
sh:datatype xsd:dateTime |
verifies if a property value has the type xsd:dateTime |
sh:datatype xsd:double |
verifies if a property value has the type xsd:double |
sh:datatype xsd:string |
verifies if a property value has the type xsd:string |
sh:class … sh:path |
range of a property is used correctly, i.e. the class of an instance matches the specified type constraint |
sh:maxCount, sh:minCount |
checks if the cardinality of a property is applied correctly, e.g., there is just one value for a given property |
sh:inversePath rdf:type |
only those values are allowed, that have been explicitly enumerated in the expression as a type |
sh:or … sh:path |
values of the specified sh:path needs to correspond to one of the explicitly enumerated IRIs |
sh:in … sh:path |
values of the specified sh:path needs to correspond to one of the explicitly enumerated IRIs |
sh:in … sh:inversePath |
values neeeds to correspond to explicitly enumerated value lists of individuals |
sh:sparql … sh:select |
verifies if a property value is correct, when subclasses of the specified codes are not allowed |
sh:SPARQLtarget … |
sh:select the constraints are only validated for this class and not for the subclasses |
… |
… |
Patterns of implemented SHACL constraints
Note
Some of the examples shown below are shortened, to improve readability. The original ones can be looked up in the shacl .ttl file generated for the SPHN RDF Schema (here).
There exist different node shape patterns implemented in the SHACLer such as Cardinality contraints, Restriction on classes, Literal type constraints, Restricting on individuals/instances.
Cardinality constraints
In SPHN, properties may have a specific cardinality, which means that there exists a restriction on how often a property can be used with a certain type of data instance. The cardinalities defined in SPHN are implemented in the RDF schema. They include information on:
1. links connecting each SPHN concept to patient (via
sphn:hasSubjectPseudoIdentifier
), provider (viasphn:hasDataProviderInstitute
), and case (viasphn:hasAdministrativeCase
);
the number of times specific metadata (i.e. properties) can be connected to a certain concept.
One example of application of these constraints is on the property
:hasAdministrativeCase
. Entities are allowed to only have at most one SubjectPseudoIdentifier. This rule is expressed by the following SHACL constraints :constraints:Biobanksample a sh:NodeShape ; sh:closed false ; sh:ignoredProperties ( rdf:type ) ; sh:property [ sh:class :Biosample ; sh:maxCount 1 ; sh:minCount 1 ; sh:path :hasBiosample ], [ sh:class :AdministrativeCase ; sh:maxCount 1 ; sh:minCount 0 ; sh:path :hasAdministrativeCase ], [ sh:class :SubjectPseudoIdentifier ; sh:maxCount 1 ; sh:minCount 1 ; sh:path :hasSubjectPseudoIdentifier ] ; sh:targetClass :Biobanksample .We can interpret this rule as follows: For all instances of the class
:Biobanksample
, the property:hasAdministrativeCase
can be used zero (sh:minCount 0
) or exactly one (sh:maxCount 1
) time. For all instances of the class:Biobanksample
, the property:hasSubjectPseudoIdentifier
can be used exactly one (sh:minCount 1
andsh:maxCount 1
) time.
Restriction on classes
A common pattern are restrictions for properties on classes. A certain property has to refer to an instance of a specific class or a specific set of classes. One example where this constraint is required is the property
:hasCode
for instances of the class:Substance
. These constraints are expressed as followed:constraints:Substance a sh:NodeShape ; sh:closed false ; sh:ignoredProperties ( rdf:type ) ; sh:property [ sh:maxCount 1 ; sh:minCount 0 ; sh:or ( [ sh:class :Code ] [ sh:class sphn-atc:ATC ] [ sh:class snomed:105590001 ] ) ; sh:path :hasCode ], [ sh:class :Quantity ; sh:maxCount 1 ; sh:minCount 0 ; sh:path :hasQuantity ] ; sh:targetClass :Substance .The above constraints can be interpreted as follows: For all instances of the class
:Substance
, it must hold that the property:hasCode
refers to an instance of at least one of the enumerated classes (i.e. an SPHNCode
, aATC
class or aSNOMED CT
class of the specific value or its children). This is ensured by the usage of the SHACL expressionsh:or
which lists all accepted classes.In addition, if a certain property has to refer to an instance of a specific class or a specific set of classes and their subclasses are not allowed as values, then the shape would be complemented with a
sh:sparql
expression. One example where this constraint is required is the property:hasCode
for instances of the class:AdministrativeGender
. These constraints are expressed as followed:constraints:AdministrativeGender a sh:NodeShape ; sh:closed false ; sh:ignoredProperties ( rdf:type ) ; sh:property [ sh:class :SubjectPseudoIdentifier ; sh:minCount 1 ; sh:path :hasSubjectPseudoIdentifier ], [ sh:datatype xsd:dateTime ; sh:maxCount 1 ; sh:minCount 0 ; sh:path :hasRecordDateTime ], [ sh:maxCount 1 ; sh:minCount 1 ; sh:or ( [ sh:class snomed:261665006 ] [ sh:class snomed:703117000 ] [ sh:class snomed:74964007 ] [ sh:class snomed:703118005 ] ) ; sh:path :hasCode ] ; sh:sparql [ a sh:SPARQLConstraint ; sh:message "No descendents (all subclasses) of the specified codes are allowed" ; sh:select """SELECT ?this (<https://biomedit.ch/rdf/sphn-ontology/sphn#hasCode> as ?path) (?class as ?value) WHERE { ?this <https://biomedit.ch/rdf/sphn-ontology/sphn#hasCode>/<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?class . FILTER( ?values IN ( <http://snomed.info/id/261665006>, <http://snomed.info/id/703117000>, <http://snomed.info/id/74964007>, <http://snomed.info/id/703118005> )) . FILTER (?class NOT IN ( ?values ) ) . FILTER NOT EXISTS { ?values <http://www.w3.org/2000/01/rdf-schema#subClassOf>+ ?class .} }""" ] ; sh:targetClass :AdministrativeGender .The above constraint can be interpreted as follows: For all instances of the class
:AdministrativeGender
, it must hold that the property:hasCode
refers to an instance of at least one of the enumerated classes (sh:or
). No other value are allowed. If the property value points, for example, to an instance of a subclass of one of the enumerated classes, an error message will occur. This is ensured by the usage of the SHACL expressionsh:sparql
, which throws a message (sh:message
) if it finds an instance of a subclass (sh:select
).Furthermore, if a certain property has to refer to an instance of specific class or a specific set of classes and only instances of direct subclasses of the specified classes are allowed, the
sh:sparql
expression is again used for encoding such restrictions. One example where this constraint is required is the property :hasCode for instances of the class :Intent. These constraints are expressed as followed:constraints:Intent a sh:NodeShape ; sh:closed false ; sh:ignoredProperties ( rdf:type ) ; sh:property [ sh:class snomed:363675004 ; sh:maxCount 1 ; sh:minCount 1 ; sh:path :hasCode ] ; sh:sparql [ a sh:SPARQLConstraint ; sh:message "Only children (direct subclasses) of the specified codes are allowed" ; sh:select """SELECT ?this (<https://biomedit.ch/rdf/sphn-ontology/sphn#hasCode> as ?path) (?class as ?value) WHERE { ?this <https://biomedit.ch/rdf/sphn-ontology/sphn#hasCode>/<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?class . FILTER( ?values IN ( <http://snomed.info/id/363675004> )) . ?child rdfs:subClassOf ?values . FILTER (?class NOT IN ( ?values, ?child) ) . FILTER NOT EXISTS { ?values <http://www.w3.org/2000/01/rdf-schema#subClassOf>+ ?class .} FILTER NOT EXISTS { ?child <http://www.w3.org/2000/01/rdf-schema#subClassOf>+ ?class .} }""" ] ; sh:targetClass :Intent .
Note
The no-subclasses-allowed and only-direct-subclasses_allowed constraints (sh:sparql) are not validated by GraphDB, but ignored.
SPARQL target constraints
To not cause unwanted validation errors when subclasses are validated against the constraints of their parent class, SPARQL target constraints are implemented for the SPHN classes with subclasses.
constraints:Measurement a sh:NodeShape ; sh:closed false ; sh:ignoredProperties ( rdf:type ) ; sh:property [ sh:class :Quantity ; sh:path :hasQuantity ] ; sh:target [ a sh:SPARQLTarget ; sh:select """SELECT ?this WHERE { ?this <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://biomedit.ch/rdf/sphn-ontology/sphn#Measurement> . MINUS { ?this <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://biomedit.ch/rdf/sphn-ontology/sphn#Measurement> . ?this <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?other_type . FILTER (?other_type != <https://biomedit.ch/rdf/sphn-ontology/sphn#Measurement> ) ?this <http://www.w3.org/2000/01/rdf-schema#subClassOf>+ <https://biomedit.ch/rdf/sphn-ontology/sphn#Measurement> . } }""" ] .The above constraint can be interpreted as follows: Only instances of the class
:Measurement
that are not also instances of a subclass of:Measurement
are validated against this constraint. Therefore, instances of subclasses (e.g.:OxygenSaturation
) are validated only against theconstraints:OxygenSaturation
shape and not against the constraints of their parent class shapeconstraints:Measurement
. This is ensured by the usage of the SHACL expressionsh:SPARQLtarget
, where instances of subclasses are excluded from the select query (sh:select
).
Note
The target class constraint (sh:SPARQLtarget
) is not validated by GraphDB, but causes errors.
Sequence paths
Some properties have a sequence of nodes specified as a path. This is expressed as followed:
constraints:Age a sh:NodeShape ; sh:closed false ; sh:ignoredProperties ( rdf:type ) ; sh:property [ sh:class :SubjectPseudoIdentifier ; sh:maxCount 1 ; sh:minCount 1 ; sh:path :hasSubjectPseudoIdentifier ], [ sh:in ( ucum:h ucum:wk ucum:a ucum:d ucum:mo ucum:min ) ; sh:maxCount 1 ; sh:minCount 1 ; sh:path ( :hasQuantity :hasUnit :hasCode ) ] ; sh:targetClass :Age .The above constraint can be interpreted as follows: For all instances of the class
:Age
, it must hold that the property:hasQuantity
refers to an instance of at least one of the enumerated classes (sh:in
) over the sequence path:hasQuantity
/:hasUnit
/:hasCode
.It means that when an age is given, the possible code values for its unit are only hour, week, year, month and minutes.
Note
The sequence paths are not validated by GraphDB, but ignored.
Literal type constraints
Besides the object properties where Restrictions on classes are used, there exist also data properties. On data properties we have the option to restrict the possible datatypes using Literal type constraints. In the class
:Code
, three of them are in use: on the properties:hasCodingSystemAndVersion
,:hasIdentifier
and:hasName
the shacl file validates that the literal used is of typexsd:string
.constraints:Code a sh:NodeShape ; sh:closed false ; sh:ignoredProperties ( rdf:type ) ; sh:property [ sh:datatype xsd:string ; sh:maxCount 1 ; sh:minCount 1 ; sh:path :hasCodingSystemAndVersion ], [ sh:datatype xsd:string ; sh:maxCount 1 ; sh:minCount 0 ; sh:path :hasName ], [ sh:datatype xsd:string ; sh:maxCount 1 ; sh:minCount 1 ; sh:path :hasIdentifier ] ; sh:targetClass :Code .The interpretation of the above constraint is: Whenever in an instance of
:Code
the property:hasName
is used, the object needs to be a Literal of typexsd:string
.
Restricting on individuals/instances
There exist cases where it is forbidden to create new instances of a class, but only already existing so-called individuals (instances) are allowed. This constraint is, for instance, applied on entities of the type
:Biosample_fixationType
as shown in the following:constraints:Biosample_fixationType a sh:NodeShape ; sh:closed false ; sh:ignoredProperties ( rdf:type ) ; sh:property [ sh:in ( :AldehydeBased :RNALater :VacuumTechnologyStabilization :Other :AlcoholBased :HeatStabilization :AllprotectTissueReagent :NeutralBufferedFormalin :SnapFreezing :UNK :OptimumCuttingTemperatureMedium :PAXgeneTissue :NonaldehydeWithAceticAcid :NonaldehydeBasedWithoutAceticAcid :NonbufferedFormalin ) ; sh:path [ sh:inversePath rdf:type ] ] ; sh:targetClass sphn:Biosample_fixationType .This SHACL constraints ensures, that only explicitly enumerated individuals are used as instances for the class
:Biosample_fixationType
. In addition, it forbids by means of an inversePath constraintsh:inversePath rdf:type
that new entities are derived as subclasses.
Restricting that the start is before the end
Whenever there are start and end datetimes given in the schema, a constraint is created that ensures that it is a valid timeframe (start befor end).
constraints:ElectrocardiographicProcedure a sh:NodeShape ; sh:closed false ; sh:ignoredProperties ( rdf:type :hasIntent :hasPhysiologicState ) ; sh:sparql [ a sh:SPARQLConstraint ; sh:message "Invalid time frame between sphn:hasStartDateTime and sphn:hasEndDateTime" ; sh:select """SELECT ?this (<https://biomedit.ch/rdf/sphn-ontology/sphn#hasStartDateTime> as ?path) (?hasStartDateTime as ?value) WHERE { ?this <https://biomedit.ch/rdf/sphn-ontology/sphn#hasStartDateTime> ?hasStartDateTime . ?this <https://biomedit.ch/rdf/sphn-ontology/sphn#hasEndDateTime> ?hasEndDateTime . FILTER (?hasStartDateTime > ?hasEndDateTime) }""" ] ; sh:targetClass :ElectrocardiographicProcedure .This shorterned excerpt of the SHACL shape of the
ElectrocardiographicProcedure
ensures in the sh:sparql that the dateTime of that is used in thehasStartDateTime
is happening before thehasEndDateTime
.
Naming convention on ontology instances
The naming convention in 2.2.1 Unique resource instantiation describes the convention that must be used to instantiate resource of SPHN and project classes. This convention is translated into a validation constraint.
constraints:GenomicPosition_Warning_Naming a sh:NodeShape ; sh:severity sh:Warning ; sh:sparql [ a sh:SPARQLConstraint ; sh:message "Instantiated unique resource not matching naming convention '^https://biomedit.ch/rdf/sphn-resource/.*GenomicPosition-.*$'" ; sh:select """PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?this (?class as ?path) (?this as ?value) WHERE { ?this rdf:type ?class . FILTER(!REGEX(STR(?this), "^https://biomedit.ch/rdf/sphn-resource/.*GenomicPosition-.*$")) }""" ] ; sh:targetClass :GenomicPosition .
Implementation examples
Class Example
constraints:Quantity a sh:NodeShape ;
sh:closed true ;
sh:ignoredProperties ( rdf:type ) ;
sh:property [ sh:class :Unit ;
sh:maxCount 1 ;
sh:minCount 1 ;
sh:path :hasUnit ],
[ sh:maxCount 1 ;
sh:minCount 1 ;
sh:or ( [ sh:datatype xsd:double ] [ sh:datatype xsd:string ] ) ;
sh:path :hasValue ] ;
sh:targetClass :Quantity .
The NodeShape shown here is generated through various parts of the ontology. From bottom to the top:
there is a class
:Quantity
in the ontology (last Line:sh:targetClass :Quantity
)the properties
:hasUnit
and:hasValue
do have the :Quantity in their domain specification (sh:property
and following)Both properties have given cardinalities (
sh:minCount
andsh:maxCount
)the property
:hasUnit
has the:Unit
class in the range (sh:property
and following). The target class will have a NodeShape on its own.the property
:hasValue
has thexsd:double
andxsd:string
from the Terminologies in the range (sh:or
and following lines).the
rdf:type
is ignored unless explicitly specifiedthe shape is closed (
sh:closed true
) to define there are no other properties allowed.
Meaning Binding / Individual Example
constraints:OncologyTreatmentAssessment_result a sh:NodeShape ;
sh:closed true ;
sh:ignoredProperties ( rdf:type ) ;
sh:property [ sh:in ( :CompleteResponse :StableDisease :Unknown :ProgressiveDisease :PartialResponse ) ;
sh:inversePath rdf:type ] ;
sh:targetClass :OncologyTreatmentAssessment_result .
A Meaning Binding or Individual also result in a NodeShape as shown just above. From bottom to the top:
there is a class
:OncologyTreatmentAssessment_result
in the schema (last Line:sh:targetClass :OncologyTreatmentAssessment_result
)the inverse property of the type
sh:inversePath rdf:type
means all instances of the classOncologyTreatmentAssessment_result
have to be in the list specified in thesh:in
list. Only:CompleteResponse
,:StableDisease
,:unknown
,:ProgressiveDisease
and:PartialResponse
are allowedthe
rdf:type
is ignored unless explicitly specifiedthe shape is closed (
sh:closed true
) to define there are no other properties allowed.
Validating data with the SHACL file
Data producers can use this SHACL file to validate the data that has been exported according to the given RDF Schema (see Data validation with SHACL). Validating data before sending it to users avoids distributing data inconsistent with the RDF Schema (e.g., data with missing properties; data with properties that have not been specified in the RDF Schema; data with wrong data types, etc.)
Validation constraint severity levels
There are three different severity levels implemented in the SHACLer:
[ERROR] : violating a constraint with this severity fails the validation. For a successful validation you must remediate the error. The validation error message gives you information about the issue.
[WARN] : violating a constraint with this severity does not fail the validation. It is recommended to check the data whether there is a potential error.
[INFO] : violating a constraint with this severity does not fail the validation. It has informational character only. It is used for e.g. informing that an old but still valid code is used for a historized terminology.
Availability and usage rights
© Copyright 2023, Personalized Health Informatics Group (PHI), SIB Swiss Institute of Bioinformatics
The SHACLer is available at https://git.dcc.sib.swiss/sphn-semantic-framework/sphn-shacl-generator (send request to DCC - dcc@sib.swiss) and is licensed under the GPLv3 license.
For any question or comment, please contact the SPHN Data Coordination Center (DCC) at dcc@sib.swiss.