Terminology Service
There are several terminologies currently available within the biomedical and life sciences domain. However, not all terminologies are represented using a formal knowledge representation language. Some can exist as spreadsheets, CSV, or even free text. Such representations are easy for humans to read and work with, but difficult for machines. And it is even more difficult to make use of these terminologies as part of a broader semantic interoperability strategy.
It is however possible to adopt best practices and make use of existing W3C standards to build a formal representation of these terminologies. If done properly, any terminology can be adapted such that it is both human and machine readable and also conforms to FAIR principles.
The DCC Terminology Service provides SPHN compatible, machine-readable versions
of national (e.g. CHOP, ICD-10-GM)
and international (e.g. SNOMED CT, LOINC,
ATC, UCUM, HGNC,
GENO, SO)
terminologies and classifications in RDF formats (.ttl
or .owl
).
This page describes the way DCC distributes the external terminologies
and the versioning strategy adopted for handling different versions of a terminology.
For further information on the external terminologies, please go to the External terminologies.
The DCC Terminology Service
The DCC developed the Terminology Service as a way to distribute external terminologies in RDF format without external dependencies and in compliance with the copyright statements of each of these terminologies.
As part of this effort,
terms in a terminology are translated into concept classes
term hierarchies are represented via parent-child relationships
term definitions, synonyms, and mappings are represented appropriately
The terminologies are available directly in the individual project spaces on the BioMedIT nodes. Furthermore, the DCC provides two modes of distribution, enabling projects and institutions to fetch the external terminologies in RDF format:
via the Terminology Service at the BioMedIT portal
via a standalone Terminology Server that uses the MinIO object storage service.
Terminology Service at the BioMedIT portal |
Terminology Server |
|
---|---|---|
URL |
||
Access |
SWITCH edu-ID |
A dedicated account to be requested at fair-data-team@sib.swiss |
Terminology format |
Provides a single file per external terminology |
Provides bundles of external terminologies |
Download modality |
Web interface |
Web interface or command-line interface |
Intended users |
Researchers |
BioMedIT nodes, hospitals or other service providers |
The two modes of distribution are always synchronized and regularly provide updated versions of the external terminologies.
The bundle of external terminologies provided in the Terminology Server on the MinIO instance is a compressed folder that contains all versions of the terminologies stated in the Namespace of terminologies used in SPHN.
Downloading terminology files
Both modes of distribution allow for manually downloading the terminologies
via the web browser. An additional tool, the terminology-server-downloader
,
enables automatically downloading terminologies from the Terminology Server
via a command-line interface.
For instructions on downloading external terminology files from the Terminology
Service, please see our user guide Download external terminologies from the Terminology Service.
After the terminology files have been downloaded, they can be imported into RDF tools such as Protégé or GraphDB.
Versioning strategy of terminologies
Since 2023, the DCC has adopted the concept of versioning to enhance interoperability of codes from different versions of a terminology.
When working with terminologies (and translating them into RDF) there are certain assumptions made:
Term Removal: Terms are never removed from successive versions of a terminology i.e. If a term
C12345
exists in 2020 version and it is retired in 2021 version then it is not removed from the 2021 version of the terminologyTerm Recycling: Terms are not recycled between two versions of the terminology i.e. If a term
C12345
exists in 2020 version, and it is retired in 2021, then it is not introduced again in 2022 as a term with a new meaningTerm Consistency: Terms defined are unique and consistent within the same version and across versions of the terminology. i.e. If a term
C12345
exists in 2020 version of a terminology then the same term exists in the 2022 version with the same meaning
While these may be reasonable assumptions to make, there have been cases where one or more terminologies have violated these assumptions.
When the above assumptions are violated, there can be unexpected outcomes such as:
the meaning of a code with the same identifier changes between 2 versions
a code that existed in one version is not available in a newer version of a terminology
To manage this discrepancy, the DCC has adopted a versioning strategy that provides the ability to:
identify codes that are identical in their semantic meaning across successive versions of a terminology
identify codes that are deprecated/removed between two successive versions of a terminology
identify codes that are newly introduced in a newer version of a terminology
The strategy can be described in two parts:
Assign version: first assign version to each code from a terminology
Apply identity: then apply a methodology for identifying identical codes between versions of a terminology
Assign version
Before trying to apply versioning, it is always important to know where a code came from. This can be achieved by assigning a version to each code from a terminology.
So if a code C12345
exists in both 2020 and 2021 version of a terminology, then the C12345
from 2020 version
is assigned a unique version IRI and the C12345
from 2021 version is assigned a separate unique version IRI.
This can be demonstrated by taking an example of ATC where https://www.whocc.no/atc_ddd_index/?code=C07FB02
is a code that exists in both 2016 and 2017 version of ATC.
By assigning versioned IRIs, we can differentiate both codes as follows:
https://www.whocc.no/atc_ddd_index/2016/?code=C07FB02
for C07FB02 from ATC 2016https://www.whocc.no/atc_ddd_index/2017/?code=C07FB02
for C07FB02 from ATC 2017
Since their IRI are different, the codes are considered as separate resources in RDF.
Apply identity
Once all codes are differentiated based on their version, the next step would be to establish identity based on some criteria. This is where we state that a code from one version of a terminology is identical to the same code from a different version of a terminology.
In an ideal scenario all codes from one version of a terminology should be identical to another version of the same terminology. But this is not the case, as we saw in the previous section.
There are several ways to establish identity between terms. Two of the most common methods are:
lexical match: A simple string match where you compare labels for the same code and treat them identical if the names are identical; but there can be scenarios where the labels change slightly without changing the semantics
semantic structure: a more holistic approach where you compare the semantic scope a code between two or more versions; This is a more complex solution
As a first pass, the DCC has adopted the lexical match methodology to identify codes that are identical between successive versions of a terminology.
In lexical match, the name (or label) of the same code from two different versions of a terminology are compared.
If the labels are identical, then the code between different versions are considered as identical
If the labels are different, then it is assumed that the code has changed in its meaning between versions
This can be demonstrated by taking the previous example of ATC.
We have https://www.whocc.no/atc_ddd_index/2016/?code=C07FB02
for C07FB02 from ATC 2016 and
https://www.whocc.no/atc_ddd_index/2017/?code=C07FB02
for C07FB02 from ATC 2017.
But the label for https://www.whocc.no/atc_ddd_index/2016/?code=C07FB02
is ‘metoprolol and other antihypertensives’
whereas the label for https://www.whocc.no/atc_ddd_index/2017/?code=C07FB02
is ‘metoprolol and felodipine’.
And due to the labels not matching for C07FB02 between 2016 and 2017, we treat the two codes as separate.
This versioning strategy is applied to ATC, CHOP, and ICD-10-GM.
As a result, the DCC provides a compiled historized version of a terminology that contains the current version of the terminology and all previous versions in a single file.
For example, in the case of ATC, the DCC Terminology Service provides the following:
sphn_atc_2017-2016-1.ttl: RDF Turtle file containing ATC 2017 vs ATC 2016
sphn_atc_2018-2016-1.ttl: RDF Turtle file containing ATC 2018 vs ATC 2017 vs ATC 2016
sphn_atc_2019-2016-1.ttl: RDF Turtle file containing ATC 2019 vs ATC 2018 vs ATC 2017 vs ATC 2016
And this pattern is applied to all versions leading up to the current (latest) version of ATC.
Note
The DCC is exploring how to improve the versioning strategy and adopt a more sophisticated solution. Future releases of the terminologies from the Terminology Service will include these improvements.
References
Krauss P, Touré V, Gnodtke K, Crameri K, Österle S. DCC Terminology Service—An Automated CI/CD Pipeline for Converting Clinical and Biomedical Terminologies in Graph Format for the Swiss Personalized Health Network. Applied Sciences. 2021; 11(23):11311. https://doi.org/10.3390/app112311311
Unni, D.; Touré, V.; Krauss, P.; Crameri, K.; Österle, S. SPHN Strategy to Unravel the Semantic Drift Between Versions of Standard Terminologies. Preprints 2023, 2023120508. https://doi.org/10.20944/preprints202312.0508.v1