Terminology Service

There are several terminologies currently available within the biomedical and life sciences domain. However, not all terminologies are represented using a formal knowledge representation language. Some can exist as spreadsheets, CSV, or even free text. Such representations are easy for humans to read and work with, but difficult for machines. And it is even more difficult to make use of these terminologies as part of a broader semantic interoperability strategy.

It is however possible to adopt best practices and make use of existing W3C standards to build a formal representation of these terminologies. If done properly, any terminology can be adapted such that it is both human and machine readable and also conforms to FAIR principles.

The DCC Terminology Service provides SPHN compatible, machine-readable versions of national (e.g. CHOP, ICD-10-GM) and international (e.g. SNOMED CT, LOINC, ATC, UCUM, HGNC, GENO, SO) terminologies and classifications in RDF formats (.ttl or .owl). This page describes the way DCC distributes the external terminologies and the versioning strategy adopted for handling different versions of a terminology.

For further information on the external terminologies, please go to the External terminologies.

The DCC Terminology Service

The DCC developed the Terminology Service as a way to distribute external terminologies in RDF format without external dependencies and in compliance with the copyright statements of each of these terminologies.

As part of this effort,

  • terms in a terminology are translated into concept classes

  • term hierarchies are represented via parent-child relationships

  • term definitions, synonyms, and mappings are represented appropriately

The terminologies are available directly in the individual project spaces on the BioMedIT nodes. Furthermore, the DCC provides two modes of distribution, enabling projects and institutions to fetch the external terminologies in RDF format:

DCC Terminology Service

Terminology Service at the BioMedIT portal

Terminology Server

URL

https://terminology.dcc.sib.swiss/

https://terminology-server.dcc.sib.swiss

Access

SWITCH edu-ID

A dedicated account to be requested at fair-data-team@sib.swiss

Terminology format

Provides a single file per external terminology

Provides bundles of external terminologies

Download modality

Web interface

Web interface or command-line interface

Intended users

Researchers

BioMedIT nodes, hospitals or other service providers

The two modes of distribution are always synchronized and regularly provide updated versions of the external terminologies.

The bundle of external terminologies provided in the Terminology Server on the MinIO instance is a compressed folder that contains all versions of the terminologies stated in the Namespace of terminologies used in SPHN.

Downloading terminology files

Both modes of distribution allow for manually downloading the terminologies via the web browser. An additional tool, the terminology-server-downloader, enables automatically downloading terminologies from the Terminology Server via a command-line interface. For instructions on downloading external terminology files from the Terminology Service, please see our user guide Download external terminologies from the Terminology Service.

After the terminology files have been downloaded, they can be imported into RDF tools such as Protégé or GraphDB.

Versioning strategy of terminologies

Since 2023, the DCC has adopted the concept of versioning to enhance interoperability of codes from different versions of a terminology.

When working with terminologies (and translating them into RDF) there are certain assumptions made:

  • Term Removal: Terms are never removed from successive versions of a terminology i.e. If a term C12345 exists in 2020 version and it is retired in 2021 version then it is not removed from the 2021 version of the terminology

  • Term Recycling: Terms are not recycled between two versions of the terminology i.e. If a term C12345 exists in 2020 version, and it is retired in 2021, then it is not introduced again in 2022 as a term with a new meaning

  • Term Consistency: Terms defined are unique and consistent within the same version and across versions of the terminology. i.e. If a term C12345 exists in 2020 version of a terminology then the same term exists in the 2022 version with the same meaning

While these may be reasonable assumptions to make, there have been cases where one or more terminologies have violated these assumptions.

When the above assumptions are violated, there can be unexpected outcomes such as:

  • the meaning of a code with the same identifier changes between 2 versions

  • a code that existed in one version is not available in a newer version of a terminology

To manage this discrepancy, the DCC has adopted a versioning strategy that provides the ability to:

  • identify codes that are identical in their semantic meaning across successive versions of a terminology

  • identify codes that are deprecated/removed between two successive versions of a terminology

  • identify codes that are newly introduced in a newer version of a terminology

The strategy can be described in two parts:

  • Assign version: first assign version to each code from a terminology

  • Apply identity: then apply a methodology for identifying identical codes between versions of a terminology

Assign version

Before trying to apply versioning, it is always important to know where a code came from. This can be achieved by assigning a version to each code from a terminology.

So if a code C12345 exists in both 2020 and 2021 version of a terminology, then the C12345 from 2020 version is assigned a unique version IRI and the C12345 from 2021 version is assigned a separate unique version IRI.

This can be demonstrated by taking an example of ATC where https://www.whocc.no/atc_ddd_index/?code=C07FB02 is a code that exists in both 2016 and 2017 version of ATC.

By assigning versioned IRIs, we can differentiate both codes as follows:

  • https://www.whocc.no/atc_ddd_index/2016/?code=C07FB02 for C07FB02 from ATC 2016

  • https://www.whocc.no/atc_ddd_index/2017/?code=C07FB02 for C07FB02 from ATC 2017

Since their IRI are different, the codes are considered as separate resources in RDF.

Apply identity

Once all codes are differentiated based on their version, the next step would be to establish identity based on some criteria. This is where we state that a code from one version of a terminology is identical to the same code from a different version of a terminology.

In an ideal scenario all codes from one version of a terminology should be identical to another version of the same terminology. But this is not the case, as we saw in the previous section.

There are several ways to establish identity between terms. Two of the most common methods are:

  • lexical match: A simple string match where you compare labels for the same code and treat them identical if the names are identical; but there can be scenarios where the labels change slightly without changing the semantics

  • semantic structure: a more holistic approach where you compare the semantic scope a code between two or more versions; This is a more complex solution

As a first pass, the DCC has adopted the lexical match methodology to identify codes that are identical between successive versions of a terminology.

In lexical match, the name (or label) of the same code from two different versions of a terminology are compared.

  • If the labels are identical, then the code between different versions are considered as identical

  • If the labels are different, then it is assumed that the code has changed in its meaning between versions

This can be demonstrated by taking the previous example of ATC.

We have https://www.whocc.no/atc_ddd_index/2016/?code=C07FB02 for C07FB02 from ATC 2016 and https://www.whocc.no/atc_ddd_index/2017/?code=C07FB02 for C07FB02 from ATC 2017.

But the label for https://www.whocc.no/atc_ddd_index/2016/?code=C07FB02 is ‘metoprolol and other antihypertensives’ whereas the label for https://www.whocc.no/atc_ddd_index/2017/?code=C07FB02 is ‘metoprolol and felodipine’.

And due to the labels not matching for C07FB02 between 2016 and 2017, we treat the two codes as separate.

This versioning strategy is applied to ATC, CHOP, and ICD-10-GM.

As a result, the DCC provides a compiled historized version of a terminology that contains the current version of the terminology and all previous versions in a single file.

For example, in the case of ATC, the DCC Terminology Service provides the following:

  • sphn_atc_2017-2016-1.ttl: RDF Turtle file containing ATC 2017 vs ATC 2016

  • sphn_atc_2018-2016-1.ttl: RDF Turtle file containing ATC 2018 vs ATC 2017 vs ATC 2016

  • sphn_atc_2019-2016-1.ttl: RDF Turtle file containing ATC 2019 vs ATC 2018 vs ATC 2017 vs ATC 2016

And this pattern is applied to all versions leading up to the current (latest) version of ATC.

Note

The DCC is exploring how to improve the versioning strategy and adopt a more sophisticated solution. Future releases of the terminologies from the Terminology Service will include these improvements.

References

Krauss P, Touré V, Gnodtke K, Crameri K, Österle S. DCC Terminology Service—An Automated CI/CD Pipeline for Converting Clinical and Biomedical Terminologies in Graph Format for the Swiss Personalized Health Network. Applied Sciences. 2021; 11(23):11311. https://doi.org/10.3390/app112311311

Unni, D.; Touré, V.; Krauss, P.; Crameri, K.; Österle, S. SPHN Strategy to Unravel the Semantic Drift Between Versions of Standard Terminologies. Preprints 2023, 2023120508. https://doi.org/10.20944/preprints202312.0508.v1