FAIRification of External Terminologies in RDF for projects

SPHN projects may be interested in accessing external standards or terminologies to be used in their projects. External standards are not always provided in RDF and therefore can’t be directly integrated in the realm of the SPHN framework to fulfil the data interoperability principles.

In this documentation, we explain how external terminologies of interest for projects, which are not provided by DCC, can be generated in RDF and integrated in the context of SPHN.

Target audience

This document is mainly intended for project data managers who wish to provide external terminologies in RDF for integrating them in the data and facilitating the data analysis process.

Identify the terminology of interest

The first step is to identify the terminology or classification of interest.

Then, it is important to know how the terminology will be used in the context of SPHN:

  • Will the concepts from the terminology be used as values? If so, for which SPHN Concept?

  • Will the terminology be updated regularly by the providers?

Define the set of metadata of interest

The second step is to define the set of metadata that are of interest for the project. This can be the whole terminology or only a subset of information provided in the terminology.

Translate terminology content to RDF elements

Usually, terminologies provide a file containing the list of their codes together with a definition and possibly other metadata. Most of the time, vocabularies that can be used in the clinical setting are provided as Excel documents. The translation of this file into RDF can be done in many ways.

One example is to write a Python script and use the rdflib library to generate an RDF representation of the original terminology file.

The steps below highlight how to go about translating a terminology file into RDF that is usable in the context of SPHN:

For the benefit of this user guide, we will use a simple example of a vocabulary with 10 concepts arranged in a hierarchy. We will call this the Pizza vocabulary:

Pizza vocabulary

name

is-a

description

pizza

Any pizza

cheese pizza

pizza

A cheese pizza

vegetarian pizza

pizza

A vegetarian pizza

non vegetarian pizza

pizza

A non-vegetarian pizza

margherita pizza

cheese pizza

A margherita pizza

four cheese pizza

cheese pizza

A margherita pizza

soho pizza

vegetarian pizza

A SoHo pizza

garden pizza

vegetarian pizza

A garden pizza

americana pizza

non vegetarian pizza

An Americana pizza

american hot pizza

non vegetarian pizza

An American hot pizza

Define the terminology namespace

Make sure to define the namespace of the terminology. There are two options:

For our example Pizza vocabulary, we will choose https://biomedit.ch/rdf/sphn-resource/pizza/ as the terminology namespace.

The code snippet below shows how to go about writing your own Python script to generate a terminology. In this case, the code is demonstrating how to generate RDF representation of the Pizza vocabulary.

import stringcase
import pandas as pd

from rdflib import Graph, Literal, term
from rdflib.namespace import RDFS, RDF, OWL

# Set namespace
NAMESPACE = "https://biomedit.ch/rdf/sphn-resource/pizza/"

Create the unique classes

Once the namespace is defined, if reusing the terminology link, you can simply create the subjects and objects in RDF that would corresponds to the codes. Otherwise, generate each class with the prefix defined and add the unique code in the URI.

For best practices, associate at least a label (rdfs:label) and a description (rdfs:comment) to each class that is created, both of which should be obtained from the terminology.

The code snippet below is a continuation of the code in the previous section. In this case, the code does the following:

  • uses rdflib to create an empty RDF graph

  • declare that the graph is an ontology

  • set the ontology version

  • read an excel file that contains the Pizza vocabulary into a Pandas DataFrame

  • iterate over the DataFrame and create a new class for each concept in the vocabulary

NAMESPACE_URI = term.URIRef(NAMESPACE)
VERSION_URI = NAMESPACE_URI + "2022" + "/" + "1"

# Create rdflib graph
graph = Graph()

# Declare graph as an ontology
graph.add((NAMESPACE_URI, RDF.type, OWL.Ontology))

# Set version of the ontology
graph.add((NAMESPACE_URI, OWL.versionIRI, VERSION_URI))

# Bind standard namespaces to graph
graph.bind("rdfs", RDFS)
graph.bind("rdf", RDF)
graph.bind("owl", OWL)

# Bind pizza namespace to graph
graph.bind("pizza", NAMESPACE)

# Read the vocabulary from an excel file
df = pd.read_excel("pizza_vocabulary.xlsx")

name_to_subject_map = {}

# Iterate over rows
for index, row in df.iterrows():
    name = row['name']
    formatted_name = stringcase.titlecase(name).replace(' ', '')
    uri = f"{NAMESPACE}{formatted_name}"
    subject = term.URIRef(uri)
    name_to_subject_map[name] = subject
    graph.add((subject, RDF.type, OWL.Class))
    graph.add((subject, RDFS.label, Literal(name)))
    graph.add((subject, RDFS.comment, Literal(row['description'])))

Define hierarchies

If hierarchy information is provided in the terminology, make sure to provide them in the RDF by using rdfs:subClassOf statement.

This is an optional step since not all vocabularies or terminologies may provide hierarchy information.

The code snippet below is a continuation of the code in the previous section. In this case, the code does the following:

  • iterates over the Data Frame and creates a hierarchy between the concepts (since this information is already provided in the Pizza vocabulary).

# Build concept hierarchy
for index, row in df.iterrows():
    name = row['name']
    subject = name_to_subject_map[name]
    if not pd.isna(row['is-a']):
        parent = name_to_subject_map[row[1]]
        graph.add((subject, RDFS.subClassOf, parent))

Add metadata using properties to connect elements

If the terminology provides metadata information that is of interest for the project, make sure to provide them using either defined properties or by creating your own ‘terminology’ properties.

This is an optional step since not all vocabularies or terminologies may provide additional metadata information.

Export the newly created graph

The generated graph should be exported in an RDF-compliant format. The DCC usually provides Turtle and/or OWL/XML formats of the external terminologies.

One can do this by exporting the rdflib Graph into the appropriate format.

The code snippet below is a continuation of the code in the previous section. In this case, the code does the following:

  • opens an empty file

  • writes the contents of the RDF graph into this file in Turtle syntax

# Export graph as RDF in TTL format
with open('pizza_vocabulary.ttl', 'w') as out:
    out.write(graph.serialize(format="turtle"))

If you inspect the newly created file pizza_vocabulary.ttl, you should see the following:

@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix pizza: <https://biomedit.ch/rdf/sphn-resource/pizza/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

pizza: a owl:Ontology ;
    rdfs:comment "RDF version of the Pizza vocabulary, adapted from the Pizza Ontology (https://protege.stanford.edu/ontologies/pizza/pizza.owl).The Pizza vocabulary is developed by the SPHN DCC (PHI, SIB Swiss Institute of Bioinformatics).The copyright follows instructions provided by the developers of the Pizza Ontology, and is licensed under the Creative Commons Attribution 3.0 (CC BY 3.0)" ;
    owl:versionIRI <https://biomedit.ch/rdf/sphn-resource/pizza/2022/1> .

pizza:AmericanHotPizza a owl:Class ;
    rdfs:label "american hot pizza" ;
    rdfs:comment "An American hot pizza" ;
    rdfs:subClassOf pizza:NonVegetarianPizza .

pizza:AmericanaPizza a owl:Class ;
    rdfs:label "americana pizza" ;
    rdfs:comment "An Americana pizza" ;
    rdfs:subClassOf pizza:NonVegetarianPizza .

pizza:FourCheesePizza a owl:Class ;
    rdfs:label "four cheese pizza" ;
    rdfs:comment "A four cheese pizza" ;
    rdfs:subClassOf pizza:CheesyPizza .

pizza:GardenPizza a owl:Class ;
    rdfs:label "garden pizza" ;
    rdfs:comment "A garden pizza" ;
    rdfs:subClassOf pizza:VegetarianPizza .

pizza:MargheritaPizza a owl:Class ;
    rdfs:label "margherita pizza" ;
    rdfs:comment "A margherita pizza" ;
    rdfs:subClassOf pizza:CheesyPizza .

pizza:SohoPizza a owl:Class ;
    rdfs:label "soho pizza" ;
    rdfs:comment "A SoHo pizza" ;
    rdfs:subClassOf pizza:VegetarianPizza .

pizza:CheesyPizza a owl:Class ;
    rdfs:label "cheesy pizza" ;
    rdfs:comment "A cheesy pizza" ;
    rdfs:subClassOf pizza:Pizza .

pizza:NonVegetarianPizza a owl:Class ;
    rdfs:label "non vegetarian pizza" ;
    rdfs:comment "A non-vegetarian pizza" ;
    rdfs:subClassOf pizza:Pizza .

pizza:VegetarianPizza a owl:Class ;
    rdfs:label "vegetarian pizza" ;
    rdfs:comment "A vegetarian pizza" ;
    rdfs:subClassOf pizza:Pizza .

pizza:Pizza a owl:Class ;
    rdfs:label "pizza" ;
    rdfs:comment "Any pizza" .

Hands on example

To provide a more interactive tutorial, we have provided a Jupyter notebook that covers all previously mentioned steps in converting a Pizza vocabulary into its RDF representation.

The notebook can be found here: https://git.dcc.sib.swiss/sphn-semantic-framework/terminology-to-rdf-notebook. You can clone the repository and run the Jupyter notebook locally.

For any question or comment, please contact the SPHN Data Coordination Center (DCC) at dcc@sib.swiss.