.. _project-terminologies: FAIRify external terminologies in RDF ===================================== SPHN projects may be interested in accessing external standards or terminologies to be used in their projects. These are not always provided in RDF and therefore can't be directly integrated in the realm of the SPHN framework to fulfil the data interoperability principles. This page explains how external terminologies of interest for projects, which are not provided by SPHN, can be generated in RDF and integrated in the context of SPHN. Target audience --------------- This document is mainly intended for project data managers who wish to provide external terminologies in RDF for integrating them in the data and facilitating the data analysis process. Identify the terminology of interest ------------------------------------ The first step is to identify the terminology or classification of interest. Then, it is important to know how the terminology will be used in the context of SPHN: - Will the concepts from the terminology be used for meaning binding? If so, for which SPHN concepts? - Will the concepts from the terminology be used as values? If so, for which SPHN composedOfs? - Will the terminology be updated regularly by the providers? Define metadata of interest --------------------------- The second step is to define the set of metadata that are of interest for the project. This can be the whole terminology or only a subset of information provided in the terminology. Translate terminology to RDF ---------------------------- Usually, terminologies are provided as a file containing the list of codes together with a definition and possibly other metadata. Most of the time, vocabularies that can be used in the clinical setting are provided as CSV, TSV, or Excel. The translation of this file into RDF can be done in many ways. If done properly, you can adapt a terminology of your choice such that they are both human and machine readable and conforms to FAIR principles. That means projects are making a conscious effort to, - translate the terms in the terminology into concept classes - represent hierarchy via parent-child relationships - represent definitions, synonyms, and mappings if available And provide this translated version for the project and others to use. .. note:: Since such an effort typically means you are redistributing the terminology; Be sure to check that the license for the terminology is permissive enough to allow for redistribution. Projects can take any approach to translate terminologies into RDF. One approach is to write a Python script and use the `rdflib `_ library to generate an RDF representation of the original terminology file. In some cases, it might be useful (and even important) to take into consideration the versioning of terms from a terminology. See :ref:`versioning-of-terminologies` for a quick introdution into the motivation behind versioning of terminologies and how it is implemented by SPHN. The steps below highlight how to go about translating a terminology file into RDF that is usable in the context of SPHN: - :ref:`Define the terminology namespace` - :ref:`Create the unique classes from the terminology` - :ref:`Define hierarchies (optional)` - :ref:`Add metadata using properties to connect elements (optional)` - :ref:`Export the newly created graph` For the benefit of this user guide, we will use a simple example of a vocabulary with 10 concepts arranged in a hierarchy. We will call this the Pizza vocabulary: .. list-table:: Pizza vocabulary :widths: 25 25 25 :header-rows: 1 * - name - is-a - description * - pizza - - Any pizza * - cheese pizza - pizza - A cheese pizza * - vegetarian pizza - pizza - A vegetarian pizza * - non vegetarian pizza - pizza - A non-vegetarian pizza * - margherita pizza - cheese pizza - A margherita pizza * - four cheese pizza - cheese pizza - A margherita pizza * - soho pizza - vegetarian pizza - A SoHo pizza * - garden pizza - vegetarian pizza - A garden pizza * - americana pizza - non vegetarian pizza - An Americana pizza * - american hot pizza - non vegetarian pizza - An American hot pizza Define the terminology namespace ******************************** Make sure to define the namespace of the terminology. There are two options: * Either the terminology already provides links to their codes which are accessible through the web (i.e. SNOMED CT, ATC and LOINC). For example, http://snomed.info/id/29836001 leads to the ``Hip region structure (body structure)`` code. * Or you can generate a ‘biomedit’ compliant namespace starting with https://biomedit.ch/rdf/sphn-resource/ followed by the name of the terminology (For example, in the case of LOINC, SPHN has defined the following namespace: https://biomedit.ch/rdf/sphn-resource/loinc/). For our example Pizza vocabulary, we will choose ``https://biomedit.ch/rdf/sphn-resource/pizza/`` as the terminology namespace. The code snippet below shows how to go about writing your own Python script to generate a terminology. In this case, the code is demonstrating how to generate RDF representation of the Pizza vocabulary. .. code-block:: python import stringcase import pandas as pd from rdflib import Graph, Literal, term from rdflib.namespace import RDFS, RDF, OWL # Set namespace NAMESPACE = "https://biomedit.ch/rdf/sphn-resource/pizza/" Create a root class (optional) ****************************** If the terminology does not provide an element or term to group the full content of the terminology, it will be necessary to create a root class to group all classes that will be generated. This root class should be generated with the prefix defined and ideally be named after the terminology name. For instance, a root class for the Pizza vocabulary can be ``https://biomedit.ch/rdf/sphn-resource/pizza/Pizza``. In the context of SPHN, several terminologies have been translated into RDF for which root nodes needed to be defined since the terminology did not have one. For instance the following IRIs are root nodes created for such cases: - ``https://biomedit.ch/rdf/sphn-resource/chop/CHOP`` (for CHOP, where the last part "CHOP" corresponds to the class created for grouping all CHOP codes under the same class) - ``https://biomedit.ch/rdf/sphn-resource/emdn/EMDN`` (for EMDN) - ``https://biomedit.ch/rdf/sphn-resource/ucum/UCUM`` (for UCUM) - ... .. note:: ATC has links to its codes which are accessible through the web. Therefore, all ATC codes are referenced with https://www.whocc.no/atc_ddd_index/?code= in SPHN. Nevertheless, ATC does not provide a root node. Hence, we've defined in the case of SPHN a root node as follow: https://biomedit.ch/rdf/sphn-resource/atc/ATC. All the codes with the 'whocc.no' IRI are listed as subclasses (either directly or indirectly, depending on their level in the hierarchy) of that root node with the 'biomedit.ch' IRI. Create the unique classes ************************* Once the namespace is defined, if reusing the terminology link, you can simply create the subjects and objects in RDF that would corresponds to the codes. Otherwise, generate each class with the prefix defined and add the unique code in the URI. For best practices, associate at least a label (``rdfs:label``) and a description (``rdfs:comment``) to each class that is created, both of which should be obtained from the terminology. The code snippet below is a continuation of the code in the previous section. In this case, the code does the following: - uses rdflib to create an empty RDF graph - declare that the graph is an ontology - set the ontology version - read an excel file that contains the Pizza vocabulary into a Pandas DataFrame - iterate over the DataFrame and create a new class for each concept in the vocabulary .. code-block:: python NAMESPACE_URI = term.URIRef(NAMESPACE) VERSION_URI = NAMESPACE_URI + "2022" + "/" + "1" # Create rdflib graph graph = Graph() # Declare graph as an ontology graph.add((NAMESPACE_URI, RDF.type, OWL.Ontology)) # Set version of the ontology graph.add((NAMESPACE_URI, OWL.versionIRI, VERSION_URI)) # Bind standard namespaces to graph graph.bind("rdfs", RDFS) graph.bind("rdf", RDF) graph.bind("owl", OWL) # Bind pizza namespace to graph graph.bind("pizza", NAMESPACE) # Read the vocabulary from an excel file df = pd.read_excel("pizza_vocabulary.xlsx") name_to_subject_map = {} # Iterate over rows for index, row in df.iterrows(): name = row['name'] formatted_name = stringcase.titlecase(name).replace(' ', '') uri = f"{NAMESPACE}{formatted_name}" subject = term.URIRef(uri) name_to_subject_map[name] = subject graph.add((subject, RDF.type, OWL.Class)) graph.add((subject, RDFS.label, Literal(name))) graph.add((subject, RDFS.comment, Literal(row['description']))) Define hierarchies ****************** If hierarchy information is provided in the terminology, make sure to provide them in the RDF by using ``rdfs:subClassOf`` statement. This is an optional step since not all vocabularies or terminologies may provide hierarchy information. The code snippet below is a continuation of the code in the previous section. In this case, the code does the following: - iterates over the Data Frame and creates a hierarchy between the concepts (since this information is already provided in the Pizza vocabulary). .. code-block:: python # Build concept hierarchy for index, row in df.iterrows(): name = row['name'] subject = name_to_subject_map[name] if not pd.isna(row['is-a']): parent = name_to_subject_map[row[1]] graph.add((subject, RDFS.subClassOf, parent)) Add metadata using properties to connect elements ************************************************* If the terminology provides metadata information that is of interest for the project, make sure to provide them using either defined properties or by creating your own ‘terminology’ properties. This is an optional step since not all vocabularies or terminologies may provide additional metadata information. Add a copyright statement ************************* When generating an RDF version of an external terminology, make sure to comply with the copyright and usage statements provided by the original terminology developers. The created RDF file must also contain a copyright statement with information about: - the provider of the RDF file - possibly the copyright statement that applies to its content For the Pizza vocabulary, the following shows an example of how to add a copyright statement: .. code-block:: python COPYRIGHT = Literal( "RDF version of the Pizza vocabulary, adapted from the Pizza Ontology (https://protege.stanford.edu/ontologies/pizza/pizza.owl)." "The Pizza vocabulary is developed by SPHN (SIB Swiss Institute of Bioinformatics)." "The copyright follows instructions provided by the developers of the Pizza Ontology, " "and is licensed under the Creative Commons Attribution 3.0 (CC BY 3.0)" ) graph.add((NAMESPACE_URI, RDFS.comment, COPYRIGHT)) Export the newly created graph ****************************** The generated graph should be exported in an RDF-compliant format. SPHN usually provides Turtle and/or OWL/XML formats of the external terminologies. One can do this by exporting the rdflib Graph into the appropriate format. The code snippet below is a continuation of the code in the previous section. In this case, the code does the following: - opens an empty file - writes the contents of the RDF graph into this file in Turtle syntax .. code-block:: python # Export graph as RDF in TTL format with open('pizza_vocabulary.ttl', 'w') as out: out.write(graph.serialize(format="turtle")) If you inspect the newly created file `pizza_vocabulary.ttl`, you should see the following: .. code-block:: @prefix owl: . @prefix pizza: . @prefix rdfs: . pizza: a owl:Ontology ; rdfs:comment "RDF version of the Pizza vocabulary, adapted from the Pizza Ontology (https://protege.stanford.edu/ontologies/pizza/pizza.owl). The Pizza vocabulary is developed by the SPHN (SIB Swiss Institute of Bioinformatics). The copyright follows instructions provided by the developers of the Pizza Ontology, and is licensed under the Creative Commons Attribution 3.0 (CC BY 3.0)" ; owl:versionIRI . pizza:AmericanHotPizza a owl:Class ; rdfs:label "american hot pizza" ; rdfs:comment "An American hot pizza" ; rdfs:subClassOf pizza:NonVegetarianPizza . pizza:AmericanaPizza a owl:Class ; rdfs:label "americana pizza" ; rdfs:comment "An Americana pizza" ; rdfs:subClassOf pizza:NonVegetarianPizza . pizza:FourCheesePizza a owl:Class ; rdfs:label "four cheese pizza" ; rdfs:comment "A four cheese pizza" ; rdfs:subClassOf pizza:CheesyPizza . pizza:GardenPizza a owl:Class ; rdfs:label "garden pizza" ; rdfs:comment "A garden pizza" ; rdfs:subClassOf pizza:VegetarianPizza . pizza:MargheritaPizza a owl:Class ; rdfs:label "margherita pizza" ; rdfs:comment "A margherita pizza" ; rdfs:subClassOf pizza:CheesyPizza . pizza:SohoPizza a owl:Class ; rdfs:label "soho pizza" ; rdfs:comment "A SoHo pizza" ; rdfs:subClassOf pizza:VegetarianPizza . pizza:CheesyPizza a owl:Class ; rdfs:label "cheesy pizza" ; rdfs:comment "A cheesy pizza" ; rdfs:subClassOf pizza:Pizza . pizza:NonVegetarianPizza a owl:Class ; rdfs:label "non vegetarian pizza" ; rdfs:comment "A non-vegetarian pizza" ; rdfs:subClassOf pizza:Pizza . pizza:VegetarianPizza a owl:Class ; rdfs:label "vegetarian pizza" ; rdfs:comment "A vegetarian pizza" ; rdfs:subClassOf pizza:Pizza . pizza:Pizza a owl:Class ; rdfs:label "pizza" ; rdfs:comment "Any pizza" . Hands on example ---------------- To provide a more interactive tutorial, we have provided a Jupyter notebook that covers all previously mentioned steps in converting a Pizza vocabulary into its RDF representation. The notebook can be found `here `_. You can clone the repository and run the Jupyter notebook locally. Another example of translating a small subset of ATC into RDF can be found `here `_. For any question or comment, please contact the SPHN FAIR Data Team at fair-data-team@sib.swiss.