.. _userguide-project-schema: Generate a project-specific RDF Schema ======================================= Target Audience --------------- This document is intended for project data managers and researchers interested in generating their project-specific RDF Schema. Guidance on how to create a project-specific RDF Schema from a Dataset Template is given. Introduction ------------ A SPHN project can extend existing (SPHN) concepts and create new concepts (referred to as semantics in the following paragraphs) to fit their needs. Note that under no circumstances a project can modify **existing** content of the SPHN Dataset. The extension of the semantics for project-specific needs implies that the project must generate its own project-specific RDF Schema. This project-specific RDF Schema will be shared by the project to data providers to get data compliant with their new schema. The project-specific RDF Schema always extend the content (i.e. semantics) defined in the SPHN RDF Schema. There exists two ways for a project to extend the SPHN semantics and produce their RDF Schema (see Figure 1): .. image:: ../images/user-guide/templates/templates-options.* :height: 300 :align: center :alt: Template options **Figure 1: The two options to generate a project-specific RDF Schema.** **Option 1**: from the SPHN Dataset Template (Excel file with content of the SPHN Dataset) provided by SPHN, the project extends the file with the semantics it needs. The project then passes the modified SPHN Dataset Template (which becomes the project-specific Dataset) as input to the SPHN Schema Forge to produce the project-specific RDF Schema automatically **Option 2**: the project defines its semantics and directly edits the SPHN RDF Schema with any editor of its choice, compliant with Semantic Web technologies (e.g. Protégé), to produce the project-specific RDF Schema manually. Procedure to update the semantics --------------------------------- In both options, the procedure for updating the semantics is the same. Figure 2 shows the process that must be followed when using and modifying the content of the SPHN Dataset (semantics) to fit the project-specific needs. The project can reuse existing SPHN concepts, extend SPHN concepts or create new concepts. When modifying existing concepts or building new concepts, these changes have the possibility to be integrated in the future within the SPHN Dataset. SPHN projects are required to design a new concept or modify an existing concept according to the :ref:`guiding-principles-concept-design`. .. image:: ../images/user-guide/process-pillar1.png :align: center :alt: Process **Figure 2: Process on how to use and modify the SPHN Dataset for the project-specific needs.** The extension or modification of existing SPHN concepts can result in additional composedOfs, an alternative semantic standard that needs to be added, or it can be a required extension of an existing value set. There are various reasons calling for extensions, e.g. implementation of a new standard in the applicable jurisdiction, change in availablity of biomedical data, new needs of research projects, or expanded medical knowledge. .. note:: There exist three SPHN concepts that have a special meaning in the processing: ``Subject Pseudo Identifier``, ``Administrative Case`` and ``Source System`` Any extension or modification of these concepts might result in invalid pipelines. Please inform SPHN (fair-data-team@sib.swiss) if you want to modify these concepts. It may happen that you find the concept in the SPHN Dataset for the data you need, but a piece of information is missing. For example, you need to know the location where a specific measurement is taken, e.g. ``Body Temperature Measurement``. However, this is not defined in the SPHN concept of Body Temperature Measurement. In this case you can extend the SPHN concept with the additional composedOf in your project-specific Dataset. .. note:: If you create an extension for your project, please submit a corresponding change request to fair-data-team@sib.swiss. A change request template is available on https://git.dcc.sib.swiss/sphn-semantic-framework/sphn-schema/-/tree/master/templates. The extension might be relevant to other projects. SPHN can coordinate an extension to the SPHN Dataset if needed. Example of semantic extension ***************************** .. list-table:: Table 1. Example of concept ``Body Temperature Measurement`` extended by composedOf location. :widths: 20 20 40 20 :header-rows: 1 * - - - description - type * - **concept** - **Body Temperature Measurement** - **body temperature of the individual** - * - composedOf - result - measured temperature - Body Temperature * - composedOf - start datetime - start datetime of the measurement - temporal * - composedOf - end datetime - end datetime of the measurement - temporal * - composedOf - body site - body site of the measurement - Body Site * - composedOf - method code - method code used to measure the temperature - Code * - composedOf - medical device - medical device used to measure the temperature - Medical Device * - composedOf - performer - performer of the measurement - Performer * - **composedOf** - **location** - **location of the measurement** - **Location** For the example above, one possible next step would be to define your value set or subset for the new composedOf. Location, as an SPHN concept, already holds a ``type code`` property with a value restriction. In case you are choosing to restrict further the existing value set of Location type code, in the context of a Body Temperature Measurement to be only in 'Hospital', you are allowed to do so. .. list-table:: Table 2. Example of value set restriction applied to the Location used in ``Body Temperature Temperature`` :widths: 10 10 25 20 35 :header-rows: 1 * - - - description - type - value set or subset * - composedOf - location - location of the measurement - Location - type code restricted to: 276339004 \|Environment (environment)\| .. note:: The notation used in the value set is explained in :ref:`value-set-subset-scenarios`. The semantics to be integrated in the project must be defined before going to the technical implementation detailed below with the two options to produce *in fine* the project-specific RDF Schema. .. _option1-dataset-template: Option 1: Produce an RDF Schema from the SPHN Dataset Template --------------------------------------------------------------- The `Dataset Template `_ is provided as an Excel sheet to be modified by projects to extend the SPHN Dataset according to their needs. Definitions of terms used in the Dataset (i.e., concept, composedOf) can be found in the ``Guideline`` sheet of the Dataset Template but also in :ref:`framework-sphn-dataset`. Once the Dataset Template Excel file is opened, do the following: .. _option1-add-project-metadata: 1. Add project's metadata ************************* Select the sheet ``Metadata`` and add the following information below the already filled SPHN metadata line: * **prefix**: define the prefix that will be used in your project * **title**: provide a short title about the dataset * **description**: provide a short description of the content of the dataset * **version**: the version of the dataset you are building. It should be in the form of ``.`` * **prior version**: if any, provide the previous version of the dataset * **copyright**: provide information about the copyright of the dataset * **license**: provide the iri of the license under which the content of the dataset and the schema belong to * **canonical_iri**: provide the full canonical iri of the dataset that will be created * **versioned_iri**: provide the versioned iri of the dataset that will be created. It should match the version information provided in ``version``. Example !!!!!!! A project called "Genotech" that wants to fill the Dataset Template, starts by providing its metadata: .. list-table:: Information about ATC :widths: auto :header-rows: 1 * - prefix - title - description - version - prior version - copyright - license - canonical_iri - versioned_iri * - genotech - The Genotech project Dataset - The Dataset of the Genotech project, based on the SPHN Dataset 2024.1 - 2024.1 - - © Copyright 2024, Genotech Institute - https://creativecommons.org/license/by/4.0/ - https://www.biomedit.ch/rdf/sphn-schema/genotech# - https://www.biomedit.ch/rdf/sphn-schema/genotech/2024/1 .. note:: The Genotech project builds a Dataset for the first time, therefore the 'prior version' field is left empty. 2. Add information about coding system ************************************** SPHN provides information about terminologies, standards, vocabularies, and ontologies - henceforth collectively referred to as "coding systems" - that can be used in the SPHN Dataset for representing particular values with codes from the coding systems. These information are given in the ``Coding System and Version`` sheet. Some of these coding systems are provided in RDF, either by the original provider of the coding system or by SPHN, while others are not. In case of the latter, one would represent codes from such coding systems as instances of ``Code`` concept in the data. Since 2023.3 release of the SPHN Dataset Template, the ``Coding System and Version`` sheet has been updated to incorporate additional information about coding systems used in SPHN **and** SPHN projects. .. note:: This sheet is also updated in the SPHN Dataset 2024.2 release. The intention for this change was to: - clarify and differentiate which coding systems are used and/or provided in SPHN and SPHN projects, - facilitate the import of coding systems in RDF by the :ref:`Dataset2RDF ` To that end, a project must update the ``Coding System and Version`` sheet to integrate information about supported and used coding systems in their projects independent of whether or not they are provided in RDF. If your project mentions a coding system that the SPHN Dataset does not, please add it to the list. If you are providing a coding system in an RDF version that SPHN is not, please add it to the list and fill the appropriate columns. Following are the columns from the ``Coding System and Version`` sheet that can be populated: * **short name**: common abbreviation of the coding system * **full name**: full name or title of the coding system * **coding system and version**: short name of the coding system followed by a pattern that represents the way the coding system is versioned by the provider * **example**: example of an existing version of the coding system (that conforms to the pattern expressed in the 'coding system and version' column) * **provided in RDF (yes/no)**: indicate whether the coding system is provided by the project in RDF * **downloadable in RDF (yes/no)**: indicate whether the coding system is downloadable in RDF from any location on the web (typically from the original provider) * **provided by**: the name of the project (should be the same as the prefix written in :ref:`option1-add-project-metadata`) to indicate that this coding system is provided/used in the project (i.e. the coding system is not provided/used in the SPHN Dataset and is specifically needed for the project) * **prefix**: prefix of the coding system, typically corresponds to the 'short name' from the 'short name' column * **root node**: indicate the root node that will be used to group all concepts from the coding system in RDF. The coding system may have multiple root nodes, in which case list them all separated by a semi-colon * **canonical iri**: IRI of the codes taken from the coding system or defined by the project (it can be a ``biomedit.ch``-based iri if the coding system does not have a web-resolvable IRI for their codes) * **resource prefix**: if applicable, a specific resource node can be created to group all codes (including the root node) under a resource node. For this resource node, a specific prefix must be given * **resource iri**: if applicable, the IRI for the resource node. The iri must be of the form ``https://biomedit.ch/rdf/sphn-resource/...``. This column goes hand in hand with 'resource prefix' column * **versioned iri**: the versioned IRI of the coding system. This IRI is used to import the coding system in the RDF schema Examples !!!!!!!! ATC - provided in RDF by SPHN ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`~~ In SPHN, the ATC coding system is actively being used and provided in RDF by SPHN. ATC codes have dereferencable links which is encoded via the 'canonical iri' column. However, a root node is created in order to group all ATC codes under the same parent. This root node is defined as ``ATC`` and uses the IRI from the 'resource iri' column. Information about ATC is provided as follows: .. list-table:: Information about ATC :widths: auto :header-rows: 1 * - short name - full name - coding system and version - example - provided in RDF (yes/no) - downloadable in RDF (yes/no) - provided by - prefix - root node - canonical iri - resource prefix - resource iri - versioned iri * - ATC - Anatomical Therapeutic Chemical classification - ATC-[YEAR] - ATC-2023 - yes - no - SPHN - atc - ATC - https://www.whocc.no/atc_ddd_index/?code= - sphn-atc - https://biomedit.ch/rdf/sphn-resource/atc/ - https://biomedit.ch/rdf/sphn-resource/atc/2023/1 ORPHA - provided in RDF on the web ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ORPHA is a coding system that provides codes that represent rare diseases. Let's assume that the Genotech project wants to use the `ORPHA `_ and aims to provide the ORPHA codes in RDF. ORPHA is already listed in the SPHN Dataset Template but it is not provided in RDF by SPHN. During the investigation phase, the Genotech project members discover `ORDO `_ (Orphanet Rare Disease Ontology) which represents ORPHA codes in a structured way and compliant with Semantic Web standards. This ORDO ontology fits their needs. Therefore, the Genotech project would like to use the ORDO ontology and will provide it in RDF. The Genotech project can then update the line containing ORPHA to add metadata about the coding system as follows: .. list-table:: Information about ORPHA :widths: auto :header-rows: 1 * - short name - full name - coding system and version - example - provided in RDF (yes/no) - downloadable in RDF (yes/no) - provided by - prefix - root node - canonical iri - resource prefix - resource iri - versioned iri * - ORPHA - Orphanet nomenclature of rare diseases - ORPHA-[YEAR]-[MONTH] - ORPHA-2021-07 - yes - yes - GENOTECH - orpha - ORPHA - http://www.orpha.net/ORDO/Orphanet\_ - sphn-orpha - https://biomedit.ch/rdf/sphn-resource/orpha/ - https://www.orphadata.com/data/ontologies/ordo/last_version/ORDO_en_4.3.owl .. note:: The 'canonical iri' corresponds to the IRI used for ORPHA codes in the ORDO ontology. The 'resource iri' and 'resource prefix' are internal to the Genotech project (and defined in the context of SPHN) in order to group all the content from the ORDO ontology under the same root node **ORPHA**. The 'versioned iri' follows the way ORDO is versioned; here it corresponds to version 4.2 of the ORDO ontology. NANDA - provided in RDF by the project ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `NANDA `_ is an example of a coding system which is neither downloadable in RDF nor provided in RDF by SPHN. The project first needs to "FAIRify" and translate the coding system into RDF as much as possible (see :ref:`project-terminologies`) before using and sharing it. Again, lets assume that the Genotech project wants to use NANDA and decides to provide it in RDF. The following metadata is encoded in the ``Coding System and Version`` sheet: .. list-table:: Information about NANDA :widths: auto :header-rows: 1 * - short name - full name - coding system and version - example - provided in RDF (yes/no) - downloadable in RDF (yes/no) - provided by - prefix - root node - canonical iri - resource prefix - resource iri - versioned iri * - NANDA - NANDA Nursing Diagnoses - NANDA[YEAR-YEAR] - NANDA-2018-2020 - yes - no - GENOTECH - nanda - NANDA - https://biomedit.ch/rdf/sphn-resource/nanda - - - https://biomedit.ch/rdf/sphn-resource/nanda/2018-2020 .. note:: In this example, the 'resource prefix' column and 'resource iri' column do not need to be defined because the root node (``NANDA``) and all resources of NANDA will share the same namespace since resources from NANDA do not have a properly defined and dereferencable link. Coding systems' copyright !!!!!!!!!!!!!!!!!!!!!!!!! It is important to keep in mind that before providing any coding system in RDF and eventually sharing them with data providers and/or other data users, the project has the responsibility to check the applicable terms and regulations for using and sharing the coding system. Copyright information must be stated in the RDF terminology file generated/used in the context of the project. This is an important step that should not be neglected. The use and sharing of a coding systems' file in RDF in the context of a project is the responsibility of the project data manager. In the future, if SPHN integrates that coding system in the SPHN Dataset and provides it in the :ref:`framework-terminology-service`, then it will be under the responsibility of SPHN. 3. Concept definition ********************** The next step is to go to the ``Concepts`` sheet which already contains all concepts and composedOfs defined in SPHN. A project is allowed to only: * extend an existing SPHN concept with a project-specific composedOf * create a new project-specific concept * add existing SPHN composedOfs to it (reuse of existing composedOfs) * add project-specific composedOfs to it A project is **not allowed to**: * edit any existing line of an SPHN concept and its related SPHN composedOfs * delete any line of an SPHN concept 2.1 Add a new project-specific concept !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! A project can decide to add a new concept to their dataset. A concept is an idea or notion that represents, in the context of SPHN, clinical-, health- and genomic-related elements. A concept here can be compared to the notion of "class" in other fields. To add a new concept, insert a new line at the end of the SPHN Dataset Template and fill the following columns: * **release version**: version of the dataset when the concept is created or last modified * **IRI**: the versioned IRI ending with the concept name in UpperCase convention (it must point to the latest version of the project-specific schema) * **active status**: ``yes`` or ``no`` are the allowed values. You must select ``yes`` for an active concept * **concept reference**: provide the name of the concept being created with the following notation: ``:`` * **concept or concept compositions or inherited**: indicate by selecting one of the three options if the row corresponds to a concept, a composedOf or if it is a composedOf that is inherited from another concept. In this case, this cell must be filled with ``concept`` * **general concept name**: provide the general name of the concept which will be used in the RDF schema * **general description**: provide the general description of the concept which will be used in the RDF schema * **contextualized concept name**: provide the contextualized description of the concept in the particular context it is used (this information is not carried on the RDF schema) * **contextualized concept description**: provide the contextualized description of the concept explaining its meaning in the particular context it is used (this information is not carried on the RDF schema) * **parent**: provide the general concept name of the parent with the following notation: ``:`` .. note:: * Unlike in the ``Concept reference`` column, the ``Parent`` is written in an UpperCase convention **without space**! * The root concept in SPHN is called ``SPHNConcept``. It contains all concepts defined in SPHN. All SPHN concepts are children of the ``SPHNConcept`` but some concepts can be children of another SPHN concept, which becomes their parent concept. The child concept must then have a more specific meaning than the parent concept. Similarly, all project concepts should be children of the project root concept, which is defined as: ``:Concept``. Same as in the SPHN Dataset, multiple levels of hierarchies can be created. Therefore, a project concept can be the child of another project (or sphn) concept if it has a more specific meaning. * **type**: Only to be filled if there exist a parent concept which is not the general ``:Concept`` * **meaning binding**: if available, a meaning binding of the concept to an coding system can be provided to further anchor the meaning of the concept defined in the project with external resources * **additional information**: text can be added to provide details to the reader of the dataset (this information is not carried on the RDF schema) * **cardinality for the concept to Administrative Case**: provide the cardinality of the concept with respect to the Administrative Case by keeping in mind the following: *how should the instance of this concept be expected to be linked to the Administrative Case?* * **cardinality for the concept to Subject Pseudo Identifier**: provide the cardinality of the concept with respect to the Subject Pseudo Identifier by keeping in mind the following: *how should the instance of this concept be expected to be linked to the Subject Pseudo Identifier?*. * **cardinality for the concept to Source System**: provide the cardinality of the concept with respect to the Source System by keeping in mind the following: *how should the instance of this concept be expected to be liked to Source System? does it make sense?* .. note:: You can highlight a concept by making the line bold for an easier reading. 2.2 Add a new project-specific composedOf !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Once a concept is created, a project can add project-specific composedOfs to this concept. A project can also add a new project-specific composedOf to an existing SPHN concept. In the latter, two options are possible: either add a new line below the SPHN Concept and its SPHN composedOfs. write in this new line the project-specific composedOf. OR add a new line at the end of the Dataset Template. Copy the line with of the SPHN Concept of interest. Add below the new copied line, the project-specific composedOf for this SPHN Concept. ComposedOf can be considered as metadata of a concept (i.e., specific information about the concept) and can be compared to properties or attributes of a concept. The following columns in the Dataset Template can (and should whenever possible) be filled: * **release version**: version of the dataset when the composedOf is created or last modified * **IRI** the versioned iri ending with the composedOf name in lowerCase convention * **active status**: select ``yes`` for a new composedOf added to a concept * **concept reference**: provide the name of the concept this composedOf belongs to with the following notation: ``:`` * **concept or concept compositions or inherited**: indicate by selecting in the list either ``composedOf`` or ``inherited`` if the composedOf is inherited from another concept * **general concept name**: provide the general name of the composedOf which will be used in the RDF schema with the following notation: ``:`` * **general description**: provide the general description of the composedOf which will be used in the RDF schema * **contextualized concept name**: provide the contextualized description of the composedOf in the particular context it is used (this information is not carried on the RDF schema) * **contextualized concept description**: provide the contextualized description of the composedOf explaining its meaning in the particular context it is used (this information is not carried on the RDF schema) * **parent**: provide the general composedOf name of the parent with the following notation: ``:`` .. note:: * Projects are allowed to refer to SPHN properties as parent properties when relevant. For instance a project defining 'has criteria code', the parent can be stated as 'hasCode' which would refer to the property defined in SPHN. * It is important to note that unlike in the ``Concept reference``, the ``Parent`` is written in a lowerCase convention **without space**. * Their exist two root attributes in SPHN for composedOfs: ``SPHNAttributeDatatype`` for datatype attribute composedOfs and ``SPHNAttributeObject`` for object attribute composedOfs. Similarly, parents of project's composedOfs should be pointing to one of the project root attribute ``:AttributeDatatype`` or ``:AttributeObject`` when the composedOf is not a descendant of another one. * **type**: provide the type of the composedOf (e.g., quantitative, qualitative, Code, any SPHN/project concept) * **excluded type descendants**: list the concepts, separated with a semi-colon, which are not considered as valid types (must be done for concepts where the type is a parent concept but children are not valid types) * **standard**: when the type of the composedOf is ``Code``, a coding system can be referenced to indicate possible values. Indicate the name of that coding system in this column * **value set or subset**: when the type of the composedOf is ``Code`` or ``qualitative``, a set of values or subset of values can be specified in this column .. note:: * Indicate a subset by starting with ``descendant of:`` followed by the identifiers/values. * Indicate a value set by listing the values and separating them with a semi colon ``;``. * The standard nomenclature to write codes from coding system is: ``: |