SPHN Dataset

Introduction

The SPHN Dataset contains atomic building blocks, called concepts, that can be used to represent biomedical data and its meaning. With these well-defined concepts, clinical information can be understood the same way across hospitals and projects. Each concept contains all elements, called composedOfs, to understand it. A concept refers to recommended value sets and/or semantic standards (e.g. LOINC, SNOMED CT, ICD-10-GM, CHOP, ATC, ICD-O-3) to express the data. Additionally, SPHN concepts are composed in a modular way to express the same information in the same way even if the context is different, e.g. a substance can be the substance someone is allergy against or the active ingredient of a drug. The use of internationally recognized standards as controlled vocabulary, such as SNOMED CT and LOINC, is fostering semantic interoperability not only within SPHN but also with international partners. Creating links to international ontologies allows us additionally leverage the domain knowledge that is represented in these ontologies.

Scope of the SPHN Dataset

The SPHN Dataset includes concepts for core clinical data, such as, demographic data, administrative case, diagnoses, procedures, lab results, medications, different measurements and biosample information as well as concepts for medical specialties, such as oncology and intensive care. The following criteria were applied to include a concept in the SPHN Dataset

  • Concepts of general importance for research (e.g. Birth Date, Administrative Gender, FOPH Diagnosis, Drug Administration Event)

  • Concepts relevant for more than one use case in SPHN (e.g. Allergy) or of high importance for other future personalized health projects (e.g. Oncological Treatment Assessment, TNM Classification)

Guiding principles for concept design

Conceptualization and semantic representation

In the SPHN Dataset, a concept is represented by the elements it is composed of (see Table 1). Each of those elements can potentially be a concept itself and the elements are separated according to its single meaning. Each project can choose the elements of interest according to the research question, in the sense “take what is needed” instead of “all or nothing”.

Table 1. Example of concept Oxygen Saturation in the SPHN Dataset.

concept

Oxygen Saturation

fraction of oxygen present in the blood

composedOf

quantity

value and unit of the concept

composedOf

measurement datetime

datetime of measurement

composedOf

body site

body site where the concept was measured, performed or collected

composedOf

measurement method

measurement method of the concept

An alternative representation would be to create single concepts such as Arterial Oxygen Saturation, Intracardiac Oxygen Saturation, etc. However, in the SPHN Dataset the information of what is measured (quantity) is separated from the information where it is measured (body site) so that different parts of the meaning are held in different composedOf elements. This is allowing the reuse of concepts.

The principle of reuse requires generalized descriptions, such as “value and unit of the concept”. Such general descriptions are accompanied with contextualized descriptions explaining the meaning for each single composedOf for the reader.

Table 2. Example of concept Oxygen Saturation in the SPHN Dataset with contextualized concept names and descriptions.

concept

Oxygen Saturation

fraction of oxygen present in the blood

composedOf

saturation

measured oxygen saturation, and unit

composedOf

measurement datetime

datetime of measurement

composedOf

body site

body site of measurement

composedOf

measurement method

method of measurement

Controlled vocabulary

International controlled vocabularies provide unambiguous semantic meaning for concepts. Linking the SPHN concepts to such controlled vocabularies, also referred to as meaning binding, provides expressions that are machine-readable and human-readable. For the meaning binding, we are not limiting to only one controlled vocabulary, a concept can be encoded in several standards. For example, clinical concepts are bound to SNOMED CT concepts and/or to LOINC codes. For further guidance on meaning binding please refer to the meaning binding section.

In addition, controlled vocabularies are used as standards for value set definitions. That means instead of value set definitions, such as 1=male; 2=female; 33=other; 99=unknown the value set for the SPHN Administrative Gender concept contains the following values coming from the controlled vocabulary SNOMED CT: 703117000 |Masculine gender (finding)|; 703118005 |Feminine gender (finding)|; 74964007 |Other (qualifier value)|; 261665006 |Unknown (qualifier value)|. By using a standard vocabulary for value sets, the SPHN Dataset supports interoperability with datasets from other initiatives or organizations that use a controlled vocabulary, such as SNOMED CT.

Reuse of concepts: a meaning defined only once

With the iterative growth of the SPHN Dataset, it becomes obvious that certain concepts are used in several medical specialties, such as intensive care and cardiology. In order to prevent one single meaning to be represented twice and differently, we define each meaning (concept) only once. For example, a substance can be the substance someone is allergic against or the active ingredient of a drug. The meaning of substance itself does not change even if it is an active ingredient of a drug or a substance causing an allergic reaction. Therefore, the meaning of Substance is represented in the Dataset only once.

Table 3. Concept Substance in the SPHN Dataset.

description

type

concept

Substance

any matter of defined composition that has discrete existence, whose origin may be biological, mineral or chemical

composedOf

code

code, name, coding system and version describing the concept

Code

composedOf

generic name

name of the substance, for not yet approved medications the international nonproprietary name (INN) of a substance given by the World Health Organization (WHO)

string

composedOf

quantity

value and unit of the concept

Quantity

The Substance concept is reused several times in the SPHN Dataset. In the concept Drug, Substance is used twice, once as an active ingredient, and once as an inactive ingredient. In the concept Allergy, Substance is used once as the substance considered to be responsible for the allergic reaction.

Substance concept

Figure 1. Concept Substance and its reuse.

For concept reuse in the SPHN Dataset, it is the type that specifies which concept is reused. The illustration below shows the concept Substance and its reuse in the concepts Drug and Allergy by setting Substance as a type.

Substance concept

Figure 2. Concept Substance with its composedOfs and its reuse.

Semantic inheritance

Semantic inheritance is a mechanism where a specific concept can be derived from a broader concept. The specific concept (child concept) and the broad concept (parent concept) have a hierarchical relationship and share common composedOfs. For example, a diagnosis in general has a datetime when it was recorded, and any specific diagnosis, such as ICD-O diagnosis, also has the datetime information of when it was recorded. Therefore, both concepts share the same composedOf record datetime. The following graphic illustrates the hierarchical relationship between the parent concept Diagnosis and its children concepts FOPH Diagnosis, Nursing Diagnosis and ICD-O Diagnosis.

Inheritance Diagnosis

Figure 3. Concept Diagnosis and its children concepts.

In the SPHN dataset, inheritance is represented by the child concept having the parent concept as its type. And the elements that the child concept inherits from the parent concept are specified as inherited. The child concept inherits all composedOfs from the parent concept but it can have additional composedOfs.

Inheritance Diagnosis

Figure 4. Concept FOPH Diagnosis inheriting elements from the Diagnosis concept.

Specification of a concept

A concept is an independent element that carries a semantic meaning by itself. Every element used in a composed concept is a concept itself. A concept can refer to a data point (e.g. Data Provider Institute), or it can be an empty container where the data points are all represented by the concept’s composition (e.g. Blood Pressure). The elements (properties) of a concept are called composedOf. A composedOf can be based on an already defined concept and it can carry semantic information specific to the concept it is part of.

Unique identifier

Each concept and each composedOf is identified by a unique identifier (ID). The unique ID is a 10-digit numeric identifier. A new unique ID is created for a concept in case one (or more) of the following changes are performed to a concept:

  • Name of the concept changes

  • Description of the concept changes

  • Meaning binding of the concept changes

A new unique ID is created for a composedOf in case one (or more) of the following changes are performed:

  • Name of the composedOf changes

  • Description of the composedOf changes

  • Meaning binding of the composedOf changes

  • Value set or subset of the composedOf changes

Concept name

There is a general and a contextualized concept name, which can be different for composedOfs. The general concept name aims to provide a unique and consistent naming across the complete dataset for distinguishing elements that have the same meaning, independent of the context in which they are used. The contextualized concept names aims to provide a more specific naming for the composedOf to be understandable within its use in the particular concept, mainly used for human understanding. As an example, the contextualized composedOf name “encounter identifier” is related to the general composedOf name “identifier”.

Description

For each concept and composedOf there is a concise description in natural language. The description needs to explain the general (context independent) meaning of the concept. Since there are already very well formulated descriptions for biomedical concepts, e.g. in the UMLS Metathesaurus, existing descriptions are reused wherever possible. Abbreviations should be avoided unless they are stated in the list of abbreviations in the SPHN Dataset. Descriptions can contain examples to illustrate the meaning of the concept.

Table 4. Example of general and contextualized descriptions in the SPHN Dataset.

general concept name

general description

contextualized concept name

contextualized description

identifier

unique identifier identifying the concept

encounter identifier

a unique pseudonymized encounter ID for the given data delivery/research purpose

Semantic type

For each composedOf there is a type indicating what kind of data can be mapped to this composedOf. The following types are used:

  • string: a sequence of characters; used for free text information such as “problem” in the Problem Condition concept,

  • temporal: any datetime information; used for time points such as assessment dates, start dates or end dates; granularity can vary from seconds (e.g. timestamps of a machine) to years (e.g. if only year of birth is allowed to be shared within a project); format should be: YYYY, YYYY-MM, YYYY-MM-DD or YYYY-MM-DDThh:mm:ss,

  • quantitative: expressing a certain value of a Quantity; technical types can be integer, float,

  • qualitative: expressing a certain characteristic with a pre-defined set of options, which are not expressed with controlled vocabulary (yet).

The type can also be a concept pointing to another concept in the SPHN Dataset, e.g. Code, Body Site.

Standards and value sets

“Standards” are controlled terminologies, classification systems, ontologies or other coding systems. They are to be used to represent the data in an interoperable way, i.e. they serve as semantic standards for value set definitions. Value set definitions can be broad, medium-detailed or detailed. For a broad value set definition only the standard is stated in the “standards” column and the column “value set or subset” is empty. Medium-detailed definitions refer to a substructure of a standard. Detailed definitions contain a finite set of qualitative options or codes from a controlled vocabulary. The following examples illustrate the difference between these three types of definitions.

Table 5. Example of broad definition.

description

standard

value set or subset

concept

Unit

unit of measurement

composedOf

code

code, name, coding system and version describing the concept

UCUM

Table 6. Example of medium-detailed definition.

description

standard

value set or subset

concept

Body Site

any anatomical structure, any nonspecific and anatomical site, as well as morphologic abnormalities

composedOf

code

code, name, coding system and version describing the concept

SNOMED CT

child of : 123037004 | body structure (body structure) |

Table 7. Example of detailed definition.

description

standard

value set or subset

concept

Care Handling

describes the relationship between the individual and care provider institute

composedOf

code

code, name, coding system and version describing the type of the concept

SNOMED CT

394656005 | Inpatient care (regime/therapy)|; 371883000 | Outpatient procedure (procedure)|; 304903009 | Provision of day care (regime/therapy)|; 261665006 | Unknown (qualifier value)|

Value sets are defined for concepts of type:

Meaning binding

SPHN concepts with a clinical meaning are associated by a so called meaning binding to an international standard, e.g. SNOMED CT and/or LOINC. These meaning bindings support the machine readability of the concepts and allow researchers to use the clinical knowledge contained in these terminologies in their research projects.

There are several criteria to consider in meaning binding, and the following guiding principles help to understand concept selection and find meaning bindings for new concepts:

  • Fit for purpose - binding to single LOINC or SNOMED CT code (otherwise not usable in URIs);

  • Suitability instead of completeness - no binding if there is no suitable SNOMED CT concept, or LOINC code;

  • Exact fit - no binding to more or less specific terms, e.g. ICD-O Diagnosis is not bound to 439401001 |Diagnosis (observable entity)|

  • Best fit - independent of SNOMED CT´s or LOINC´s hierarchy except

  • LOINC: no binding to panel codes

  • SNOMED CT: attribute hierarchy codes not to be used for concepts

  • Avoid same code for different items in the SPHN Dataset.

The following example illustrates how meaning bindings are stated in the SPHN Dataset. In the example, there is a meaning binding to a SNOMED CT concept and a meanining binding to a LOINC code.

Table 8. Example of meaning binding in the SPHN Dataset.

description

meaning binding SNOMED CT

meaning binding LOINC

concept

Problem Condition

clinical condition, problem, diagnosis, or other event, situation, issue, or clinical concept that has risen to a level of concern

55607006 |Problem (finding)|

44100-6 Medical problem

Development process

Request for adding a new concept and/or making changes to the SPHN Dataset need to be submitted to the SPHN Data Coordination Center (DCC). The development of the new SPHN Dataset is coordinated by the DCC, in close collaboration with IT experts of the 5 Swiss university hospitals and domain experts of the SPHN Driver projects. One to two releases per year are published after approval by Hospital IT strategy alignment group of SPHN.

Availability and usage rights

The SPHN Dataset is available on the SPHN website.

The SPHN Dataset is under the CC-BY-NC-SA 4.0 License.

For any question or comment, please contact the Data Coordination Center (DCC) at dcc@sib.swiss.