SPHN Dataset

Introduction

The SPHN Dataset contains atomic building blocks, called concepts, that can be used to represent biomedical data and its meaning in a way accessible to human readers, but importantly also to machines. With these well-defined concepts, clinical information can be understood unambiguously across hospitals and projects.

Each concept contains all elements to understand it, including its specific properties called “composedOfs”. A concept refers to recommended value sets and/or semantic standards (e.g. LOINC, SNOMED CT, ICD-10-GM, CHOP, ATC, ICD-O-3, HGNC, GENO, SO) to express the data.

Additionally, SPHN concepts are composed in a modular way to express the same information in the same way even if the context is different, e.g. a substance can be the substance someone is allergy against or the active ingredient of a drug. The use of internationally recognized standards as controlled vocabulary, such as SNOMED CT and LOINC, is fostering semantic interoperability not only within SPHN but also with international partners. Creating links to international terminologies allows us additionally leverage the domain knowledge that is represented in these terminologies.

Scope of the SPHN Dataset

The SPHN Dataset includes concepts for clinical data such as Diagnosis, Transplant and Medical Procedure, for genomic data such as Assay, Genetic Variation and Genomic Position, for microbiology data such as Microorganism Identification Lab Test and Microorganism Identification Result, and for provenance data such as Data Provider and Source System. The following criteria were applied to include a concept in the SPHN Dataset

  • Concepts of general importance for research (e.g. Birth Date, Administrative Sex, Drug Administration Event, Software, Gene)

  • Concepts relevant for more than one use case in SPHN (e.g. Allergy, Quality Control Metric) or of high importance for other future personalized health projects (e.g. Tumor Grade Assessment, Library Preparation)

Guiding principles for concept design

Conceptualization and semantic representation

In the SPHN Dataset, a concept is represented by the elements it is composed of (see Table 1). Each of those elements can potentially be a concept itself and the elements are separated according to its single meaning. Each project can choose the elements of interest according to the research question, in the sense “take what is needed” instead of “all or nothing”.

Table 1. Example of concept Medical Procedure in the SPHN Dataset.

concept

Medical Procedure

invasive or non-invasive intervention performed for, with or on behalf of an individual whose purpose is to assess, improve, maintain, promote or modify health, functioning or health conditions

composedOf

code

coded information specifying the concept

composedOf

body site

anatomical site or structure associated to the concept

composedOf

start datetime

datetime at which the concept started

composedOf

end datetime

datetime at which the concept ended

composedOf

intent

intention for the concept

An alternative representation would be to create single concepts such as Angiographic Procedure, Abdominal Procedure, etc. However, in the SPHN Dataset the information of what was done (procedure code, e.g. 44491008 |Fluoroscopy (procedure)|) is separated from the information where it was done (body site, e.g. 71854001 |Colon structure (body structure)|) so that different parts of the meaning are held in different composedOf elements. This is allowing the reuse of concepts.

The principle of reuse requires generalized descriptions, such as “coded information specifying the concept”. Such general descriptions are accompanied with contextualized descriptions explaining the meaning for each single composedOf for the reader.

Table 2. Example of concept Medical Procedure in the SPHN Dataset with contextualized concept names and descriptions.

concept

Medical Procedure

invasive or non-invasive intervention performed for, with or on behalf of an individual whose purpose is to assess, improve, maintain, promote or modify health, functioning or health conditions

composedOf

code

coded procedure information

composedOf

body site

body site where procedure was performed

composedOf

start datetime

datetime the procedure started

composedOf

end datetime

datetime the procedure ended

composedOf

intent

intention for the procedure

Knowledge-centric design

SPHN adopts a knowledge-centric approach to the design of concepts. By “knowledge-centric” we mean that the concepts should be designed in a way that represents either a process or an entity:

  • Process: A process is any event (or series of events) or activity (or series of activities) that has temporal parts and operates on some input and can yield some output

  • Entity: An entity is any “thing” that is an input and/or an output to a process

When modeling something that (i) occurs over a period of time, (ii) potentially has an input, and (iii) potentially generates an output, then it would be modeled as a Process. For example, Measurement (such as Heart Rate Measurement, Blood Pressure Measurement, Oxygen Saturation Measurement), Electrocardiographic Procedure, and Adverse Event can be considered as Processes.

When modeling something that exists as an input to a Process then it would in most cases be modeled as an Entity, unless it qualifies as a process itself. For example, Sample, Outcome, Result, and File are examples of Entities.

Applied to concept design, one should follow a logical train of thought from past to present. Such chronological order comes naturally when describing the series of events resulting in data generation. In general, some kind of input undergoes a process or event to result in an output.

For example, lets consider a patient that undergoes an electrocardiographic procedure which results in an electrocardiogram.

Heart Rate measurement

Figure 1. Electrocardiographic Procedure

The figure above illustrates at a very high level a Process concept that has an input Entity and an output Entity. The next level below there is a Medical Procedure concept (which is a type of Process) and has an input Patient (a.k.a. Subject Pseudo Identifier) and an output Result. Then the next level below is the Electrocardiographic Procedure concept (which is a type of Procedure) with an input Patient and an output Electrocardiogram.

Domain independence

Data concepts used in several medical specialties, such as Body Site and Intent, or general concepts, such as Quantity should be defined in a general manner. This is allowing the reuse of concepts in different contexts, e.g. the reuse of the concept Quantity in Body Height and Age. The general description of the Quantity concept and its elements is universal, as shown in the table below, and therefore reuse is possible.

Table 3. Example of reusable concept Quantity in the SPHN Dataset.

concept

Quantity

an amount or a number of something

composedOf

value

value of the concept

composedOf

unit

unit of the concept

composedOf

comparator

qualifier describing imprecise values

Controlled vocabulary

International controlled vocabularies provide unambiguous semantic meaning for concepts. Linking the SPHN concepts to such controlled vocabularies, also referred to as meaning binding, provides expressions that are both machine-readable and human-readable. For the meaning binding, we are not limiting to only one controlled vocabulary, a concept can be encoded in several standards. For example, clinical concepts are bound to SNOMED CT concepts and/or to LOINC codes and genomic concepts can use genomic-specific terminologies such as HGNC, GENO and SO. For further guidance on meaning binding please refer to the meaning binding section.

In addition, controlled vocabularies are used as standards for value set definitions. That means instead of value set definitions, such as 1=male; 2=female; 3=indeterminate the value set for the SPHN Administrative Sex concept contains the following values coming from the controlled vocabulary SNOMED CT: 248152002 |Female (finding)|; 248153007 |Male (finding)|; 32570681000036106 |Indeterminate sex (finding)|. By using a standard vocabulary for value sets, the SPHN Dataset supports interoperability with datasets from other initiatives or organizations which also use controlled vocabulary, such as SNOMED CT.

Examples for controlled vocabularies include:

… For other standards see External resources - International standards used by SPHN

Reuse of concepts: a meaning defined only once

With the iterative growth of the SPHN Dataset, it becomes obvious that certain concepts are used in several medical specialties, such as intensive care and cardiology. In order to prevent one single meaning to be represented twice and differently, we define each meaning (concept) only once. For example, a body site can be the body site a heart rate was measured, the body site oxygen saturation was measured, or the body site the patient felt pain. The meaning of body site itself does not change no matter if it is the body site a heart rate was measured or another measurement or procedure was performed on the body site. Therefore, the meaning of Body Site is represented in the Dataset only once.

Table 4. Concept Body Site in the SPHN Dataset.

general description

type

concept

Body Site

any anatomical structure, any nonspecific and anatomical site, as well as morphologic abnormalities

composedOf

code

coded information specifying the concept

Code

composedOf

laterality

localization with respect to the side of the body

Laterality

The Body Site concept is reused several times in the SPHN Dataset. For example, in the concept Heart Rate Measurement and in the concept Oxygen Saturation Measurement, Body Site is used to describe the body site the Heart Rate or Oxygen Saturation was measured. In the concept Medical Procedure, it is used to indicate the body site where the procedure was performed.

Body Site concept

Figure 2. Concept Body Site and its reuse.

For concept reuse in the SPHN Dataset, it is the type that specifies which concept is reused. The illustration below shows the concept Body Site and its reuse in the concepts Procedure and Heart Rate by setting Body Site as a type.

Body Site concept

Figure 3. Concept Body Site with its composedOfs and its reuse.

Semantic inheritance

Semantic inheritance is a mechanism where a specific concept can be derived from a broader concept. The specific concept (child concept) and the broad concept (parent concept) have a hierarchical relationship and share common composedOfs. For example, a diagnosis in general has a datetime when it was recorded, and any specific diagnosis such as Oncology Diagnosis, also has the datetime information of when it was recorded. Therefore, both concepts share the same composedOf record datetime. The following graphic illustrates the hierarchical relationship between the parent concept Diagnosis and its children concepts Billed Diagnosis, Nursing Diagnosis and Oncology Diagnosis.

Inheritance Diagnosis

Figure 4. Concept Diagnosis and its children concepts.

In the SPHN Dataset, inheritance is represented by the child concept having the parent concept as its type. And the elements that the child concept inherits from the parent concept are specified as inherited. The child concept inherits all composedOfs from the parent concept but it can have additional composedOfs.

Inheritance Diagnosis

Figure 5. Concept Billed Diagnosis inheriting elements from the Diagnosis concept.

Meaning preservation

Existing concepts in the SPHN Dataset can be adapted to project requirements. However, the meaning of the concept itself should not change. For example, the Blood Pressure Measurement concept is described as “measurement process of a blood pressure on an individual”. This concept description is still valid and does not change when a new composedOf body position is added.

Table 5. Example of concept Blood Pressure Measurement in the SPHN Dataset.

concept

Blood Pressure Measurement

measurement process of a blood pressure on an individual

composedOf

result

evaluation outcome associated to the concept

composedOf

start datetime

datetime at which the concept started

composedOf

end datetime

datetime at which the concept ended

composedOf

method code

coded information specifying the method of the concept

composedOf

medical device

medical device of the concept

composedOf

performer

person who performs or reports the concept

composedOf

body site

anatomical site or structure associated to the concept

Table 6. Example of a possible concept extension for Blood Pressure Measurement without change of concept meaning.

concept

Blood Pressure Measurement

measurement process of a blood pressure on an individual

composedOf

result

evaluation outcome associated to the concept

composedOf

start datetime

datetime at which the concept started

composedOf

end datetime

datetime at which the concept ended

composedOf

method code

coded information specifying the method of the concept

composedOf

medical device

medical device of the concept

composedOf

performer

person who performs or reports the concept

composedOf

body site

anatomical site or structure associated to the concept

composedOf

subject body position

body position of the subject

Step-wise creation of a concept

Investigating potentially existing concepts

Before starting with concept design from scratch it is advisable to inquire whether or how other standards and data models implement concepts in scope of the desired use case. Similar concepts may exist and there is no need to reinvent the wheel!

Non-exhaustive selection of resources:

Comparison to other standards models / data models will facilitate design choices. Moreover, it helps avoiding the use of terms in ways contradictory to their common use in the field and increases interoperability.

Rationale behind a concept

The rationale for creating a concept, i.e., its purpose and use case, should be put down in writing in detail when designing a concept. This step will support the author in grasping the necessary details required.

Describe in detail the process the concept is supposed to represent or which the concept is supposed to be used by. If there are different variants of the process, all of them should be described. Reflect whether some steps or aspects should be mandatory or are possibly required multiple times (compare section Cardinalities).

In case of multiple design options the choices should be transparent by mentioning variants considered but not implemented and giving the rationale why one variant was chosen and another one disregarded!

Such verbosity seems an additional effort at first but will give the design choices a solid basis and help future users (and possibly the concept creator oneself in the future). Moreover, it will provide facilitate the identification of the individual ‘building blocks’ of a concept.

Knowledge centric design

To create a concept compliant with SPHN standards, one should follow the knowledge-oriented (process-oriented) design of SPHN (compare section on Knowledge-centric design).

Process with input and output

Figure 6. A process featuring input and output.

Table 7: Process layout in tabular form

Concept name

Description

Type

Process

any event (or series of events) or activity (or series of activities) that has temporal parts and operates on some input and can yield some output

Input

any ‘thing’ that is an input to the process

Entity

Output

any ‘thing’ that is an output to the process

Entity

The steps and components recorded when describing the rationale of the concept should be categorized accordingly and ordered to match the knowledge-centric design pattern.

General concept design considerations

  • Reusability

    • Aim at reusability when designing a concept, its properties and their potential value sets

    • Keep in mind a general design rather than very specific needs.

    • Avoid overengineering of concepts trying to cover all potential use cases right away. Rather implement a basic solution first which can be extended later as need arises, possibly in form of inheriting child-concepts.

  • Concept naming

    • The concept name should reflect clearly what the concepts is trying to represent, including whether it describes an entity or a process

    • Avoid general concept names for specific use cases! Rather use more specific name to avoid “blocking” a general term which could then not be used later on, e.g., do not aim at creating a “Device”-concept for a device only used to test lung function but create a “Lung Function Test Device”-concept instead

    • Abbreviations cannot be used in concept names. Spell out or paraphrase abbreviations.

  • Integration

    • Concepts of a schema are linked and often form a hierarchy with “parent” and “child” relations, with “parent” concepts with a broader meaning being higher in the hierarchy than related “child”-concepts with a more specialized meaning (see section Semantic inheritance’).

    • It is good practice to check whether new concepts could be fit into an existing hierarchy, e.g., a new specialized “Microbiology Lab Test” concept could be created as “child of” an existing “Lab Test”-concept.

    • Integration will keep the structure of the schema meaningful but also facilitate data queries down the road.

  • Properties of a concept (composedOfs)

    • Different properties of a concept each describe the concept!

    • Such same-level properties DO NOT describe each other!

      • Hypothetical example (see figure below): concept Administration Event with two composedOfs:

        • medication: a drug containing active and inactive substances

        • quantity: This quantity attribute describes the Administration Event (e.g. 2/d (=twice per day), not the Medication!

        • To describe the quantity of medication it needs to hold its own composedOf quantity!

        • The quantity of a medication in turn would not describe the quantities of the substances within the medication. Describing the quantities of the substances requires quantity attributes linked to the substances.

Relationships of concept attributes

Figure 7. Relationships of concept attributes.

  • The concept name should not be repeated in composedOfs (properties), e.g., a composedOf specifying the classification of a hypothetical concept Sepsis should be called classification rather than sepsis classification.

  • Several composedOfs starting with the same phrase are an indication that they should form a separate concept (or names should be cleaned), e.g., composedOfs exposure datetime, exposure duration, exposure route code, …, should be combined to a separate concept Exposure.

  • Likewise, if patterns occur multiple times or composedOfs are repeated across concepts a new concept should be created.

  • Names of composedOfs of type Code should carry the suffix “code”, e.g., a composedOf holding coded information specifying a type should be named type code rather than type.

  • Whenever possible, attributes should be represented in coded form (rather than as free text)! Free text is hardly ever interoperable (differences in spelling, varying abbreviations, different languages, …). Compare section Standards and value sets.

    Usefulness of composedOf type: code > qualitative value set >> free text

Cardinalities

Some composedOfs should be mandatory while others are optional, and some may possibly be required more than once. These needs are represented by the cardinalities of a composedOf, indicating minimal and maximal number of times a composedOf can be part of a concept. Cardinalities are represented as <minimum>..<maximum> or <minimum>:<maximum>, e.g., 0..1 or 1:n.

0 indicates that a composedOf is optional, while n represents any number:

A Body Height, for example, has one and exactly one quantity, represented by a cardinality of 1:1. The mode of its data determination is optional, but there is at most one mode, represented by a cardinality of 0:1.

A Medical Device, for example, may optionally use a software, however, possibly more than one, reflected by a cardinality of 0:n.

Detailed Examples

During concept design and review one should have use cases and examples for the conept including all its properties in mind. This tends to be underrated or ignored because it means additional effort. However, it will help to uncover caveats and flaws in the design. They can be written down in tabular form and should including all properties and subproperties of a concept, e.g., for a concept with a composedOf of type Body Site, all properties of Body Site should also be represented.

In case of different variants of a process, at least one example should be sketched out for each of them.

Concept visualization

People have individual preferences which type of data representation is considered most accessible, e.g. in form of tables or text. During concept design it is often helpful to visualize your concepts graphically:

  • concepts → represented by a graph node (e.g., Diagnosis)

  • composedOf(s): attributes / properties of the concept → represented by graph node(s) as well (e.g., code or record datetime)

  • predicate(s): description of the relation between concept and attribute or between concepts → represented as a labelled conncection between concept and attribute (e.g., hasCode, or hasRecordDatetime)

Once a general design has materialized, a visualization of instantiation examples can be helpful for processes of higher complexity involving multiple concepts or steps. Like the examples, this will help to unveil design flaws.

Specification of a concept

A concept is an independent element that carries a semantic meaning by itself. Every element used in a composed concept is a concept itself. A concept can refer to a data point (e.g. Department), or it can be an empty container where the data points are all represented by the concept’s composition (e.g. Blood Pressure Measurement). The elements (properties) of a concept are called composedOf. A composedOf can be based on an already defined concept and it can carry semantic information specific to the concept it is part of.

IRI - Internationalized Resource Identifier

Each concept and each composedOf is uniquely identified by an IRI. The IRI is a resolvable versioned URL (Uniform Resource Locator) pointing to a website where details of the current version of the concept or composedOf can be found.

Concept name

There is a general and a contextualized concept name, which can be different for composedOfs. [It is identical for concepts though!] The general concept name aims to provide a unique and consistent naming across the complete dataset for distinguishing elements that have the same meaning, independent of the context in which they are used. The contextualized concept names aims to provide a more specific naming for the composedOf to be understandable within its use in the particular concept, mainly used for human understanding. As an example, the contextualized composedOf name encounter identifier is related to the general composedOf name identifier. Consider additional name requirements highlighted in the sections Concept naming and Concept properties.

Description

For each concept and composedOf there is a concise description in natural language. The description needs to explain the general (context independent) meaning of the concept. Since there are already very well formulated descriptions for biomedical concepts, e.g. in the UMLS Metathesaurus, existing descriptions are reused wherever possible. Abbreviations should be avoided unless they are stated in the list of abbreviations in the SPHN Dataset. Descriptions can contain examples to illustrate the meaning of the concept.

Table 8. Example of general and contextualized descriptions in the SPHN Dataset.

general concept name

general description

contextualized concept name

contextualized description

identifier

unique identifier identifying the concept

encounter identifier

a unique pseudonymized encounter ID for the given data delivery/research purpose

Semantic type

For each composedOf there is a type indicating what kind of data can be mapped to this composedOf. The following types are used:

  • string: a sequence of characters; used for free text information such as “problem” in the Problem Condition concept,

  • temporal: any datetime information; used for time points such as assessment dates, start dates or end dates; granularity can vary from seconds (e.g. timestamps of a machine) to years (e.g. if only year of birth is allowed to be shared within a project); format should be: YYYY, YYYY-MM, YYYY-MM-DD or YYYY-MM-DDThh:mm:ss,

  • quantitative: expressing a certain value of a Quantity; technical types can be integer, float,

  • qualitative: expressing a certain characteristic with a pre-defined set of options, which cannot be expressed with controlled vocabulary (yet).

The type can also be a concept, i.e., it is pointing to another concept in the SPHN Dataset, e.g. Code or Body Site. Multiple types are also allowed for a composedOf separated by semicolon, e.g. type “Substance; Drug” for a composedOf which can be of either type Substance or Drug.

Standards and value sets

  • “Standards” are controlled terminologies, classification systems, ontologies or other coding systems.

    They are to be used to represent the data in an interoperable way, i.e. they serve as semantic standards for value set definitions.

  • Value set definitions can be broad, medium-detailed or detailed.

    • For a broad value set definition only the standard is stated in the “standard” column and the column “value set or subset” is left empty. Multiple standards are entered separated by semicolon, e.g., SNOMED CT; EDQM. If codes should be allowed in addition, “or other” is added after the last standard without semicolon, e.g., SNOMED CT; ATC; NCI Thesaurus or other.

    • Medium-detailed definitions refer to a substructure of a standard. The type of standard is entered into the “standard” column, and the “root” node of the substructure is specified, preceded by “descendant of:”, e.g. “descendant of: 117259009 |Microscopy (procedure)|”

    • Detailed definitions contain a finite set of qualitative options or codes from a controlled vocabulary, separated by semicolon.

Note that detailed and medium-detailed definitions cannot be mixed!

The following examples illustrate the difference between these three types of definitions.

Table 9. Example of broad definition.

description

standard

value set or subset

concept

Unit

unit of measurement

composedOf

code

coded information specifying the concept

UCUM

Table 10. Example of medium-detailed definition.

description

standard

value set or subset

concept

Body Site

anatomical site or structure associated to the concept

composedOf

code

coded information specifying the concept

SNOMED CT

descendant of : 123037004 | body structure (body structure) |

Table 11. Example of detailed definition.

description

standard

value set or subset

concept

Care Handling

describes the relationship between the individual and care provider institute

composedOf

code

coded information specifying the concept

SNOMED CT

394656005 | Inpatient care (regime/therapy)|; 371883000 | Outpatient procedure (procedure)|; 304903009 | Provision of day care (regime/therapy)|

Detailed value set definitions must not contain values that are overlapping in their meaning. An overlap in meaning would be, for example, mixing information about the type of surgery in regards to a minimally invasive or open approach and the access route (access through body site) chosen by the surgeon.

Table 12. Example of value set with overlapping meaning - to be avoided.

standard

value set or subset

composedOf

surgery type

SNOMED CT

129236007 |Open approach - access (qualifier value)|; 103388001 |Percutaneous approach - access (qualifier value)|; 129220005 |Transaxillary approach (qualifier value)|

Table 13. Example of value set with non-overlapping meaning - best practice.

standard

value set or subset

composedOf

surgery access type

SNOMED CT

129236007 |Open approach - access (qualifier value)|; 103388001 |Percutaneous approach - access (qualifier value)|

Value sets are defined for concepts of type:

Meaning binding

SPHN concepts with clinical or other meaning are associated by a so called meaning binding to an international standard (e.g., SNOMED CT,LOINC or other). These meaning bindings support the machine readability of the concepts and allow researchers to use the clinical knowledge contained in these terminologies in their research projects.

There are several criteria to consider in meaning binding, and the following guiding principles help to understand concept selection and find meaning bindings for new concepts:

  • Fit for purpose - binding to a single concept or code from an external terminology (otherwise not usable in URIs);

  • Suitability instead of completeness - no binding if there is no suitable concept or code from an external terminology;

  • Exact fit - no binding to more or less specific terms, e.g. ICD-O Diagnosis is not bound to 439401001 |Diagnosis (observable entity)|

  • LOINC - don’t use panel or group codes

  • SNOMED CT

    • Use procedure codes for procedure concepts, e.g. 29303009 |Electrocardiographic procedure (procedure)|

    • Use observable entity codes for observables, e.g. 397155001 |Body position (observable entity)|

  • Avoid same code for different items in the SPHN Dataset.

The following example illustrates how meaning bindings are stated in the SPHN Dataset. In the example, there is a meaning binding to a SNOMED CT concept and a meanining binding to a LOINC code.

Table 14. Example of meaning binding in the SPHN Dataset.

description

meaning binding

concept

Problem Condition

clinical condition, problem, diagnosis, or other event, situation, issue, or clinical concept that has risen to a level of concern

SNOMED CT: 55607006 |Problem (finding)|; LOINC: 44100-6 Medical problem

Development process

Request for adding a new concept and/or making changes to the SPHN Dataset need to be submitted to the SPHN Data Coordination Center (DCC). The SPHN Dataset is being developed in collaboration with experts from the five Swiss university hospitals, SPHN National Data Stream (NDS) data managers, and clinical and genomic experts, with the DCC coordinating the development. One to two releases per year are published after approval by Hospital IT strategy alignment group of SPHN.

Availability and usage rights

The SPHN Dataset is available on the SPHN website.

The SPHN Dataset is under the CC-BY 4.0 License.

For any question or comment, please contact the Data Coordination Center (DCC) at dcc@sib.swiss.