Provenance concepts

Overview

Information about data provenance i.e., details on the data origin, gives an indication about data type and quality. It allows researchers to assess whether the data are suitable for the intended type of data science (fit for purpose).

Such metadata can span various areas of interest, however, the following aspects were emphasised for the introduction of such Provenance-concepts to the SPHN Schema:

  • Who has generated or provided the data?

This concerns the concepts Data Provider, Department, and Performer.

  • Where is the data extracted from?

Such details, including the source system, its purpose, or the primary system can be represented using the concepts Source System and Healthcare Primary Information System.

  • What data is provided?

This entails details on raw and insights on potential processing steps, e.g., mapping or data transformation events. This is covered by the concepts Source Data and Semantic Mapping.

Concept design

Data Provider and Department

The concept Data Provider replaces the concept Data Provider Institute from release 2024.1 onwards. In addition to a coded unique identifier it offers information on the category of the data provider.

Data Provider and Department

Figure 1: Design of the concepts Data Provider and Department

As hospitals or other institutions are often divided into departments or divisions there may be more than a single point of exit for data export. The concept Data Provider alone may therefore not feature sufficient granularity. Information on a potential subdivision where the data originates from are represented by the concept Department.
The Department concept is very initial, currently only providing a name as string (release 2024.1). It has nevertheless been shaped as a concept rather than an additional composedOf to Data Provider to be open for extensions in the future.

Performer

The concept Performer describes the type of person (patient, physician, etc.) performing the activity which has resulted in the data of interest. This includes, for example, measurements or assessments.

Performer

Figure 2: Design of Performer concept The type of the performer is provided as SNOMED CT-code. Two types of Performer can be expressed, 1. with semantic tag “occupation”, e.g. 106292003 | Professional nurse (occupation) |, 309343006 | Physician (occupation) | and 2. with semantic tag "person", e.g. 86372007 |Grandchild (person)|, 394863008 |Non-family member (person)|. The Performer concept is open for extensions in the future.


Source System and Healthcare Primary Information System

The proposed concept Source System describes the category and purpose of the source system, e.g., a hospital information system for clinical routine data intended for quality control. In case a coded category is not applicable the source system description can be expressed using free text. Source System is introduced as a novel core concept in release 2024.1 of the SPHN Schema, alongside Subject Pseudo Identifier, Administrative Case, and Data Provider (former Data Provider Institute).

Source System

Figure 3: Design of Source System concept

The Source System concept also links to an attribute Healthcare Primary Information System which describes the primary source system of the healthcare data, e.g., a clinical laboratory information system. In case a coded category is not applicable the healthcare primary information system description can be expressed using free text.

Healthcare Primary Information System

Figure 4: Design of Healthcare Primary Information System concept


Source Data

The Source Data-concept provides information on raw data that has undergone a transformation, e.g., a mapping or coding event. Source data can be provided as string or code. A string can be in local language, for example “féminin” (mapped to 248152002 |Female (finding)|). A code may be a local code from a vendor-specific coding system. In addition, the source system from which the source data is provided can be represented.

Source Data

Figure 5: Design of Source Data concept

The concept Source Data is currently used by the Semantic Mapping concept only.


Semantic Mapping Event

The concept Semantic Mapping represents information about the transformation of data elements to a (ideally standardised) code. It refers to the source data, the output of the semantic mapping as well as the mapping method, purpose, and time point of mapping.

From a knowledge-centric perspective, a process is an event which operates on input(s) and can yield output(s). In this sense, the concept Semantic Mapping represents the process which operates on source data (input) like non-standard codes or strings and transforms it into a standardised code (output) which may for example be linked to a Result.

Semantic Mapping

Figure 6: Design of Semantic Mapping concept

Semantic Mapping Process

Figure 7: Semantic Mapping in the context of knowledge-centric (process-oriented) concept design.

From a knowledge-centric perspective, a process is an event which operates on input(s) and can yield output(s).
In this sense, the concept Semantic Mapping represents the process which operates on source data (input) like non-standard codes or strings and transforms it into a standardized code (output) which may for example be linked to a Result.

Example for data delivery

Provenance Instantiation Example 1

Figure 8: Instantiation example of Provenance concepts

Guideline for data delivery

General

General guidelines on the concept use as well as specific guidelines for individual composedOfs are given in the following sections.


Data Provider

institution code: Unique identifier (UID) of the data provider from the UID-register of the Federal Statistical Office (FSO)

category: One of {Company; External Laboratory; Federal Office; Health Insurance; Hospital; Pharmacy; Private Practice; Research Organization; Service Provider; University}

department: Link to concept Department.


Department

name: The preferred language for values of the composedOf ‘name’ is English, which should be used whenever applicable and possible to increase accessibility across regions and language boundaries. Department abbreviations, if included, should also be spelled out.


Performer

As for the use of patient data, it needs to be considered that the written consent of performers, e.g., medical staff, needs to be obtained before data request, in particular in case the Performer-concept is extended in the future by additional, possibly identifying attributes.


Source System

The use of Source System is described in detail here.

category: One of {biobank; case report form; clinical data platform; clinical registry; cohort; data repository; healthcare information system; OMICS facility; research laboratory}, see table below for details

Note: The term “clinical data platform” is intended as an overarching term for data warehouses, data lakes etc. in a clinical context.

name: source system description expressed using free text where a coded category is not applicable or where free text holds additional information, e.g. National Cancer Registry (NCR)

purpose: One of {billing; patient care; quality control; research}, see table below for details primary information system: Link to concept Healthcare Primary Information System

Table 1: Usage examples for ‘purpose’ of ‘Source System’

Example situation the source system is primarily in place for

Corresponding value for purpose

billing purposes, for example software used for assigning ICD-10- or CHOP-codes

billing

diagnostic, screening, or therapeutic purposes, e.g., a radiology information system or an oncology information system

patient care

quality control purposes, e.g., holding data for a registry

quality control

a research project, e.g., capturing data for a clinical trial

research

The composedOf ‘category’ of Source System will provide information on the type of source system from which the data is extracted for the recipient. The table below provides example situations and the corresponding value.

Table 2: Usage examples for ‘category’ of ‘Source System’

Example situation for data delivery to the project

Corresponding value for ‘category’

Definition (Chat GPT)

directly from the database or information system of a biobank

biobank

Biobank: Facility or institution with a focus on the collection, storage, and distribution of biological samples and associated data.

directly from a data capture system, e.g., a REDCap database, set up by the project

case report form

Case Report Form (CRF): A standardized document used in clinical research to collect data on each participant in a study.

from the clinical data platform, e.g., a clinical data warehouse, a clinical data lake, etc.;

‘clinical data platform’ is used as an overarching term for data warehouses, data lakes etc. in a clinical context.

clinical data platform

Clinical Data Platform: An integrated system for collecting, managing, and analyzing clinical data, often used in healthcare settings.

directly from a clinical registry database, e.g., a rare disease registry database

clinical registry

Clinical Registry: A database of information about individuals with a specific condition or disease, typically used for research and monitoring.

directly from a cohort database, e.g., the Swiss Spinal Cord Injury Cohort Study (SwiSCI)

cohort

Cohort: A group of individuals with shared characteristics, often studied over time to understand health outcomes.

directly from a data repository, e.g., data available from the Swiss Federal Office of Statistics

data repository

Data Repository: A centralized location for storing and managing diverse datasets, including health data.

directly from a clinical information system, for example:

  • if a hospital does not have a clinical data platform and data is delivered directly from the clinical information system to the recipient,

  • if an oncology information system is not yet connected to the clinical data platform and therefore data is delivered directly from the oncology information system to the recipient,

  • if a private practice without clinical data platform delivers data

healthcare information system

Healthcare Information System: An integrated information system designed for managing healthcare data within a hospital or a healthcare organization or provider

directly from an OMICS facility information system, e.g., a genome center, e.g., the Health 2030 Genome Center)

OMICS facility

OMICS Facility: Facility or institution dealing with high-throughput technologies like genomics, proteomics, etc., generating large-scale molecular data.

directly from a research information system (non-clinical)

research laboratory

Research Laboratory: Physical space equipped for conducting scientific experiments and analyses, generating various types of research data.


Healthcare Primary Information System

The use of Healthcare Primary Information System is described in detail here.

code: SNOMED CT code specifying the category of healthcare primary information system, descendant of SNOMED CT: 706593004 |Information system (physical object)|

name: healthcare primary information system description expressed using free text where a coded category is not applicable or where free text holds additional information, e.g., Onkostar


Source Data

The Source Data concept is intended for two types of raw data, codes or character strings (free text). The source data usually represents non-standardised data.

code: coded information specifying the source data description, e.g. identifier: 1, name: verheiratet, coding system and version: KIS code book

string value: source data description expressed using free text, e.g. “Weichteilsarkom”. The string value must not contain patient identifying data.


Semantic Mapping

The concept ‘Semantic Mapping’ is designed and intended to cover both mapping and coding events. Its description “process of transforming data elements to a code​” covers both of these cases. For coding events, however, the input data may not always be fully available or it may not be possible to represent all input components taken into account properly in the schema, for example for the assignment of a diagnosis code. The cardinality of the composedOf ‘source data​’ (type ‘Source Data’) has therefore been set to 0:n to cover such cases and still be able to represent output codes as output code without formal input.

source data​: Link to the concept Source Data, i.e., the input for the Semantic Mapping

output: output of the Semantic Mapping, usually a standardised code

datetime: datetime of the Semantic Mapping. Date and time it refers to may vary between data providers. In one hospital, it may refer to the time a semantic mapping was applied, in another one it may refer to the point in time when a mapping table was created. In both cases, however, it shall be possible to trace back which mapping was applied, e.g., for quality control, provided that internal operating procedures and tracking are in place.

method code​: ECO-code specifying the method used for semantic coding, such as manual or automatic assertion.

purpose: objective of the semantic mapping, one of {billing; patient care; quality control; research}. Same value set as for Source System_purpose, see further details there.