Provenance concepts
Overview
Information about data provenance i.e., details on the data origin, gives an indication about data type and quality. It allows researchers to assess whether the data are suitable for the intended type of data science (fit for purpose).
Such metadata can span various areas of interest, however, the following aspects were emphasised for the introduction of such Provenance-concepts to the SPHN Schema:
Who has generated or provided the data?
This concerns the concepts Data Provider, Department, and Performer.
Where is the data extracted from?
Such details, including the source system, its purpose, or the primary system can be represented using the concepts Source System and Healthcare Primary Information System.
What data is provided?
This entails details on raw and insights on potential processing steps, e.g., mapping or data transformation events. This is covered by the concepts Source Data and Semantic Mapping.
Concept design
Data Provider and Department
The concept Data Provider replaces the concept Data Provider Institute from release 2024.1 onwards. In addition to a coded unique identifier it offers information on the category of the data provider.
Figure 1: Design of the concepts Data Provider and Department
Performer
The concept Performer describes the type of person (patient, physician, etc.) performing the activity which has resulted in the data of interest. This includes, for example, measurements or assessments.
Figure 2: Design of Performer concept The type of the performer is provided as SNOMED CT-code. Two types of Performer can be expressed, 1. with semantic tag “occupation”, e.g. 106292003 | Professional nurse (occupation) |, 309343006 | Physician (occupation) | and 2. with semantic tag "person", e.g. 86372007 |Grandchild (person)|, 394863008 |Non-family member (person)|. The Performer concept is open for extensions in the future.
Source System and Healthcare Primary Information System
The proposed concept Source System describes the category and purpose of the source system, e.g., a hospital information system for clinical routine data intended for quality control. In case a coded category is not applicable the source system description can be expressed using free text. Source System is introduced as a novel core concept in release 2024.1 of the SPHN Schema, alongside Subject Pseudo Identifier, Administrative Case, and Data Provider (former Data Provider Institute).
Figure 3: Design of Source System concept
The Source System concept also links to an attribute Healthcare Primary Information System which describes the primary source system of the healthcare data, e.g., a clinical laboratory information system. In case a coded category is not applicable the healthcare primary information system description can be expressed using free text.
Figure 4: Design of Healthcare Primary Information System concept
Source Data
The Source Data-concept provides information on raw data that has undergone a transformation, e.g., a mapping or coding event. Source data can be provided as string or code. A string can be in local language, for example “féminin” (mapped to 248152002 |Female (finding)|). A code may be a local code from a vendor-specific coding system. In addition, the source system from which the source data is provided can be represented.
Figure 5: Design of Source Data concept
The concept Source Data is currently used by the Semantic Mapping concept only.
Semantic Mapping Event
The concept Semantic Mapping represents information about the transformation of data elements to a (ideally standardised) code. It refers to the source data, the output of the semantic mapping as well as the mapping method, purpose, and time point of mapping.
From a knowledge-centric perspective, a process is an event which operates on input(s) and can yield output(s). In this sense, the concept Semantic Mapping represents the process which operates on source data (input) like non-standard codes or strings and transforms it into a standardised code (output) which may for example be linked to a Result.
Figure 6: Design of Semantic Mapping concept
Figure 7: Semantic Mapping in the context of knowledge-centric (process-oriented) concept design.
Example for data delivery
Figure 8: Instantiation example of Provenance concepts
Guideline for data delivery
General
General guidelines on the concept use as well as specific guidelines for individual composedOfs are given in the following sections.
Data Provider
institution code: Unique identifier (UID) of the data provider from the UID-register of the Federal Statistical Office (FSO)
category: One of {Company; External Laboratory; Federal Office; Health Insurance; Hospital; Pharmacy; Private Practice; Research Organization; Service Provider; University}
department: Link to concept Department.
Department
name: The preferred language for values of the composedOf ‘name’ is English, which should be used whenever applicable and possible to increase accessibility across regions and language boundaries. Department abbreviations, if included, should also be spelled out.
Performer
As for the use of patient data, it needs to be considered that the written consent of performers, e.g., medical staff, needs to be obtained before data request, in particular in case the Performer-concept is extended in the future by additional, possibly identifying attributes.
Source System
The use of Source System is described in detail here.
category: One of {biobank; case report form; clinical data platform; clinical registry; cohort; data repository; healthcare information system; OMICS facility; research laboratory}, see table below for details
Note: The term “clinical data platform” is intended as an overarching term for data warehouses, data lakes etc. in a clinical context.
name: source system description expressed using free text where a coded category is not applicable or where free text holds additional information, e.g. National Cancer Registry (NCR)
purpose: One of {billing; patient care; quality control; research}, see table below for details primary information system: Link to concept Healthcare Primary Information System
Example situation the source system is primarily in place for |
Corresponding value for purpose |
---|---|
billing purposes, for example software used for assigning ICD-10- or CHOP-codes |
billing |
diagnostic, screening, or therapeutic purposes, e.g., a radiology information system or an oncology information system |
patient care |
quality control purposes, e.g., holding data for a registry |
quality control |
a research project, e.g., capturing data for a clinical trial |
research |
The composedOf ‘category’ of Source System will provide information on the type of source system from which the data is extracted for the recipient. The table below provides example situations and the corresponding value.
Example situation for data delivery to the project |
Corresponding value for ‘category’ |
Definition (Chat GPT) |
---|---|---|
directly from the database or information system of a biobank |
biobank |
Biobank: Facility or institution with a focus on the collection, storage, and distribution of biological samples and associated data. |
directly from a data capture system, e.g., a REDCap database, set up by the project |
case report form |
Case Report Form (CRF): A standardized document used in clinical research to collect data on each participant in a study. |
from the clinical data platform, e.g., a clinical data warehouse, a clinical data lake, etc.; ‘clinical data platform’ is used as an overarching term for data warehouses, data lakes etc. in a clinical context. |
clinical data platform |
Clinical Data Platform: An integrated system for collecting, managing, and analyzing clinical data, often used in healthcare settings. |
directly from a clinical registry database, e.g., a rare disease registry database |
clinical registry |
Clinical Registry: A database of information about individuals with a specific condition or disease, typically used for research and monitoring. |
directly from a cohort database, e.g., the Swiss Spinal Cord Injury Cohort Study (SwiSCI) |
cohort |
Cohort: A group of individuals with shared characteristics, often studied over time to understand health outcomes. |
directly from a data repository, e.g., data available from the Swiss Federal Office of Statistics |
data repository |
Data Repository: A centralized location for storing and managing diverse datasets, including health data. |
directly from a clinical information system, for example:
|
healthcare information system |
Healthcare Information System: An integrated information system designed for managing healthcare data within a hospital or a healthcare organization or provider |
directly from an OMICS facility information system, e.g., a genome center, e.g., the Health 2030 Genome Center) |
OMICS facility |
OMICS Facility: Facility or institution dealing with high-throughput technologies like genomics, proteomics, etc., generating large-scale molecular data. |
directly from a research information system (non-clinical) |
research laboratory |
Research Laboratory: Physical space equipped for conducting scientific experiments and analyses, generating various types of research data. |
Healthcare Primary Information System
The use of Healthcare Primary Information System is described in detail here.
code: SNOMED CT code specifying the category of healthcare primary information system, descendant of SNOMED CT: 706593004 |Information system (physical object)|
name: healthcare primary information system description expressed using free text where a coded category is not applicable or where free text holds additional information, e.g., Onkostar
Source Data
The Source Data concept is intended for two types of raw data, codes or character strings (free text). The source data usually represents non-standardised data.
code: coded information specifying the source data description, e.g. identifier: 1, name: verheiratet, coding system and version: KIS code book
string value: source data description expressed using free text, e.g. “Weichteilsarkom”. The string value must not contain patient identifying data.
Semantic Mapping
The concept ‘Semantic Mapping’ is designed and intended to cover both mapping and coding events. Its description “process of transforming data elements to a code” covers both of these cases. For coding events, however, the input data may not always be fully available or it may not be possible to represent all input components taken into account properly in the schema, for example for the assignment of a diagnosis code. The cardinality of the composedOf ‘source data’ (type ‘Source Data’) has therefore been set to 0:n to cover such cases and still be able to represent output codes as output code without formal input.
source data: Link to the concept Source Data, i.e., the input for the Semantic Mapping
output: output of the Semantic Mapping, usually a standardised code
datetime: datetime of the Semantic Mapping. Date and time it refers to may vary between data providers. In one hospital, it may refer to the time a semantic mapping was applied, in another one it may refer to the point in time when a mapping table was created. In both cases, however, it shall be possible to trace back which mapping was applied, e.g., for quality control, provided that internal operating procedures and tracking are in place.
method code: ECO-code specifying the method used for semantic coding, such as manual or automatic assertion.
purpose: objective of the semantic mapping, one of {billing; patient care; quality control; research}. Same value set as for Source System_purpose, see further details there.