LUKS implementation of SPHN

LUKS has adopted a unified data warehouse strategy centered on the Caboodle platform, part of the Epic ecosystem. Caboodle consolidates legacy data infrastructures and integrates both clinical and financial domains. Data ingestion occurs through a nightly ETL process from source systems such as Chronicles (EMR) and SAP.

LUKS pipeline SPHN

Caboodle serves as the exclusive source for all SPHN-related data pipelines. These pipelines are stateless and follow an Extract-Load-Transform (ELT) paradigm. Each SPHN project is managed by a dedicated data pipeline that utilizes a shared instance of the SPHN Connector for data validation and the generation of RDF files. The resulting RDF data is transmitted to the BioMedIT Network using Sett.

1. Development and Deployment Lifecycle

The development lifecycle follows a trunk-based model with three distinct environments: development, integration, and production. The integration environment is automatically deployed from the latest main branch commits and serves as a validation layer for testing new concepts and configurations. The production environment is version-frozen to the most recently verified stable release per SPHN project, ensuring data integrity and compliance in all SPHN submissions.

2. Data Extraction and Population Management

Caboodle uses a Kimball dimensional model with some vendor-specific concepts. A Kimball model is based on fact tables for transactional concepts and companion dimensional tables for descriptive attributes, following a star-like or snowflake schema. For each SPHN project and defined cohort, a population fact table is generated nightly through the ETL process. This table provides patient, case, and encounter identifiers to constrain downstream data extractions to the relevant cohort. Raw SPHN concept data is retrieved in a full load from Caboodle via dynamically generated SQL queries and directly loaded into the SPHN Connectors PostgreSQL database. For cohorts up to 100k patients, this approach showed an effective balance between pipeline throughput and agility during development.

3. Data Processing and Transformation

The transformation of raw SPHN data into the target schema proceeds through modular stages implemented in SQL and Python using dbt as the data transformation framework. dbt supports version-controlled, declarative modeling and integrates testing capabilities that are essential for data reliability. The overall workflow is orchestrated using Prefect, which manages task dependencies and triggers the SPHN connector where required.

LUKS data processing

3.1 De-identification

A dedicated anonymization layer was developed to ensure consistent surrogate key generation and controlled date shifting before data ingestion into the SPHN Connector. The layer operates in a stateless manner, which guarantees reproducibility across deployments. De-identification outcomes are verified at runtime through both data and unit testing.

3.2 Terminology Mapping

Wherever possible, mappings are maintained within source systems, ensuring consistency across institutional use cases beyond SPHN. SPHN-specific and advanced mapping logic, including unit parsing (UCUM) or address derivation (SEP), are implemented within the pipeline itself.

3.3 Concept Factory

The Concept Factory is the central component responsible for modeling, transforming, and validating SPHN concept data. It is implemented using dbt, which provides a declarative, dependency-aware modeling framework. Versioned dbt models are generated according to the defined target schema for each SPHN concept, forming a directed acyclic graph that converts raw data into the SPHN project database schema expected by the SPHN Connector.

3.4 Validation

The use of dbt promotes test-driven development and facilitates a close collaboration with analysts that are familiar with SQL and the institutional data model. Its integrated testing suite supports early detection of bugs, data drift and schema evolution. During development, this approach minimizes the reliance on the computationally expensive validation performed by the SPHN Connector. Full validation remains active in integration and production environments to guarantee data integrity and compliance.

4. Data Submission to BioMedIT

Data submission to the BioMedIT Network is automated via Sett. The upload is programmatically triggered upon successful completion of data processing and validation workflows. Only the production environment is authorized for submission, ensuring traceability and preventing premature data release.

5. Platform Architecture

The infrastructure follows a cloud-oriented roadmap designed for automation, reproducibility, and operational stability. System provisioning and configuration are automated using Infrastructure as Code practices implemented in Ansible, ensuring uniformity across environments.

LUKS service architecture

All components operate on Red Hat Enterprise Linux 9 with Podman as the container runtime. Containers are executed in unprivileged mode for improved security. Core services, including the Prefect server (red), worker agents (blue), and the SPHN Connector (green), are deployed in dedicated Pods defined via Kubernetes YAML manifests. These manifests are executed through Podman using Quadlet units, establishing a declarative and portable deployment model that can seamlessly transition to cloud-managed Kubernetes services in the future.