CHUV with Oracle

Note

The following documentation about implementation is provided by Le Centre hospitalier universitaire vaudois (CHUV). For additional information or questions, please send an email to SPHN Data Coordination Center at dcc@sib.swiss.

Introduction

The CHUV RDF Graph Solution mainly resides on Oracle Database 12 and Ontop Graph System 4.

A Jenkins pipeline creates and deploys the RDF graphs.

Graph Life Cycle

The pipeline encompasses the following steps :

Data Preparation
Data Generation
Data Validation
Data Transfer & Load

Oracle Semantic Technologies

Note

Oracle Relational Database is the single source of truth (i.e. Data Warehouse).
Oracle RDF Graph is the RDF triplestore.

CHUV Technology Stack

Note

HORUS is the CHUV platform that encompasses in particular a Patient Consent Manager for the De-Identification of the data and a Clinical Data Warehouse (Oracle).
Oracle hosts the CHUV Data Warehouse and the RDF Graph Triplestore.
Jenkins is the automation server for the execution of the pipeline.
Ontop is a Virtual Graph System used to materialize RDF Data (Turtle format) based on the defined R2RML mappings (Oracle to RDF: Data Generation).
Protégé is used to edit the ontologies and to design R2RML (RDB to RDF Mapping Language) mappings (Ontop plugin: Data preparation).
Jena Adapter is a Java-based interface used to load RDF data into Oracle RDF Triplestore (Data Load).
Oracle RDF Graph Server and Query UI provides REST Services and a Query UI (SPARQL endpoint: User experience).

Implementation

Graph Life Cycle and Pipeline

Data Preparation

Design and curate ontologies (Protégé)

ETL all required data into Oracle (Staging Tables, …)

Develop Oracle views to expose de-identified data (Datawarehouse, Consent Manager, …)

Develop R2RML mappings (Protégé/Ontop: Oracle DB to RDF)

Data Generation

Generate RDF data files (Turtle format)

Data Validation

Check RDF data format (Shacl rules)

Check RDF data quality (SPARQL queries)

Data Transfer and Load

Transfer RDF data with SETT

Load RDF data (files) into Oracle Graph (model)

User Experience

Explore and analyze data

Example of use

A researcher can query from a Python notebook the SPARQL endpoint (e.g. Fuseki) to explore and analyze data.

Example of a Jupyter notebook for PSSS Data (Sepsis project)

SPARQL query

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-ontology/sphn#>
PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX psss:<https://biomedit.ch/rdf/sphn-ontology/psss/>

select distinct ?patient ?stay ?temperature ?temperature_datetime ?infection_datetime
where {
    ?infection  rdf:type psss:InfectionSuspected;
                sphn:hasSubjectPseudoIdentifier ?patient;
                sphn:hasAdministrativeCase ?stay;
                psss:hasInfectionSuspectedDateTime ?infection_datetime ;
                psss:hasInfectionSuspectedValue ?infection_suspicion .
    FILTER(?infection_suspicion = "yes") .

    ?temp   rdf:type sphn:BodyTemperature;
            sphn:hasSubjectPseudoIdentifier ?patient ;
            sphn:sphn:hasAdministrativeCase ?stay;
            sphn:hasBodyTemperatureValue ?temperature ;
            sphn:hasDateTime ?temperature_datetime .
    FILTER(?stay=resource:CHE_108_910_225-AdministrativeCase-S28174321571) .
}
order by ?patient ?temperature_datetime

Infection and Body Temperature over time for a Patient Encounter

Python Data Analysis from a Jupyter notebook

Python code (Jupyter notebook)

from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import json
import ssl
from getpass import getpass
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
%matplotlib inline

v_user = "admin"
v_password = getpass(prompt="Please type the password of Fuseki user " + v_user  + ": ")

v_endpoint_url = "https://fuseki.horus-graph.intranet.chuv/fuseki/CHUV_PROD_WRK_RDF/query"

prefixes = {
    "xml": "http://www.w3.org/XML/1998/namespace",
    "xsd": "http://www.w3.org/2001/XMLSchema#",
    "owl": "http://www.w3.org/2002/07/owl#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "sphn": "https://biomedit.ch/rdf/sphn-ontology/sphn#",
    "resource": "https://biomedit.ch/rdf/sphn-resource/",
    "psss": "https://biomedit.ch/rdf/sphn-ontology/psss/"
}

# SPARQL Query

v_sparql_query = """
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-ontology/sphn#>
PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX psss:<https://biomedit.ch/rdf/sphn-ontology/psss/>

select distinct ?patient ?stay ?temperature ?temperature_datetime ?infection_datetime
where {
    ?infection  rdf:type psss:InfectionSuspected;
                sphn:hasSubjectPseudoIdentifier ?patient;
                sphn:hasAdministrativeCase ?stay;
                psss:hasInfectionSuspectedDateTime ?infection_datetime ;
                psss:hasInfectionSuspectedValue ?infection_suspicion .
    FILTER(?infection_suspicion = "yes") .

    ?temp   rdf:type sphn:BodyTemperature;
            sphn:hasSubjectPseudoIdentifier ?patient ;
            sphn:sphn:hasAdministrativeCase ?stay;
            sphn:hasBodyTemperatureValue ?temperature ;
            sphn:hasDateTime ?temperature_datetime .
    FILTER(?stay=resource:CHE_108_910_225-AdministrativeCase-S28174321571) .
}
order by ?patient ?temperature_datetime
"""

# SPARQL Wrapper

sparql = SPARQLWrapper(v_endpoint_url)
sparql.setCredentials(v_user,v_password)
sparql.setQuery(v_sparql_query)
sparql.setReturnFormat(JSON)

results = sparql.query().convert()

df = pd.DataFrame(columns=results["results"]["bindings"][0].keys())
for k, result in enumerate(results["results"]["bindings"]):
    s = pd.Series({key: result[key]["value"] for key in result.keys()}, name=k)
    df = df.append(s)

print("Endpoint:", v_endpoint_url)
display(df)

# Pandas Dataframe formatting

df = df.set_index('stay')
df['temperature_datetime'] = df['temperature_datetime'].apply(pd.to_datetime)
df['infection_datetime'] = df['infection_datetime'].apply(pd.to_datetime)
df['temperature'] = df['temperature'].astype(np.float)

# Results plotting

for s in df.index.unique()[:4]:
    d_infection = df.loc[s, 'infection_datetime'].drop_duplicates()
    temp = df.loc[s, 'temperature']

    fig, ax = plt.subplots(figsize=(12,4))
    plt.plot(df.loc[s, 'temperature_datetime'], temp, label='temperature (C)')

    for i in d_infection:
        plt.fill_betweenx(  y=np.linspace(temp.min(),temp.max(),2),
                            x1=i,
                            x2=i+pd.to_timedelta('24 hours'),
                            facecolor='r',
                            alpha=0.4,
                            label='infection')

    plt.title(f'stay {s}')
    plt.xlabel('datetime')
    plt.ylabel('temperature (C)')
    plt.grid()
    plt.legend(loc='lower right')
    date_format = mdates.DateFormatter('%d/%m/%y %H:%M')
    ax.xaxis.set_major_formatter(date_format)
    fig.autofmt_xdate()
    plt.show()