Query data with SPARQL

Note

To find out more watch the Querying Data with SPARQL Training. For an introduction to SPARQL, visit the SPARQL Background section.

Target Audience

This document is mainly intended for researchers who are interested in querying their data for data profiling or exploration purposes. The document provides the following information:

  • how to setup the inference in GraphDB before running any query

  • how to build queries for getting statistical knowledge about data

  • some examples of queries for exploring the content of a particular data.

Data profiling querying - Summary statistics

A set of statistical queries is provided for making an initial evaluation of the data content and quality. These statistical queries can be run in any triplestore that enables the querying of RDF data (e.g. GraphDB, Jena) by simply copy-pasting the content of the queries into the querying field. They are integrated in the Quality Check tool.

More information on how to run SPARQL can be found in Training Video.

The queries are mostly built in the following manner:

First, prefixes used in the queries need to be defined to facilitate the (human) reading and writing of the query:

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX spo:<https://biomedit.ch/rdf/sphn-schema/spo#>

Then the type of query should be specified. Here, the query form SELECT is used for specifying variables expected to be provided in the results.

Note

Variables start with a question mark ? in SPARQL.

SELECT ?concept (COUNT(?resource) AS ?sphn_concepts_resources)
 (COUNT(distinct ?subject) as ?subject_cnt)
 (COUNT(distinct ?case) as ?case_cnt)
 (COUNT(distinct ?provider) as ?provider_cnt)

In this example, five variables will be given as result:

  • a concept,

  • the count of resource for that concept as well as

  • the count of subjects,

  • cases and

  • providers for that concept.

Following the definition of variables, the graph pattern of interest must then be specified with the clause WHERE:

WHERE {
 { ?concept rdfs:subClassOf+ sphn:SPHNConcept } UNION { ?concept rdfs:subClassOf+ spo:SPOConcept } .
 ?resource a ?concept .
 optional {?resource sphn:hasDataProvider ?provider}
 optional {?resource sphn:hasSubjectPseudoIdentifier ?subject}
 optional {?resource sphn:hasAdministrativeCase ?case}

Here, the query searches for patterns where a resource is defined with the RDF class type concept. This concept class must be at least either a subClass of SPHN or SPO. And finally, the resource can optionally be connected to:

  • a data provider or

  • a subject pseudo identifier or

  • a administrative case.

Next it is possible to filter out for graph patterns that are not of interest and that should not be returned in the results, with the clause FILTER NOT EXISTS:

FILTER NOT EXISTS {?concept rdfs:subClassOf sphn:ValueSet}

Here, the query filters out classes that are subClasses of the SPHN class ValueSet.

Finally, it is possible to end with some query modifiers, here the query ends by grouping results for a given variable:

} group by ?concept order by desc(?sphn_concepts_resources)

In this example, the results are grouped by concepts retrieved.

For data following project-specific schemas, it may be necessary to adjust the queries to fit the search for certain elements. For any help, please contact the fair-data-team@sib.swiss.

Data exploration querying

The following examples are meant to showcase ways to write queries for exploring the content of data in RDF following the SPHN RDF Schema. They are based on the mock-data introduced in previous sections (see mock-data description and loading instructions) and comply with the SPHN RDF Schema version 2024.2.

1. Patients allergic to Peanuts

Here, the question is which patients are allergic to the peanuts. To address this question, the graph pattern to be matched by the query is shown in Figure 1. In this graph, information about the patient is denoted by an instance of a sphn:SubjectPseudoIdentifier class. It is possible to find out which allergy episode links to this patient by matching an instance of an sphn:AllergyEpisode class linked by the sphn:hasSubjectPseudoIdentifier property. Similarly, the allergen causing the allergy episode can be found by matching an instance of a sphn:Allergen class linked by a sphn:hasAllergen property.

In order to search for patients allergic to Peanuts, the substance can be fixed to Peanut by matching instances of sphn:Allergen class linked to snomed:762952008 by the sphn:hasCode property.

Graph for Patients allergic to Peanuts.

Figure 1: Graph for Patients allergic to Peanuts.

Figure 2 visualizes the graph pattern used for matching Patients allergic to Peanuts, and is followed by the corresponding SPARQL query implementation.

The SPARQL query, after defining the prefixes, retrieves distinct patients (?patient variable of interest). The graph pattern starts by stating that the patients must be of type sphn:SubjectPseudoIdentifier. Next, the ?allergy_episode variable representing the sphn:AllergyEpisode, a ?allergen variable representing the sphn:Allergen class, and a ?code variable of type snomed:762952008 are defined.

In order to get the data, one still needs to link all of these variables together. To that end, ?allergy_episode is linked to ?patient through the sphn:hasSubjectPseudoIdentifier property, and to ?allergen through the sphn:hasAllergen property. Finally, ?allergen is linked to ?code through the sphn:hasCode property. Running this query in a given triplestore will retrieve any data (i.e., list of patients) matching this graph pattern (see Figure 3 for an excerpt of the results).

Diagram complementing the SPARQL query for Patients allergic to Peanuts.

Figure 2: Diagram complementing the SPARQL query for Patients allergic to Peanuts.

# List of patients with an allergy to peanuts

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX snomed: <http://snomed.info/id/>

SELECT distinct ?patient
WHERE {
    ?patient a sphn:SubjectPseudoIdentifier .
    ?allergy_episode a sphn:AllergyEpisode .
    ?allergen a sphn:Allergen .

    ?allergy_episode sphn:hasSubjectPseudoIdentifier ?patient .
    ?allergy_episode sphn:hasAllergen ?allergen .
    ?allergen sphn:hasCode ?code .

    ?code a snomed:762952008 .
}

By modifying the above SPARQL query with the COUNT(distinct ...) statement (see the following code block), it is possible to determine the exact count of the matched patients. For example, evaluating this query in GraphDB with the mock-data used throughout this guide produces the result shown in Figure 3. Note the use of the FILTER(...) statement to retrieve only labels from SNOMED CT.

 # Count of patients with an allergy to peanuts (snomed:762952008)

SELECT (COUNT(distinct ?patient) AS ?patients) ?snomed_code ?label
WHERE {
    ?patient a sphn:SubjectPseudoIdentifier .
    ?allergy_episode a sphn:AllergyEpisode .
    ?allergen a sphn:Allergen .

    ?allergy_episode sphn:hasSubjectPseudoIdentifier ?patient .
    ?allergy_episode sphn:hasAllergen ?allergen .
    ?allergen sphn:hasCode ?code .

    ?code a snomed:762952008 .
    ?code rdf:type ?snomed_code .
    ?snomed_code rdfs:label ?label .

    FILTER(strStarts(str(?snomed_code), "http://snomed.info/id/"))
} GROUP BY ?snomed_code ?label
Results of running the SPARQL query for Patients allergic to Peanuts in GraphDB on the mock-data.

Figure 3: Results of running the modified SPARQL query for Patients allergic to Peanuts in GraphDB on the mock-data.

2. Patients allergic to Pulse Vegetable

Here, the question is which patients are allergic to pulse vegetable. We do not find this information directly in the data, as the data is often collected at a more granular level e.g. allergy to lentils or beans or beansprouts. Therefore, the hierarchical structure of Pulse Vegetable from SNOMED CT needs to be considered, as shown in Figure 4. However, it is not needed to query individually for all levels in order to get all patients that are allergic to Pulse Vegetable. Thanks to the RDF graph structure and the hierarchy of SNOMED CT, the query can be done in a more straightforward way.

Hierarchical structure of Pulse Vegetable from SNOMED CT.

Figure 4: Hierarchical structure of Pulse Vegetable from SNOMED CT.

Query option 1: reasoning without RDF inference

List of patients

Figure 5 visualizes the graph pattern used for matching Patients allergic to Pulse vegetable, and is followed by an implementation enabling reasoning with SPARQL query (Note: without inference turned on).

../_images/graph_patients_allergy_pulse_veg_query.png

Figure 5: Diagram complementing the SPARQL query for Patients allergic to Pulse Vegetable.

The query is in large part the same as in the previous example in Figure 3. The difference occurs in the statements after the row with the ?allergen sphn:hasCode ?code triple:

  • the variable ?pulse_veg_and_descendants is introduced, representing the Pulse Vegetable causing the allergy

  • the ?pulse_veg_and_descendants is defined as being a subclass of the snomed:227313005 code using the rdfs:subClassOf* property. Note that the * following the rdfs:subClassOf will look into all the nested levels from Pulse Vegetable (i.e., not only Beansprouts, Peanut, and other concepts that are immediate subclasses of snomed:227313005, but also Peanut Butter, Salted Peanut, etc.).

Running this query in a given triplestore will retrieve any data (i.e., list of patients) matching this graph pattern.

# List of patients with an allergy to any kind of pulse vegetable

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX snomed: <http://snomed.info/id/>

SELECT DISTINCT ?patient
WHERE {
    ?patient a sphn:SubjectPseudoIdentifier .
    ?allergy_episode a sphn:AllergyEpisode .
    ?allergen a sphn:Allergen .

    ?allergy_episode sphn:hasSubjectPseudoIdentifier ?patient .
    ?allergy_episode sphn:hasAllergen ?allergen .
    ?allergen sphn:hasCode ?code .
    ?code a ?pulse_veg_and_descendants .
    ?pulse_veg_and_descendants rdfs:subClassOf* snomed:227313005 .
}

Count of patients

By modifying the above SPARQL query with the COUNT(distinct ...) statement (see the following code block), it is possible to determine the exact count of patients allergic to Pulse Vegetable grouped by labels of annotated substance they are allergic to. For example, evaluating this query in GraphDB with the mock-data used throughout this guide produces the result shown in Figure 6. Note the use of the FILTER(...) statement to retrieve only labels from SNOMED CT.

 # Count of patients with an allergy to any kind of pulse vegetable

 PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
 PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
 PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>
 PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
 PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
 PREFIX snomed: <http://snomed.info/id/>

SELECT (COUNT (DISTINCT ?patient) AS ?patients) (?pulse_veg_and_descendants AS ?snomed_code) ?label
WHERE {
    ?patient a sphn:SubjectPseudoIdentifier .
    ?allergy_episode a sphn:AllergyEpisode .
    ?allergen a sphn:Allergen .

    ?allergy_episode sphn:hasSubjectPseudoIdentifier ?patient .
    ?allergy_episode sphn:hasAllergen ?allergen .
    ?allergen sphn:hasCode ?code .

    ?code a ?pulse_veg_and_descendants .
    ?pulse_veg_and_descendants rdfs:subClassOf* snomed:227313005 .
    ?pulse_veg_and_descendants rdfs:label ?label .
    FILTER(strStarts(str( ?pulse_veg_and_descendants), "http://snomed.info/id/"))

} GROUP BY ?pulse_veg_and_descendants ?label
Results of running the modified SPARQL query for Patients allergic to Pulse Vegetable in GraphDB on the mock-data.

Figure 6: Results of running the modified SPARQL query for Patients allergic to Pulse Vegetable in GraphDB on the mock-data.

3. RDF reasoning

RDF reasoning enables the computer to deduce knowledge based on provided information and using some logical statements. For example, given the information in an schema on hierarchies (Class vs. Subclass; Property vs Subproperty) and in the provided data, the computer is able to do some reasoning with respect to that hierarchy. Two such examples are shown in Figure 7:

  • based on information coming from the schema that HeartRate is a Subclass of Measurement, and that data X is a HeartRate, the computer is able to infer that X is a Measurement.

  • based on information coming from the schema that hasInsertionSite is a Subproperty of hasBodySite, and that data X hasInsertionSite Arm, the computer is able to infer that X hasBodySite Arm.

In general, patients can have information annotated at different levels of granularity. As mentioned previously, it is not necessary to query individually for all levels of information to get patients that match certain criteria thanks to the RDF graph structure and the hierarchical knowledge provided by SNOMED CT.

Example of RDF reasoning.

Figure 7: Example of RDF reasoning.

4. Patients allergic to Pulse Vegetable (inference turned on)

Setup inference in GraphDB

Inference enables to deduce new knowledge from existing information, which constitutes one of the main strength of the RDF data representation (see section about 3. RDF reasoning). Before writing queries in GraphDB, it is important to setup the tool properly for enabling such inference to be done, when needed. Here, we demonstrate how to setup inference in GraphDB.

For an overview of the options to load RDF data into GraphDB, please refer to loading instructions or GraphDB’s documentation.

Creating a new repository

In order to be able to use inference capabilities of GraphDB, inference needs to be enabled when creating a new repository, as this currently can not be done afterwards. Note that a default inference is already enabled, as shown in Figure 8.

Creating a new repository with default inference enabled.

Figure 8: Creating a new repository with default inference enabled.

User settings

Following the creation of the repository with inference enabled, one can choose to exclude/include inferred data in results by selecting the corresponding option in the SPARQL editor settings (see Figure 9 and GraphDB documentation for more information).

One can choose to exclude/include inferred data in results by selecting the corresponding option in the SPARQL editor settings.

Figure 9: One can choose to exclude/include inferred data in results by selecting the corresponding option in the SPARQL editor settings.

SPARQL Editor

The GraphDB SPARQL editor allows to include or exclude inferred statements in the results by clicking the >> icon, as shown in Figure 10 (see GraphDB documentation for more information).

Enable inclusion

Figure 10: Enabling inclusion (both elements of the arrow icon are a solid line) of inferred statements in the results in GraphDB built in SPARQL editor.

Note

GraphDB System Statements can be used to disable inference from within a SPARQL query, and also to enable it again (assuming the inference was enabled during the repository creation).

Warning

During offline bulk loading of data, no inference is enabled and GraphDB inference settings are ignored (see GraphDB documentation for further information).

List of patients

Using the reasoning possibilities, the query to retrieve patients allergic to Pulse Vegetable can be simplified. In comparison to when the inference is off, when inference is turned on the main difference occuring in the query statement regarding the pulse vegetables is:

  • the ?code a snomed:227313005 . line results again in retrieving all the nested levels from Pulse Vegetable, this time, however, without the need of using rdfs:subClassOf.

Running this query in a given triplestore with inference turned on retrieves the desired list of patients (note the arrows on the right side of the editor that can be used to turn inference ON/OFF).

# List of patients allergic to any kind of pulse vegetable (inference on)

 PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
 PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
 PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>
 PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
 PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
 PREFIX snomed: <http://snomed.info/id/>

 SELECT DISTINCT ?patient
 WHERE {
     ?patient a sphn:SubjectPseudoIdentifier .
     ?allergy_episode a sphn:AllergyEpisode .
     ?allergen a sphn:Allergen .

     ?allergy_episode sphn:hasSubjectPseudoIdentifier ?patient .
     ?allergy_episode sphn:hasAllergen ?allergen .
     ?allergen sphn:hasCode ?code .
     ?code a snomed:227313005 .
 }

Count of patients

By modifying the above SPARQL query with the COUNT(distinct ...) statement (see the following code block) it is possible to determine the exact count of patients allergic to Pulse Vegetable grouped by labels of annotated substance they are allergic to, and with additional inference. For example, evaluating this query in GraphDB with the mock-data used throughout this guide and inference turned on produces the result shown in Figure 11. Note that, with inference turned on, the count of patients allergic to Pulse Vegetable has increased to 77, which is a more accurate statement regarding allergy to any type of Pulse Vegetable.

# Count of patients allergic to any kind of pulse vegetable (inference on)

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX snomed: <http://snomed.info/id/>

SELECT (COUNT (DISTINCT ?patient) as ?patients) ?snomed_code ?label
WHERE {
   ?patient a sphn:SubjectPseudoIdentifier .
   ?allergy_episode a sphn:AllergyEpisode .
   ?allergen a sphn:Allergen .

   ?allergy_episode sphn:hasSubjectPseudoIdentifier ?patient .
   ?allergy_episode sphn:hasAllergen ?allergen .
   ?allergen sphn:hasCode ?code.
   ?code a snomed:227313005 .
   ?code rdf:type ?snomed_code .
   ?snomed_code rdfs:label ?label .
   FILTER(strStarts(str(?snomed_code), "http://snomed.info/id/"))
} GROUP BY ?snomed_code ?label
Results of running the modified SPARQL query for Patients allergic to Pulse Vegetable with inference turned on in GraphDB on the mock-data.

Figure 11: Results of running the modified SPARQL query for Patients allergic to Pulse Vegetable with inference turned on in GraphDB on the mock-data.

5. Patient with measurements of Leukocytes in Blood by Automated count (LOINC 6690-2)

Here, the question is which patients have had a lab test done identified by a specific LOINC code. To address this question, the graph pattern to be matched by the query is shown in Figure 12. In this graph pattern, instances of a sphn:LabTestEvent class are linked by the sphn:hasSubjectPseudoIdentifier property to instances of a sphn:SubjectPseudoIdentifier class. Patients that had measurements of Leukocytes in Blood by Automated count are queried by matching instances of sphn:LabTest class linked by the sphn:hasCode property to loinc:6690-2. Finally, the instance of sphn:LabTestEvent (linked to a specific patient) must be linked to the instance of sphn:LabTest (linked to the specific LOINC lab test). Note that a sphn:LabTest holds a LabResult which usually contains the test’s resulting value (and associated unit).

: Graph for Patient with measurements of Leukocytes in Blood by Automated count.

Figure 12: Graph for Patient with measurements of Leukocytes in Blood by Automated count (LOINC 6690-2).

Figure 13 visualizes the graph pattern used for matching Patients with measurements of Leukocytes in Blood by Automated count, and is followed by the corresponding SPARQL query implementation.

Similar as in the previous example, the SPARQL query retrieves distinct patients, and the graph pattern starts by stating that the patients must be of type sphn:SubjectPseudoIdentifier. Next, the ?lab_test variable representing the sphn:LabTest, and a ?code variable of type loinc:6690-2 are defined. The ?lab_test_event variable represent the sphn:LabTestEvent that connect the patient and the lab test of interest. In order to get the data, the ?lab_test_event is linked to ?patient through the sphn:hasSubjectPseudoIdentifier property, and to ?lab_test through the sphn:hasLabTest. The ?lab_test is linked to the ?code through the sphn:hasCode property.

Running this query in a given triplestore will retrieve any data (i.e., list of patients) matching this graph pattern.

For example, evaluating this query in GraphDB with the mock-data used throughout this guide will retrieve all patients annotated with having a lab test code measurements of Leukocytes in Blood by Automated count (see Figure 13 for an excerpt of the results).

: Diagram complementing the SPARQL query for Patient with measurements of Leukocytes in Blood by Automated count.

Figure 13: Diagram complementing the SPARQL query for Patient with measurements of Leukocytes in Blood by Automated count.

# List of patients with a lab test event where their leucocyte in blood has been measured

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX loinc: <https://loinc.org/rdf/>

SELECT distinct ?patient
WHERE {
    ?patient a sphn:SubjectPseudoIdentifier
    ?lab_test_event a sphn:LabTestEvent .
    ?lab_test a sphn:LabTest .


    ?lab_test_event sphn:hasSubjectPseudoIdentifier ?patient .
    ?lab_test_event sphn:hasLabTest ?lab_test .
    ?lab_test sphn:hasCode ?test_code .
    ?test_code rdf:type loinc:6690-2 .
}

6. Min/max values for measurements of Leukocytes in Blood by Automated count (LOINC 6690-2)

Here, the question is what are the min/max values measured for the result of a lab test identified with a specific LOINC code. To address this question for measurements of Leukocytes in Blood by Automated count , the following query can be ran:

 # Get the list of min and max values for each LOINC tests performed

 PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
 PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
 PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>
 PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
 PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
 PREFIX loinc: <https://loinc.org/rdf/>
 PREFIX sphn-loinc: <https://biomedit.ch/rdf/sphn-resource/loinc/>

SELECT ?code (MIN(?lab_res_value) AS ?min_value) (MAX(?lab_res_value) AS ?max_value) ?lab_res_unit
 WHERE {

    ?lab_res a sphn:LabResult .
    ?lab_test a sphn:LabTest .

    ?lab_test sphn:hasResult ?lab_res .

    ?lab_test sphn:hasCode ?code .
    ?code rdf:type loinc:6690-2 .

    ?lab_res sphn:hasQuantity ?result .
    ?result sphn:hasValue ?lab_res_value .
    ?result sphn:hasUnit/sphn:hasCode ?lab_res_unit.
 }
 GROUP BY ?code ?lab_res_unit

References

Further information is available in the following references: