Query data with SPARQL
Note
To find out more watch the Querying Data with SPARQL Training. For an introduction to SPARQL, visit the SPARQL Background section.
Target Audience
This document is mainly intended for researchers who are interested in querying their data for data profiling or exploration purposes. The document provides the following information:
how to setup the inference in GraphDB before running any query
how to build queries for getting statistical knowledge about data
some examples of queries for exploring the content of a particular data.
Data profiling querying - Summary statistics
A set of statistical queries is provided for making an initial evaluation of the data content and quality. These statistical queries can be run in any triplestore that enables the querying of RDF data (e.g. GraphDB, Jena) by simply copy-pasting the content of the queries into the querying field. They are integrated in the Quality Check tool.
More information on how to run SPARQL can be found in Training Video.
The queries are mostly built in the following manner:
First, prefixes used in the queries need to be defined to facilitate the (human) reading and writing of the query:
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX spo:<https://biomedit.ch/rdf/sphn-schema/spo#>
Then the type of query should be specified.
Here, the query form SELECT
is used for specifying variables
expected to be provided in the results.
Note
Variables start with a question mark ?
in SPARQL.
SELECT ?concept (COUNT(?resource) AS ?sphn_concepts_resources)
(COUNT(distinct ?subject) as ?subject_cnt)
(COUNT(distinct ?case) as ?case_cnt)
(COUNT(distinct ?provider) as ?provider_cnt)
In this example, five variables will be given as result:
a concept,
the count of resource for that concept as well as
the count of subjects,
cases and
providers for that concept.
Following the definition of variables,
the graph pattern of interest must then be specified with the clause WHERE
:
WHERE {
{ ?concept rdfs:subClassOf+ sphn:SPHNConcept } UNION { ?concept rdfs:subClassOf+ spo:SPOConcept } .
?resource a ?concept .
optional {?resource sphn:hasDataProvider ?provider}
optional {?resource sphn:hasSubjectPseudoIdentifier ?subject}
optional {?resource sphn:hasAdministrativeCase ?case}
Here, the query searches for patterns where a
resource
is defined with the RDF class type concept
.
This concept
class must be at least either a subClass of SPHN or SPO.
And finally, the resource can optionally be connected to:
a
data provider
ora
subject pseudo identifier
ora
administrative case
.
Next it is possible to filter out for graph patterns that are not of
interest and that should not be returned in the results,
with the clause FILTER NOT EXISTS
:
FILTER NOT EXISTS {?concept rdfs:subClassOf sphn:ValueSet}
Here, the query filters out classes that are subClasses of the SPHN class ValueSet
.
Finally, it is possible to end with some query modifiers
,
here the query ends by grouping results for a given variable:
} group by ?concept order by desc(?sphn_concepts_resources)
In this example, the results are grouped by concepts retrieved.
For data following project-specific schemas, it may be necessary to adjust the queries to fit the search for certain elements. For any help, please contact the fair-data-team@sib.swiss.
Data exploration querying
The following examples are meant to showcase ways to write queries for exploring the content of data in RDF following the SPHN RDF Schema. They are based on the mock-data introduced in previous sections (see mock-data description and loading instructions) and comply with the SPHN RDF Schema version 2024.2.
1. Patients allergic to Peanuts
Here, the question is which patients are allergic to the peanuts.
To address this question, the graph pattern to be matched by the query
is shown in Figure 1. In this graph, information about the
patient is denoted by an instance of a sphn:SubjectPseudoIdentifier
class.
It is possible to find out which allergy episode links to this patient by matching
an instance of an sphn:AllergyEpisode
class linked by the
sphn:hasSubjectPseudoIdentifier
property.
Similarly, the allergen causing the allergy episode can be found by
matching an instance of a sphn:Allergen
class linked by a
sphn:hasAllergen
property.
In order to search for patients allergic to Peanuts,
the substance can be fixed to Peanut
by matching instances of sphn:Allergen
class linked to
snomed:762952008
by the sphn:hasCode
property.
Figure 1: Graph for Patients allergic to Peanuts.
Figure 2 visualizes the graph pattern used for matching Patients allergic to Peanuts, and is followed by the corresponding SPARQL query implementation.
The SPARQL query, after defining the prefixes, retrieves distinct patients
(?patient
variable of interest).
The graph pattern starts by stating that the patients must be of type
sphn:SubjectPseudoIdentifier
.
Next, the ?allergy_episode
variable representing the
sphn:AllergyEpisode
, a ?allergen
variable representing
the sphn:Allergen
class, and a ?code
variable of type snomed:762952008
are defined.
In order to get the data, one still needs to link all of these variables together.
To that end, ?allergy_episode
is linked to ?patient
through the sphn:hasSubjectPseudoIdentifier
property,
and to ?allergen
through the sphn:hasAllergen
property.
Finally, ?allergen
is linked to ?code
through
the sphn:hasCode
property. Running this query in a given
triplestore will retrieve any data (i.e., list of patients) matching this
graph pattern (see Figure 3 for an excerpt of the results).
Figure 2: Diagram complementing the SPARQL query for Patients allergic to Peanuts.
# List of patients with an allergy to peanuts
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX snomed: <http://snomed.info/id/>
SELECT distinct ?patient
WHERE {
?patient a sphn:SubjectPseudoIdentifier .
?allergy_episode a sphn:AllergyEpisode .
?allergen a sphn:Allergen .
?allergy_episode sphn:hasSubjectPseudoIdentifier ?patient .
?allergy_episode sphn:hasAllergen ?allergen .
?allergen sphn:hasCode ?code .
?code a snomed:762952008 .
}
By modifying the above SPARQL query with the
COUNT(distinct ...)
statement (see the following code block),
it is possible to determine the exact count of the matched patients.
For example, evaluating this query in GraphDB with the mock-data used throughout
this guide produces the result shown in Figure 3.
Note the use of the FILTER(...)
statement to retrieve only labels from SNOMED CT.
# Count of patients with an allergy to peanuts (snomed:762952008)
SELECT (COUNT(distinct ?patient) AS ?patients) ?snomed_code ?label
WHERE {
?patient a sphn:SubjectPseudoIdentifier .
?allergy_episode a sphn:AllergyEpisode .
?allergen a sphn:Allergen .
?allergy_episode sphn:hasSubjectPseudoIdentifier ?patient .
?allergy_episode sphn:hasAllergen ?allergen .
?allergen sphn:hasCode ?code .
?code a snomed:762952008 .
?code rdf:type ?snomed_code .
?snomed_code rdfs:label ?label .
FILTER(strStarts(str(?snomed_code), "http://snomed.info/id/"))
} GROUP BY ?snomed_code ?label
Figure 3: Results of running the modified SPARQL query for Patients allergic to Peanuts in GraphDB on the mock-data.
2. Patients allergic to Pulse Vegetable
Here, the question is which patients are allergic to pulse vegetable. We do not find this information directly in the data, as the data is often collected at a more granular level e.g. allergy to lentils or beans or beansprouts. Therefore, the hierarchical structure of Pulse Vegetable from SNOMED CT needs to be considered, as shown in Figure 4. However, it is not needed to query individually for all levels in order to get all patients that are allergic to Pulse Vegetable. Thanks to the RDF graph structure and the hierarchy of SNOMED CT, the query can be done in a more straightforward way.
Figure 4: Hierarchical structure of Pulse Vegetable from SNOMED CT.
Query option 1: reasoning without RDF inference
List of patients
Figure 5 visualizes the graph pattern used for matching Patients allergic to Pulse vegetable, and is followed by an implementation enabling reasoning with SPARQL query (Note: without inference turned on).
Figure 5: Diagram complementing the SPARQL query for Patients allergic to Pulse Vegetable.
The query is in large part the same as in the previous example in Figure 3.
The difference occurs in the statements after the row with the ?allergen sphn:hasCode ?code
triple:
the variable
?pulse_veg_and_descendants
is introduced, representing the Pulse Vegetable causing the allergythe
?pulse_veg_and_descendants
is defined as being a subclass of thesnomed:227313005
code using therdfs:subClassOf*
property. Note that the*
following therdfs:subClassOf
will look into all the nested levels from Pulse Vegetable (i.e., not only Beansprouts, Peanut, and other concepts that are immediate subclasses ofsnomed:227313005
, but also Peanut Butter, Salted Peanut, etc.).
Running this query in a given triplestore will retrieve any data (i.e., list of patients) matching this graph pattern.
# List of patients with an allergy to any kind of pulse vegetable
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX snomed: <http://snomed.info/id/>
SELECT DISTINCT ?patient
WHERE {
?patient a sphn:SubjectPseudoIdentifier .
?allergy_episode a sphn:AllergyEpisode .
?allergen a sphn:Allergen .
?allergy_episode sphn:hasSubjectPseudoIdentifier ?patient .
?allergy_episode sphn:hasAllergen ?allergen .
?allergen sphn:hasCode ?code .
?code a ?pulse_veg_and_descendants .
?pulse_veg_and_descendants rdfs:subClassOf* snomed:227313005 .
}
Count of patients
By modifying the above SPARQL query with the COUNT(distinct ...)
statement (see the following code block), it is possible to determine
the exact count of patients allergic to Pulse Vegetable grouped by labels
of annotated substance they are allergic to.
For example, evaluating this query in GraphDB with the mock-data used throughout
this guide produces the result shown in Figure 6.
Note the use of the FILTER(...)
statement to retrieve only labels from SNOMED CT.
# Count of patients with an allergy to any kind of pulse vegetable
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX snomed: <http://snomed.info/id/>
SELECT (COUNT (DISTINCT ?patient) AS ?patients) (?pulse_veg_and_descendants AS ?snomed_code) ?label
WHERE {
?patient a sphn:SubjectPseudoIdentifier .
?allergy_episode a sphn:AllergyEpisode .
?allergen a sphn:Allergen .
?allergy_episode sphn:hasSubjectPseudoIdentifier ?patient .
?allergy_episode sphn:hasAllergen ?allergen .
?allergen sphn:hasCode ?code .
?code a ?pulse_veg_and_descendants .
?pulse_veg_and_descendants rdfs:subClassOf* snomed:227313005 .
?pulse_veg_and_descendants rdfs:label ?label .
FILTER(strStarts(str( ?pulse_veg_and_descendants), "http://snomed.info/id/"))
} GROUP BY ?pulse_veg_and_descendants ?label
Figure 6: Results of running the modified SPARQL query for Patients allergic to Pulse Vegetable in GraphDB on the mock-data.
3. RDF reasoning
RDF reasoning enables the computer to deduce knowledge based on provided information and using some logical statements. For example, given the information in an schema on hierarchies (Class vs. Subclass; Property vs Subproperty) and in the provided data, the computer is able to do some reasoning with respect to that hierarchy. Two such examples are shown in Figure 7:
based on information coming from the schema that HeartRate is a Subclass of Measurement, and that data X is a HeartRate, the computer is able to infer that X is a Measurement.
based on information coming from the schema that hasInsertionSite is a Subproperty of hasBodySite, and that data X hasInsertionSite Arm, the computer is able to infer that X hasBodySite Arm.
In general, patients can have information annotated at different levels of granularity. As mentioned previously, it is not necessary to query individually for all levels of information to get patients that match certain criteria thanks to the RDF graph structure and the hierarchical knowledge provided by SNOMED CT.
Figure 7: Example of RDF reasoning.
4. Patients allergic to Pulse Vegetable (inference turned on)
Setup inference in GraphDB
Inference enables to deduce new knowledge from existing information, which constitutes one of the main strength of the RDF data representation (see section about 3. RDF reasoning). Before writing queries in GraphDB, it is important to setup the tool properly for enabling such inference to be done, when needed. Here, we demonstrate how to setup inference in GraphDB.
For an overview of the options to load RDF data into GraphDB, please refer to loading instructions or GraphDB’s documentation.
Creating a new repository
In order to be able to use inference capabilities of GraphDB, inference needs to be enabled when creating a new repository, as this currently can not be done afterwards. Note that a default inference is already enabled, as shown in Figure 8.
Figure 8: Creating a new repository with default inference enabled.
User settings
Following the creation of the repository with inference enabled, one can choose to exclude/include inferred data in results by selecting the corresponding option in the SPARQL editor settings (see Figure 9 and GraphDB documentation for more information).
Figure 9: One can choose to exclude/include inferred data in results by selecting the corresponding option in the SPARQL editor settings.
SPARQL Editor
The GraphDB SPARQL editor allows to include or exclude inferred statements
in the results by clicking the >>
icon, as shown in Figure 10
(see GraphDB documentation
for more information).
Figure 10: Enabling inclusion (both elements of the arrow icon are a solid line) of inferred statements in the results in GraphDB built in SPARQL editor.
Note
GraphDB System Statements can be used to disable inference from within a SPARQL query, and also to enable it again (assuming the inference was enabled during the repository creation).
Warning
During offline bulk loading of data, no inference is enabled and GraphDB inference settings are ignored (see GraphDB documentation for further information).
List of patients
Using the reasoning possibilities, the query to retrieve patients allergic to Pulse Vegetable can be simplified. In comparison to when the inference is off, when inference is turned on the main difference occuring in the query statement regarding the pulse vegetables is:
the
?code a snomed:227313005 .
line results again in retrieving all the nested levels from Pulse Vegetable, this time, however, without the need of usingrdfs:subClassOf
.
Running this query in a given triplestore with inference turned on retrieves the desired list of patients (note the arrows on the right side of the editor that can be used to turn inference ON/OFF).
# List of patients allergic to any kind of pulse vegetable (inference on)
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX snomed: <http://snomed.info/id/>
SELECT DISTINCT ?patient
WHERE {
?patient a sphn:SubjectPseudoIdentifier .
?allergy_episode a sphn:AllergyEpisode .
?allergen a sphn:Allergen .
?allergy_episode sphn:hasSubjectPseudoIdentifier ?patient .
?allergy_episode sphn:hasAllergen ?allergen .
?allergen sphn:hasCode ?code .
?code a snomed:227313005 .
}
Count of patients
By modifying the above SPARQL query with the COUNT(distinct ...)
statement (see the following code block) it is possible to determine the exact
count of patients allergic to Pulse Vegetable grouped by labels of annotated
substance they are allergic to, and with additional inference.
For example, evaluating this query in GraphDB with the mock-data used throughout
this guide and inference turned on produces the result shown in Figure 11.
Note that, with inference turned on, the count of patients allergic to Pulse Vegetable
has increased to 77, which is a more accurate statement regarding allergy to any type of Pulse Vegetable.
# Count of patients allergic to any kind of pulse vegetable (inference on)
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX snomed: <http://snomed.info/id/>
SELECT (COUNT (DISTINCT ?patient) as ?patients) ?snomed_code ?label
WHERE {
?patient a sphn:SubjectPseudoIdentifier .
?allergy_episode a sphn:AllergyEpisode .
?allergen a sphn:Allergen .
?allergy_episode sphn:hasSubjectPseudoIdentifier ?patient .
?allergy_episode sphn:hasAllergen ?allergen .
?allergen sphn:hasCode ?code.
?code a snomed:227313005 .
?code rdf:type ?snomed_code .
?snomed_code rdfs:label ?label .
FILTER(strStarts(str(?snomed_code), "http://snomed.info/id/"))
} GROUP BY ?snomed_code ?label
Figure 11: Results of running the modified SPARQL query for Patients allergic to Pulse Vegetable with inference turned on in GraphDB on the mock-data.
5. Patient with measurements of Leukocytes in Blood by Automated count (LOINC 6690-2)
Here, the question is which patients have had a lab test done identified by a specific LOINC code.
To address this question, the graph pattern to be matched by the query is shown in Figure 12.
In this graph pattern, instances of a sphn:LabTestEvent
class are linked by the
sphn:hasSubjectPseudoIdentifier
property to instances of a sphn:SubjectPseudoIdentifier
class.
Patients that had measurements of Leukocytes in Blood by Automated count
are queried by matching instances of sphn:LabTest
class linked by the
sphn:hasCode
property to loinc:6690-2
. Finally, the instance of sphn:LabTestEvent
(linked to a specific patient)
must be linked to the instance of sphn:LabTest
(linked to the specific LOINC lab test).
Note that a sphn:LabTest
holds a LabResult
which usually contains the test’s resulting value (and associated unit).
Figure 12: Graph for Patient with measurements of Leukocytes in Blood by Automated count (LOINC 6690-2).
Figure 13 visualizes the graph pattern used for matching Patients with measurements of Leukocytes in Blood by Automated count, and is followed by the corresponding SPARQL query implementation.
Similar as in the previous example, the SPARQL query retrieves distinct patients,
and the graph pattern starts by stating that the patients must be of type
sphn:SubjectPseudoIdentifier
.
Next, the ?lab_test
variable representing the sphn:LabTest
,
and a ?code
variable of type loinc:6690-2
are defined.
The ?lab_test_event
variable represent the sphn:LabTestEvent
that connect the patient and the lab test of interest.
In order to get the data, the ?lab_test_event
is linked to ?patient
through the sphn:hasSubjectPseudoIdentifier
property, and to
?lab_test
through the sphn:hasLabTest
.
The ?lab_test
is linked to the ?code
through the sphn:hasCode
property.
Running this query in a given triplestore will retrieve any data (i.e., list of patients) matching this graph pattern.
For example, evaluating this query in GraphDB with the mock-data used throughout this guide will retrieve all patients annotated with having a lab test code measurements of Leukocytes in Blood by Automated count (see Figure 13 for an excerpt of the results).
Figure 13: Diagram complementing the SPARQL query for Patient with measurements of Leukocytes in Blood by Automated count.
# List of patients with a lab test event where their leucocyte in blood has been measured
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX loinc: <https://loinc.org/rdf/>
SELECT distinct ?patient
WHERE {
?patient a sphn:SubjectPseudoIdentifier
?lab_test_event a sphn:LabTestEvent .
?lab_test a sphn:LabTest .
?lab_test_event sphn:hasSubjectPseudoIdentifier ?patient .
?lab_test_event sphn:hasLabTest ?lab_test .
?lab_test sphn:hasCode ?test_code .
?test_code rdf:type loinc:6690-2 .
}
6. Min/max values for measurements of Leukocytes in Blood by Automated count (LOINC 6690-2)
Here, the question is what are the min/max values measured for the result of a lab test identified with a specific LOINC code. To address this question for measurements of Leukocytes in Blood by Automated count , the following query can be ran:
# Get the list of min and max values for each LOINC tests performed
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource:<https://biomedit.ch/rdf/sphn-resource/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX loinc: <https://loinc.org/rdf/>
PREFIX sphn-loinc: <https://biomedit.ch/rdf/sphn-resource/loinc/>
SELECT ?code (MIN(?lab_res_value) AS ?min_value) (MAX(?lab_res_value) AS ?max_value) ?lab_res_unit
WHERE {
?lab_res a sphn:LabResult .
?lab_test a sphn:LabTest .
?lab_test sphn:hasResult ?lab_res .
?lab_test sphn:hasCode ?code .
?code rdf:type loinc:6690-2 .
?lab_res sphn:hasQuantity ?result .
?result sphn:hasValue ?lab_res_value .
?result sphn:hasUnit/sphn:hasCode ?lab_res_unit.
}
GROUP BY ?code ?lab_res_unit
References
Further information is available in the following references: