SPARQL

Introduction

SPARQL (SPARQL Protocol and RDF Query Language) is the standard querying language for RDF. More specifically, it is the declarative language part of the W3C standards. SPARQL borrows elements from RDF and is similar to SQL.

SPARQL queries are based on ‘graph pattern matching’, meaning that the tool doing the search will try to match the pattern in the query with the corresponding data and retrieve it.

Shown in Figure 1 is a triple representing a resource resource:HospitalA, which has a relation sphn:hasSubjectPseudoIdentifier to a variable ?patient This is a valid pattern which can be used for a search, and yields the list of patients for resource:HospitalA. The syntax of SPARQL queries is similar to Turtle (but not exactly the same).

Figure 1: Example of a graph. A resource HospitalA connects to a variable ?patient via a sphn:hasSubjectPseudoIdentifier link.

Note

A variable in SPARQL always includes a question mark in front of the variable name.

Structure of a query

At the minimum, a basic SPARQL query format includes a SELECT and WHERE statement.

It has the following structure:

SELECT <variables>
WHERE {
       <graph-pattern>
}

where part of the WHERE statement are curly brackets that include the graph pattern.

Furthermore, a SPARQL query can include the following parts as well:

Prefix declarations

They are namespace declarations and allow for prefix names to be written in queries, rather than full URIs. With prefix declarations, we can write shorter and clearer code:

prefix dc: <http://purl.org/dc/elements/1.1/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#>

Note

With a prefix declared for SPHN, we can simply refer to sphn in our code rather than spell out the entire URI. Therefore, instead of writing:

 ?patient rdf:type https://biomedit.ch/rdf/sphn-schema/SubjectPseudoIdentifier .

We can write:

?patient a sphn:SubjectPseudoIdentifier .

Type of query declaration

There exists four types of query declaration (more information here and on Query Forms):

SELECT
ASK
DESCRIBE
CONSTRUCT

Data set definition

If multiple data graphs are provided in a triplestore, specific data set from which the query should be ran against can be done specified with:

FROM <...>
FROM NAMED <...>

Note

If the dataset is not defined, the query usually runs by default on the complete data set.

Graph pattern

The clause WHERE { ... } is used to define the graph pattern (in the form of triples) that the result of the query should comply with.

Query modifiers

They allow to modify the way the output of the query is presented:

ORDER BY ... Establishes the order of a solution sequence
GROUP BY ... After dividing the solution into groups, GROUP BY calculates the aggregate value of the groups.
HAVING ... Operates over group solution sets and filters by a variable
LIMIT ... Places a limit on the number of solutions returned
OFFSET ... Controls where the solutions start from
BIND ... Assigns a variable to a value or an expression stated in the query

Types of queries

There are four types of SPARQL queries:

`SELECT`

A SELECT query gets results for requested variables. The output is displayed in a table (see W3C documentation SELECT)

The example above retrieves all instances (?patient) where the type is a sphn:SubjectPseudoIdentifier.

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>

SELECT ?patient
WHERE {
    ?patient rdf:type sphn:SubjectPseudoIdentifier
 }

`ASK`

An ASK query checks for matches of a requested pattern, and results in a Boolean ‘yes/no’ output (see W3C documentation ASK)

In the example below, the question asked is whether patient77 had an allergy episode annotated.

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-schema/sphn#>

ASK
WHERE {
  ?patient a sphn:SubjectPseudoIdentifier .
  ?patient sphn:hasIdentifier "patient77" .
  ?allergy_episode a sphn:AllergyEpisode .
  ?allergy_episode sphn:hasSubjectPseudoIdentifier ?patient .

}

`CONSTRUCT`

A CONSTRUCT query gets specific parts of a graph, and manipulates the graph by creating new triple as indicated in the query (see W3C documentation CONSTRUCT)

The query below adds a diagnosis to patients that have a lab test code LOINC 6690-2. The result retrieves the list of patients having this new diagnosis, in the form of a triple.

 PREFIX sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#>
 PREFIX snomed: <http://snomed.info/id/>
 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
 PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

 CONSTRUCT {
           resource:Diagnosis1 a sphn:Diagnosis .
       resource:Diagnosis1 sphn:hasSubjectPseudoIdentifier ?patient .
 }
 WHERE {
       ?patient a sphn:SubjectPseudoIdentifier .
       ?lab a sphn:LabTestEvent .
       ?lab sphn:hasSubjectPseudoIdentifier ?patient .
       ?lab sphn:hasLabTest/sphn:hasResult ?code .
       ?code a loinc:6690-2 .
}

`DESCRIBE`

A DESCRIBE query gets basic (triple) information about a variable or resource (see W3C documentation DESCRIBE)

In the example below, the query returns all information provided for patient78.

PREFIX sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX snomed: <http://snomed.info/id/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
DESCRIBE ?thing
WHERE {
  ?thing a sphn:SubjectPseudoIdentifier .
  ?thing sphn:hasIdentifier "patient78" .

}

Query formation

In addition to the already mentioned query types, other constructs are also possible:

Nested queries

Nested queries are referred to as ‘subqueries’ in SPARQL: one SELECT inside another SELECT (more information about subqueries).

A nested query is a SELECT clause within a SELECT clause, where the results of the subquery are evaluated first and then projected to the outer query.

The following query calculates the average number of patients per data provider institute:

PREFIX sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT (avg(?numPatients) AS ?avgNumPatientsByDataProvider)
WHERE {
        SELECT ?data_provider (count(?patient) AS ?numPatients)
        WHERE {
                 ?patient a sphn:SubjectPseudoIdentifier .
                 ?data_provider a sphn:DataProvider .
                 ?patient sphn:hasDataProviderInstitute ?data_provider .
       } GROUP BY ?data_provider
 }

Federated SPARQL

A federated query allows for querying different SPARQL endpoints in the same query using a SERVICE clause (more information about federated querying). We can thereby combine information that live in different datasets in one query.

In the example below, the assumption made is that SNOMED CT codes to annotate the Substance comes from the BioPortal instance of SNOMED CT. Using the SERVICE clause which connects to the BioPortal namespace of SNOMED CT (http://bioportal.bioontology.org/ontologies/SNOMEDCT/), it is possible to retrieve the preferred label of the following SNOMED CT code: 762952008 which corresponds to the Peanut substance some patients are allergic against:

PREFIX sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX snomed_bioportal: <http://purl.bioontology.org/ontology/SNOMEDCT/>

SELECT ?patient ?label
WHERE {
     ?patient a sphn:SubjectPseudoIdentifier .
     ?allergy_episode a sphn:AllergyEpisode .
     ?substance a sphn:Substance .

     ?allergy_episode sphn:hasSubjectPseudoIdentifier ?patient .
     ?allergy_episode sphn:hasSubstance ?substance .
     ?substance sphn:hasCode ?substance_code .

     SERVICE <http://bioportal.bioontology.org/ontologies/SNOMEDCT/> {
       ?substance_code a snomed_bioportal:762952008 .
       ?substance_code skos:prefLabel  ?label.
     }
}

Advanced queries

Querying with negation

There exists two option for representing a negation, either use FILTER and ! or use FILTER NOT EXISTS. FILTER and ! enable to put a negation on a single statement while FILTER NOT EXISTS enables to put negation on multiple statements.

In the next two queries, patients which have a diagnosis but not age information connected to that diagnosis are retrieved. The first query shows the example with FILTER and ! while the second query shows how it is achieved with FILTER NOT EXISTS.

PREFIX sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#>

SELECT DISTINCT ?patient

WHERE {
  ?diagnosis a sphn:Diagnosis .
  ?diagnosis sphn:hasSubjectPseudoIdentifier ?patient .
  ?diagnosis sphn:hasSubjectAge ?age .

  FILTER ( !BOUND ( ?age ) )
}

PREFIX sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#>

SELECT ?patient

WHERE {
  ?diagnosis a sphn:Diagnosis .
  ?diagnosis sphn:hasSubjectPseudoIdentifier ?patient .
  FILTER NOT EXISTS {
      ?diagnosis sphn:hasSubjectAge ?age .
  }
}

Querying with property paths

In SPARQL paths can be expressed that traverse through one or more properties. These paths enable to reach from one node another one in the graph which are not necessarily connected with a single property only. Therefore, property paths enable to ‘hop’ from one node to other indirectly connected nodes but also to find connections between two nodes over arbitrary path lengths.

Below is an example query that allows to find from a Lab Test Event the results that are bound to it.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#>

SELECT ?event ?result

WHERE {
        ?event a sphn:LabTestEvent .
  ?event sphn:hasLabTest/sphn:hasResult ?result .
}

Querying with nested statements

Nested queries or subqueries allow to embed a query inside another query. From a query execution standpoint, the inner query is evaluated first before the outer query. Subqueries are very useful to reduce the search space during the execution of a query. Therefore, it would be recommended to write in the inner query the pattern that would filter out the majority of statements, so lesser statements are evaluated in the outer query. This would improve the execution time for big data analysis.

The following query shows an example where patients with a Body Mass Index (BMI) above 30 kg/m2. The inner query filters for Quantity instances that have a value above 30 and a ucum unit kg/m2. Then the outer query finds out the BMI instances connected to these quantities before making the link with the patient (SubjectPseudoIdentifier) associated.

PREFIX sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#>

SELECT DISTINCT ?patient ?value
WHERE
{
    ?bmi a sphn:BodyMassIndex .
    ?bmi sphn:hasQuantity ?quantity .
    ?bmi sphn:hasSubjectPseudoIdentifier ?patient .
    {
        SELECT ?quantity ?value
        WHERE
        {
            ?quantity sphn:hasUnit ?ucum_resource .
            ?ucum_resource sphn:hasCode/a ucum:kgperm2 .

            ?quantity sphn:hasValue ?value .
            FILTER (?value >= 30)
      }
    }
}

Note

Some tips for working with SPARQL queries:

a is a shortcut for rdf:type
Prefixes are highly recommended for better readability
Being familiar with the dataset structure helps to write a query
The period at the end of a line in the WHERE clause is a conjunction, i.e. AND
A semicolon at the end of a line in a WHERE clause introduces another property of the same subject
A comma at the end of a line in a WHERE clause introduces another object with the same predicate and subject
SPARQL is case sensitive, make sure the statements are written accordingly.

SPARQL

Introduction

Structure of a query

Types of queries

SELECT

ASK

CONSTRUCT

DESCRIBE

Query formation

Nested queries

Federated SPARQL

Advanced queries

Querying with negation

Querying with property paths

Querying with nested statements

`SELECT`

`ASK`

`CONSTRUCT`

`DESCRIBE`