Warning

The following page aims to provide an introduction to SPARQL for querying RDF data. After reading this page, you will know what SPARQL is; what is the structure of a SPARQL query; what are the different kinds of SPARQL queries; how to build and write SPARQL queries for validating data in RDF.

SPARQL

Introduction

SPARQL (SPARQL Protocol and RDF Query Language) is the standard querying language for RDF. More specifically, it is the declarative language part of the W3C standards. SPARQL borrows elements from RDF and is similar to SQL.

SPARQL queries are based on ‘graph pattern matching’, meaning that the tool doing the search will try to match the pattern in the query with the corresponding data and retrieve it.

Shown in Figure 1 is a triple representing a resource resource:HospitalA, which has a relation sphn:hasSubjectPseudoIdentifier to a variable ?patient This is a valid pattern which can be used for a search, and yields the list of patients for resource:HospitalA. The syntax of SPARQL queries is similar to Turtle (but not exactly the same).

Example of a graph.

Figure 1: Example of a graph. A resource HospitalA connects to a variable ?patient via a sphn:hasSubjectPseudoIdentifier link.

Note

A variable in SPARQL always includes a question mark in front of the variable name.

Structure of a query

At the minimum, a basic SPARQL query format includes a SELECT and WHERE statement.

It has the following structure:

SELECT <variables>
WHERE {
       <graph-pattern>
}

where part of the WHERE statement are curly brackets that include the graph pattern.

Furthermore, a SPARQL query can include the following parts as well:

Prefix declarations

They are namespace declarations and allow for prefix names to be written in queries, rather than full URIs. With prefix declarations, we can write shorter and clearer code:

prefix dc: <http://purl.org/dc/elements/1.1/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix sphn: <https://biomedit.ch/rdf/sphn-ontology/sphn#>

Note

With a prefix declared for SPHN, we can simply refer to sphn in our code rather than spell out the entire URI. Therefore, instead of writing:

?patient rdf:type https://biomedit.ch/rdf/sphn-ontology/SubjectPseudoIdentifier

We can write:

?patient a sphn:SubjectPseudoIdentifier

Type of query declaration

There exists four types of query declaration (more information here and on Query Forms):

  • SELECT

  • ASK

  • DESCRIBE

  • CONSTRUCT​

Data set definition

If multiple data graphs are provided in a triplestore, specific data set from which the query should be ran against can be done specified with:

  • FROM <...>

  • FROM NAMED <...>

Note

If the dataset is not defined, the query usually runs by default on the complete data set.

Graph pattern

The clause WHERE { ... }​ is used to define the graph pattern (in the form of triples) that the result of the query should comply with.

Query modifiers

They allow to modify the way the output of the query is presented:

  • ORDER BY ... Establishes the order of a solution sequence

  • GROUP BY ... After dividing the solution into groups, GROUP BY calculates the aggregate value of the groups.

  • HAVING ... Operates over group solution sets and filters by a variable

  • LIMIT ... Places a limit on the number of solutions returned

  • OFFSET ... Controls where the solutions start from

  • BIND ... Assigns a variable to a value or an expression stated in the query

Types of queries

There are four types of SPARQL queries:

SELECT

A SELECT query gets results for requested variables. The output is displayed in a table (see W3C documentation SELECT)

The example above retrieves all instances (?patient) where the type is a sphn:SubjectPseudoIdentifier.

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-ontology/sphn#>

SELECT ?patient
WHERE {
    ?patient rdf:type sphn:SubjectPseudoIdentifier
 }

ASK

An ASK query checks for matches of a requested pattern, and results in a Boolean ‘yes/no’ output (see W3C documentation ASK)

In the example below, the question asked is whether patient77 had an allergy episode annotated.

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX sphn:<https://biomedit.ch/rdf/sphn-ontology/sphn#>

ASK
WHERE {
  ?patient a sphn:SubjectPseudoIdentifier .
  ?patient sphn:hasIdentifier "patient77" .
  ?allergy_episode a sphn:AllergyEpisode .
  ?allergy_episode sphn:hasSubjectPseudoIdentifier ?patient .

}

CONSTRUCT

A CONSTRUCT query gets specific parts of a graph, and manipulates the graph by creating new triple as indicated in the query (see W3C documentation CONSTRUCT)

The query below adds a diagnosis to patients that have a lab test code LOINC 6690-2. The result retrieves the list of patients having this new diagnosis, in the form of a triple.

 PREFIX sphn: <https://biomedit.ch/rdf/sphn-ontology/sphn#>
 PREFIX snomed: <http://snomed.info/id/>
 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
 PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

 CONSTRUCT {
       resource:Diagnosis1 a sphn:Diagnosis .
       resource:Diagnosis1 sphn:hasSubjectPseudoIdentifier ?patient .
 }
 WHERE {
       ?patient a sphn:SubjectPseudoIdentifier .
       ?lab a sphn:LabResult .
       ?lab sphn:hasSubjectPseudoIdentifier ?patient .
       ?lab sphn:hasLabResultLabTestCode ?code .
       ?code a loinc:6690-2 .
}

DESCRIBE

A DESCRIBE query gets basic (triple) information about a variable or resource (see W3C documentation DESCRIBE)

In the example below, the query returns all information provided for patient78.

PREFIX sphn: <https://biomedit.ch/rdf/sphn-ontology/sphn#>
PREFIX snomed: <http://snomed.info/id/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
DESCRIBE ?thing
WHERE {
  ?thing a sphn:SubjectPseudoIdentifier .
  ?thing sphn:hasIdentifier "patient78" .

}

Query formation

In addition to the already mentioned query types, other constructs are also possible:

Nested queries

Nested queries are referred to as ‘subqueries’ in SPARQL: one SELECT inside another SELECT (more information about subqueries).

A nested query is a SELECT clause within a SELECT clause, where the results of the subquery are evaluated first and then projected to the outer query.

The following query calculates the average number of patients per data provider institute:

PREFIX sphn: <https://biomedit.ch/rdf/sphn-ontology/sphn#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT (avg(?numPatients) AS ?avgNumPatientsByDataProvider)
WHERE {
        SELECT ?data_provider (count(?patient) AS ?numPatients)
        WHERE {
                 ?patient a sphn:SubjectPseudoIdentifier .
                 ?data_provider a sphn:DataProviderInstitute .
                 ?patient sphn:hasDataProviderInstitute ?data_provider .
       } GROUP BY ?data_provider
 }

Federated SPARQL

A federated query allows for querying different SPARQL endpoints in the same query using a SERVICE clause (more information about federated querying). We can thereby combine information that live in different datasets in one query.

In the example below, the assumption made is that SNOMED CT codes to annotate the Substance comes from the BioPortal instance of SNOMED CT. Using the SERVICE clause which connects to the BioPortal namespace of SNOMED CT (http://bioportal.bioontology.org/ontologies/SNOMEDCT/), it is possible to retrieve the preferred label of the following SNOMED CT code: 762952008 which corresponds to the Peanut substance some patients are allergic against:

PREFIX sphn: <https://biomedit.ch/rdf/sphn-ontology/sphn#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX snomed_bioportal: <http://purl.bioontology.org/ontology/SNOMEDCT/>

SELECT ?patient ?label
WHERE {
     ?patient a sphn:SubjectPseudoIdentifier .
     ?allergy_episode a sphn:AllergyEpisode .
     ?substance a sphn:Substance .

     ?allergy_episode sphn:hasSubjectPseudoIdentifier ?patient .
     ?allergy_episode sphn:hasSubstance ?substance .
     ?substance sphn:hasCode ?substance_code .

     SERVICE <http://bioportal.bioontology.org/ontologies/SNOMEDCT/> {
       ?substance_code a snomed_bioportal:762952008 .
       ?substance_code skos:prefLabel  ?label.
     }
}

Note

Some tips for working with SPARQL queries:

  • a is a shortcut for rdf:type

  • Prefixes are highly recommended for better readability

  • Being familiar with the dataset structure helps to write a query

  • The period at the end of a line in the WHERE clause is a conjunction, i.e. AND

  • A semicolon at the end of a line in a WHERE clause introduces another property of the same subject

  • A comma at the end of a line in a WHERE clause introduces another object with the same predicate and subject.