Extract subsets of a graph

There can be many use cases and applications for a knowledge graph that requires the need to generate subsets based on certain criteria. These criteria can be simple or complex depending on the use case. In SPHN, one of the key use cases is to create a subset graph that contains a select set of patients and/or a select set of concepts. The approach taken to generate the subset depends on the complexity of the criteria.

Patient data in named graphs

The simplest approach is where we assume that the data is already structured in a way that is amenable for creating sensible subsets, i.e. named graphs.

For this approach, we make a couple of assumptions:

  • You are using GraphDB

  • Your triples in the graph repository are organized as one patient per named graph (See documentation on Named Graphs and on Data Loading for loading data into named graphs)

  • You want a subset that corresponds to all the triples that belong to a patient

You can fetch the subset via GraphDB’s REST API. But first, you need to identify what the IRI of your named graph is. In SPHN, if your graph is organized such that data for each patient is in its own named graph then the named graph IRI will be the patient IRI.

Alternatively, you can also retrieve a full list of all the named graphs within a given repository by querying GraphDB’s /contexts endpoint:

curl -X GET --header 'Accept: application/sparql-results+json' \
'http://localhost:7200/repositories/project-data/contexts'

Assuming that the GraphDB is on http://localhost:7200 and the repository name is project-data.

This will give you a JSON object that contains a list of contexts where each context corresponds to a named graph.

Then you can call the /rdf-graphs endpoint to get all the triples from a given named graph:

curl --header 'Accept: application/x-turtle' \
'http://localhost:7200/repositories/project-data/rdf-graphs/patient-001'

Where patient-001 is the named graph retrieved from the context list.

One thing to keep in mind is that GraphDB makes a distinction between directly referencing a named graph vs indirectly referencing a named graph.

If your named graph has a name that is relative to the GraphDB instance, for example http://localhost:7200/repositories/<REPOSITORY_NAME>/rdf-graphs/<GRAPH_NAME>, then this is considered as directly referencing a named graph and the above API call is sufficient to fetch your triples.

But if your graph name is an IRI, for example https://example.com/my-patient-graph1, then this is considered as indirectly referencing a named graph, and you will have to make use of the /rdf-graphs/service endpoint instead. In SPHN, this is the common and likely scenario where data is requested from data providers in N-Quads or TriG formats and the data is loaded into GraphDB using IRIs for named graphs.

curl -X GET --header 'Accept: application/x-turtle' \
'http://localhost:7200/repositories/project-data/rdf-graphs/service?graph=http%3A%2F%2Fexample.com%2Fmy-patient-graph1'

Note: the IRI needs to be URL encoded before being used in the API call.

When calling the API endpoints, you are specifying the format in which you expect the results via the headers. Following is a table that provides a list of all the different formats supported:

Header

Format

Accept: application/x-turtle

RDF Turtle

Accept: application/x-trig

TriG

Accept: application/n-triples

N-Triples

Accept: application/n-quads

N-Quads

Accept: application/ld-json

JSON LD

For a full list of supported formats, see GraphDB Documentation on RDF Formats.

Note

The above list is applicable only for GraphDB 10.2.2 or greater.

Data not in named graphs

In this approach, we would like to export a subset of data according to a specific criteria but the data is not structured in a way that is amenable for creating the needed subset. This could be because the data is not organized as named graphs or because the subset you would like to export is highly specific.

For this approach, we make a couple of assumptions:

  • You are using GraphDB

  • Your triples in the graph repository are in the default graph (i.e. no named graphs)

  • You want a subset that corresponds to all the triples that belong to a particular concept, here we’ll take the example of a patient (i.e. sphn:SubjectPseudoIdentifier)

Background

  • There are several concepts directly connected to the sphn:SubjectPseudoIdentifier (also known as core concepts). And these concepts in turn have links to other concepts.

  • Some supporting concepts are always connected to sphn:SourceSystem concept:

    • sphn:Interpretation

    • sphn:ReferenceInterpretation

    • sphn:SemanticMapping

    • sphn:SourceData

Following are the steps to create a query that would be capable of extracting all triples associated to a patient (i.e. patient-1).

Step 1

Write a SELECT query to extract all core and supporting concepts. The SELECT query will be used as a nested query in the final CONSTRUCT query.

We first extract the core concepts (?subj) that are directly connected to a specific patient (i.e. resource:patient-1). Then we extract the supporting concepts (?resource) by creating a pattern to identify a link between these supporting concepts and an instance of sphn:SourceSystem. This instance of sphn:SourceSystem is in turn connected to one of the core concepts (?subj).

Finally, to extract the relevant sphn:DataRelease instance, we create a pattern to identify the link between this instance and the instance of sphn:DataProvider linked to the resource:patient-1.

SELECT DISTINCT ?s
WHERE
{
    ?subj sphn:hasSubjectPseudoIdentifier resource:patient-1 .
    resource:patient-1 sphn:hasDataProvider ?dp.
    ?subj sphn:hasSourceSystem ?sourceSys .
    ?resource sphn:hasSourceSystem ?sourceSys.
    ?datarelease a sphn:DataRelease.
    ?datarelease sphn:hasDataProvider ?dp.
    {
        BIND(?subj as ?s)
    }
    UNION
    {
        BIND(?resource as ?s)
    }
    UNION
    {
        BIND(?datarelease as ?s)
    }
}

Note

Use of the BIND clause

We use the BIND clause to bind all the required concepts (core concepts as well as supporting concepts) to the same variable ?s to make querying simpler. If there were different variables to get the core concepts and supporting concepts, we would have had to create separate query patterns to get the entire hierarchies of concepts connected to these core and supporting concepts.

Step 2

Now, we need to retrieve all concepts that are directly or indirectly connected to the concepts of interest (denoted by the variable ?s) .

Property path expressions allow us to traverse any number of properties between connected concepts in a RDF graph. We need to specify exact property names to traverse paths of arbitrary length. However, in our scenario, we want to traverse all the paths starting from subject concepts connected to our required patient, going all the way till the leaf values without knowing the exact property names. We can do this by using an expression like the one below in the WHERE clause of our outer CONSTRUCT query:

?s ?prop ?val ;
  (sphn:p|!sphn:p)+ ?child .

Using (+) operator in a property path expression enables to extract all child concepts connected through one or more occurrences of specified path. To match the property path we use either sphn:p or !sphn:p where !sphn:p will hold true for every property. The expression binds every reachable child concept to the variable ?child. Furthermore, to connect properties of the ?child concept, we use the following triple pattern:

?child ?childProp ?childPropObj

The combined expression would be:

?s ?prop ?val ;
(sphn:p|!sphn:p)+ ?child .
?child ?childProp ?childPropVal.

The final CONSTRUCT query will look like this:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource: <https://biomedit.ch/rdf/sphn-resource/>

CONSTRUCT
{
    ?s ?prop ?val .
    ?child ?childProp ?childPropVal .
}
WHERE
{
    ?s ?prop ?val ;
    (sphn:p|!sphn:p)+ ?child .
    ?child ?childProp ?childPropVal .
    {
        SELECT DISTINCT ?s
        WHERE
        {
            ?subj sphn:hasSubjectPseudoIdentifier resource:patient-1.
            resource:patient-1 sphn:hasDataProvider ?dp.
            ?subj sphn:hasSourceSystem ?sourceSys .
            ?resource sphn:hasSourceSystem ?sourceSys.
            ?datarelease a sphn:DataRelease.
            ?datarelease sphn:hasDataProvider ?dp.
            {
                BIND (?subj as ?s)
            }
            UNION
            {
                BIND (?resource as ?s)
            }
            UNION
            {
                BIND (?datarelease as ?s)
            }
        }
    }
    FILTER ((!isBlank(?val)) && (!isBlank(?child)) && (!isBlank(?childPropVal)) && (STRSTARTS(STR(?s), "https://biomedit.ch/rdf/sphn-resource/")))
}

Note

Use of FILTER clause

We use the FILTER clause to exclude blank nodes that might appear in the result and to ensure that only concepts with IRIs starting with the prefix https://biomedit.ch/rdf/sphn-resource/ are retrieved. This helps avoid retrieving irrelevant metadata information or details about the concepts which are part of the schema.