Extract subsets of a graph

There can be many use cases and applications for a knowledge graph that requires the need to generate subsets based on certain criteria. These criteria can be simple or complex depending on the use case. In SPHN, one of the key use cases is to create a subset graph that contains a select set of patients and/or a select set of concepts. The approach taken to generate the subset depends on the complexity of the criteria.

Simple approach

The simplest approach is where we assume that the data is already structured in a way that is amenable for creating sensible subsets.

For this approach, we make a couple of assumptions:

  • You are using GraphDB

  • Your triples in the graph repository are organized as one patient per named graph (See documentation on Named Graphs and on Data Loading for loading data into named graphs)

  • You want a subset that corresponds to all the triples that belong to a patient

You can fetch the subset via GraphDB’s REST API. But first, you need to identify what the IRI of your named graph is. In SPHN, If your graph is organized such that data for each patient is in its own named graph then the named graph IRI will be the patient IRI.

Alternatively, you can also retrieve a full list of all the named graphs within a given repository by querying GraphDB’s /contexts endpoint:

curl -X GET --header 'Accept: application/sparql-results+json' \
'http://localhost:7200/repositories/project-data/contexts'

Assuming that the GraphDB is on http://localhost:7200` and the repository name is ``project-data.

This will give you a JSON object that contains a list of contexts where each context corresponds to a named graph.

Then you can call the /rdf-graphs endpoint to get all the triples from a given named graph:

curl --header 'Accept: application/x-turtle' \
'http://localhost:7200/repositories/project-data/rdf-graphs/patient-001'

Where patient-001 is the named graph retrieved from the context list.

One thing to keep in mind is that GraphDB makes a distinction between ‘directly referencing a named graph’ vs ‘indirectly referencing a named graph’.

If your named graph has a name that is relative to the GraphDB instance, for example http://localhost:7200/repositories/<REPOSITORY_NAME>/rdf-graphs/<GRAPH_NAME>, then this is considered as directly referencing a named graph and the above API call is sufficient to fetch your triples.

But if your graph name is an IRI, for example https://example.com/my-patient-graph1, then this is considered as indirectly referencing a named graph, and you will have to make use of the /rdf-graphs/service endpoint instead. In SPHN, this is the common and likely scenario where data is requested from data providers in N-Quads or TriG formats and the data is loaded into GraphDB using IRIs for named graphs.

curl -X GET --header 'Accept: application/x-turtle' \
'http://localhost:7200/repositories/project-data/rdf-graphs/service?graph=http%3A%2F%2Fexample.com%2Fmy-patient-graph1'

Note: the IRI needs to be URL encoded before being used in the API call.

When calling the API endpoints, you are specifying the format in which you expect the results via the headers. Following is a table that provides a list of all the different formats supported:

Header

Format

Accept: application/x-turtle

RDF Turtle

Accept: application/x-trig

TriG

Accept: application/n-triples

N-Triples

Accept: application/n-quads

N-Quads

Accept: application/ld-json

JSON LD

For a full list of supported formats, see GraphDB Documentation on RDF Formats.

Note: The above list is applicable only for GraphDB 10.2.2 or greater.

Advanced approach

In this approach, we would like to export a subset of data according to a specific criteria but the data is not structured in a way that is amenable for creating the needed subset. This could be because the data is not organized as named graphs or because the subset you would like to export is highly specific.

For this approach, we make a couple of assumptions:

  • You are using GraphDB

  • Your triples in the graph repository are in the default graph (i.e. no named graphs)

  • You want a subset that corresponds to all the triples that belong to a particular concept, here we’ll take the example of a patient (i.e. sphn:SubjectPseudoIdentifier)

  • Inference is turned off in GraphDB at the time of executing SPARQL queries

Note

This approach is only valid for data that conforms to SPHN Schema 2024.2.

Note

There are several concepts connected to the sphn:SubjectPseudoIdentifier directly (also known as core concepts). And these concepts in turn have links to other concepts. Some supporting concepts have a direct or indirect link to sphn:DataProvider. These supporting concepts are:

  • sphn:Interpretation

  • sphn:ReferenceInterpretation

  • sphn:SemanticMapping

  • sphn:SourceData

  • sphn:DataRelease

Following are the steps to create a query that would be capable of extracting all triples associated to a patient (i.e. patient-1).

Step 1: Extract all core concepts

Extract all core concepts that are directly connected to a specific patient:

{
  ?subj sphn:hasSubjectPseudoIdentifier resource:patient-1 .
  BIND (?subj as ?s)
}

Step 2: Get all supporting concepts

To get all supporting concepts, we first start by fetching all instances of type sphn:SemanticMapping, sphn:Interpretation, sphn:ReferenceInterpretation, sphn:SourceData and sphn:DataRelease. We then bind these to the variable ?prevSubj:

?prevSubj rdf:type ?o .
VALUES ?o
{
  sphn:SemanticMapping
  sphn:Interpretation
  sphn:ReferenceInterpretation
  sphn:SourceData
  sphn:DataRelease
}

We need to fetch only the instances of these concepts that are connected through sphn:DataProvider to resource:patient-1.

sphn:SemanticMapping, sphn:Interpretation, sphn:ReferenceInterpretation and sphn:DataRelease have a direct link to sphn:DataProvider.

sphn:SourceData follows the path sphn:hasSourceSystem / sphn:hasDataProvider.

Hence the query pattern:

resource:patient-1 sphn:hasDataProvider ?dp .
{
  ?prevSubj ?p ?dp .
}
UNION
{
  ?prevSubj sphn:hasSourceSystem/sphn:hasDataProvider ?dp .
}

Thus, to get all instances of supporting concepts, the final query pattern that we use is as follows:

{
  ?prevSubj rdf:type ?o .
  VALUES ?o
  {
    sphn:SemanticMapping
    sphn:Interpretation
    sphn:ReferenceInterpretation
    sphn:SourceData
    sphn:DataRelease
  }
  resource:patient-1 sphn:hasDataProvider ?dp .
  {
    ?prevSubj ?p ?dp .
  }
  UNION {
    ?prevSubj sphn:hasSourceSystem/sphn:hasDataProvider ?dp .
  }
  BIND (?prevSubj as ?s)
}

Note

Use of BIND Clause

We use the BIND clause to bind all the required concepts (core concepts as well as supporting concepts) to the same variable ?s to make querying simpler. If there were two different variables , ?subj and ?prevSubj, getting the core concepts and supporting concepts respectively, we would have had to create two query patterns to get the entire hierarchies of concepts connected to these core and supporting concepts.

Step 3: Fetch all concepts of interest

Use SPARQL UNION clause with the query patterns in Step 1 and Step 2 in a SELECT query to fetch variable ?s. The SELECT query will be used as a nested query in the final CONSTRUCT query.

At this level we will obtain all the concepts of interest (for which we want to fetch the hierarchies):

SELECT DISTINCT ?s
WHERE
{
  {
    ?prevSubj rdf:type ?o .
    VALUES ?o
    {
      sphn:SemanticMapping
      sphn:Interpretation
      sphn:ReferenceInterpretation
      sphn:SourceData
      sphn:DataRelease
    }
    resource:patient-1 sphn:hasDataProvider ?dp .
    {
      ?prevSubj ?p ?dp .
    }
    UNION
    {
      ?prevSubj sphn:hasSourceSystem/sphn:hasDataProvider ?dp .
    }
    BIND (?prevSubj as ?s)
  }
  UNION
  {
    ?subj sphn:hasSubjectPseudoIdentifier resource:patient-1 .
    BIND (?subj as ?s)
  }
}

Step 4: Get all the concepts that are directly or indirectly connected

Now, we need to get all the concepts that are directly or indirectly connected to the concepts of interest (referred to by the variable ?s).

Property path expressions allow us to traverse any number of properties between connected concepts in a RDF graph. We need to specify exact property names to traverse paths of arbitrary length. However, in our scenario, we want to traverse all the paths starting from subject concepts connected to our required patient, going all the way till the leaf values without knowing the exact property names. We can do this by using an expression like the one below in the WHERE clause of our outer CONSTRUCT query:

?s ?prop ?val ;
(sphn:p|!sphn:p)+ ?child .

Using (+) sign with the property path expression enables extracting all child concepts connected through one or more of specified path expressions. To match the property path we say it’s either sphn:p or !sphn:p.

!sphn:p will hold true for every property. Thus the expression binds every connected child concept to the ?child variable. Further, to connect properties of the ?child concept, we use the following triple pattern:

?child ?childProp ?childPropObj

The final CONSTRUCT query will look like this:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource: <https://biomedit.ch/rdf/sphn-resource/>
CONSTRUCT
{
    ?s ?prop ?val .
    ?child ?childProp ?childPropVal .
    ?dr ?p ?objVal .
}
WHERE
{
  ?s ?prop ?val ;
  (sphn:p|!sphn:p)+ ?child .
  ?child ?childProp ?childPropVal .
  {
    SELECT DISTINCT ?s
    WHERE
    {
      {
        ?prevSubj rdf:type ?o .
        VALUES ?o
        {
            sphn:SemanticMapping
            sphn:Interpretation
            sphn:ReferenceInterpretation
            sphn:SourceData
            sphn:DataRelease
        }
        resource:patient-1 sphn:hasDataProvider ?dp .
        {
          ?prevSubj ?p ?dp .
        }
        UNION
        {
          ?prevSubj sphn:hasSourceSystem/sphn:hasDataProvider ?dp .
        }
        BIND (?prevSubj as ?s)
      }
      UNION
      {
        ?subj sphn:hasSubjectPseudoIdentifier resource:patient-1 .
        BIND (?subj as ?s)
      }
    }
  }
  FILTER ((!isBlank(?val)) && (!isBlank(?child)) && (!isBlank(?childPropVal)) && (STRSTARTS(STR(?s), "https://biomedit.ch/rdf/sphn-resource/")))
}

Note

Use of FILTER clause

We use the FILTER clause to eliminate blank nodes that might appear in the final graph and to fetch only those concepts that start with the IRI https://biomedit.ch/rdf/sphn-resource/ . This is to avoid getting metadata information or details about the concepts that are part of the schema

An alternative query would be:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource: <https://biomedit.ch/rdf/sphn-resource/>
PREFIX : <https://biomedit.ch/rdf/sphn-schema/sphn#>
CONSTRUCT
{
    ?s ?prop ?val .
    ?child ?childProp ?childPropVal .
    ?dr ?p ?objVal .
}
WHERE
{
  ?s ?prop ?val ;
  (sphn:p|!sphn:p)+ ?child .
  ?child ?childProp ?childPropVal .
  {
    SELECT DISTINCT ?s
    WHERE
    {
      {
        ?prevSubj rdf:type ?o .
        VALUES ?o
        {
          sphn:SemanticMapping
          sphn:Interpretation
          sphn:ReferenceInterpretation
          sphn:SourceData
          sphn:DataRelease
        }
        ?subj sphn:hasSubjectPseudoIdentifier resource:patient-1 .
        resource:patient-1 sphn:hasDataProvider ?dp .
        {
          ?prevSubj ?p ?dp .
        }
        UNION
        {
          ?prevSubj sphn:hasSourceSystem/sphn:hasDataProvider ?dp .
        }
        {
          BIND (?prevSubj as ?s)
        }
        UNION
        {
          BIND (?subj as ?s)
        }
      }
    }
  }
  FILTER ((!isBlank(?val)) && (!isBlank(?child)) && (!isBlank(?childPropVal)) && (STRSTARTS(STR(?s), "https://biomedit.ch/rdf/sphn-resource/")))
}