Extract subsets of a graph
There can be many use cases and applications for a knowledge graph that requires the need to generate subsets based on certain criteria. These criteria can be simple or complex depending on the use case. In SPHN, one of the key use cases is to create a subset graph that contains a select set of patients and/or a select set of concepts. The approach taken to generate the subset depends on the complexity of the criteria.
Simple approach
The simplest approach is where we assume that the data is already structured in a way that is amenable for creating sensible subsets.
For this approach, we make a couple of assumptions:
You are using GraphDB
Your triples in the graph repository are organized as one patient per named graph (See documentation on Named Graphs and on Data Loading for loading data into named graphs)
You want a subset that corresponds to all the triples that belong to a patient
You can fetch the subset via GraphDB’s REST API. But first, you need to identify what the IRI of your named graph is. In SPHN, If your graph is organized such that data for each patient is in its own named graph then the named graph IRI will be the patient IRI.
Alternatively, you can also retrieve a full list of all the named graphs
within a given repository by querying GraphDB’s /contexts
endpoint:
curl -X GET --header 'Accept: application/sparql-results+json' \
'http://localhost:7200/repositories/project-data/contexts'
Assuming that the GraphDB is on http://localhost:7200` and the repository
name is ``project-data
.
This will give you a JSON object that contains a list of contexts where each context corresponds to a named graph.
Then you can call the /rdf-graphs
endpoint to get all the triples from a
given named graph:
curl --header 'Accept: application/x-turtle' \
'http://localhost:7200/repositories/project-data/rdf-graphs/patient-001'
Where patient-001
is the named graph retrieved from the context list.
One thing to keep in mind is that GraphDB makes a distinction between ‘directly referencing a named graph’ vs ‘indirectly referencing a named graph’.
If your named graph has a name that is relative to the GraphDB instance, for
example http://localhost:7200/repositories/<REPOSITORY_NAME>/rdf-graphs/<GRAPH_NAME>
,
then this is considered as directly referencing a named graph and the above API
call is sufficient to fetch your triples.
But if your graph name is an IRI, for example https://example.com/my-patient-graph1
,
then this is considered as indirectly referencing a named graph, and you will have to
make use of the /rdf-graphs/service
endpoint instead. In SPHN, this is the common
and likely scenario where data is requested from data providers in N-Quads or TriG
formats and the data is loaded into GraphDB using IRIs for named graphs.
curl -X GET --header 'Accept: application/x-turtle' \
'http://localhost:7200/repositories/project-data/rdf-graphs/service?graph=http%3A%2F%2Fexample.com%2Fmy-patient-graph1'
Note: the IRI needs to be URL encoded before being used in the API call.
When calling the API endpoints, you are specifying the format in which you expect the results via the headers. Following is a table that provides a list of all the different formats supported:
Header |
Format |
---|---|
|
RDF Turtle |
|
TriG |
|
N-Triples |
|
N-Quads |
|
JSON LD |
For a full list of supported formats, see GraphDB Documentation on RDF Formats.
Note: The above list is applicable only for GraphDB 10.2.2 or greater.
Advanced approach
In this approach, we would like to export a subset of data according to a specific criteria but the data is not structured in a way that is amenable for creating the needed subset. This could be because the data is not organized as named graphs or because the subset you would like to export is highly specific.
For this approach, we make a couple of assumptions:
You are using GraphDB
Your triples in the graph repository are in the default graph (i.e. no named graphs)
You want a subset that corresponds to all the triples that belong to a particular concept, here we’ll take the example of a patient (i.e.
sphn:SubjectPseudoIdentifier
)Inference is turned off in GraphDB at the time of executing SPARQL queries
Note
This approach is only valid for data that conforms to SPHN Schema 2024.2.
Note
There are several concepts connected to the sphn:SubjectPseudoIdentifier
directly
(also known as core concepts). And these concepts in turn have links to other concepts.
Some supporting concepts have a direct or indirect link to sphn:DataProvider
. These
supporting concepts are:
sphn:Interpretation
sphn:ReferenceInterpretation
sphn:SemanticMapping
sphn:SourceData
sphn:DataRelease
Following are the steps to create a query that would be capable of extracting all triples
associated to a patient (i.e. patient-1
).
Step 1: Extract all core concepts
Extract all core concepts that are directly connected to a specific patient:
{
?subj sphn:hasSubjectPseudoIdentifier resource:patient-1 .
BIND (?subj as ?s)
}
Step 2: Get all supporting concepts
To get all supporting concepts, we first start by fetching all instances of type
sphn:SemanticMapping
, sphn:Interpretation
, sphn:ReferenceInterpretation
,
sphn:SourceData
and sphn:DataRelease
. We then bind these to the variable ?prevSubj
:
?prevSubj rdf:type ?o .
VALUES ?o
{
sphn:SemanticMapping
sphn:Interpretation
sphn:ReferenceInterpretation
sphn:SourceData
sphn:DataRelease
}
We need to fetch only the instances of these concepts that are connected through
sphn:DataProvider
to resource:patient-1
.
sphn:SemanticMapping
, sphn:Interpretation
, sphn:ReferenceInterpretation
and sphn:DataRelease
have a direct link to sphn:DataProvider
.
sphn:SourceData
follows the path sphn:hasSourceSystem
/ sphn:hasDataProvider
.
Hence the query pattern:
resource:patient-1 sphn:hasDataProvider ?dp .
{
?prevSubj ?p ?dp .
}
UNION
{
?prevSubj sphn:hasSourceSystem/sphn:hasDataProvider ?dp .
}
Thus, to get all instances of supporting concepts, the final query pattern that we use is as follows:
{
?prevSubj rdf:type ?o .
VALUES ?o
{
sphn:SemanticMapping
sphn:Interpretation
sphn:ReferenceInterpretation
sphn:SourceData
sphn:DataRelease
}
resource:patient-1 sphn:hasDataProvider ?dp .
{
?prevSubj ?p ?dp .
}
UNION {
?prevSubj sphn:hasSourceSystem/sphn:hasDataProvider ?dp .
}
BIND (?prevSubj as ?s)
}
Note
Use of BIND Clause
We use the BIND
clause to bind all the required concepts (core concepts
as well as supporting concepts) to the same variable ?s
to make querying
simpler.
If there were two different variables , ?subj
and ?prevSubj
, getting
the core concepts and supporting concepts respectively, we would have had to
create two query patterns to get the entire hierarchies of concepts connected
to these core and supporting concepts.
Step 3: Fetch all concepts of interest
Use SPARQL UNION
clause with the query patterns in Step 1 and Step 2
in a SELECT
query to fetch variable ?s
. The SELECT
query will be used
as a nested query in the final CONSTRUCT
query.
At this level we will obtain all the concepts of interest (for which we want to fetch the hierarchies):
SELECT DISTINCT ?s
WHERE
{
{
?prevSubj rdf:type ?o .
VALUES ?o
{
sphn:SemanticMapping
sphn:Interpretation
sphn:ReferenceInterpretation
sphn:SourceData
sphn:DataRelease
}
resource:patient-1 sphn:hasDataProvider ?dp .
{
?prevSubj ?p ?dp .
}
UNION
{
?prevSubj sphn:hasSourceSystem/sphn:hasDataProvider ?dp .
}
BIND (?prevSubj as ?s)
}
UNION
{
?subj sphn:hasSubjectPseudoIdentifier resource:patient-1 .
BIND (?subj as ?s)
}
}
Step 4: Get all the concepts that are directly or indirectly connected
Now, we need to get all the concepts that are directly or indirectly connected
to the concepts of interest (referred to by the variable ?s
).
Property path expressions allow us to traverse any number of properties between
connected concepts in a RDF graph. We need to specify exact property names to
traverse paths of arbitrary length. However, in our scenario, we want to
traverse all the paths starting from subject concepts connected to our required
patient, going all the way till the leaf values without knowing the exact
property names. We can do this by using an expression like the one below in the
WHERE
clause of our outer CONSTRUCT
query:
?s ?prop ?val ;
(sphn:p|!sphn:p)+ ?child .
Using (+
) sign with the property path expression enables extracting all
child concepts connected through one or more of specified path expressions.
To match the property path we say it’s either sphn:p
or !sphn:p
.
!sphn:p
will hold true for every property. Thus the expression binds every
connected child concept to the ?child
variable. Further, to connect
properties of the ?child
concept, we use the following triple pattern:
?child ?childProp ?childPropObj
The final CONSTRUCT
query will look like this:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource: <https://biomedit.ch/rdf/sphn-resource/>
CONSTRUCT
{
?s ?prop ?val .
?child ?childProp ?childPropVal .
?dr ?p ?objVal .
}
WHERE
{
?s ?prop ?val ;
(sphn:p|!sphn:p)+ ?child .
?child ?childProp ?childPropVal .
{
SELECT DISTINCT ?s
WHERE
{
{
?prevSubj rdf:type ?o .
VALUES ?o
{
sphn:SemanticMapping
sphn:Interpretation
sphn:ReferenceInterpretation
sphn:SourceData
sphn:DataRelease
}
resource:patient-1 sphn:hasDataProvider ?dp .
{
?prevSubj ?p ?dp .
}
UNION
{
?prevSubj sphn:hasSourceSystem/sphn:hasDataProvider ?dp .
}
BIND (?prevSubj as ?s)
}
UNION
{
?subj sphn:hasSubjectPseudoIdentifier resource:patient-1 .
BIND (?subj as ?s)
}
}
}
FILTER ((!isBlank(?val)) && (!isBlank(?child)) && (!isBlank(?childPropVal)) && (STRSTARTS(STR(?s), "https://biomedit.ch/rdf/sphn-resource/")))
}
Note
Use of FILTER clause
We use the FILTER
clause to eliminate blank nodes that might appear in
the final graph and to fetch only those concepts that start with the
IRI https://biomedit.ch/rdf/sphn-resource/
. This is to avoid getting
metadata information or details about the concepts that are part of the
schema
An alternative query would be:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX sphn: <https://biomedit.ch/rdf/sphn-schema/sphn#>
PREFIX resource: <https://biomedit.ch/rdf/sphn-resource/>
PREFIX : <https://biomedit.ch/rdf/sphn-schema/sphn#>
CONSTRUCT
{
?s ?prop ?val .
?child ?childProp ?childPropVal .
?dr ?p ?objVal .
}
WHERE
{
?s ?prop ?val ;
(sphn:p|!sphn:p)+ ?child .
?child ?childProp ?childPropVal .
{
SELECT DISTINCT ?s
WHERE
{
{
?prevSubj rdf:type ?o .
VALUES ?o
{
sphn:SemanticMapping
sphn:Interpretation
sphn:ReferenceInterpretation
sphn:SourceData
sphn:DataRelease
}
?subj sphn:hasSubjectPseudoIdentifier resource:patient-1 .
resource:patient-1 sphn:hasDataProvider ?dp .
{
?prevSubj ?p ?dp .
}
UNION
{
?prevSubj sphn:hasSourceSystem/sphn:hasDataProvider ?dp .
}
{
BIND (?prevSubj as ?s)
}
UNION
{
BIND (?subj as ?s)
}
}
}
}
FILTER ((!isBlank(?val)) && (!isBlank(?child)) && (!isBlank(?childPropVal)) && (STRSTARTS(STR(?s), "https://biomedit.ch/rdf/sphn-resource/")))
}