Visually explore data with GraphDB
To find out more watch the Schema and Data Visualization Training
This document is mainly intended for researchers who are interested in exploring with visuals their data using the GraphDB triplestore. This document provides information about data loading and visualization of both the schema and the data in GraphDB.
Loading data in the GraphDB triplestore
The data in RDF can be loaded and sometimes also visualized in triplestores. Here, we demonstrate how to load and visualize data in GraphDB. GraphDB’s documentation gives a good overview of the options to load RDF data into GraphDB.
Step 1: Create and configure a repository
First, you have to create a new repository which will hold the data:
Figure 1: Create a new repository.
Fill in the necessary information:
Figure 2: Fill in the necessary information.
Select the repository on the top right before importing data into the created repository:
Figure 3: First select the repository on the top right, then import data into the repository.
Enable the Autocomplete setting to ease your searches in the tool:
Figure 4: Enabling the Autocomplete setting in this repository easens the search.
The data indexing time will depend on the size of the data. With the autocomplete index you will be able to use the tools and more easily search for the labels (e.g., visual graph or SPARQL editor).
For more details and information about GraphDB persistence strategy and storage options, please refer to the GraphDB user guide
Step 2: Import data
There are several options for loading the data into GraphDB.
Option A: Import from a text snippet
In this example the data is imported via a text snippet that is copied into the web front-end.
Import/RDF menu, on the
User data tab, select
Import RDF text snippet.
Figure 5: Import RDF text snippet.
Copy and paste the following text into the text field:
@prefix sphn: <https://biomedit.ch/rdf/sphn-ontology/sphn#> . @prefix snomed: <http://snomed.info/id/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix resource: <https://biomedit.ch/rdf/sphn-resource/> . # types resource:hospital1-SubjectPseudoIdentifier-anonymous1 rdf:type sphn:SubjectPseudoIdentifier . resource:hospital1-DataProviderInstitute rdf:type sphn:DataProviderInstitute . resource:hospital1-Allergy-allergy1 rdf:type sphn:Allergy . resource:hospital1-Allergen-peanuts1 rdf:type sphn:Allergen ; sphn:hasCode resource:Code-SNOMED-CT-762952008 . resource:Code-SNOMED-CT-762952008 rdf:type snomed:762952008 . # relations to the allergy resource:hospital1-Allergy-allergy1 sphn:hasSubjectPseudoIdentifier resource:hospital1-SubjectPseudoIdentifier-anonymous1 . resource:hospital1-Allergy-allergy1 sphn:hasDataProviderInstitute resource:hospital1-DataProviderInstitute . resource:hospital1-Allergy-allergy1 sphn:hasAllergen resource:hospital1-Allergen-peanuts1 .
Figure 6: This shows the import from a text snippet.
Accept the default settings:
Figure 7: We accept the default settings.
A message appears showing the successful import:
Figure 8: Message showing the successful import.
Option B: Import from server files
If enabled at your GraphDB instance,
data in a dedicated folder on the GraphDB server can be provided to the user interface.
To list and load these files and folders, navigate to the
Server files tab, and import the selected or all files.
The folder where data must be provided is specified in the Help section of
This folder can be changed in the settings.
Figure 9: Data import via server files.
When prompted, accept all default settings (as above).
A message appears showing the successful import:
Figure 10: Message showing the successful import.
Option C: Import via the
For large datasets, GraphDB’s preload tool
offers a better performance than the import via the user interface.
preload command needs to be executed directly on the GraphDB server.
Please get in touch with your instance’s system administrators.
Option D: Use sciCORE’s load script
To automate the import of server files, sciCORE has created a bash script for loading data into GraphDB using the GraphDB API to trigger the “import server file” procedure. This script is provided in the users’ home folder upon request to the BioMedIT node.
The script allows for a fast and efficient import of large datasets, with minimal disruptions and import failure which may be experienced when importing data using the GraphDB GUI. Compared to GraphDB’s preload tool, the script allows for the creation of named graphs, which are recommended, especially when multiple data deliveries and imports are expected. For more information on named graphs see the Named Graphs section below.
The workflow to import RDF files using this script is: 1. The user copies the data to import to a pre-defined data sync folder which is visible to the GraphDB daemon 2. The user executes the script to contact the GraphDB API. When the script is executed the GraphDB daemon will read all the RDF files from the folder to import. The destination collection and target graph are defined as script input vars.
Users interested in using the load script can contact their BioMedIT node admin and request it.
The BioMedIT nodes offer support with adjusting the provided data upload script to the specific project needs, creating the corresponding folders needed for the script to run, as well as with instructions for the users on how to use it. It can be expected that the first data uploads for each project will be performed with full BioMedIT node support.
Monitoring resources while importing data
System resources, such as memory or CPU consumption, can be monitored via
Figure 11: Resource monitoring.
This can be helpful to debug issues with excessive resource consumption, especially when importing large datasets.
Named graphs are a key concept of the Semantic Web architecture. Essentially, they allow users to assign a URI to a collection of triples, thus being able to make statements on that specific set. In order words, a subset of data is referenced with a unique identifier.
To some extent, the URI of an RDF file that contains a number of triples can be considered to be a mamed graph. We can understand named graphs as the formalization of the idea that the content of an RDF document (a graph) on the web can be considered to be named by the URI of the document. This allows for a fine-grained access control of the source data.
Some RDF formats may enable the specification of named graphs, while others do not support that functionality. In the latter case, when data is imported into GraphDB and the target graphs parameter is not specified, graphs are treated as a default graph. Hence, all data is imported to the default graph as a single graph. Such behavior will lead to the deletion of the existing and complete graphs with each new import, and its replacement with the contents of the new upload. It is not possible to delete only parts of the graph.
With named graphs, it is possible to upload data into a graph different than the default, thus preventing the deletion of the existing graph and all its data.
This strategy of using named graphs within the same repository to separate datasets into smaller “chunks”, is preferred over loading all data into the default graph. However, when loading multiple named graphs into the same repository, user access control can be difficult. In GraphDB, user management is implemented at the repository level and it is not possible to give access to different graphs for different users. Hence all named graphs inside the repository can be seen by any user who is authorized to run queries. This limitation was deemed critical for some use cases, so the principle of Internal SPARQL federation was implemented. In this case, it is recommended to load different graphs into different repositories with specific user access, and subsequently create a federated repository that will be able to query data from all others.
For more details please refer to GraphDB’s documentation
In the example shown in Figure 12, in addition to the default graph, there is a named graph for the schema, and a named graph for the mock-data. Having data in a separate graph allows for more flexibility in its management. For instance, it becomes possible now to only replace the mock-data without having to delete all the content and reload both the mock-data and the schema.
If multiple datasets are provided in different named graphs, it is also possible to write queries that will only look into one of the dataset instead of the whole content. Simple add the statement FROM <Named_Graph_URI> to your SPARQL query.
Figure 12: Graphs overview section in GraphDB.
Schema and data visualization
In order to demonstrate the visualization capabilites of GraphDB some mock-data will be used,
an overview of which is shown in Figure 13. The mock-data is modeled with the SPHN RDF Schema
and centered around patients, denoted by the class
In this mock-data, each patient has an
AllergyEpisode, triggered by an
and confirmed by a
Codes from the external terminologies SNOMED CT
are used for encoding substances, LOINC
for encoding laboratory tests, and UCUM
for encoding units of measurement.
For the sake of didactic simplicity, it is assumed that each patient is linked to
AllergyEpisode and to a single
Figure 13: Mock-data overview.
Shown in Figure 14 is a class hierarchy visualization in GraphDB, with a focus on the classes from the SPHN RDF Schema used in the mock-data. Here, the levels in the hierarchy are represented by packing circles inside other circles (nested structure). Further information on class hierarchy visualization can be found in GraphDB’s documentation.
Figure 14: Class hierarchy visualization.
Shown in Figure 15 is a visualization of class hierarchy relationships in GraphDB. Here, the relationship between instances of classes are depicted as bundles of links in both directions. The bundles vary in thickness (indicating the number of links), and in color (indicating the class with the higher number of incoming links). Only the classes with the most ingoing and/or outgoing links are included per default. Classes can be added/removed by clicking on the corresponding icons.
For the mock-data used in this example, we find that the
Quantity class is
tied for the top spot regarding the total number of links.
It is strongly connected to the
and has both incoming and outgoing links.
AllergyEpisode class, on the other hand, only has outgoing links and
connects to the
Further information on class relationships visualization can be found in GraphDB’s documentation.
Figure 15: Class relationships visualization.
The GraphDB visual graph functionality enables the visualization of a specific class
or data of interest that was imported. For example, in Figure 16 the search for
LabResult is shown, along with suggestions provided by the Autocomplete functionality (more information).
Figure 16: Search for visual graph of the LabResult class.
Following the search for the
the corresponding class is shown along with its first hop neighbours.
Both the imported RDF Schema and instances of
LabResult (purple nodes)
are included in the displayed visual graph (see Figure 17).
Figure 17: Visual graph for the LabResult class.
Only the first 20 links to other resources are shown per default. This limit, as well as the types and predicates being shown, can be be adjusted in the settings (see Figure 18).
Figure 18: Settings for the visual graph display.
Through the settings one can, for example, exclude all instances of
sphn:LabResult class (i.e., by adding it to the Ignored types - type
sphn:LabResult, press enter)
yielding a visual graph of the
LabResult schema only (see Figure 19 ).
Here, in addition to classes, object and datatype properties (blue) are shown.
The object properties link instances of classes to other instances
The datatype properties link instances of classes to literal values
Figure 19: LabResult schema.
One can also search for instances of a class, as shown in Figure 20.
Figure 20: Search for visual graph of an instance of LabResult class.
A closer inspection of the visual graph for an instance of the
LabResult class (e.g.,
CHE-229_707_417-LabResult-e8624fab-5186-4e32-ae8d-7ee9770348a0 in Figure 21)
reveals object property links to many instances such as a
Figure 21: Visual graph for an instance of the LabResult class.
By clicking once on a LabResult instance (e.g.,
a side panel appears, providing additional information as shown in Figure 22.
In addition to annotations (label, description, etc.) side panel also contains datatype properties along with their values.
Figure 22: Side panel for an instance of the LabResult class.
Double clicking on a node expands it by showing its first hop neighbours,
as demonstrated in Figure 23 for the
Note that a single code instance is shared among different
One can learn more about the LOINC code either by inspecting the side panel,
or by visiting the URI of the LOINC code.
Figure 23: Exploring a LOINC code instance in a visual graph.
One can also explore the UCUM code instance which is connected to the
and learn that it has
cal as unit code (see Figure 24 ).
Figure 24: Exploring an UCUM code instance in a visual graph.
Now in order to find out more about what is causing the allergy,
we need to traverse the visual graph by visiting
Allergen instances (see Figure 25).
Here, we find that the
Allergen instance is linked to a SNOMED CT code instance,
and in Figure 26 we observe that this code is of type
Figure 25: Exploring a SNOMED CT code instance in a visual graph.
Figure 26: SNOMED CT code of Egg White.
Once again, we can learn more by visiting the URI of the SNOMED CT code.
Querying and aggregating data for visualization
Similarly to querying relational databases using SQL, one can also query RDF graph databases using SPARQL. The queried data can then be aggregated for visualization, e.g., with the built-in Google Chart functionality in GraphDB. An example of this process is shown in Figure 27, where the mock-data is queried for instances of classes. The retrieved instances are then aggregated per class, and the aggregated counts are visualized using Google Chart.
Figure 27: Example of querying and aggregating data for visualization using SPARQL and Google Chart.
The mock data is available in GitLab.
External terminologies are available through the Terminology Service accessible on the BioMedIT Portal (for additional information, please read about the Terminology Service). If you wish to use the minified versions (which contain only subset of codes) of SNOMED CT and LOINC, please get in touch with the DCC