Managing your GraphDB instance

This section is meant for developers and system adminstrators who have sufficient privileges to set up and manage GraphDB in an appropriate manner.

Setting up your GraphDB instance

When setting up your GraphDB instance, there are several factors to consider.

Virtualization

It is advisable to run your GraphDB instance within its dedicated Virtual Machine (VM). Virtualization offers numerous advantages, such as isolation, resource management, and scalability. By deploying GraphDB in its own VM, you can ensure that it operates independently, minimizing potential conflicts with other software or services running on the same hardware infrastructure.

Storage

The performance of the GraphDB instance depends on the storage where it is installed. This is because when you create a GraphDB repository, the image for the repository is stored on disk. At runtime, GraphDB reads the different files belonging to a repository from disk. This operation is dependent on the disk speed and random Input/output operations per second (IOPS). Thus, where the GraphDB data files are stored is important.

Whenever possible, we would recommend having the GraphDB data directory be on a fast SSD storage. You can configure where the data directory should be via the graphdb.home.data property. This can be set in the GraphDB properties file (graphdb.properties) located in the GraphDB installation directory. Alternatively, you can specify this property at runtime via the -Dgraphdb.home.data property.

CPU

The performance of GraphDB is influenced by CPU usage during different phases of its operation. When loading data into a GraphDB repository, the primary resource-intensive aspects are I/O operations and memory utilization. However, during query execution, having multiple CPU cores can significantly enhance performance, as they enable parallel processing of queries.

For an optimized GraphDB setup, it is recommended to allocate a minimum of 4 CPU cores to the GraphDB Virtual Machine (VM). It’s important to note that the actual utilization of CPU cores by GraphDB will depend on the GraphDB License, so be sure to review the license and request for a license with more cores if needed.

RAM

RAM is critical for the performance of GraphDB, be it while loading data or while working with GraphDB. To ensure optimal operation and responsiveness, it is crucial to allocate an adequate amount of RAM to the GraphDB VM. We would recommend having at least 64GB RAM allocated for the GraphDB VM.

However, be careful when configuring RAM allocation. When starting the GraphDB instance, avoid allocating the entirety of the available RAM to GraphDB itself. A balanced distribution of memory resources is crucial for preventing performance bottlenecks.

In scenarios where your GraphDB instance is expected to handle a substantial amounts of RDF data, such as 1 billion statements, the following memory settings are recommended:

  • Java heap (opt): 32GB

  • Minium RAM reserved for the OS (off-heap): 4GB

  • Total: 36 GB

Note that off-heap storage is more important since this is used for caching and more is better for improving query performance.

You can configure these memory settings when starting the GraphDB instance:

/opt/graphdb/dist/bin/graphdb -Xms32g -Xmx32g -Dgraphdb.home.data=/path/to/data

Where:

  • -Xms32g configures the Java minimum heap size

  • -Xmx32g configures the Java maximum heap size

  • -Dgraphdb.home.data=/data/graphdb-data configures the data directory for GraphDB

Tip

See GraphDB Documentation on Hardware Requirements for recommended memory settings at different scales of data.

Managing graphs within a GraphDB repository

There are many ways to organize data that is already in a GraphDB repository. How the data is organized in the repository is important as this can affect the strategies you can use to load and manage your data.

  • Bulk Load: One common strategy is to load all the RDF data to the default graph of a GraphDB repository. Whenever there is a new dataset, the existing repository is deleted and a new one is created to accommodate the new data. This method is practical when dealing with datasets that contain both pre-existing and new data. However, a notable drawback is that as the dataset size continues to grow with each iteration, the loading times also increase proportionally.

  • Delta Load: An alternative strategy is to initially load all the RDF data to the default graph in a GraphDB repository. Whenever there is a new dataset, only the triples that have changed (the delta) are loaded into the repository. This is a bit more complex since there needs to be an evaluation of what changed between the old data and the new data. Accordingly, you would have to delete triples that are no longer in the new dataset and insert triples that are new in the new dataset.

Regardless of the strategy, it is important to note that there are some approaches you can take to organize your data within a repository. For example, making use of Named Graphs to have multiple independent graphs within the default graph. Named Graphs provide a mechanism to partition and categorize data, enabling more efficient querying and management of data within a repository.

Named Graphs

Named Graph is a useful concept from the Semantic Web. Essentially, they allow users to assign a URI to a collection of triples, thus being able to make statements on that specific subset of triples. In other words, the subset of data (graph) is referenced with a unique URI (name).

When working with triples from an RDF file, one can assume the URI of the RDF file as unique and can be used as the identifier of the named graph. We can understand named graphs as the formalization of the idea that the content of an RDF document (a graph) on the web can be considered to be named by the URI of the document. This allows for a fine-grained access control of the source data.

Some RDF formats may enable the specification of named graphs, while others do not support that functionality. In the latter case, when data is imported into a GraphDB repository, the target graphs parameter is not specified, and the incoming graphs are treated as part of the default graph. Hence, all data is imported to the default graph as a single graph. Such behavior will lead to the deletion of the existing graphs with each new import, and its replacement with the contents of the new data load. It is not possible to delete only parts of the graph without also facing unintended side effects.

With named graphs, it is possible to load data into a graph different from the default graph. This allows for easier management of the a subset of data since one can perform deletion or updates to the named graph without affecting the default graph. This strategy of using named graphs within the same repository to organize datasets into smaller subsets, is preferred over loading all data into the default graph. Having data in a separate named graphs allows for more flexibility in its management.

Note

When loading multiple named graphs into the same repository, user access control can be difficult. In GraphDB, user management is implemented at the repository level and thus it is not possible to give access to different named graphs for different users. Hence all named graphs inside a repository can be seen by any user who is authorized to run queries on that repository.

If this limitation is critical then one can also look into the concept of ‘Internal SPARQL federation’ where one can load their subsets into different repositories, each with specific user access, and subsequently create a federated repository that will be able to query data from multiple repositories.

For more details please refer to GraphDB Documentation on Internal SPARQL Federation.