KGC Workshop

Knowledge graph construction of heterogeneous data has seen a lot of uptake in the last decade from compliance to performance optimizations with respect to execution time. Besides execution time as a metric for comparing knowledge graph construction, other metrics e.g. CPU or memory usage are not considered. This challenge aims at benchmarking systems to find which RDF graph construction system optimizes for metrics e.g. execution time, CPU, memory usage, or a combination of these metrics.

Task Description

The task is to reduce and report the execution time and computing resources (CPU and memory usage) for the parameters listed in this challenge, compared to the state-of-the-art of the existing tools and the baseline results provided by this challenge. This challenge is not limited to execution times to create the fastest pipeline, but also computing resources to achieve the most efficient pipeline.

We provide a tool which can execute such pipelines end-to-end. This tool also collects and aggregates the metrics such as execution time, CPU and memory usage, necessary for this challenge, as CSV files. Moreover, the information about the hardware used during the execution of the pipeline is available as well to allow fairly comparing different pipelines. Your pipeline should consist of Docker images which can be executed on Linux to run the tool.

It is strongly encouraged to use this tool for participating in this challenge. If you prefer to use a different tool or our tool imposes technical requirements you cannot solve, please contact us directly.

Click here to go to the website of the tool and the resources of the benchmark.

Submissions

Workflow for submissions:

Fill this form about your tool before 31 of March 2023 AoE through.
Submit results (details for submission will be provided here soon):
- Submit parameters dataset (Part 1) results before 12th of May (AoE).
- Submit GTFS-Madrid-Bench (Part 2) results before 19th of May(AoE).
OPTIONAL (Proceedings):

Submit report paper AFTER the workshop.
Report paper will be reviewed by PC/OC.
If paper is accepted, it will be published within the proceedings of KGCW.

Do you want to ask questions? Join us in slack

At least one author of each tool needs to present the results during the workshop
(virtual presentations are not allowed)

Part 1: Knowledge Graph Construction Parameters

These parameters are evaluated using synthetic generated data to have more insights of their influence on the pipeline. Data is in CSV (with SQL schema provided) and mappings are in R2RML.

Data Parameters

Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).
Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).
Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).
Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

Mappings Parameters

Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).
Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).
Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M).

Part 2: GTFS-Madrid-Bench

The GTFS-Madrid-Bench provides insights in the pipeline with real data from the public transport domain in Madrid. Mappings are provided in R2RML for scaling and RML for heterogeneity.

Scaling

GTFS-1 SQL
GTFS-10 SQL
GTFS-100 SQL
GTFS-1000 SQL

Heterogeneity

Files-only: GTFS-100 with JSON, XML, and CSV
Tabular-only: GTFS-100 with SQL and CSV
Nested-only: GTFS-100 with JSON and XML
Mixed: GTFS-100 with SQL, CSV, JSON, and XML

Evaluation Criteria

Submissions must evaluate the following metrics:

Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.
CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.
Min and max memory consumption for each step of the pipeline. The minimum and maximum memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

Resources & Ground Truth