KGC Workshop

KGCW2024 Challenge

Knowledge graph construction of heterogeneous data has seen a lot of uptake in the last decade from compliance to performance optimizations with respect to execution time. Besides execution time as a metric for comparing knowledge graph construction, other metrics e.g. CPU or memory usage are not considered. This challenge aims at sparking interest among RDF graph construction systems to comply with the new RML specifications and its modules while benchmarking them regarding e.g. execution time, CPU, memory usage, or a combination of these metrics.

Task Description

The task is to comply with the new RML specification and its modules in this challenge while also aiming at an efficient implementation regarding execution time and computing resources e.g. CPU and memory usage. This challenge is not limited to execution times to create the fastest pipeline, but also computing resources to achieve the most efficient pipeline.

We provide a tool which can execute such pipelines end-to-end. This tool also collects and aggregates the metrics such as execution time, CPU and memory usage, necessary for this challenge, as CSV files. Moreover, the information about the hardware used during the execution of the pipeline is available as well to allow fairly comparing different pipelines. Your pipeline should consist of Docker images which can be executed on Linux to run the tool.

It is strongly encouraged to use this tool for participating in this challenge. If you prefer to use a different tool or our tool imposes technical requirements you cannot solve, please contact us directly.

Click here to go to the website of the tool.

Click here to go to the website of the resources.

Submissions

Workflow for submissions of the Challenge:

Interest for participation by January 31th 2024 AoE.
Results submission by April 30th 2024 AoE (participants received submissions details via e-mail).
OPTIONAL (Proceedings):

Submit report paper by May 21th 2024.
Report paper will be reviewed by PC/OC.
If paper is accepted, it will be published within the proceedings of KGCW.
Camera ready submission by June 22th 2024.

Do you want to ask questions? Join us in slack

At least one author of each tool needs to present the results during the workshop
(virtual presentations are not allowed)

Track 1: Conformance: Test cases of all RML modules

Test compliance of an engine with all new RML modules:

RML-Core: https://github.com/kg-construct/rml-core/
RML-IO: https://github.com/kg-construct/rml-io/
RML-CC: https://github.com/kg-construct/rml-cc/
RML-FNML: https://github.com/kg-construct/rml-fnml/
RML-Star: https://github.com/kg-construct/rml-star/

You can find the test cases on Zenodo as well where the official version used during the Challenge is published: https://zenodo.org/doi/10.5281/zenodo.10721874

Track 2: Performance: Part 1: Knowledge Graph Construction Parameters

These parameters are evaluated using synthetic generated data to have more insights of their influence on the pipeline. Data is in CSV (with SQL schema provided) and mappings are in R2RML.

Data Parameters

Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).
Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).
Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).
Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

Mappings Parameters

Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).
Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).
Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M).

You can find the KGC Parameters on Zenodo as well where the official version used during the Challenge is published: https://zenodo.org/doi/10.5281/zenodo.10721874

Track 2: Performance: Part 2: GTFS-Madrid-Bench

The GTFS-Madrid-Bench provides insights in the pipeline with real data from the public transport domain in Madrid. Mappings are provided in R2RML for scaling and RML for heterogeneity.

Scaling

GTFS-1 SQL
GTFS-10 SQL
GTFS-100 SQL
GTFS-1000 SQL

Heterogeneity

Files-only: GTFS-100 with JSON, XML, and CSV
Tabular-only: GTFS-100 with SQL and CSV
Nested-only: GTFS-100 with JSON and XML
Mixed: GTFS-100 with SQL, CSV, JSON, and XML

You can find the GTFS-Madrid-Benchmark on Zenodo as well where the official version used during the Challenge is published: https://zenodo.org/doi/10.5281/zenodo.10721874

Evaluation Criteria

Submissions must evaluate the following metrics:

Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.
CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.
Min and max memory consumption for each step of the pipeline. The minimum and maximum memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.
Compliance with the new RML specification and its modules.