A Collection of Benchmark Datasets for Systematic Evaluations of Machine Learning on the Semantic Web

In the recent years, several approaches for machine learning on the Semantic Web have been proposed. However, no extensive comparisons between those approaches have been undertaken, in particular due to a lack of publicly available, acknowledged benchmark datasets. Here, we present a collection of 22 benchmark datasets at different sizes, derived from existing Semantic Web datasets as well as from external classification and regression problems linked to datasets in the Linked Open Data cloud. Such a collection of datasets can be used to conduct quantitative performance testing and systematic comparisons of approaches, which, due to the number of datasets, also allows for determining the statistical significance of the findings.

The datasets, as well as a detailed description for each of them, can be found here.

Datasets

Our dataset collection consists of 22 datasets divided into three categories:

  1. Existing datasets that are commonly used in machine learning experiments
  2. Datasets that were generated from official observations
  3. Datasets generated from existing RDF datasets.

 

Each of the datasets in the first two categories are initially linked to DBpedia. This has two main reasons, (1) DBpedia being a cross-domain knowledge base usable in datasets from very different topical domains, and (2) tools like DBpedia Lookup and DBpedia Spotlight making it easy to link external datasets to DBpedia. However, DBpedia can be seen as an entry point to the Web of Linked Data, with many datasets linking to and from DBpedia. In fact, we use the initial DBpedia links to retrieve external links for each entity to YAGO and Wikidata. Such links could be exploited for systematic evaluation of the relevance of the data of different LOD dataset in different learning tasks.

Existing ML datasets
Name#InstancesSourceTaskLicence
Auto MPG371UCI MLRegressionpending
AAUP960JSERegression/Classification (c=3)pending
Auto 9393JSERegressionpending
Zoo101UCI MLClassification (c=3)pending
Generated datasets from official observations
Name#InstancesSourceTaskLicence
Forbes1,585ForbesRegression/Classification (c=2)pending
Cities212MercerRegression/Classification (c=3)pending
Facebook Books1,600FacebookRegression/Classification (c=2)pending
Facebook  Movies1,600FacebookRegression/Classification (c=2)pending
Metacritic Albums1,600MetacriticRegression/Classification (c=2)pending
Metacritic Movies2,000MetacriticRegression/Classification (c=2)pending
HIV Deaths Country114WHORegression/Classification (c=2)Open
Traffic Accidents Country146WHORegression/Classification (c=2)Open
Energy Savings Country162WorldBankRegression/Classification (c=2)Open
Inflation Country160WorldBankRegression/Classification (c=2)Open
Scientific Journals Country160WorldBankRegression/Classification (c=2)Open
Unemployment French Region26SemStats 2013Regression/Classification (c=2)pending
Endangered Species301a-z-animalsRegression/Classification (c=2)pending
Drug-Food Interaction2,000FinkiLODClassification (c=2)odc-by
Datasets generated from existing RDF datasets
Name#InstancesTaskLicence
AIFB176Classification (c=4)CC-BY
AM1,000Classification (c=11)cc-by-sa
MUTAG340Classification (c=2)CC-BY 
BGS146Classification (c=2)Open

Link Quality Evaluation

To evaluate the quality of the DBpedia links, for each of the datasets we randomly selected at least 100 instances (for dataset smaller than 100 instances, we selected all instances) and manually evaluated the correctness of the links.

Link quality evaluation
Dataset#Test Links#Correct LinksPrecision (%)
Auto MPG100100100.00
AAUP100100100.00
Auto 939393100.00
Zoo1019998.01
Forbes100100100.00
Cities100100100.00
Facebook Books1009898.00
Facebook Movies130130100.00
Metacritic Albums100100100.00
Metacritic Movies13012898.46
HIV Deaths Country114114100.00
Traffic Accidents Country146146100.00
Energy Savings Country162162100.00
Inflation Country160160100.00
Scientific Journals Country160160100.00
Unemployment French Region2626100.00
Endangered Species100100100.00
Drug-Food Interaction100100100.00

Dataset Download

The datasets, as well as a detailed description for each of them, can be found here.

Citation

If you use the collection of datasets in your research, please cite the following paper:

  • Ristoski, P., de Vries, G.K.D., Paulheim, H.: A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web. In: International Semantic Web Conference, 2016