Benchmark for Data Web Crawlers

ORCA is a Crawler Analysis benchmark for Data Web Crawlers which runs on the HOBBIT platform. Currently, the following types of data nodes are available:

RDF data served in various formats over HTTP (dump and dereferencing variants)
RDFa data embedded in HTML, based on the RDFa Test Suite
SPARQL endpoints, based on Virtuoso
CKAN instances

License

This project is licensed under the GNU Affero General Public License v3.0. For the full license text, see LICENSE.

Source code

Permanent URL: http://w3id.org/dice-research/orca/code
GitHub: https://github.com/dice-group/orca

Documentation

Crawlers

Experiments

Parameters

Parameter	Description	Ontology resources
Number of nodes	The number of nodes in the synthetic graph.	orca:numberOfNodes
Average node degree	The average degree of the nodes in the generated graph.	orca:averageNodeGraphDegree
RDF dataset size	Average number of triples of the generated RDF graphs.	orca:averageTriplesPerNode
Average resource degree	The average degree of the resources in the RDF graphs.	orca:averageRdfGraphDegree
Node type amounts	For each node type, the user can define the proportion of nodes that should have this type.	orca:httpDumpNodeWeight orca:dereferencingHttpNodeWeight orca:sparqlNodeWeight orca:ckanNodeWeight
Dump file serialisations	For each available dump file serialisation, a boolean flag can be set.	orca:useNtDumps orca:useN3Dumps orca:useRdfXmlDumps orca:useTurtleDumps
Dump file compression ratio	Proportion of dump files that are compressed.	orca:httpDumpNodeCompressedRatio
Average ratio of disallowed resources	Proportion of resources that are generated within a node and marked as disallowed for crawling.	orca:averageDisallowedRatio
Average crawl delay	The crawl delay of the node's `robots.txt` file.	orca:averageCrawlDelay
Seed	A seed value for initialising random number generators is used to ensure the repeatability of experiments.	orca:seed

Key performance indicators

KPI	Description	Ontology resources
Recall	Number of true positives divided by the number of checked triples.	orca:microRecall orca:macroRecall
Runtime	The time it takes from starting the crawling process to termination.	orca:runtime
Requested disallowed resources	The number of forbidden resources crawled by the crawler, divided by the number of all resources forbidden by the `robots.txt` file.	orca:ratioOfRequestedDisallowedResources
Crawl delay fulfilment	The average measured delay between the requests received by a single node divided by the delay defined in the `robots.txt` file. If the measure is below 1.0 the crawler does not strictly follow the delay instruction.	orca:minAverageCrawlDelayFulfillment orca:maxAverageCrawlDelayFulfillment orca:macroAverageCrawlDelayFulfillment
Consumed hardware resources	The RAM and CPU consumption of the benchmarked crawler.	orca:totalCpuUsage orca:averageDiskUsage orca:averageMemoryUsage
Triples over time	The number of triples in the sink over time.	orca:tripleCountOverTime

Maintenance

This project is maintained by the Data Science Group at Paderborn University within its role as a member of the special group 7 of task force 6 of the BDVA.

Citation

ORCA has been accepted by the IEEE International Conference on Semantic Computing (ICSC). The paper should be cited as follows:

@InProceedings{roeder2021orca,
  author    = {Michael Röder and Geraldo de Souza Jr. and Denis Kuchelev and Abdelmoneim Amer Desouki and Axel-Cyrille Ngonga Ngomo},
  booktitle = {Proceedings of the 15th IEEE International Conference on Semantic Computing (ICSC)},
  title     = {ORCA – a Benchmark for Data Web Crawlers},
  year      = {2021},
  pages     = {62-69},
  publisher = {IEEE Computer Society},
  keywords  = {dice raki daikiri opal limbo sys:relevantFor:limbo sys:relevantFor:opal group_aksw roeder ngonga kuchelev gsjunior},
  url       = {https://papers.dice-research.org/2021/ICSC2021_ORCA/ORCA_public.pdf},
}

Name	Name	Last commit message	Last commit date
Latest commit denkv Add links to experiments from ORCA paper Jun 30, 2023 04573c9 · Jun 30, 2023 History 698 Commits
.github/workflows	.github/workflows	Added submodule init; removed codacy coverage temp.	Jul 9, 2021
docs	docs	Add links to experiments from ORCA paper	Jun 30, 2023
ldcbench.api	ldcbench.api	Cleaned up code.	Jul 15, 2021
ldcbench.ckan-node	ldcbench.ckan-node	Fixed the problem that nodes were not shutdown. Removed exception log…	Jul 16, 2021
ldcbench.controller	ldcbench.controller	Fixed the problem that nodes were not shutdown. Removed exception log…	Jul 16, 2021
ldcbench.data-generator	ldcbench.data-generator	Removed Maven from the Dockerfiles. Replaced it with openjdk 8 alipne.	Jul 15, 2021
ldcbench.empty-server	ldcbench.empty-server	Finished the controller and generator part of Lemming. Updated build …	Jul 7, 2021
ldcbench.eval-module	ldcbench.eval-module	Removed Maven from the Dockerfiles. Replaced it with openjdk 8 alipne.	Jul 15, 2021
ldcbench.http-node	ldcbench.http-node	Merge branch 'develop' into feature/lemming	Jul 20, 2021
ldcbench.integration-test	ldcbench.integration-test	Fixed unit test and integration test names. Fixed test configuration.	Jul 16, 2021
ldcbench.lemming	ldcbench.lemming	Fixed the problem that nodes were not shutdown. Removed exception log…	Jul 16, 2021
ldcbench.nodes	ldcbench.nodes	Fixed the problem that nodes were not shutdown. Removed exception log…	Jul 16, 2021
ldcbench.parent	ldcbench.parent	Removed the Spring framework as it was not used, anyway.	Jul 16, 2021
ldcbench.rdfa-gen	ldcbench.rdfa-gen	Removed Maven from the Dockerfiles. Replaced it with openjdk 8 alipne.	Jul 15, 2021
ldcbench.rdfa-node	ldcbench.rdfa-node	Removed Maven from the Dockerfiles. Replaced it with openjdk 8 alipne.	Jul 15, 2021
ldcbench.sparql-node	ldcbench.sparql-node	Removed the Spring framework as it was not used, anyway.	Jul 16, 2021
ldcbench.system	ldcbench.system	Removed Maven from the Dockerfiles. Replaced it with openjdk 8 alipne.	Jul 15, 2021
orca.tools	orca.tools	Fix version in pom	Apr 15, 2020
.editorconfig	.editorconfig	Use 2 spaces for xml indent	Jan 28, 2019
.gitignore	.gitignore	Remove and ignore /bin/	Jan 30, 2019
.gitmodules	.gitmodules	Finished to split rdfa into rdfa-gen and rdfa-node. Fixed warnings.	Mar 3, 2020
.travis.yml	.travis.yml	Add dist and os to the CI config	Apr 15, 2020
LICENSE	LICENSE	Add AGPL-3.0 license	Jan 25, 2019
Makefile	Makefile	Finished the controller and generator part of Lemming. Updated build …	Jul 7, 2021
benchmark.ttl	benchmark.ttl	Fixed wrong property in the benchmark.ttl file.	Jul 22, 2021
find-rdfa-domains	find-rdfa-domains	Intercept domains used in the RDFa test suite	Mar 30, 2020
pom.xml	pom.xml	Merged integration tests into an own maven module. Update configurati…	Jul 9, 2021
system.ttl	system.ttl	Use http://w3id.org/dice-research/orca/ URIs	Dec 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmark for Data Web Crawlers

License

Source code

Documentation

Crawlers

Experiments

Parameters

Key performance indicators

Maintenance

Citation

About

Contributors 9

Languages

License

dice-group/orca

Folders and files

Latest commit

History

Repository files navigation

Benchmark for Data Web Crawlers

License

Source code

Documentation

Crawlers

Experiments

Parameters

Key performance indicators

Maintenance

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 9

Languages