Skip to content

dice-group/orca

Repository files navigation

Java CI with Maven Codacy Badge

Benchmark for Data Web Crawlers

ORCA is a Crawler Analysis benchmark for Data Web Crawlers which runs on the HOBBIT platform. Currently, the following types of data nodes are available:

  • RDF data served in various formats over HTTP (dump and dereferencing variants)
  • RDFa data embedded in HTML, based on the RDFa Test Suite
  • SPARQL endpoints, based on Virtuoso
  • CKAN instances

License

This project is licensed under the GNU Affero General Public License v3.0. For the full license text, see LICENSE.

Source code

Documentation

Crawlers

Experiments

Parameters

Parameter Description Ontology resources
Number of nodes The number of nodes in the synthetic graph. orca:numberOfNodes
Average node degree The average degree of the nodes in the generated graph. orca:averageNodeGraphDegree
RDF dataset size Average number of triples of the generated RDF graphs. orca:averageTriplesPerNode
Average resource degree The average degree of the resources in the RDF graphs. orca:averageRdfGraphDegree
Node type amounts For each node type, the user can define the proportion of nodes that should have this type. orca:httpDumpNodeWeight orca:dereferencingHttpNodeWeight orca:sparqlNodeWeight orca:ckanNodeWeight
Dump file serialisations For each available dump file serialisation, a boolean flag can be set. orca:useNtDumps orca:useN3Dumps orca:useRdfXmlDumps orca:useTurtleDumps
Dump file compression ratio Proportion of dump files that are compressed. orca:httpDumpNodeCompressedRatio
Average ratio of disallowed resources Proportion of resources that are generated within a node and marked as disallowed for crawling. orca:averageDisallowedRatio
Average crawl delay The crawl delay of the node's robots.txt file. orca:averageCrawlDelay
Seed A seed value for initialising random number generators is used to ensure the repeatability of experiments. orca:seed

Key performance indicators

KPI Description Ontology resources
Recall Number of true positives divided by the number of checked triples. orca:microRecall orca:macroRecall
Runtime The time it takes from starting the crawling process to termination. orca:runtime
Requested disallowed resources The number of forbidden resources crawled by the crawler, divided by the number of all resources forbidden by the robots.txt file. orca:ratioOfRequestedDisallowedResources
Crawl delay fulfilment The average measured delay between the requests received by a single node divided by the delay defined in the robots.txt file. If the measure is below 1.0 the crawler does not strictly follow the delay instruction. orca:minAverageCrawlDelayFulfillment orca:maxAverageCrawlDelayFulfillment orca:macroAverageCrawlDelayFulfillment
Consumed hardware resources The RAM and CPU consumption of the benchmarked crawler. orca:totalCpuUsage orca:averageDiskUsage orca:averageMemoryUsage
Triples over time The number of triples in the sink over time. orca:tripleCountOverTime

Maintenance

This project is maintained by the Data Science Group at Paderborn University within its role as a member of the special group 7 of task force 6 of the BDVA.

Citation

ORCA has been accepted by the IEEE International Conference on Semantic Computing (ICSC). The paper should be cited as follows:

@InProceedings{roeder2021orca,
  author    = {Michael Röder and Geraldo de Souza Jr. and Denis Kuchelev and Abdelmoneim Amer Desouki and Axel-Cyrille Ngonga Ngomo},
  booktitle = {Proceedings of the 15th IEEE International Conference on Semantic Computing (ICSC)},
  title     = {ORCA – a Benchmark for Data Web Crawlers},
  year      = {2021},
  pages     = {62-69},
  publisher = {IEEE Computer Society},
  keywords  = {dice raki daikiri opal limbo sys:relevantFor:limbo sys:relevantFor:opal group_aksw roeder ngonga kuchelev gsjunior},
  url       = {https://papers.dice-research.org/2021/ICSC2021_ORCA/ORCA_public.pdf},
}