Data visualisation over semantic datasets and Knowledge Graphs becomes a challenging task as the data volume increases. The amount of nodes and edges linking them produce overwhelming diagrams in which understanding the information displayed is cumbersome. In this work we introduce Graphless, a visualisation interface that exploits the statistical information of the usage of instances and relations to produce graphical summaries of entities. The system supports several heuristics, providing different visualisations, according to configurable parameters.
Graphless is an online tool that provides a user interface for graphically exploring RDF-based datasets. The main goal of this tool is to provide a way for summarizing entity graphs,supporting a better understanding of how individuals and their relations are stated. In a way, our aim is to reduce big portions of data to a more easy to digest representation, trying to preserve the most relevant elements while doing so.
For doing so, we have to consider the data features that semantic datasets have. In the most basic and common scenario, semantic data is represented as a graph of individuals and properties linking them. Even when an schema, defined as an ontology, could be expected, having it is not always the case. Many available SPARQL endpoints hosting datasets do not contain the ontology (or ontologies) used to generate them. In this work we aim to provide a generic approach for producing visualization summaries. We rely only on the statistical features that are common to RDF data graphs. We consider the following two main features:
Connectivity degree: the number of ingoing and outgoing links from/to a given node is an indicator of the relevance of the node in the dataset. Central nodes (e.g Countries as birthplaces, Years for dates) tend to have many incoming links, as they are referenced by many other instances. Also, nodes of prominent entities usually assert many outgoing properties, either to other entities or to literal values (e.g. Authors to the books they wrote, Places to their names in different languages).
Property usage: the distribution of properties along the instances on a dataset is, as expected, not strictly homogeneous. Properties such as name or birthplace are more likely to be instantiated on instances of type Person than others such as capital of or foundation year. Thus, classes allow us to estimate which are the most used properties for their instances.
In order for the interface to be able to filter and display the data properly, the target graph has to be analyzed according to the metrics described previously. This process starts by retrieving the list resources and their corresponding types from the dataset. Using this information the system counts how many times each property is instantiated, keeping track of the type of the subject of the triple as well as of the object (only for object properties). Only resources with types are thus considered in the system.
In this process, the system records the statistical usage of properties, both as ingoing and outgoing links of instances of each type. We define a property as outgoing for a given type when an instance of such type is used as the subject of a triple. Similarly, ingoing properties are those with a resource of a given type as the object.
With this information, the system is able to generate a profile for each type in the dataset, assigning weights to each type-property combination. In order to have a relative measure, raw counts are normalized for each type, generating a weight value from 0 to 1 (that is, dividing all weights by the maximun one found for the given type). Values closer to 1 would imply highly used properties, and thus highly relevant, following the discussion introduced previous section. Those properties with weights close to 0 imply that these properties are low in relevance as they are rarely used.
The result of this process is exemplified in which displays the statistic analysis of the Jamendo dataset, which we will describe in . The table depicts the property counts for the class mo:MusicArtist in the dataset, including the total and normalized counts for both ingoing and outgoing properties.
Property | Prop.Type | Count | Normalized weight |
---|---|---|---|
foaf:made |
Outgoing |
5786 |
1.0 |
foaf:based_near |
Outgoing |
3244 |
0.5606 |
foaf:name |
Outgoing |
3505 |
0.6057 |
foaf:homepage |
Outgoing |
3006 |
0.5195 |
foaf:img |
Outgoing |
2982 |
0.5154 |
mo:biography |
Outgoing |
485 |
0.0838 |
owl:sameAs |
Outgoing |
119 |
0.0206 |
foaf:maker |
Ingoing |
5786 |
1.0 |
Besides the property usage for each type, the system also analyzes how many properties link to each resource in the dataset, obtaining the ingoing and outgoing degree for each node (i.e. individual) in the graph. These measurements indicate the connectivity degree of the node, and serve as indicators of the relevance of the node in the graph. Intuitively, the more connected a node is, the more relevant it is.
shows the information generated during this process for a set of resources belonging to different classes. As explained for the previous process, the system normalizes the raw counts, so as to have a consistent measurement of the importance of the node in the graph. This is achieved by dividing each ingoing and outgoing count by the corresponding maximum value found on the graph.
Resource | Type | Ingoing Deg. | Outgoing Deg. | Norm. In. Deg. | Norm. Out. Deg. |
---|---|---|---|---|---|
http://dbtune.org/jamendo/record/3280 |
mo:Record |
1 |
109 |
0.00009 |
1.0 |
http://dbtune.org/jamendo/artist/1442 |
mo:MusicArtist |
25 |
54 |
0.0022 |
0.2660 |
http://dbtune.org/jamendo/lyrics/23855 |
mo:Lyrics |
1 |
1 |
0.00009 |
0.0091 |
Once these two analysis (i.e. properties usage and connectivity degree) have been conducted, their results are combined, assigning to each node on the graph the corresponding information. That is, each node in the graph will be annotated with the connectivity degrees and the usage weights for each property instantiated for it, both for ingoing and outgoing properties. As discussed in the property weights annotations are done using the relationship features of Neo4j.
As depicted in , each property of the resource http://dbtune.org/jamendo/artist/1442 is annotated with their corresponding weights, along with the ingoing and outgoing degrees of the resource. These parameters can be used to filter and reduce the elements to be displayed on the final diagram, as we introduce in the following section.
When comparing and , we can observe how the system provides different visualizations based on the same dataset, by applying different heuristics. Whereas these two heuristics are rather simple and straightforward they showcase the potential of our approach. The Top-K heuristic provides a more consistent view along the graph, depicting always the same amount of nodes each step. The Threshold heuristic is intented to generate more restrictive visualizations, in which only strictly relevant nodes are depicted.
In the coming sections we will discuss about how this system is implemented and how the usage of a graph-oriented database allows to store the required annotations and perform queries using them efficiently.
The information extracted from the process described in has to be stored in a way that can be accessed efficiently. This requires a data system able to support the annotation of properties and nodes. Both things could be done using just RDF over the original dataset, using custom properties for annotating nodes and reification statements for the properties. However we decided to use a more graph-oriented solution, providing better performance and ease of use when querying the data. The use of extra properties, and specially the use of reification, would increase the complexity of SPARQL queries, difficulting the process of exploring the graph.
For these reasons, we selected Neo4j as our data backend solution, for storing the dataset and the related annotations. This graph database offers the required features for storing the nodes and the properties (known as relationships according to Neo4j's naming conventions), with annotations for both of them. These annotations can be used directly in Cypher queries, the query language used in Neo4j. , shows the two types queries used for the Top-K and Threshold algorithms respectively. As we can see, we can exploit the annotations over the properties ("r" in both examples) to either sort by weight and select the top two properties (top-k heuristic) or to filter based on the weight being higher than a given value (threshold heuristic).
The frontend of the system is developed as a web interface, which aims to provide a clean and clear visualization experience to the final user. The graphs visualization is supported by vis.js, a popular Javascript library for drawing statical reports, graphs and relation diagrams on the web. The interface itself is composed by two parts, the main area, in which graphs are displayed, and the right menu, which is hidden by default in order to maximize the visualisation space.
As depicted in the GUI figures in this paper, the graphs are displayed in the main area, following a shade color scheme. The root node, the one the user queried, is displayed with the strongest red hue. As the distance from this root node increases, the color code of the node gets lighter and lighter. This is intended for the user to have a more clear view of which are the most relevant nodes, assuming that those closer to the main one are more informative for it. Circles and nodes are used to differentiate between object nodes and literal ones respectively. In both cases, the local name or the literal value are used as the node label. Properties are also labeled with their name and weight.
All nodes are clickable objects. On double click, the right menu is opened and the information about the selected node is displayed, as seen in .
The code, implemented in Java, is divided in two different projects, for the analysis process and interface, which are available online.
In this section we introduce the result of applying our approach to the Jamendo dataset, currently available in Graphless. This dataset contains information about records, musical artists, torrents, lyrics, signals, tags, documents and tracks. It combines five different ontologies and contains 412,564 object nodes, 251,698 literals, and a total of 1,267,012 instantiated relations.
The data ingestion process, undertaken during the analysis process, starts by querying the public Jamendo endpoint, from which the system retrieves all the subjects, objects, and their related properties and types. This is done by means of SPARQL queries, which are executed with a time delay between them to avoid overloading the system.
Jamendo individuals and properties can be visualized on the currently available version of Graphless, using the two exploratory algorithms described above. The final dataset is also available as a Neo4j database
In this paper we introduce Graphless, a data visualization tool for semantic datasets, which combines statistical analysis and graph traverse heuristics to produce graph summaries. Combining these two approaches the system is able to provide smaller diagrams, which are easier to read and interpret.
This work represents a first version of the tool, in which only the Jamendo dataset has been included. We plan to extend the collection of datasets, including more representative ones, such as DBpedia. Our intuition is that a tool such as this would be of help specially in heterogeneous datasets, containing information from different sources and with a wide range of varied types and properties.
From the technical point of view, we plan to include new features, currently under development, such as the generation of RDF snippets from the visualizations. We also plan to support new algorithms implementing novel heuristics. Our final goal is to provide a programmable interface in which developers can integrate their algorithms based on the available data properties.
Finally, we are currently conducting a usability evaluation with several users, which will provide feedback for improving the tool. This will allow us to measure how suitable the two assumptions introduced in are, as well as to explore new ones.
This work was partially funded by the esTextAnalytics (RTC-2016-4952-7) project, from the Spanish State Investigation Agency of the MINECO and FEDER Funds.
Y Raimond, MB Sandler, Q Mary: A Web of Musical Information, ISMIR, 2008
Heiko Paulheim and Christian Bizer: Type Inference on Noisy RDF Data. In: Lecture notes in computer scienceThe Semantic Web - ISWC 2013 : 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part I; 510-525. Springer, Berlin [u.a.], 2013.
Thalhammer, Andreas, Nelia Lasierra and Achim Rettinger. "LinkSUM: Using Link Analysis to Summarize Entity Data." ICWE (2016).
Thalhammer, Andreas and Achim Rettinger. "Browsing DBpedia Entities with Summaries." ESWC (2014).
Kalpa Gunaratna, Krishnaprasad Thirunarayan, and Amit Sheth. 2015. FACES: diversity-aware entity summarization using incremental hierarchical conceptual clustering. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI'15). AAAI Press 116-122.
Vitalis Wiens, Steffen Lohmann, and Sören Auer. 2017. Semantic Zooming for Ontology Graph Visualizations. In Proceedings of the Knowledge Capture Conference (K-CAP 2017). ACM, New York, NY, USA, Article 4, 8 pages. DOI: https://doi.org/10.1145/3148011.3148015
Lohmann, S.; Link, V.; Marbach, E.; Negru, S.: WebVOWL: Web-Based Visualization of Ontologies. Proceedings of EKAW 2014 Satellite Events, LNAI 8982, pp. 154-158, Springer, 2015
Chawuthai R., Takeda H. (2016) RDF Graph Visualization by Interpreting Linked Data as Knowledge. In: Qi G., Kozaki K., Pan J., Yu S. (eds) Semantic Technology. JIST 2015. Lecture Notes in Computer Science, vol 9544. Springer, Cham
Nikos Bikakis, Timos K. Sellis: Exploration and Visualization in the Web of Big Linked Data: A Survey of the State of the Art. EDBT/ICDT Workshops 2016
Andrea Giovanni Nuzzolese, Aldo Gangemi, Valentina Presutti, Paolo Ciancarini: Encyclopedic Knowledge Patterns from Wikipedia Links. International Semantic Web Conference (1) 2011: 520-536
The online version of this paper can be found in http://w3id.org/people/idafensp/papers/graphless/
The following prefixes are used throughout the text: mo (http://purl.org/ontology/mo/), foaf (http://xmlns.com/foaf/0.1/), tag (http://www.holygoat.co.uk/owl/redwood/0.1/tags/), owl (http://www.w3.org/2002/07/owl#)