RML logical views is an extension of the RDF Mapping Language (RML) that increases the language's capability to
construct RDF datasets from nested input data, to join data sources (also across data hierarchies),
and to handle data sources that mix source formats,
by allowing to specify a logical view: a flattened, source format-agnostic view
over one or more existing data sources.
Additionally, it provides a mechanism to express relationships between data sources,
as well as additional information about their fields, through structural annotations.
This document describes RML logical views through definitions and examples.
The examples are contained in color-coded boxes. We use the Turtle syntax [Turtle] to write RDF.
1.2 Conformance
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MAY, MUST, and SHOULD in this document
are to be interpreted as described in
BCP 14
[RFC2119] [RFC8174]
when, and only when, they appear in all capitals, as shown here.
2. Problem
This section is non-normative.
RML Logical Views aims to resolve challenges such as handling hierarchy of nested data, more flexible joining (also across data hierarchies), and handling data sources that mix source formats.
2.1 Nested data structures
References to nested data structures, like JSON or XML, may return multiple values. These values can be composite: they may again contain multiple values.
RML-Core defines mapping constructs that produce results by combining the results of other mapping constructs in a specific order.
For example, a triples map combines the results of a subject map and a predicate-object map in that order.
Another example is a template expression,
which combines character strings and zero or more reference expressions in declared order.
When mapping constructs produce multiple results, the combining mapping constructs will apply an n-ary Cartesian product over the sets of results, maintaining the order of the mapping constructs. In the case of nested data structures, this may cause the generation of results that do not match the source hierarchy, i.e. do not follow the root-to-leaf paths in the source data, since values are combined irrespective of it.
Furthermore, there is varying expressiveness in data source expression and query languages, and many languages have limited support for hierarchy traversal. For example, JSONPath has no operator to refer to an ancestor in the document hierarchy.
This limits the ability of RML-Core to map nested data.
2.2 Mixed data formats
Data in one format can contain multiple or composite values stored in another format, e.g. a CSV dataset could contain columns containing JSON values. To define the expected form of references to input data RML-Core employs the notion of a reference formulation that is a property of every logical source. However, currently a logical source is limited to having a single reference formulation, meaning mixed format data can only be referenced using a query language that supports just one of the formats.
2.3 Joining of data sources
RML-Core restricts join operations to referencing object maps. Since a referencing object map can only generate an object that is an IRI or blank node subject as specified by a parent triples map, it is not possible to combine data from two sources in one term, use data from a join on another position than the object, or generate a literal using data from a join.
Moreover, RML-Core cannot declare join operations correctly across hierarchies.
3. Records
A record is created using an iterator or an expression. Depending on the source type, records might take different forms: for tabular data sources, a record might be a row or a cell; for tree-structured sources like XML, a record might be a node; for document-structured sources like JSON, a record might be a document or property value.
A recordMUST have a string representation. It MAY be possible to derive other records from a record using an expression.
For a given record, the evaluation of an expression against it MUST either result in an ordered sequence of records, called the expression values, or throw an error. An expressionMUST be valid for the given reference formulation.
A record sequence is an ordered sequence of sets of key-value pairs, where each key is a string and each value a record. A record sequence MUST have a finite set of keys that appear in each set in the sequence. In any particular set in a record sequence, the value of a key MAY be a null value.
An iterator defines a record sequence from the iterator's [logical source], called the iterator record sequence. This record sequence has two keys:
An index key # with as corresponding values the position of the current entry in the sequence defined by the iterator.
A key <it> with as corresponding values the records in the sequence defined by the iterator.
An iterable field (rml:IterableField) is a type of iterable.
Consequently, an iterable fieldMUST have a reference formulation and a logical iterator.
If no reference formulation is declared for a field, the reference formulation of the field's parent is implied.
5.1 Field parents
A fieldMUST have a parent that is either abstract logical source or another field. The parent relation MUST not contain cycles: it is tree-shaped with a logical view as its root. The transitive parents of a field, i.e., the field's parent, the parent of the field's parent, etcetera, are fittingly called the field's ancestors.
5.2 Field names
A fieldMUST have a declared name that is an alphanumerical string. Fields with the same parentMUST have different declared names. If a field's parent is another field, we distinguish between the field's declared name and the field's name. A field's name is the concatenation of the name of the parent field, a dot ., and the field's declared name.
A logical view join (rml:LogicalViewJoin) is an operation that extends the logical iteration of one logical view (the child logical view) with fields derived from another logical view (the parent logical view).
exactly one parent logical view property (rml:parentLogicalView), whose value is a logical view (rml:LogicalView) that supplies the additional fields, fulfills the role of the parent logical source in the join condition(s) of the logical view join, and is referred to as parent logical view.
at least one join condition property (rml:joinCondition), whose value is a join condition.
at least one field property (rml:field), whose value is an expression field (rml:ExpressionField). This field SHOULD only contain field references that can be evaluated on the parent logical view.
A left join (rml:leftJoin) is the equivalent of a left (outer) join in SQL, where the child logical view is the left part of the join, and the parent logical view is the right part of the join. If any of the join conditions evaluates to false, the fields from the logical view join in the extended logical iteration contain a null value.
An inner join (rml:innerJoin) is the equivalent of an inner join in SQL. If any of the join conditions evaluates to false, the logical iteration is removed from the child logical view.
6.2 Logical view join examples
6.2.1 Left join
6.2.2 Inner join
6.2.3 Two left joins
7. Structural Annotations
Structural annotations provide a mechanism to express relationships between logical views, as well as additional information about fields.
Each logical viewMAY have zero or more structural annotation properties (rml:structuralAnnotation), connecting the logical view to a structural annotation object (i.e., of type rml:StructuralAnnotation).
All structural annotations of a logical view lvMUST have an on fields property (rml:onFields), linking the structural annotation to a list of field names occurring in lv. Intuitively, property on fields specifies the fields in lv that are involved by the structural annotation. The semantics of this involvement depends on the specific annotation.
Property
Domain
Range
rml:onFields
rml:StructuralAnnotation
rdf:List
7.1 Invariance Principle
Structural annotations provide additional information about the data that might be used by the RML processor to optimize the KG construction process. If this additional information is incorrect, then the RML processor might either fail or produce wrong results. When using structural annotations, users should make sure that the following invariance principle is satisfied:
For any source instances, the RDF graph produced by the RML engine using an RML mapping with annotations, and the same RML mapping where annotations have been removed, MUST be the same.
We emphasize that RML engines might exploit structural annotations, as they could totally ignore them. It is responsibility of the user to make sure that the annotations provided are indeed correct (that is, the data complies with the annotations). Sanity checks MAY be provided by the RML engines themselves, but this is not mandatory. Note that providing wrong annotations to an engine that takes into account for annotations, for instance for applying optimizations, could result in a violation of the invariance principle, with unpredictable results.
7.2 IriSafe
An IriSafe structural annotation (rml:IriSafeAnnotation) on fieldsF indicates that the content of each field in F is IRI safe, that is, each field in F does not contain any character that is not in the iunreserved production in RFC3987.
7.3 PrimaryKey
A PrimaryKey structural annotation (rml:PrimaryKeyAnnotation) on fields(f1, ..., fn) imposes two conditions:
no duplicate record sequences are present over the list of fields (f1, ..., fn);
No NULL value is admitted in any of the field f1, ..., fn.
The Unique structural annotation (rml:UniqueAnnotation) is analogous to the notion of UNIQUE constraints in databases. Specifically, a Unique annotationon fields(f1, ..., fn) imposes the following condition:
no duplicate record sequences are present over the list of fields (f1, ..., fn).
The NotNull structural annotation (rml:NotNullAnnotation) is analogous to the notion of NOT NULL constraints in databases. Specifically, a NotNull annotationon fieldsF imposes that each field in F does not contain NULL values.
Note
7.6 ForeignKey
The ForeignKey structural annotation (rml:ForeignKeyAnnotation) is analogous to the notion of foreign key constraint in databases. Specifically, a ForeignKey annotationon fields(f1, ..., fn) , target viewlv, and target fields(tf1,...,tfn) imposes the following conditions:
each NULL-free record sequence over the list of fields (f1, ..., fn) occurs also as a record sequence in (tf1,...,tfn);
The target view is a logical view specified through the property rml:targetView, whereas the target fields are an RDF list of field names specified through the property rml:targetFields. These two properties are specified as follows:
Property
Domain
Range
rml:targetView
rml:InclusionAnnotation
rml:LogicalView
rml:targetFields
rml:InclusionAnnotation
rdf:List
Therefore, each ForeignKey annotation MUST specify (additionally to the inherited rml:onFields property):
Exactly one rml:targetView property
Exactly one rml:targetFields property.
7.7 Inclusion
The Inclusion structural annotation (rml:InclusionAnnotation) is analogous to the notion of inclusion dependency in databases. Specifically, an Inclusion annotationon fields(f1, ..., fn) , target viewlv, and target fields(tf1,...,tfn) imposes the following condition:
each NULL-free record sequence over the list of fields (f1, ..., fn) occurs also as a record sequence in (tf1,...,tfn);
As for ForeignKey annotation, the target viewMUST be a logical view specified through the property rml:targetView, whereas the target fieldsMUST be an RDF list of field names specified through the property rml:targetFields.
Therefore, each inclusion annotation MUST specify (additionally to the inherited rml:onFields property):