CLASSIFICATION OF SPARQL QUERIES INTO EQUIVALENCE CLASSES OF RELEVANT QUERIES

: This paper is inspired by ideas from the field of theoretical Mathematics, used for the partitioning of abstract spaces into equivalence classes, and applies analogous concepts in order to propose a classification of SPARQL queries into equivalence classes. The novel concepts of relevant queries and covering query are introduced in a manner appropriate for the study of SPARQL queries. These new definitions shed new light on the relations among SPARQL queries. They enable the formal identification of similar queries and this leads to the partition of SPARQL queries into equivalence classes of relevant queries. This work also discusses how the covering query relating two or more relevant queries can be useful from the perspective of computational cost when evaluating composite queries composed of simpler relevant queries. Hence, the introduction of the concept of relevance between queries provides not only obvious theoretical advantages, but also concrete practical ones, which in many cases have the potential to lower the computational cost of query evaluation.


INTRODUCTION
Today one of the most important areas of research is undoubtedly the Semantic Web. During the last decade, Semantic Web, combined with a spectrum of related technologies, e.g., Linked Open Data [1], has forever transformed the way we perceive the World Wide Web. One of the key reasons for the success of Semantic Web is the fact that it is based on standards. The Resource Description Framework (RDF) and SPARQL are probably the two most important standards of the Semantic Web.
The Resource Description Framework is used to store data in the form of a directed graph [2]. The contents of the directed graph are viewed as triples (subject, predicate, object), where the subject is related to the object through the predicate. SPARQL [3] is the de facto standard language that is used for querying RDF datasets. Of particular importance from our viewpoint is the class of Regular Path Queries (RPQ for short). These are SPARQL queries that concern pairs of nodes of the RDF graph. An underlying path consisting of directed edges of the RDF graph begins from the first node and terminates at the second node. This path satisfies certain properties and these properties are formulated in terms of simple regular expressions that are suitable for this purpose.
In this context, the so called "transitive" predicates play a particularly important role. A predicate R, which can conveniently be viewed as a label of one or more directed edges of the RDF graph, is called transitive if one can validly infer the triple that (a, R, c) from the existence of the two triples (a, R, b) and (b, R, c) in the RDF dataset. SPARQL queries taking advantage and utilizing transitive predicates are the most suitable examples for demonstrating the concept of "relevant" queries, which is the main theme of this paper. This work is inspired by theoretical ideas from the field of Mathematics which are used in order to access the similarity between abstract mathematical notions. We investigate how these ideas can be infused in the context of SPARQL queries so as to provide a theoretical partition of the set of SPARQL queries into equivalence classes, where each class contains "relevant" queries, that is queries that are connected in a precise formal way.
Contribution. The main contribution of this work lies in its novelty. This paper advocates the use of mathematical notions for the classification of SPARQL queries into equivalence classes. Mathematical ideas have always been used in a fruitful way to tackle concrete computational problems. Following this line of thought, this work proposes the use of abstract mathematical concepts as a tool for the classification and subsequent evaluation of SPARQL queries. The idea of relevant SPARQL queries, which is introduced here, has far-reaching ramifications because it reveals hidden connections between queries. These connections, apart from being of interest in their own right, can also be used to improve the computational cost of evaluating those composite SPARQL queries that are comprised of relevant queries. In such composite queries, which are often encountered in practice, a covering query, that is a query that establishes the formal connection among the relevant queries, can be used instead of the individual relevant queries. The use of a covering query is advantageous because it will enable the evaluation of the composite query in a more efficient manner, requiring less computational time.
The paper is organized as follows: Section 2 contains references to other related works, Section 3 presents the definitions and the notation used in this work, Section 4 lists and analyzes the main results, and, finally, Section 5 summarizes conclusions and suggests some possible ideas for future work.

RELATED WORK
The notion of "similar" queries in the general context of web searching has been studied extensively (see [4] and [5] for some recent progress and more references on the subject). However, it should be emphasized that in this context the query is not a SPARQL query applicable to a RDF dataset, but just a keyword based query, typically submitted by the user when searching for some information in the internet. The present paper focuses on SPARQL queries and establishes a type of similarity among such queries based upon a rigorous definition. To avoid any potential confusion, we shall henceforth use the characterization "relevant" in our study of SPARQL queries. In the rest of this section we briefly mention a few other works that are related to the present article in the sense that they focus on SPARQL and RDF graphs from a theoretical viewpoint.
In [6] Schmidt et al. study equivalences in the context of SPARQL algebra. The main theme of their work is the classification of SPARQL fragments in complexity classes. They extensively use SPARQL set algebra and study both set and bag semantics. Our work is different from theirs in that we give a totally different and completely new definition for the equivalence of SPARQL queries, introducing at the same time the novel concept of covering query, and we also avoid the use SPARQL algebra.
Zhang et al. [7] proposed an extension of navigational path queries using elements from the theory of context-free languages. Since context-free constructs are more expressive than regular expressions, this approach enhances the expressive power of SPARQL queries. The resulting language is named cfSPARQL and, as the name suggests, endows standard SPARQL with context-free grammars. cfSPARQL enables the user to formulate more powerful and complex queries that SPARQL. The authors claim that the increased expressive power does not come up with an increased computational cost, i.e., in most practical examples query evaluation in cfSPARQL remains efficient.
An important theoretical work by Sistla et al. [8] demonstrated the relationship of database queries with finite automata. In [8] the authors developed a technique by which database queries, e.g., nearest neighbor queries, are expressed using an automata-theoretic approach. Ideas and methods from the theory of finite automata motivated Wang et al. in [9] to devise an algorithm suitable for evaluating RDF queries. They also presented experimental results that confirm that the methodology they propose is capable of handling efficiently certain categories of regular path queries on large scale RDF graphs.
Another theoretical work that investigated the correlation of queries on RDF datasets to certain types of finite automata appeared in [10]. There the emphasis was on the practically infinite nature of Linked Data apothecaries, which is a reasonable abstraction if one takes into account their ever increasing size. This line of thought was further pursued in [11], where a connection between SPARQL queries involving transitive predicates and ω-regular languages, i.e., the analog of regular languages in case of infinite words, and finite automata accepting infinite inputs is established. Tools and techniques from the theory of probabilistic automata can also be used when dealing with data characterized by a certain degree of uncertainty, e.g., biomedical data, as was demonstrated in [12].
All the previous references serve to indicate that ideas and methods originating from theoretical disciplines can be successfully adopted to more concrete and practical environments, such as evaluation of SPARQL queries. It is this point of view that characterizes this paper, where the inspiration comes from the field of mathematics and leads to the introduction of novel notion like relevant queries and covering query.

DEFINITIONS AND NOTATION
SPARQL queries return information stored in a RDF graph. The underlying syntax is rather user-friendly and enables the user to retrieve data that match a certain pattern. In this work we shall use the notation designated in the following definition for the answer set returned by a query q when applied on the dataset D. The examples used to demonstrate the concept of relevant queries will rely on the use of so called transitive predicates and will take advantage of the new navigational capabilities of SPARQL 1.1 [2]. Definition 1. Let q(x 1 , …, x n ) be a SPARQL query involving the n projection variables ?x 1 , …, ?x n in the SELECT clause of the query, and let D be a RDF dataset. The result of applying q(x 1 , …, x n ) on D will be called the answer set of q over D and will be denoted by q(x 1 , …, Consider a simple SPARQL query Q(x 1 , x 2 ) like the one shown in Figure 1a. If the projection variable x 2 is removed from the SELECT clause of Q(x 1 , x 2 ) and all other occurrences of x 2 are replaced by the constant destination, then the result would be another query Q´(x 1 ), shown in Figure 1b, containing the single projection variable x 1 . Symmetrically, if the projection variable x 1 is removed from the SELECT clause of Q(x 1 , x 2 ) and all other occurrences of x 1 are replaced by the constant source, then the result would be the query Q´´(x 2 ), shown in Figure 1c, containing the single projection variable x 2 . It will be convenient to introduce the following notation to describe such substitutions of variables by constants. Definition 2. Let q(x 1 , …, x n ) be a SPARQL query involving the n projection variables ?x 1 , …, ?x n in the SELECT clause of the query. We write q(x 1 , …, x n ){x i1 |c 1 , …, x im |c m } to denote the query q´ arising from q, if all the m projection variables ?x i1 , …, ?x im are removed from the SELECT clause of q(x 1 , …, x n ) and all remaining occurrences of ?x i1 , …, ?x im are replaced by the m constants c 1 , …, c m , respectively. Obviously, m ≤ n.
With the above notation, the queries Q´(x 1 ) and Q´´(x 2 ), depicted in Figure 1b and Figure 1c, respectively, can be written as Q(x 1 , x 2 ){x 2 |destination} and Q(x 1 , x 2 ){x 1 |source}, which immediately reveals that are special instances of the more general query Q(x 1 , x 2 ) of Figure 1a. In the sequel, we will often refer to such an action as the application of a substitution to a given query, e.g., applying the substitution {x 2 |destination} to Q(x 1 , x 2 ), will give rise to the Q´(x 1 ).

Remark 1.
If a query q´(y 1 , …, y n ) containing exactly n projection variables, results from the query q(x 1 , …, x n ), also containing exactly n projection variables, by renaming all occurrences of x 1 , …, x n to y 1 , …, y, respectively, then the queries q and q´ will be considered identical. In other words, consistent renaming of the projection variables in a query leaves the query unchanged and so q(x 1 , …, x n ) and q´(y 1 , …, y n ) are in fact the same query.
Consider two SPARQL queries q 1 and q 2 and let us further assume that both queries involve n projection variables. We call q 1 and q 2 relevant if they can be related by another query Q that utilizes at least n variables. Formally, the following definition captures the notion of relevance between queries.
We write q 1 ~ q 2 to denote that q 1 and q 2 are relevant.
Some clarifications are perhaps necessary in order to better understand the above definition.
• First, we emphasize that the covering query Q involves n+m, where m ≥ 0, projection variables, whereas each of the two relevant queries q 1 and q 2 involve exactly n projection variables.
• In writing Q 1 (x i1 , …, x in ) and Q 2 (x k1 , …, x kn ), the meaning is that both Q 1 (x i1 , …, x in ) and Q 2 (x k1 , …, x kn ) result from the covering query Q(x 1 , …, x n , x n+1 , …, x n+m ) by substituting the m remaining projection variables by m constants. In the first case the m constants are c 1 , …, c m and in the second case the m constants are d 1 , …, d m .
• The resulting query Q 1 (x i1 , …, x in ) involves the n projection variables x i1 , …, x in . Likewise, Q 2 (x k1 , …, x kn ) involves the n projection variables x k1 , …, x kn . These n projection variables are in general different and are also different from the n initial variables x 1 , …, x n of the covering query Q.
• The answer sets are sets of n tuples, as required to achieve the equality with the answer sets q 1 (x 1 , …, x n )[D] and q 2 (x 1 , …, x n )[D], respectively.
• The constants c 1 , …, c m and d 1 , …, d m correspond to URIs appearing in D and will also in general be different.
The following example will hopefully serve as a useful introduction to the notion of relevant queries. Example 1. Consider the SPARQL query q 1 shown in Figure 2a. This query when applied to a RDF graph that contains the transitive predicate P will return all those nodes that are connected to the node destination through one or more edges labeled by the same transitive predicate P.
Let us emphasize that in this query we regard predicate P as transitive in sense that if (a, P, b) and (b, P, c) are two triples stored in the RDF dataset, then, on a semantic level, we may infer that (a, P, c). Moreover, q 1 utilizes the capabilities of SPARQL 1.1 [3], the syntax of which enables us to define and process path properties. The special symbol + is interpreted as asserting the existence of one or more edges labeled by the transitive predicate P. Let us consider now the SPARQL query q 2 shown in Figure 2b. This query when applied to an appropriate RDF graph will return all those nodes that can be reached from the node source through one or more edges labeled by P.
The two queries q 1 and q 2 can be regarded as similar in view of the fact that both return nodes that form a path of length at least one, which is labeled by the same predicate (in our case the transitive predicate P). The difference is that in the first case the path terminates at a specific node, namely the node destination, whereas in the second case the path begins at a specific node (the node source).
It should therefore come as no surprise that there is another SPARQL query Q, the one depicted in Figure 2c, which is closely related to both queries q 1 and q 2 , or, from another viewpoint, that relates explicitly q 1 and q 2 . It is rather straightforward to see that Q returns all the ordered pairs (x 1 , Symmetrically, by substituting the constant source for all occurrences of the projection variable x 1 in Q, the resulting query Q 2 (x 2 ) = Q(x 1 , x 2 ){x 1 |source} becomes precisely the query q 2 (x). Therefore, according to Definition 3, Q(x 1 , x 2 ) is indeed a covering query for q 1 (x) and q 2 (x) because q 1 The previous Example 1 is quite simple, but the following example will demonstrate that relevant queries can be significantly more complex. From now for brevity we shall adopt the following terminology: a path consisting of edges labeled by the same transitive predicate P will simply be called a P-path. Whenever we want to express the fact that x is the first node and y is the terminal node of such a P-path we shall write x ⇒ P y.  (x 1 , x 2 , x 3 ) such that there exists a Ppath from x 1 to x 2 , and an R-from x 2 to x 3 . Both paths are of length at least one.

Example 2.
In this example, we begin by considering the SPARQL query q 1 shown in Figure 3a. This query contains not just one but two transitive predicates: P and R and involves two variables x 1 and x 2 . When applied on a suitable RDF graph it will return all those ordered pairs (x 1 , x 2 ) such that x 1 is connected to x 2 via a P-path of length at least one and x 2 is connected to the node destination through an R-path of length at least one.
The SPARQL query q 2 shown in Figure 3b will list all ordered pairs (x 1 , x 3 ) such that x 1 is connected to the node intermediate via a P-path of length at least one and, in turn, intermediate is connected to x 3 through an R-path of length at least one.
A similar examination of the query q 3 of Figure 3c, shows that q 3 outputs all ordered pairs (x 2 , x 3 ) such that there exists a P-path of length at least one from the node source to x 2 and there exists also an R-path of length at least one from the x 2 to x 3 .
The relevance of queries q 1 , q 2 and q 3 is rather obvious. All three of them return nodes that form precisely two paths: a P-path followed by an R-path. The difference among the three queries is in the specifics. For the q 1 query the R-path must terminate at the node destination, for the q 2 query the P-path must terminate at the node intermediate and the Rpath must begin at the node intermediate, and for the q 3 query the P-path must begin at the node source.
The SPARQL query Q(x 1 , x 2 , x 3 ) depicted in Figure 3d is the covering query for q 1 , q 2 and q 3 . Q(x 1 , x 2 , x 3 ) is more complex that q 1 , q 2 and q 3 . While each of q 1 , q 2 and q 3 involve two projection variables, Q(x 1 , x 2 , x 3 ) involves three projection variables. As a result Q returns ordered triples (x 1 , x 2 , x 3 ); in each such triple x 1 is connected to x 2 via a P-path and x 2 is connected to x 3 through an R-path. More formally, by evaluating Q to the dataset D, we get the answer set Q(x 1 , It is clear that by substituting the constant destination for all occurrences of the projection variable ?x 3 in Q, the resulting query Q 1 (x 1 , x 2 ) = Q(x 1 , x 2 , x 3 ){x 3 |destination} is precisely the query q 1 (x 1 , x 2 ). Reasoning in a similar manner, we see that by substituting the constant intermediate for all occurrences of the projection variable ?x 2 in Q, the resulting query Q 2 (x 1 , x 3 ) = Q(x 1 , x 2 , x 3 ){x 2 |intermediate} is just the query q 2 (x 1 , x 3 ). Finally, by substituting the constant source for all occurrences of the projection variable ?x 1 in Q, the resulting query Q 3 (x 2 , x 3 ) = Q(x 1 , x 2 , x 3 ){x 1 |source} is simply the query q 3 (x 2 , x 3 ).
Obviously, Q(x 1 , x 2 , x 3 ) is a covering query for q 1 , q 2 and q 3 , since q 1

FUNDAMENTAL PROPERTIES OF RELEVANT QUERIES
From a theoretical point of view, the relevance relation between queries satisfies certain important properties. This is expressed in the next proposition. Proposition 1. The relevance relation ~ between queries is an equivalence relation. Proof.
We must check that the relation ~ satisfies the following three properties that characterize equivalence.
(1) The reflexive property requires to show that for every query q(x 1 , …, x n ), it holds that q(x 1 , …, x n ) ~ q(x 1 , …, x n ). This is rather trivial because we can take the query q(x 1 , …, x n ) itself as the covering query.
(3) Finally, suppose that q 1 (x 1 , …, x n ) ~ q 2 (x 1 , …, x n ) and q 2 (x 1 , …, x n ) ~ q 3 (x 1 , …, x n ). To establish the transitive property, we must that also q 1 (x 1 , …, x n ) ~ q 3 (x 1 , …, x n ). The two hypotheses imply the existence of two covering queries Q 1 (x 1 , …, x n , x n+1 , …, x n+m ) and Q 2 (x 1 , …, x n , x n+1 , …, x n+m´) , and four substitutions θ 1 , θ 2 , θ 3 , θ 4 , such that q 1 We construct a new query Q that contains as subqueries the queries Q 1 and Q 2 . We may assume that Q 1 and Q 2 have no variable names in common. Even if this is not the case, we may rename the projection variables of Q 2 to ensure that the all variables are distinct. This renaming does not change the semantics of Q 2 (recall Remark 1) and the resulting query is the semantically equivalent to Q 2 . The projection variables of Q are comprised of the projection variables of Q 1 , the projection variables of Q 2 (after they have been renamed, if necessary), and a new variable, which we call ?choice. Moreover, we construct a new substitution θ 1´ by augmenting θ 1 with substitutions of the projection variables y 1 , …, y n , y n+1 , …, y n+m´ of Q 2 by constants d 1 , …, d n , d n+1 , …, d n+m´, and the substitution of choice by a string constant, e.g., "first". The resulting substitution θ 1´ is θ 1 ∪{y 1 |d 1 , …, y n |d n , y n+1 |d n+1 , ..., y n+m´| d n+m´, choice|"first"}. The subquery Q 1 is also augmented with a FILTER statement testing whether ?choice is equal to the string constant used in θ 1´, e.g., "first". This guarantees that the augmented subquery returns exactly the same answer set as θ 1´ when θ 1´ is used and nothing whenever a different substitution for ?choice is used. Symmetrically, starting from θ 4 , we construct the new substitution θ 4´ = θ 4 ∪{x 1 |c 1 , …, x n |c n , x n+1 |c n+1 , ..., x n+m |c n+m , choice|"second"}. Likewise, Q 2 is also augmented with a FILTER statement involving ?choice that passes the results only when the substitution θ 4´ is used.
Therefore, by the above construction, we conclude that …, x n+m , y 1 , …, y n , y n+1 , …, y n+m´, choice) θ 1´ and W 2 (x t1 , …, x tn ) = Q(x 1 , …, x n , x n+1 , …, x n+m , y 1 , …, y n , y n+1 , …, y n+m´, choice) θ 4´. This proves that Q is a covering query for q 1 and q 3 and, thus, q 1 ~ q 3 . Example 3. This example will shed some light on the construction we used in Proposition 1 in order to establish the transitive property of the ~ relation.
The queries q 1 and q 2 shown in Figure 4a are relevant and the covering query Q 1 that establishes this fact is also shown in Figure 4a. The two substitutions that, when applied to Q 1 , establish the relation q 1 ~ q 2 are {x 3 |IsSolid} and {x 2 |metallicObject} for q 1 and q 2 , respectively.
The queries q 2 and q 3 , shown in Figure 4b, are also relevant. A covering query for q 2 and q 3 is the query Q 2 also depicted in Figure 4b. The two substitutions that establish that q 2 ~ q 3 are {x 2 |metallicObject} for q 2 and {x 1 |bolt} for q 3 , respectively.
The algorithm described in Proposition 1 results in the construction of the query Q shown in Figure 4c. To avoid any clash of names and any possible ambiguity, the projection variables x 1 , x 2 , x 3 of Q 2 are consistently renamed to y 1 , y 2 , y 3 . This ensures that there are no variable names in common between Q 1 and Q 2 . Moreover, this renaming does not change the semantics of Q 2 (recall Remark 1), meaning that the resulting query is the same as Q 2 .  Figure 4c. The above query Q is a covering query for q 1 and q 3 .

SELECT
Hence, the variables appearing in the SELECT clause of Q are the variables x 1 , x 2 , x 3 of Q 1 , the variables y 1 , y 2 , y 3 of Q 2 , and a new projection variable ?choice, which will be used to filter the results returned by the two subqueries. FILTER ( "first" = "second" ) } } } Figure 5a. The above SPARQL query W 1 (x 1 , x 2 ) arises from the query Q(x 1 , x 2 , x 3 , y 1 , y 2 , y 3 , choice) of Figure 4c with the substitution {x 3 |IsSolid, y 1 |d 1 , y 2 |d 2 , y 3 |d 3 , choice|"first"}, where d 1 , d 2 , d 3 are arbitrary constants. The FILTER statements in the two subqueries guarantee that W 1 returns all ordered pairs (x 1 , x 2 ) from subquery Q 1 but none from subquery Q 2 . FILTER ( "second" = "second" ) } } } Figure 5b. The above query W 2 (y 2 , y 3 ) arises from the query Q(x 1 , x 2 , x 3 , y 1 , y 2 , y 3 , choice) of Figure 4c with the substitution {x 1 |c 1 , x 2 |c 2 , x 3 |c 3 , y 1 |bolt, choice|"second"}, where c 1 , c 2 , c 3 are arbitrary constants. The FILTER statements in the two subqueries guarantee that W 2 returns all ordered pairs (y 2 , y 3 ) from subquery Q 2 but none from Q 1 .

SELECT
Applying the substitution {x 3 |IsSolid, y 1 |d 1 , y 2 |d 2 , y 3 |d 3 , choice|"first"}, where d 1 , d 2 , d 3 are arbitrary constants, to the query Q(x 1 , x 2 , x 3 , y 1 , y 2 , y 3 , choice), results in the query W 1 (x 1 , x 2 ) depicted in Figure 5a. In view of the fact that the second FILTER statement will exclude everything, while the first FILTER statement will allow everything, we conclude that W 1 (x 1 , x 2 ) is equivalent to Q 1 {x 3 |IsSolid}. Therefore, Similarly, the query W 2 (y 2 , y 3 ) of Figure 5b results from the application of the substitution {x 1 |c 1 , x 2 |c 2 , x 3 |c 3 , y 1 |bolt, choice|"second"}, where c 1 , c 2 , c 3 are arbitrary constants, to x 2 , x 3 , y 1 , y 2 , y 3 , choice). In this case, the first FILTER statement will exclude everything, while the second FILTER statement will allow everything. This implies that W 2 (y 2 , y 3 ) is equivalent to Q 2 {y 1 |bolt}, and, therefore, W 2 (y 2 , y 3 )[D] = q 3 (y 2 , y 3 ) [D]. This concludes the proof that Q is a covering query for q 1 and q 3 and, thus, q 1 ~ q 3 . ▲ The construction the query Q that establishes the transitivity of the relevance relation ~ was somewhat artificial and mechanical. It serves only to complete the proof. Clearly, there is a high degree of redundancy in Q, which is not at all optimized. In most practical cases, things will be much easier. For instance, in Example 3, query Q 1 alone suffices to establish that q 1 ~ q 3 . This is achieved with the substitutions {x 3 |IsSolid} and {x 1 |bolt} for q 1 and q 3 , respectively.
Proposition 1 is important because it means that the set of SPARQL queries is partitioned into equivalence classes, and each SPARQL query q belongs to one such class.
Definition 4. Let q be a SPARQL query. The equivalence class to which q belongs is denoted by [q]. Alternatively, we say that q is a representative of the class [q].
Having established this theoretical classification of SPARQL queries into equivalence classes, let us turn our attention into possible ways to take advantage of this situation for practical purposes.
Consider a scenario where we have the two relevant queries q 1 and q 2 . We may further assume that we know a third query Q that is a covering query for q 1 and q 2 via the substitutions θ 1 and θ 2 , respectively. Whenever we are confronted with the evaluation of a more composite query, involving q 1 and q 2 , we may use our knowledge of the covering query Q to our advantage in order to speed up the computation. Specifically, instead of having to compute two queries, we can arrive at the same answer set by computing just one. This can be achieved by applying the substitution θ = θ 1 ∪θ 2 to Q and then evaluation the resulting query Q´ = Qθ. Taking into account the properties of the covering queries, we see that the soundness of this method is immediate. Furthermore, and more importantly, this approach takes considerably less time.

Example 4.
In this example, we assume that we want to compute the SPARQL equivalent of the join of the query q 1 with the query q 3 , shown in Figures 3a and 3c, respectively. We also know that Q, depicted in Figure 3d, is a covering query for q 1 and q 3 .
We recall that q 1 returns the ordered pairs (x 1 , x 2 ) such that x 1 is connected to x 2 via a P-path and x 2 is connected to the node destination through an R-path, while q 3 lists the ordered pairs (x 2 , x 3 ) such that there exists a P-path from source to x 2 and an R-path from x 2 to x 3 . All paths have length at least one.
Formally, the answer sets of q 1 and q 3 on a dataset D are q 1 (x 1 , x 2 )[D] = {(x 1 , x 2 ): x 1 ⇒ P x 2 and x 2 ⇒ R destination} and q 3 (x 2 , x 3 )[D] = {(x 2 , x 3 ): source ⇒ P x 2 and x 2 ⇒ R x 3 }, respectively. Therefore the answer set of their join is {x 2 : source ⇒ P x 2 and x 2 ⇒ R destination}, that is the nodes x 2 for which there exists a P-path from source to x 2 and an R-path from x 2 to destination.

SELECT ?x2 WHERE {
source P+ ?x2 . ?x2 R+ destination . } Figure 6. The above SPARQL query lists all the nodes x 2 for which there exists a P-path from source to x 2 and an R-path from x 2 to destination. Again, both paths are of length at least one.
The query Q lists the ordered triples (x 1 , x 2 , x 3 ) such that there exists a P-path from x 1 to x 2 and an R-path from x 2 to x 3 . More formally, applying Q to the dataset D produces the answer set Q(x 1 , x 2 , x 3 )[D] = {(x 1 , x 2 , x 3 ) : x 1 ⇒ P x 2 and x 2 ⇒ R x 3 }. By simultaneously substituting the constants source and destination for all occurrences of the projection variables x 1 and x 3 in Q, we get the resulting query Q´(x2) = Q(x 1 , x 2 , x 3 ){ x 1 |source, x 3 |destination} shown in Figure 6. It is easy to see that the answer set of Q´ is precisely {x 2 : source ⇒ P x 2 and x 2 ⇒ R destination}.
What this means in terms of efficiency, is that instead of evaluating two queries, each involving two projection variables, and then computing their join, we can, equivalently, evaluate a single query, involving one projection variable. This approach has the potential to reduce the computational cost significantly. ▲ It is important to point out that this technique is not only valid for just two relevant queries but it can be readily generalized to an arbitrary (finite) number of relevant queries, due to the transitive nature of the ~ relation.

CONCLUSION
In this paper we have analyzed SPARQL queries using concepts and ideas inspired from the field of abstract mathematics. This novel approach, besides its theoretical merits, has the potential to provide important practical benefits regarding the computational aspects of SPARQL query evaluation. Quite often in practice we may encounter composite queries that are comprised of simpler queries that happen to be relevant. This situation was demonstrated in the toy scale Example 4, where the evaluation of the conjunction of two SPARQL queries was considered. The current approach requires the evaluation of both queries in order to achieve the evaluation of their conjunction. Knowledge of the fact that the queries in question happen to be relevant, along with a covering query establishing their relation, opens up another possibility. By using only one query, specifically one arising from the covering query via an appropriate substitution, the evaluation of the conjunction can be completed in a more efficient manner, requiring less computational time.