Recursive Queries in SBQL
by Tomasz Pieciukiewicz, Krzysztof Stencel, Kazimierz Subieta
Back to Description
of SBA and SBQL.
There are many important tasks
that require recursive processing. The most widely known is Bill-Of-Material
(BOM), a part of Materials Requirements Planning (MRP)
systems. BOM acts on a recursive data structure representing a hierarchy of
parts and subparts of some complex material products, e.g. cars or airplanes.
Typical BOM software processes such data structures by proprietary routines and
applications implemented in a classical programming language. Frequently,
however, there is a need to use ad hoc
queries or programs addressing such structures. In such cases the user requires
special user-friendly facilities dealing with recursion in a query language.
Similar problems concern processing and optimization tasks on genealogic trees,
stock market dependencies, various types of networks (transportation,
telecommunication, electricity, gas, water, and so on), etc. Recursion is also
necessary for internal purposes of computer systems, such as processing
recursive metadata structures (e.g. Interface Repository of the CORBA
standard), software configuration management repositories, hierarchical
structures of XML or RDF files, and others.
In many cases recursion can be
substituted by iteration, but this implies much lower programming level and
less elegant problem specification. Iteration may also cause higher cost of
program maintenance since it implies a clumsy code that is more difficult to
debug and change.
Despite importance, recursion is not supported
in SQL standards SQL-89 and SQL-92. Beyond the standards, it is implemented
(differently) in relational database management systems, in particular, in
Oracle and DB2, in the form of transitive closure and linear recursion. The
newer standards SQL-3, SQL-99, SQL-2003, SQL-2008 introduce both transitive
closure and deductive rules a la
Datalog. Unfortunately these standards are very huge and eclectic thus there
are doubts among database professionals if they will ever be fully implemented,
especially concerning such advanced properties.
The ODMG standard for OO databases does not
mention corresponding facilities. This will obviously lead to extensive
programming in programming languages such as C++, Java and Smalltalk. Recursion
is considered a desirable feature of XML and RDF-oriented query languages, but
current proposals and implementations do not introduce corresponding features
to the corresponding query languages or introduce them with limitations.
The possibility of recursive
processing has been highlighted in the deductive databases paradigm, notably
Datalog. The paradigm has roots in logic programming and has several variants.
Some time ago it was advocated as a true successor of relational databases, as
an opposition to the emerging wave of object-oriented databases. Despite high
hype and pressure of academic communities it seems that Datalog falls short of
the software engineering perspective. It has several recognized disadvantages,
in particular:
·
Lack of efficient methodology supporting the
developers of applications in transition from business conceptual models to
Datalog programs. For real applications an average developer or programmer has
no idea how to formulate Datalog rules in response to typical results of
analysis and design processes (stated e.g. in UML).
·
Although Datalog is claimed to be a universal query
language, its real application area is very limited to niche applications
requiring some “intelligence” expressed through sylogisms and
recursive rules.
·
Limitations of data structures that Datalog deals
with. Current object-oriented analysis and design methodologies as well as
programming languages and environments deal with much more sophisticated data
structures (e.g. complex objects with associations, classes, inheritance, etc.)
than relational structures that Datalog deals with. Complex data structures
allow one to get the software complexity under control.
·
„Flatness” of Datalog programs, i.e. lack
of abstraction and encapsulation mechanisms, such as procedures, views or
classes. This flaw means no support for hierarchical decomposition of a big
problem to sub-problems and no support for top-down program design and
refinement and encapsulation of problem details.
·
Datalog is stateless thus it gives no direct
possibility to express data manipulation operations. Majority of applications
require update operations, which are possible to express in Datalog only via
side effects, with no clear formal semantics.
·
Datalog, in principle, does not separate data and
programs. Both are “rules” in the same abstraction frame. However,
the fundamental differences between these two aspects concern the following
factors: design of programs (databases are designed before programs that
operate on them), client-server architecture (databases are on servers,
programs are usually on clients); structural constructs (data structures are
specifically constructed and constrained, especially in object-oriented
databases; this does not concern programs); updating (data can be updated,
while it makes little sense for programs). These fundamental differences cause
that the unification of data and programs must meet severe limitations.
·
Datalog implies significant problems with performance.
Current optimization methods, such as magic sets, do not seem to be
sufficiently mature and efficient.
Thus Datalog applications
supporting all the scale that databases require are till now unknown.
Nevertheless, the idea of Datalog semantics based on fixed-point equations
seems to be very attractive to formulate complex recursive tasks. Note however
that fixed-point equations can be introduced not only to languages based on
logic programming, but to any query language, including SQL, OQL and XQuery.
Below we show how to incorporate fixed point equations in SBQL. In contrast to
Datalog, we assume that such a feature can be used by the SBQL programmer only
when needed rather than as the major programming paradigm.
Besides transitive closures
and fixed-point equations there are classical methods for recursive processing
known from programming languages, namely recursive functions (procedures,
methods). In the database domain a similar concept is known as recursive views.
Integration of recursive functions or recursive views with a query language requires
generalizations beyond the solutions known from typical programming languages
or databases. First, functions have to be prepared to return bulk types that a
corresponding query language deals with, i.e. a function output should be
compatible with the output of queries. Second, both functions and views should
possess parameters, which could be bulk types compatible with query output too.
Currently very few existing query languages have such possibilities, thus using
recursive functions or views in a query language is practically unexplored.
This chapter discusses three different approaches to
recursive queries:
The first approach is implemented in a few industrial
commercial systems with severe limitations [PiSu04]. Two next approaches are
implemented so far in research prototype systems only, with limitations too.
Till now no system implements all three approaches in a sufficiently robust
version, thus comparison of them from the software designer and programmer
point of view is unfeasible. Moreover, all widely known implementations
concerns relational database systems which are successful, but with
well-recognized disadvantages such as impedance mismatch and poor conceptual
modeling. There are few theoretical proposals concerning recursive queries for
object-oriented and XML-oriented systems, but implementations are not widely
known or are very limited. Only SBQL implements all three approaches in a
sufficiently robust and simple versions.
Recursive queries can imply performance problems.
There are several typical optimization methods related to recursive queries,
such as factoring out independent subqueries, indices, or caching
(materializing) results of queries to utilize them during evaluation of next
(identical) queries. For fixpoint equations there are special methods know as
magic sets. In the chapter we briefly discuss the above issues.
A transitive
closure in SBQL is a non-algebraic operator having the following syntax:
query ::= query close by query
query ::= query leaves by query
The
query q_{1} close by q_{2} is
evaluated as follows. Let final_result be the final result of the query
and union be the bag union. The
pseudo-code accomplishing abstract implementation of q_{1} close by q_{2} is the following:
final_result := result_of (q_{1});
for each r in final_result do {
push nested(r)
at top of ENVS;
final_result
:=
final_result union result_of (q_{2});
pop ENVS;
}
Note that each
element r added to final_result by q_{2} is
subsequently processed by the for each
command. The above operational semantic can be described in the denotational
setting as the least fixed point equation (started from final_result = empty_bag and continued till fixpoint):
final_result = q_{1}
union final_result.q_{2}
where dot is
identical with the dot operator in SBQL. Similarly, the semantics can be
expressed by iteration (continued till result_of
(q_{2}) = empty_bag):
final_result = q_{1} union q_{1}.q_{2 }union q_{1}.q_{2}.q_{2 }union q_{1}.q_{2}.q_{2}.q_{2 }union ....
Naive
implementation of the close by
operator is as easy as the implementation of the dot operator. Note that if q_{2} returns a previously processed element, an infinite loop
will occur. Checking for such situations in queries is sometimes troublesome
and may introduce unnecessary complexity into the queries. Another operator close unique by has been
introduced to avoid infinite loops due to duplicates returned by q_{2}.
The query q_{1}
leaves by q_{2} is
evaluated as q_{1} close by q_{2}, but
in the result only leaves of the transitive closure graph are left. The
semantics can be expressed as follows:
final_result := result_of (q_{1});
for each r in final_result do {
push nested(r)
at top of ENVS;
temporary_result := result_of (q_{2});
final_result
:=
final_result union temporary_result;
if temporary_result <> bag() then final_result := final_result minus { r };
pop ENVS;
}
bag() stands for empty bag. As q_{1} and q_{2} can be any queries, simple or
complex, the relation between elements which is used for transitive closure is
calculated on the fly during the query evaluation; thus the relation need not
be explicitly stored in the database.
Fig.
Fig.55 depicts a simple database schema used in
our examples. It is a description of parts, similar to descriptions used in
Bill of Material (BOM) applications. Each Part has a name. The part may be either a Detail
or Aggregate. Each detail has attributes cost and mass (its cost
and mass). Each Aggregate has attributes assemblyCost and assemblyMass
which represent the cost of assembling this aggregate and mass added to the mass
of its components as the result of the assembly process. Aggregates have one or
more Component subobjects. Each Component has the attribute amount (number of components of specific
type in a part), and a pointer object leadsTo
leading to a Part object.
The following SBQL query with a transitive
closure over this schema finds all components of a part named
“engine”.
E.TC.1 |
All components of a part named
“engine”. |
SBQL: |
(Part
where name = ”engine”)
close by Component.leadsTo.Part |
This query
first selects parts having the attribute name equal to
“engine”. The transitive closure relation is described by the
subquery Component.leadsTo.Part. It returns all Part
objects which can be reached by the pointer leadsTo from already
selected objects.
This query takes advantage of an assumption
made in the AS1 model. After reaching an object by its name all its attributes
are accessible, not only those of the class whose name has been used to
retrieve the object. The assumption allows us to simplify queries by removing
the necessity to use cast operators.
If an object does not have the queried
attribute (because it belongs to a subclass which does not have this attribute
or the attribute is optional), the empty collection is returned as the result
of binding of the name of this attribute. In this query, the subquery Component.leadsTo.Part is evaluated for Part objects which can be
either Detail or Aggregate. In case of Aggregate, the name
Component is properly bound and returns the collection of the Component
subobjects. In case of Detail, the name Component cannot be bound
and returns the empty bag.
Assuming there is no access from a Part object
to Aggregate and Detail attributes the query can be formulated with a cast:
E.TC.2 |
All components of a part named
“engine”. |
SBQL: |
((Part
where name = ”engine”) as
e) close by ((Aggregate)e). Component.leadsTo.Part |
One of the basic BOM problems, i.e. finding all
components of a specific part, together with their amount, may be formulated
using the transitive closure as follows:
E.TC.3 |
Find all components of a specific part, along
with their amount required to make this part. |
SBQL: |
((Part where name=”engine”),
1 as qty) close by Component.((leadsTo.Part), (qty*amount) as qty) |
The query
uses a named value in order to calculate the number of components. The number
of parts the user wants to assemble (in this case 1) is named qty and paired with the found part. In
subsequent iterations the qty value
from parent object is used to calculate the required amount of child elements.
It is also named qty and paired with
the child object.
The above
query does not sum up amounts of identical sub-parts from different branches of
the BOM tree. Below we present a modified query which returns aggregated data
– sums of distinct components from all branches of the BOM tree:
E.TC.4 |
Find all components of a specific part, along
with their amount required to make this part, summing quantities of identical
parts. |
SBQL: |
((((Part where name=”engine”)
as x, 1 as qty) close by x.Component.((leadsTo.Part) as x, (qty*amount) as qty)) group as allEngineParts ).
((distinct(allEngineParts.x) as y). (y,
sum((allEngineParts where x=y).qty))) |
This query
uses grouping in order to divide the problem into two parts. First, all the
components named x, along with their
amounts named qty are found. The pairs
are then grouped and named allEngineParts.
The grouped pairs are further processed by finding all distinct elements and
summing the amounts for each distinct element.
This query could be further refined in order to
remove all aggregate parts (so only
the detail parts will be
returned). There are many ways to accomplish this goal. One of them is to use
the operator leaves by
in place of close by.
The operator leaves by
returns only leaf objects, i.e. objects which do not result in adding any
further objects to the result bag:
E.TC.5 |
Find all details of a specific part, along
with their amount required to make this part, summing quantities of identical
details. |
SBQL: |
((((Part where name=”engine”)
as x, 1 as qty) leaves
by x.Component.((leadsTo.Part) as x, (qty*amount) as qty))
group as allEngineDet ).
((distinct(allEngineDet.x) as y). (y,
sum((allEngineDet where x=y).qty))) |
The other way
to sort the aggregates out of the result of the previous query is to use a cast.
The cast operator takes a collection of objects and returns only these items of
it which belong to the given class. The cast applied to a single object returns
this object if it belongs to the class; otherwise it returns the empty bag.
Here is a query with the same retrieval goal but written using a cast.
E.TC.6 |
Find all details of a specific part, along
with their amount required to make this part, summing quantities of identical
details. |
SBQL: |
((((Part where name=”engine”)
as x, 1 as qty) close
by x.Component.((leadsTo.Part) as x, (qty*amount) as qty))
group as allEnginePar).
((distinct((Detail)
(allEnginePar.x)) as y). (y,
sum((allEnginePar where x=y).qty))) |
Here the
result of the subquery whose result is further named allEngineDet is first cast to the class Detail. This cast
drops all objects which do not belong to this class.
Although the
complexity of the SBQL solution is still high, SBQL supports facilities to
manage the complexity. In this case the grouping operator allows decomposing
the problem into easier subproblems.
SBQL queries may be used to perform even more
complex tasks. The query below calculates the cost and mass of the part named
“engine”, taking into account cost and mass of each engine part,
amount of engine parts and cost and mass increment connected with assembly.
This task has been used in [AtBu87] as an
example of lack of power and flexibility of currently used query languages. In
SBQL the task can be formulated with no essential problems:
E.TC.7 |
Calculate the cost and mass of the part named
“engine”, taking into account cost and mass of each engine part,
amount of engine parts and cost and mass increment connected with the
assembly. |
SBQL: |
((((Part where name=”engine”)
as x, 1 as qty) close by x.Component.((leadsTo.Part) as x, (amount*qty) as qty)) group as allEngineParts ). (allEngineParts.((Detail) x as d). ((qty * d.cost) as c, (qty * d.mass) as m) group as CostMassIncreaseD,
allEngineParts.((Aggregate)
x as a). ((qty*a.assemblyCost) as
c, (qty*a.assemblyMass) as
m) group as CostMassIncreaseA
). ( ((sum(CostMassIncreaseD.c)
+ sum(CostMassIncreaseA.c)) as
engineCost, (sum(CostMassIncreaseD.m) + sum(CostMassIncreaseA.m))
as engineMass) |
Such a typical
BOM task cannot be formulated so comprehensively as a single query in other
query languages. For example, this query formulated in SQL-99 exceeds 100 lines
of code and is totally illegible. Furthermore, some of its subqueries have to
be repeated several times in its text.
SBQL can also perform quite complex
calculations (not even considered in other query languages). Let the object
named a store some non-negative
number. The query below calculates approximation of the square root of a, using the fixed point equation x = (a/x + x)/2, starting from x=1 and making 50
iterations:
E.TC.8 |
Calculate the approximation of the square
root of a. |
SBQL: |
((1 as x, 1 as counter) leaves by (((a/x + x)/2 as x, counter+1
as counter) where counter£50)). (x where
counter = 50) |
In this
example we use a kind of internal variable counter as a counter of iterations.
Another example of a recursive task:
E.TC.9 |
Calculate
50 Fibonacci numbers according to the recursive definition: F(0) = 0; F(1) = 1; F(n) = F(n-2) + F(n-1).
The output should contain the argument of F and the result of F. |
SBQL: |
((0 as arg, 0 as fib, 1 as nextfib) close by ((arg+1) as arg, nextfib as fib, (fib + nextfib) as nextfib) where
arg < 50)). (arg as arg, fib as F) |
Although the
calculations remind programming, we underline that we are still using query
operators only. The examples show the power
of recursive queries in SBQL so far not even considered in other query
languages.
Cycles in the
queried graph can be easily dealt with by means of another variant of the close by operator – close unique by. This variant
removes duplicates on the fly after each closure iteration; thus cycles do not
imply infinite loops. Another variant of the close by operator is the leaves unique by operator. It is a combination of the two
previous variants. It returns only leaf objects, while preventing problems with
cycles in graphs. Note that cycles in the queried graph do not have to be an
effect of database inconsistency. It is easy to imagine a database which
contains a graph with cycles that is consistent with the real world situation
(see Fig. 56).
Fig.
A Company has name and may own
shares in other Companies (in this example we abstract from the amount
of shares). Other companies in turn may own shares in further companies; and so
on. It is possible for a company (lets call it “ACME”) to own
shares in another company which in turn owns shares in “ACME”
(directly or by owning shares in other companies). To find the names of all
companies owned (directly or indirectly) by “ACME” we cannot use a
simple query using the close by
operator as it would not function correctly upon encountering the cycle. We can
use the close unique by
operator instead, as shown in the example below:
E.TC.10 |
Find the names of all companies owned
(directly or indirectly) by “ACME”. |
SBQL: |
(Company where name =
”ACME”) close unique by hasSharesIn.Company |
A simple case of
using the operator leaves unique by is shown below.
E.TC.11 |
Calculate the approximation of the square
root of a up to 5-th position after
the dot. |
SBQL: |
1 as x leaves unique by ((integer)(100000 * (a/x
+ x)/2)/100000) as x |
In the above case we have no counter variable
that terminates the closure. The termination is achieved after the expression
following leaves unique by returns
the same value as previously. However, the case is less safe, as it is possible
in general that each next value of the closure will always be different than
the previous one (the final results may form a cycle).
As shown
above, some advanced tasks may lead to very complex queries which semantics
could be difficult to grasp for the programmers. However, the complexity is
caused by the tasks rather than by the language. According to our tests, such
tasks written e.g. in C++ may take several source code pages. SBQL offers also
other recursive querying options which may support easier problem
decomposition, in particular fixed point equation and recursive procedures.
Sometimes,
apparently quite easy tasks cannot be formulated without a transitive closure.
In example E.TC.12 we show the case when calculating a proper bag of numbers
require such an operator.
E.TC.12 |
Assume Emp
objects with an attribute salary.
For each interval [i, i+100), i = 0, 100, 200, ... get the number of employees having the
salary within this interval. The maximal salary is unknown for the person
asking the query. |
SBQL: |
(0 as
i close by ((i+100) where i <= max(Emp.salary)) as i)
join (count(Emp where salary >= i and
salary < i+100) as c) |
The result
is a bag of structures { i(lower_bound), c(nbr_of_emps) }. The
transitive closure is used to calculate the bag {i(0), i(100), i(200), … i(maxsal)}, where maxsal is the maximal salary rounded to
full hundreds. Note that the number of elements in this bag is initially
unknown, hence the use of the transitive closure. Even such a simple query is
impossible to express in SQL.
In mathematics a fixed point equation has the
form x = f(x), where x is a variable or (a vector of
variables) of type T, f is a function T → T. For
instance, such a function can look like x
= (2/x + x)/2. The solution of this equation is known as a least fixed
point. For our example function the solution is the square root of 2.
Well-known context-free grammars can be considered least fixed point equations.
Recursive procedures can also be described as least fixed point equations. The
essence of such equations is that they precisely model recursive processes and
structures. Starting from some initial value x_{0} we calculate x_{1}
= f(x_{0}), x_{2}
= f(x_{1}), x_{3}
= f(x_{2}), … etc. till reaching the fixed point, i.e.
when x_{n} = x_{n-1}. Sometimes reaching the
fixed point requires infinitely many steps, but mathematical apparatus is well
prepared for such situations. It may also happen that the fixed point cannot be
achieved in this way (hence the above procedure will result in an infinite
loop). This means that function f
must be specifically constrained; these constraints also form a mathematical
theory (unfortunately, inapplicable for our purposes).
The theory of fixed point equations has great
meaning in mathematics (in particular, it is the foundation of a semantic
specification method known as denotational
semantics). In case of SBQL we have introduced fixed point equations as a
facility for programmers to formulate recursive queries. For some tasks such
equations could be more legible than transitive closures or recursive
procedures and could better support decomposition of the task into subtasks
with clear conceptual meaning.
Essentially, the evaluation mechanism of our
fixed point equations is the same as the described above mathematical
procedure. However, we must define some important syntactic and semantic
details, such as “variables”, because SBQL binding rules for names
are more semantically specific than the simple semantics of mathematical variables. The syntax of an SBQL
fixpoint equations is as follows:
query
::= fixpoint [‘(‘set_of_variables’)’]
‘{‘set_of_rules’}’
set_of_variables ::= variable {‘,’ variable}
set_of_rules ::= rule{rule}
rule ::= variable ‘:-‘ query ’;’
Let fixpoint(x_{i1}, x_{i2},…, x_{in})
{x_{1}_{ }:- q_{1}; x_{2}_{ }:-
q_{2};… x_{m}_{
}:- q_{m};} be a
fixpoint system, where:
· x_{1}, x_{2},…,
x_{m} are names of variables in this equation system,
· x_{i1},
x_{i2},…, x_{in} are returned
variables, {x_{i1}, x_{i2},…, x_{in}}
is a subset of {x_{1}, x_{2},…,
x_{m}},
· q_{1}, q_{2},…,
q_{m} are SBQL queries with free variables x_{1}, x_{2},…, x_{m};
The basic semantics of this construct is
the following:
1.
Variables x_{1}, x_{2},…, x_{m} are
initialized to empty bags. More
precisely, a new section with binders x_{1}(bag()), x_{2}(bag()),…, x_{m}(bag()) is pushed at top of ENVS; (bag() stands for empty bag).
2.
Queries q_{1}, q_{2},…, q_{m} are
evaluated. Their results are on the top of QRES.
3.
Pop ENVS.
4.
If the results of q_{1}, q_{2},…,
q_{m} are equal to the values of x_{1}, x_{2},…,
x_{m}, correspondingly, then the fixpoint is reached, hence go to
step 5. Otherwise assign the results of q_{1}, q_{2},…, q_{m} to the
values of x_{1}, x_{2},…, x_{m}. More precisely, a new section with
binders x_{1}(result of q_{1}), x_{2}(result of q_{2}),…, x_{m}(result of q_{m}) is pushed at top of ENVS. Then remove results
of q_{1}, q_{2},…, q_{m} from QRES
and go to step 2.
5.
Results of queries that are not among q_{i1},
q_{i2},…, q_{in
}(corresponding to variables x_{i1}, x_{i2},…, x_{in})
are dropped. Remaining results are changed into binders x_{i1}(result of q_{i1}), x_{i2}(result of q_{i2}),…,
x_{in}(result of q_{in})
and left at the top of the QRES stack. Hence only variables x_{i1}, x_{i2},…, x_{in}
can be used in an outside query that consumes the results of the fixpoint operator. If set_of_variables is absent (in our
syntax it is optional), then all the variables x_{1}, x_{2},…, x_{m} form the final
result.
The symbol :- we use instead of = to
distinguish the fact that we are dealing with equations having the above
semantics rather than with regular equality predicates. As queries q_{1}, q_{2},…,
q_{m} can reference variables x_{1}, x_{2},…, x_{m}, fixpoint equations
provides recursive capabilities. A fixpoint query should be used within some
non-algebraic operator (or within a for
each statement) which pushes the binders from top of QRES onto the top of
ENVS.
Note that a transitive closure query q_{1} close by q_{2} can be written as a fixed
point equation x = q_{1} union x.q_{2}, hence the general fixed point equation x = f(x) offers the power of transitive
closures. However, it is not sure if such general fixed point equations would
be sufficiently usable and understandable by the programmers. Moreover, such
complex equations could be difficult to optimize and could result in
unpredictable infinite loops. It also easy to see that the power of the
transitive closure operator is sufficient to express any system of fixpoint
equations. The fixpoint system
fixpoint(x_{i1}, x_{i2},…, x_{in})
{x_{1}_{ }:- q_{1}; x_{2}_{ }:-
q_{2};… x_{m}_{
}:- q_{m};}
is semantically equivalent to the following
query with the leaves unique by
operator:
(struct( bag() group as x_{1}, bag() group as x_{2}, … , bag() group as x_{m} )
leaves unique by
struct( q_{1} group as x_{1}, q_{2} group as x_{2},
… , q_{m} group as x_{m} ) ).
struct( x_{i1} group as x_{i1}, x_{i2} group as x_{i2}, … , x_{in} group as x_{in})
The equivalence stems from the fact that in
both cases the evaluation procedure is the same. However, fixpoint equations
offer much more compact and legible formulation of some recursive problems.
Below we
present examples of SBQL fixed point equations; they refer to the schema from Fig.55.
E.FIX.1 |
All components of a part named
“engine”. |
SBQL: |
fixpoint(engparts){
myPart
:- Part where name=”engine”; engparts:- myPart union (engparts.Component.leadsTo.Part);} |
The first rule returns a reference to a part
named engine. The second rule
involves recursion on the engparts variable.
The operator union in this query is
the union of bags. Indeed, it may happen that the fixpoint equation returns a
bag rather than a set. This makes our construct different from Datalog, which
assumes sets (relations) as results of calculations. To illustrate this
situation consider the parts graph from Fig.60.
Fig.60.
Example of a graph of parts
In Fig.60 arrows represent instances of the
relationship Component. We can see that the part “bolt10x30” is a
component of (at least) two sub-parts of the “engine” part. Hence,
to be consistent, the reference to the bolt10x30 object should occur in the engparts variable (at least) two times.
The necessity of such semantics will be more evident in next examples.
Fixpoint equations are regular SBQL queries, fully
consistent with the entire SBQL semantics. As such may be used as parts of
other SBQL queries, for instance:
E.FIX.2 |
All unique components of a part named
“engine”. |
SBQL: |
distinct(
fixpoint(engparts){ myPart
:- Part where name=”engine”; engparts:- myPart union (engparts.Component.leadsTo.Part);}) |
E.FIX.3 |
From all components of a part named
“engine” take parts named “carburettor” (two queries below are equivalent). |
SBQL: |
(fixpoint( engparts ){ mypart
:- Part where name=”engine”; engparts
:- mypart union (engparts.Component.leadsTo.Part);}. engparts where name = “carburettor”) as engcarb |
SBQL: |
fixpoint( engcarb
){ mypart
:- Part where name=”engine”; engparts
:- mypart union (engparts.Component.leadsTo.Part); engcarb :-
engparts where name = “carburettor”;} |
In the first query we use the fixpoint query
within an outer query. In the second query the same effect is achieved by three
rules.
A fixpoint construct may use some variables as a way
to break down the problem into smaller, more manageable parts, for instance
(c.f. Fig.56):
E.FIX.4 |
For each component of the engine get the
number of identical elements that the engine consists of. |
SBQL: |
fixpoint (final) { engine :- (Part where name=”engine”) as x, 1 as howMany; engineParts :- engine union (engineParts.Component.((leadsTo.Part)
as x, (amount*howMany)
as howMany); final :- ((distinct(engineParts.x)
as y). (y,sum(engineParts where x=y).howMany)); } |
Only variable final
is returned as the fixpoint. The other two variables are used to perform
calculations, but never returned, as their final values are inessential to the
user. Variable engine is used to find
the top element of the hierarchy (the “engine” part), while engineParts is the variable in which the
results of recursive calculations are stored. Variables final and engine are not
defined recursively – their purpose is to make the query easier to read
and create. Note that in this case the fact that variables store bags (and
operator union is the union of bags)
is essential. If duplicates are removed, the result would be warped. Assume
that in Fig.60 bolt10x30 occurs once
in carburetor and once in
starter. Hence the structure struct{x(i_{bolt10x30}), howMany(1)} occurs two times within the result of
the variable engineParts; i_{bolt10x30} is the identifier of the bolt10x30 object (finally, 2 such bolts are components of the engine, while
removing duplicates will cause that the number of such bolts will be 1).
The same principle is used in the next example.
E.FIX.5 |
Get the total cost and mass of the engine. |
SBQL: |
fixpoint (cost, mass){
engine :- (Part where name=”engine”)
as x, 1 as howMany;
engineParts :-
engine union engineParts.Component.((leadsTo.Part) as x, (amount*howMany)
as howMany);
detailsMass :-
sum((Detail)engineParts. (howMany*x.mass));
detailsCost :-
sum((Detail)engineParts.(howMany*x.cost));
addedMass :-
sum((Aggregate)engineParts.
(howMany*x.assemblyMass));
addedCost:- sum((Aggregate)engineParts.
(howMany*x.assemblyCost));
cost :- detailsCost + addedCost; mass :- detailsMass + addedMass;} |
Queries utilizing fixpoint equations, unlike those
utilizing transitive closure, are capable of evaluating more than one recursive
problem in each recursion step, in a manner similar to the Datalog. It is
believed that such capabilities will allow users to formulate complex and
intelligent “business rules”. This topic may be an interesting area
for further research, although most of the practical recursive problems authors
are aware of can be solved using only a single recursion.
Similarly to transitive closure operator, fixpoint
equations may be used to perform recursive calculations without referring to
the database at all. For instance, the example below shows a fixpoint equations
version similar to Example E.TC.8:
E.FIX.6 |
Calculate
the cubic root of a, according to
the fixpoint equation x = (a/x^{2} + x)/2, starting from x = 1 and making
50 iterations |
SBQL: |
fixpoint( x ){ y :- (1 as r, 1 as c) union y.(((a/(r*r)+r)/2 as r, c+1
as c) where c <=50); x :- (y
where c = 50).r; } |
Next example is analogous to E.TC.9:
E.FIX.7 |
Calculate 50 Fibonacci numbers. The output should
contain arguments and results of the Fibonacci function. |
SBQL: |
fixpoint( Fib ){ init :- (0 as a, 0 as fib, 1 as n); f :- init union f.(((a+1) as a, n as fib, (fib+n) as n) where a < 50); Fib :- f.(a as
arg, fib as F); } |
In the example variable a is a counter (which is the argument of the function), fib represents a Fibonacci number, n represents a next Fibonacci number.
If the graph to be closed contains cycles then the
transitive closure is unsafe (thus the special
closure variant close unique by).
This problem does not occur for systems of fixed point equations, as function distinct can be used as usual for
removing duplicates. For instance (c.f. Fig.56):
E.FIX.8 |
Find the names of all companies owned
(directly or indirectly) by “ACME”. |
SBQL: |
fixpoint( comps ) { acme :- Company where name =
”ACME”; comps :- distinct(acme union comps.hasSharesIn.Company);} |
Assume the following class definition (ODRA) for
typical genealogical trees:
class
PersonClass {
name: string;
birthYear: integer;
sex: enum{“m”,
“f”};
alive: boolean;
parent: ref Person [0..2] reverse child;
child: ref Person [0..*] reverse parent;
}
Person: PersonClass[0..*];
In PersonClass we have used bidirectional
pointers parent/child that represent relationships in the genealogical tree.
Then we declare a collection of objects named Person. This structure can be used to formulate interesting queries
with the help of fixed point equations.
E.FIX.9 |
Get names and sex for all living and younger
cousins of “Doe”. |
SBQL: |
fixpoint( result ) {
myperson :- Person
where name = ”Doe”;
ancestors
:- myperson union ancestors.parent.Person; allcousins
:- ancestors union allcousins.child.Person;
livingYoungerCousins :- allcousins where alive
and birthYear > myperson.birthYear
; result
:- livingYoungerCousins.(name, sex); } |
E.FIX.10 |
Get all cousins of “Doe” of the same generation. |
SBQL: |
fixpoint( result ) {
myperson :- (Person
where name = ”Doe”) as
p, 1 as gen;
ancestors
:- myperson union (ancestors.((p.parent.Person) as p, (gen-1) as gen); allcousins
:- ancestors union (allcousins.((p.child.Person) as p, (gen+1) as gen); result
:- ((allcousins where gen = myperson.gen) minus myperson).p; } |
These examples much remind typical cases that
can be found in the Datalog literature. Of course, we are not sure if all
examples of Datalog programs can be formulated in this way. However, we do not
expect that SBQL fixed point equations alone would have the full algorithmic
power, as they are one of many SBQL constructs and features.
Strong Typing
For simplicity, in the presented constructs we
skip typing issues. Because fixed point rules can be recursive, type inferences
can be problematic (as initially the type of a variable on the right side of a
rule is unknown). Probably this is not a very difficult problem to solve,
however to avoid it each variable could be typed. This makes the syntax of the
rules a bit more complex, however, the strong type checking would be more
efficient.
Optimization
Evaluation of a fixpoint
system may be expensive, especially when bags to be assigned to variables are
very large. There are several opportunities for optimization. All optimization
methods that are developed for SBQL can be used, in particular, indices,
factoring out independent subqueries, removing dead subqueries, pushing
selections before structure constructors (joins), caching, factoring out weakly
dependent subqueries and pipelining.
There are also methods specific
for fixed point equations. The method from Datalog is known as seminaive evaluation. The main idea of
this technique is to divide the fixpoint rules into groups, which may be
evaluated independently from the other groups, utilizing only the final results
of previous group evaluation. For instance, in E.FIX.9 we can evaluate myperson at the beginning, because this
rule does not depend on other variables. Then, we can evaluate ancestors, as
this variable depends only on myperson
and ancestors. Then, we can evaluate allcousins, which depends on ancestors and allcousins. Finally, we evaluate result that depend on allcousins and myperson. In general, the method is easy to implement through
constructing a dependency graph that for our example is shown in Fig.61.
Calculations of a variable can start when all its predecessor variables are
calculated.
Fig.61. A dependency graph
between variables in E.FIX.9,
Another optimization opportunity concerns rules
of the form
x :- c union f(x)
where c
is some already counted part independent from x. All recursive rules that we have introduced in our examples have
this form. Under some assumptions (distributivity of f w.r.t. elements of its bag argument) such a recursion can be
turned into the iteration:
x_{0} = c; x_{1}
= f(x_{0}); ; x_{2}
= f(x_{1}); x_{3}
= f(x_{2}); x_{4}
= f(x_{3}); …
The iteration is continued till some next
component returns the empty bag. Then,
x = x_{0} union x_{1} union x_{2} union x_{3} union x_{4} union …
Probably the magic sets technique,
also known from Datalog, can be adapted to SBQL queries – further
research in this direction is needed.
Convergence of Fixpoint Equations
Evaluation of fixed point equations can result
in an infinite loop. That is, the fixed point is never reached. For instance
(c.f. Fig.56):
E.FIX.11 |
Find the names of all companies owned
(directly or indirectly) by “ACME” (without removing duplicates). |
SBQL: |
fixpoint( comps ) { acme :- Company where name =
”ACME”; comps :- acme
union comps.hasSharesIn.Company;} |
If objects Company
form a cycle that can be accessed from the ACME object via the hasSharesIn relationship, then the
evaluation process of comps will
never be finished. This situation can be tested, as usual, but without any
guarantee that a test will be reliable. For instance, it is possible that the
above example is convergent for “ACME” as a parameter, but it leads
to an infinite loop for another parameter.
In general,
the convergence
of fixpoint equations depends both on data structures to be processed and on
query operators that are used in queries on right of the rules. For instance,
the equation x = 1 – x starting from x = 1 will produce an infinite loop 1, 0, 1, 0, 1, 0,…. (despite there is a fixpoint x = 0.5).
In our opinion, there is no sufficiently
general theory or algorithm that could discover and prevent all such cases. In
Datalog there is the stratification
concept that prevents infinite loops in case when a negation operator is used.
However, there are a lot of other operators that may cause infinite loops, for
instance, arithmetic minus, divide, set-theoretic difference, function sinus, etc. Moreover, as shown in
E.FIX.10, infinite loops can be caused also by cyclic data structures, which
can be explicit (navigation via pointers) or implicit (a cycle due to some
non-evident dependencies among data). Hence some “rules of thumb”
are necessary to prevent such cases, in particular:
Of course, these rules do not prevent all such
bugs, hope that eliminate them in sufficiently reliable manner.
Recursive
Procedures and Functions
As almost all popular
programming languages, SBQL allows for declaring procedures (and methods),
including recursive ones. Any procedure can be recursive due to the stack-based
semantics of procedure calls. There is no special syntax separating recursive
and non-recursive procedures. In principle, the number of levels of recursion
is unlimited, however, there could be some physical and performance
constraints. In contrast to languages such as Java, C++, C#, etc., SBQL is
unique due to the following features:
·
SBQL functional procedures (methods) can return a bulk
output and can be called within SBQL queries. Hence, SBQL functional procedures
can encapsulate queries.
·
Parameters of SBQL procedures can be SBQL queries too.
Parameters can be transmitted in the call-by-value,
call-by-reference and strict-call-by-value modes.
·
Output returned by SBQL functional procedures can
contain references. Hence, the output from a procedure calls can be used for updating.
·
As follows from the previous features, SBQL functional
procedures can work as virtual updatable views. However, for the well
recognized view updating problem we consider procedures and views as separate
notions. (SBQL updatable views are the subject of a separate chapter).
In this chapter we do not
specify precisely all the syntactic and semantic details of SBQL procedures.
This is the subject of a separate chapter. Below we show on examples how SBQL
recursive procedures can be used to formulate recursive tasks.
The syntax for SBQL procedures
is shown below.
A procedure declaration consists of
a procedure name, a typed parameter list, a type of its output and a list of
imperative SBQL statements.
procDeclaration ::= name’(‘[parameter_list]’)’[‘:’returntype] ‘{‘statement_list’}’
parameterList ::= parDeclaration {‘;’ parDeclaration}
parDeclaration
::= name ‘:’ type [ cardinality ]
If the cardinality is not specified,
the default cardinality [1..1] is assumed.
Procedure invocation
procInvocation ::= name’(‘[actual_par_list]’)’
actual_par_list ::= parameter {‘;’ parameter}
parameter ::= query
Return
statement
statement ::= return [query]’;’
Semantics: When a procedure
is called, the statements in the procedure are evaluated in the consecutive
order. When the return statement is
reached, the result of the query being parameter of that statement is put on
top of QRES. If no return statement
is reached before the end of procedure evaluation, the procedure returns
nothing (thus it is not a functional procedure).
Statements in SBQL procedures
use SBQL queries. SBQL includes several imperative operators (object creation,
assignment, deletion, etc.) and control flow statements (including loops) such
as if, while, for each, etc.
A simplest recursive procedure
consists of a single return statement, which depends on a procedure parameter
either returning an empty collection or returning the result from invocation of
the same procedure with different parameter value. A sample recursive procedure
finding all components of specified parts, along with their amount required to
make these parts, is shown below (see Fig.55):
E.PRO.1 |
Find all components of specified parts, along
with their amount required to make these parts. |
SBQL: |
type PCType
is record { p: PartType; c: integer; } SubPartsPC( myPC: PCType[0..*] ): PCType[0..*] { return
if
exists myPC then bag(myPC, SubPartsPC ( myPC.p.Component.((leadsTo.Part) as p, (c * amount) as c ))) else bag()} |
The procedure takes a bag of structures as the
parameter (named myPC). Each structure contains a reference to a part
object named p and the amount of
parts of this type named c. Note that
a query being the parameter of this procedure can have unlimited complexity.
E.PRO.2 |
Find all the details necessary to make 125
engines and 150 gear boxes. |
SBQL: |
(Detail) SubPartsPC ( bag( ((Part where name = ”engine”) as p, 125 as c), ((Part where name = ”gear box”) as p, 150 as c))) |
E.PRO.3 |
How many bolts 5X50 is necessary to make 125
engines and 150 gear boxes? |
SBQL: |
sum((SubPartsPC ( bag( ((Part where name = ”engine”) as p, 125 as c), ((Part where name =
”gear box”) as p, 150 as c)))
where p.name =
“bolt5X50”).c) |
The advantage of recursive procedures is simplicity of
problem decomposition. A recursive task can be easily distributed among several
procedures (some of which may be reused in other tasks), and calculations
within a single procedure can be performed in easy to comprehend steps,
comparable to the simplicity of fixed point equations. A procedure calculating
the cost and mass of a part illustrating this possibility is shown in the next
example. The procedure utilizes the previously defined SubPartsPC procedure in order to perform the recursive
processing and then performs calculations, utilizing local variables.
E.PRO.4 |
Get the total cost and mass of a part given
as a parameter. |
SBQL: |
PartCostMass( myPart:
PartType ): record{ p: PartType; cost: real; mass: real; }{
allSubParts: PCType[0..*];
detailsMass: real; detailCost: real;
addedMass; real; addedCost: real;
create local allSubParts( SubPartsPC( bag((myPart as p; 1 as c)) ));
detailMass := sum((Detail) allSubParts.(c * p.mass)); detailCost := sum((Detail) allSubParts.(c * p.cost)); addedMass := sum((Aggregate) allSubParts.(c * p.assemblyMass)); addedCost := sum((Aggregate) allSubParts.(c * p.assemblyCost)); return (myPart as p, (addedCost+detailsCost) as cost, (addedMass+detailsMass) as mass); } |
E.PRO.5 |
Get the cost for all parts having the mass
higher than 100. |
SBQL: |
(Part as p).(PartCostMass(p) where mass > 100).(p, cost)) |
Recursive procedures in
SBQL offer many advantages, when compared to stored procedures in relational
DBMSs. Most of them are consequences of the fact that recursive procedures in
SBQL are a natural extension of the SBA, working on the same principles and
evaluated by the same evaluation engine – unlike relational systems,
where stored procedures are an addition to the system, but are evaluated
separately from SQL queries. The main advantages of SBQL recursive procedures
are the following:
·
SBQL queries are valid as expressions, procedure
parameters, etc.;
·
The type system is the same;
·
There is no impedance mismatch.
SBQL procedures may be also
used as a way to reuse queries or fixpoint equations. Instead of writing a
stand-alone fixpoint equation for a single use, it is possible to write a
procedure utilizing the fixpoint equation, while providing a way to
parameterize it easily. A sample procedure doing this is shown below:
E.PRO.6 |
Get all subparts of a part being a parameter. |
SBQL: |
subParts(myPart:
PartType): PartType[0..*] { return
distinct(fixpoint (parts){ parts :- myPart
union (parts.Component.leadsTo.Part); });} |
Recursive procedures can also substitute fixed
point equations or transitive closures, for example:
E.PRO.7 |
Calculate
the square root of a, according to
the fixpoint equation x = (a/x + x)/2, starting from x = 1 and making
50 iterations. |
SBQL: |
Root( a: real ): real{
return internalRoot(a; 1;
50);} internalRoot(b:real;
x:real; c:integer): real { if c = 0 then return x; else
return internalRoot(b; (b/x
+ x)/2 ; c - 1); } |
Formulation of recursive definitions is equally
simple:
E.PRO.8 |
Calculate n-th
Fibonacci number. |
SBQL: |
Fib(n: integer):integer { return
if n = 0 or n = 1 then n else Fib(n-2) + Fib(n-1); } |
Similarly, recursive procedures can be used for
processing cyclic structures, for instance (c.f. Fig.56):
E.PRO.9 |
Find the names of all companies owned
(directly or indirectly) by “ACME”. |
SBQL: |
visited: string[0..*]; //global variable storing names of visited objects … allOwnedBy( comp: CompType): CompType[1..*] { delete
visited; intAllOwned(comp); return
Company where name in visited; } intAllOwned( comp: CompType) { if
not comp.name in visited then { create
visited( comp.name ); for
each (comp.hasSharesIn.Company)
as own do
intAllOwned ( own ); } } |
SBQL: |
allOwnedBy(Company where name = ”ACME”) |
In the example we use a global variable visited to avoid starting next
navigation via hasSharesIn from an
already visited object. This task can be of course solved in many different
ways.
Conclusion
In this chapter we have presented and compared several
implemented and postulated techniques for expressing recursive problems in
database query and programming languages. We have considered three techniques:
transitive closures (four variants), fixed point equations and recursive
procedures and functions. All of them
are implemented in the ODRA system, hence we have gained some experience.
Practical examples have shown that usability of these techniques can be
different in various situations and none of them is dominating or
disadvantageous. Although recursive tasks are not much frequent, in some
important situations presence of the techniques in a database programming
language can much simplify the programmers’ effort.
Last modified: November 24, 2009