Syntax, semantics and
pragmatics of query languages
Back to Description of SBA and SBQL.
Each computer language can be
characterized by three aspects, known as syntax, semantics and pragmatics.
Developers of computer languages, especially from the computer industry,
frequently confuse these aspects, usually assuming that syntax, plus informal
explanation of semantics, plus some examples completely specify a proposed
language. Such an approach is represented by the ODMG and SQL-99 standards,
which specify syntax and intuitive explanations, give some examples, but
essentially present no idea concerning formal semantics and (recursive)
semantic interdependencies between various language constructs. This leaves a
lot of room for different understanding of syntactic constructs, hence for sure
will result in different, incompatible implementations. Thus, lack of formal
semantics undermines the initial goals of the standards, such as portability of
programs and interoperability between applications or program libraries made by
independent vendors. Lack of formal semantics causes also problems with
specification of a strong type checker (which must simulate the actual
computations during the parse/compile time) and undermines query optimization,
which have to be based on strong rules and easy reasoning concerning
equivalence of different plans of actual computations for all possible database
states.
Syntax is determined by formal rules
saying how to construct expressions of the language from the set of atomic
tokens (known as alphabet). From the mathematical point of view the syntax of a
language can be defined as a (usually infinite) set of all correct strings of
tokens from the alphabet. Unfortunately, such definition of syntax (that can be
found in popular textbooks e.g. on context-free grammars or regular
expressions) is incomplete from the point of view of semantics. Semantic rules
are usually associated with syntactic rules. It may happen that for the same
set of correct strings there are two sets of syntactic rules A and B, but the
set A is correct, while the set B is wrong. For instance, consider the SQL
statement
select X from Y where Z
and two sets of rules A and B determining its syntax. However, the set A
assumes implicitly the parentheses as in select X from (Y where
Z), while the set B assumes implicitly the parentheses as in (select
X from Y) where Z. Although both A and B produce the same set of
SQL statements, A is correct, while B is wrong (inconsistent with the actual
semantics of SQL).
Syntax is usually specified by rules of context-free grammars, perhaps
with additional constraints (for instance, concerning typing). In SBA we take
little attention to concrete syntax, leaving this issue for possible
implementations. All our definitions will be based on abstract syntax. Perhaps,
at some moment SBQL will require standardization, and at that moment the
unified concrete SBQL syntax will be determined. In existing implementations of
SBQL it is a bit different.
Semantics determines the meaning of syntactic constructs, that is, the
relationship between syntactic constructs and elements of some universe of
meanings. In the popular understanding semantics addresses human minds and in
this case it can be expressed in terms of a natural language. Such semantics we
can see, for instance, in UML diagrams, whose syntactic constructs are
explained through more or less understandable phrases in our everyday tongue or
by relationships with other (also informal) constructs. Such semantics,
however, is valueless for a machine, it is not enough precise, too ambiguous
and contains a lot of unspecified or poorly specified details.
The machine requires precise formal semantics. The general definition of
formal semantics is not as easy as the definition of syntax because it requires
the formal definition of the mentioned universe of meanings and the definition
of mappings of the syntax into the universe of meanings. Such a definition is
also not univocal, as it depends on who or what is the addressee of the
definition. In particular, the description of semantics can be different for
compiler writers and for application programmers who are/will be the users of
the language. In our explanation we take the point of view of compiler or
interpreter designers rather than application programmers. Obviously, this
point of view requires from us to be extremely precise in specification and
sensitive to all, even smallest semantic details.
In particular, we can assume that the universe of the meanings is the
set of all the sequences of instructions of the Java virtual machine (JVM). The
definition would be a mapping of the set of all the language expressions into
the set of sequences of instructions of JVM. The definition of semantics
assumes that the meaning of JVM instructions is non-definable; it is given as
an axiom. Such an approach assumes that the definition of the semantics is done
by the designers of a compiler or interpreter of the given language. The
definition would be, however, valueless for application programmers, who rarely
understands the actions of the Java virtual machine. Moreover, every team that
attempts to implement Java (or other language) defines its own semantics,
incompatible with the semantics of other teams.
For these reasons the semantics must be defined in abstract terms and
concepts that are to be easily understood by programmers and leave no room for
different interpretations by designers of compilers or interpreters of the
language. The abstract terms and concepts should be much more abstract than the
level of the machine code (assembler) or instructions of a Java-like virtual
machine. Such semantics should address two kinds of subjects: (1) system
programmers, who precisely and formally map the semantic definition
into actions of the machine; (2) application programmers who will use informally
the language to make applications. Formal actions of the machine and informal
understanding of the semantics by application programmers must coincide. Lack
of the coincidence is referred to as semantic reef. It is a property of
the language that most frequently causes application programmers errors due to
improper informal understanding of formal language constructs. (SQL is famous
also for well known semantic reefs, in particular, with the group by
clause and with null values; the issue is discussed, e.g. in several papers by
C.J.Date.)
In general, precision of semantics is so important for implementation
and standardization of computer languages that the general rule can be
formulated as an oxymoron:
A smallest
semantic problem is the very big problem.
Even smallest ambiguities on single bits or minor details cause that
such goals as portability and interoperability become unfeasible.
Unfortunately, this fact is ignored by next and next proponents of „standards”
of query languages; thus the low quality of the standards, difficulties with
consistent and entire implementation, and a lot of incompatible implementations
of the same specification.
The database theory has already proposed
several theoretical frameworks that are claimed to be formal bases for
definition of semantics of query languages; in particular, relational algebra
(originally proposed by E.F.Codd in 1970), relational calculus and formal
logic. There are also theories that are claimed to be semantic foundations for
queries addressing more sophisticated database models, such as object-oriented
and XML-oriented models, in particular, object algebras, an XML algebra, object
calculi, variants of mathematical logic (e.g. F-logic), structural recursion,
monoid calculus, etc. Our attitude to these proposals is definitely negative:
in many cases we do not believe in their conceptual, mathematical and pragmatic
soundness (c.f. object algebras)
and in all cases they offer too narrow formal basis and are inadequate
to represent a lot of phenomena that occur in database models and database
query/programming languages. In another section we present a list of common
flaws that undermine their mathematical, conceptual and pragmatic significance.
Basically the flaws concern crippled conception, too narrow scope, inadequate
mathematics and wishful thinking concerning their practical potential.
Currently we believe that the Stack-Based Approach is one and the only paradigm
that is acceptable as a formal basis for the description of query languages'
semantics.
Our focus on formal semantics has apparently
led us to mathematical methods. Unfortunately, we have no good message for
mathematicians.
As a
method of formal specification, mathematics has two fundamental disadvantages:
Could we speak on formal language specification
methods that are not mathematical? Mathematicians working in the computer
science are trying to convince everybody that „formal” always means strong
mathematical discipline. Fortunately, such claims are not justified and are not
reasonable looking on the above fundamental disadvantages of mathematics as a
formal specification method. They can be considered as attempts to make a false
stereotype defending the (actually lost) position of mathematics in the
computer science. The stereotype is not justified by common practice in other
technical branches. For instance, the documentation of a building by an
architect presents fully formal and precise specification that is sufficient to
construct the building according to his/her intention. The specification uses a
lot of drawings, plans, diagrams, texts and tables, but we see neither
mathematical formulas nor theorems and proofs that the specification is
mathematically „sound and correct”. Obviously, the specification allows for
many kinds of reasoning concerning e.g. what is the optimal plan of the building
construction. Similarly we can make the analogy with the car, electronic or
other industries. In all these branches mathematics plays some part, but not
the major one. Still, specifications of the artifacts produced by these
industries are enough formal and precise for manufacturing processes and for
inferences concerning properties of products.
In no way computer languages are different in this respect. The
difference concerns only the fact that computer languages are relatively young,
thus methods of formal specification are not as mature as in traditional
technical branches. Computers and their software are constantly changing
causing the necessity of new and new formal specification methods. As in other
cases of formal specification of technical artifacts we can use mathematics in
all the places when mathematical concepts can be helpful for understanding.
These concepts, however, will not be used in the strictly mathematical sense;
in fact, we will rely only on common intuitive understanding of simple
mathematical notions such as number, set, function, relation, union of sets,
set containment, etc. Obviously, within this approach we are unable and we will
not strive to make any mathematical proof of theorems concerning e.g. query
optimization. However, we will show that our formal model can lead us to deep
inferences (concerning query optimization, in particular), which are based on
precise understanding of introduced concepts and relationships between them
rather than on mathematical reasoning. Eventually, in every case (including
mathematical proofs) implementing and practical testing of inferences is the
only credible and believable proof of their correctness and usability.
Early our approach to the semantics of query languages was based on the
denotational method, where each syntactic construct was associated with some
abstract mathematical concept, like a function. The method is based on defining
such functions by sophisticated mathematical notions known as least fixed point
equations. Despite big effort to promote this approach for different software
specification areas, it was totally unsuccessful, in particular, for
specification of query languages' semantics. In general, the denotational
semantics is a great theory and we recommend it as beautiful exercise for
everybody who is interested in top-level achievements of human intellects.
Unfortunately, this way of specification was not understandable for typical
designers of computer languages and deeply involves the two mentioned above
fundamental disadvantages of mathematical specifications.
Currently we rely on another formal specification method known as operational
semantics. The idea of the method is perhaps as old as the computer
science. Many years ago it was formalized by E. Dijkstra and the A. van
Wijngaarden group, but in SBA we do not refer to these old efforts. In
operational semantics we have to define some machine and then, to
specify the semantics of particular language constructs through operations of
the machine (and through data structures that it involves). We are looking for
a machine that is defined on a much more abstract level than e.g. JVM, but
still is able to map formally and precisely all the language constructs. Our
machine involves basic data structures that are necessary to specify the
semantics of query/programming languages, that is:
· Abstract data/object store;
· Environment stack;
· Query result stack.
The method appeared to be very successful for understanding of the
semantics of query languages by many people. It is sensitive to any detail of a
datamodel that we want to consider and to any operation that we would like to
introduce into a query/programming language. The operational semantics presents
an abstract implementation of a language that can be directly used to
make the concrete implementation in our favorite programming language. The
method is also very efficient from the point of view of query optimization,
i.e. it allows for general and very deep inferences concerning how to construct
an optimal query evaluation plan.
Pragmatics of a language determines its function in interaction between
humans or between a human and a machine. Pragmatics describes how to use the
language in practical situations, what are the reasons for the use and what
goals can be achieved. Pragmatics requires learning how to match expressions of
the language to concrete real-life situations, what will be the response from
the machine and how we have to interpret the response.
A computer language should be pragmatically efficient, i.e. the language
must have the potential to accomplish some important practical goals.
A
computer language that is pragmatically inefficient is not a serious computer
language.
In particular, one can perfectly understand syntax and semantics of a
programming language, but cannot use it in pragmatically efficient way to make
some usable system (the case of many so-called "theoreticians of computer
science"). Pragmatics cannot be formalized. It can be explained by showing
some use cases, examples, patterns, anti-patterns, best practices, wrong
practices, etc. Majority of the user textbooks and documentations of languages
are devoted to their pragmatics. However, the only way to teach and learn
pragmatics is to use the language for concrete practical situations. By
analogy, we can explain in many ways how to drive cars; however, eventually the
efficient teaching requires to go into the car and drive it through crowded
streets.
Pragmatics
of a computer language dominates over its syntax and semantics.
Pragmatics is the most important aspect of a language. Syntax and
semantics are important, but only if serve the pragmatic goals of the language.
The arguments in favor or against some syntactic or semantic constructs must
refer to the pragmatics of the language. In particular, we reject all arguments
that stem from some ideology (e.g. "the language must have sound
mathematical basis"), analogy (e.g. "the language syntax must be
similar to SQL") or silly associations (e.g. "a query language for
XML must have XML syntax").
In the commercial literature and documentation pragmatics is frequently
confused with semantics. A typical manual presents a business intention, e.g.
"Get names and salaries of employees working as programmers", and
then presents a corresponding query accomplishing the intention. Such an
approach to semantics is presented e.g. in the ODMG standard. Pragmatics,
however, is not good as a method of specification of semantics, because it is
able to present by examples some isolated semantic islands. The complete formal
picture of semantics and (recursive) interdependencies between different
notions and language constructs require systematic specification method,
independent from examples of the use.
In the SBA and SBQL description we frequently refer to pragmatics,
nevertheless our ultimate goal is the formal description of query languages'
semantics.
Data model and
database schema as inherent pragmatic components of a query language
Due to the data
independence principle, the pragmatics of a query language must be extended
outside the languaage itself. The necessary condition of pragmatic efficiency
of queries is that the programmer fully understands what the database contains
and how it is organized. Because the concrete state of the database is unknown
for the programmer, he/she must recognize it on some abstract level, through a
database schema and business-oriented description (called ontology) that
determine the business meaning of the data and services stored in the database.
To understand operations of a query language the programmer first of all
must understand the database model on the level of algorithmic precision (which
is required for efficient programming). Then, he/she must understand the
database schema and data structures that are implied by the schema, also with
algorithmic precision.
Both a
data model and a concrete database schema are inevitable pragmatic parts of a
query language.
In case of the relational model the situation is simple, because the
model (in a pure version) introduces only named tables with named columns. Each
value stored on the cross of a tuple and a column is atomic. However,
simplicity of the data model and data structures that it involves is at the
cost of the complexity of conceptual modeling of applications, the complexity
of queries and the complexity of nesting queries within programs written in
popular programming languages. Currently, the conceptual modeling tools are
object-oriented, as a rule (c.f. UML). To reduce the gap between the business
model and data storage model, there is a need for object-oriented database
models and management systems (although the need is questioned by the
commercial vendors of relational systems, for the obvious commercial reasons).
Object-oriented models introduce many useful notions that support
conceptual modeling, such as complex objects with no limitations concerning
size and object hierarchy levels, associations among objects, classes and
inheritance among classes, behavior stored inside classes, polymorphism, and
others. While for conceptual modeling these notions are (more or less) precise,
attempts to map them into data structures are not trivial and not univocal. The
same concepts can be mapped on many, many ways. For instance, UML class
diagrams are quite understandable as a way of human thinking on some business
environment. However, they are far from being precise when we try to map them
to data structures. How to map inheritance and multiple inheritance? How to map
associations, aggregations, qualified associations, association roles, methods,
polymorphism, etc.? There is no a single answer for these questions, mapping
UML class diagrams to object-oriented data structures having the property of
algorithmic precision is a non-trivial research task.
The task has been solved in CORBA by IDL and mapping IDL to declarations
of data structures in popular programming languages. The same is done with
CORBA objects that are mapped to objects (or other structures) of the
programming languages. In this way the algorithmic precision is assured.
However, such an approach has disadvantages. First of all, it forces low-level
programming model, in comparison to programming through query languages. The
CORBA object model is also far incomplete from the point of view of modeling of
business applications: it has no collections, associations, a query language
and many other features (which are introduced on the level of CORBA services,
with a lot of limitations and too complex notions).
The disadvantages of CORBA caused the next standardization effort known
as the ODMG standard. In this framework a lot of notions that are necessary for
conceptual modeling of object database applications were introduced. The
standard proposes the database schema language ODL and the query language OQL.
Unfortunately, the specification of object-oriented model and corresponding
schema and query languages is very far from being algorithmically precise. The
standard presents also a lot of other flaws, thus we consider it as
unsatisfactory for our purposes.
In an object-oriented data model and a database schema we have to
reconcile several concepts and demand of different agents. Object-oriented
models introduce concepts such as complex objects, object identity, classes,
types, interfaces, associations, inheritance and multi-inheritance, dynamic
object roles, and so on. The must present some consistent and universal whole. The
model and a database schema should be clear for designers of a database, which
use the schema language for determining the database structure. It should also
be clear for application programmers, who have to understand the schema and
objects induced by the schema with algorithmic precision. The schema must also
be clear for designers of database query engine, who have to develop algorithms
enabling storing, maintaining, accessing and checking data in the database,
algorithms for strong type checking of queries and programs based on queries,
and perhaps other algorithms. Unfortunately, understanding of object-oriented
notions is very different and depends on a school, some ideology, theory or a
concrete programming tool.
Many designers of database models and their schema and query languages
start the job from the definition of the concept of type. (c.f. the
CORBA and ODMG standards, or XML technologies). Types are main components of a
database schema. However, the concept of type is not obvious. Many languages
have no types (e.g. Smalltalk) and nevertheless are useful. However, we
advocate the view that any query/programming language should be supported by
strong type checking. Types are easy when they address human minds and support
conceptual modeling only. However, a type system for strong static query and
program checking presents a non-trivial research issue. So far there is
practically no good pattern that would be sufficiently consistent and
universal. There are type theories coming from functional languages such as SML
and Haskell; for our purposes these theories (although mathematical) are too
restrictive. Types proposed by ODMG are inconsistent, thus non-implementable.
Types introduced in the XML world (DTD, XML Schema and RDF Schema) are too limited
for our purposes. After implementing a type system for SBQL we have concluded
that type systems for database query/programming languages are nowadays
immature and must be revisited. To have precise motivation for our type system
we will introduce it at the end of our semantic considerations, when we
introduce all the data structures that we want to address (on the level of
algorithmic precision) and define all query operators that we want to involve
into a query language.
Because in examples we need some schema language, till introducing it
formally we use informal notation similar to UML. We correct it according to
our needs. Below we present a simple self-explained schema built according this
notation. The left part of Fig.1 shows the database schema in an UML-like
notation. The right part shows example database structures complying with the
schema. i10, i11, i12,..., i64
denote object identifiers. Arrows denote pointers (abstract implementation of
the association between Emp and Dept objects).

Fig.1. A database schema and database structures complying to the schema
Abstract syntax and
syntax-driven semantics
Old discussions on the syntax of programming languages were the subject
of jokes, which present it as an issue for fools (c.f. the David Harel's shortest
paper "do considered od" considered odder than
"do considered ob"). From that time the syntax arouses some disrespect among
professionals, who coined the term „syntactic sugar” to denote meaningless
tokens of the language that can be arbitrarily taken by the designers. Nowadays
a lot of similar jokes can be invented looking at assertions produced by the
fans of XML. XML is a syntactic convention, nothing more, it has no semantics.
Indeed, some discussions on
the syntax are meaningless. However, it is not true that all such
discussions are meaningless. Syntax is important for conceptual modeling (a
pragmatic part of the language). Due to associations with tokens or phrases of
the natural language, the meaning and the use of formal statements is easier to
understand. For instance, in the SQL query:
select Name, Salary
from Employee where Job =
'programmer'
the keywords select, from and where and characters
"," and "=" are syntactic sugar. From the point of view of
formal semantics the sugar is not essential, e.g. the same statement can be
written in some hypothetical grammar as:
(Employee%Job:"programmer")/Name++Salary
Due to the sugar, the SQL statement is easier to understand and use. Too
much sugar, however, combined recursively by various language constructs, may
lead to clumsy queries. For instance, in SQL-92 there is possibility to insert select
statements inside select, from and where clauses. In
effect we can obtain constructs like
select ... from (select ... from ... where
x = (select y from ... where ...)) where
...
Understanding of such nested statements can be problematic. In SQL too
much sugar (plus non-orthogonality, plus lack of some operators, plus some
other syntactic and semantic problems) causes that some statements are quite
illegible; this is known as "SQL puzzles".
Syntax is also important as a basis for definition of semantics.
Semantic definitions are independent on the syntactic sugar, hence it can be
removed (or more precisely, reduced to some minimal set of abstract tokens). The
syntax without sugar is referred to as abstract syntax. The syntax with
all the sugar is referred to as concrete syntax. For instance, the SQL
query having the concrete syntax:
select Name,
DeptName from Employee, Dept where some_predicate
in abstract syntax can be
written as:
((Employee × Dept) σ some_predicate) π (Name ° DeptName)
where symbols ×, σ, π and ° denote some
operators (Cartesian product, selection, projection and tuple composition,
respectively). In abstract syntax we isolate all operators and syntactic
constructs having independent semantic meaning (as in the above SQL example),
in the right order (possibly forced by parentheses), even if some operators are
not explicitly seen from the concrete syntax. In the abstract syntax it is also
necessary to resolve some syntactic ambiguities that may occur in the concrete
syntax. For instance, in the above SQL example commas have different meaning,
depending on the context (tuple composition and Cartesian product,
respectively). In our abstract syntax we map each comma to a proper abstract
operator (° and ×). Operators can
be written in prefix, infix and postfix style; this has no meaning for
semantics. The above abstract syntax presents the infix style. In the postfix
style (a.k.a. the Polish notation) it can be written as:
Employee Dept × some_predicate
σ Name DeptName ° π
In our presentation we
prefer the infix style, which is traditional in the elementary mathematics thus
more legible for the readers. The syntax will be based on context-free
grammars, but for our current goals this is a secondary issue. Usually the
abstract syntax is an internal part of the language definition and frequently
takes the form of a special data structure known as a syntactic tree of
a language expression.
In almost all computer
languages semantics is syntax-driven. First, the designers of a language
define its syntactic rules. Then, a semantic rule is associated with each
syntactic rule. Symbols denoting operators are defined through corresponding
mathematical concepts and/or some routines. Names occurring in language
expressions are bound to some compile time or run time entities.
Semantically, the binding means that the name fires some search in the computer
or application environment for an entity (or entities) having this name. In
this method syntactic rules are not arbitrary, but must reflect corresponding
semantic rules. Syntactic rules are usually presented as productions of
context-free grammars. Because syntactic rules are usually recursive, semantic
rules must be recursive too. This method of definition we use in the
description of the SBQL semantics.
The definition of semantics
is much easier if the syntax is abstract rather than concrete. For query
languages the most optimal case from the point of view of semantic definition
is when each syntactic rule involves one operator and zero, one or at most two
arguments. This allows for the most compact and most orthogonal definition of
semantics (thus short implementation and much easier and more general
optimization methods). SBQL is the only known query language that follows this
advice. SQL and OQL present completely different syntactic style, based on very
big syntactic constructs with far context dependencies (e.g. dependency between
the group by clause and select clause). The style has severe
disadvantages; we do not follow it.
General assumptions of the
semantics of query languages
We take the assumption that the semantics of any query is a
(mathematical) function that maps the set of all states (basically, the
database states, but not only) into the set of all query results. That
is, for a given query each state is mapped into some result. More formally, let
assume that Query is a set of syntactically correct queries of a given
query language, State be the set of all states and Result be the
set of all possible query results. Then, the semantics |q| of any query q ∊ Query is a function mapping State
into Result:
|q|: State → Result
Such view on the semantics is correct for queries having purely retrieval
capabilities. If a query has side-effects, i.e. it can change a state,
then we must assume that each query determines a function mapping a state into
a result and a new state, i.e.:
|q|: State → (Result × State)
Such queries may e.g. invoke a method, which makes some changes on
computer, database or/and application environment.
Usually we assume the terminology in which a query always returns a
result. SQL, however, has introduced queries that do not return a result, but
make operations on a database. They are known as update, insert
and delete clauses. According to the tradition of programming languages
we will call such constructs imperative statements or instructions,
avoiding the term query. If p is an imperative statement, than
the semantics |p| of p can be described as a function mapping a state into a state:
|p|: State → State
The sets State and Result are usually different, although
(as will be shown later) they are built from the same primitives. The relational
model professionals worked out a stereotype known as the closure property,
which can be formulated as follows: an element of State is a set of
relations and an element of Result is a relation. The closure property
has been considered as a necessary condition for nesting queries. This
inference is based on false reasoning. After importing the closure property to
an object-oriented model one can say that an element of State is a set
of objects and an element of Result is a set of objects too. However, we
do not buy such assertions. For object-oriented models that we intend to deal
with we conclude that:
The closure property is nonsense.
In particular, it leads to subdividing queries into object-preserving
and object-generating, which is nonsense too (stemming from the previous
nonsense and inappropriate understanding of object-oriented concepts).
Moreover, for relational systems, in particular for SQL, the closure property
is nonsense too. SQL is not dealing with mathematical relations, but with tables.
Tables stored in the relational database have fundamentally different
properties in comparison to tables returned by SQL queries. For instance, the
first ones can be updated (are mutable) while the second ones cannot (are
immutable). On some abstraction level we can describe them as mathematical
relations (this is actually done in all the theories devoted to the relational
model), but in this way a lot of semantic properties and constructs of SQL are
not expressible formally (for instance, SQL updating clauses). Hence we will
not follow such doubtful assertions and will define the sets State and Result
according to our own perception of the issue and according to 40-years
tradition of programming languages. (More critique of the closure property you
can find here.)
Now it becomes more and more clear what we have to do to define the
formal semantics of a query language. We have to define formally:
·
Set
Query determining the abstract syntax of queries. We will not strive to
define all constructs of Query, showing only the basic constructs. Other
constructs can be easily added by analogy with already defined ones. Query
will be defined by context-free rules; each of them will be associated with a
semantic rule.
·
·
Set
Result determining derived data structures that can be returned
by queries in the result of their execution. To some extent, this set depends
on the assumed data model too. We will define it in a quite general fashion.
·
A
semantic rule associated with every syntactic rule of our abstract
syntax. According to the operational semantics (abstract implementation) each
semantic rule will be represented by actions of some abstract machine that for
the given syntactic rule performs the mapping of a state into a result.
Our approach to the formal semantics presented in the above steps will
be extended to imperative constructs and to abstractions such as procedures,
function and methods. We will show that the approach is an efficient theory,
which is able to explain all linguistic phenomena that occur in data
models and corresponding query/programming languages. This is not possible with
the current theories, such as relational algebra, relational calculus, formal
logic, which are able to cover only a very small subset of these phenomena and
are very hard to extend to more advanced database models, such as
object-oriented and XML-oriented models. The theory explains construction of
query/programming languages in highly intuitive terms, allowing for efficient
designing of new query/programming languages for various data models and fast direct
implementation. We also intend to show that the theory is very efficient
concerning reasoning on query optimization methods.
Compositionality of queries
As we have noted before, because syntactic rules are recursive, semantic
rules expressed by the operational machine must be recursive too. To this end
we need to follow a programming language principle known as compositionality.
The principle is the basis for recursive definitions and implementations of
practically all programming languages. In short, the principle requires that
the semantics of a compound statement is a function of semantics of its
components. For instance, if we have a compound query q = q1
θ q2, where q, q1, q2
are queries and θ is an operator, then the semantics |q| is the result of some function funθ having |q1| and |q2| as arguments: |q| = funθ (|q1|, |q2|). Function funθ depends on the operator θ. The compositionality property
allows for orthogonal combination of operators and recursive nesting of
queries. According to this principle, the designer has to determine the
semantics of atomic queries, i.e. queries having no sub-queries. Then, the
semantics of complex queries is build recursively from the semantics of their
components. To exploit the full potential of this principle, the syntactic
rules of the query language should be as short as possible, i.e. the designers
should avoid big syntactic patterns and far context semantic dependencies. The
principle will be applied to all constructs of our abstract syntax, including
queries, imperative statements, programming abstractions and perhaps other
constructs.
Database professionals usually assume (more or less explicitly) that in
the context of query languages the concept of state is equivalent to database
state. Moreover, it is assumed that the database state purely reflects our
favorite database model (e.g. the relational model), i.e. a state is a
composition of data structures that are allowed in the model (e.g. relational
tables). Our view on the concept of state will be much more extended.
Generally, we assume that:
A state
involves all run-time entities that can influence the results returned by
queries.
A state involves all run-time entities whose names can occur in
queries. In particular, a database state may include not only passive database
objects, but also stored procedures, views, etc., because their names can be
used in queries. In object-oriented databases the concept of state must include
classes and methods stored within classes. In our approach we follow the orthogonal
persistence principle, which implies, in particular, that a state involves
persistent and volatile data on equal rights. A state includes a metabase,
i.e. data structures that store database schema in a structural form. Entities
stored in the metabase can be explicitly or implicitly referred from queries,
e.g. to accomplish the method of generic programming known as reflection.
Queries may also refer to a local state of the application that the queries are
nested in, e.g. to programming language variables. Hence local state of an
application is also a component of the state. Queries can refer to some
computer environment information, such as date, time, version of operating
system, amount of free memory, etc.; hence they also augment the concept of
state. Queries can also refer to some libraries (i.e. classes, procedures) that
are bound to an application, thus libraries (plus states of their own
variables) constitute another part of the state concept.
Query languages require also augmenting the state concept by internal
data structures that store the state of query evaluation. Consider the SQL
query ("get employees earning more than their boss"):
select * from Emp e
where e.salary > (select b.salary
from Dept d, Emp b
where e.dept#
= d.dept# and d.boss = b.emp# )
The query contains the subquery:
select b.salary from Dept d, Emp b
where e.dept# = d.dept# and d.boss = b.emp#
having e as a free correlation variable. When the query processor
evaluates this subquery, it meets e and then must find some data
structure storing its value. This structure is internal to the query, it will
disappear after the entire query will be evaluated. However, during evaluation
of the subquery this temporal structure augments the concept of a state.
In object-oriented query languages there are more such situations when
for evaluation of queries the query engine must create some data structures
storing a temporary state of query evaluation. Some formal discipline
concerning such temporal structures is necessary. Not surprisingly, the
situation is very well recognized in programming languages and leads to the
well-known concept referred to as environment stack or call stack.
We prefer to use environment (environmental) stack, because the stack that we
would like to consider will be used not only for calling functions, procedures
and methods, but also for resolving some query operators (called non-algebraic).
The environment stack concept is central for our approach to query languages,
thus the Stack-Based Approach (SBA).
Some professionals from the database community used to claim that the
stack should not be introduced on the level of formal semantics, because it is
a purely implementation notion. We strongly disagree - the definitions of some
query operators cannot be formally correct without the concept. The stack is a
data structure that is internally manages by the query engine. Within the
description of SBA and SBQL we deal with it on the abstract rather than on the
physical level. Implementation of the stack can be of course very different,
especially after optimizations. The environment stack (denoted here ENVS) introduces
necessary discipline for all temporary data structures that are necessary
during evaluation of queries. It is a client-side structure (i.e. it is the
property of a client application rather than the property of a database server)
and usually it is stored and managed in the main memory.
In SBA we introduce another stack, called query result stack
(abbreviated as QRES). This stack has different role in comparison to ENVS.
Although it is also a temporary data structure maintained by the query engine, we
do not involve it into the concept of state. The reason is obvious: semantics
of queries does not depend on the state of this stack. In the operational
method that we have assumed it must be introduced explicitly, but in an
equivalent denotational method this stack is unnecessary - the semantics can be
explained without it. However, all methods of description of semantics of
queries must refer to ENVS. Some professionals of relational databases may
claim that such a stack is unnecessary for relational languages, in particular,
for SQL, because they can be explained through the relational algebra or
relational calculus. Please don't believe them, please reject all such
simplistic relational ideas. The mentioned theories are far too primitive to
explain all constructs of SQL. The environment stack in some form must also
occur within the SQL engine. This will be clear after we explain the semantics
of basic constructs of SBQL and show examples much remaining SQL.
In general, all entities that are external to the query engine
and can influence the results returned by queries we call object store.
At some abstraction level we can make no distinction between different concepts
of the store, e.g. client, server, computer environment, application
environment, persistent objects, database, volatile objects, etc. All internal
entities that are to be introduced by the query engine in order to evaluate
queries will be collected at the environment stack. In conclusion:
The
state concept consists of two components: the state of an object store
and the state of an environment stack.
A query result concept is not as difficult as the concept of state.
Because our queries cover also programming expressions, then almost any value
that can be stored in the object store or can be computed from other values can
be the result of a query. For instance, the query 2+2 returns 4. (The result is
no surprise, but such a query can be shocking for lovers of relational theories
and SQL.) However, not all values that are stored in the object store can be
returned by queries. This concerns multimedia values, such e.g. as a DVD file
having some gigabytes. In such cases a query should return a reference
to such a value (e.g. a file name) rather than a value. In general, because
according to the principle
of total internal identification every object has an identifier, a query
can return an identifier of an object; such a result we traditionally call reference.
Reference is another fundamental concept of SBA. Queries can also introduce
more complex data structures which include values, references, names, structure
constructors and collection constructors. The exact definition concerning the
set Result will be given later, when we will start to describe SBQL.
In the
stack-based approach queries never return objects, but references to
objects, perhaps within some complex data structures.
Such an assumption is the tradition in programming languages, whose
expressions never return programming variables, but references to variables.
Objects are data structures that are stored within the object store. Similarly
to Smalltalk, we use the terminology that everything that can be stored in the
store is an object, even a single value 2. Queries can refer to objects, but
the result of the retrieval is volatile, it will be consumed by some agent
(e.g. the print command) and after that the result no more exists. Some
database theoreticians claim that there is no need to distinguish objects and
references to objects, but such unification (assumed e.g. in the ODMG database standard)
causes a lot of ambiguities. It worth as many as the assertion that a person
named Smith and his name Smith are the same entities. The assumption that
queries may return objects is the consequence of the closure property which we
have totally rejected as conceptually invalid, internally inconsistent and
leading to next semantically doubtful concepts.
As we have said before, in the operational semantics we have to define a
machine that each syntactic rule associates with a semantic rule in the form of
actions of the machine. We define this machine as a recursive procedure eval
having a query q written in an abstract syntax as an argument. The
procedure eval evaluates the query q and returns the result.
Inside the procedure we have syntactic cases, and then, actions of the machine
for particular cases. A case presents just a semantic rule subordinated to the
corresponding syntactic rule. Below we present a skeleton of the procedure eval
in the form of self-explained pseudo-code.
procedure eval( q : query )
begin
parse ( q ); //the
parser recognizes top-level subqueries in the query q
case q is recognized as literal l :
// a simplest query consisting of a single literal, e.g. 2, "Smith",
etc.
push the value
denoted by l on the query result stack QRES; //the simplest semantic
rule
case q is recognized as name n : // a
simple query consisting of a single name n, e.g. Employee, Salary,
etc.
bind the name n on the
environment stack ENVS;
push the result of
the binding on QRES; // these two statements present simple semantic rule
case q is recognized as ∆ q1 : // the query q consists of
a unary algebraic operator ∆ applied to a subquery q1
eval( q1 );
apply the operator denoted by ∆ to the result returned by q1
;
push the result of
the application on QRES; // these three statements present a semantic rule for
a unary algebraic operator
case q is recognized as q1
∆
q2 : // the query q consists of
a binary algebraic operator ∆ joining subqueries q1
and q2
eval( q1 );
eval( q2 );
apply the operator denoted by ∆ to the results returned by q1
and q2;
push the result of
the application on QRES; // these four statements present a semantic rule for a
binary algebraic operator
case q is recognized as .... : //next
syntactic rule
.... //next semantic
rule
case q is recognized as .... : //next
syntactic rule
.... //next semantic
rule
.....
end
The core of
SBA and SBQL concerns semantic rules for queries involving operators that we
call non-algebraic. We present them later. In our terminology, operator where,
dot (projection, navigation), dependent (navigational) join, quantifiers, ordering
and transitive closures are all non-algebraic operators. Quite surprisingly,
their semantic rules are very similar and based on a simple, intuitive pattern.
In the relational model some of such operators (selection, projection, join)
are called algebraic. We have rejected this terminology as not fully
precise from the point of view of mathematics, because it too much confuses the
language and meta-language levels of mathematical description. Concerning full
generality, e.g. some constructs of SQL, algebraization of these operators is
possible, but the style must be fundamentally different than the style of the
relational algebra. Traditionally, in mathematics operators such as quantifiers
are non-algebraic (again, algebraization of them is possible, but through very
advanced notions, such as cylindrical algebras, very far from primitive notions
of the relational algebra). Because in SBA the semantics of quantifiers and the
semantics of selection, projection and join are very similar and introduce the
same notions, we unify all such operators under the name
"non-algebraic". This unification is especially important for object
oriented models, which are much more sophisticated and where simplistic
theories coming from the relational models have totally failed.
"Completeness" of
Query Languages
In general, the concept of
"completeness" of query languages remains undefined. The relational
fathers introduced so-called "relational completeness", but it was based
on the wrong assumption that relational query languages do not require more
than it is available in the relational algebra (and calculus). Even 2+2 is not
expressible in the algebra, hence this sort of "completeness" is
nothing more than a random, very poorly motivated point on the scale of the
universality of query languages. SQL (and SBQL) does much more than
"relationally complete" languages are able to do.
Another group of researchers tried
to promote so-called "Turing power", i.e. the concept of universality
based on the Turing machine. Datalog and OCL people are frequently proud that
Datalog and OCL have the power equivalent to the Turing machine. Unfortunately,
this is a false argument for anybody who knows mathematics. The Turing machine
was invented for totally different goals than data retrieval and manipulation.
It is quite easy to have the "Turing power" in a typical query and
programming languages. Enough to say that the language Basic has the Turing
power even if we remove from it 90% of its functionalities. Surely, Datalog
advocates would not be happy if they hear that Datalog is as powerful as 10% of
Basic. SQL has no Turing power, because it has no recursive capabilities.
Still, nobody cares about that. A lot of other languages, including SBQL, have
probably the Turing power, but formal proof of this fact is completely
irrelevant and not interesting from the practical point of view.
Everything that we can say on
universality of query languages is pragmatic completeness. A
query language should reasonably support everything that programmers
want and expect for the given application domain. The pragmatic completeness
depends on the current state-of-the-art concerning expectation of programmers,
thus by definition is a moving goal. If some feature of a data model or data
processing is not supported, the programmers will complain that this is a
disadvantage. Such complains can concern both declarations of data structures
and language that are intended to process these structures.
Pragmatic completeness requires also
proper qualities in the use of the language. If formulation of some queries
remain puzzles (c.f. SQL) then programmers will rarely or never use them in
practice. Note that programs are written once and then read 100 times. Any query
that is complicated and illegible decrease the maintainability of an
application. Hence proper syntax, clean semantics, orthogonality, simplicity,
following typical programmers habits, etc. can be considered as factors
influencing the completeness.
A query language is also
pragmatically incomplete if the performance of some queries is bad or below
expectation. Queries with bad performance will not be used by programmers,
hence are non-existent. They should not be taken into account when considering
the pragmatic completeness.
The only method to investigate
pragmatic completeness is to prepare a bunch of sample queries showing the
power of the language and compare this bunch to other languages. If there is an
efficient and usable query in another language that are impossible to express
as an efficient and usable query in our language, then our language is
incomplete.
Summing up, pragmatic completeness
is subjective, relative and dependent on a lot of factors that are difficult to
measure. In this sense no computer language is complete, because different
applications may require quite different options and nobody is able to predict
all expectations of future programmers.
Last modified: April 21, 2008