©Copyright by Kazimierz Subieta.

Syntax, semantics and pragmatics of query languages

by Kazimierz Subieta

Back to Description of SBA and SBQL.

Each computer language can be characterized by three aspects, known as syntax, semantics and pragmatics. Developers of computer languages, especially from the computer industry, frequently confuse these aspects, usually assuming that syntax, plus informal explanation of semantics, plus some examples completely specify a proposed language. Such an approach is represented by the ODMG and SQL-99 standards, which specify syntax and intuitive explanations, give some examples, but essentially present no idea concerning formal semantics and (recursive) semantic interdependencies between various language constructs. This leaves a lot of room for different understanding of syntactic constructs, hence for sure will result in different, incompatible implementations. Thus, lack of formal semantics undermines the initial goals of the standards, such as portability of programs and interoperability between applications or program libraries made by independent vendors. Lack of formal semantics causes also problems with specification of a strong type checker (which must simulate the actual computations during the parse/compile time) and undermines query optimization, which have to be based on strong rules and easy reasoning concerning equivalence of different plans of actual computations for all possible database states.


Syntax

Syntax is determined by formal rules saying how to construct expressions of the language from the set of atomic tokens (known as alphabet). From the mathematical point of view the syntax of a language can be defined as a (usually infinite) set of all correct strings of tokens from the alphabet. Unfortunately, such definition of syntax (that can be found in popular textbooks e.g. on context-free grammars or regular expressions) is incomplete from the point of view of semantics. Semantic rules are usually associated with syntactic rules. It may happen that for the same set of correct strings there are two sets of syntactic rules A and B, but the set A is correct, while the set B is wrong. For instance, consider the SQL statement

select X from Y where Z

and two sets of rules A and B determining its syntax. However, the set A assumes implicitly the parentheses as in  select X from (Y where Z), while the set B assumes implicitly the parentheses as in  (select X from Y) where Z. Although both A and B produce the same set of SQL statements, A is correct, while B is wrong (inconsistent with the actual semantics of SQL).

Syntax is usually specified by rules of context-free grammars, perhaps with additional constraints (for instance, concerning typing). In SBA we take little attention to concrete syntax, leaving this issue for possible implementations. All our definitions will be based on abstract syntax. Perhaps, at some moment SBQL will require standardization, and at that moment the unified concrete SBQL syntax will be determined. In existing implementations of SBQL it is a bit different.


Semantics

Semantics determines the meaning of syntactic constructs, that is, the relationship between syntactic constructs and elements of some universe of meanings. In the popular understanding semantics addresses human minds and in this case it can be expressed in terms of a natural language. Such semantics we can see, for instance, in UML diagrams, whose syntactic constructs are explained through more or less understandable phrases in our everyday tongue or by relationships with other (also informal) constructs. Such semantics, however, is valueless for a machine, it is not enough precise, too ambiguous and contains a lot of unspecified or poorly specified details.

The machine requires precise formal semantics. The general definition of formal semantics is not as easy as the definition of syntax because it requires the formal definition of the mentioned universe of meanings and the definition of mappings of the syntax into the universe of meanings. Such a definition is also not univocal, as it depends on who or what is the addressee of the definition. In particular, the description of semantics can be different for compiler writers and for application programmers who are/will be the users of the language. In our explanation we take the point of view of compiler or interpreter designers rather than application programmers. Obviously, this point of view requires from us to be extremely precise in specification and sensitive to all, even smallest semantic details.

In particular, we can assume that the universe of the meanings is the set of all the sequences of instructions of the Java virtual machine (JVM). The definition would be a mapping of the set of all the language expressions into the set of sequences of instructions of JVM. The definition of semantics assumes that the meaning of JVM instructions is non-definable; it is given as an axiom. Such an approach assumes that the definition of the semantics is done by the designers of a compiler or interpreter of the given language. The definition would be, however, valueless for application programmers, who rarely understands the actions of the Java virtual machine. Moreover, every team that attempts to implement Java (or other language) defines its own semantics, incompatible with the semantics of other teams.

For these reasons the semantics must be defined in abstract terms and concepts that are to be easily understood by programmers and leave no room for different interpretations by designers of compilers or interpreters of the language. The abstract terms and concepts should be much more abstract than the level of the machine code (assembler) or instructions of a Java-like virtual machine. Such semantics should address two kinds of subjects: (1) system programmers, who precisely and formally map the semantic definition into actions of the machine; (2) application programmers who will use informally the language to make applications. Formal actions of the machine and informal understanding of the semantics by application programmers must coincide. Lack of the coincidence is referred to as semantic reef. It is a property of the language that most frequently causes application programmers errors due to improper informal understanding of formal language constructs. (SQL is famous also for well known semantic reefs, in particular, with the group by clause and with null values; the issue is discussed, e.g. in several papers by C.J.Date.)

In general, precision of semantics is so important for implementation and standardization of computer languages that the general rule can be formulated as an oxymoron:

A smallest semantic problem is the very big problem.

Even smallest ambiguities on single bits or minor details cause that such goals as portability and interoperability become unfeasible. Unfortunately, this fact is ignored by next and next proponents of „standards” of query languages; thus the low quality of the standards, difficulties with consistent and entire implementation, and a lot of incompatible implementations of the same specification.

The database theory has already proposed several theoretical frameworks that are claimed to be formal bases for definition of semantics of query languages; in particular, relational algebra (originally proposed by E.F.Codd in 1970), relational calculus and formal logic. There are also theories that are claimed to be semantic foundations for queries addressing more sophisticated database models, such as object-oriented and XML-oriented models, in particular, object algebras, an XML algebra, object calculi, variants of mathematical logic (e.g. F-logic), structural recursion, monoid calculus, etc. Our attitude to these proposals is definitely negative: in many cases we do not believe in their conceptual, mathematical and pragmatic soundness (c.f. object algebras) and in all cases they offer too narrow formal basis and are inadequate to represent a lot of phenomena that occur in database models and database query/programming languages. In another section we present a list of common flaws that undermine their mathematical, conceptual and pragmatic significance. Basically the flaws concern crippled conception, too narrow scope, inadequate mathematics and wishful thinking concerning their practical potential. Currently we believe that the Stack-Based Approach is one and the only paradigm that is acceptable as a formal basis for the description of query languages' semantics.

Our focus on formal semantics has apparently led us to mathematical methods. Unfortunately, we have no good message for mathematicians.

As a method of formal specification, mathematics has two fundamental disadvantages:

  • For real languages the full mathematical description of their semantics becomes too complex, because it must take into account all the concepts and properties of data structures (a database model) that are addressed by the language and all the language concepts, constructs and interdependencies that are welcomed by the users and designers. The complexity violates the current view on mathematical theories, which are to be based on a small number of very general concepts and aesthetic, elegant reasoning.
  • Fully formal mathematical specifications would become a tight corset that practically would disallow efficient reasoning on possible changes, mutations, variants and extensions concerning the properties of data structures addressed by the language and the language constructs. These changes, mutations, variants and extensions are made due to pragmatic considerations that involve human preferences, behavior, psychology and ergonomics, hence cannot be anticipated and formalized. There are myriads of permutations of these variants and extensions and each of them could violate the given mathematical specification (thus would require new and new specifications). Moreover, a lot of these permutations are unknown: they are the subject of future inventions. Mathematical specification would be thus the burden for the progress and evolution of the language.

Could we speak on formal language specification methods that are not mathematical? Mathematicians working in the computer science are trying to convince everybody that „formal” always means strong mathematical discipline. Fortunately, such claims are not justified and are not reasonable looking on the above fundamental disadvantages of mathematics as a formal specification method. They can be considered as attempts to make a false stereotype defending the (actually lost) position of mathematics in the computer science. The stereotype is not justified by common practice in other technical branches. For instance, the documentation of a building by an architect presents fully formal and precise specification that is sufficient to construct the building according to his/her intention. The specification uses a lot of drawings, plans, diagrams, texts and tables, but we see neither mathematical formulas nor theorems and proofs that the specification is mathematically „sound and correct”. Obviously, the specification allows for many kinds of reasoning concerning e.g. what is the optimal plan of the building construction. Similarly we can make the analogy with the car, electronic or other industries. In all these branches mathematics plays some part, but not the major one. Still, specifications of the artifacts produced by these industries are enough formal and precise for manufacturing processes and for inferences concerning properties of products.

In no way computer languages are different in this respect. The difference concerns only the fact that computer languages are relatively young, thus methods of formal specification are not as mature as in traditional technical branches. Computers and their software are constantly changing causing the necessity of new and new formal specification methods. As in other cases of formal specification of technical artifacts we can use mathematics in all the places when mathematical concepts can be helpful for understanding. These concepts, however, will not be used in the strictly mathematical sense; in fact, we will rely only on common intuitive understanding of simple mathematical notions such as number, set, function, relation, union of sets, set containment, etc. Obviously, within this approach we are unable and we will not strive to make any mathematical proof of theorems concerning e.g. query optimization. However, we will show that our formal model can lead us to deep inferences (concerning query optimization, in particular), which are based on precise understanding of introduced concepts and relationships between them rather than on mathematical reasoning. Eventually, in every case (including mathematical proofs) implementing and practical testing of inferences is the only credible and believable proof of their correctness and usability.

Early our approach to the semantics of query languages was based on the denotational method, where each syntactic construct was associated with some abstract mathematical concept, like a function. The method is based on defining such functions by sophisticated mathematical notions known as least fixed point equations. Despite big effort to promote this approach for different software specification areas, it was totally unsuccessful, in particular, for specification of query languages' semantics. In general, the denotational semantics is a great theory and we recommend it as beautiful exercise for everybody who is interested in top-level achievements of human intellects. Unfortunately, this way of specification was not understandable for typical designers of computer languages and deeply involves the two mentioned above fundamental disadvantages of mathematical specifications.

Currently we rely on another formal specification method known as operational semantics. The idea of the method is perhaps as old as the computer science. Many years ago it was formalized by E. Dijkstra and the A. van Wijngaarden group, but in SBA we do not refer to these old efforts. In operational semantics we have to define some machine and then, to specify the semantics of particular language constructs through operations of the machine (and through data structures that it involves). We are looking for a machine that is defined on a much more abstract level than e.g. JVM, but still is able to map formally and precisely all the language constructs. Our machine involves basic data structures that are necessary to specify the semantics of query/programming languages, that is:

·         Abstract data/object store;

·         Environment stack;

·         Query result stack.

The method appeared to be very successful for understanding of the semantics of query languages by many people. It is sensitive to any detail of a datamodel that we want to consider and to any operation that we would like to introduce into a query/programming language. The operational semantics presents an abstract implementation of a language that can be directly used to make the concrete implementation in our favorite programming language. The method is also very efficient from the point of view of query optimization, i.e. it allows for general and very deep inferences concerning how to construct an optimal query evaluation plan.


Pragmatics

Pragmatics of a language determines its function in interaction between humans or between a human and a machine. Pragmatics describes how to use the language in practical situations, what are the reasons for the use and what goals can be achieved. Pragmatics requires learning how to match expressions of the language to concrete real-life situations, what will be the response from the machine and how we have to interpret the response.

A computer language should be pragmatically efficient, i.e. the language must have the potential to accomplish some important practical goals.

A computer language that is pragmatically inefficient is not a serious computer language.

In particular, one can perfectly understand syntax and semantics of a programming language, but cannot use it in pragmatically efficient way to make some usable system (the case of many so-called "theoreticians of computer science"). Pragmatics cannot be formalized. It can be explained by showing some use cases, examples, patterns, anti-patterns, best practices, wrong practices, etc. Majority of the user textbooks and documentations of languages are devoted to their pragmatics. However, the only way to teach and learn pragmatics is to use the language for concrete practical situations. By analogy, we can explain in many ways how to drive cars; however, eventually the efficient teaching requires to go into the car and drive it through crowded streets.

Pragmatics of a computer language dominates over its syntax and semantics.

Pragmatics is the most important aspect of a language. Syntax and semantics are important, but only if serve the pragmatic goals of the language. The arguments in favor or against some syntactic or semantic constructs must refer to the pragmatics of the language. In particular, we reject all arguments that stem from some ideology (e.g. "the language must have sound mathematical basis"), analogy (e.g. "the language syntax must be similar to SQL") or silly associations (e.g. "a query language for XML must have XML syntax").

In the commercial literature and documentation pragmatics is frequently confused with semantics. A typical manual presents a business intention, e.g. "Get names and salaries of employees working as programmers", and then presents a corresponding query accomplishing the intention. Such an approach to semantics is presented e.g. in the ODMG standard. Pragmatics, however, is not good as a method of specification of semantics, because it is able to present by examples some isolated semantic islands. The complete formal picture of semantics and (recursive) interdependencies between different notions and language constructs require systematic specification method, independent from examples of the use.

In the SBA and SBQL description we frequently refer to pragmatics, nevertheless our ultimate goal is the formal description of query languages' semantics.


Data model and database schema as inherent pragmatic components of a query language

Due to the data independence principle, the pragmatics of a query language must be extended outside the languaage itself. The necessary condition of pragmatic efficiency of queries is that the programmer fully understands what the database contains and how it is organized. Because the concrete state of the database is unknown for the programmer, he/she must recognize it on some abstract level, through a database schema and business-oriented description (called ontology) that determine the business meaning of the data and services stored in the database.

To understand operations of a query language the programmer first of all must understand the database model on the level of algorithmic precision (which is required for efficient programming). Then, he/she must understand the database schema and data structures that are implied by the schema, also with algorithmic precision.

Both a data model and a concrete database schema are inevitable pragmatic parts of a query language.

In case of the relational model the situation is simple, because the model (in a pure version) introduces only named tables with named columns. Each value stored on the cross of a tuple and a column is atomic. However, simplicity of the data model and data structures that it involves is at the cost of the complexity of conceptual modeling of applications, the complexity of queries and the complexity of nesting queries within programs written in popular programming languages. Currently, the conceptual modeling tools are object-oriented, as a rule (c.f. UML). To reduce the gap between the business model and data storage model, there is a need for object-oriented database models and management systems (although the need is questioned by the commercial vendors of relational systems, for the obvious commercial reasons).

Object-oriented models introduce many useful notions that support conceptual modeling, such as complex objects with no limitations concerning size and object hierarchy levels, associations among objects, classes and inheritance among classes, behavior stored inside classes, polymorphism, and others. While for conceptual modeling these notions are (more or less) precise, attempts to map them into data structures are not trivial and not univocal. The same concepts can be mapped on many, many ways. For instance, UML class diagrams are quite understandable as a way of human thinking on some business environment. However, they are far from being precise when we try to map them to data structures. How to map inheritance and multiple inheritance? How to map associations, aggregations, qualified associations, association roles, methods, polymorphism, etc.? There is no a single answer for these questions, mapping UML class diagrams to object-oriented data structures having the property of algorithmic precision is a non-trivial research task.

The task has been solved in CORBA by IDL and mapping IDL to declarations of data structures in popular programming languages. The same is done with CORBA objects that are mapped to objects (or other structures) of the programming languages. In this way the algorithmic precision is assured. However, such an approach has disadvantages. First of all, it forces low-level programming model, in comparison to programming through query languages. The CORBA object model is also far incomplete from the point of view of modeling of business applications: it has no collections, associations, a query language and many other features (which are introduced on the level of CORBA services, with a lot of limitations and too complex notions).

The disadvantages of CORBA caused the next standardization effort known as the ODMG standard. In this framework a lot of notions that are necessary for conceptual modeling of object database applications were introduced. The standard proposes the database schema language ODL and the query language OQL. Unfortunately, the specification of object-oriented model and corresponding schema and query languages is very far from being algorithmically precise. The standard presents also a lot of other flaws, thus we consider it as unsatisfactory for our purposes.

In an object-oriented data model and a database schema we have to reconcile several concepts and demand of different agents. Object-oriented models introduce concepts such as complex objects, object identity, classes, types, interfaces, associations, inheritance and multi-inheritance, dynamic object roles, and so on. The must present some consistent and universal whole. The model and a database schema should be clear for designers of a database, which use the schema language for determining the database structure. It should also be clear for application programmers, who have to understand the schema and objects induced by the schema with algorithmic precision. The schema must also be clear for designers of database query engine, who have to develop algorithms enabling storing, maintaining, accessing and checking data in the database, algorithms for strong type checking of queries and programs based on queries, and perhaps other algorithms. Unfortunately, understanding of object-oriented notions is very different and depends on a school, some ideology, theory or a concrete programming tool.

Many designers of database models and their schema and query languages start the job from the definition of the concept of type. (c.f. the CORBA and ODMG standards, or XML technologies). Types are main components of a database schema. However, the concept of type is not obvious. Many languages have no types (e.g. Smalltalk) and nevertheless are useful. However, we advocate the view that any query/programming language should be supported by strong type checking. Types are easy when they address human minds and support conceptual modeling only. However, a type system for strong static query and program checking presents a non-trivial research issue. So far there is practically no good pattern that would be sufficiently consistent and universal. There are type theories coming from functional languages such as SML and Haskell; for our purposes these theories (although mathematical) are too restrictive. Types proposed by ODMG are inconsistent, thus non-implementable. Types introduced in the XML world (DTD, XML Schema and RDF Schema) are too limited for our purposes. After implementing a type system for SBQL we have concluded that type systems for database query/programming languages are nowadays immature and must be revisited. To have precise motivation for our type system we will introduce it at the end of our semantic considerations, when we introduce all the data structures that we want to address (on the level of algorithmic precision) and define all query operators that we want to involve into a query language.

Because in examples we need some schema language, till introducing it formally we use informal notation similar to UML. We correct it according to our needs. Below we present a simple self-explained schema built according this notation. The left part of Fig.1 shows the database schema in an UML-like notation. The right part shows example database structures complying with the schema. i10, i11, i12,..., i64 denote object identifiers. Arrows denote pointers (abstract implementation of the association between Emp and Dept objects).

Fig.1. A database schema and database structures complying to the schema


Abstract syntax and syntax-driven semantics

Old discussions on the syntax of programming languages were the subject of jokes, which present it as an issue for fools (c.f. the David Harel's shortest paper "do considered od" considered odder than "do considered ob"). From that time the syntax arouses some disrespect among professionals, who coined the term „syntactic sugar” to denote meaningless tokens of the language that can be arbitrarily taken by the designers. Nowadays a lot of similar jokes can be invented looking at assertions produced by the fans of XML. XML is a syntactic convention, nothing more, it has no semantics.

Indeed, some discussions on the syntax are meaningless. However, it is not true that all such discussions are meaningless. Syntax is important for conceptual modeling (a pragmatic part of the language). Due to associations with tokens or phrases of the natural language, the meaning and the use of formal statements is easier to understand. For instance, in the SQL query:

select Name, Salary from Employee where Job = 'programmer'

the keywords select, from and where and characters "," and "=" are syntactic sugar. From the point of view of formal semantics the sugar is not essential, e.g. the same statement can be written in some hypothetical grammar as:

(Employee%Job:"programmer")/Name++Salary

Due to the sugar, the SQL statement is easier to understand and use. Too much sugar, however, combined recursively by various language constructs, may lead to clumsy queries. For instance, in SQL-92 there is possibility to insert select statements inside select, from and where clauses. In effect we can obtain constructs like

select ... from (select ... from ... where x = (select y from ... where ...)) where ...

Understanding of such nested statements can be problematic. In SQL too much sugar (plus non-orthogonality, plus lack of some operators, plus some other syntactic and semantic problems) causes that some statements are quite illegible; this is known as "SQL puzzles".

Syntax is also important as a basis for definition of semantics. Semantic definitions are independent on the syntactic sugar, hence it can be removed (or more precisely, reduced to some minimal set of abstract tokens). The syntax without sugar is referred to as abstract syntax. The syntax with all the sugar is referred to as concrete syntax. For instance, the SQL query having the concrete syntax:

select Name, DeptName from Employee, Dept where some_predicate

in abstract syntax can be written as:

((Employee × Dept) σ some_predicate) π (Name ° DeptName)

where symbols ×, σ, π and ° denote some operators (Cartesian product, selection, projection and tuple composition, respectively). In abstract syntax we isolate all operators and syntactic constructs having independent semantic meaning (as in the above SQL example), in the right order (possibly forced by parentheses), even if some operators are not explicitly seen from the concrete syntax. In the abstract syntax it is also necessary to resolve some syntactic ambiguities that may occur in the concrete syntax. For instance, in the above SQL example commas have different meaning, depending on the context (tuple composition and Cartesian product, respectively). In our abstract syntax we map each comma to a proper abstract operator (° and ×). Operators can be written in prefix, infix and postfix style; this has no meaning for semantics. The above abstract syntax presents the infix style. In the postfix style (a.k.a. the Polish notation) it can be written as:

Employee Dept × some_predicate σ Name DeptName ° π

In our presentation we prefer the infix style, which is traditional in the elementary mathematics thus more legible for the readers. The syntax will be based on context-free grammars, but for our current goals this is a secondary issue. Usually the abstract syntax is an internal part of the language definition and frequently takes the form of a special data structure known as a syntactic tree of a language expression.

In almost all computer languages semantics is syntax-driven. First, the designers of a language define its syntactic rules. Then, a semantic rule is associated with each syntactic rule. Symbols denoting operators are defined through corresponding mathematical concepts and/or some routines. Names occurring in language expressions are bound to some compile time or run time entities. Semantically, the binding means that the name fires some search in the computer or application environment for an entity (or entities) having this name. In this method syntactic rules are not arbitrary, but must reflect corresponding semantic rules. Syntactic rules are usually presented as productions of context-free grammars. Because syntactic rules are usually recursive, semantic rules must be recursive too. This method of definition we use in the description of the SBQL semantics.

The definition of semantics is much easier if the syntax is abstract rather than concrete. For query languages the most optimal case from the point of view of semantic definition is when each syntactic rule involves one operator and zero, one or at most two arguments. This allows for the most compact and most orthogonal definition of semantics (thus short implementation and much easier and more general optimization methods). SBQL is the only known query language that follows this advice. SQL and OQL present completely different syntactic style, based on very big syntactic constructs with far context dependencies (e.g. dependency between the group by clause and select clause). The style has severe disadvantages; we do not follow it.


General assumptions of the semantics of query languages

We take the assumption that the semantics of any query is a (mathematical) function that maps the set of all states (basically, the database states, but not only) into the set of all query results. That is, for a given query each state is mapped into some result. More formally, let assume that Query is a set of syntactically correct queries of a given query language, State be the set of all states and Result be the set of all possible query results. Then, the semantics |q| of any query q Query is a function mapping State into Result:

|q|: State Result

Such view on the semantics is correct for queries having purely retrieval capabilities. If a query has side-effects, i.e. it can change a state, then we must assume that each query determines a function mapping a state into a result and a new state, i.e.:

|q|: State (Result × State)

Such queries may e.g. invoke a method, which makes some changes on computer, database or/and application environment.

Usually we assume the terminology in which a query always returns a result. SQL, however, has introduced queries that do not return a result, but make operations on a database. They are known as update, insert and delete clauses. According to the tradition of programming languages we will call such constructs imperative statements or instructions, avoiding the term query. If p is an imperative statement, than the semantics |p| of p can be described as a function mapping a state into a state:

|p|: State State

The sets State and Result are usually different, although (as will be shown later) they are built from the same primitives. The relational model professionals worked out a stereotype known as the closure property, which can be formulated as follows: an element of State is a set of relations and an element of Result is a relation. The closure property has been considered as a necessary condition for nesting queries. This inference is based on false reasoning. After importing the closure property to an object-oriented model one can say that an element of State is a set of objects and an element of Result is a set of objects too. However, we do not buy such assertions. For object-oriented models that we intend to deal with we conclude that:

The closure property is nonsense.

In particular, it leads to subdividing queries into object-preserving and object-generating, which is nonsense too (stemming from the previous nonsense and inappropriate understanding of object-oriented concepts). Moreover, for relational systems, in particular for SQL, the closure property is nonsense too. SQL is not dealing with mathematical relations, but with tables. Tables stored in the relational database have fundamentally different properties in comparison to tables returned by SQL queries. For instance, the first ones can be updated (are mutable) while the second ones cannot (are immutable). On some abstraction level we can describe them as mathematical relations (this is actually done in all the theories devoted to the relational model), but in this way a lot of semantic properties and constructs of SQL are not expressible formally (for instance, SQL updating clauses). Hence we will not follow such doubtful assertions and will define the sets State and Result according to our own perception of the issue and according to 40-years tradition of programming languages. (More critique of the closure property you can find here.)

Now it becomes more and more clear what we have to do to define the formal semantics of a query language. We have to define formally:

·         Set Query determining the abstract syntax of queries. We will not strive to define all constructs of Query, showing only the basic constructs. Other constructs can be easily added by analogy with already defined ones. Query will be defined by context-free rules; each of them will be associated with a semantic rule.

·         Set State determining stored data structures that are to be queried. The data structures depend on some initial ideological assumptions and constraints that the designers of a database system (or some other data repository) want to promote. In databases such ideological assumptions and constraints are referred to as data model or database model. Because there are a lot of data models and a lot of peculiarities of them, the essential question concerns a proper choice of features that are able to cover majority of them. We show that it will be possible by relatively few (currently 4) store models, and all of them  have some core foundation.

·         Set Result determining derived data structures that can be returned by queries in the result of their execution. To some extent, this set depends on the assumed data model too. We will define it in a quite general fashion.

·         A semantic rule associated with every syntactic rule of our abstract syntax. According to the operational semantics (abstract implementation) each semantic rule will be represented by actions of some abstract machine that for the given syntactic rule performs the mapping of a state into a result.

Our approach to the formal semantics presented in the above steps will be extended to imperative constructs and to abstractions such as procedures, function and methods. We will show that the approach is an efficient theory, which is able to explain all linguistic phenomena that occur in data models and corresponding query/programming languages. This is not possible with the current theories, such as relational algebra, relational calculus, formal logic, which are able to cover only a very small subset of these phenomena and are very hard to extend to more advanced database models, such as object-oriented and XML-oriented models. The theory explains construction of query/programming languages in highly intuitive terms, allowing for efficient designing of new query/programming languages for various data models and fast direct implementation. We also intend to show that the theory is very efficient concerning reasoning on query optimization methods.


Compositionality of queries

As we have noted before, because syntactic rules are recursive, semantic rules expressed by the operational machine must be recursive too. To this end we need to follow a programming language principle known as compositionality. The principle is the basis for recursive definitions and implementations of practically all programming languages. In short, the principle requires that the semantics of a compound statement is a function of semantics of its components. For instance, if we have a compound query q = q1 θ q2, where q, q1, q2 are queries and θ is an operator, then the semantics |q| is the result of some function funθ having |q1| and |q2| as arguments: |q| = funθ (|q1|, |q2|). Function funθ depends on the operator θ. The compositionality property allows for orthogonal combination of operators and recursive nesting of queries. According to this principle, the designer has to determine the semantics of atomic queries, i.e. queries having no sub-queries. Then, the semantics of complex queries is build recursively from the semantics of their components. To exploit the full potential of this principle, the syntactic rules of the query language should be as short as possible, i.e. the designers should avoid big syntactic patterns and far context semantic dependencies. The principle will be applied to all constructs of our abstract syntax, including queries, imperative statements, programming abstractions and perhaps other constructs.


What is State?

Database professionals usually assume (more or less explicitly) that in the context of query languages the concept of state is equivalent to database state. Moreover, it is assumed that the database state purely reflects our favorite database model (e.g. the relational model), i.e. a state is a composition of data structures that are allowed in the model (e.g. relational tables). Our view on the concept of state will be much more extended. Generally, we assume that:

A state involves all run-time entities that can influence the results returned by queries.

A state involves all run-time entities whose names can occur in queries. In particular, a database state may include not only passive database objects, but also stored procedures, views, etc., because their names can be used in queries. In object-oriented databases the concept of state must include classes and methods stored within classes. In our approach we follow the orthogonal persistence principle, which implies, in particular, that a state involves persistent and volatile data on equal rights. A state includes a metabase, i.e. data structures that store database schema in a structural form. Entities stored in the metabase can be explicitly or implicitly referred from queries, e.g. to accomplish the method of generic programming known as reflection. Queries may also refer to a local state of the application that the queries are nested in, e.g. to programming language variables. Hence local state of an application is also a component of the state. Queries can refer to some computer environment information, such as date, time, version of operating system, amount of free memory, etc.; hence they also augment the concept of state. Queries can also refer to some libraries (i.e. classes, procedures) that are bound to an application, thus libraries (plus states of their own variables) constitute another part of the state concept.

Query languages require also augmenting the state concept by internal data structures that store the state of query evaluation. Consider the SQL query ("get employees earning more than their boss"):

select * from Emp e

where e.salary > (select b.salary from Dept d, Emp b

                             where e.dept# = d.dept# and d.boss = b.emp# )

The query contains the subquery:

select b.salary from Dept d, Emp b where e.dept# = d.dept# and d.boss = b.emp#

having e as a free correlation variable. When the query processor evaluates this subquery, it meets e and then must find some data structure storing its value. This structure is internal to the query, it will disappear after the entire query will be evaluated. However, during evaluation of the subquery this temporal structure augments the concept of a state.

In object-oriented query languages there are more such situations when for evaluation of queries the query engine must create some data structures storing a temporary state of query evaluation. Some formal discipline concerning such temporal structures is necessary. Not surprisingly, the situation is very well recognized in programming languages and leads to the well-known concept referred to as environment stack or call stack. We prefer to use environment (environmental) stack, because the stack that we would like to consider will be used not only for calling functions, procedures and methods, but also for resolving some query operators (called non-algebraic). The environment stack concept is central for our approach to query languages, thus the Stack-Based Approach (SBA).

Some professionals from the database community used to claim that the stack should not be introduced on the level of formal semantics, because it is a purely implementation notion. We strongly disagree - the definitions of some query operators cannot be formally correct without the concept. The stack is a data structure that is internally manages by the query engine. Within the description of SBA and SBQL we deal with it on the abstract rather than on the physical level. Implementation of the stack can be of course very different, especially after optimizations. The environment stack (denoted here ENVS) introduces necessary discipline for all temporary data structures that are necessary during evaluation of queries. It is a client-side structure (i.e. it is the property of a client application rather than the property of a database server) and usually it is stored and managed in the main memory.

In SBA we introduce another stack, called query result stack (abbreviated as QRES). This stack has different role in comparison to ENVS. Although it is also a temporary data structure maintained by the query engine, we do not involve it into the concept of state. The reason is obvious: semantics of queries does not depend on the state of this stack. In the operational method that we have assumed it must be introduced explicitly, but in an equivalent denotational method this stack is unnecessary - the semantics can be explained without it. However, all methods of description of semantics of queries must refer to ENVS. Some professionals of relational databases may claim that such a stack is unnecessary for relational languages, in particular, for SQL, because they can be explained through the relational algebra or relational calculus. Please don't believe them, please reject all such simplistic relational ideas. The mentioned theories are far too primitive to explain all constructs of SQL. The environment stack in some form must also occur within the SQL engine. This will be clear after we explain the semantics of basic constructs of SBQL and show examples much remaining SQL.

In general, all entities that are external to the query engine and can influence the results returned by queries we call object store. At some abstraction level we can make no distinction between different concepts of the store, e.g. client, server, computer environment, application environment, persistent objects, database, volatile objects, etc. All internal entities that are to be introduced by the query engine in order to evaluate queries will be collected at the environment stack. In conclusion:

The state concept consists of two components: the state of an object store and the state of an environment stack.


What is Result?

A query result concept is not as difficult as the concept of state. Because our queries cover also programming expressions, then almost any value that can be stored in the object store or can be computed from other values can be the result of a query. For instance, the query 2+2 returns 4. (The result is no surprise, but such a query can be shocking for lovers of relational theories and SQL.) However, not all values that are stored in the object store can be returned by queries. This concerns multimedia values, such e.g. as a DVD file having some gigabytes. In such cases a query should return a reference to such a value (e.g. a file name) rather than a value. In general, because according to the principle of total internal identification every object has an identifier, a query can return an identifier of an object; such a result we traditionally call reference. Reference is another fundamental concept of SBA. Queries can also introduce more complex data structures which include values, references, names, structure constructors and collection constructors. The exact definition concerning the set Result will be given later, when we will start to describe SBQL.

In the stack-based approach queries never return objects, but references to objects, perhaps within some complex data structures.

Such an assumption is the tradition in programming languages, whose expressions never return programming variables, but references to variables. Objects are data structures that are stored within the object store. Similarly to Smalltalk, we use the terminology that everything that can be stored in the store is an object, even a single value 2. Queries can refer to objects, but the result of the retrieval is volatile, it will be consumed by some agent (e.g. the print command) and after that the result no more exists. Some database theoreticians claim that there is no need to distinguish objects and references to objects, but such unification (assumed e.g. in the ODMG database standard) causes a lot of ambiguities. It worth as many as the assertion that a person named Smith and his name Smith are the same entities. The assumption that queries may return objects is the consequence of the closure property which we have totally rejected as conceptually invalid, internally inconsistent and leading to next semantically doubtful concepts.


What is a semantic rule?

As we have said before, in the operational semantics we have to define a machine that each syntactic rule associates with a semantic rule in the form of actions of the machine. We define this machine as a recursive procedure eval having a query q written in an abstract syntax as an argument. The procedure eval evaluates the query q and returns the result. Inside the procedure we have syntactic cases, and then, actions of the machine for particular cases. A case presents just a semantic rule subordinated to the corresponding syntactic rule. Below we present a skeleton of the procedure eval in the form of self-explained pseudo-code.

 

procedure eval( q : query )

begin

parse ( q );   //the parser recognizes top-level subqueries in the query q

case q is recognized as literal l : // a simplest query consisting of a single literal, e.g. 2, "Smith", etc.

push the value denoted by l on the query result stack QRES; //the simplest semantic rule

case q is recognized as name n : // a simple query consisting of a single name n, e.g. Employee, Salary, etc.

bind the name n on the environment stack ENVS;

push the result of the binding on QRES; // these two statements present simple semantic rule

case q is recognized as q1 : // the query q consists of a unary algebraic operator applied to a subquery q1

eval( q1 );

apply the operator denoted by to the result returned by q1 ;

push the result of the application on QRES; // these three statements present a semantic rule for a unary algebraic operator

case q is recognized as q1 q2 : // the query q consists of a binary algebraic operator joining subqueries q1 and q2

eval( q1 );

eval( q2 );

apply the operator denoted by to the results returned by q1 and q2;

push the result of the application on QRES; // these four statements present a semantic rule for a binary algebraic operator

case q is recognized as .... : //next syntactic rule

.... //next semantic rule

case q is recognized as .... : //next syntactic rule

.... //next semantic rule

.....

end

 

The core of SBA and SBQL concerns semantic rules for queries involving operators that we call non-algebraic. We present them later. In our terminology, operator where, dot (projection, navigation), dependent (navigational) join, quantifiers, ordering and transitive closures are all non-algebraic operators. Quite surprisingly, their semantic rules are very similar and based on a simple, intuitive pattern. In the relational model some of such operators (selection, projection, join) are called algebraic. We have rejected this terminology as not fully precise from the point of view of mathematics, because it too much confuses the language and meta-language levels of mathematical description. Concerning full generality, e.g. some constructs of SQL, algebraization of these operators is possible, but the style must be fundamentally different than the style of the relational algebra. Traditionally, in mathematics operators such as quantifiers are non-algebraic (again, algebraization of them is possible, but through very advanced notions, such as cylindrical algebras, very far from primitive notions of the relational algebra). Because in SBA the semantics of quantifiers and the semantics of selection, projection and join are very similar and introduce the same notions, we unify all such operators under the name "non-algebraic". This unification is especially important for object oriented models, which are much more sophisticated and where simplistic theories coming from the relational models have totally failed.


"Completeness" of Query Languages

In general, the concept of "completeness" of query languages remains undefined. The relational fathers introduced so-called "relational completeness", but it was based on the wrong assumption that relational query languages do not require more than it is available in the relational algebra (and calculus). Even 2+2 is not expressible in the algebra, hence this sort of "completeness" is nothing more than a random, very poorly motivated point on the scale of the universality of query languages. SQL (and SBQL) does much more than "relationally complete" languages are able to do.

Another group of researchers tried to promote so-called "Turing power", i.e. the concept of universality based on the Turing machine. Datalog and OCL people are frequently proud that Datalog and OCL have the power equivalent to the Turing machine. Unfortunately, this is a false argument for anybody who knows mathematics. The Turing machine was invented for totally different goals than data retrieval and manipulation. It is quite easy to have the "Turing power" in a typical query and programming languages. Enough to say that the language Basic has the Turing power even if we remove from it 90% of its functionalities. Surely, Datalog advocates would not be happy if they hear that Datalog is as powerful as 10% of Basic. SQL has no Turing power, because it has no recursive capabilities. Still, nobody cares about that. A lot of other languages, including SBQL, have probably the Turing power, but formal proof of this fact is completely irrelevant and not interesting from the practical point of view.

Everything that we can say on universality of query languages is pragmatic completeness.  A  query language should reasonably support everything that programmers want and expect for the given application domain. The pragmatic completeness depends on the current state-of-the-art concerning expectation of programmers, thus by definition is a moving goal. If some feature of a data model or data processing is not supported, the programmers will complain that this is a disadvantage. Such complains can concern both declarations of data structures and language that are intended to process these structures.

Pragmatic completeness requires also proper qualities in the use of the language. If formulation of some queries remain puzzles (c.f. SQL) then programmers will rarely or never use them in practice. Note that programs are written once and then read 100 times. Any query that is complicated and illegible decrease the maintainability of an application. Hence proper syntax, clean semantics, orthogonality, simplicity, following typical programmers habits, etc. can be considered as factors influencing the completeness.

A query language is also pragmatically incomplete if the performance of some queries is bad or below expectation. Queries with bad performance will not be used by programmers, hence are non-existent. They should not be taken into account when considering the pragmatic completeness.

The only method to investigate pragmatic completeness is to prepare a bunch of sample queries showing the power of the language and compare this bunch to other languages. If there is an efficient and usable query in another language that are impossible to express as an efficient and usable query in our language, then our language is incomplete.

Summing up, pragmatic completeness is subjective, relative and dependent on a lot of factors that are difficult to measure. In this sense no computer language is complete, because different applications may require quite different options and nobody is able to predict all expectations of future programmers.


Last modified: April 21, 2008