Principles of modern database
query and programming languages
by Kazimierz Subieta (December 2005)
Back to Description
of SBA and SBQL.
Majority of the principles presented below are
speculative ideals that stem from aesthetic feelings and considerations. We
have to strive to ideals but in real life it is impossible to achieve all of
them. Moreover, ideals can be contradictory to real-life critical factors such
as performance, implementation effort and development time. Many principles are
violated due to the bottom-up development, where initially designers think
about a small and easy language, but after success of the language they are
trying to extend it incrementally with all the functionalities that are
required by different applications and groups of users. For instance, we can
claim that strong type checking is a principle of programming languages, but if
some language was initially developed without this feature it would be
extremely difficult to introduce it incrementally without affecting thousands
of already existing programs and violating knowledge, habits and customs of
thousands of programmers. The bottom-up development syndrome affects many
commercial artifacts, including SQL, object-relational databases, PHP, Web
Services, and others. In such artifacts we can observe a lot of violated
principles, thus they are commonly perceived as non-aesthetic and sometimes
even chaotic. Nevertheless, they are useful, and this property should be
considered as a meta-principle dominating over all other principles. The
usability of a language is not in contradiction to the mentioned below
principles; indeed, we believe that eventually they support the usability.
Every combination of
language features that makes sense should be allowed (providing the given
combination does not violate typing constraints). Orthogonality
allows the designers to minimize language definition and to increase its
retrieval or computational power. Orthogonality
supports language implementation because shorter language definition implies
shorter implementation code and easier optimization. Orthogonal languages are
much easier to learn and use. Orthogonality may
become essential in the case when programs in the given language are generated
automatically as an output of some tools, e.g. CASE tools or graphical user
interfaces.
Avoiding big
syntactic and semantic patterns (cf. the SQL select...from...where...group by...having...order by....pattern),
avoiding implicit semantic dependencies between distant pieces of the code,
avoiding tangled structure of programs with no clean modules and their
hierarchy (cf. the goto
statements). Compositionality means that query/program pieces that have their
own semantic meaning should be as small as possible (for example, only literals
and names) and can be orthogonally composed into bigger and bigger units by a
set of operators, where each of them has a small number of arguments (ideally,
one or two arguments).
The use of
names and bindings in programming languages must be consistent. In particular, the
mechanism of bindings names used in declarations of data structures must
coincide with the mechanism of binding parameters of procedural abstractions;
the methods of binding are the same for declarations and parameters. This one-to-one correspondence
includes methods of parameter passing (call by value, call by reference etc.)
and time of binding (static or dynamic).
Each new feature A of
a query/programming language should be smoothly combined with all the already
existing features B, C, D...., Z. The new feature A
must be smoothly integrated with the typing system of the language, its binding
mechanisms, scoping rules, imperative constructs, programming abstractions,
database abstractions, etc. It must be consistently introduced into
documentation, user manuals and training courses. For instance, if the
designers of a language decide to introduce a new data structure, say, dynamic
arrays, the corresponding questions are: “How such an array will be typed?”,
“Can it be an element of a structure?”, “Can it be an element
of a collection?”, “Can contain collections as elements?”,
“Can be a parameter of a procedure?”, “Can be returned as a
procedure output?”, etc. Because a new feature have to be combined with
all the existing features, one can conclude that the complexity of semantics
and pragmatics grows in square to the number of features introduced in a given
query/programming language. Thus implementation and optimization efforts grow
probably more than in square to the number of features. Hence advices:
(1) Avoid redundant
features (the Occam’s razor);
(2) Make the power
through orthogonality;
(3) Reduce kinds of
introduced data structures to some minimal but still universal set;
(4) Generalize existing
features rather than introduce incrementally new and new features
independently.
Object-oriented principles
assumed by SBA
If a subclass B inherits from a superclass
A, then an
object b being an instantiation of B can be used in all places of a
query/program in which object a being
an instantiation of A can be used.
For instance, an object Student can
be used in all places when the object Person
can be used, because a class StudentClass inherits from the class PersonClass. Although this
principle (known also as LSP, Liskov Substitutability
Principle) seems to be obvious, it must be taken with care, at least for three
reasons. The first concerns updating, where straightforward application of this
principle leads to anomalies or (at least) different semantic interpretations.
The second case concerns parameter passing, where the requirement of strong
type checking leads to a long (and a bit academic) dilemma concerning
co-variance and contra-variance. The last, most difficult case concerns the
case when an object name is a property of its type (this does not occur in
programming languages, but is a strong rule in database schemata). For such a
case the object-oriented model based on substitutability makes little sense and
must be superseded by some more general model. Substitutability, together with
the open-close principle, is also contradictory to the
straightforward concept of collection, fundamental for databases. From these
three concepts you can take any two, but not the three together.
Each class can be
closed for modifications, but it is still open for specializations. The
principle is the basis for program reuse and method polymorphism. For instance,
we can buy within some compiled library a class Person, together with all the methods that are implemented for the
class. We have no possibility to change anything in this class, because its
source code is unavailable. Nevertheless, we can specialize
this class by classes Student, Employee, Customer, etc. Each specialization inherits the implementation that
is already done within the class Person
and augments the implementation by new methods, specific for the given
specialization. Moreover, each specialization can override some method implemented within the class Person by a specialized method
implemented within this specialized class and having the same name (plus
compatible parameters) as the name of the method within Person.
The open-close
principle must be taken with care, because it is contradictory to the substitutability principle and the straightforward
concept of collections of objects, fundamental for databases. From these three
concepts you can take any two, but not the three together.
Semantic Relativism
of Objects
If some object O1
can be defined, then object O2 having O1 as a component
can also be defined. There should be no limitations concerning the number of
hierarchy levels of objects. Objects on any hierarchy level should be treated
uniformly. In particular, an atomic object (having no sub-objects inside)
should be allowed as a regular data structure independent from other
structures. The relativism of data structures implies the relativism of
corresponding query capabilities, i.e. there should be no difference in
language concepts and constructs acting on different object hierarchy levels.
Traditionally, an object consists of attributes, an
attribute consists of sub-attributes, etc. Assuming the semantic relativism of
objects there is no need for such distinction: attributes, sub-attributes,
pointer links between objects, procedures, methods, views, etc. are objects
too. The semantic relativism of objects radically cuts the size of database
model, the size of specification of query languages addressing the model, the
size of implementation, and the size of documentation. It also supports easier
learning of both a database model and a corresponding query language. By minimizing
the number of concepts the semantic relativism of objects supports development
of a universal theory of query languages, which is necessary to reason about
query optimization methods.
Total internal
object identification
Each object which
could be separately retrieved, updated, inserted, deleted, authorized, indexed,
protected, locked, etc. should possess a unique internal identifier. The
identifier is not printable and the programmer never uses it explicitly. A
unique internal identifier should be assigned not only to objects on the top
level of their hierarchy, but to all sub-objects, including atomic ones. If
some atomic objects create a repeating group, e.g. a person has many hobbies, each object in the group should possess a unique
identifier. For persistent objects (i.e. database objects) their identifiers
should be persistent too, i.e. invariant during all the life of the objects.
We are not interested
in the structure and meaning of internal identifiers, e.g., they can be disc
addresses, RAM addresses, some symbolic internal names, some tuple identifiers (tid-s), pairs
<tid, attribute name>, triples <relation
name, primary key value, attribute name>, etc. For us it is essential that
all objects and all their sub-objects can be unambiguously identified through
its internal unique name. The principle makes it possible to make references
and pointers to all possible objects, thus to avoid conceptual problems with
binding, scoping, updating, deleting, parameter passing, and other
functionalities that require references as query primitives.
The total object
internal identification principle is not satisfied in value-oriented database
models, such as the relational model, models based on the mathematical logic (Datalog), functional database models, etc. This, however,
introduces a lot of limitations and inconsistencies. For instance, without
assuming that each tuple has a unique identifier it
is impossible to give the correct semantics of the SQL delete clause or SQL cursors. Even more, without identifiers of
values that are stored within a tuple it is
impossible to express the semantics of the SQL update clauses. Value-based database models are thus idealistic and
utopistic. More advanced functionalities (e.g.
updating) sooner or latter will require correcting them by introducing some
concept of internal identifiers.
Unfortunately, the
principle is also not satisfied by XML. However, some functionalities developed
for this technology require internal identification of XML sub-objects, thus
special methods and languages have appeared to repair this flaw, c.f. Xpath, Xlink and Xpointer. All such inventions would be unnecessary assuming
the total internal identification (by symbolic addresses) for parsed XML
objects. In this case all the functionalities assumed in Xpath,
Xlink, Xpointer, XQuery (and perhaps, next and next XML querying tools) can
be covered by a single, simple and much more powerful query language such as
SBQL.
Orthogonal persistence
Traditionally,
persistent data was stored in databases and they were collections. On the other
hand, popular programming languages refer to volatile data only (i.e.
programming variables) and such data are individual. (Some languages, such as
Pascal and Java, introduce volatile collections too, but with severe constraints
on their construction and use.) Nowadays programming languages become more and
more abstract, thus the border between different storage kinds is more and more
transparent. Moreover, some professionals anticipate that in few years
databases will be stored in main memory only; there are already such prototypes
and commercial solutions. The size of main memories is reaching dozens of
gigabytes, i.e. is comparable to the capacity of disks, but the access time to
main memory is millions time faster. In some solutions (e.g. CORBA-based middlewares) the programmer is in fact unaware if he/she
deals with persistent or volatile data. Lack of individual data stored in
databases causes that they must be artificially stored as (one element)
collections. This violates conceptual modeling and is error-prone. Lack of
collections in programming languages causes that the programmer is forced to
use lower-level mechanisms such as heaps, with obvious disadvantages (no stack
discipline, dangling pointers, memory leak, etc.). Summing up, nowadays there
is little justification for the assumption that persistent data are to be
collections, volatile data are to be individual and typing systems of
persistent and volatile data should be different.
The orthogonal
persistence principle requires that the persistence property is orthogonal to
kinds of data (or objects). In particular, a database can store individual
objects (not only collections) and the volatile main memory of an application
can contain collections of objects. Moreover, orthogonal persistence requires
that the type system is the same for persistent and volatile data/objects.
Looking at query languages that we would like to develop, there should also no
syntactic and semantic differences in access to persistent and volatile data.
Unfortunately,
orthogonal persistency is against the tradition in software industry, thus some
professionals attack it as impractical. In our opinion such attacks have
ideological rather technical foundations (mostly caused by attempts to defend
current commercial software products). From the technical point of view it is
difficult to imagine why orthogonal persistence would result in some critical
disadvantages.
Data independence
means that a database is designed and maintained independently of applications
that retrieve and manipulate the data. There are several levels of data
independence. Physical data independence
means that physical details of data organization and access are transparent for
the application programmer. The physical details are determined by the database
designers or even earlier, by the designers of a database management system.
Some physical details (e.g. indices supporting the access to data, special file
organization, special methods of performing operations, etc.) are under control
of a database administrator (DBA). DBA uses special administration module to
tune the database operation according to the actual demands of applications,
but still, nothing in applications has to be changed due to the tuning. Logical data independence means that DBA
is able to perform some operations on the database structure, for instance, add
new data kinds, add or remove some object or table attributes, change user
privileges, add and remove views, database procedures, triggers etc. without
unconscious influencing the applications. Conceptual
data independence means that DBA is able to change the structure of the
database conceptually without changing existing (legacy) applications, for
instance, through special wrappers, mediators, views, updatable views, do instead of triggers, and other means
that allow to change significantly the database schema and its organization,
perhaps with minor changes of applications. The last level of data independence
is referred to as schema evolution and conceptually is close to software change
management methods and the aspect-orientation in databases.
Data independence
means that a query language has no notions referring to physical details of
data organization such indices. Logical and conceptual data independence
requires that a query after compilation is still correct even in case of
significant logical or conceptual changes in the database. This implies that
the compiled query has no elements dependent on physical details, such as offsets
of attributes within objects or tuples. This usually
leads to the necessity of run-time (late, dynamic) binding of all names
occurring in a query.