Principles of modern database query and programming languages
by Kazimierz Subieta (December 2005)
Back to Description of SBA and SBQL.
Majority of the principles presented below are speculative ideals that stem from aesthetic feelings and considerations. We have to strive to ideals but in real life it is impossible to achieve all of them. Moreover, ideals can be contradictory to real-life critical factors such as performance, implementation effort and development time. Many principles are violated due to the bottom-up development, where initially designers think about a small and easy language, but after success of the language they are trying to extend it incrementally with all the functionalities that are required by different applications and groups of users. For instance, we can claim that strong type checking is a principle of programming languages, but if some language was initially developed without this feature it would be extremely difficult to introduce it incrementally without affecting thousands of already existing programs and violating knowledge, habits and customs of thousands of programmers. The bottom-up development syndrome affects many commercial artifacts, including SQL, object-relational databases, PHP, Web Services, and others. In such artifacts we can observe a lot of violated principles, thus they are commonly perceived as non-aesthetic and sometimes even chaotic. Nevertheless, they are useful, and this property should be considered as a meta-principle dominating over all other principles. The usability of a language is not in contradiction to the mentioned below principles; indeed, we believe that eventually they support the usability.
Every combination of language features that makes sense should be allowed (providing the given combination does not violate typing constraints). Orthogonality allows the designers to minimize language definition and to increase its retrieval or computational power. Orthogonality supports language implementation because shorter language definition implies shorter implementation code and easier optimization. Orthogonal languages are much easier to learn and use. Orthogonality may become essential in the case when programs in the given language are generated automatically as an output of some tools, e.g. CASE tools or graphical user interfaces.
Avoiding big syntactic and semantic patterns (cf. the SQL select...from...where...group by...having...order by....pattern), avoiding implicit semantic dependencies between distant pieces of the code, avoiding tangled structure of programs with no clean modules and their hierarchy (cf. the goto statements). Compositionality means that query/program pieces that have their own semantic meaning should be as small as possible (for example, only literals and names) and can be orthogonally composed into bigger and bigger units by a set of operators, where each of them has a small number of arguments (ideally, one or two arguments).
The use of names and bindings in programming languages must be consistent. In particular, the mechanism of bindings names used in declarations of data structures must coincide with the mechanism of binding parameters of procedural abstractions; the methods of binding are the same for declarations and parameters. This one-to-one correspondence includes methods of parameter passing (call by value, call by reference etc.) and time of binding (static or dynamic).
Each new feature A of a query/programming language should be smoothly combined with all the already existing features B, C, D...., Z. The new feature A must be smoothly integrated with the typing system of the language, its binding mechanisms, scoping rules, imperative constructs, programming abstractions, database abstractions, etc. It must be consistently introduced into documentation, user manuals and training courses. For instance, if the designers of a language decide to introduce a new data structure, say, dynamic arrays, the corresponding questions are: “How such an array will be typed?”, “Can it be an element of a structure?”, “Can it be an element of a collection?”, “Can contain collections as elements?”, “Can be a parameter of a procedure?”, “Can be returned as a procedure output?”, etc. Because a new feature have to be combined with all the existing features, one can conclude that the complexity of semantics and pragmatics grows in square to the number of features introduced in a given query/programming language. Thus implementation and optimization efforts grow probably more than in square to the number of features. Hence advices:
(1) Avoid redundant features (the Occam’s razor);
(2) Make the power through orthogonality;
(3) Reduce kinds of introduced data structures to some minimal but still universal set;
(4) Generalize existing features rather than introduce incrementally new and new features independently.
Object-oriented principles assumed by SBA
If a subclass B inherits from a superclass A, then an object b being an instantiation of B can be used in all places of a query/program in which object a being an instantiation of A can be used. For instance, an object Student can be used in all places when the object Person can be used, because a class StudentClass inherits from the class PersonClass. Although this principle (known also as LSP, Liskov Substitutability Principle) seems to be obvious, it must be taken with care, at least for three reasons. The first concerns updating, where straightforward application of this principle leads to anomalies or (at least) different semantic interpretations. The second case concerns parameter passing, where the requirement of strong type checking leads to a long (and a bit academic) dilemma concerning co-variance and contra-variance. The last, most difficult case concerns the case when an object name is a property of its type (this does not occur in programming languages, but is a strong rule in database schemata). For such a case the object-oriented model based on substitutability makes little sense and must be superseded by some more general model. Substitutability, together with the open-close principle, is also contradictory to the straightforward concept of collection, fundamental for databases. From these three concepts you can take any two, but not the three together.
Each class can be closed for modifications, but it is still open for specializations. The principle is the basis for program reuse and method polymorphism. For instance, we can buy within some compiled library a class Person, together with all the methods that are implemented for the class. We have no possibility to change anything in this class, because its source code is unavailable. Nevertheless, we can specialize this class by classes Student, Employee, Customer, etc. Each specialization inherits the implementation that is already done within the class Person and augments the implementation by new methods, specific for the given specialization. Moreover, each specialization can override some method implemented within the class Person by a specialized method implemented within this specialized class and having the same name (plus compatible parameters) as the name of the method within Person.
The open-close principle must be taken with care, because it is contradictory to the substitutability principle and the straightforward concept of collections of objects, fundamental for databases. From these three concepts you can take any two, but not the three together.
If some object O1 can be defined, then object O2 having O1 as a component can also be defined. There should be no limitations concerning the number of hierarchy levels of objects. Objects on any hierarchy level should be treated uniformly. In particular, an atomic object (having no sub-objects inside) should be allowed as a regular data structure independent from other structures. The relativism of data structures implies the relativism of corresponding query capabilities, i.e. there should be no difference in language concepts and constructs acting on different object hierarchy levels. Traditionally, an object consists of attributes, an attribute consists of sub-attributes, etc. Assuming the semantic relativism of objects there is no need for such distinction: attributes, sub-attributes, pointer links between objects, procedures, methods, views, etc. are objects too. The semantic relativism of objects radically cuts the size of database model, the size of specification of query languages addressing the model, the size of implementation, and the size of documentation. It also supports easier learning of both a database model and a corresponding query language. By minimizing the number of concepts the semantic relativism of objects supports development of a universal theory of query languages, which is necessary to reason about query optimization methods.
Each object which could be separately retrieved, updated, inserted, deleted, authorized, indexed, protected, locked, etc. should possess a unique internal identifier. The identifier is not printable and the programmer never uses it explicitly. A unique internal identifier should be assigned not only to objects on the top level of their hierarchy, but to all sub-objects, including atomic ones. If some atomic objects create a repeating group, e.g. a person has many hobbies, each object in the group should possess a unique identifier. For persistent objects (i.e. database objects) their identifiers should be persistent too, i.e. invariant during all the life of the objects.
We are not interested in the structure and meaning of internal identifiers, e.g., they can be disc addresses, RAM addresses, some symbolic internal names, some tuple identifiers (tid-s), pairs <tid, attribute name>, triples <relation name, primary key value, attribute name>, etc. For us it is essential that all objects and all their sub-objects can be unambiguously identified through its internal unique name. The principle makes it possible to make references and pointers to all possible objects, thus to avoid conceptual problems with binding, scoping, updating, deleting, parameter passing, and other functionalities that require references as query primitives.
The total object internal identification principle is not satisfied in value-oriented database models, such as the relational model, models based on the mathematical logic (Datalog), functional database models, etc. This, however, introduces a lot of limitations and inconsistencies. For instance, without assuming that each tuple has a unique identifier it is impossible to give the correct semantics of the SQL delete clause or SQL cursors. Even more, without identifiers of values that are stored within a tuple it is impossible to express the semantics of the SQL update clauses. Value-based database models are thus idealistic and utopistic. More advanced functionalities (e.g. updating) sooner or latter will require correcting them by introducing some concept of internal identifiers.
Unfortunately, the principle is also not satisfied by XML. However, some functionalities developed for this technology require internal identification of XML sub-objects, thus special methods and languages have appeared to repair this flaw, c.f. Xpath, Xlink and Xpointer. All such inventions would be unnecessary assuming the total internal identification (by symbolic addresses) for parsed XML objects. In this case all the functionalities assumed in Xpath, Xlink, Xpointer, XQuery (and perhaps, next and next XML querying tools) can be covered by a single, simple and much more powerful query language such as SBQL.
Traditionally, persistent data was stored in databases and they were collections. On the other hand, popular programming languages refer to volatile data only (i.e. programming variables) and such data are individual. (Some languages, such as Pascal and Java, introduce volatile collections too, but with severe constraints on their construction and use.) Nowadays programming languages become more and more abstract, thus the border between different storage kinds is more and more transparent. Moreover, some professionals anticipate that in few years databases will be stored in main memory only; there are already such prototypes and commercial solutions. The size of main memories is reaching dozens of gigabytes, i.e. is comparable to the capacity of disks, but the access time to main memory is millions time faster. In some solutions (e.g. CORBA-based middlewares) the programmer is in fact unaware if he/she deals with persistent or volatile data. Lack of individual data stored in databases causes that they must be artificially stored as (one element) collections. This violates conceptual modeling and is error-prone. Lack of collections in programming languages causes that the programmer is forced to use lower-level mechanisms such as heaps, with obvious disadvantages (no stack discipline, dangling pointers, memory leak, etc.). Summing up, nowadays there is little justification for the assumption that persistent data are to be collections, volatile data are to be individual and typing systems of persistent and volatile data should be different.
The orthogonal persistence principle requires that the persistence property is orthogonal to kinds of data (or objects). In particular, a database can store individual objects (not only collections) and the volatile main memory of an application can contain collections of objects. Moreover, orthogonal persistence requires that the type system is the same for persistent and volatile data/objects. Looking at query languages that we would like to develop, there should also no syntactic and semantic differences in access to persistent and volatile data.
Unfortunately, orthogonal persistency is against the tradition in software industry, thus some professionals attack it as impractical. In our opinion such attacks have ideological rather technical foundations (mostly caused by attempts to defend current commercial software products). From the technical point of view it is difficult to imagine why orthogonal persistence would result in some critical disadvantages.
Data independence means that a database is designed and maintained independently of applications that retrieve and manipulate the data. There are several levels of data independence. Physical data independence means that physical details of data organization and access are transparent for the application programmer. The physical details are determined by the database designers or even earlier, by the designers of a database management system. Some physical details (e.g. indices supporting the access to data, special file organization, special methods of performing operations, etc.) are under control of a database administrator (DBA). DBA uses special administration module to tune the database operation according to the actual demands of applications, but still, nothing in applications has to be changed due to the tuning. Logical data independence means that DBA is able to perform some operations on the database structure, for instance, add new data kinds, add or remove some object or table attributes, change user privileges, add and remove views, database procedures, triggers etc. without unconscious influencing the applications. Conceptual data independence means that DBA is able to change the structure of the database conceptually without changing existing (legacy) applications, for instance, through special wrappers, mediators, views, updatable views, do instead of triggers, and other means that allow to change significantly the database schema and its organization, perhaps with minor changes of applications. The last level of data independence is referred to as schema evolution and conceptually is close to software change management methods and the aspect-orientation in databases.
Data independence means that a query language has no notions referring to physical details of data organization such indices. Logical and conceptual data independence requires that a query after compilation is still correct even in case of significant logical or conceptual changes in the database. This implies that the compiled query has no elements dependent on physical details, such as offsets of attributes within objects or tuples. This usually leads to the necessity of run-time (late, dynamic) binding of all names occurring in a query.