©Copyright by Kazimierz Subieta.

Persistence

by Kazimierz Subieta

Back to Description of SBA and SBQL.


A programming entity is persistent if it lives longer than the run of a program that have created it. A persistent entity saves its state between subsequent runs of the program. All entities stored in databases are assumed to be persistent. Programming languages’ data structures (in particular, variables or objects) are not persistent (are volatile), because after the program is finished they are no longer available for next runs of this or another program. The concept of persistence is orthogonal to object-orientedness and can be discussed in the context of any data model. However, the object-oriented literature treats persistence with special attention.

The concept of persistence has not been coined in the database domain. Originally, the domain was based on the data independence principle that assumes that a database is designed, administered, maintained, secured, catalogued, published and accessed independently from any application programs that act on the database. Moreover, there is usually no assumption that a database application are to be written in a single programming language. Just otherwise, data independence implicitly assumes that database will be available for any programming language, providing it implements a corresponding library (or a “driver”). Because in databases all structures are persistent, there is no need for introducing the concept of persistence. Similarly for operating systems, there is no need to characterize its files as “persistent”, because they may exist independently of any application program that may act on them. In this senses database management systems, with their data independence principle, are more close to operating systems than to programming languages. In the literature there are several proposals to connect these two domains. This is actually done for big DBMS such as Oracle that take over  (and refine) many functionalities that were traditionally on the side of operating systems (for instance, granting access privileges).

Originally, the concept of persistence does not appear in the domain of programming languages, either. From the very beginning programming languages worked with files, which were treated very differently from program’s data structures. Usually, files were created, read, updated, deleted, etc. by special routines collected in some library available for a given programming language, but still a file itself was external to a programming language on the same principle as e.g. a keyboard or a mouse are external. An application programmer explicitly uses these routines to do some actions on these external resources, e,g., creating a file, recording some data (stored within program variables) in a file or reading data from a file (to program variables). This idea was naturally extended to databases, although with different (more complex) options and libraries (APIs) equipped with query languages.

The concept of persistence has been born from the marriage of databases and programming languages. It concerns a new type of programming languages (so-called database programming languages, DBPLs) that are assumed to be especially prepared for making applications that act on databases. None of popular programming languages (Pascal, C, C++, Java, etc.) involves explicitly the concept of persistence. To some extent it violates the old principle of data independence because it implicitly assumes that database application programs will be written in a single programming language (or in a single family of languages having the same typing and/or data representation system). It is not sure that the software community is currently prepared for such a “monopoly” (which apparently violates our sense of democracy, free commercial competition, free possibilities of inventions and the need of diversity as a progress factor). However, as a matter of fact, many database applications are actually based on such monopoly, hence the concept of persistence is worth attention. Even more, because of persistent procedural entities some kind of monopoly is inevitable.

The persistence concept is based on observation that both programming languages and databases deal with data structures that are very similar in conceptual models, construction, representation and typing. The only conceptual difference is that data structures that are stored in databases exist independently from a program run life, while data structures that the run processes completely disappear when it is terminated. Hence they differ only by one factor, which was just named “persistence status”. If so, let this factor be separated, but all other properties and functionalities related to data structures should be unified for both cases. Obviously, this point of view is programming-languages-centric and (to some extent, as we will discuss) is contradictory to the data independence principle that have established the domain of databases.

The concept of persistence do not assume pure separation of persistent and volatile entities. Different proposals assume various ideas of unification of their construction and functionality. The most common ideas are the following:

·         Unified naming, scoping and binding. Programming variables and database structures are named in the same way, they follow the same scope rules and the binding of a volatile variable has the same syntax as the binding of a database persistent variable. For instance, a programmer can create a persistent variable X that is store in a database, but accessed from a program simply as X, with no special syntax like procedure calls or special keywords.

·         Unified typing system. Programming variables and database structures follow the same typing system and the same (strong, static) type checking. Because of traditional cultural differences (programming languages deal with individual variables, while databases deal with collections), this idea requires some combination of the cultures. In particular, it should be possible to create volatile collections and persistent individual objects. There are many languages (Pascal/R, DBPL, Napier88, PS-Algol, Galileo, Fibonacci, Tycoon, PJama and others)  that follow this idea.

·         Unified query and expression language. Traditionally, volatile structures were accessed by programming expressions, while database structures were accessed by queries. The subdivision involves the previous factor, namely, expressions deal with individual variables, while queries deal with collections.  The subdivision is justified only for historical reasons. The idea is to join both ideas so there will be no difference between expressions and queries, e.g. x+y is a query on equal rights with e.g. Employee where Salary = (x+y).

·         Integrated database programming language. The language makes no difference between volatile and persistent entities (except the persistence status). It is fully based on queries that are used as expressions within imperative (updating) statements, as parameters of procedures, functions and methods, and as specifications of database abstractions (views, constraints, rules, triggers, etc.). The language follows a unified strong typing system. Procedural abstractions written in the language can persist, i.e. can be stored on the side of a database server as (stored) procedures, functions, methods, (updatable) views, triggers, etc. The language supports orthogonal persistence, i.e. free, unlimited combination of the persistence status with any feature of the language, including data structures, types, procedural abstractions and database abstractions. Till now, only SBQL implemented in ODRA fully accomplishes the above idea.

Till now only SBQL implemented in ODRA fully accomplishes the last, most complete idea of persistence. Other database programming languages and implemented database systems make always some eclectic tradeoffs that are mainly motivated by historical (legacy) development, reluctance to revolutionary changes, reluctance to developing new programming languages and a lot of unsolved research problems (strong typing, query optimization, object-oriented updatable views, etc.).

 

Persistence, impedance mismatch and data independence

The concept of persistence is the consequence of attempts to avoid the impedance mismatch, i.e. incompatibility of data models, types, access and updating facilities, program abstractions, maintenance, refactoring, etc. of programming languages’ data structures and database data structures. The impedance mismatch is an inherent consequence of the data independence principle. Hence, to some extent, the concept of persistence is in opposition to the principle.

In the relationships between impedance mismatch and data independence there is no ideal solution, only some tradeoffs. In particular, a tradeoff is necessary for the data independence principle. The principle was formulated at the time when databases (especially relational databases) contained pure data only. Current database servers, including relational database servers, store many entities that must be prepared in a query and programming language. These entities include:

·         Stored procedures and functions.

·         Triggers, constraints and (business) rules.

·         Stored classes, including methods that are defined within these classes, inheritance, and other features of object-orientedness.

·         Database views, in particular, updatable database views.

·         Definitions of workflow processes.

·         Definitions of wrappers, mediators, adapters, integrators, exporters, importers and other interoperability or data distribution facilities.

Some other entities are possible and are currently considered such as persistent threads, pre- and post-conditions, assertions and so on. One can imagine  that these entities can be written in many languages, but for several reasons such a freedom would be disadvantageous or unrealistic. All such languages should be based on the same data structures (determined by the database model and types) and this limitation much reduces the freedom. The assumption that any programming language can be used for this purpose is unrealistic at least for two reasons: (1) early binding assumed in popular languages (which would exclude many database features such as views, changes in the database schema, etc.); (2) severe problems with impedance mismatch. Hence, as a final conclusion, for a given DBMS all such active entities should be written in a single, integrated query and programming language that deals with persistence as a regular option. For these reasons the development of database programming languages and their standards makes a great sense.

We also note that these (persistent) entities are prepared during the database design phase or during database maintenance by a database server administrator. They can be used by client applications, but are not under control of these applications: they are to be designed, programmed and administrated by a database designed or a database administrator. Obviously, these persistent entities during their runs may create and maintain volatile data, which makes the distinction between persistent and volatile entities quite fuzzy.

 

Relativity of the persistence status, data sharing and transactions

The persistence concept is relatively clear if it is considered w.r.t. subdivision between main memory and magnetic discs as a data storage media or w.r.t. the program life cycle. Volatile data are stored in main memory, while persistent data are stored on discs. Volatile data are available during a run of a program and unavailable when the program is terminated or non-active, while persistent data are available on a disc and can be activated at any time when required. This subdivision, however, becomes much unclear when data are  stored in main memory only, which is now the case of many modern DBMS (including ODRA) and other data environments. In such systems magnetic discs may not exist at all or can be used as a back-up facility only (e.g. ODRA uses for this purpose the technology of memory mapping files). Such unclear attitude to persistence can be observed especially in data-intensive grid solutions or P2P (peer-to-peer) networks, where many servers cooperate, each of them can be switched off at any moment and all the data that the server supplies becomes unavailable. This problem with the clear definition of the persistence concept has appeared in particular in ODRA, where all the data are kept within some abstract stores and the programmer has no any possibilities to determine how and where the data physically reside. A very similar problem appears with the environments based on CORBA or other transparent middleware tools.

For such environment the traditional criteria of subdividing between volatile and persistent data make little sense. For instance, some application A may create volatile data and then make them available for application B.  From the point of view of application A these data are volatile and from the point of application B these data are persistent. How to keep the sense of such a persistence concept?

One of conclusion is that a persistence status is relative: some data can be volatile or persistent depending on an application acting on it. If so, how and where the persistence status of a data is to be declared?

The concept of persistence makes a sense for a single local application when one would like to distinguish data that are available when a program is running from a data that retain their state when the program is terminated. This situation seems to be the main case of so-called persistent programming languages such as PJama. However, this situation is not typical for large databases. Typically, database applications are subdivided into client and server processes and in this case the server keeps persistent data that are shared among many client applications. The concept of sharing is in this case more relevant than the concept of persistence, because – anyway – if the server process is not active than data that are kept on the server is unavailable, event if is persistent. If the server is active, it can export to clients not only “persistent” data but any data, including volatile ones. The situation with the persistence status becomes even more unclear in case when there is a problem of distinguishing client and servers, as e.g. in P2P networks. Independently, persistent or volatile, a data on some server will be unavailable for external use in the case when the server is switched off or is down.

Confusing persistence with data sharing is a common mistake of many authors in the context of transaction processing. If data is not shared then the transactional semantics is inessential, independently if the data is persistent or not. If data is shared, but cannot be simultaneously updated by many processes then the transactional semantics is inessential too, independently of its persistence status. The transactional semantics is essential only in the case when data are shared and can be updated by many processes, but in this case the persistence status of data does not matter.

Assuming any data server, its programmer or administrator should be equipped with facilities allowing him/her to determine which data entities are to be shared among clients and other servers. Moreover, he/she may be allowed to determine how the data entities are to be shared. The facilities may include access and updating rights, some (database) views and some specific protocols of sharing, such as a transactional semantics. During the development of the ODRA system we have tried to solve these issues, but without deep feedback from practical applications it is difficult to assess if our solutions are optimal for majority of cases.

 

Persistence models

Orthogonal persistence

The orthogonal persistence [Atki95] presents the simplest and the most rich persistence model in which the persistence status is orthogonal to any other features of the (database) programming environment, including data representation, types of data, strong type checking, expression or query languages, scoping and binding rules, etc. This is the most universal, simple and obvious model. Orthogonal persistence much reduces the size of implementation, simplifies many optimizations, makes the user documentation much shorter, makes programs simpler and shorter and much supports programs’ maintenance. For these reasons it makes little sense to consider other persistence models. The orthogonal persistence model (with some cautions concerning the persistence concept presented in the previous subsection) we have assumed in the ODRA system.

Traditionally, however, volatile data were kept of the side of a client applications and usually volatile collections were not available. On the other hand, persistent data were kept on the side of a server and (as a rule) their data types were restricted to collections (e.g. relational databases). Moreover, there are significant differences in access and updating syntax, semantics, typing, binding phases and pragmatics of use, maintenance, etc. In opinion of many professionals such subdivision between volatile and persistent data is only the consequence of historical development. There is no reason that volatile data cannot be collections, persistent data cannot be individual and the access to volatile and persistent data is to be based on different syntax and semantics. Lack of (nested) collections in popular programming languages such as C has led to the concept of a heap that just violates the programming discipline and strong typing. It is also the reason of memory leaks, pointer processing and other features compromising the reliability of programs.

The orthogonal persistence was the feature of many prototype database programming languages such as PS-Algol, DBPL, Napier88, Galileo and Tycoon. Popular commercial languages such as C/C++, Java and Smalltalk do not deal with persistence at all. There are attempts to introduce orthogonal persistence to some popular languages such as Java (persistent Java or PJama). These projects, however, are too modest concerning query languages and are based on some limited attitude to database architecture and the concept of persistence that we have discussed in the previous subsection.

The historical reasons cause some critique of the idea of orthogonal persistence within commercial communities, as impractical and unnecessary. In particular, the ODMG standard does not assume such a feature. However, because the orthogonal persistence has many advantages for new systems, it is almost sure that it will be the eventual winner in the longer time perspective.

Some authors claim that achieving perfect orthogonal persistence is impossible because of such features as transaction processing which is relevant to persistent but irrelevant to volatile data. As we have argued in the previous section, the configuration of concepts, especially concerning transactions, is more complex than it could be stated by simple observations. In our opinion, these doubts concerning orthogonal persistence are caused by particular understanding of the problem.

Persistence through reachability

Persistence through reachability can be considered as a supplement to orthogonal persistence that is motivated by some configurations of persistent and volatile entities that apparently make little sense. There are few such situations that we want to avoid:

·         A volatile object contains a persistent object as a component. In this case removing the volatile object will cause removing the persistent object, hence its persistence is a fiction. Alternatively, a persistent object stored within a volatile object can be moved somehow to another logical place, but this would require some extra semantics that makes little practical sense.

·         A persistent object contains a volatile object as a component. This case (implemented in the Loqis system) is imaginable and can have practical meaning for programming (e.g. keeping temporary results of calculations within persistent objects). However, mixing up volatile and persistent entities within one entity may be problematic for database maintenance, query optimization and garbage collection. Hence, it should be avoided too.

·         A persistent object contains a pointer (a reference) to a volatile object. This case is disadvantageous similarly to the previous one. If the volatile object is removed, then the persistent object will contain a dangling pointer (i.e. a pointer leading to garbage or to improper object). However, there are well recognized methods of avoiding dangling pointers and this case does not imply anything special.

·         A volatile object contains a pointer (a reference) to a persistent object. This case is quite reasonable, as in many applications the programmer may need to refer to persistent objects. Sometimes, however, this case implies implementation problems connected e.g. dangling pointers, garbage collection and transaction processing.

Indeed, the persistence through reachability makes the data organization more clean, but it does not undermine the general idea of the orthogonal persistence. As an analogy, in every language it is possible to write some senseless statements (e.g. an infinite loop), but they rarely influence the construction of the language. Most frequently they are presented in manuals as practical warnings and rules-of-thumb. Anyway, in databases and programming languages everything is in hands of the designer or programmer and he/she will be the first victim of his/her unreasonable decisions during database design or writing programs. This does not mean that designers of database systems should ignore the above disadvantageous situation: if they imply some additional implementation effort the designer are in rights to forbid them.

The persistence through reachability principle is assumed in the ODMG standard within Smalltalk and Java bindings. In the C++ binding the principle does not hold due to lack of automatic garbage collection.

Persistence through inheritance

Persistence through inheritance assumes that persistence is an invariant of a class that is inherited by all subclasses and by objects being members of theses classes. No class can contain – directly or indirectly – both persistent and volatile objects. It is possible to create a persistent class that is a specialization of a volatile class, but not vice versa.

The intention of this concept is clearly motivated by physical subdivision between persistent and volatile objects. Basically, it assumes quite different typing and different access or updating interfaces that inherently depend on a storage medium. This persistence model makes little sense in case when the designers abstract from storage media (the system automatically decides where persistent data are stored) or make storage media transparent for programmers (like e,g. in CORBA). The idea makes impossible (or more problematic) writing methods that act both on persistent and volatile data. For instance, if the programmer wants to copy persistent object to a volatile store, he or she must create two classes: one for persistent objects and another for volatile; then to write similar methods in both classes. This is obvious waste of resources (time, money, size and complexity of program artifacts, etc.)  that is caused by physical features. Negative features of such an approach are obvious: more complex schemas and programming interfaces, size of programs, maintenance of applications, etc. The history of computer technologies clearly shows that such a waste of resources sooner or latter becomes critical and unacceptable. The idea of persistence through inheritance is obviously with the idea of orthogonal persistence and the idea of integrated database query and programming language. It also obviously promotes impedance mismatch and assumes it as an inevitable rule.

The advantages of his idea are in the historical legacy. For many years database management systems are constructed in such a way that operations on database structures are very different from operations on volatile programs’ variables. The mentioned before data independence principle was the main catalyst of this subdivision. Current technologies, however, much require revisiting these concepts, especially in the triangle: persistence, data independence and impedance mismatch.


Last modified: January 10, 2008