©Copyright by Kazimierz Subieta.

Collections

by Kazimierz Subieta

Back to Description of SBA and SBQL.


Collections are data structures consisting of data entities or objects of a similar structure. Sometimes it is assumed that “similar” means “of the same type”, however, such an assertion is vague, especially for environments that are not typed. It is assumed that the size of a collection is vary and can be dynamically changed at any moment. Typical collections are sets, relations (tables), bags, sequences, lists, trees, some kinds of arrays (with open ends), indices and perhaps others. Differences between kinds of collections concern their structural properties, as perceived by the programmers, as well as their behavioral properties, i.e. operations that are allowed for a particular  collection kind. Usually collection are build in a particular data structure environment, that is, their structural properties and generic primitive operators acting on collection are predefined of the given language and cannot be changed by application programmers. Some languages (C++, Java) make it possible to define new kinds of collections, but this concerns rather primitive, non-optimized and non-scalable cases. In databases all kinds of collections are predefined and build in the data environment.

Collections are not present in popular programming languages, for instance, in C. In Pascal collections are represented by the concept of file, burdened by implementation restrictions. Lack of collections enforces the programmer to use dynamically allocated memory (heap) to implement tasks requiring collections, with many negative consequences such as error-prone operations on pointers, lack to typing safety and memory leaks. Popular object oriented programming languages (Smalltalk, C++, Java) does not introduce the collection concept explicitly, although there are some possibilities to introduce some kinds of collections (e.g., vectors of references). Lack of explicit collections cause many limitations, in particular, major problems with defining nested collections (e.g. attributes with any number of values), pointer links or binary relationships between elements of collections, user-friendly queries acting on collections, and others. Collections (sets of objects) are explicitly introduced in popular analysis and design notations such as UML. However, the level of algorithmic precision of such modeling tools is much below the level of precision required for programming languages. Moreover, collections introduced in analysis and design notations and tools are usually not associated with operations (especially updating operations) that are inevitable in programming languages.

On the other hand, databases mostly deal with collections. In particular, relational databases deal with collections known as relations or tables. In databases collections have usually (conceptually) unlimited size that can be millions or billions of data items. Due to their size, collections in databases are carefully tuned both from the point of view of their structure and organization and from the point of view of access. In relational databases access to collections is accomplished through the query language SQL, which is supported by a powerful query optimizer that reduces the access time sometimes in several orders of magnitude. Due to query optimization that is build in the SQL interpreter any other processing style (e.g. processing a collection element-by-element) is discouraged or not supported.

Lack of collections (or their limited features) in programming languages is the direct reason of infamous effect known as impedance mismatch. The impedance mismatch causes reduction of programmer productivity and overhead in the application program code. It is much disadvantageous for program maintenance. It is also unaesthetic or even sloppy from the point of view of software manufacturing tools.

Object oriented databases in early assumptions (c.f. object-oriented database manifesto) provided no impedance mismatch by unifying types in programming languages and databases. Unfortunately, this tenet has remained among nice wishes. In the ODMG standard of object-oriented databases (commonly perceived as flavous) the impedance mismatch still persists, due to the fact that the standard provides Smalltalk, Java and C++ as the only tools for application programming.

 

Kinds of collections

Kinds of collections depend on whether we consider collections as data structures stored in data or object stores or collections that are returned by queries. Majority of theoretical models make no difference between these two situations, but looking on SQL we see the fundamental difference. Inside a relational database we have only unordered tables of tuples. Considering the output from SQL queries we can distinguish the previous case, as well as ordered tables (sequences) returned by SQL queries with the order by clause. Moreover, collections returned by SQL differ from collections stored in a database by many semantic properties. For instance, it makes a sense to insert a new tuple to a stored collection, but it makes little sense (and is impossible) to insert a tuple to a (volatile) collection returned by an SQL query. Although the distinction is obvious, some strong typing professionals and some industrial proposals (in particular, the ODMG standard) are trying to unify the cases by a common typing system. In our opinion, this must lead to inconsistencies. We  intend to consider collections precisely, according to their semantic nature and practical needs.

In general, we can distinguish two kinds of collections that are structurally different:

·      Sets or bags, where the order of elements is inessential,

·      Sequences or arrays, where the order of elements is bearing some information.

All other kinds of collections are derived from these two kinds by assuming some special constraints (e.g. no duplicate elements in sets), by special physical implementation (e.g. indices implemented as hash tables or B-trees) an/or by special set of generic operations assigned to a particular collection kind (e.g. taking i-th element from a collection). In particular, in the ODMG standard the following collection kinds are proposed:

·      Sets: can contain elements of any (but the same) type, without duplicates,

·      Bags: as before, but duplicates are allowed,

·      Sequences: as before, but the order of elements is essential,

·      Arrays: as before, with some restriction on inserting and deleting elements (allowed only on the array’s top),

·      Dictionaries: sets of pairs (t, v), where t is a string of characters and v is a value of any type. The intention of dictionaries is organizing indices.

The ODMG standard makes no difference between collection kinds stored in a database and collection kinds returned by queries. For instance, concerning stored collections, it makes little sense to consider bags, because every element of a bag is different due to object identifiers. On the other hand, application of sets in practice is rather low and can be fully substituted by bags, perhaps with an explicit operator for removing duplicates (applicable only to bags returned as results of queries). The above doubts we present only to realize that the number of choices concerning collections is big and the ODMG proposal can be perceived only as a particular composition of the choices, not the only possible, not minimal and not optimal. The general principles that should be a guide to an optimal choice is minimality and universality of the collection constructs, as well as their full orthogonality.

If a particular typing system introduces sets as a collection kind, then it assumes automatic removing duplicates. This is required after any operation of insertion or assignment and after any query that returns a set as its output. However, removing duplicates is a costly operation which (as a rule) requires sorting. For this reason sets should be used by programmers with caution. In some proposals (SQL, ODMG) the operation of removing duplicates is explicitly in hands of the programmer (a distinct or unique clause). Existence of this statement undermines the necessity of sets as a type constructor: it is enough to have only bags, with the above duplicate removing operator. Note that this is just the solution of SQL. The explicit removing duplicates operator and the set collection kind are redundant, because declaring the result a query as a set should automatically cause removing duplicates. In the ODMG standard there are both options, which looks not very reasonable.

Sets, removing duplicates operators and some operations on collections (equality, containment, etc.) imply the necessity to have a criteria what does it mean that two elements of the same or different collections are identical. In some cases this is not obvious. For stored collections the issue is obvious, as “identical” means “having the same object identifier”. Some authors, however, consider deep comparisons, where “identical” means “isomorphic up to object identifier”. However, we do not think that the last point of view makes a sense for the programmers. For collections returned by queries the question of identical elements becomes more vague. For instance, are two structures {name: “Smith”, salary: 2500} and {salary: 2500, name: “Smith”} identical? Query operators (except equality) are not able to distinguish them, however, the doubt remains.  At least, they are probably instances of two different types, hence are not identical. In general, the definition of identical elements may much be dependent on implementation, representation and the assumed strong typing system.

Collections can be nested. The kind of nesting and the number of nesting levels should not be constrained. In Fig.51 we show nested collections, where elements of the collection Employees have nested bags Children and nested sequences Employments. Such a flexibility in nesting collections is the consequence of orthogonality of the typing system, the object relativism and some conceptual closure saying that everything that could make sense for anybody should be allowed. Such a conceptual closure is the feature of almost all programming languages. In databases, however, it is assumed only recently (cf. ODMG and SQL-99 standards). Relational databases have assumed the first normal form, according to which nesting tables was impossible. This was motivated by the necessity to simplify the conceptual model as well as by mathematical theories that supported the relational databases concept (such as the relational algebra). In our opinion, such limitations are not reasonable from the point of view of conceptual modeling of business applications. Moreover, claims that nested collections decrease the possibility to use mathematical theories are obviously tendentious. If necessary, proper mathematical theories could be developed for nested collections as well.

 

Fig.51. Nested collections

 

Operators and constructors for processing collections

Operators on collections depend on whether we deal with stored collections or collections returned by queries. They can be grouped into two categories:

·      Macroscopic operators: arguments are collections, the results are collections too.

·      Iteration operators: they process collections element-by-element.

Operators can also be subdivided into functional and updating operators. Functional operators do not change the state of an object store, while updating operators make it possible to change the state. There are several macroscopic operators for bags, in particular, sum, intersection, difference, equality and inclusion of bags, removing duplicates from a bag, changing a bag into a sequence (with no sorting), membership of an element in a bag, the size of a bag, inserting an element to a bag, removing an element from a bag, assignment to an element of a bag, and perhaps others. Similarly, for sequences (and for arrays) the following operations are possible: concatenation, intersection, difference, equality and inclusion of sequences, first, last and i-th element of a sequence, the size of a sequence, inserting an element into a sequence, removing an element from a sequence, assignment to an element of a sequence, membership of an element in a sequence, changing a sequence into a bag, removing duplicates from a sequence, sorting a sequence according to some key, and perhaps others.

In practice, big set of operators like above could be insufficient. Fully universal processing of collections require a processing element-by-element by some constructs known as iterators. Note that iterators can be nested and some nested iterator instances may act on the same collection. Iterators are sets of methods such as “get_first”, “get_next”, “is_last?”. Each iterator instance has a state, which is the reference to an element that is currently processed. States of iterator must be somehow identified and stored somewhere; this could present some non-trivial conceptual problem. Programming with iterators is challenging, especially concerning achieving good performance during processing large collections. Alternatively to explicit iterators, a query language, perhaps with the full algorithmic power, could be developed. Queries can be considered as nested iterators, however each application of an iterator (selection, projection, join, etc.) has some meaning for the programmer. Query languages usually adopt the above macroscopic operations on collections as query operators. In a well-developed query languages queries are internally optimized, thus the performance problems are much easier. 

Some tradeoff between iterators and queries is the operator for each, which iterates along a collection. This collection, however, can be returned by a query, for instance (SBQL):

for each ( Employee where Job = „analyst” ) as a do {

print (a.Surname);

a.Salary := a Salary + 100; };

In the ODMG standard operators for processing collections are grouped as interfaces, which (syntactically) are not different from normal interfaces:

interface Collection: Object{

unsigned long cardinality();

boolean      is_empty();

void            insert_element( in Object element );

boolean      contains_element( in Object element );

 ...};

interface Set : Collection{

Set          union_with( in Set other );

Set          difference_with( in Set other );

boolean is_subset_of( in Set other );

... };

Note parameters of methods typed as Object. Hence the above interfaces are semantically different from normal interfaces, because they are parameterized, in principle, by any type. Such interfaces are type polymorphic, with parametric polymorphism which is one of difficult notions in the theory of programming languages. In C++ terms such interfaces correspond to templates. However, templates are parameterized by concrete types, not by the most general type Object. In effect, the strong typing potential is much reduced [Alag97].

 

Contradiction of collections with object-oriented principles

The collection concept leads to problems with substitutability and open-close, the basic principles of object-orientedness. Assume the following:

·    No two different stored collections have common elements;

·    Substitutability is supported, i.e. each member of a specialized class is also a member of a more general class;

·     Open-close is supported, i.e. a class can be closed for changes, but still open to specialization through sub-classes.

The first assumption says that collections Persons and Students are disjoint. The second assumptions requires that methods inside a class PersonClass should process a collection Students, because each student is a person. Assume that according to the third assumption the PersonClass is closed for changing. If during the development one concludes that a new class EmployeeClass of objects Employee is necessary, it would be necessary to alter the PersonClass by taking into account the collection Employees. This violates the third assumption. Similar problems arise due to multiple-inheritance and in general, with collections that store objects of different types. One of the solutions of this dilemma is the concept of dynamic object roles and dynamic inheritance described previously. In this concept each object can (conceptually) belong to many collections.

The concept of collections is not obligatory. In XML the concept is not introduced (at least, on the level of XML files). Instead, collections are represented as many objects having the same name. For instance, for an XML object Book there could be many sumbobjects Author. However, the concept of collections is necessary for strong typing. On the level of types, collections can be declared by special keywords (cf. in the ODMG standard) or by cardinalities (cf. XML Schema and UML). Representing collections by cardinalities allows for more precision in specification. For instance, specification with cardinalities Person: PersonType[0..*] denotes a collection Person which can be empty, while Person: PersonType[1..*] denotes a collection Person which is never empty. Such specification precision is not possible in specification Person: Bag PersonType. The above change does not remove the problems with object-oriented principles.

 

Null values and collections

A lot of literature is devoted to null values in databases, especially in the context of relational databases. There are many reasons for null values, such as incomplete, irrelevant, delayed or pending information. Languages such as SQL do not consider external (real world) semantics of nulls, treating them as a technical facility that is in hands of database designers and programmers, who can use them for any purpose, including the use for bearing some known information. For instance, we can imagine that for all patients of some health service the attribute Cancer is null for anybody who is not ill on cancer, or the cancer kind for patients that are ill.

Null values in databases have a bad fame. [Date86] contains a severe critics of null values as the reason of many semantic anomalies in SQL. Following this critics, nulls as a special values stored in the database should be avoided. Instead, [Date86] proposes to use default values; however, this solution has also disadvantages [Subi96a].

Null values were the subject of many theoretical proposals, but none of them has found essential practical applications. Majority of them assumed that null values are the manifestation of incomplete (unknown) information. They describe situations when some information exists (e.g. John for sure has some job), but we do not know precisely its state (e.g. that John is a teacher). Unfortunately, such assertions are wrong for majority of cases. Null values can appear for a lot of other reasons, not related to incomplete information.  Moreover, the research on null values was reduced to some limited query languages (e.g. SPJ operators of the relational algebra). This is in sharp contradiction with the real scope of null values, which must be taken into account in all query operators. And even more, they can influence the entire application programming environment, including updating capabilities such as insert, update and delete operators and including all features of application programming languages. Hence the presented research devoted to null values was indeed meaningless, oriented towards a next conference paper rather than to make a contribution to solving real problems. Till now, there is no consistent proposal showing how to consistently involve null values into query operators, updating capabilities and programming languages. Solutions known from SQL are error-prone and inconsistent in many places; see [Date86].

Majority of reasons for null values can be found in irregularity of information that cannot be fit into a predefined data structure format. This especially is true for the relational model which is based on very inflexible, fixed data structures. For instance, null values within the attribute MaidenName in the Employees table stem from the fact that the assumed data structure unifies males, unmarried females and married females. In principle, the table Employees should have no attribute MaidenName and there should be a separate table EmployeesMarriedFemales, which contains such an attribute. However, providing there are many such reasons for nulls in the Employees table, the number of such additional tables could be unreasonably high, with negative consequences for the complexity of the database schema and the complexity and performance of SQL queries. Null values prevent unreasonable growth of these factors, but at the cost of extra complexity of SQL, with options for processing nulls which are inconvenient, illogical and error-prone. Null values behave differently than normal data values, hence a lot of special syntax, options and special treatment, which (as practice has shown) cannot be consistently designed.

A bit different (and indeed reasonable) treatment of null values appeared in the concepts known as semi-structured data, manifested in XML and related technologies. In these concepts null values are represented by absent data. For instance, in an XML file lack of information results in lack of a corresponding XML object (including its tags). This idea fits well with the idea of collections that are determined by cardinalities. If the database designer or programmer wants to say that some object or attribute A can be null, he/she writes it type with the cardinality [0..1]. Other rules related to such an options are the same as for regular collections. In this way we receive the possibility of expressing nulls in data structures with no special options related to nulls in query and programming languages. For instance, let the attribute Job of objects Employee will be declared as Job: JobType[0..1]. It means that the attribute Job is optional in an object Employee. If Job is absent in a particular object, than it can be filled in by the operator insert. Similarly, the operator delete can nullify the attribute. To such a collection we can use other query operators, such as quantifiers. Below we present some examples (SBQL):

·      Find all employees with Job filled in:

Employee where exists(Job)

·      Find employees without Job:

Employee where JobÆ    (≡ denotes equality of bags, Æ denotes empty bag)

·      Get all designers:

Employee where  $ Job as j (j = “designer”)

Employee where Jobbag{“designer”}

Note that the query

Employee where Job = “designer”

is typologically incorrect, because a bag is compared with a string. However, such a query can be made more friendly by using a special syntax, e.g.

Employee where exists Job = “designer”

Such a query is also a bit more complex in SQL:

select * from Employee where Job is not null and Job = “designer” .

The advantage of this idea is that it is completely universal concerning storing and processing nulls. Because a null value is not introduced explicitly majority semantic problems with null values disappear. Some problems remain, such as the definition of the outer join operator, but they are not seem to be hard. The idea is also good to represent absent complex objects, which could be difficult or impossible in relational systems.

 

Problems with sequences

Sequences are frequently listed as “regular” collections, similar to bags and sets. In some cases a sequence naturally represent the order of elements in conceptual modeling. For instance, it can be important to represent the sequence of authors of a book and (frequently) the sequence represents the importance of particular authors or the weight of their contribution to the book. A bit more careful attention to sequences reveals their peculiarities, especially concerning stored sequences. First observation is that stored sequences can be considered as an optimization option: some collections are stored as sequences because of some particular access to them (e.g. via their ordering indices) or because of particular applications (e.g. making fast ordered reports). In majority of cases sequences are supported by some attributes of their elements (e.g. the time of inserting) that can be used to sort them. Hence we can distinguish the following cases of sequences:

·        The programmer is responsible for inserting a new sequence element into a proper place of the sequence according to his/her own criteria; no sorting procedure is possible. This case is probably very rare and can be neglected.

·        The programmer is responsible for inserting a new sequence element into a proper place of the sequence, but there is a sorting procedure that makes it possible to correct his/her decision. The sorting procedure can also be called after some updates of sequence elements. In this case a sequence can be considered as a bag supported by a sorting procedure.

·        The programmer is not responsible for inserting a new sequence element into a proper place of the sequence. There is an automatic rule that makes the sequence properly ordered after each operation, including insertions and updates. In this case the ordering rule is a component of the sequence concept. This important component is neglected in all known nowadays approaches to sequences. We can note that this element, in a simplified form, was the property of the very old DBTG CODASYL standard devoted to network data model. In modern approaches the feature can be implemented by special triggers.

·        A bag can posses many stored orders, as many as the database designer or administrator wishes. Orders are named (for identification) and can be dynamically attached/deattached to/from a bag. The option makes it possible to make quick ordered reports of different kinds. For instance, employees can be ordered according to their names and simultaneously according to their salaries. In this proposal the sequence concept does not exist independently: it belongs to some layer on top of bags. Such a feature, in a simplified form, was the property of the DBTG CODASYL standard. The feature can be supported by rules assuring keeping order within all sequences attached to a bag.

If we consider sequences as an optimization option then it should be in hands of the database administrator rather than the programmer. The programmer uses a sorting operator (e.g. order by in SQL) and depending of the current ordering determined by the administrator this operator can trigger sorting or can do nothing because the collection is already sorted. If the administrator would decide to remove this ordering, then the sorting operator indeed will make sorting, but only the result returned by the corresponding query. The administrator can decide to keep many sorts for the given collection. For instance, Person collection can be equipped by the administrator with the sort according to persons names, persons profession, persons incomings, etc., depending on currently used sorting operators in applications. Hence the sorting property is similar to indices that are also in hands of the administrator. The application programmer does not use indices explicitly and needs not to be aware of them. A similar approach can be implemented for different sorts of collections. In this way the level of abstraction of end user application programming is higher, because the concept of ordered collections is no more burdening application programs. On the other hand, such an approach gives the database administration much more freedom concerning tuning options for minimizing the global processing time.

As we can see, sequences present a bunch of options that may support important practical needs. Development of these options in a database models and programming languages is worth attention.

Delete operator and garbage collection

Some authors postulate avoiding the delete operator, as it may lead to inconsistency known as dangling pointers. They argue that if an object is deleted, but pointers leading to the object are not deleted (or nullified), then pointers would lead to garbage or to a random object created within the storage space of the deleted object. Hence – anyway – the programmer has to remove or nullify all the pointers that lead to the object being deleted. If he or she does that, deleting the object is inessential, as it is no more accessible via pointers; hence it can be removed automatically by the garbage collector.

This view is obviously inherited from programming languages such as C/C++ and Java. It assumes specific understanding of a pointer (or a reference). In these languages pointers have obvious physical representation; usually they are memory addresses. Indeed, in such situations removing an object without nullifying (or removing) pointers leading to it might cause the appearance of dangling pointers.

This view is a nonsense for collections that are considered at the conceptual level. First of all, delete is a conceptual notion that is well understood by database application programmers. Removing an object from a collection may occur in a lot of business modeling situations. Substituting this conceptual notion by purely physical or data representation-oriented operations is a step back from conceptual modeling to the computer illegible folklore. But this view is also nonsense for purely technical reasons, in particular, the following:

·        It is quite easy to implement pointers in such a way that the appearance of dangling pointers would be impossible, despite deletion of objects. For instance, in Loqis and in ODRA each pointer from object A to object B is associated with a twin backward pointer from B to A. Backward pointers are transparent for programmers. Hence, the operation of deleting an object is preceded by navigation according to backward pointers, nullifying or deleting proper pointers and them, deleting the object. Additional storage necessary for backward pointers can be disregarded for majority of practical applications.

·        On the conceptual level it is possible to create collections of objects with no pointers leading to them. How to remove pointers that belong to the physical representation rather than to a programmer’s schema?

·        If the programmer has to nullify or remove all the pointers leading to an object, his or her knowledge on the database schema must be much more complete that in case of removing a single object. An object of a given class can be referenced from dozens of classes. This is not reasonable to force the necessity of such knowledge for such a reason as deletion of an object.

·        In shared environment parts of the entire schema can be unavailable for a given programmer for security, privacy or other reasons. How the programmer can nullify a pointer if he or she has no access to it?

·        If the programmer trying to remove an object forgets about only one pointer and do not nullifies it, then this object will not be physically removed and will cause inconsistent behaviour of some application. For instance, if an employee object is to be removed by nullifying pointers, but the pointer from the healthcare service is not removed, then this employee object will be still processed by a healthcare application.

·        Pointers also can create collections, hence some operation of deleting a pointer would be necessary anyway. Thus the question concerns the principle of conceptual closure and relativity of objects: the delete operation is to be available on pointers, but not available on other kinds of objects.

The above arguments clearly show that the above view on the delete operation is a conceptual weed arising from transferring some reasonable argumentation to totally different environment, where this argumentation leads to a nonsense. In databases, especially in SQL, the delete operation has normal meaning and pragmatics and there is no reason that object-oriented databases make in this respect any new quality.


Last modified: May 23, 2009