©Copyright by Kazimierz Subieta.

Storing and Processing Irregular Data (Semi-Structured Data)

by Kazimierz Subieta

Back to Description of SBA and SBQL.

 

Irregular data, that is, data that do not follow some fixed format, are properties of many current technologies, especially based on XML. Irregular data are also specific to conceptual modeling, which frequently requires some forms of loose formats. Irregularity in data has many forms and presents a lot of issues, especially concerning strong type checking and query languages. Irregular data is the reason of a lot of naïve research, inconsistent solutions, poor   imagination of the developers and false opinions. In this chapter we try to discuss more important issues related to irregular data and present our views and concepts concerning irregular data in the context of SBA and SBQL and object-oriented store and query models.

The most recognized concept related to irregular data is a null value known from the relational model, relational databases and SQL. The concept, inevitable and obvious in relational data structures, becomes quite difficult in query and programming languages. In SQL it is supported apparently with little care about semantic consistency and consequence. Depending on SQL constructs, nulls are treated very differently and all the design seems to be contradictory to common logic and sense. However, there is perhaps no fault of SQL designers. The devil of chaotic design is inside the null value concept itself. According to [DaDa92, DaDa95] there is no consistent way to introduce nulls to database structures and corresponding query and programming languages.

A new wave known under the term “semi-structured data” approaches the problem from a different angle and actually does not refer to null values. The problem arises in the context of more loose and weakly typed data structures that are common for some applications, in particular, for Web and XML technologies. However, the semi-structured data problem inherits a lot from the problem of null values.

In general, any DBMS must eventually deal with complete computational and pragmatic power of programming interfaces. Hence, features related to irregular data must be combined with all aspects of database systems: data description, query languages, updating and other imperative constructs, typing, object-orientedness, procedures, methods, views, active rules, metadata, etc. Support for irregular data is especially challenging if one assumes that the data is to be processed by strongly typed programming languages, such as C++, C# and Java. Usually such languages support regular data formats only. The necessity of mapping between irregular data and regular formats must lead to some impedance mismatch. Hence, the support for irregular data must be designed with an extreme care concerning conceptual simplicity, minimality, consistency, universality and openness for external data processing tools.

During the development of SBA and SBQL we address the problem of irregular data in object bases and consider how they are to be manipulated in the integrated query/programming languages addressing object models. The difference between our approach to the problem and the approaches known in the literature is that we extend the scope of issues related to irregular data to all the aspects that are necessary to make a professional object-oriented database query and programming language. Our approach to irregular data is very simple and is based on the idea of collections and cardinalities of UML. Null values are represented as absent data. Any data that can be absent are to be typed by cardinalities with lower value equal to zero, for instance [0..1] or [0..*]. In particular, optional data are treated as a collection being empty or having exactly one element. To process such a data the programmer can uses facilities that are specific for processing collections rather than for processing values. In this way we can deal with nulls without introducing an explicit a null value. Within this approach all the disadvantages of null values presented in [DaDa92, DaDa95] become irrelevant. This chapter discusses the situations that are related to this new attitude to nulls and presents corresponding SBQL solutions.

Null Values

Many authors (more than 500 papers) have associated null values with “incomplete information”, making a lot of research aiming at solving this apparently important problem. After dozens of years almost all this research has been commonly recognized as scholastic, with little (or perhaps no) practical impact in the domain of database management systems. Some of these researches, however, are claimed to be applicable in other domains, in particular, artificial intelligence, data mining and data warehouses.

From the practical point of view, null values are loosely related to the problem of uncertain or missing information. The term irregular data better expresses the problem. It concerns situations when particular information does not fit exactly the predefined data format, or when some spots in the storage media (or in their formal or conceptual model) are not filled in by meaningful data. The interpretation of the fact that a particular data spot is unfilled is a data semantics issue, outside of any formal model.

Actually, SQL adopts such attitude to null values as described above. The meaning of SQL null values is not predefined. Null values present some technical feature in programming that can be used for different purposes and informally interpreted by designers, programmers, database administrators and other users. Because of many sources of null values and many reasons for which they are introduced it is difficult to imagine that another, more semantically specific approach to null values makes a sense. As an example, if we assume that almost all persons in some collection are not handicapped, then the value of the attribute handicapKind can be null for almost all objects. Only handicapped people have some information within this attribute. But interpretation of these nulls belongs to informal business semantics of data. In this case nulls have nothing in common with uncertain or incomplete information – just nulls are used as regular and certain information determining that corresponding persons are not handicapped. Some methodologists might not recommend to use nulls in such a way, but actually this would be only recommendation and the decision anyway belongs to application and database designers.

The irregularity of data is relative to the typing system that is assumed in the given database and its query/programming environment. Irregularity with respect to some older data models (such as the relational one) is no more irregularity if the typing system is prepared for it and there are corresponding constructs that serve all the types in a corresponding query and programming language. Hence the true attitude to irregular data is to change them into regular ones through preparing a proper typing system. Such data are to be addressed by a database query and programming language with all the universality. However, irregular  features require extra programming options, thus resulting languages can be more (or even too much) sophisticated for programmers.

In the development of SBQL we tried to introduce irregularities to object data structures without compromising the simplicity of the language. Eventually, however, any design must be acknowledged by common practice of language’s users.

Variants

In the domain of PLs there is another well-known feature that can be considered as irregular data. It is called “variant” in the Pascal family of languages, or “union” in the C family. In the following we will use the term “variant” (”union” is already used as an SBQL query operator). Originally (e.g. in Pascal) variants were introduced to save the memory space. This motivation no more holds for current data models where variants have mostly conceptual meaning. A variant may be required when objects of a conceptually similar type can change the collection of their attributes. For example, an Employee object may concern regular employees and junior employees. In a variant for a regular employee the object contains the attribute salary and no attribute scholarship, while in a variant for a junior employee – vice versa.

Variants and null values are conceptually similar. If a null value we interpret as an absent attribute, than the assumption that an attribute A can be null-valued might be modelled as a variant of record types R(A, B, C,...) and R(B, C, ...). The notions of null values and unions are, however, not equivalent. For example, if some record type involves n attributes that can be null valued, then the number of corresponding variants is 2n. Variants cover also the situation when names of attributes are the same, but types are different. It is interesting to note that the PL community noticed variants and ignored null values, and the database community did v/v. Because variants do not match the concept of a relation, they were unacceptable for the proponents the relational database model. This is one of many cases when a stiff mathematical frame is an impediment for the progress.

One can observe that in object-oriented models variants are unnecessary, as they can be substituted by specialized classes. However, classes mostly bear conceptual information from the business model. It would be improper to burden the model by some technical reasons such as variants. Variants would cause the appearance of many artificially introduced classes, with exotic names and unclear meaning. Moreover, if there is more than one variant in one business object type, then this may result in the combinatorial explosion of (permuted) specialized classes. If a type introduces n binary variants on different attributes, then the number of specialized classes that conceptually cover this situation is 2n. This presents an argument that variants should be introduced as a feature orthogonal to classes.

Variants, as introduced in Pascal, require a special attribute (so called discrimination attribute) that allow during runtime to determine which variant is currently processed. Such an attribute makes it possible strong type checking of variants (which in this case must be delegated to runtime). In C variants (unions) were introduced without the discrimination attribute, hence the responsibility for determining a variant is in hands of the programmer. To this end he or she can choose any option and any information. The solution makes strong typing of variants impossible.

In the literature there is a subdivision between exclusive and non-exclusive variants. Exclusive variants mean that in the actual object we can choose A or B, but not A and B together. For instance, an Employee record can have the attribute salary or the attribute scholarship, but not the both. Non-exclusive variants do not imply such a constraint and hence are easy to be substituted by optional data: non-exclusive variant between A and B can be represented as two attributes A and B with cardinalities [0..1]. Hence, only exclusive variants present essentially new situation for conceptual modelling, typing and programming capabilities.

Variants are also similar to dynamic object roles, as described in the AS2 object store model. Dynamic object roles provide capabilities much more general and flexible than the capabilities assumed in variants. Most of all, the difference concerns conceptual modelling. Variants can be introduced for any technical reason, while dynamic object roles are assumed to have some business-oriented semantics. In particular, each dynamic object role must possess a distinguished external name, while this is not required for a variant. Actually, however, these distinctions between variants and dynamic object roles are secondary from the technical point of view. Object roles can be provided as a substitute and generalization of variants. This leads to some new situations with strong typing. Such an approach to typing roles is described in the PhD thesis by A.Jodłowski [Jodl03].

Typing of Null Values and Variants

Irregular data require a new attitude to typing, especially to static (compile time) strong type checking. Some authors argue that due to the irregular nature of data strong typing is impossible: such data are type-less or schema-less. We disagree with such opinions. Lack of types or data schemata leads to a lot of pathological situations in programming. In particular, a database schema is a carrier of business-oriented data semantics and determines the structure and representation of data. It is difficult to imagine that the programmer can write programs without any schema, because in such cases he or she must guess the structures form examples of data or apply a generic programming where everything, including data names, are perceived as strings. At least for conceptual modelling, types or schemas must be introduced anyway, explicitly or implicitly. This is the motive for such languages as DTD or XML Schema. The question arises how such facilities can be used for static type checking, for determining representation of data, for resolving some ambiguous situations (such as ellipses and automatic coercions), for query optimizations, etc. Within SBQL we have proposed a typing system that we call “semi-strong”, to underline the fact that our types are prepared to typing irregular data, with all the features that are usually associated with types, such as conceptual modelling, static type checking of queries and programs, resolving ambiguities in queries and programs and preparing information for query optimization.

Concerning null values and variants, static type checking (during compilation time) has limits because both a null value and a variant are runtime features. Hence in such cases type checking must be usually delegated to runtime. Alternatively, developers of a language can assume that null values and variants are outside the typing system and instead of typing errors they assume exceptions in reaction of some illegal runtime situation (e.g. when a null is an argument of an arithmetic operator). Such exceptions, however, require anyway semantically clean detection of null values and variants during runtime.

More information on typing irregular data will be given in a section devoted to typing.

Current Proposals Concerning Irregular Data

The current view on null values is materialized in relational databases and SQL. A null value can be stored in a database and returned by an SQL expression if it cannot be evaluated correctly. This may happen because of null-valued arguments of an operation, as well as in the case of wrong arguments of operations. For example, function sum returns NULL for an empty argument table. SQL does not allow explicit comparisons of nulls and ordinary values but involves a special predicate is [not] null. A special function if_null returns an attribute's value if it is not null, or some constant value otherwise. Comparison of an expression which returns NULL with any other value returns a third truth value UNKNOWN; in the WHERE clause, however, it is equivalent to FALSE. In embedded SQL the admission of null values requires for every such attribute two variables in an underlying host language program. An extra indicator variable is used to store boolean information determining if the particular value of an attribute is null or not.

Several authors (notably Date and Darwen) point out difficulties related to null values, which make the semantics of user/programmer interfaces obscure and inconsistent. Date [Date86] presents a severe criticism of null values in the relational model (“...the null value concept is far more trouble than it is worth”, “...the SQL null value concept introduces far more problems than it solves”). He gives many striking examples of anomalies implied by null values. For instance, although in SQL “officially” every null value is distinct (thus A = B returns UNKNOWN if both A and B return nulls), the group by and unique operators treat them as identical, and aggregate functions sum, avg, min, max, count totally ignore them. Another striking example: if relation R contains numerical attributes A and B which could be null valued, than in general select sum(A+B) from R may return the result different from select sum(A) + sum(B) from R. Date concludes that these flaws are not only the property of an inadequate SQL design, but they are the inevitable consequence of the idea of null values.

The descendant model of SQL also promotes the use of null values. The SQL 99 standard [SQL99] proposes to introduce an arbitrary number of application-specific null values such as “Unknown”, “Missing”, “Not Applicable”, “Pending”, etc. It is not clear, however, how this feature will be reflected in the semantics of particular query/manipulation constructs.

Variants are proposed in IDL of the OMG CORBA  standard, with an explicit discrimination attribute. It is assumed that they are to be addressed in programming languages. Only C and C++ provide corresponding facilities. A similar concept is proposed in the ODMG standard. However, the standard contains no construct to deal with variants in the query language OQL and contains no suggestion how variants are to be treated by the assumed strong typing system. As far as we know, actually there is no powerful query language consistently dealing with variants. Null values are introduced in the ODMG standard in a manner that is much more restrictive than it is done in SQL. This is perhaps motivated by the possible problems with nulls that the designers of the ODMG standard want to avoid. No operators such as outer joins are allowed in ODMG OQL. In general, object-oriented database models neglect the problem of irregular data. Perhaps SBA and SBQL are first proposals that address powerful object-oriented models and support non-trivial forms of irregular data.

There are also proposals to make loosely typed or typeless data structures. Examples are XML and RDF- oriented technologies. For instance, in XML one can introduce any tags or attributes within tags and to put inside them any values[1]. In RDF one can change ad hoc the type of some entity. Such approaches are even advocated as “more flexible”, hence having advantages over disciplined data structures. However, the flexibility should not be confused with chaos. Tags (data names) and types are features of a schema, which is necessary for the user to ask correct queries. If tags and types are unknown, then the user or programmer must guess them and to discover the meaning of values basing on some side information, which can also be poorly defined. Our conclusion is that loosely typed or typeless data structures are unmanageable, especially for very large databases. Such proposals can be used for small, non-resposible applications, where the user can see the entire database (and its next possible states) before asking queries.   In real applications such situations are difficult to imagine.

Irregular Data in Theories

The fundamental defect of theories w.r.t. null values and other forms of irregular data is the scope mismatch. The phenomenon is similar to the impedance mismatch and concerns the difference between the scope of theories and the scope in which irregular data need to be considered in practice. Theories usually address an idealized query language with very limited capabilities (e.g., SPJ queries in the relational algebra, or the language of some predicate logic). The scope for null values is much wider and should include:

·      All advanced query constructs, such as grouping, aggregate functions, quantifiers, ordering;

·      Imperative constructs based on queries: creating, updating, inserting, deleting;

·      The interface for embedding a query language into a programming language;

·      Database semantic enhancements: views, database procedures, integrity constraints, deductive rules, active rules;

·      Object-oriented extensions: ADT-s, types, classes, interfaces, encapsulation, inheritance;

·      Static strong or semi-strong type checking;

·      Privacy, security, transactions;

·      Interfaces for end users and programming interfaces: graphical query languages, 4GLs, and other.

·      Metadata (an internal representation of a database schema)

The scope mismatch is the main reason for a very low applicability of the theories addressing null values and other forms of irregular data. From the position of designers and vendors of DBMS it is very risky to introduce some particular solution without a clear view how it could be expanded to the whole conceivable environment of database application programming. In our opinion, theories concerning null values in databases have severely impeded the progress rather than supported it.

Irregular Data in Object Databases

In practice there are many sources of irregular data. Sometimes they can be qualified as uncertain information, but in general irregularities are related to specific aspects of data storing and processing.

·      Information is irrelevant, for example, “nickname” for a person who has no nickname.

·      Information is known, but not filled yet because data input lasts some time. For example, a new employee is hired, all his/her data are already introduced to the database, except “salary” which has to be input by another administration division.

·      Information is valid only for fixed time; after it must be nullified (to prevent errors) and then input again or recalculated; for example, weather or stock market forecasts.

·      Information is known and could be filled in, but it is considered inessential for the particular domain of applications. For example, in some airline ticket reservation system the nationality of passengers is not filled in for domestic flights.

·      Information consists of “legacy” records, with limited amount of information, and actual records having an extended set of attributes. Hence in the integrated resources legacy records must possess null values.

·      The result of some processing is an intermediary data structure containing null-valued elements; e.g. outer joins.

Nowadays, there are next reasons to deal with irregular data. The following aspects of object database technologies may require them:

·      Object-oriented analysis and design. In many cases the admission of null values, unions and repeating data presents an important facility for the analyst or designer. For example, the designer can stick two classes into one introducing some null-valued attributes; and vice versa.

·      Schema evolution. As a result of schema evolution some new attributes may appear, some may disappear, some may change type, some single attributes can be changed into repeating ones. This naturally leads to variants and optional data.

·      Interoperability. Heterogeneous and/or distributed databases require resolution of missing or conflicting data values that occur when semantically identical data items may have some attribute values different of missing in some data sources.

·      Web and XML. HTML and XML files are weakly formatted. There is a lot of valuable research devoted to querying semi-structured data, especially XML; however, there is little doubt that they will require efficient programming techniques to deal with their irregularities and inconsistencies.

·      Data warehouses. There is a tendency for collecting information concerning particular topics (e.g. a stock market) from various heterogeneous sources. Analytical processing of such heterogeneous information (statistical analysis, overview reports, data mining, etc.) requires capabilities of integrated query/programming languages that will make querying null values and variants possible.

Note however that null values and variants are additional programming options that can be easily avoided both on the stage of database design and on the stage of programming. For instance, if attribute A of type integer can be null valued, it is enough to introduce two attributes: attribute A with no null allowed and an additional Boolean attribute is_A_null?, with value true indicating that the value of A should not be taken into account because it is invalid. This approach is actually used in host languages embedding SQL. It corresponds to discrimination attribute that is used to distinguish variants. The attribute is_A_null can also store more information about the status of A, for example, “not initialized”, “pending”, “irrelevant”, etc. Such an approach do not require introducing irregularities of data explicitly; however, it could be more difficult for conceptual modelling and makes description of objects more complex.

Date’s Default Values

As an alternative to null values C.J.Date and other authors propose the concept of default values, with the intention that default values work as nulls. They are ordinary values that are filled in into a tuple in the case of absent information. Default values can be determined in a relational schema. According to Date, default values do  not require any special options concerning irregular data in databases and in programming languages.

Unfortunately, this claim is obviously false and based on fundamental misunderstanding. Null values and other irregularities are conceptual issues, while default values present purely technical option to record irregularities. If the programmer forgets to consider default values as a special mean for storing irregularities, then probably his/her program will be formally correct, but having conceptual flaws from the point of view of the business that is supported by the program. To avoid this danger, default values must be indicated in the conceptual database schema, which would become more complicated. Processing default values may require special care or special options. For instance, if for salaries the default value would be -1, then calculation of the average salary through the aggregate function avg would be erroneous.  The programmer cannot forget that before calculating the average he/she must filter all objects with salary = -1. Moreover, for some domains there could be no good default, e.g. for the Boolean type. Default values are also extremely error-prone: if the programmer forgets or not properly recognizes defaults, then the typing system will not be able to recognize an error and it appears during runtime with no warning. 

Note that this intention of default values is unusual for programmers. Usually a programmer expects that default values are the most common values that are filled in during object creation, for example the value “no” for the attribute IsSmoking? Using default values with another intention will require – at least – another term to denote them and special explanation in manual.

Summing up, the proposal of using default values as the substitution for null values is doubtful and risky in the same way as the criticized null values are. In our opinion, default values are not worth to consider seriously as a special option in data structures and query languages.

SBA - Approach to Irregular Data

We assume that a schema is an inevitable property of any data store, including storing irregular data. A schema is necessary for conceptual modelling . The programmer before writing a query or a program must clearly understand what the database contains and how it is logically organized. Such a schema can be informal or even not explicit, but anyway must exist.

If a schema is expressed formally, it can be used for dynamic (runtime) checking of processed objects. For example, DTD and XMLSchema play this part for XML repositories. A bit more formality and discipline is necessary if we assume that types declared within a schema are to be used for static (compile time) type checking of queries and programs.

   The new qualities introduced by irregular data to this assumption  are the following:

·        A schema should be able to express the fact that some data are optional and may not be present in a particular data instance. The feature is specified by the cardinality [0..1] and is equivalent to the SQL clause “NULL IS ALLOWED”. In this way every data that can be optional is modelled as a bag which is empty or has exactly one element. All operators that address bags can be applied to such a bag, in particular, quantifiers. Properly designed implicit type coercions between bags and individual elements (and v/v) make it possible to reduce the conceptual overhead related to optional data to minimum.

·        A schema should be able to express the fact that some data are alternative. This issue is known as “exclusive variants”. We assume that a variant must possess a discriminator that allow for type checking of variants during runtime.

·        A schema should be able to express the fact that some data can be repeated some number of times, including zero times or any number of times. The issue is known as cardinality of a collections. Cardinalities can also specify pointer links (association instances) among objects.

·        The schema language should allow orthogonal combination of the above features ot any data hierarchy level.

Any binding to a data that is absent results in empty bag rather than in a special value “null”. Although this looks as a very subtle change, there are essential consequences. We show that the idea makes it possible to overcome difficulties encountered in SQL and can be smoothly combined with variants, and object-oriented concepts. It can also be consistently incorporated in query languages and programming languages integrated with queries

As far as we are aware of, this model is the first, simple, intuitive and universal idea, which is capable to express uniformly all features related to semi-structured information and which is acceptable to an average database engineer. The idea is consistently incorporated into data structures, a query language, programming interfaces and other capabilities of the prototype object-oriented database management system ODRA. The ODRA query and programming language SBQL is much more powerful than Lorel and UnQL, well recognized for their capabilities to process irregular data. Moreover, SBQL has very precise, intuitive semantics and is seamlessly integrated with programming constructs, programming abstractions and object oriented notions.

Querying Optional Data, Variants and Repeating Data

Absent data (including variants) imply the necessity of special care in queries. Consider the SBQL schema (from the ODRA system):

E.IRR.1

Database schema involving departments and their employees

SBQL:

type AddressType is record { city: string; street: string; house: integer; }

class EmpClass {

    instance Emp: {     

        name: string;

        address : AddressType [0..1];               //absent data allowed

        salary: real[0..1];                                 //absent data allowed

        job: string;

        worksIn: ref DeptClass[0..1] reverse employs; //absent data allowed

    }

}

class DeptClass {

    instance Dept: {

        employs: ref EmpClass [0..*] reverse worksIn;

        boss: ref EmpClass[0..1];                  //pending absent data allowed

        dname: string;

        budget: real;

        location: string [1..*];                        //multiple values allowed

    }

}

Emp: EmpClass [0..*];

Dept: DeptClass [0..*];

 

The schema declares one type AddressType and two classes  EmpClass and DeptClass. Then it declares two collections of objects, Emp and Dept, of the defined classes. For Emp objects the attribute address is optional. Similarly, in the Emp object the attribute salary is optional too. In Dept objects the reference employs is multi-valued, the reference boss is optional and the attribute location can assume one or more values.

Consider the query Emp where salary >1000 . If for some employee the attribute salary is absent, then binding name salary will return an empty bag, which is compared by > with 1000. The comparison > is not defined for such a case. We can therefore consider the following possible definitions of the semantics:

·        Predefined truth value: in this case the formula will return FALSE.

·        The formula will return the third logical value UNKNOWN.

·        A runtime type error or exception, i.e., the case is semantically forbidden.

The first approach leads to unreliable programs, because, e.g., some obvious tautologies do not hold. For instance, not (salary > 1000) is not equivalent to salary <= 1000. The second case requires introducing a lot of features to the integrated query/programming languages, which would process the UNKNOWN value. As experience has shown, such a new truth value is extremely cumbersome in constructing and using the language. Although two previous cases are allowed in some languages (e.g. SQL, LINQ and OCL), our decision is restrictive: only the last case is acceptable. We do not accept some false “user friendliness” in which queries are simpler at the cost of clean semantics and reliability of applications.

The standard query capabilities that exist in SBQL can be adopted to handle the last case. Depending on preferences, the designer or programmer can choose between the options which we present below. For each described capability we list two options. In the first one, marked by (a), an identifier of an Emp object with absent salary is included in the result of the query. In the second option, marked by (b), an identifier of an Emp object with absent salary is excluded from the result.

E.IRR.2

Get employees earning more than 1000

SBQL:

a)

b)

 

a)

b)

 

a)

b)

  

 

 

a)

b)

// Quantifiers:

    Emp where forall salary as s (s > 1000)

    Emp where forsome salary as s (s > 1000)

// The SBQL function exists tests salary:

    Emp where if exists(salary) then salary >1000 else true

    (Emp where exists(salary)) where salary > 1000

// The SBQL function count calculates the size of the argument:

    Emp where if count(salary) = 1 then salary >1000 else true

    (Emp where count(salary) = 1) where salary > 1000

//In some languages operators and, or are “lazy”, i.e. if the

// argument before or returns true (argument before and returns false)

// then the second argument is not calculated.

// With such semantics we have one more case:

    Emp where not exists(salary) or salary >1000

    Emp where exists(salary) and salary > 1000

// In general lazy operators can lead to problems in query languages,

// because some query optimization method based on rewriting

// can change the order of arguments of boolean operators.

 

The optional data may concern not only atomic objects, but also complex ones. Note that a null value concerning a complex attribute is not expressible in relational systems. SBQL can also deal with this case as usual:

E.IRR.3

Get names and cities of programmers; if address is absent, return the string “no address”.

SBQL:

(Emp where job = “programmer”).

(name, if exists(address) then address.city else “no address” )

 

Similar methods concern variants. Querying variants does not imply any special option in SBQL; however variants with discriminators require special features of a strong typing system.

SBQL is also prepared to process any other cardinalities of collections. The query below processes location of Dept, which has the cardinality [1..*].

E.IRR.4

Get cities hosting all departments.

SBQL:

(unique(deref(Dept.location)) as deptcity)

where forall Dept (deptcity in location)

 

In the first line the query collects locations of all departments, then dereference converts location identifiers to strings. Then, unique removes duplicate strings. Each element of the result is named deptcity. The second query line selects cities hosting all departments. 

As a side effect of the above assumptions, aggregate functions have the same semantics as in SQL: absent data do not influence the result. Consider the query

E.IRR.5

Get the average salary of all employees.

SBQL:

avg(Emp.salary)

 

For Emp objects with no salary sub-object the name salary occurring in the query will return the empty table. This table will be merged with other tables according to the union operator, hence it does not influence the result. This approach is simple, consistent and should be easily understood by the programmers. Consider the Date's example

E.IRR.6

 

SQL:

select sum(A+B) from R

select sum(A) + sum(B) from R

where the first query may return a result different from the second one. In SBQL the first query cannot be formulated as sum(R.(A+B)) because if A or B denote an absent attribute then the operator + fails. The correct formulation is:

E.IRR.7

 

SBQL:

sum(R.(A as a, B as b).(a+b))

 

The operator “,” (structure constructor) returns an empty table if any of the arguments is absent. Hence, a+b is never evaluated with a or b returning an empty table. The second query in SBQL

E.IRR.8

 

SBQL:

sum( R.A ) + sum( R.B )

 

can hardly be confused with the first one.

 

 Capabilities Equivalent to Outer Joins

In the relational model the outer join is an important operator that is relevant for semantic modelling and design methodologies, and convenient for programmers. In many cases the ordinary (inner) join is not satisfactory because it loses information. For example, the name and the department name of an employee can be obtained by the query (providing dno column within Emp and Dept tables):

E.IRR.9

 

SQL:

select name, dname from Emp, Dept where Emp.dno = Dept.dno

 

The resulting table, however, will not contain information on such employees for which the dno attribute is null valued. The outer join enables the user to formulate the query returning a tuple <name, NULL> for an employee having null valued dno attribute. The operator is usually considered non-primitive, as it can be expressed by other operators of the relational algebra. Date [Date86] advocates the “default style” outer joins which avoids null values. There could be more specialized outer joins: left-hand side outer join ignores nulls that occur at right side join argument table, and right-hand side outer join does vice versa.

A similar operator could be also useful in object-oriented query languages. Consider the corresponding query in SBQL:

E.IRR.10

 

SBQL:

Emp.(name, (worksIn.Dept.dname))

 

It returns no information on employees for which the worksIn sub-object is absent. We can avoid special options because the existing SBQL operators are sufficiently powerful. For example, assuming that the user would like to return the string “???” for Emp objects with absent worksIn, the above query can be formulated as:

E.IRR.11

 

SBQL:

Emp.(name, if exists(worksIn) then (worksIn.Dept.dname) else “???”)

 

There are other solutions, still without introducing special features to SBQL. For instance, we can use groupas operator:

E.IRR.12

 

SBQL:

Emp.(name as  n, (worksIn.Dept.dname) groupas d)

 

This solution is just in the spirit of SQL: instead of null, however, we return an empty bag named d. Such a solution can be applied in all cases that in relational languages require outer joins and can be specialized to cases equivalent to left-side and right-side outer joins.

Default Values in SBA

A class object in an object-oriented PL may store several kinds of invariants inherited by class members. Default values can be considered as such invariants. We distinguish two roles for them:

·        Initial values for some attributes, which are used when a new object is created. They are physically copied into a created object, thus need not to be inherited. Such default values are used only by a distinguished method commonly denoted as new. This kind of default values is not relevant to irregular data.

·        Common default values, which are dynamically inherited by class members and can be overridden by values within a particular member. Such a value is used in situations when a given member object has a corresponding attribute absent. Default values stored inside classes have the same external name as a corresponding attribute. They can be atomic or complex.

 

 

Fig.57. Overriding of default values

 

Default values in the object-oriented approach do not imply special features in query languages, assuming the model and scoping/binding rules as explained in chapters [SBQL non-algebraic operators] and [SBQL for the AS1 model]. However, we must consider a new semantics of assignments. When one updates a value which actually is inherited from a class, the semantics should not update the value, but create a new sub-object with the value determined by the right hand side of the assignment. For example, in the case shown in Fig.57 the assignment

 

E.IRR.13

 

SBQL:

for each Supplier where Sname = ”Blackdo Status := 45;

 

should result in creating a new atomic object <inew, Status, 45> and then inserting it into the Black's object. Updating of class properties via its members should not be allowed.

Note that the runtime structures should allow the navigation from the identifier of a default object to the identifier of the object that is currently processed (to make within it a proper change). In SBA such navigation is quite easy, because the environment of the currently processed object is currently on the top of the stack. However, some minor modification to the stack-based  mechanism could be necessary.

The presented definitions of default values capabilities can be easily generalized, for example, in order to cater for cases when an inherited property can be not only a “static” object, but also a functional procedure. Assume that when the status of a supplier is not defined, it is calculated as the number of products she supplies. In this case instead of introducing the object <.., Status, -1> to the SupplierClass we introduce the following functional procedure (a method), where SP( Sno, Pno,...) denotes the relation connecting suppliers and parts:

E.IRR.14

 

SBQL:

integer procedure Status {

      return count((SP as x) where x.Sno = self.Sno);

};

 

There are many examples of situations in the conceptual database design, when such dynamic defaults might be useful. As far as we know they have not been considered in the database literature.

Note that the procedure, like all inherited procedures, is evaluated within an environment of a Supplier object, thus the second occurrence of Sno in the procedure's body is properly bound to Sno of Supplier (what is indicated by self). Semantically, the construct SP as x creates so-called binders, i.e., named runtime entities. For detailed semantics related to binders see chapter [Defining Auxiliary Names].

 

Default Values and Scoping/Binding Rules

 

Default values stored within classes require special attention concerning scoping/binding rules. In some situations it is a difficult to determine consistent scoping rules that obey natural expectations of the programmer. The expectations can be summarised in two points:

·        First visit the internal properties of an object; if there is no proper attribute, then look for the default within its class, superclass, etc.

·        During execution of a method first visit its internal environment, then visit its class, superclass, etc. looking for local properties; next, visit the internal properties of an object being the receiver of a message.

These rules are contradictory for default values. For example, assume that a method m is stored within a class c1 and refers to an attribute a stored within object O; O is an instance of a class c2, and c2 contains a sub-object a storing a default value for the attribute a, Fig.7a.

 

Fig.58. A conflict between natural scoping rules and default values

Assume that class C1 inherits from C2, the object O received the message m being a property of C2, and within the body of m there is a reference to the attribute a. The natural order of visiting particular environments during the binding name a occurring within the body of m is shown on Fig.58a as a sequence of curved arrows:

·        Visit the local environment of m ;

·        Visit private and exported properties of C2 (to bind some other methods, class variables or procedures that may be called within m);

·        Visit the sub-objects of O ;

·        Visit exported properties of the classes that O belongs to; in this case, the exported properties of C2. The default a will be bound here if a was not bound in the previous step;

·        Visit the rest of the environment (database objects, global temporary objects, functions and procedures from available global libraries).

The questions arises what will happen if C1 and C2 are the same class? In such a case the above rules make our idea of default values invalid for the method m, because the default a will be visited and bound before the attribute a, Fig.7b.

There are of course several ways to avoid this conflict. For instance, inside a method if a default attribute is to be bound, it checked if an attribute with the same name exists in the currently processed object. The above case shows, however, that for object-oriented database programming languages the scoping/binding rules may imply some pitfalls, hence all possible situations should be carefully analysed, designed and tested.

Possibility of False Binding

 

If a query language is not strongly typed, irregular data together with assumed full orthogonality of query operators may cause a binding problem. To explain it, consider the query:

E.IRR.15

Get employees who have salary and earn the same as Brown

SBQL:

(Emp where exists(salary))

          where salary = ((Emp where name = ”Brown”).salary)

 

If the Brown's salary is absent the query should cause a runtime error. However, during the binding of the third occurrence of salary the environment stack is as shown in Fig.59.

Fig.59. An example of false binding

Since the top of the environment stack contains no salary binder the search will be continued in lower sections of the stack. The section below contains the salary binder, thus the binding will succeed. As a result, the predicate after the second where will be TRUE for any Emp having a salary. The query will return all such employees.

Obviously, the result is wrong. Let us analyze the reason. The third salary occurrence is referring to the Brown's object; if evaluated independently, the nested sub-query ((Emp where name = ”Brown”).salary) will return the empty bag. However, because the query is nested, salary is considered as referring to the outer query. In terms of our stack model, the search for the salary binder should be finished at the top of the stack, but because of absent data there is no such a binder and the search is continued in lower sections of the stack. Hence such a false binding is caused by the lack of information about the semantic qualification of the second occurrence of the name salary in the query. This information should be somewhere given, and should influence the scoping rules and/or structures introduced in the model.

This information is usually present in the type of an object. It can be used during compilation to determine statically which section of the environment stack (relatively to its top) is relevant for binding a particular name; so the problem does not occur. Analogously, the problem can be easily solved by introducing to the EmpClass a special entity named salary, which the only role is to stop the search in lower sections of the stack; the binding of name salary to such objects will return an empty table.

If the data environment is untyped[2], and there is no possibility to introduce the corresponding information to a class, the programmer must be supported with special facilities to deal with such cases. Among various options we present the following. We assume that every name occurring in a query can be augmented by an arbitrary number of names of its attributes. Each attribute name is preceded by a keyword with or without. For example,

Emp with name with salary without job

This phrase is evaluated (bound) as a single semantic unit (i.e., name, salary and job are not bound independently). It returns the references to all Emp objects which are present in the actual scope an satisfy the “with” and “without” conditions: that is, contain the attributes name and salary but do not contain the attribute job. Inherited properties are not taken into account.

The described facility gives the programmer a full control over the binding, despite possibly defaults and absent data. For example, assuming salary and job can be absent, and a default job is stored within the EmpClass, the following query

E.IRR.16

 

SBQL:

( Emp without salary with job) where job <> “clerk”

 returns references to all employees containing a job subobject, have no salary and are not clerks. Note that the default jobs and salaries do not influence the result. The query E.IRR.15 can be formulated as follows:

E.IRR.17

Get employees who have salary and earn the same as Brown

SBQL:

(Emp with salary) where salary =

                                ((Emp with salary where name = ”Brown”).salary)

Assignment to Absent Object

If an object is typed by the cardinality [0..1] in general the simple assignment may result in an exception if the object does not exist. Hence each assignment to such an object should be preceded by testing whether the object exists and then inserting it, for instance:

E.IRR.17

Let Brown earn 5000.

SBQL:

for each (Emp where name = ”Brown”) as B do {

      if exists B.salary then B.salary := 5000

      else B :<<  5000 as salary;

}

 

The operator :<< means “create and insert” – on the right side there is a query that creates an object and on the left side there is a reference to an altered object.  If there are many objects that are typed in this way, such assignments can be annoying and the program will become much less legible. A solution of this problem is to introduce a special operator (or to extend the meaning of the := operator) that creates an object if it does not exist, for instance:

E.IRR.18

Let Brown earns 5000.

SBQL:

 (Emp where name = ”Brown”).salary := 5000;

 

Unfortunately, such a solution will not work if we compile this statement as usually and try to do the proper operation at runtime. The left side of the assignment will return an empty bag and it will be impossible to recognize what is the name of an object to be inserted and where it should be inserted. Recognition of the situation during compilation and corresponding altering of the generated code seem too challenging (if not hopeless). Because the object to be assigned can be the result of a query with any complexity, which may involve function, method and view calls, in the general case altering of the generated code could be totally impossible.

Such a feature require altering the run time structures, on the same principle as default values discussed above. Assume that each declaration of an object (attribute, sub-attribute, etc.) with lower cardinality 0 always means that the corresponding runtime class will be augmented by a dummy object having the same name as the object and with some dummy value. For instance, for EmpClass from E.IRR.1 after compilation we insert dummy objects

<id_dummy_1, address, dummy> , <id_dummy_2, salary, dummy> , <id_dummy_3, worksIn, dummy>,

where dummy is a value distinguished internally that cannot be used by the programmer and cannot be stored in the object store. Its only role is binding of names that cannot be bound in normal ways. Having such a dummy object named x it will be bound in all situations where x cannot be bound in normal way, because it is absent. After such a binding the system can recognize that updating is to be performed on the dummy object, and instead of that, it goes to the top of the ENVS stack, which in this case should contain the environment of the currently processed object. Hence, it can recognize to which object the new object is to be inserted. For instance, for example E.IRR.18, if Brown has no salary, then the query returns id_dummy_2. An attempt to make the assignment will cause navigation to the top of ENVS, finding the reference of the Brown object, and inserting into it the sub-object <some_new_id, salary, 5000>. To make this possible the system should be a bit lazy with reducing ENVS. It should not be reduced before the operator := is completed.

Note that the dummy value is not equivalent to the null value. A dereference of a reference to a dummy object should return an empty bag. Dummy objects and the dummy value are never used by the programmer and he or she needs not to be aware of such a feature. It is assumed to be an internal option of the binding mechanism that is necessary to solve the problem of assignments to  absent objects.

Typing Irregular Data in SBQL

Well-known typing systems are not prepared to type irregular data. Actually, only the type system assumed in the XMLSchema is able to specify some non-trivial irregularities in data. However, XMLSchema is used as a grammar of XML files rather than a typing system for strong type checking of queries or programs. In general, because of new database models, new semantic properties of query languages and semi-structured data, all the known strong typing systems developed for programming languages are too limited and not flexible enough. Sometimes there are even opinions that irregularity in data (semi-structured data) are contradictory to strong type checking of queries and programs.

We disagree with such opinions. Irregularity of data is relative. If a typing system is prepared to deal with irregular data, they are no more irregular and become ordinary and typical. Actually, the problem is in developing a new typing system, because for such situations the known typing systems are not prepared to this goal. This concerns also advanced typing systems based on inclusion and parametric polymorphisms, F-bounded polymorphism, etc.

We have developed a new strong typing theory and implement it for SBQL. We call it “semi-strong” just because it deals with semi-structured data as described above. We distinguish internal and external type systems. The internal type system reflects the behavior of the type checking mechanism, while the external type system is used by the programmer. A static strong type checking mechanism simulates run-time computations during compile time by reflecting the run-time semantics with the precision that is available at the compile time. Roles of the SBQL typing system are the following: compile-time type checking of query operators, imperative constructs, procedures, functions, methods, views and modules; user-friendly, context dependent reporting on type errors; resolving ambiguities with automatic type coercions, ellipses, dereferences, literals and binding irregular data structures; shifting type checks to run-time, if it is impossible to do them during compile time; restoring a type checking process after a type error, to discover more than one type error in one run; preparing information for query optimization by properly decorating a query syntax tree.

The internal SBQL type system includes three basic data structures that are compile-time counterparts of run time structures: a metabase, a static environment stack and a static result stack. Static stacks process type signatures – typing counterparts of corresponding run time entities. Signatures are additionally associated with attributes, such as mutability, cardinality, collection kind, type name, multimedia, etc. For each query/program operator a decision table is provided, which determines allowed combinations of signatures and attributes, the resulting signature and its attributes, and additional actions.

Processing types with cardinalities require delegating type checking to run time. For each creating, inserting and deleting operation performed on an object typed with non-trivial cardinalities the SBQL compiler inserts a piece of code that checks the corresponding cardinality during run time.

More about semi-strong typing system of SBQL is the subject of the special chapter and in description of the ODRA system. An example of SBQL types is presented in E.IRR.1 .

Irregular Queries

Irregular data are usually associated with irregular queries that are impossible to ask in SQL. Examples of irregular queries can be as follows:

·        Get names (as strings) of objects having an attribute storing the string “Monthy Pyton”.

·        Get names (as strings) of attributes that store the string “Monthy Pyton”.

·        Get objects of any type that store the string “Monthy Pyton” at any attribute and at any level of the hierarchy of their attributes.

·        Get the number of all attributes in the person object with the name “Tom Jones”.

·        Get Emp objects with an attribute having the name starting with letter “S” and with the value terminated by the string “logy”.

We can give more complicated examples of such unusual queries. All such queries require generic capabilities, sometimes  beyond the assumed strong typing system. It is difficult to say if such queries are necessary. If such a necessity exist, it usually means that the design of a data schema is ill. However, current experiences with XML and its query language XQuery show that for some environments such queries make a sense. 

Actually, the above queries require some generic functions that remain reflexive capabilities. We list some of these functions:

·        Navigation from a reference of an object to references of all its components (toSubordinates). Obviously, the typing system is violated, hence it must be somehow changed.

·        Navigation from a subobject to its direct superobject (toOwner).

·        Returning the collection of names of all objects (allObjects).

·        Given a reference to an object, returning its name as a string (nameOf).

·        Given a string, use it as a data name (asName).

·        Navigation from a reference to an object to its class/type (toClass, toType).

We can invent of course more such functions and perhaps their number is unlimited. It may happen that some conceivable query is still not possible within the given set of such functions, but inventing them is quite easy (although not sure if makes a sense).

Within this set of functions augmenting SBQL we can ask the following queries:

E.IRR.20

Get names (as strings) of objects having an attribute storing the string “Monthy Pyton”.

SBQL:

 ((allObjects as m join toSubordinates(m) as s)

where s = “Monthy Pyton”).nameOf(m)

 

E.IRR.21

Get Emp objects with an attribute having the name starting with letter “S” and with the value terminated by the string “logy”

SBQL:

 ((Emp as m join toSubordinates(m) as s) where nameOf(s)[1] = “S”

and s like “*logy”

 

The examples has shown that such (relatively simple to implement) functions are able to rise the power of a query language. However, it is not such that such a power would not be the reason of some non-disciplined design of a database schema.

Conclusion

Although it is a common belief that null values and other forms of semi-structured information are one of the inevitable features of database applications the object-oriented database research and technology tend to neglect the issue. Practical proposals, such as the ODMG standard, present very few examples of relevant capabilities. Theories proposed for object bases and their query languages, such as object algebras, F-logic, comprehensions and structural recursion, do not supply comprehensive solutions. Moreover, the discussed SQL approach ultimately leads to numerous flaws and inconsistencies.

In this chapter we have presented a systematic approach to the problem based on the idea of cardinalities assigned to types of objects and briefly described our solutions. These can be summarized into the following recommendations:

·        We agree with Ch. Date: a special generic value called null (nil) should be absolutely avoided. We argue that it acts like a little devil that is able to spoil the semantic clarity and consistency of all language constructs; this especially concerns query languages.

·        Ch. Date suggests to use default values instead of nulls. In general, we doubt if this approach offers advantages and reduces the possibility of inconsistencies and programmers’ errors. However, in this chapter we present simple methods to incorporate default values into classes.

·        In all cases when the use of default values is inconvenient or impossible (e.g., unions, repetitions, complex attributes) binding a name of an absent object should result in an empty collection, which can be processed by standard language facilities, such as exists, count, and quantifiers.

Static strong type checking of queries and programs addressing semi-structured data in the context of object-oriented query languages is not considered in currently typing systems and their theories (including polymorphic ones). SBA and SBQL offer semi-strong typing system based on compile-time simulation of run-time semantics. The approach makes it possible to provide the type checking as far as it is possible for irregular data.


Last modified: July 21, 2009

 



[1] This can be of course constrained by DTD or XMLSchema, but such constraints are not obligatory.

[2] We strongly discourage making untyped programming interfaces for queries. SBQL is strongly typed. As will be shown in next sections, strong typing has a crucial role for query optimization. One can argue that discovering typing errors during compilation time is not so important and we can live without it, c.f. SQL. However, lack of query optimization totally undermines the query interface for its users.