Storing and
Processing Irregular Data (Semi-Structured Data)
Back to Description
of SBA and SBQL.
Irregular data, that is, data that do not follow some fixed format, are
properties of many current technologies, especially based on XML. Irregular
data are also specific to conceptual modeling, which frequently requires some
forms of loose formats. Irregularity in data has many forms and presents a lot
of issues, especially concerning strong type checking and query languages.
Irregular data is the reason of a lot of naïve research, inconsistent
solutions, poor imagination of the
developers and false opinions. In this chapter we try to discuss more important
issues related to irregular data and present our views and concepts concerning
irregular data in the context of SBA and SBQL and object-oriented store and
query models.
The most recognized concept related to irregular data is a null value known from the relational
model, relational databases and SQL. The concept, inevitable and obvious in
relational data structures, becomes quite difficult in query and programming
languages. In SQL it is supported apparently with little care about semantic
consistency and consequence. Depending on SQL constructs, nulls are treated
very differently and all the design seems to be contradictory to common logic
and sense. However, there is perhaps no fault of SQL designers. The devil of
chaotic design is inside the null value concept itself. According to [DaDa92,
DaDa95] there is no consistent way to introduce nulls to database structures
and corresponding query and programming languages.
A new wave known under the term “semi-structured data”
approaches the problem from a different angle and actually does not refer to
null values. The problem arises in the context of more loose and weakly typed
data structures that are common for some applications, in particular, for Web
and XML technologies. However, the semi-structured data problem inherits a lot
from the problem of null values.
In general, any DBMS must eventually deal with complete computational
and pragmatic power of programming interfaces. Hence, features related to
irregular data must be combined with all aspects of database systems: data
description, query languages, updating and other imperative constructs, typing,
object-orientedness, procedures, methods, views, active rules, metadata, etc.
Support for irregular data is especially challenging if one assumes that the
data is to be processed by strongly typed programming languages, such as C++,
C# and Java. Usually such languages support regular data formats only. The
necessity of mapping between irregular data and regular formats must lead to
some impedance mismatch. Hence, the support for irregular data must be designed
with an extreme care concerning conceptual simplicity, minimality, consistency,
universality and openness for external data processing tools.
During the development of SBA and SBQL we address the problem of
irregular data in object bases and consider how they are to be manipulated in
the integrated query/programming languages addressing object models. The
difference between our approach to the problem and the approaches known in the
literature is that we extend the scope of issues related to irregular data to
all the aspects that are necessary to make a professional object-oriented
database query and programming language. Our approach to irregular data is very
simple and is based on the idea of collections and cardinalities of UML. Null
values are represented as absent data. Any data that can be absent are to be
typed by cardinalities with lower value equal to zero, for instance [0..1] or
[0..*]. In particular, optional data are treated as a collection being empty or
having exactly one element. To process such a data the programmer can uses facilities
that are specific for processing collections rather than for processing values.
In this way we can deal with nulls without introducing an explicit a null
value. Within this approach all the disadvantages of null values presented in
[DaDa92, DaDa95] become irrelevant. This chapter discusses the situations that
are related to this new attitude to nulls and presents corresponding SBQL
solutions.
Many authors (more than 500 papers) have associated null values with
“incomplete information”, making a lot of research aiming at
solving this apparently important problem. After dozens of years almost all
this research has been commonly recognized as scholastic, with little (or
perhaps no) practical impact in the domain of database management systems. Some
of these researches, however, are claimed to be applicable in other domains, in
particular, artificial intelligence, data mining and data warehouses.
From the practical point of view, null values are loosely related to the
problem of uncertain or missing information. The term irregular data better expresses the problem. It concerns situations
when particular information does not fit exactly the predefined data format, or
when some spots in the storage media (or in their formal or conceptual model)
are not filled in by meaningful data. The interpretation of the fact that a
particular data spot is unfilled is a data semantics issue, outside of any
formal model.
Actually, SQL adopts such attitude to null values as described above.
The meaning of SQL null values is not predefined. Null values present some
technical feature in programming that can be used for different purposes and
informally interpreted by designers, programmers, database administrators and
other users. Because of many sources of null values and many reasons for which
they are introduced it is difficult to imagine that another, more semantically
specific approach to null values makes a sense. As an example, if we assume
that almost all persons in some collection are not handicapped, then the value
of the attribute handicapKind can be
null for almost all objects. Only handicapped people have some information
within this attribute. But interpretation of these nulls belongs to informal
business semantics of data. In this case nulls have nothing in common with
uncertain or incomplete information – just nulls are used as regular and
certain information determining that corresponding persons are not handicapped.
Some methodologists might not recommend to use nulls in such a way, but actually
this would be only recommendation and the decision anyway belongs to
application and database designers.
The irregularity of data is relative to the typing system that is
assumed in the given database and its query/programming environment. Irregularity
with respect to some older data models (such as the relational one) is no more
irregularity if the typing system is prepared for it and there are
corresponding constructs that serve all the types in a corresponding query and
programming language. Hence the true attitude to irregular data is to change
them into regular ones through preparing a proper typing system. Such data are
to be addressed by a database query and programming language with all the
universality. However, irregular
features require extra programming options, thus resulting languages can
be more (or even too much) sophisticated for programmers.
In the development of SBQL we tried to introduce irregularities to
object data structures without compromising the simplicity of the language.
Eventually, however, any design must be acknowledged by common practice of
language’s users.
In the domain of PLs there is another well-known feature that can be
considered as irregular data. It is called “variant” in the Pascal
family of languages, or “union” in the C family. In the following
we will use the term “variant” (”union” is already used
as an SBQL query operator). Originally (e.g. in Pascal) variants were
introduced to save the memory space. This motivation no more holds for current
data models where variants have mostly conceptual meaning. A variant may be
required when objects of a conceptually similar type can change the collection
of their attributes. For example, an Employee
object may concern regular employees and junior employees. In a variant for a
regular employee the object contains the attribute salary and no attribute scholarship,
while in a variant for a junior employee – vice versa.
Variants and null values are conceptually similar. If a null value we
interpret as an absent attribute, than the assumption that an attribute A can
be null-valued might be modelled as a variant of record types R(A, B, C,...)
and R(B, C, ...). The notions of null values and unions are, however, not
equivalent. For example, if some record type involves n attributes that can be null valued, then the number of
corresponding variants is 2n. Variants cover also the situation when
names of attributes are the same, but types are different. It is interesting to
note that the PL community noticed variants and ignored null values, and the
database community did v/v. Because variants do not match the concept of a
relation, they were unacceptable for the proponents the relational database
model. This is one of many cases when a stiff mathematical frame is an
impediment for the progress.
One can observe that in object-oriented models variants are unnecessary,
as they can be substituted by specialized classes. However, classes mostly bear
conceptual information from the business model. It would be improper to burden
the model by some technical reasons such as variants. Variants would cause the
appearance of many artificially introduced classes, with exotic names and
unclear meaning. Moreover, if there is more than one variant in one business
object type, then this may result in the combinatorial explosion of (permuted)
specialized classes. If a type introduces n
binary variants on different attributes, then the number of specialized classes
that conceptually cover this situation is 2n. This presents an argument
that variants should be introduced as a feature orthogonal to classes.
Variants, as introduced in Pascal, require a special attribute (so
called discrimination attribute) that
allow during runtime to determine which variant is currently processed. Such an
attribute makes it possible strong type checking of variants (which in this
case must be delegated to runtime). In C variants (unions) were introduced
without the discrimination attribute, hence the responsibility for determining
a variant is in hands of the programmer. To this end he or she can choose any
option and any information. The solution makes strong typing of variants
impossible.
In the literature there is a subdivision between
exclusive and non-exclusive variants. Exclusive variants mean that in the actual
object we can choose A or B, but not A and B together. For instance, an Employee record can have the attribute salary or the attribute scholarship, but not the both.
Non-exclusive variants do not imply such a constraint and hence are easy to be
substituted by optional data: non-exclusive variant between A and B can be
represented as two attributes A and B with cardinalities [0..1]. Hence, only
exclusive variants present essentially new situation for conceptual modelling,
typing and programming capabilities.
Variants are also similar to dynamic object roles, as described in the AS2 object store model.
Dynamic object roles provide capabilities much more general and flexible than
the capabilities assumed in variants. Most of all, the difference concerns
conceptual modelling. Variants can be introduced for any technical reason,
while dynamic object roles are assumed to have some business-oriented
semantics. In particular, each dynamic object role must possess a distinguished
external name, while this is not required for a variant. Actually, however,
these distinctions between variants and dynamic object roles are secondary from
the technical point of view. Object roles can be provided as a substitute and
generalization of variants. This leads to some new situations with strong
typing. Such an approach to typing roles is described in the PhD thesis by
A.Jodłowski [Jodl03].
Typing of Null Values and Variants
Irregular data require a new attitude to typing, especially to static
(compile time) strong type checking. Some authors argue that due to the
irregular nature of data strong typing is impossible: such data are type-less
or schema-less. We disagree with such opinions. Lack of types or data schemata
leads to a lot of pathological situations in programming. In particular, a
database schema is a carrier of business-oriented data semantics and determines
the structure and representation of data. It is difficult to imagine that the
programmer can write programs without any schema, because in such cases he or
she must guess the structures form examples of data or apply a generic
programming where everything, including data names, are perceived as strings.
At least for conceptual modelling, types or schemas must be introduced anyway,
explicitly or implicitly. This is the motive for such languages as DTD or XML
Schema. The question arises how such facilities can be used for static type
checking, for determining representation of data, for resolving some ambiguous
situations (such as ellipses and automatic coercions), for query optimizations,
etc. Within SBQL we have proposed a typing system that we call
“semi-strong”, to underline the fact that our types are prepared to
typing irregular data, with all the features that are usually associated with
types, such as conceptual modelling, static type checking of queries and
programs, resolving ambiguities in queries and programs and preparing
information for query optimization.
Concerning null values and variants, static type checking (during
compilation time) has limits because both a null value and a variant are
runtime features. Hence in such cases type checking must be usually delegated
to runtime. Alternatively, developers of a language can assume that null values
and variants are outside the typing system and instead of typing errors they
assume exceptions in reaction of some illegal runtime situation (e.g. when a
null is an argument of an arithmetic operator). Such exceptions, however,
require anyway semantically clean detection of null values and variants during
runtime.
More information on typing irregular data will be given in a section
devoted to typing.
Current
Proposals Concerning Irregular Data
The current view on null values is materialized in relational databases
and SQL. A null value can be stored in a database and returned by an SQL
expression if it cannot be evaluated correctly. This may happen because of
null-valued arguments of an operation, as well as in the case of wrong
arguments of operations. For example, function sum returns NULL for an
empty argument table. SQL does not allow explicit comparisons of nulls and
ordinary values but involves a special predicate is [not] null. A special function if_null returns an attribute's value if it is not null, or some
constant value otherwise. Comparison of an expression which returns NULL with any other value returns a
third truth value UNKNOWN; in the WHERE clause, however, it is equivalent
to FALSE. In embedded SQL the
admission of null values requires for every such attribute two variables in an
underlying host language program. An extra indicator
variable is used to store boolean information determining if the particular
value of an attribute is null or not.
Several authors (notably Date and Darwen) point out difficulties related
to null values, which make the semantics of user/programmer interfaces obscure
and inconsistent. Date [Date86] presents a severe criticism of null values in
the relational model (“...the null
value concept is far more trouble than it is worth”, “...the SQL
null value concept introduces far more problems than it solves”). He
gives many striking examples of anomalies implied by null values. For instance,
although in SQL “officially” every null value is distinct (thus A = B
returns UNKNOWN if both A and B return nulls), the group by
and unique operators treat them as
identical, and aggregate functions sum,
avg, min, max, count totally ignore them. Another striking example: if
relation R contains numerical
attributes A and B which could be null valued, than in general select sum(A+B) from R may return the result different from select sum(A) + sum(B) from R. Date concludes that these flaws are not only the
property of an inadequate SQL design, but they are the inevitable consequence
of the idea of null values.
The descendant model of SQL also promotes the use of null values. The
SQL 99 standard [SQL99] proposes to introduce an arbitrary number of
application-specific null values such as “Unknown”, “Missing”,
“Not Applicable”, “Pending”, etc. It is not clear,
however, how this feature will be reflected in the semantics of particular
query/manipulation constructs.
Variants are proposed in IDL of the OMG CORBA standard, with an explicit discrimination
attribute. It is assumed that they are to be addressed in programming
languages. Only C and C++ provide corresponding facilities. A similar concept
is proposed in the ODMG standard. However, the standard contains no construct
to deal with variants in the query language OQL and contains no suggestion how
variants are to be treated by the assumed strong typing system. As far as we
know, actually there is no powerful query language consistently dealing with
variants. Null values are introduced in the ODMG standard in a manner that is
much more restrictive than it is done in SQL. This is perhaps motivated by the
possible problems with nulls that the designers of the ODMG standard want to
avoid. No operators such as outer joins are allowed in ODMG OQL. In general,
object-oriented database models neglect the problem of irregular data. Perhaps
SBA and SBQL are first proposals that address powerful object-oriented models
and support non-trivial forms of irregular data.
There are also proposals to make loosely typed or typeless data
structures. Examples are XML and RDF- oriented technologies. For instance, in
XML one can introduce any tags or attributes within tags and to put inside them
any values[1].
In RDF one can change ad hoc the type of some entity. Such approaches are even
advocated as “more flexible”, hence having advantages over
disciplined data structures. However, the flexibility should not be confused
with chaos. Tags (data names) and types are features of a schema, which is
necessary for the user to ask correct queries. If tags and types are unknown,
then the user or programmer must guess them and to discover the meaning of
values basing on some side information, which can also be poorly defined. Our
conclusion is that loosely typed or typeless data structures are unmanageable,
especially for very large databases. Such proposals can be used for small,
non-resposible applications, where the user can see the entire database (and
its next possible states) before asking queries. In real applications such situations are
difficult to imagine.
The fundamental defect of theories w.r.t. null values and other forms of
irregular data is the scope mismatch.
The phenomenon is similar to the impedance mismatch and concerns the difference
between the scope of theories and the scope in which irregular data need to be
considered in practice. Theories usually address an idealized query language
with very limited capabilities (e.g., SPJ queries in the relational algebra, or
the language of some predicate logic). The scope for null values is much wider
and should include:
· All advanced query constructs, such as grouping, aggregate functions,
quantifiers, ordering;
· Imperative constructs based on queries: creating, updating, inserting,
deleting;
· The interface for embedding a query language into a programming
language;
· Database semantic enhancements: views, database procedures, integrity
constraints, deductive rules, active rules;
· Object-oriented extensions: ADT-s, types, classes, interfaces, encapsulation,
inheritance;
· Static strong or semi-strong type checking;
· Privacy, security, transactions;
· Interfaces for end users and programming interfaces: graphical query
languages, 4GLs, and other.
· Metadata (an internal representation of a database schema)
The scope mismatch is the main reason for a very low applicability of
the theories addressing null values and other forms of irregular data. From the
position of designers and vendors of DBMS it is very risky to introduce some
particular solution without a clear view how it could be expanded to the whole
conceivable environment of database application programming. In our opinion,
theories concerning null values in databases have severely impeded the progress
rather than supported it.
In practice there are many sources of irregular data. Sometimes they can
be qualified as uncertain information, but in general irregularities are
related to specific aspects of data storing and processing.
·
Information is irrelevant,
for example, “nickname” for a person who has no nickname.
·
Information is known,
but not filled yet because data input lasts some time. For example, a new
employee is hired, all his/her data are already introduced to the database,
except “salary” which has to be input by another administration
division.
·
Information is valid
only for fixed time; after it must be nullified (to prevent errors) and then
input again or recalculated; for example, weather or stock market forecasts.
·
Information is known and
could be filled in, but it is considered inessential for the particular domain
of applications. For example, in some airline ticket reservation system the
nationality of passengers is not filled in for domestic flights.
·
Information consists of
“legacy” records, with limited amount of information, and actual
records having an extended set of attributes. Hence in the integrated resources
legacy records must possess null values.
·
The result of some
processing is an intermediary data structure containing null-valued elements;
e.g. outer joins.
Nowadays, there are next reasons to deal with irregular data. The
following aspects of object database technologies may require them:
·
Object-oriented analysis and design.
In many cases the admission of null values, unions and repeating data presents
an important facility for the analyst or designer. For example, the designer
can stick two classes into one introducing some null-valued attributes; and
vice versa.
·
Schema evolution. As a result of schema
evolution some new attributes may appear, some may disappear, some may change
type, some single attributes can be changed into repeating ones. This naturally
leads to variants and optional data.
·
Interoperability. Heterogeneous and/or distributed
databases require resolution of missing or conflicting data values that occur
when semantically identical data items may have some attribute values different
of missing in some data sources.
·
Web and XML. HTML and XML files are
weakly formatted. There is a lot of valuable research devoted to querying
semi-structured data, especially XML; however, there is little doubt that they
will require efficient programming techniques to deal with their irregularities
and inconsistencies.
·
Data warehouses. There is a tendency
for collecting information concerning particular topics (e.g. a stock market)
from various heterogeneous sources. Analytical processing of such heterogeneous
information (statistical analysis, overview reports, data mining, etc.) requires
capabilities of integrated query/programming languages that will make querying
null values and variants possible.
Note however that null values and variants are additional programming
options that can be easily avoided both on the stage of database design and on
the stage of programming. For instance, if attribute A of type integer can be
null valued, it is enough to introduce two attributes: attribute A with no null allowed and an additional
Boolean attribute is_A_null?, with
value true indicating that the value
of A should not be taken into account because it is invalid. This approach is
actually used in host languages embedding SQL. It corresponds to discrimination
attribute that is used to distinguish variants. The attribute is_A_null can also store more
information about the status of A, for example, “not initialized”,
“pending”, “irrelevant”, etc. Such an approach do not
require introducing irregularities of data explicitly; however, it could be
more difficult for conceptual modelling and makes description of objects more
complex.
As an alternative to null values C.J.Date and other authors propose the
concept of default values, with the
intention that default values work as nulls. They are ordinary values that are
filled in into a tuple in the case of absent information. Default values can be
determined in a relational schema. According to Date, default values do not require any special options concerning
irregular data in databases and in programming languages.
Unfortunately, this claim is obviously false and based on fundamental
misunderstanding. Null values and other irregularities are conceptual issues,
while default values present purely technical option to record irregularities.
If the programmer forgets to consider default values as a special mean for
storing irregularities, then probably his/her program will be formally correct,
but having conceptual flaws from the point of view of the business that is
supported by the program. To avoid this danger, default values must be indicated
in the conceptual database schema, which would become more complicated.
Processing default values may require special care or special options. For
instance, if for salaries the default value would be -1, then calculation of
the average salary through the aggregate function avg would be erroneous. The
programmer cannot forget that before calculating the average he/she must filter
all objects with salary = -1.
Moreover, for some domains there could be no good default, e.g. for the Boolean
type. Default values are also extremely error-prone: if the programmer forgets
or not properly recognizes defaults, then the typing system will not be able to
recognize an error and it appears during runtime with no warning.
Note that this intention of default values is unusual for programmers.
Usually a programmer expects that default values are the most common values
that are filled in during object creation, for example the value
“no” for the attribute IsSmoking?
Using default values with another intention will require – at least
– another term to denote them and special explanation in manual.
Summing up, the proposal of using default values as the substitution for
null values is doubtful and risky in the same way as the criticized null values
are. In our opinion, default values are not worth to consider seriously as a
special option in data structures and query languages.
We assume that a schema is an inevitable property of any data store,
including storing irregular data. A schema is necessary for conceptual
modelling . The programmer before writing a query or a program must clearly
understand what the database contains and how it is logically organized. Such a
schema can be informal or even not explicit, but anyway must exist.
If a schema is expressed formally, it can be used for dynamic (runtime)
checking of processed objects. For example, DTD and XMLSchema play this part
for XML repositories. A bit more formality and discipline is necessary if we
assume that types declared within a schema are to be used for static (compile
time) type checking of queries and programs.
The new qualities introduced by
irregular data to this assumption are
the following:
·
A schema should be able
to express the fact that some data are optional and may not be present in a
particular data instance. The feature is specified by the cardinality [0..1]
and is equivalent to the SQL clause “NULL IS ALLOWED”. In this way
every data that can be optional is modelled as a bag which is empty or has
exactly one element. All operators that address bags can be applied to such a
bag, in particular, quantifiers. Properly designed implicit type coercions
between bags and individual elements (and v/v) make it possible to reduce the
conceptual overhead related to optional data to minimum.
·
A schema should be able
to express the fact that some data are alternative. This issue is known as
“exclusive variants”. We assume that a variant must possess a
discriminator that allow for type checking of variants during runtime.
·
A schema should be able
to express the fact that some data can be repeated some number of times,
including zero times or any number of times. The issue is known as cardinality
of a collections. Cardinalities can also specify pointer links (association
instances) among objects.
·
The schema language
should allow orthogonal combination of the above features ot any data hierarchy
level.
Any binding to a data that is absent
results in empty bag rather than in a special value “null”.
Although this looks as a very subtle change, there are essential consequences.
We show that the idea makes it possible to overcome difficulties encountered in
SQL and can be smoothly combined with variants, and object-oriented concepts.
It can also be consistently incorporated in query languages and programming
languages integrated with queries
As far as we are aware of, this
model is the first, simple, intuitive and universal idea, which is capable to
express uniformly all features related to semi-structured information and which
is acceptable to an average database engineer. The idea is consistently
incorporated into data structures, a query language, programming interfaces and
other capabilities of the prototype object-oriented database management system
ODRA. The ODRA query and programming language SBQL is much more powerful than
Lorel and UnQL, well recognized for their capabilities to process irregular
data. Moreover, SBQL has very precise, intuitive semantics and is seamlessly
integrated with programming constructs, programming abstractions and object
oriented notions.
Absent data (including variants) imply the necessity of special care in
queries. Consider the SBQL schema (from the ODRA system):
|
Database schema involving departments and
their employees |
|
|
SBQL: |
type AddressType is record { city: string; street: string; house: integer; } class EmpClass { instance
Emp: { name:
string; address
: AddressType [0..1]; //absent data allowed salary:
real[0..1]; //absent
data allowed job:
string; worksIn:
ref DeptClass[0..1] reverse
employs; //absent data allowed } } class DeptClass { instance
Dept: { employs:
ref EmpClass [0..*] reverse
worksIn; boss:
ref EmpClass[0..1];
//pending absent data allowed dname:
string; budget:
real; location:
string [1..*]; //multiple values
allowed } } Emp: EmpClass [0..*]; Dept: DeptClass [0..*]; |
The schema declares one type AddressType
and two classes EmpClass and DeptClass.
Then it declares two collections of objects, Emp and Dept, of the
defined classes. For Emp objects the
attribute address is optional.
Similarly, in the Emp object the attribute salary is optional too. In Dept objects the reference employs is multi-valued, the reference boss is optional and the attribute location can assume one or more values.
Consider the query Emp where salary >1000 . If for some employee the attribute salary is absent, then binding name salary will return an empty bag, which
is compared by > with 1000. The comparison > is not defined for such a
case. We can therefore consider the following possible definitions of the
semantics:
·
Predefined truth value:
in this case the formula will return FALSE.
·
The formula will return
the third logical value UNKNOWN.
·
A runtime type error or
exception, i.e., the case is semantically forbidden.
The first approach leads to unreliable programs, because, e.g., some
obvious tautologies do not hold. For instance, not (salary > 1000) is
not equivalent to salary <= 1000.
The second case requires introducing a lot of features to the integrated
query/programming languages, which would process the UNKNOWN value. As experience has shown, such a new truth value is
extremely cumbersome in constructing and using the language. Although two
previous cases are allowed in some languages (e.g. SQL, LINQ and OCL), our
decision is restrictive: only the last case is acceptable. We do not accept
some false “user friendliness” in which queries are simpler at the
cost of clean semantics and reliability of applications.
The standard query capabilities that exist in SBQL can be adopted to
handle the last case. Depending on preferences, the designer or programmer can
choose between the options which we present below. For each described
capability we list two options. In the first one, marked by (a), an identifier
of an Emp object with absent salary is included in the result of the
query. In the second option, marked by (b), an identifier of an Emp object with absent salary is excluded from the result.
|
E.IRR.2 |
Get employees earning more than 1000 |
|
SBQL: a) b) a) b) a) b)
a) b) |
// Quantifiers: Emp
where forall salary as s (s > 1000) Emp
where forsome salary as s (s > 1000) // The SBQL function exists
tests salary: Emp
where if exists(salary) then salary >1000 else true (Emp where exists(salary)) where salary > 1000 // The SBQL function count
calculates the size of the argument: Emp
where if count(salary) = 1 then salary >1000 else true (Emp where count(salary) = 1) where salary > 1000 //In
some languages operators and, or are “lazy”, i.e. if
the //
argument before or returns true (argument before and returns false) //
then the second argument is not calculated. // With such semantics we have one more case: Emp
where not exists(salary) or salary >1000 Emp
where exists(salary) and salary > 1000 // In
general lazy operators can lead to problems in query languages, //
because some query optimization method based on rewriting // can change the order of arguments of
boolean operators. |
The optional data may concern not only atomic objects, but also complex
ones. Note that a null value concerning a complex attribute is not expressible
in relational systems. SBQL can also deal with this case as usual:
|
E.IRR.3 |
Get names and cities of programmers; if address is absent, return the string
“no address”. |
|
SBQL: |
(Emp where job =
“programmer”). (name,
if exists(address) then address.city else “no address” ) |
Similar methods concern variants. Querying variants does not imply any
special option in SBQL; however variants with discriminators require special
features of a strong typing system.
SBQL is also prepared to process any other cardinalities of collections.
The query below processes location of
Dept, which has the cardinality
[1..*].
|
E.IRR.4 |
Get cities hosting all departments. |
|
SBQL: |
(unique(deref(Dept.location)) as deptcity) where forall Dept (deptcity in location) |
In the first line the query collects locations of all departments, then
dereference converts location identifiers
to strings. Then, unique removes
duplicate strings. Each element of the result is named deptcity. The second query line selects cities hosting all
departments.
As a side effect of the above assumptions, aggregate functions have the
same semantics as in SQL: absent data do not influence the result. Consider the
query
|
E.IRR.5 |
Get the average salary of all employees. |
|
SBQL: |
avg(Emp.salary) |
For Emp objects with no salary sub-object the name salary occurring in the query will
return the empty table. This table will be merged with other tables according
to the union operator, hence it does not influence the result. This approach is
simple, consistent and should be easily understood by the programmers. Consider
the Date's example
|
E.IRR.6 |
|
|
SQL: |
select sum(A+B) from R select sum(A) + sum(B) from R |
where the first query may return a result different from the second one.
In SBQL the first query cannot be formulated as sum(R.(A+B))
because if A or B denote an absent attribute then the operator + fails. The correct
formulation is:
|
E.IRR.7 |
|
|
SBQL: |
sum(R.(A
as a, B as b).(a+b)) |
The operator “,” (structure constructor) returns an empty
table if any of the arguments is absent. Hence, a+b is never evaluated
with a or b returning an empty table. The second query in SBQL
|
E.IRR.8 |
|
|
SBQL: |
sum( R.A
) + sum( R.B ) |
can hardly be confused with the first one.
In the relational model the outer join is an important operator that is
relevant for semantic modelling and design methodologies, and convenient for
programmers. In many cases the ordinary (inner) join is not satisfactory
because it loses information. For example, the name and the department name of
an employee can be obtained by the query (providing dno column within Emp and
Dept tables):
|
E.IRR.9 |
|
|
SQL: |
select name, dname from Emp, Dept where Emp.dno = Dept.dno |
The resulting table, however, will not contain information on such
employees for which the dno attribute
is null valued. The outer join enables the user to formulate the query
returning a tuple <name, NULL> for an employee having null
valued dno attribute. The operator is
usually considered non-primitive, as it can be expressed by other operators of
the relational algebra. Date [Date86] advocates the “default style”
outer joins which avoids null values. There could be more specialized outer joins:
left-hand side outer join ignores nulls that occur at right side join argument
table, and right-hand side outer join does vice versa.
A similar operator could be also useful in object-oriented query
languages. Consider the corresponding query in SBQL:
|
E.IRR.10 |
|
|
SBQL: |
Emp.(name, (worksIn.Dept.dname)) |
It returns no information on employees for which the worksIn sub-object is absent. We can
avoid special options because the existing SBQL operators are sufficiently
powerful. For example, assuming that the user would like to return the string
“???” for Emp objects
with absent worksIn, the above query
can be formulated as:
|
E.IRR.11 |
|
|
SBQL: |
Emp.(name, if exists(worksIn) then (worksIn.Dept.dname) else
“???”) |
There are other solutions, still without introducing special features to
SBQL. For instance, we can use groupas
operator:
|
E.IRR.12 |
|
|
SBQL: |
Emp.(name as n, (worksIn.Dept.dname)
groupas d) |
This solution is just in the spirit of SQL: instead of null, however, we
return an empty bag named d. Such a
solution can be applied in all cases that in relational languages require outer
joins and can be specialized to cases equivalent to left-side and right-side
outer joins.
A class object in an object-oriented PL may store several kinds of
invariants inherited by class members. Default values can be considered as such
invariants. We distinguish two roles for them:
·
Initial values for some
attributes, which are used when a new object is created. They are physically
copied into a created object, thus need not to be inherited. Such default
values are used only by a distinguished method commonly denoted as new. This kind of default values is not
relevant to irregular data.
·
Common default values,
which are dynamically inherited by class members and can be overridden by
values within a particular member. Such a value is used in situations when a
given member object has a corresponding attribute absent. Default values stored
inside classes have the same external name as a corresponding attribute. They
can be atomic or complex.

Fig.57. Overriding of default values
Default values in the object-oriented approach do not imply special
features in query languages, assuming the model and scoping/binding rules as
explained in chapters [SBQL
non-algebraic operators] and [SBQL
for the AS1 model]. However, we must consider a new semantics of
assignments. When one updates a value which actually is inherited from a class,
the semantics should not update the value, but create a new sub-object with the
value determined by the right hand side of the assignment. For example, in the
case shown in Fig.57 the assignment
|
E.IRR.13 |
|
|
SBQL: |
for each Supplier where Sname = ”Black”
do Status := 45; |
should result in creating a new atomic object <inew, Status, 45> and then inserting it
into the Black's object. Updating of class properties via its members should
not be allowed.
Note that the runtime structures should allow the navigation from the
identifier of a default object to the identifier of the object that is
currently processed (to make within it a proper change). In SBA such navigation
is quite easy, because the environment of the currently processed object is
currently on the top of the stack. However, some minor modification to the
stack-based mechanism could be
necessary.
The presented definitions of default values capabilities can be easily
generalized, for example, in order to cater for cases when an inherited
property can be not only a “static” object, but also a functional
procedure. Assume that when the status of a supplier is not defined, it is
calculated as the number of products she supplies. In this case instead of
introducing the object <.., Status,
-1> to the SupplierClass we
introduce the following functional procedure (a method), where SP( Sno,
Pno,...) denotes the relation
connecting suppliers and parts:
|
E.IRR.14 |
|
|
SBQL: |
integer procedure
Status { return count((SP as x) where x.Sno = self.Sno); }; |
There are many examples of situations in the conceptual database design,
when such dynamic defaults might be useful. As far as we know they have not
been considered in the database literature.
Note that the procedure, like all inherited procedures, is evaluated
within an environment of a Supplier object,
thus the second occurrence of Sno in
the procedure's body is properly bound to Sno
of Supplier (what is indicated by self). Semantically, the construct SP as
x creates so-called binders,
i.e., named runtime entities. For detailed semantics related to binders see
chapter [Defining
Auxiliary Names].
Default Values and Scoping/Binding Rules
Default values stored within classes require special attention
concerning scoping/binding rules. In some situations it is a difficult to
determine consistent scoping rules that obey natural expectations of the
programmer. The expectations can be summarised in two points:
·
First visit the internal
properties of an object; if there is no proper attribute, then look for the
default within its class, superclass, etc.
·
During execution of a
method first visit its internal environment, then visit its class, superclass,
etc. looking for local properties; next, visit the internal properties of an
object being the receiver of a message.
These rules are contradictory for default values. For example, assume
that a method m is stored within a
class c1 and refers to an
attribute a stored within object O; O
is an instance of a class c2, and c2 contains a sub-object a storing a default value for the
attribute a, Fig.7a.

Fig.58.
A conflict between natural scoping rules and default values
Assume that class C1 inherits
from C2, the object O received the message m being a property of C2, and within the body of m there is a reference to the attribute a. The natural order of visiting
particular environments during the binding name a occurring within the body of m
is shown on Fig.58a as a sequence of curved arrows:
·
Visit the local
environment of m ;
·
Visit private and
exported properties of C2 (to bind
some other methods, class variables or procedures that may be called within m);
·
Visit the sub-objects of
O ;
·
Visit exported
properties of the classes that O
belongs to; in this case, the exported properties of C2. The default a will be bound here if a was not bound in the
previous step;
·
Visit the rest of the
environment (database objects, global temporary objects, functions and
procedures from available global libraries).
The questions arises what will happen if C1 and C2 are
the same class? In such a case the above rules make our idea of default values
invalid for the method m, because the
default a will be visited and bound
before the attribute a, Fig.7b.
There are of course several ways to avoid this conflict. For instance,
inside a method if a default attribute is to be bound, it checked if an
attribute with the same name exists in the currently processed object. The
above case shows, however, that for object-oriented database programming
languages the scoping/binding rules may imply some pitfalls, hence all possible
situations should be carefully analysed, designed and tested.
If a query language is not strongly typed, irregular data together with
assumed full orthogonality of query operators may cause a binding problem. To
explain it, consider the query:
|
E.IRR.15 |
Get employees who have salary and earn the
same as Brown |
|
SBQL: |
(Emp where exists(salary)) where salary = ((Emp where name =
”Brown”).salary) |
If the Brown's salary is absent the query should cause a runtime error.
However, during the binding of the third occurrence of salary the environment stack is as shown in Fig.59.

Fig.59.
An example of false binding
Since the top of the environment stack contains no salary binder the search will be continued in lower sections of the
stack. The section below contains the salary
binder, thus the binding will succeed. As a result, the predicate after the
second where will be TRUE for any Emp having a salary. The
query will return all such employees.
Obviously, the result is wrong. Let us analyze the reason. The third salary occurrence is referring to the
Brown's object; if evaluated independently, the nested sub-query ((Emp where
name = ”Brown”).salary) will return the empty bag.
However, because the query is nested, salary
is considered as referring to the outer query. In terms of our stack model, the
search for the salary binder should
be finished at the top of the stack, but because of absent data there is no
such a binder and the search is continued in lower sections of the stack. Hence
such a false binding is caused by the lack of information about the semantic
qualification of the second occurrence of the name salary in the query. This information should be somewhere given,
and should influence the scoping rules and/or structures introduced in the
model.
This information is usually present in the type of an object. It can be
used during compilation to determine statically which section of the
environment stack (relatively to its top) is relevant for binding a particular
name; so the problem does not occur. Analogously, the problem can be easily
solved by introducing to the EmpClass
a special entity named salary, which
the only role is to stop the search in lower sections of the stack; the binding
of name salary to such objects will
return an empty table.
If the data environment
is untyped[2],
and there is no possibility to introduce the corresponding information to a
class, the programmer must be supported with special facilities to deal with
such cases. Among various options we present the following. We assume that
every name occurring in a query can be augmented by an arbitrary number of
names of its attributes. Each attribute name is preceded by a keyword with or without. For example,
Emp with name with salary without job
This phrase is evaluated (bound) as a single semantic unit (i.e., name, salary and job are not
bound independently). It returns the references to all Emp objects which are present in the actual scope an satisfy the
“with” and “without” conditions: that is, contain the
attributes name and salary but do not contain the attribute job. Inherited properties are not taken
into account.
The described facility gives the programmer a full control over the
binding, despite possibly defaults and absent data. For example, assuming salary and job can be absent, and a default job is stored within the EmpClass,
the following query
|
E.IRR.16 |
|
|
SBQL: |
( Emp without salary with job) where job <> “clerk” |
returns references to all
employees containing a job subobject,
have no salary and are not clerks. Note
that the default jobs and salaries do not influence the result. The query
E.IRR.15 can be formulated as follows:
|
E.IRR.17 |
Get employees who have salary and earn the
same as Brown |
|
SBQL: |
(Emp with salary) where salary = ((Emp with salary where name =
”Brown”).salary) |
If an object is typed by the cardinality [0..1] in general the simple
assignment may result in an exception if the object does not exist. Hence each assignment
to such an object should be preceded by testing whether the object exists and
then inserting it, for instance:
|
E.IRR.17 |
Let Brown earn 5000. |
|
SBQL: |
for each
(Emp where name =
”Brown”) as B do
{ if
exists B.salary then B.salary := 5000 else
B :<< 5000 as salary; } |
The operator :<< means “create and insert” – on
the right side there is a query that creates an object and on the left side
there is a reference to an altered object.
If there are many objects that are typed in this way, such assignments
can be annoying and the program will become much less legible. A solution of
this problem is to introduce a special operator (or to extend the meaning of
the := operator) that creates an object if it does not exist, for instance:
|
E.IRR.18 |
Let Brown earns 5000. |
|
SBQL: |
(Emp
where name = ”Brown”).salary
:= 5000; |
Unfortunately,
such a solution will not work if we compile this statement as usually and try to
do the proper operation at runtime. The left side of the assignment will return
an empty bag and it will be impossible to recognize what is the name of an
object to be inserted and where it should be inserted. Recognition of the
situation during compilation and corresponding altering of the generated code
seem too challenging (if not hopeless). Because the object to be assigned can
be the result of a query with any complexity, which may involve function,
method and view calls, in the general case altering of the generated code could
be totally impossible.
Such a
feature require altering the run time structures, on the same principle as
default values discussed above. Assume that each declaration of an object
(attribute, sub-attribute, etc.) with lower cardinality 0 always means that the
corresponding runtime class will be augmented by a dummy object having the same
name as the object and with some dummy value. For instance, for EmpClass from E.IRR.1 after compilation we insert
dummy objects
<id_dummy_1,
address, dummy> , <id_dummy_2, salary,
dummy> , <id_dummy_3, worksIn, dummy>,
where dummy is a value distinguished
internally that cannot be used by the programmer and cannot be stored in the
object store. Its only role is binding of names that cannot be bound in normal
ways. Having such a dummy object named x
it will be bound in all situations where x
cannot be bound in normal way, because it is absent. After such a binding the
system can recognize that updating is to be performed on the dummy object, and
instead of that, it goes to the top of the ENVS stack, which in this case
should contain the environment of the currently processed object. Hence, it can
recognize to which object the new object is to be inserted. For instance, for
example E.IRR.18, if Brown has no salary, then the query returns id_dummy_2. An
attempt to make the assignment will cause navigation to the top of ENVS,
finding the reference of the Brown object, and inserting into it the sub-object
<some_new_id, salary, 5000>. To
make this possible the system should be a bit lazy with reducing ENVS. It
should not be reduced before the operator := is completed.
Note that the
dummy value is not equivalent to the null value. A dereference of a reference
to a dummy object should return an empty bag. Dummy objects and the dummy value are never used by the
programmer and he or she needs not to be aware of such a feature. It is assumed
to be an internal option of the binding mechanism that is necessary to solve
the problem of assignments to absent
objects.
Well-known
typing systems are not prepared to type irregular data. Actually, only the type
system assumed in the XMLSchema is able to specify some non-trivial irregularities
in data. However, XMLSchema is used as a grammar of XML files rather than a
typing system for strong type checking of queries or programs. In general, because of new database models, new semantic
properties of query languages and semi-structured data, all the known strong
typing systems developed for programming languages are too limited and not
flexible enough. Sometimes there are even opinions that irregularity in data
(semi-structured data) are contradictory to strong type checking of queries and
programs.
We disagree with such opinions. Irregularity of data is relative. If a
typing system is prepared to deal with irregular data, they are no more
irregular and become ordinary and typical. Actually, the problem is in
developing a new typing system, because for such situations the known typing
systems are not prepared to this goal. This concerns also advanced typing
systems based on inclusion and parametric polymorphisms, F-bounded
polymorphism, etc.
We have developed a new strong typing theory and implement it for SBQL.
We call it “semi-strong” just because it deals with semi-structured
data as described above. We distinguish internal
and external type systems. The
internal type system reflects the behavior of the type checking mechanism,
while the external type system is used by the programmer. A static strong type
checking mechanism simulates run-time computations during compile time by
reflecting the run-time semantics with the precision that is available at the
compile time. Roles of the SBQL typing system are the following: compile-time
type checking of query operators, imperative constructs, procedures, functions,
methods, views and modules; user-friendly, context dependent reporting on type
errors; resolving ambiguities with automatic type coercions, ellipses,
dereferences, literals and binding irregular data structures; shifting type
checks to run-time, if it is impossible to do them during compile time;
restoring a type checking process after a type error, to discover more than one
type error in one run; preparing information for query optimization by properly
decorating a query syntax tree.
The internal SBQL type system includes three basic data structures that
are compile-time counterparts of run time structures: a metabase, a static
environment stack and a static result stack. Static stacks process type
signatures – typing counterparts of corresponding run time entities.
Signatures are additionally associated with attributes, such as mutability,
cardinality, collection kind, type name, multimedia, etc. For each
query/program operator a decision table is provided, which determines allowed
combinations of signatures and attributes, the resulting signature and its
attributes, and additional actions.
Processing types with cardinalities require delegating type checking to
run time. For each creating, inserting and deleting operation performed on an
object typed with non-trivial cardinalities the SBQL compiler inserts a piece
of code that checks the corresponding cardinality during run time.
More about semi-strong typing system of SBQL is the subject of the
special chapter and in description
of the ODRA system. An example of SBQL types is presented in E.IRR.1 .
Irregular data are
usually associated with irregular queries that are impossible to ask in SQL.
Examples of irregular queries can be as follows:
·
Get names (as strings) of objects having an
attribute storing the string “Monthy Pyton”.
·
Get names (as strings) of attributes that store
the string “Monthy Pyton”.
·
Get objects of any type that store the string
“Monthy Pyton” at any attribute and at any level of the hierarchy
of their attributes.
·
Get the number of all attributes in the person
object with the name “Tom Jones”.
·
Get Emp
objects with an attribute having the name starting with letter “S”
and with the value terminated by the string “logy”.
We can give more
complicated examples of such unusual queries. All such queries require generic
capabilities, sometimes beyond the
assumed strong typing system. It is difficult to say if such queries are
necessary. If such a necessity exist, it usually means that the design of a
data schema is ill. However, current experiences with XML and its query
language XQuery show that for some environments such queries make a sense.
Actually, the above
queries require some generic functions that remain reflexive capabilities. We
list some of these functions:
·
Navigation from a reference of an object to
references of all its components (toSubordinates).
Obviously, the typing system is violated, hence it must be somehow changed.
·
Navigation from a subobject to its direct
superobject (toOwner).
·
Returning the collection of names of all objects
(allObjects).
·
Given a reference to an object, returning its
name as a string (nameOf).
·
Given a string, use it as a data name (asName).
·
Navigation from a reference to an object to its
class/type (toClass, toType).
We can invent of course
more such functions and perhaps their number is unlimited. It may happen that
some conceivable query is still not possible within the given set of such
functions, but inventing them is quite easy (although not sure if makes a
sense).
Within this set of
functions augmenting SBQL we can ask the following queries:
|
E.IRR.20 |
Get names (as strings) of objects having an
attribute storing the string “Monthy Pyton”. |
|
SBQL: |
((allObjects
as m join toSubordinates(m) as s) where s = “Monthy Pyton”).nameOf(m) |
|
E.IRR.21 |
Get Emp
objects with an attribute having the name starting with letter
“S” and with the value terminated by the string
“logy” |
|
SBQL: |
((Emp
as m join toSubordinates(m) as s) where nameOf(s)[1] = “S” and s like
“*logy” |
The examples has shown that such (relatively simple to implement)
functions are able to rise the power of a query language. However, it is not
such that such a power would not be the reason of some non-disciplined design
of a database schema.
Although it is a common belief that null values and other forms of
semi-structured information are one of the inevitable features of database
applications the object-oriented database research and technology tend to
neglect the issue. Practical proposals, such as the ODMG standard, present very
few examples of relevant capabilities. Theories proposed for object bases and
their query languages, such as object algebras, F-logic, comprehensions and
structural recursion, do not supply comprehensive solutions. Moreover, the
discussed SQL approach ultimately leads to numerous flaws and inconsistencies.
In this chapter we have presented a systematic approach to the problem
based on the idea of cardinalities assigned to types of objects and briefly
described our solutions. These can be summarized into the following
recommendations:
·
We agree with Ch. Date:
a special generic value called null (nil) should be absolutely avoided. We argue that it acts like a little devil that
is able to spoil the semantic clarity and consistency of all language
constructs; this especially concerns query languages.
·
Ch. Date suggests to use
default values instead of nulls. In general, we doubt if this approach offers
advantages and reduces the possibility of inconsistencies and
programmers’ errors. However, in this chapter we present simple methods
to incorporate default values into classes.
·
In all cases when the
use of default values is inconvenient or impossible (e.g., unions, repetitions,
complex attributes) binding a name of an absent object should result in an
empty collection, which can be processed by standard language facilities, such
as exists, count, and quantifiers.
Static strong type checking of queries and programs addressing
semi-structured data in the context of object-oriented query languages is not
considered in currently typing systems and their theories (including
polymorphic ones). SBA and SBQL offer semi-strong typing system based on
compile-time simulation of run-time semantics. The approach makes it possible
to provide the type checking as far as it is possible for irregular data.
Last modified: July 21, 2009
[1] This can be
of course constrained by DTD or XMLSchema, but such constraints are not
obligatory.
[2] We strongly
discourage making untyped programming interfaces for queries. SBQL is strongly
typed. As will be shown in next sections, strong typing has a crucial role for
query optimization. One can argue that discovering typing errors during
compilation time is not so important and we can live without it, c.f. SQL.
However, lack of query optimization totally undermines the query interface for
its users.