©Copyright by Kazimierz Subieta.

 

Why the group by Operator is not Necessary in Object Query Languages

 

by Kazimierz Subieta

Back to Description of SBA and SBQL.

The operator group by occurs in SQL and is proposed in other query languages, in particular, in the object-oriented query language OQL (by ODMG). In relational query languages the operator allows one to formulate a lot of useful queries that cannot be formulated otherwise. In particular, these queries cannot be formulated in the relational algebra[1], making one more evidence that the “relational completeness” of query languages is only a pseudo-scientific buzzword, with no sense and meaning[2]. The operator was especially useful in connection with aggregate functions, for example (c.f. Fig.30, the relational case):

 

E.GB.1

For each department get its dept number, the number of employees and the average salary of employees:

SQL:

select worksIn, count(*), avg(sal) from Emp

group by worksIn

Semantics of this construct is very simple. The table Emp is horizontally subdivided into groups according to the same vale of worksIn. Then, for each such group the select clause calculates the number of tuples count(*) and the average salary avg(sal). The result is a table with three columns, where the first can be named worksIn and two next are unnamed.

In SQL the operator is extended by the having clause that is a conditional expression allowing one to filter proper groups. The having clause can coexist with the where clause, for instance:

 

E.GB.2

For each department having more than 50 employees get its dept number and the average salary:

SQL:

select worksIn, avg(sal) from Emp

group by worksIn having count(*) > 50

The operator is relatively simple for a single table. In case of several joined tables it is not easy to use:

 

E.GB.3

For each department having more than 50 employees get its dept number, dept name and the average salary:

SQL:

select e.worksIn, d.dname, avg(e.sal) from Emp e, Dept d

where e.worksIn = d.d#

group by e.worksIn, d.dname having count(e.*) > 50

There are some additional rules of using this operator, in particular, all attribute names that occur in the select clause outside aggregate functions must also occur in the group by clause. For large SQL queries such far context dependency compromises orthogonality and could be the reason of programmer’s bugs. The far context dependency makes also some problem for static strong type checking; however, current SQL has no such a feature.

Some professionals consider the group by operator as a consequence of the flat (unnested) nature of relational tables. The group by operator temporarily subdivides the table into sub-tables, but such an operation is impossible in the classical relational model. Hence, a nested relational model could change the nature or the necessity of this operator. It is also considered disadvantageous for performance, as its definition is imperative rather than declarative. Hence many optimization methods become invalid if this operator is used. Some new methods are developed, but in general, because the operator requires sorting, the methods are not always as efficient as required.

The operator also leads to the well known semantic reefs where the result calculated by the query processor is different from the result expected by the programmer. The first reef is connected with empty groups, which are not taken into account by the operator.

 

E.GB.4

Get the departments together with the number of their employees:

SQL:

select worksIn, count(*) from Emp group by worksIn

Assume that the company has three departments, one with 10 employees, second with 20 employees and third with no employees, and the programmer wants to calculate the average number of employees in the departments. Obviously, the average is 10, but if it would be calculated according to the result returned by the above SQL statement, the average is 15. The empty department is not taken into account. Another semantic reef is connected with null values. If in the worksIn attribute nulls are allowed, then the group by operator collects all nulls within an additional group. Hence, the number of departments is increased by one, one group has null as its department number. We can imagine how many difficult bugs such a feature can generate, especially in the maintenance phase, when the administrator is forced to add “null is allowed” for definitions of some columns in the relational tables. Checking all group by operators in all applications of all clients that work with this database would not be an easy job.

In SQL the group by operator is tightly associated with aggregate functions sum, min, max, avg and count. If the functions act on a table that is not subdivided by group by into subtables, then they act as usual. However, if the aggregate functions are put in the context of the group by operator, then they calculate proper values for groups rather then on an entire table. This is a kind of semantic schizophrenia that is disadvantageous from the point of view of  the perception of programmers and obvious evidence of non-orthogonality of aggregate functions with other query operators.

Despite the above well recognized disadvantages, ODMG OQL introduces the group by operator. Unfortunately, the syntax and especially semantics of this operator are not clear. It makes little sense to repeat the definitions and examples presented in the “standard”, as they are rather below the commonly accepted scientific and technical formal specification standards. It seems that this operator is introduced with no deeper motivation, as a clone of the corresponding SQL operator motivated by the (false) statement that OQL is only a minor extension of SQL.

During the development of SBA and SBQL we have tried to develop a “civilized” version of the operator that would satisfy the following assumptions:

Unfortunately, despite many investigations, trials and dozens of examples we did not find the solution that would satisfy these requirements. It is possible that other researchers will find it, but actually we have lost any hope. Our conclusion is that the group by operator, motivated purely by physical implementation, makes no chances to be integrated with other query operators in such a way that all assumptions presented above would be fully satisfied.

Our attempts to specify this operator have led us to the conclusion that for object-oriented models the operator is unnecessary, redundant and awkward. It can be smoothly and directly substituted by dependent (navigational) join, dot and other SBQL operators. As we have observed at the beginning, the operator requires subdividing a table into subtables, i.e. it implicitly assumes a relational model with nested tables. But our store models AS0-AS3 allow for arbitrary nesting of stored data structures. Nesting is also the assumption concerning query results. Hence the conceptual motivation for the group by operator totally disappears.  It is unable to introduce to a query language essential new quality. Moreover, a lot of complex queries that in SQL require the group by operator become incredible simpler in SBQL, without this operator.

Below we illustrate on examples how we can avoid the group by operator in SBQL (c.f. Fig.30, the object-oriented case).

 

E.GB.5

(Compare E.GB.1) For each department get its dept number, the number of employees and the average salary of employees:

SBQL:

Dept.(d#, count(employs), avg(employs.Emp.sal))

 

E.GB.6

(Compare E.GB.2) For each department having more than 50 employees get its dept number and the average salary:

SBQL:

(Dept where count(employs) > 50). (d#, avg(employs.Emp.sal))

 

E.GB.7

(Compare E.GB.3) For each department having more than 50 employees get its dept number, dept name and the average salary:

SBQL:

(Dept where count(employs) > 50).

(d#, dname, avg(employs.Emp.sal))

Note that having clauses are not necessary.

 

E.GB.8

(Compare E.GB.4) Get the average number of employees in departments:

SQL:

avg( Dept.count( employs ) )

Note that unlike SQL there is no danger of the semantic reef causing omitting empty departments. All the departments having no employees will return count(employs) = 0, hence the average number of employees in departments is calculated correctly.  To deliver the average number of employees in non-empty departments one can write:

 

E.GB.9

(Compare E.GB.4) Get the average number of employees in non-empty departments:

SQL:

avg(( Dept where exists( employs )).count( employs ) )

SBQL avoids also the semantic reefs related to null values, because in SBQL null values are not supported. According to [Date86] and [SLU98] no consistent definition of null values is possible. Instead, in SBQL a null value is represented as a collection with the cardinality [0..1].

In SBQL we can also address relational databases. Below we show that all the above queries that require the group by operator in SQL can be formulated in SBQL without this operator. The prescription for such SBQL queries is very simple:

Sometimes naming suggested in the first step can be omitted. Below we present examples that follow this prescription (c.f. Fig.30, the relational case):

 

E.GB.10

(Compare E.GB.1) For each department get its dept number, the number of employees and the average salary:

SBQL:

Dept.(d#, count(Emp where worksIn = d#),

                   avg((Emp where worksIn = d#).sal)))

SBQL:

Dept join ((Emp where worksIn = d#) group as e) .

(d#, count(e), avg(e.sal))

SBQL:

(Dept.d# as d) join ((Emp where worksIn = d) group as e) .

(d, count(e), avg(e.sal))

 

E.GB.11

(Compare E.GB.2) For each department having more than 50 employees get its dept number and the average salary:

SBQL:

((Dept as d) join ((Emp where worksIn = d.d#) group as e )

where count(e) > 50) . (d.d#, avg(e.sal))

 

E.GB.12

(Compare E.GB.3) For each department having more than 50 employees get its dept number, dept name and the average salary:

SBQL:

((Dept as d) join ((Emp where worksIn = d.d#) group as e )

where count(e) > 50) . (d.d#, d.dname, avg(e.sal))

Note that in this case (in contrast to SQL) including dname into the final result has required a small obvious change only.

 

E.GB.13

(Compare E.GB.4) Get the average number of employees in departments:

SBQL:

avg( Dept.count(Emp where d# = worksIn))

Note that in this case (ulike SQL) no semantic reef is occurring: empty departments are taken into account during calculation of the average. One can also create the semantic equivalent of E.GB.4, but this must be explicitly determined by the programmer:

 

E.GB.14

(Compare E.GB.4) Get the average number of employees in non-empty departments:

SBQL:

avg((Dept  join count(Emp where d# = worksIn) as c where c > 0).c)

Some queries in SQL that require the group by operator are difficult to formulate. Consider, for instance, the following SBQL query:

 

E.GB.15

(Compare E.GB.7) For each department having more than 50 employees get its dept number, dept name and the average salary of its clerks:

SBQL:

(Dept where count(employs) > 50).

(d#, dname, avg(employs.(Emp where job = “clerk”). sal))

Note that in comparison to E.GB.7 the above query requires only inserting the condition job = “clerk”. In SQL this is not so easy, because count requires the group by operator acting on the entire Emp table, while avg requires a similar grouping, but addressing only on a part of the Emp table selected by the condition job = “clerk”. This leads to a very awkward SQL query or some sequence of queries, with the help of materialized views. This can be considered another sign of the conceptual weakness of the group by operator.

Below we give some quite easy (but a bit more sophisticated) examples in SBQL that for sure would require the group by option in SQL, but cannot be so easily formulated - not excluding that due to the limitations of SQL they cannot be formulated at all (c.f. Fig.30, the object-oriented case).

 

E.GB.16

For each location give the set of department names that are located at it, the average salary of bosses of these departments and the number of clerks that are (possibly) employed at these locations.

SBQL:

(distinct(Dept.loc) as X). ((Dept where X in loc) group as Xdepts). (

    (Xdepts.dname) group as XdeptsNames,

    avg( Xdepts.boss.Emp.sal) as XdeptsBossAvgSal,

    count( Xdepts.employs.(Emp where job = “clerk”)) as XdeptsClerks# )

 

E.GB.17

For each salary range 0-99, 100-199, 200-299,…, etc. get the range, the number of employees getting the salary from the range and the average salary of bosses of the employees getting the salary from the range. Formulate the result as report messages. Ranges with no employee belonging to should be omitted.

SBQL:

distinct( integerOf(Emp.sal/100) ) as R .

   ((100* R) as Rmin join (Rmin+99) as Rmax  join

      (Emp where salRmin and sal < Rmax) group as Remps). (

          count(Remps) + “ employees earn between “ + Rmin +

          “ and “ + Rmax + “; the average salary of their bosses: “ +

          avg(distinct(Remps.worksIn.Dept.boss.Emp).sal)               )

The first line calculates the range discriminators R. The second line calculates the minimal and maximal salaries within a particular range. The third line calculates identifiers of employees that earn salary within the given range and forms from such employees a group named Remps. Three next lines form the output message for the given range using + as a concatenation of strings.

Some professionals can consider as a positive property of the group by operator some performance gain. Because the operator is closely related to its physical implementation, it can be tuned to be very efficient. This argument, however, works only for simple cases, when the grouping concerns stored tables. For tables that are calculated in a query the argument may be no more valid. Moreover, this argument reminds the old argument that programming should be done in assembler, because it is the most efficient and programs can be tuned by the programmers with no limits. The current point of view that governs the development of computer languages emphasizes human and economic aspects of programming processes, conceptual closure (orthogonality) of programming options and achieving proper quality of programs. Performance is very important, but it is the subject of internal optimizations rather than the conceptual construction of the language. In particular, query optimization methods developed for SBQL (and further methods) give some promises that the overall performance of SBQL queries that in SQL require grouping would be at least not worse. This of course requires testing and very detailed analysis of situations in query processing that are disadvantageous for performance.

As a side effect of our attempts to define the group by operator for object-oriented store models we concluded that another grouping operator is necessary. This operator we call group as; see examples E.GB.12, E.GB.16 and E.GB.17.  The operator groups a result returned by a query under a single name; then the group can be manipulated as a single element. The group us operator is earlier known as the “nest” operator. Despite the origin, the group us operator has semantically almost nothing in common with the group by operator. The group as operator is analogous to grouping some set of figures into a single figure, as for instance, in Power Point.  The semantics of the group as operator is very simple and it is explained in the section devoted to algebraic operators in SBQL.


Last modified: February 21, 2008



[1] There are special relational algebras that include the group by operator. In our opinion they are too opportunistic (in the negative sense) to consider them seriously.

 

[2] The “relational completeness” concept makes still a lot of misunderstanding, especially in industrial communities not always realizing properly its meaning and limitations. As “Turing completeness”, the “relational completeness” introduces no valid yardstick of the universality of computer  languages, including query languages and programming languages. The term “completeness” is abused here and has no expected technical meaning. Instead of Turing or relational “completeness” we can talk on “pragmatic completeness”, i.e. a language reasonably offering everything that programmers want for the given application domain. In this relative sense no computer language is “complete”.